
Core model FAQ
Our system is designed for predicting movements of financial assets on crypto and stock markets based on comprehensive data analysis.
We combine fundamental, technical, cluster, mathematical, and candlestick analyses to obtain the most accurate and reliable forecasts. In this documentation, we will thoroughly examine each stage of the system's operation, technologies, and methodologies used
Stage 1. Data Collection and Preparation
Data Sources
Our system utilizes various data sources to ensure completeness and timeliness of information:
Exchange APIs:
CCXT API: Provides real-time data on prices, trading volumes, order books.
Native exchange APIs.
News Aggregators:
CryptoPanic: Collecting news and events affecting the market.
CoinDesk: Market analysis articles and reviews.
Medium: Project analysis articles and news.
Social Media:
Twitter: Monitoring tweets of key figures and overall community sentiment.
Reddit (r/CryptoCurrency): Analyzing discussions, trends, and social sentiments.
Data Collection Methods
API Requests:
Using RESTful APIs for regular data retrieval.
Setting up WebSocket connections for real-time data reception.
Web Scraping:
Parsing web pages for additional information and project analysis (while adhering to legal conditions and website rules).
Used Scheduling Tools:
Cron, Anacron, Celery for automating data collection processes at scheduled times.
Preliminary Data Processing
Data Cleaning
Duplicate Removal:
Using unique identifiers and timestamps to filter out duplicate records.
Handling Missing Values:
Filling missing values with means or medians.
Time series interpolation.
Removing records with critical missing data.
Data Transformation
Price Data Scaling:
Min-Max normalization:
Scaling data to the range [0,1].
Z-Score normalization:
Converting data to standardized form with mean 0 and standard deviation 1.
Feature Engineering
Creating new features from calculated technical indicators based on the original dataset:
RSI
MACD
Bollinger Bands
Momentum
OBV
CMF
VWMACD
MFI
AO
Fibonacci levels
• Social Metrics:
Analyzing social media mentions count, Sentiment Analysis.
Data Labeling
Historical Example Classification
"Buy":
If after a certain period, the asset price increased by a specified percentage.
"Hold":
If price changes were insignificant.
"Sell":
If the price decreased by a specified percentage.
Window-based Labeling
Applying a window of a specific size for sequential labeling of time series.
Tools and Technologies
Programming Languages:
• Python: Primary language for data processing.
Data Processing Libraries:
• Pandas: For table and time series manipulation. • NumPy: For high-performance computations.
API Clients:
• CCXT: Unified interface for various cryptocurrency exchanges.
Data Storage:
• MongoDB: For storing unstructured data.
• PostgreSQL: For relational data.
Fundamental Analysis
Data Collection:
Project Information:
Roadmaps
Development team details
Partnerships
Development Activity:
Number of commits
Open pull requests on GitHub
Events and News:
Update announcements
Listings on exchanges
Feature Extraction:
Sentiment Analysis of News:
Using NLP models to determine article tone
Metrics:
Market capitalization
Trading volumes
Circulating supply
Technical Analysis
Indicators:
Moving Averages (SMA, EMA):
Various periods (5, 10, 20, 50, 100, 200 days)
Relative Strength Index (RSI):
Identifying overbought or oversold conditions in assets
Moving Average Convergence Divergence (MACD):
Determining trend changes
Bollinger Bands:
Identifying support and resistance levels
Momentum:
Assessing price change speed
On-Balance Volume (OBV):
Analyzing market movement direction
Chaikin Money Flow (CMF):
Predicting price movements based on cash flow
Volume Weighted MACD (VWMACD):
Trend counters considering trading volume
Absorption Oscillator (AO):
Analyzing rising and falling price movements
Process:
Calculation Tools:
Using TA-Lib, pandas-ta libraries for calculating indicators
Employing functions from statsmodels library for additional analytical tasks
mplfinance for creating price profiles and technical analysis charts
ta-lib-python for accessing a wide range of technical indicators via API
Cluster Analysis
Order Book Data:
Buy and Sell Orders:
Analysis of volume distribution across price levels
Identification of patterns in order behavior (convergent or divergent orders)
Methods:
Clustering:
k-Means: Grouping orders based on price, time, and volume
DBSCAN: Identifying clusters of large orders or anomalies
Hierarchical clustering: Building hierarchical cluster structures to detect relationships between different types of orders
Goals:
Determining Support and Resistance Levels:
Using clusters to identify stable price levels
Analyzing order behavior around these levels
Identifying Large Trader Activity:
Recognizing patterns characteristic of large traders
Analyzing temporal dependencies between different types of orders to identify strategies of large traders
Using machine learning to predict the probability that a specific order belongs to a large trader
Additional Methods:
Network Analysis:
Building graphs of relationships between different types of orders to detect complex patterns of activity
Anomaly Detection:
Using Isolation Forest or One-Class SVM to identify unusual scenarios in order data
Time Series Analysis:
Applying Prophet for predicting trading volume behavior based on historical data
Mathematical Analysis
Statistical Models:
Autocorrelation:
Analysis of dependence of current values on past values
Correlation Coefficients:
Relationship between individual assets and the overall market
Volatility Models:
• GARCH (Generalized Autoregressive Conditional Heteroskedasticity):
Forecasting future volatility based on historical data
Candlestick Analysis
Japanese Candlestick Patterns:
• Popular reversal patterns • Popular continuation patterns
Recognition Algorithms:
• Application of libraries:
• TA-Lib, candle-patterns for automatic pattern detection
Outcomes:
• Signals indicating potential trend change or continuation
Tools and Libraries:
• TA-Lib: Library for technical analysis
• Scikit-learn: For clustering and other machine learning models
• Statsmodels: For statistical analysis and volatility modeling
• NLTK, SpaCy: For natural language processing when working with textual data
Methodology of Union
Combining data from various sources:
Using time-based keys for synchronizing time series from different data sources
Joining by asset identifiers for correct mapping of data related to the same asset
Creating a single dataframe:
cuDF DataFrame is used as the basis for storing combined data
Alternative solutions are considered for large datasets and distributed computing:
Apache Spark for processing large datasets on a cluster
PySpark for working with data in a distributed environment
Apache Arrow for efficient data exchange between different systems
Data Normalization
Reasons for normalization:
• Different scales of features can negatively affect model training. Normalization ensures feature compatibility and improves training efficiency.
Methods of normalization: o Standardization (Z-score):
Formula: z = (x - μ) / σ
Implemented using cuDF functions for fast calculation of mean and standard deviation across columns on GPU o Scaling to range [0,1]:
Formula: x_scaled = (x - x_min) / (x_max - x_min)
Performed using cuDF functions to find minimum and maximum values and subsequent data transformation
Handling Missing Values
Methods of filling: o Interpolation:
Linear or polynomial interpolation of time series using cuDF functions to fill missing values based on neighboring values o Constant filling:
Filling missing values with zero, mean, or median of the feature using cuDF methods
Removing features or records with a large number of missing values if filling is not advisable
Correlation Checking
Correlation matrix:
Calculated using cuDF functions to identify strongly correlated features
For a wider range of statistical functions, cuML from RAPIDS is used • Feature Selection:
Excluding features with high correlation to reduce redundancy and prevent overfitting
Using methods such as Variance Inflation Factor (VIF) for quantitative assessment of multicollinearity between features
Neural Network Architecture
Modular Structure:
Separate inputs and layers for different types of data
Connecting outputs of various modules before feeding into common layers
Layer Combination:
Joining outputs of different modules before inputting into common layers
Convolutional Layers (CNN)
Purpose:
• Extracting local patterns:
Detecting complex dependencies in time series
Basic Configuration:
• Number of layers:
3 convolutional layers followed by pooling layers • Layer Parameters:
Filters: 64, 128, 256
Kernel size: 3, 5
Activation: ReLU
Recurrent Layers (LSTM)
Purpose:
• Accounting for temporal dependencies:
Remembering past states for future predictions
Configuration:
• Number of layers:
2 LSTM layers • Layer Parameters:
Memory units: 256, 512
Dropout: 0.2 - 0.3
Fully Connected Layers and Output Layer
Fully Connected Layers
• Feature aggregation:
Transforming data into a form suitable for classification or regression • Parameters:
Number of neurons: 128, 64
Activation: ReLU
Output Layer
• Classification:
Activation Softmax • Regression:
Activation Linear
Regularization and Optimization
Regularization
• Dropout:
Disabling a portion of neurons during training to prevent overfitting • L1/L2 regularization:
Adding penalties for weight coefficients
Optimization
• Optimizers:
Adam
Learning Rate Scheduler • Loss function:
Categorical cross-entropy
MSE (Mean Squared Error)
Predictions and Interpretation
Probability Generation:
Obtaining probabilities for each class ("Buy", "Hold", "Sell")
Result Interpretation:
Threshold settings:
Establishing thresholds for decision-making
Interpretation Methods:
SHAP (SHapley Additive exPlanations):
Evaluating the contribution of each feature to the prediction
LIME (Local Interpretable Model-agnostic Explanations):
Explaining predictions for specific instances
Recommendation Formation
Business Logic:
Determining actions based on predictions and risk management • Reporting:
Creating reports explaining reasons for recommendations
Including visualizations and charts
Visualization of Results
Graphs:
Price charts with overlaid signals
Heat maps of feature importance • Interactive Panels:
Using Dash, Streamlit for creating web applications
Training Process
Data Split:
Training set: 70%
Validation set: 15%
Test set: 15%
Cross-validation:
K-Fold Cross-Validation for assessing model stability
Evaluation Metrics:
Accuracy, Precision, Recall, F1-score
For multi-class classification problems, use macro and weighted evaluations
Application of Data Augmentation Methods
Purpose:
Increasing data volume and improving model generalization ability
Methods:
Noise transformations:
Adding small Gaussian noise
Temporal distortions:
Stretching or compressing time series
Model Update
Regular Retraining:
Periodic updating of the model with new data
Performance Monitoring:
Tracking metrics on new data
Alerting for declining model quality
Hyperparameter Adaptation:
Random Search, Bayesian Optimization
Last updated