Page cover

Core model FAQ

Our system is designed for predicting movements of financial assets on crypto and stock markets based on comprehensive data analysis.

We combine fundamental, technical, cluster, mathematical, and candlestick analyses to obtain the most accurate and reliable forecasts. In this documentation, we will thoroughly examine each stage of the system's operation, technologies, and methodologies used


Data Sources

Our system utilizes various data sources to ensure completeness and timeliness of information:

  1. Exchange APIs:

    • CCXT API: Provides real-time data on prices, trading volumes, order books.

    • Native exchange APIs.

  2. News Aggregators:

    • CryptoPanic: Collecting news and events affecting the market.

    • CoinDesk: Market analysis articles and reviews.

    • Medium: Project analysis articles and news.

  3. Social Media:

    • Twitter: Monitoring tweets of key figures and overall community sentiment.

    • Reddit (r/CryptoCurrency): Analyzing discussions, trends, and social sentiments.

Data Collection Methods

API Requests:

  • Using RESTful APIs for regular data retrieval.

  • Setting up WebSocket connections for real-time data reception.

Web Scraping:

  • Parsing web pages for additional information and project analysis (while adhering to legal conditions and website rules).

Used Scheduling Tools:

  • Cron, Anacron, Celery for automating data collection processes at scheduled times.

Preliminary Data Processing

Data Cleaning

Duplicate Removal:

  • Using unique identifiers and timestamps to filter out duplicate records.

Handling Missing Values:

  • Filling missing values with means or medians.

  • Time series interpolation.

  • Removing records with critical missing data.

Data Transformation

Price Data Scaling:

  • Min-Max normalization:

    • Scaling data to the range [0,1].

  • Z-Score normalization:

    • Converting data to standardized form with mean 0 and standard deviation 1.

Feature Engineering

Creating new features from calculated technical indicators based on the original dataset:

  • RSI

  • MACD

  • Bollinger Bands

  • Momentum

  • OBV

  • CMF

  • VWMACD

  • MFI

  • AO

  • Fibonacci levels

• Social Metrics:

  • Analyzing social media mentions count, Sentiment Analysis.

Data Labeling

Historical Example Classification

  • "Buy":

    • If after a certain period, the asset price increased by a specified percentage.

  • "Hold":

    • If price changes were insignificant.

  • "Sell":

    • If the price decreased by a specified percentage.

Window-based Labeling

  • Applying a window of a specific size for sequential labeling of time series.

Tools and Technologies

Programming Languages:

• Python: Primary language for data processing.

Data Processing Libraries:

• Pandas: For table and time series manipulation. • NumPy: For high-performance computations.

API Clients:

• CCXT: Unified interface for various cryptocurrency exchanges.

Data Storage:

• MongoDB: For storing unstructured data.

• PostgreSQL: For relational data.

Stage 2. Feature Extraction

Fundamental Analysis

Data Collection:

Project Information:

  • Roadmaps

  • Development team details

  • Partnerships

Development Activity:

  • Number of commits

  • Open pull requests on GitHub

Events and News:

  • Update announcements

  • Listings on exchanges

Feature Extraction:

Sentiment Analysis of News:

  • Using NLP models to determine article tone

Metrics:

  • Market capitalization

  • Trading volumes

  • Circulating supply

Technical Analysis

Indicators:

Moving Averages (SMA, EMA):

  • Various periods (5, 10, 20, 50, 100, 200 days)

Relative Strength Index (RSI):

  • Identifying overbought or oversold conditions in assets

Moving Average Convergence Divergence (MACD):

  • Determining trend changes

Bollinger Bands:

  • Identifying support and resistance levels

Momentum:

  • Assessing price change speed

On-Balance Volume (OBV):

  • Analyzing market movement direction

Chaikin Money Flow (CMF):

  • Predicting price movements based on cash flow

Volume Weighted MACD (VWMACD):

  • Trend counters considering trading volume

Absorption Oscillator (AO):

  • Analyzing rising and falling price movements

Process:

Calculation Tools:

  • Using TA-Lib, pandas-ta libraries for calculating indicators

  • Employing functions from statsmodels library for additional analytical tasks

  • mplfinance for creating price profiles and technical analysis charts

  • ta-lib-python for accessing a wide range of technical indicators via API

Cluster Analysis

Order Book Data:

Buy and Sell Orders:

  • Analysis of volume distribution across price levels

  • Identification of patterns in order behavior (convergent or divergent orders)

Methods:

Clustering:

  • k-Means: Grouping orders based on price, time, and volume

  • DBSCAN: Identifying clusters of large orders or anomalies

  • Hierarchical clustering: Building hierarchical cluster structures to detect relationships between different types of orders

Goals:

Determining Support and Resistance Levels:

  • Using clusters to identify stable price levels

  • Analyzing order behavior around these levels

Identifying Large Trader Activity:

  • Recognizing patterns characteristic of large traders

  • Analyzing temporal dependencies between different types of orders to identify strategies of large traders

  • Using machine learning to predict the probability that a specific order belongs to a large trader

Additional Methods:

Network Analysis:

  • Building graphs of relationships between different types of orders to detect complex patterns of activity

Anomaly Detection:

  • Using Isolation Forest or One-Class SVM to identify unusual scenarios in order data

Time Series Analysis:

  • Applying Prophet for predicting trading volume behavior based on historical data

Mathematical Analysis

Statistical Models:

Autocorrelation:

  • Analysis of dependence of current values on past values

Correlation Coefficients:

  • Relationship between individual assets and the overall market

Volatility Models:

• GARCH (Generalized Autoregressive Conditional Heteroskedasticity):

  • Forecasting future volatility based on historical data

Candlestick Analysis

Japanese Candlestick Patterns:

• Popular reversal patterns • Popular continuation patterns

Recognition Algorithms:

• Application of libraries:

• TA-Lib, candle-patterns for automatic pattern detection

Outcomes:

• Signals indicating potential trend change or continuation

Tools and Libraries:

• TA-Lib: Library for technical analysis

• Scikit-learn: For clustering and other machine learning models

• Statsmodels: For statistical analysis and volatility modeling

• NLTK, SpaCy: For natural language processing when working with textual data

Stage 3. Feature Union and Normalization

Methodology of Union

Combining data from various sources:

  • Using time-based keys for synchronizing time series from different data sources

  • Joining by asset identifiers for correct mapping of data related to the same asset

  • Creating a single dataframe:

    • cuDF DataFrame is used as the basis for storing combined data

    • Alternative solutions are considered for large datasets and distributed computing:

      • Apache Spark for processing large datasets on a cluster

      • PySpark for working with data in a distributed environment

      • Apache Arrow for efficient data exchange between different systems

Data Normalization

Reasons for normalization:

• Different scales of features can negatively affect model training. Normalization ensures feature compatibility and improves training efficiency.

Methods of normalization: o Standardization (Z-score):

  • Formula: z = (x - μ) / σ

  • Implemented using cuDF functions for fast calculation of mean and standard deviation across columns on GPU o Scaling to range [0,1]:

  • Formula: x_scaled = (x - x_min) / (x_max - x_min)

  • Performed using cuDF functions to find minimum and maximum values and subsequent data transformation

Handling Missing Values

Methods of filling: o Interpolation:

  • Linear or polynomial interpolation of time series using cuDF functions to fill missing values based on neighboring values o Constant filling:

  • Filling missing values with zero, mean, or median of the feature using cuDF methods

  • Removing features or records with a large number of missing values if filling is not advisable

Correlation Checking

Correlation matrix:

  • Calculated using cuDF functions to identify strongly correlated features

  • For a wider range of statistical functions, cuML from RAPIDS is used • Feature Selection:

  • Excluding features with high correlation to reduce redundancy and prevent overfitting

  • Using methods such as Variance Inflation Factor (VIF) for quantitative assessment of multicollinearity between features

Stage 4. Neural Network Data Processing

Neural Network Architecture

Modular Structure:

  • Separate inputs and layers for different types of data

  • Connecting outputs of various modules before feeding into common layers

Layer Combination:

  • Joining outputs of different modules before inputting into common layers

Convolutional Layers (CNN)

Purpose:

• Extracting local patterns:

  • Detecting complex dependencies in time series

Basic Configuration:

• Number of layers:

  • 3 convolutional layers followed by pooling layers • Layer Parameters:

  • Filters: 64, 128, 256

  • Kernel size: 3, 5

  • Activation: ReLU

Recurrent Layers (LSTM)

Purpose:

• Accounting for temporal dependencies:

  • Remembering past states for future predictions

Configuration:

• Number of layers:

  • 2 LSTM layers • Layer Parameters:

  • Memory units: 256, 512

  • Dropout: 0.2 - 0.3

Fully Connected Layers and Output Layer

Fully Connected Layers

• Feature aggregation:

  • Transforming data into a form suitable for classification or regression • Parameters:

  • Number of neurons: 128, 64

  • Activation: ReLU

Output Layer

• Classification:

  • Activation Softmax • Regression:

  • Activation Linear

Regularization and Optimization

Regularization

• Dropout:

  • Disabling a portion of neurons during training to prevent overfitting • L1/L2 regularization:

  • Adding penalties for weight coefficients

Optimization

• Optimizers:

  • Adam

  • Learning Rate Scheduler • Loss function:

  • Categorical cross-entropy

  • MSE (Mean Squared Error)

Stage 5. Result Generation

Predictions and Interpretation

Probability Generation:

  • Obtaining probabilities for each class ("Buy", "Hold", "Sell")

Result Interpretation:

  • Threshold settings:

    • Establishing thresholds for decision-making

  • Interpretation Methods:

    • SHAP (SHapley Additive exPlanations):

      • Evaluating the contribution of each feature to the prediction

    • LIME (Local Interpretable Model-agnostic Explanations):

      • Explaining predictions for specific instances

Recommendation Formation

Business Logic:

  • Determining actions based on predictions and risk management • Reporting:

  • Creating reports explaining reasons for recommendations

  • Including visualizations and charts

Visualization of Results

Graphs:

  • Price charts with overlaid signals

  • Heat maps of feature importance • Interactive Panels:

  • Using Dash, Streamlit for creating web applications

Stage 6. Model Training and Adaptation

Training Process

Data Split:

  • Training set: 70%

  • Validation set: 15%

  • Test set: 15%

Cross-validation:

  • K-Fold Cross-Validation for assessing model stability

Evaluation Metrics:

  • Accuracy, Precision, Recall, F1-score

  • For multi-class classification problems, use macro and weighted evaluations

Application of Data Augmentation Methods

Purpose:

  • Increasing data volume and improving model generalization ability

Methods:

  • Noise transformations:

    • Adding small Gaussian noise

  • Temporal distortions:

    • Stretching or compressing time series

Model Update

Regular Retraining:

  • Periodic updating of the model with new data

Performance Monitoring:

  • Tracking metrics on new data

  • Alerting for declining model quality

Hyperparameter Adaptation:

  • Random Search, Bayesian Optimization

Last updated