Core model FAQ

Our system is designed for predicting movements of financial assets on crypto and stock markets based on comprehensive data analysis.

We combine fundamental, technical, cluster, mathematical, and candlestick analyses to obtain the most accurate and reliable forecasts. In this documentation, we will thoroughly examine each stage of the system's operation, technologies, and methodologies used

Stage 1. Data Collection and Preparation

Data Sources

Our system utilizes various data sources to ensure completeness and timeliness of information:

Exchange APIs:
- CCXT API: Provides real-time data on prices, trading volumes, order books.
- Native exchange APIs.
News Aggregators:
- CryptoPanic: Collecting news and events affecting the market.
- CoinDesk: Market analysis articles and reviews.
- Medium: Project analysis articles and news.
Social Media:
- Twitter: Monitoring tweets of key figures and overall community sentiment.
- Reddit (r/CryptoCurrency): Analyzing discussions, trends, and social sentiments.

Data Collection Methods

API Requests:

Using RESTful APIs for regular data retrieval.
Setting up WebSocket connections for real-time data reception.

Web Scraping:

Parsing web pages for additional information and project analysis (while adhering to legal conditions and website rules).

Used Scheduling Tools:

Cron, Anacron, Celery for automating data collection processes at scheduled times.

Preliminary Data Processing

Data Cleaning

Duplicate Removal:

Using unique identifiers and timestamps to filter out duplicate records.

Handling Missing Values:

Filling missing values with means or medians.
Time series interpolation.
Removing records with critical missing data.

Data Transformation

Price Data Scaling:

Min-Max normalization:
- Scaling data to the range [0,1].
Z-Score normalization:
- Converting data to standardized form with mean 0 and standard deviation 1.

Feature Engineering

Creating new features from calculated technical indicators based on the original dataset:

RSI
MACD
Bollinger Bands
Momentum
OBV
CMF
VWMACD
MFI
AO
Fibonacci levels

• Social Metrics:

Analyzing social media mentions count, Sentiment Analysis.

Data Labeling

Historical Example Classification

"Buy":
- If after a certain period, the asset price increased by a specified percentage.
"Hold":
- If price changes were insignificant.
"Sell":
- If the price decreased by a specified percentage.

Window-based Labeling

Applying a window of a specific size for sequential labeling of time series.

Tools and Technologies

Programming Languages:

• Python: Primary language for data processing.

Data Processing Libraries:

• Pandas: For table and time series manipulation. • NumPy: For high-performance computations.

API Clients:

• CCXT: Unified interface for various cryptocurrency exchanges.

Data Storage:

• MongoDB: For storing unstructured data.

• PostgreSQL: For relational data.

Stage 2. Feature Extraction

Fundamental Analysis

Data Collection:

Project Information:

Roadmaps
Development team details
Partnerships

Development Activity:

Number of commits
Open pull requests on GitHub

Events and News:

Update announcements
Listings on exchanges

Feature Extraction:

Sentiment Analysis of News:

Using NLP models to determine article tone

Metrics:

Market capitalization
Trading volumes
Circulating supply

Technical Analysis

Indicators:

Moving Averages (SMA, EMA):

Various periods (5, 10, 20, 50, 100, 200 days)

Relative Strength Index (RSI):

Identifying overbought or oversold conditions in assets

Moving Average Convergence Divergence (MACD):

Determining trend changes

Bollinger Bands:

Identifying support and resistance levels

Momentum:

Assessing price change speed

On-Balance Volume (OBV):

Analyzing market movement direction

Chaikin Money Flow (CMF):

Predicting price movements based on cash flow

Volume Weighted MACD (VWMACD):

Trend counters considering trading volume

Absorption Oscillator (AO):

Analyzing rising and falling price movements

Process:

Calculation Tools:

Using TA-Lib, pandas-ta libraries for calculating indicators
Employing functions from statsmodels library for additional analytical tasks
mplfinance for creating price profiles and technical analysis charts
ta-lib-python for accessing a wide range of technical indicators via API

Cluster Analysis

Order Book Data:

Buy and Sell Orders:

Analysis of volume distribution across price levels
Identification of patterns in order behavior (convergent or divergent orders)

Methods:

Clustering:

k-Means: Grouping orders based on price, time, and volume
DBSCAN: Identifying clusters of large orders or anomalies
Hierarchical clustering: Building hierarchical cluster structures to detect relationships between different types of orders

Goals:

Determining Support and Resistance Levels:

Using clusters to identify stable price levels
Analyzing order behavior around these levels

Identifying Large Trader Activity:

Recognizing patterns characteristic of large traders
Analyzing temporal dependencies between different types of orders to identify strategies of large traders
Using machine learning to predict the probability that a specific order belongs to a large trader

Additional Methods:

Network Analysis:

Building graphs of relationships between different types of orders to detect complex patterns of activity

Anomaly Detection:

Using Isolation Forest or One-Class SVM to identify unusual scenarios in order data

Time Series Analysis:

Applying Prophet for predicting trading volume behavior based on historical data

Mathematical Analysis

Statistical Models:

Autocorrelation:

Analysis of dependence of current values on past values

Correlation Coefficients:

Relationship between individual assets and the overall market

Volatility Models:

• GARCH (Generalized Autoregressive Conditional Heteroskedasticity):

Forecasting future volatility based on historical data

Candlestick Analysis

Japanese Candlestick Patterns:

• Popular reversal patterns • Popular continuation patterns

Recognition Algorithms:

• Application of libraries:

• TA-Lib, candle-patterns for automatic pattern detection

Outcomes:

• Signals indicating potential trend change or continuation

Tools and Libraries:

• TA-Lib: Library for technical analysis

• Scikit-learn: For clustering and other machine learning models

• Statsmodels: For statistical analysis and volatility modeling

• NLTK, SpaCy: For natural language processing when working with textual data

Stage 3. Feature Union and Normalization

Methodology of Union

Combining data from various sources:

Using time-based keys for synchronizing time series from different data sources
Joining by asset identifiers for correct mapping of data related to the same asset
Creating a single dataframe:
- cuDF DataFrame is used as the basis for storing combined data
- Alternative solutions are considered for large datasets and distributed computing:
  - Apache Spark for processing large datasets on a cluster
  - PySpark for working with data in a distributed environment
  - Apache Arrow for efficient data exchange between different systems

Data Normalization

Reasons for normalization:

• Different scales of features can negatively affect model training. Normalization ensures feature compatibility and improves training efficiency.

Methods of normalization: o Standardization (Z-score):

Formula: z = (x - μ) / σ
Implemented using cuDF functions for fast calculation of mean and standard deviation across columns on GPU o Scaling to range [0,1]:
Formula: x_scaled = (x - x_min) / (x_max - x_min)
Performed using cuDF functions to find minimum and maximum values and subsequent data transformation

Handling Missing Values

Methods of filling: o Interpolation:

Linear or polynomial interpolation of time series using cuDF functions to fill missing values based on neighboring values o Constant filling:
Filling missing values with zero, mean, or median of the feature using cuDF methods
Removing features or records with a large number of missing values if filling is not advisable

Correlation Checking

Correlation matrix:

Calculated using cuDF functions to identify strongly correlated features
For a wider range of statistical functions, cuML from RAPIDS is used • Feature Selection:
Excluding features with high correlation to reduce redundancy and prevent overfitting
Using methods such as Variance Inflation Factor (VIF) for quantitative assessment of multicollinearity between features

Stage 4. Neural Network Data Processing

Neural Network Architecture

Modular Structure:

Separate inputs and layers for different types of data
Connecting outputs of various modules before feeding into common layers

Layer Combination:

Joining outputs of different modules before inputting into common layers

Convolutional Layers (CNN)

Purpose:

• Extracting local patterns:

Detecting complex dependencies in time series

Basic Configuration:

• Number of layers:

3 convolutional layers followed by pooling layers • Layer Parameters:
Filters: 64, 128, 256
Kernel size: 3, 5
Activation: ReLU

Recurrent Layers (LSTM)

Purpose:

• Accounting for temporal dependencies:

Remembering past states for future predictions

Configuration:

• Number of layers:

2 LSTM layers • Layer Parameters:
Memory units: 256, 512
Dropout: 0.2 - 0.3

Fully Connected Layers and Output Layer

Fully Connected Layers

• Feature aggregation:

Transforming data into a form suitable for classification or regression • Parameters:
Number of neurons: 128, 64
Activation: ReLU

Output Layer

• Classification:

Activation Softmax • Regression:
Activation Linear

Regularization and Optimization

Regularization

• Dropout:

Disabling a portion of neurons during training to prevent overfitting • L1/L2 regularization:
Adding penalties for weight coefficients

Optimization

• Optimizers:

Adam
Learning Rate Scheduler • Loss function:
Categorical cross-entropy
MSE (Mean Squared Error)

Stage 5. Result Generation

Predictions and Interpretation

Probability Generation:

Obtaining probabilities for each class ("Buy", "Hold", "Sell")

Result Interpretation:

Threshold settings:
- Establishing thresholds for decision-making
Interpretation Methods:
- SHAP (SHapley Additive exPlanations):
  - Evaluating the contribution of each feature to the prediction
- LIME (Local Interpretable Model-agnostic Explanations):
  - Explaining predictions for specific instances

Recommendation Formation

Business Logic:

Determining actions based on predictions and risk management • Reporting:
Creating reports explaining reasons for recommendations
Including visualizations and charts

Visualization of Results

Graphs:

Price charts with overlaid signals
Heat maps of feature importance • Interactive Panels:
Using Dash, Streamlit for creating web applications

Stage 6. Model Training and Adaptation

Training Process

Data Split:

Training set: 70%
Validation set: 15%
Test set: 15%

Cross-validation:

K-Fold Cross-Validation for assessing model stability

Evaluation Metrics:

Accuracy, Precision, Recall, F1-score
For multi-class classification problems, use macro and weighted evaluations

Application of Data Augmentation Methods

Purpose:

Increasing data volume and improving model generalization ability

Methods:

Noise transformations:
- Adding small Gaussian noise
Temporal distortions:
- Stretching or compressing time series

Model Update

Regular Retraining:

Periodic updating of the model with new data

Performance Monitoring:

Tracking metrics on new data
Alerting for declining model quality

Hyperparameter Adaptation:

Random Search, Bayesian Optimization

PreviousCore Predict Model NextPredictAi

Last updated 1 year ago