The volume of data generated globally grows exponentially every year. In 2024, 2.5 quintillion bytes of data were created daily. The rapid expansion of connected devices and digital transformation drive this trend. High-load systems demand precise solutions to ensure stable performance under the immense volume of requests. Alexey Kish, a lead developer at Semrush, shares his expertise in managing high-load systems—from automating CI/CD pipelines and setting up metrics to optimizing databases and preventing data duplication.
Tell us about your key projects related to high-load systems.
Nearly all the projects I’ve worked on can be classified as high-load, but I’d highlight two of the most significant ones.
The first project was a domain categorization tool for a company’s internal users. Its goal was to unify domain categories and distribute them across company products, creating a single reliable and trustworthy source. We collaborated with a data science team that developed a model for data processing, and our role as engineers was to implement it. The system had to handle up to 50,000 requests per second and support data for 200–300 million domains, with real-time updates.
The solution comprised two services: an API to handle requests and a service for running the model calculations. The database size was approximately 1 terabyte. Internal teams used this system for analytics, such as traffic categorization, advertising flow analysis, and client funnel optimization.
The second project, EyeOn, was a market monitoring tool developed as part of Semrush’s marketing suite. It allowed users to track market changes, analyze competitors, and receive alerts about key events, such as price changes on competitor websites. The system processed data from multiple sources: social media, ad platforms, website pages, and blogs.
The collected data was stored in PostgreSQL, and over three years, the storage volume grew to 5 terabytes. Tables contained up to 10 billion rows, and we optimized queries to ensure even complex operations were executed in 400 milliseconds. The system analyzed around 200,000 domains daily, collecting data on ads, new pages, social media, and blogs.
What key factors do you consider when designing the architecture for such systems?
I always start with analyzing business needs. It’s essential to understand the purpose of the system and ensure it meets current requirements while anticipating future growth. This includes forecasting loads: we estimate expected user numbers, request volumes, and other metrics, then triple those figures to account for uncertainty.
A critical step is creating a user interaction model. For instance, in the EyeOn project, users interacted with projects and competitors, but data enrichment occurred on the system side. Once this model is established, we choose an architectural approach: monolith or microservices. If system components have vastly different load profiles, service-oriented architecture is the preferred choice.
Special attention is given to storage selection. For systems dealing with large datasets, it’s crucial to use scalable solutions capable of handling growth. A poorly thought-out storage choice can lead to complex and expensive migrations down the line.
The design approach also depends on context. In startups, we prioritized budget constraints, while at Semrush, we focused on delivery speed and stability, allowing us more flexibility in resource allocation without exceeding budget limits.
How do you handle data aggregation to ensure real-time information accuracy?
Solutions depend on the product’s specifics and its requirements. For instance, in messaging systems where speed is critical, data is duplicated across several regions. Messages are stored on the servers closest to the user to minimize latency, ensuring fast access even if users switch regions.
In other cases, the approach may differ. For example, in the EyeOn project, data was aggregated daily. This eliminated the issue of incomplete datasets and avoided scenarios where users might encounter fragmented information.
For systems requiring real-time updates, such as trading platforms, we use synchronization time windows. Data from different vendors may arrive with delays ranging from milliseconds to half a second, so we standardize it within a shared window, say one second. This approach collects and processes all data, maintaining accuracy while minimizing latency for the user.
The key is finding an optimal balance between update speed and the accuracy of aggregated data, considering both system load and user expectations.
When Do You Use PostgreSQL and ClickHouse, and How Do You Decide Between Them?
The choice between PostgreSQL and ClickHouse depends on workload profiles, data requirements, and ease of administration. PostgreSQL is ideal for transactional tasks where atomicity and data consistency are critical. It’s a classic relational database operating on rows, well-suited for structured data with scalability and automated maintenance options in cloud services. PostgreSQL is straightforward to deploy: with a single click, you can set up a database, scale resources automatically, and ensure ongoing support.
ClickHouse, on the other hand, is specialized for analytics. It’s a columnar database optimized for handling large data volumes and executing aggregation queries efficiently. It excels at analytical workloads due to its column-based storage but lacks transactional support and strict consistency. ClickHouse is ideal for scenarios requiring fast data processing and real-time aggregation. However, it’s more challenging to administer, especially when deploying a self-managed cluster.
Infrastructure also influences the decision. PostgreSQL is included in most cloud platforms (e.g., Google Cloud), making it accessible and cost-effective. ClickHouse can be deployed as a standalone cluster but requires administrative resources. Cloud-based ClickHouse solutions are typically more expensive.
In practice, the choice between PostgreSQL and ClickHouse comes down to a trade-off between ease of use and performance requirements. For instance, in one of our projects, we chose PostgreSQL for its simplicity and automation, even though ClickHouse could have handled analytical tasks more efficiently.
How Does CI/CD Automation Work in High-Load Projects, and What Impact Does It Have on Stability?
CI/CD automation ensures stability by minimizing disruptions during updates. High-load systems are always under significant pressure since users interact with their services continuously. Any update must be seamless to avoid disrupting the user experience.
One of the most effective tools for achieving this is Kubernetes. It enables the horizontal scaling of applications, which is critical for high-load projects. If an application supports multiple instances, a new version can run in parallel with the current one. Kubernetes smoothly redirects traffic to the updated version, decommissioning the old version only after all active connections are completed. In case of an error, Kubernetes automatically rolls back changes, maintaining the stability of the previous version.
What Methods and Algorithms Do You Use to Handle Data Duplication in Systems?
Data duplication can be both beneficial and problematic, depending on the context. For example, it’s necessary to accelerate access in messaging systems where data is replicated across regions. However, in most cases, duplicates risk compromising data integrity.
To minimize duplication, we start by defining clear data ownership zones, establishing a single trustworthy source. This is often a dedicated service responsible for storage, using traditional databases like PostgreSQL or MySQL, which ensure consistency and atomicity.
When duplication is unavoidable, as in analytical databases like ClickHouse, we filter data during processing. These databases are optimized for large datasets, allowing real-time deduplication.
If the issue arises during pre-aggregation, we employ big data sorting algorithms. This is particularly important for resource-limited tasks requiring large-scale data processing while retaining only unique values. This approach effectively addresses duplication, ensuring both accuracy and system performance.
How Do You Set Up Metrics and Logs for Monitoring High-Load Services?
Metrics and logs are the foundation of stable system operation, especially in high-load environments. Without them, it’s impossible to understand system behavior or detect problems promptly.
- Standard Baseline Metrics
Start with basic metrics (e.g., RED metrics): latency, request rate, and error rate. These provide an immediate overview of system health. Latency is typically measured using histograms to group requests by key characteristics. - Structured and Clear Logs
Logging should be structured so that a single log entry clearly indicates what and where something went wrong. Neglecting proper logging often makes troubleshooting failures much harder. - Tracing and Request Correlation IDs
Tracing enables tracking the full lifecycle of a request through the system—identifying which service triggered another, how many requests were generated, and how they were processed. Correlation IDs are logged for each request, allowing the complete chain of interactions to be reconstructed easily.
What Advice Can You Give on Database Optimization for Better Performance?
The cornerstone of database optimization is selecting the right storage solution and using it appropriately. The market offers a vast array of databases, each tailored to specific needs: transactional, analytical, document-based NoSQL, caching services, or hybrid options like Tarantool.
Performance optimization requires understanding the application’s load profile and leveraging the strengths of the chosen database. Instead of trying to “tune” the database with additional modules or code modifications, focus on two areas:
- Query Optimization
Write SQL that interacts efficiently with the database. - Data Organization
Structure data to align with the database’s characteristics.
For example, working with PostgreSQL, we managed a table with 10 billion rows, performing grouping and filtering operations at high speed. This was achieved through proper data organization and leveraging PostgreSQL’s built-in features without modifying the database itself.
Deep-level optimization, such as altering database code, is justified only in rare cases where standard solutions fall short and migrating to another storage system isn’t feasible. However, such situations are exceptions. In most cases, choosing the right storage system and using its features effectively is sufficient.
Leave a comment
Have something to say about this article? Add your comment and start the discussion.