Managing Global Round-the-Clock Platform Operations and Rapid Growth
Over the past several weeks, the real-time reporting publishers have come to expect from Sovrn has experienced delays. We believe strongly in transparency and continual learning & improving, so we thought we should provide a longer explanation of the reasons behind the delays and what we are doing to fix the underlying causes.
It is important to know that in spite of data delays there has been zero data loss due to the robust data protection infrastructure Sovrn has in place. All data is and has been intact. The core issues had to do with the processing of that data into publisher dashboards and reporting tables.
To understand the delays in data reporting, it is important to first outline the three tiers that comprise the Sovrn platform.
Tier 1
The delivery tier is a distributed “pod” architecture with points of presence close to the large population centers and demand (advertiser) buyer systems. Proximity to media centers such as New York, Los Angeles, Chicago, London, etc. enables extremely fast exchange auction mechanics, advertising creative delivery, and ultimately less latency for the reader of the site. With increased adoption of new auction dynamics like header bidding, the need to scale for high volume with predictable response times is becoming even more important.
Tier 2
The data tier includes the capability to capture, store, and then process data, at extremely high speed. Sovrn’s Publisher network exceeds 100,000 sites, where more than 1B people consume 90 billion pages of content every month. The systems required to process that volume of data are the same or very similar to large platform companies (think: Netflix, Spotify, Snap, Twitter). The reporting issues the publishers have seen were due to processing and network capacity bottlenecks at our primary data centers as data volumes increased in late January.
Tier 3
Finally, the analytics tier which includes individual publisher reports. The data tier feeds the analytics tier so as processing slowed and network capacity constrained, the result was publishers were impacted by delayed reporting in their dashboards and other reports.
Growth: The Best Kind of Challenge
Sovrn’s data processing layer starts with a data ingest queue, a centralized data store, and real-time and batch data processing pipelines to supply the analytics data. The infrastructure is designed to “scale-out” by adding additional servers as needed. What we discovered, was the data processing architecture we’d deployed made it difficult to perform critical systems maintenance while simultaneously running at the extreme volume we began to encounter in late January. Frankly, we were a victim of growth. We’ve seen an 8x increase in data volume due to a rapid expansion from primarily header bidding implementations but also from video and Signal adoption. When we attempted to perform regular system maintenance and adding additional servers, the infrastructure became unstable. This instability required several in-place upgrades, patching, and reconfigurations to recover.
We should reiterate that the ad delivery tier (Tier 1) was never impacted. Moreover, all data was intact. This was fundamentally a bottleneck in the data tier that impacted data processing and resulted in several weeks of slow populating dashboards and publisher reports.
Key Lessons
- Sovrn is a 24×7, always on system. We operate in North America, Europe, and Asia. It is simply not possible to take the platform down for maintenance. This means that configurations can become out of date and experience instability due to lack of regular maintenance (patches, etc.).
- The platform requires a more flexible architecture that provides flexible capacity and ability to remove or replace select components for maintenance without impacting overall processing.
- The platform must be able to quickly and efficiently spike capacity to keep up with both current and backfill data processing. As volume grows, the processing capacity required to catch-up from data delays can increase dramatically, further slowing things down.
- Our commitment to the publisher is to provide timely and transparent communication of causes, impacts, and responses when issues like the above occur.
What Publishers Can Expect Going Forward
Now that the systems are stabilized and reporting is current, we are moving to upgrade the processing capacity and guarantee data within an hour.
Each 3rd party component (MapR, Aerospike, etc.) will be certified by these third-party vendors and validated with quality assurance testing, acceptance testing, and parallel processing to confirm data validity. Phased deployments of these fixes, which allow for publisher feedback and rapid response during deployment, will be implemented.
We believe these investments will continue to provide publishers access to unique, high-value data that is delivered consistently, reliably and in real-time.