Scalability was one of the primary concerns when we started building the tool. Essentially, the tool gathers numbers about links you post, it is quite straightforward. To gather these numbers, our tool uses many external APIs and in a way acts as a sort of proxy between the user and many other 3rd party API providers, on top of which some internal indicators are derived. Many tools allow you to do that, but, regarding scalability, some ways are better than others. Much better actually. Gathering information for 1000 urls a day is different than doing it on 1 million, lots of challenges came in the way.
TECHNOLOGYDeciding on which platform to use, we ended up using the well-known combo Python-Django-Celery. It is the one i have most experience with, and the task is really I/O bound therefore it is not one of those cases in which writing everything in C makes a big difference. This combo also allows us to code things pretty quickly, testing various methods and combinations. The real complexity is in the Celery backend, which is where the data gathering takes place.
WORKFLOWRequests could come in through API or through the Web interface. Web interface is a better example because that is the only way now to send multiple urls at once. When URLs enter into the system, each one of those is done in parallel. For every url, there are two rounds of data gathering, the first gets part of the final results, and then a second round gets the results that are dependent on the first round of numbers.
All these single rounds of API calls are done asynchronously, not sequentially. We make heavy use of Celery advanced features such as tasksets and chords to make sure we squeeze every bit of performance we can from the system.
Each background task takes then care of storing these numbers in a PostgreSql database server, which they later get pulled back in the Web interface (or API results)