Developing a robust scraping solution requires handling many moving parts due to reliability issues intrinsic to the solution – Network traffic issues, frequent site content changes, and blockers. A reliable system shall give good visibility into the solution’s day-to-day operations, self-heal the transient errors, and provide tools to handle permanent errors.
The diagram given below outlines a minimally viable solution for building a production-grade scraping robot. The proposed architecture cleanly decouples activities using a message broker, which results clean code and good operations visibility into the system.
One can take open-source components to build the proposed solution. Some of the elements are RabbitMQ, BeautifulSoup, Scrapy, MongoDB, influx dB, Postgress, Nagios, and Python-Django. The rest of the document will discuss the Architecture and essential components required to deploy a scraping solution using AWS cloud-native resources.
A spider is responsible for scraping the required contents of a targeted website. Refer to A reliable web scraping robot – design Insights for more details on spider development. Designing a configurable spider with the following options will reduce unnecessary network blockages and improve the overall spider reliability.
A Scheduler is responsible for scheduling scraping jobs and publish them to Jobs topic. The spiders that subscribed to this topic will pick up these jobs and scrape the specified sites. The schedular may also reschedule transient spider errors. A typical scheduler wakes up every five minutes and:
Analysis and storage activities vary from business case to business case, and the discussion about these activities is beyond the scope of this document. Refer to our recent article, A reliable web scraping robot – design Insights, for little more details on this subject.
A launcher is responsible for setting up the infrastructure and updating changes to the software components. The launcher will setup Topics, Queues, Lambda functions and subscriptions. The launcher program can be used as a command-line program or embedded in a CI/CD pipeline. The launcher is responsible for:
The Architecture emphasizes error recovery and flexible workflow by choosing Lambda and SQS. Using Lambda functions for scraping and result processing will simplify the deployment and offer maximum flexibility to workflow changes.
First-time scraping of a new site may become a bit tedious chore due to 15 min max time limit on Lambda functions. One may consider using ECS/MQ instead of Lambda/SQS for large implementations.
Even though, One can use the AWS console to manage day-to-day operations, building an operations UI with real-time dashboards will give better control to the Ops team.
Mobigesture has delivered text mining and analytical data visualization solutions to customer care, skincare and hospitality industries. Our team is proficient in scraping spiders, cloud-native resources, back-office integrations, NLP, predictive & deep learning, big data, AWS, Azure, GCP, Python Django/Flask, NodeJS, MongoDB, Elastic Search, React, Angular, D3, and Mobile Hybrid frameworks.
For any demos or POCs, please write to us at contact@mobigesture.com and know more about our offerings.