A reliable web scraping robot

Companies scrape the internet for many reasons - price monitoring, counterfeiting, copyright infringement, reputation management, product information, lead prioritization, communication compliance, SEO audit, or website testing. Corporate websites, news aggregators, e-commerce portals, blogs, and social media sites are the potential scraping targets.

The collected data is cleansed, validated, converted into structured format before handing over to the analytics engine. The analytics engine performs natural language processing, classification, clustering, and scoring algorithms. The result is either visually displayed as interactive charts or sent to the back-office system.

Developing a Spider

It’s not just about coding; it’s all about reliability.

Developers having decent Python skills can quickly develop a basic spider using Requests, Scrapy, XPath, BeautifulSoup, and Selenium. However, a mature one needs to handle the following situations.

Handle dynamic portals – A JavaScript-driven content accumulation.
Improve crawling speeds, especially when selenium is in the picture.
Throttle page clicks to avoid real-time blocking.
Handle lazy loading or infinite scrolling.
Avoid hidden link honeypots.
Handle static and real-time IP blockers.
Provision to handle Turing tests – CAPTCHA, mouse clicks, and keystrokes.

Taking it to Production

Job is not done until a robust system is in place.

Deploying robust and reliable scrapers is never an easy task. No matter how good your spiders are, they inevitably degrade over time as websites make continuous changes. Another major challenge is scraping a large number of sites with good speeds. A sound scraping system shall:

Include a messaging backbone to manage the workflow - job scheduling, data cleanup & ingestion, error recovery, NLP, and deep learning activities.
Send alerts to email, text, Slack, or raise a ticket when a scraping error occurs.
Achieve fast scraping – concurrent scraping jobs and error recovery.
Include a data store capable of handling relational, key-value, NoSQL, time-series, and elastic searching.
Automatically recover jobs having transient failures – site down, slow loading, IP blocks, and throttling.
Manually recover permanent errors – Turing tests, site changes.
Provide Dashboards and drill-down reports to display real-time activity, trends, and data collection anomalies.

Data Processing

Continuous data quality assurance systems produce accurate insights.

Data preprocessing stages vary from business case to case. The following are the essential activities to perform:

Data Cleansing: Not everything that you gather from the internet is clean, organized, and useful. First, you need to cleanse the data like Unicode characters, remove non-printable characters, drop duplicates, inadvertent data like HTML, CSS, JavaScript, etc.
Data validation: Data validation is semantic verification of incoming data. Some of the challenges here are data type mismatch, incomplete data sets, and missing data. Automate the verification steps using JSON schema validators, regular expressions, and RPAs. Also, provide an option to inspect and correct the incoming data manually.
Ingestion: Persist structured data into Relational, key-value documents, or time-series data stores based on the analytics strategy.

Analytics

Improve your Decision making with custom applications and data-driven Insights.

The analysis phase significantly changes from business case to business case. Some of the standard algorithms are natural language processing, classification, clustering, and scoring algorithms. Some of the business case-specific approaches are:

Reputation Management

Mine customer reviews on social media, online shopping portals, and industry-specific blogs - using language processors customized to handle the industry-specific jargon - for sentiments, brand loyalty, product merits, issues, and fake product sale. Accurately customize the language processor to improve content coverage%, which is a challenging task. Use of Stanford NLP, nltk, n-gram analysis, Vader will get you the best result.

Lead Prioritization

scrape lead information from the selected portals, extract metrics from the scraped content, run against clustering algorithms to segment the leads, and finally score the lead. Some of the components are K-means clustering, Gini Index, Logistic regression, SMOTE oversampling methods, and the natural language processors.

Price Monitoring

Scrape e-commerce, partners, and rival sites to capture pricing, packaging details, and imagery. Monitor price changes across regions, seasons, and outliers. Detect changes in images and product information and notify the users. Use kNN, K-means and SVM, OpenCV, and image and content diff tools to accomplish the task.

Conclusion

A solution designed with messaging and instrumentation components will result in a reliable system capable of handling hundreds of asynchronous tasks. In the event of a failure, the system could perform self-healing or alert escalation activities.

The scraping system needs close monitoring and frequent intervention due to its inherent capricious nature – frequent content changes, layout changes, and network blockages. Engaging experts who are having experience developing robust scraping systems will significantly reduce the learning curve and time to market your solution.

About Mobigesture

Mobigesture has delivered text mining and analytical data visualization solutions to customer care, skincare and hospitality industries. Our team is proficient in scraping spiders, cloud-native resources, back-office integrations, NLP, predictive & deep learning, big data, AWS, Azure, GCP, Python Django/Flask, NodeJS, MongoDB, Elastic Search, React, Angular, D3, and Mobile Hybrid frameworks.

For any demos or POCs, please write to us at contact@mobigesture.com and know more about our offerings.