Case Study Details

Web Crawling Optimisation

Focus

Development of statistical models to optimise the timing of scraping data from websites.

Approach

The client was an online platform that collected, stored and analysed a vast quantity of media data from a significant number of webpages.

This required a complex web crawling approach to optimise the time and frequency of web crawls to minimise the time between target websites updating their website and the crawling time, as well as ensure crawls only occurred when there was new content to retrieve.

The solution involved the use of a stochastic point processes to crawl at a time and frequency to maximise the likelihood of new content.

Outcome

  • Optimised crawling schedule
  • Efficient use of cloud compute resources
  • Increased currency of content

Project information

  • Sector Private
  • Clients Media Analysis Platform
  • Technical Expertise
    • Machine Learning
    • Statistical Analysis
    • Data Search & Retrieval