How I Analyzed Millions of Log Files with Python?

Python for SEO

Few months ago I’ve published my log file analyzer tool, and finally now I found the time to write about log files, how to use this tool and what were the challenges I faced during this process. So, here we go.

What Are Log Files?

Log files are server log which contains a history of page requests for a website, from both humans and robots.

Log file analysis is an extremely valuable source of 100% accurate data that allows us to understand what happens when a search engine crawls our website.

Typically, there will be one line per request, and this will include some or all of the following:

  • Time the request was made
  • URL requested
  • User agent
  • Response Code
  • Size of response
  • IP address of client making the request
  • Time taken to serve the request
  • Referrer, the page that provided the link to make this request

What log files reveal?

  1. See what Google bot is actually consuming.
  2. Show how much “crawl budget” is being wasted and where
  3. Google Bot Mobile vs Google Bot Desktop (Mobile-first Index indicator)
  4. Improve accessibility errors such as 404 and 500 errors
  5. Locate most crawled pages and pages that aren’t being crawled often

Additionally, we’re able to track all HTTP status codes for all pages on the site, identify broken pages, pages with server errors etc. and fix them accordingly.

By doing so, we’re improving site visibility in SERP, with a focus on the most important pages we’d like to rank for and optimizing Google crawl budget, to allow its bot to crawl and index our brands more efficiently and easily.

How to manage HUGE log files?

When dealing with big websites with millions of users, the full log file might be massive and might take a lot of time and resources to work with it.

What you can do is strip out just the Google bot user-agents from the server logs. It will help drop the overall server log size by over 90% (at least in this case). This can be done via simple grep commands within Linux.

Pull any server log that contains ‘Googlebot’ will match all of Google’s user-agent bots for both desktop and mobile.

Alternatively, you can run a reverse DNS lookup to verify if a web crawler accessing your server is really Google bot.

Data Challenges

1. The zipped log files were uploaded from the server to an FTP. When unzipped, I got a txt file of about 250 mb (this file was only about 6 hours worth of data, so try to imagine the size of this after 30 days, multiple by a number of websites).

Each text file had about 1 million rows, which resulted in 30 gb of data after 30 days. With this tool I was able to reduce this dramatically, so 30 days of data weighted 11 mb only!

2. Merging all log files together to be able to analyze the data: 1,000 txt files to one csv file

3. Removing redundant rows & modify titles

4. Filtering Google Bot Hits: 95% of Google Bot hits in the log files I’m getting are not relevant for me. Those hits come from either SEO crawler tools who emulate Google Bot, or some other Google bots, such as Google Bot for ads, videos etc.

Solution

  1. Works automatically for multiple websites. Chose website — → get data.
  2. Get 90% more data compared to any other SEO crawler or Google GA, GSC.
  1. Filter, merge and save data to a CSV file
  1. Analyse log files data and draw graphs (automatically) for relevant data points.
    • Google Bot Hits by Day — Mobile vs. Desktop
      • Green bars — Google Bot Mobile
      • Purple bars — Google Bot Desktop

Google Bot Hits By Device

Google Bot Hits By Day

Response Code Pie Graph

Response Code % Total

Daily Hits by Response Code

Blending Data

Next, I used Microsoft Power BI to create SEO dashboards with data coming from log files, Google Analytics and Google Search Console

and here’s the final result:

Here’s a link to the code. Please notice that you’ll probably have to adjust it, so you can use it for your specific need.

Leave a Reply

Your email address will not be published. Required fields are marked *