Getting Lighthouse scores from HTTPArchive for sites in India.

Paul Kinlan
Available in: Deutsch Español Français 日本語 मानक हिन्दी русский язык tiếng Việt தமிழ் bahasa Indonesia

I’m about to go on a short trip to India, and I’ve been thinking about longer-term developer relations work for Chrome and Web in the region. As with most trips I like to do a bit of research ahead of time so I can get a better understanding of what the web looks like from the perspective of the country I am visiting.

I’ve been following a bunch of the updates to HTTPArchive over the past couple of months and it’s been amazing to see the improvements to the types of data it collects and stores in its BigQuery tables. One specific piece of information that is of massive interest to me is the Lighthouse data generated on each run of HTTPArchive. With this data I was keen to see if I could use it to get a snapshot of the data and get a high-level understanding of how people might experience the web in the country.

The good news is that it’s not too hard to analyse the Lighthouse data in HTTPArchive.

For my needs though, the harder part is to get a lock on what a ‘top site’ in any given country is, especially when I am thinking about developer relations work that we could and should be doing.

Here is how I broke the problem down. In each country there are many types of developers that build for the web and personally I tend to bucket them in to 3 groups: Those whose current project target the local market; Those that target a foreign market (I building for export); and those that target a global audience.

When I think about the above three groups, it’s nearly impossible to work out the intent of the site and the people behind it. But there are some heuristics that you can use to at least help you reason and understand the data.

For my analysis I didn’t think I could get a list of the top sites visited by users in India, so I made a simple assumption that ‘.in’ domains are likely to be built for people in India. The sensitivity and specificity for the question of ‘indian sites’ is not 100% by focusing on ‘.in domains’ — users all over the world like to use experiences that aren’t just locked to the countries TLD — but it seems like decent measure of the state of Indian sites as a first pass.

This type of analysis turns out to be pretty easy. You open up BigQuery and find the latest table that contains the Lighthouse data run [httparchive:lighthouse.2018_08_01_mobile] in this case and run the following query.

SELECT
  url,
  JSON_EXTRACT(report, '$.categories.seo.score') AS [seo_score],
  JSON_EXTRACT(report, '$.categories.pwa.score') AS [pwa_score],
  JSON_EXTRACT(report, '$.categories.performance.score') AS [speed_score],
  JSON_EXTRACT(report, '$.categories.accessibility.score') AS [accessibility_score]
FROM
  [httparchive:lighthouse.2018_08_01_mobile]
WHERE
  url LIKE '%.in/'

The above query is filtered on domains ending in ‘.in’, and it returns the Lighthouse score for each of the Lighthouse test categories. The Lighthouse data is stored as a JSON object, which you have to extract the required components via an XPath like syntax for JSON.

The number of results is actually pretty large and not of much use to present here, but I did pivot these into a histogram.

Score Range SEO Score PWA Score Speed Score A11Y Score
0 0 46 279 25
0.5 84 13992 6502 3973
0.7 3391 1400 2222 7585
0.8 1438 19 1147 2374
0.9 2762 9 1545 1069
1 7752 13 3189 434

Further drill-down and analysis of the data needs to take place, to understand exactly which specific issues are affecting the scores, however in some cases like with the ‘PWA Score’ I’ve seen enough of the site scores in the past to know what issues affect the overall score and I can see some of the challenges ahead of us now.

Next up. Try and find a way to get the sites that Indian users frequent…. Hint, it’s here

Paul Kinlan

Trying to make the web and developers better.

RSS Github Medium