Thomas Farrar
04/05/2020
Web Scraping of Static HTML Content vs Dynamic Javascript-Rendered Content
In the previous post, we learned how to create a dataset of English Premier League football match data by scraping the HTML code of Match Centre webpages. In that case, the data we wanted was stored between HTML tags (nodes) in the HTML code itself. Thus, once we had downloaded the HTML document for each match using the read_html
function from R’s xml2
package, we had the data. It only remained to locate and extract the exact bits of data we wanted. This required the use of functions from R’s rvest
package that help us to interpret an HTML document.
The reality, however, is that many webpages today do not merely consist of static HTML content but also have Javascript-rendered content. This allows webpages to be more dynamic and interactive. A webpage with Javascript-rendered content requires a very different approach to web scraping. As an example, take the Historical Results section for the LOTTO game on the South African National Lottery website. When you open this page, you will see links corresponding to the past few draw numbers. If you click on one of these draw numbers, you will see results for that specific draw, such as which balls were drawn, the number of winners in each prize category, the amount of winnings in each prize category, and below that, more information such as the total sales and the (estimated) next jackpot. Observe, however, that the webpage URL has not changed; it is still ‘https://www.nationallottery.co.za/lotto-history’. We have not moved to a different webpage; we have merely changed the content on the same webpage. Moreover, if we scrape the HTML code for this URL, we will not find the data for any LOTTO draw.
This has two implications for the web scraper. Firstly, we cannot use the URL to cycle through all of the past LOTTO draws (as we did with Premier League matches in the previous post), because the URL remains the same. We must find another way to cycle through all of the past draws. Secondly, instead of merely downloading the HTML content of this webpage (which does not contain the data we want), we need to find out how the webpage uses Javascript to render the draw data and imitate that procedure in R. Before we dig into the details of doing this, a bit more background on our objective.
The National Lottery’s LOTTO Game
One of the games of the South African National Lottery is LOTTO. Draws take place twice per week, on Wednesdays and Saturdays. Each play consists of selecting six numbers between 1 and 52 without repeats. At the draw, a machine knocks around 52 numbered balls and six are drawn without replacement, plus a seventh called the bonus ball. To win the jackpot (Level 1 prize), your six numbers must exactly match the six balls drawn from the machine (order does not matter). The jackpot prize ranges from about R3 million up to over R100 million depending on the draw. Progressively smaller prizes can be won by matching five numbers plus the bonus ball, five numbers without bonus ball, four numbers plus bonus ball, and so on. The lowest prize-winning outcome (Level 8) corresponds to matching two numbers plus the bonus ball; this earns the player R20.
A student of probability may find it an interesting exercise to work out the probabilities of winning each of the prize levels using combinations. Interestingly, there were formerly only 49 balls in the game. The seemingly inconsequential increase from 49 balls to 52 with effect from Draw No. 1732 (August 2017) has actually reduced the probability of winning a jackpot by over 45%, from 1 in 13983816 to 1 in 20358520.
What would scraping historical draw results allow us to do? One thing it would not allow us to do is to improve our chances of winning a jackpot. The statistical principle of independence means that knowing which numbers have been drawn more frequently or less frequently in the past tells us nothing about which numbers might be drawn in the next draw. Two things we could do with the scraped data, however, are (1) test whether the data fits the theoretical probability distributions in terms of the distribution of balls and the number of jackpots won. If not, this could suggest either that the machines are faulty or that some sort of fraud has occurred. (2) We could look at patterns of LOTTO sales over the years. To do either of these, we need the data, and so back to our web scraping problem!
Web Scraping of Javascript-Rendered Content Using R
Certain R packages have functions that will make scraping this data very easy; actually much easier than it was to scrape the Premier League match data from HTML code. We will need to install the packages httr
, jsonlite
, and (if we have not already) rvest
and xlsx
:
install.packages(c("httr", "jsonlite", "rvest", "xlsx"))
Now, to harvest Javascript-rendered data from a webpage it is absolutely essential to use Developer tools in Google Chrome (or a similar tool in another browser). First open the the Historical Results page and then press Ctrl+Shift+i to open Developer tools. Click on the ‘Network’ tab in Developer tools and then click on the most recent draw (No. 2013 as I write this). You will see two items appear in Developer tools; we are interested in the item of type ‘xhr’ circled in red in the screenshot below. This gives us a record of the dynamic content that was rendered when we opened the page.
Right-click on the text beginning with ‘index.php?’ and select ‘Copy’ and then ‘Copy link address.’ Paste this URL into your R script and save it as a character:
requesturl <- "https://www.nationallottery.co.za/index.php?task=results.redirectPageURL&Itemid=265&option=com_weaver&controller=lotto-history"
Now, left-click on the text beginning with index.php?
you will see a tab called ‘Headers’ open with some information under ‘General’. Often, Javascript content is rendered with a ‘GET’ request, but in this case we can see that the Request Method is ‘POST’. (This is the same kind of request that is made when you submit a form on a webpage.) Since we have the URL and method of the request, we are nearly ready to run this request within R and thus scrape the data for this draw. But how does the webpage know which draw’s results to return? (If you open the results of another draw with Developer tools running, you will see that the Request URL is the same.)
Scroll down to the bottom of the Headers tab and you will see three variable names and values under ‘Form Data’:
The variable drawNumber
, with a value of 2013
, is clearly what is going to tell the script which draw’s results to return. Now we are ready to emulate the POST request within R using the POST
function of httr
package. Among the arguments we pass to the function are the request URL and the three variables that we found under ‘Form Data’ in Developer tools.
response <- httr::POST(url = requesturl, body = list(gameName = "LOTTO", drawNumber = "2013", isAjax = "true"), encode = "form")
Now we have extracted the data, but it is encoded in JSON (JavaScript Object Notation) format. Fortunately, the function parse_json
in R package jsonlite
can parse it for us:
myjson <- jsonlite::parse_json(response)
The object myjson
is a list containing another list object named data
containing two other list objects named drawDetails
and totalWinnerRecord
. The drawDetails
object has the data we need stored under various names like ball1
(the first ball drawn), ball2
, etc. We can convert all of this to a vector with one simple step:
mydrawdata <- unlist(myjson$data$drawDetails)
mydrawdata
## drawNumber drawDate nextDrawDate ball1
## "2013" "2020/04/15" "2020/04/18" "24"
## ball2 ball3 ball4 ball5
## "48" "33" "51" "28"
## ball6 bonusBall div1Winners div1Payout
## "36" "46" "0" "0"
## div2Winners div2Payout div3Winners div3Payout
## "1" "65122.8" "20" "5662.9"
## div4Winners div4Payout div5Winners div5Payout
## "70" "2022.5" "1288" "184.7"
## div6Winners div6Payout div7Winners div7Payout
## "1822" "113.5" "25295" "50"
## div8Winners div8Payout rolloverAmount rolloverNumber
## "19247" "20" "2066940.84" "1"
## totalPrizePool totalSales estimatedJackpot guaranteedJackpot
## "4481277.24" "9958035" "4500000" "0"
## drawMachine ballSet status nwwinners
## "RNG2" "RNG" "published" "0"
## kznwinners fswinners winners millionairs
## "0" "0" "47743" "0"
## gpwinners wcwinners ncwinners ecwinners
## "47743" "0" "0" "0"
## mpwinners lpwinners
## "0" "0"
We now have the data for Draw No. 2013 saved in a convenient format. All that remains is to cycle through all the past LOTTO draws and compile their data into a single spreadsheet. We can verify that the earliest draw for which results are available on the website is Draw No. 1506 from 6 June 2015. Thus we can proceed as below. (Advanced R programmers will probably prefer to use lapply
and the pipe operator %>%
from package magrittr
rather than a for loop.)
firstdrawno <- 1506
lastdrawno <- 2018 # Change to most recent draw number
ndraws <- length(firstdrawno:lastdrawno)
lottotable <- matrix(nrow = ndraws, ncol = length(mydrawdata))
jsondat <- vector("list", ndraws)
for (d in firstdrawno:lastdrawno) {
response <- httr::POST(url = requesturl, body = list(gameName = "LOTTO", drawNumber = as.character(d),
isAjax = "true"), encode = "form")
jsondat[[d - firstdrawno + 1]] <- unlist(jsonlite::parse_json(response)$data$drawDetails)
}
lottotable <- as.data.frame(do.call(rbind, jsondat), stringsAsFactors = FALSE)
## Warning in (function (..., deparse.level = 1) : number of columns of result is
## not a multiple of vector length (arg 1)
names(lottotable)
## [1] "drawNumber" "drawDate" "nextDrawDate"
## [4] "ball1" "ball2" "ball3"
## [7] "ball4" "ball5" "ball6"
## [10] "bonusBall" "div1Winners" "div1Payout"
## [13] "div2Winners" "div2Payout" "div3Winners"
## [16] "div3Payout" "div4Winners" "div4Payout"
## [19] "div5Winners" "div5Payout" "div6Winners"
## [22] "div6Payout" "div7Winners" "div7Payout"
## [25] "div8Winners" "div8Payout" "rolloverAmount"
## [28] "rolloverNumber" "totalPrizePool" "totalSales"
## [31] "estimatedJackpot" "guaranteedJackpot" "drawMachine"
## [34] "ballSet" "status" "nwwinners"
## [37] "kznwinners" "fswinners" "winners"
## [40] "millionairs" "gpwinners" "wcwinners"
## [43] "ncwinners" "ecwinners" "mpwinners"
## [46] "lpwinners"
names(lottotable) <- names(jsondat[[length(jsondat)]])
# Change columns containing numbers from character to numeric
numericcols <- c(1, 4:32, 36:37)
lottotable[numericcols] <- sapply(lottotable[numericcols], as.numeric)
# Write to Excel
xlsx::write.xlsx2(lottotable[1:37], file = "LOTTO_draw_results.xlsx", row.names = FALSE)
The Excel spreadsheet LOTTO_draw_results.xlsx
in your working directory should now exist and contain the results of all LOTTO draws from 2015 to the present, as in the screenshot below. (If you are not sure of your working directory, just run the command getwd()
.)
Analysis
We can easily verify the total number of jackpot winners over all the draws covered by our data set:
attach(lottotable)
njackpotwinners <- sum(div1Winners)
njackpotwinners
## [1] 117
# Ratio of number of jackpot winners to number of draws
njackpotwinners / ndraws
## [1] 0.2280702
This works out to a jackpot win about every fourth draw. But how many total plays have occurred in that time? The data contains the total sales for each draw, so if we know the price per play we can work out the number of plays. The price per play is now R5.00, but up until Draw No. 1607 (21 May 2016) it was R3.50. Thus,
totalplays <- rep(NA_real_, ndraws)
for (i in 1:ndraws) {
if (drawNumber[i] <= 1607) {
totalplays[i] <- totalSales[i] / 3.5
} else {
totalplays[i] <- totalSales[i] / 5
}
}
sum(totalplays)
## [1] 1879514323
The total number of plays is approaching two billion! So, does the observed frequency of jackpot winners fit the probability distribution? We will have to consider draws no. 1506-1731 (when there were 49 balls) separately from draws no. 1732-present (with 52 balls). We can use a Pearson chi-squared goodness-of-fit test
obs_jackpots49 <- sum(div1Winners[drawNumber <= 1731])
obs_nojackpot49 <- sum(totalplays[drawNumber <= 1731]) - obs_jackpots49
chisq.test(x = c(obs_jackpots49, obs_nojackpot49), p = c(1, choose(49, 6) - 1), rescale.p = TRUE)
##
## Chi-squared test for given probabilities
##
## data: c(obs_jackpots49, obs_nojackpot49)
## X-squared = 1.3928, df = 1, p-value = 0.2379
obs_jackpots52 <- sum(div1Winners[drawNumber >= 1732])
obs_nojackpot52 <- sum(totalplays[drawNumber >= 1732]) - obs_jackpots52
chisq.test(x = c(obs_jackpots52, obs_nojackpot52), p = c(1, choose(52, 6) - 1), rescale.p = TRUE)
##
## Chi-squared test for given probabilities
##
## data: c(obs_jackpots52, obs_nojackpot52)
## X-squared = 0.80276, df = 1, p-value = 0.3703
Both chi-squared goodness-of-fit tests have \(p\)-values well above 0.05, so it appears that the observed number of jackpot wins in the LOTTO game is reasonable.
Finally, let us look at how the number of LOTTO plays per draw has changed over time.
# Fixing an error in the year in the `drawDate` of draws nos. 1514-1521
drawDate[drawNumber %in% 1514:1524] <- gsub("2016", "2015", drawDate[drawNumber %in% 1514:1524])
par(mar = c(4, 4, 1, 1))
plot(as.POSIXct(drawDate), totalplays, type = "l", xlab = "Draw Date", ylab = "No. of Plays")
abline(v = as.POSIXct(drawDate[drawNumber == 1607]), lty = "dotted")
We can observe a sudden downward shift in mid-2016 at the point when the price per play increased from R3.50 to R5.00 (represented by the dotted line on the graph). We can also observe a huge spike in the number plays for LOTTO Draw No. 1783 (27 Jan 2018). A glance at the data reveals that this draw had a guaranteed jackpot prize of R110 million, the largest guaranteed jackpot in the game’s history. There also seems to be a slight downward trend in plays per draw from 2018 to 2020, which could be due to the country’s economic struggles. There is also a sharp drop in plays in the most recent draws in March-May 2020, which is of course due to the national lockdown. During this period it was still possible to play LOTTO online but not in-store.
We can also observe some interesting patterns in when people tend to play LOTTO more.
dayofweek <- weekdays(as.POSIXct(drawDate))
# Median number of plays on Wednesday draws
median(totalplays[dayofweek == "Wednesday"])
## [1] 3051419
# Median number of plays on Saturday draws
median(totalplays[dayofweek == "Saturday"])
## [1] 3940625
# Median number of plays by day of month
dayofmonth <- substr(drawDate, start = 9, stop = 10)
plot(aggregate(totalplays ~ dayofmonth, FUN = median), pch = 20,
xlab = "Day of Month", ylab = "Median Number of Plays")
We observe that the median number of plays for Saturday draws is nearly 1 million more than the median number of plays for Wednesday draws. We can also see that if we compute the median number of plays for each day of the month, the medians are generally higher at the end and beginning of the month than in the middle of the month. This makes sense given that most South Africans receive their salary near or at the end of the month.
Conclusion
Over the past two articles we have learned how to scrape or harvest data from webpages where the data is stored statically in the HTML code and webpages where the data is stored dynamically within Javascript-rendered content. Some R programming ability was required to extract exactly the data we need from the scraped objects, format it nicely into a data.frame
, and analyse it. However, very little knowledge was required of HTML, Javascript, or generally how the Internet works. We just needed to get a little bit of information about the webpages we wanted to scrape from Developer tools in Google Chrome (or a similar tool in another browser.) The bottom line: web scraping is a very powerful tool in the data scientist’s arsenal, and a surprisingly easy one to use.