Thomas Farrar
21/04/2020
Big Data and Web Scraping
According to some experts, the amount of data in the world is expected to grow by over 60% per year between now and 2025. It is also predicted that 80% of the world’s data will be unstructured by 2025. Structured data is data that is organised in a predefined structure such as in tables. Unstructured data is the rest. Thus, for instance, a MS Excel spreadsheet would typically contain structured data, whereas a Twitter feed would typically contain unstructured data.
In this article we will introduce the reader to web scraping, which refers to extracting data from the World Wide Web and (where necessary) cleaning the data to get it in a usable (e.g. structured) format. In light of the above predictions, it is clear that web scraping will be an increasingly useful skill going forward.
Before going further, an ethical disclaimer is needed. Web scraping could be a tool in the toolkit of an unscrupulous hacker. However, what is being advocated here is not stealing secure, proprietary data but rather harvesting publicly available data. The data we will be scraping is already accessible to anyone with an Internet connection and a web browser; we are just accessing it in a more efficient and systematic way.
Harvesting HTML-Rendered Content in R
In order to proceed you will need to download R and ideally RStudio as your programming user interface.
We will see that you can get up and running as a web scraper in R very quickly, without a lot of expertise either in how the Internet works or in R programming. This is because a lot of the functionality is available in ready-made R packages. I will attempt to very briefly explain how web scraping works, and it will necessarily be in layman’s terms because I am no expert in it myself.
The basic structure of most websites is written in HTML code. When you load a webpage in your browser, your browser compiles the HTML code and displays the webpage to you. Traditionally, any data that existed on the webpage (e.g., tables, numbers, text) was already present in the HTML code, but wrapped in various ‘tags’ that would determine how the content is formatted. For instance, if the text was to be displayed as a clickable link, it would be wrapped in <a>
</a>
.
The R packages rvest
and xml2
contain the functions we need to very easily extract the HTML code and HTML-rendered content from any publicly accessible webpage. We first need to install the rvest
package (which will automatically result in xml2
also being installed.) We will also need the xlsx
package to write our results to an Excel spreadsheet.
install.packages("rvest")
install.packages("xlsx")
Then we can call the function read_html
from xml2
package and send it any webpage URL, and it will save the HTML code in an R object.
A Basic Example: Google South Africa Homepage
As a first example, let us scrape the Google South Africa homepage.
myurl <- "https://www.google.co.za"
mygoogle <- xml2::read_html(myurl)
The object mygoogle
now stores the HTML code for the Google homepage. A simple way to view the content we have scraped is using the function rvest::html_text
, which eliminates the HTML tags and preserves the content in between them as raw text in a character
. If we do this with the Google homepage and display the result, it is not very impressive: we basically get a lot of what looks like gibberish.
googlechar <- rvest::html_text(mygoogle)
googlechar
## [1] "Google(function(){window.google={kEI:'wdeeXoHtEI6bkwWcx6DoBA',kEXPI:'31',kBL:'2amb'};google.sn='webhp';google.kHL='en-ZA';})();(function(){google.lc=[];google.li=0;google.getEI=function(a){for(var c;a&&(!a.getAttribute||!(c=a.getAttribute(\"eid\")));)a=a.parentNode;return c||google.kEI};google.getLEI=function(a){for(var c=null;a&&(!a.getAttribute||!(c=a.getAttribute(\"leid\")));)a=a.parentNode;return c};google.ml=function(){return null};google.time=function(){return Date.now()};google.log=function(a,c,b,d,g){if(b=google.logUrl(a,c,b,d,g)){a=new Image;var e=google.lc,f=google.li;e[f]=a;a.onerror=a.onload=a.onabort=function(){delete e[f]};google.vel&&google.vel.lu&&google.vel.lu(b);a.src=b;google.li=f+1}};google.logUrl=function(a,c,b,d,g){var e=\"\",f=google.ls||\"\";b||-1!=c.search(\"&ei=\")||(e=\"&ei=\"+google.getEI(d),-1==c.search(\"&lei=\")&&(d=google.getLEI(d))&&(e+=\"&lei=\"+d));d=\"\";!b&&google.cshid&&-1==c.search(\"&cshid=\")&&\"slh\"!=a&&(d=\"&cshid=\"+google.cshid);b=b||\"/\"+(g||\"gen_204\")+\"?atyp=i&ct=\"+a+\"&cad=\"+c+e+f+\"&zx=\"+google.time()+d;/^http:/i.test(b)&&\"https:\"==window.location.protocol&&(google.ml(Error(\"a\"),!1,{src:b,glmm:1}),b=\"\");return b};}).call(this);(function(){google.y={};google.x=function(a,b){if(a)var c=a.id;else{do c=Math.random();while(google.y[c])}google.y[c]=[a,b];return!1};google.lm=[];google.plm=function(a){google.lm.push.apply(google.lm,a)};google.lq=[];google.load=function(a,b,c){google.lq.push([[a],b,c])};google.loadAll=function(a,b){google.lq.push([a,b])};}).call(this);google.f={};(function(){\ndocument.documentElement.addEventListener(\"submit\",function(b){var a;if(a=b.target){var c=a.getAttribute(\"data-submitfalse\");a=\"1\"==c||\"q\"==c&&!a.elements.q.value?!0:!1}else a=!1;a&&(b.preventDefault(),b.stopPropagation())},!0);document.documentElement.addEventListener(\"click\",function(b){var a;a:{for(a=b.target;a&&a!=document.documentElement;a=a.parentElement)if(\"A\"==a.tagName){a=\"1\"==a.getAttribute(\"data-nohref\");break a}a=!1}a&&b.preventDefault()},!0);}).call(this);\nvar a=window.location,b=a.href.indexOf(\"#\");if(0<=b){var c=a.href.substring(b+1);/(^|&)q=/.test(c)&&-1==c.indexOf(\"#\")&&a.replace(\"/search?\"+c.replace(/(^|&)fp=[^&]*/g,\"\")+\"&cad=h\")};#gbar,#guser{font-size:13px;padding-top:1px !important;}#gbar{height:22px}#guser{padding-bottom:7px !important;text-align:right}.gbh,.gbd{border-top:1px solid #c9d7f1;font-size:1px}.gbh{height:0;position:absolute;top:24px;width:100%}@media all{.gb1{height:22px;margin-right:.5em;vertical-align:top}#gbar{float:left}}a.gb1,a.gb4{text-decoration:underline !important}a.gb1,a.gb4{color:#00c !important}.gbi .gb4{color:#dd8e27 !important}.gbf .gb4{color:#900 !important}\nbody,td,a,p,.h{font-family:arial,sans-serif}body{margin:0;overflow-y:scroll}#gog{padding:3px 8px 0}td{line-height:.8em}.gac_m td{line-height:17px}form{margin-bottom:20px}.h{color:#36c}.q{color:#00c}.ts td{padding:0}.ts{border-collapse:collapse}em{font-weight:bold;font-style:normal}.lst{height:25px;width:496px}.gsfi,.lst{font:18px arial,sans-serif}.gsfs{font:17px arial,sans-serif}.ds{display:inline-box;display:inline-block;margin:3px 0 4px;margin-left:4px}input{font-family:inherit}body{background:#fff;color:#000}a{color:#11c;text-decoration:none}a:hover,a:active{text-decoration:underline}.fl a{color:#36c}a:visited{color:#551a8b}.sblc{padding-top:5px}.sblc a{display:block;margin:2px 0;margin-left:13px;font-size:11px}.lsbb{background:#eee;border:solid 1px;border-color:#ccc #999 #999 #ccc;height:30px}.lsbb{display:block}.ftl,#fll a{display:inline-block;margin:0 12px}.lsb{background:url(/images/nav_logo229.png) 0 -261px repeat-x;border:none;color:#000;cursor:pointer;height:30px;margin:0;outline:0;font:15px arial,sans-serif;vertical-align:top}.lsb:active{background:#ccc}.lst:focus{outline:none}(function(){var src='/images/nav_logo229.png';var iesg=false;document.body.onload = function(){window.n && window.n();if (document.images){new Image().src=src;}\nif (!iesg){document.f&&document.f.q.focus();document.gbqf&&document.gbqf.q.focus();}\n}\n})(); Search Images Maps Play YouTube News Gmail Drive More »Web History | Settings | Sign in (function(){var id='tsuid1';document.getElementById(id).onclick = function(){if (this.form.q.value){this.checked = 1;if (this.form.iflsig)this.form.iflsig.disabled = false;}\nelse top.location='/doodles/';};})();Advanced search(function(){var a,b=\"1\";if(document&&document.getElementById)if(\"undefined\"!=typeof XMLHttpRequest)b=\"2\";else if(\"undefined\"!=typeof ActiveXObject){var c,d,e=[\"MSXML2.XMLHTTP.6.0\",\"MSXML2.XMLHTTP.3.0\",\"MSXML2.XMLHTTP\",\"Microsoft.XMLHTTP\"];for(c=0;d=e[c++];)try{new ActiveXObject(d),b=\"2\"}catch(h){}}a=b;if(\"2\"==a&&-1==location.search.indexOf(\"&gbv=2\")){var f=google.gbvu,g=document.getElementById(\"gbv\");g&&(g.value=a);f&&window.setTimeout(function(){location.href=f},0)};}).call(this);.szppmdbYutt__middle-slot-promo{font-size:small;margin-bottom:32px}.szppmdbYutt__middle-slot-promo a.ZIeIlb{display:inline-block;text-decoration:none}.szppmdbYutt__middle-slot-promo img{border:none;margin-right:5px;vertical-align:middle}Stay Home. Save Lives#gws-output-pages-elements-homepage_additional_languages__als{font-size:small;margin-bottom:24px}#SIvCob{display:inline-block;line-height:28px;}#SIvCob a{padding:0 3px;}.H6sW5{display:inline-block;margin:0 2px;white-space:nowrap}.z4hgWe{display:inline-block;margin:0 2px}Google offered in: Afrikaans Sesotho isiZulu IsiXhosa Setswana Northern Sotho Advertising<U+00A0>ProgramsBusiness SolutionsAbout GoogleGoogle.com© 2020 - Privacy - Terms(function(){window.google.cdo={height:0,width:0};(function(){var a=window.innerWidth,b=window.innerHeight;if(!a||!b){var c=window.document,d=\"CSS1Compat\"==c.compatMode?c.documentElement:c.body;a=d.clientWidth;b=d.clientHeight}a&&b&&(a!=google.cdo.width||b!=google.cdo.height)&&google.log(\"\",\"\",\"/client_204?&atyp=i&biw=\"+a+\"&bih=\"+b+\"&ei=\"+google.kEI);}).call(this);})();(function(){google.kEXPI='0,202123,3,1151620,5663,730,224,5104,207,3204,10,1051,175,364,925,510,4,60,239,337,241,309,74,246,5,860,99,51,9,335,108,88,334,42,88,108,45,337,34,138,5,89,278,98,98,419133,706743,1197733,416,329118,1294,12383,4855,32691,15248,864,28687,369,8819,8384,4859,996,365,9290,3023,4745,3118,7915,1808,4020,978,7931,5192,103,2056,920,873,1217,2382,593,2784,3646,11306,2902,319,4518,2777,520,399,2277,8,2796,1593,1279,390,1822,530,149,1103,841,518,1137,1,277,57,48,158,4100,312,1136,3,2063,606,1839,184,1777,143,377,1947,245,502,1482,93,328,1284,17,446,2480,2246,474,1339,29,719,1039,3229,2843,7,438,379,4782,8547,2662,642,1407,1042,2459,1226,1462,3935,1274,108,1712,1697,906,2,940,533,313,1769,2397,1387,3583,449,226,657,338,830,840,480,606,1349,3,12,334,201,29,157,813,865,378,1634,1908,438,266,149,189,354,2959,502,1,1539,47,268,131,28,130,1,70,795,1228,2598,1345,46,1093,82,569,4,1337,191,17,619,1,22,283,69,4,279,381,629,152,65,243,566,188,22,271,128,95,367,284,385,2,18,5,2,12,22,64,129,941,413,272,1817,278,211,120,174,142,525,118,94,15,117,573,127,4,634,15,6,169,237,19,126,336,702,724,88,165,43,834,678,1131,788,5824815,3277,32,1802585,6996022,549,333,444,1,2,80,1,900,896,1,8,1,2,2551,1,748,141,59,736,563,1,4265,1,1,1,1,137,1,1193,1259,142,3,5,91,63,4,109,1,46,7,4,2,10,6,1,1,36,1,1,1,3,20742588,3220019';})();(function(){var u='/xjs/_/js/k\\x3dxjs.hp.en.8jTnuCMMiFI.O/m\\x3dsb_he,d/am\\x3dAAMCbAQ/d\\x3d1/rs\\x3dACT90oFSe3F4u6cscTHTUw58MPqxYvNkww';\nsetTimeout(function(){var b=document;var a=\"SCRIPT\";\"application/xhtml+xml\"===b.contentType&&(a=a.toLowerCase());a=b.createElement(a);a.src=u;google.timers&&google.timers.load&&google.tick&&google.tick(\"load\",\"xjsls\");document.body.appendChild(a)},0);})();(function(){window.google.xjsu='/xjs/_/js/k\\x3dxjs.hp.en.8jTnuCMMiFI.O/m\\x3dsb_he,d/am\\x3dAAMCbAQ/d\\x3d1/rs\\x3dACT90oFSe3F4u6cscTHTUw58MPqxYvNkww';})();function _DumpException(e){throw e;}\nfunction _F_installCss(c){}\n(function(){google.jl={em:[],emw:false,lls:'default',pdt:0,snet:true,uwp:true};})();(function(){var pmc='{\\x22d\\x22:{},\\x22sb_he\\x22:{\\x22agen\\x22:true,\\x22cgen\\x22:true,\\x22client\\x22:\\x22heirloom-hp\\x22,\\x22dh\\x22:true,\\x22dhqt\\x22:true,\\x22ds\\x22:\\x22\\x22,\\x22ffql\\x22:\\x22en\\x22,\\x22fl\\x22:true,\\x22host\\x22:\\x22google.co.za\\x22,\\x22isbh\\x22:28,\\x22jsonp\\x22:true,\\x22msgs\\x22:{\\x22cibl\\x22:\\x22Clear Search\\x22,\\x22dym\\x22:\\x22Did you mean:\\x22,\\x22lcky\\x22:\\x22I\\\\u0026#39;m Feeling Lucky\\x22,\\x22lml\\x22:\\x22Learn more\\x22,\\x22oskt\\x22:\\x22Input tools\\x22,\\x22psrc\\x22:\\x22This search was removed from your \\\\u003Ca href\\x3d\\\\\\x22/history\\\\\\x22\\\\u003EWeb History\\\\u003C/a\\\\u003E\\x22,\\x22psrl\\x22:\\x22Remove\\x22,\\x22sbit\\x22:\\x22Search by image\\x22,\\x22srch\\x22:\\x22Google Search\\x22},\\x22ovr\\x22:{},\\x22pq\\x22:\\x22\\x22,\\x22refpd\\x22:true,\\x22rfs\\x22:[],\\x22sbpl\\x22:16,\\x22sbpr\\x22:16,\\x22scd\\x22:10,\\x22stok\\x22:\\x22NyTRBYu_yyJ-6iCGKbuJZboi-S4\\x22,\\x22uhde\\x22:false}}';google.pmc=JSON.parse(pmc);})();"
This is a reality of web scraping: we are always going to have a lot of useless data mixed in with the useful data in our output, so we need ways of separating the wheat from the chaff. In R, if we have converted the HTML document to raw text, we can use string manipulation functions such as gregexpr
and substr
to locate and extract the useful content. Alternatively, we can use the function html_nodes
in the rvest
package to locate content stored within a particular HTML tag. As a guide to doing so, it is useful to open the page and its source code in our web browser. In Google Chrome, we can visit the Google homepage and then press Ctrl+Shift+i to open Developer tools. Then we can scroll through the HTML code and see the HTML tags surrounding various content.
For example, suppose we want to know the languages in which Google is available in South Africa. Using the raw text approach, we notice that the words ‘Google offered in’ appear before these languages. We can first remove all characters other than alphanumeric, punctuation, and spaces (otherwise some weird special characters will result in an error), then search for the location of the expression ‘Google offered in’ and take a substring beginning from that location until 100 characters later.
googlechar2 <- gsub("[^[:alnum:][:punct:][:space:]]", "", googlechar)
textlocation <- gregexpr("Google offered in", googlechar2)[[1]]
substr(googlechar2, start = textlocation, stop = textlocation + 120)
## [1] "Google offered in: Afrikaans Sesotho isiZulu IsiXhosa Setswana Northern Sotho AdvertisingProgramsBusines"
Voilà, the list of languages. Now, if instead we wanted to extract the title of the webpage, the html_text
approach would be less effective. The word ‘title’ appears nowhere in the resulting string, even after putting it into lowercase. (Note that gregexpr
returns -1
when it does not find the expression in the string).
gregexpr("title", tolower(googlechar2))[[1]]
## [1] -1
## attr(,"match.length")
## [1] -1
If we know a little bit of HTML, or otherwise if we inspect the HTML code in Google Chrome Developer tools, we will observe that the webpage title appears within a <title>
tag. The tags are eliminated when we call html_text
. However, we can instead call html_nodes
, which requests all instances of a specific tag or node. Then we run html_text
on the result, to get the text within the <title>
tag.
titletag <- rvest::html_nodes(mygoogle, "title")
titletag
## {xml_nodeset (1)}
## [1] <title>Google</title>\n
rvest::html_text(titletag)
## [1] "Google"
And voilà! The title of the page is ‘Google’.
Scraping Barclays Premier League Match Data for 2019-2020
Okay, scraping the Google homepage is not that interesting. However, for fans of English football, scraping all the match data for the 2019-2020 Premier League season may provide temporary relief of withdrawal symptoms. First, we need a website that stores the match data in static HTML format. Fox Sports Australia does the trick. You can see on this page the result of every match during the season, and can filter by team and/or round (matchday no.). If you hover over a particular match result, you can see a link to a Match Centre page with more detailed results, such as Liverpool’s 4-1 defeat of Norwich City on 10 August 2019. There, on the Stats tab, we can see various statistics for both teams such as Possession, Shots, Passing, etc. At the bottom we can also see various statistics for each individual player in a table, one for each team. Our goal here is to create a spreadsheet storing the Match Centre statistics for all of the matches played this season.
Downloading the Match Centre HTML Documents
If we examine the URL for the Match Centre page, we can see that the page for each match is uniquely specified by the last few characters in the format EPL2019-20rrmm/stats
where rr
is the round number (from 01
to 29
, which was the last matchday played before play was suspended), and mm
is the match number on the day (from 01
to 10
, since there are 20 teams).
Thus, by cycling through the various values of rr
and mm
we can scrape the data for all matches played to date. We will store the HTML document for each Match Centre page in a list object called fullhtml
.
totmatchdays <- 29
ro <- c(paste0("0", 1:9), 10:totmatchdays)
ma <- c(paste0("0", 1:9), 10)
nmatch <- totmatchdays * 10
urls <- rep(NA, nmatch)
fullhtml <- vector("list", nmatch)
i <- 0
for (r in ro) {
for (m in ma) {
i <- i + 1
print(i)
if (r == "18" && m == "06") {
urls[i] <- "https://www.foxsports.com.au/football/premier-league/match-centre/EPL2019-202511/stats"
} else if (r == "26" && m == "05") {
urls[i] <- "https://www.foxsports.com.au/football/premier-league/match-centre/EPL2019-202611/stats"
} else {
urls[i] <- paste0("https://www.foxsports.com.au/football/premier-league/match-centre/EPL2019-20",
r, m, "/stats")
}
if (!(r == "28" && (m == "01" || m == "03"))) {
loadedcorrectly <- FALSE
while (loadedcorrectly == FALSE) {
fullhtml[[i]] <- xml2::read_html(urls[i])
titlenodes <- rvest::html_nodes(fullhtml[[i]], "title")
figcaptionnodes <- rvest::html_nodes(fullhtml[[i]], "figcaption")
if (length(titlenodes) == 2 && length(figcaptionnodes) == 25) loadedcorrectly <- TRUE
}
}
}
}
Note that, by trial and error, I discovered that there was no match no. 06 in round 18 and no match no. 05 in round 26; these two matches were apparently postponed and rescheduled to other matchdays. Thus there is a match no. 11 in rounds 25 and 26. We handled this issue using if
statements. Also, matches no. 01 and 03 in round 28 were postponed and have not yet been played; thus the total number of matches played to date is 288. We have the Match Centre HTML document for the i
th match stored as fullhtml[[i]]
. The while loop is designed to ensure that the HTML document has downloaded correctly before continuing to the next match, as the downloads quite often fail.
Now, we have scraped the HTML for all 288 matches; all that remains is to process the data. In this case we are going to use the html_nodes
approach rather than the html_text
approach, because most of the information we need is stored within identifiable HTML tags. To see this we need to open the Match Centre page for any one of the matches (say, that first Liverpool-Norwich match) and press Ctrl+Shift+i to open Developer tools.
Extracting the Team Names
If we click on the Elements tab and press Ctrl+F to do a search, we can search for ‘title’ and find the <title>
tag which contains the text Liverpool vs Norwich City Live Statistics - Premier League Round 1, 2019
. Thus, we observe that we can get the names of the home and away team respectively by getting the string within the <title>
tag and extracting the word(s) before ‘vs’ and the word(s) between ‘vs’ and ‘Live Statistics’:
i <- 1 # For first match only
mytitle <- rvest::html_text(rvest::html_nodes(fullhtml[[i]], "title"))
hometeam <- substr(mytitle[1], start = 1, stop = gregexpr("vs", mytitle[1])[[1]] - 2)
awayteam <- substr(mytitle[1], start = gregexpr("vs", mytitle[1])[[1]] + 3,
stop = gregexpr("Live Statistics", mytitle[1])[[1]] - 2)
Extracting the Match Score
Next, we need the match score. Returning to Google Chrome, we see the 4-1 score displayed just after the line ‘Saturday August 10 5:00am AEST, Anfield’. If we press Ctrl+F in Developer tools and search for ‘Anfield’, we find it in an element <div class="styles__GameDetails-c8wynr-2 gocbqD">Saturday August 10 5:00am AEST, Anfield</div>
. A few lines below this we find an element with code <div class="styles__MatchScore-sc-3rdpqd-0 fcKRjY">4</div>
. This ‘4’ is the ‘4’ in the 4-1 score line. Shortly below that, we have <div class="styles__MatchScore-sc-3rdpqd-0 fcKRjY">1</div>
, representing the ‘1’ (Norwich City’s goal tally). Thus, we observe that the match scores are in <div>
elements with the string MatchScore
included in its class
attribute. We can use this information with together with rvest
functions to find the <div>
tags containing the scores and extract the scores from within them:
mydivs <- rvest::html_nodes(fullhtml[[i]], "div")
mydivclass <- rvest::html_attr(mydivs, name = "class")
mydivtext <- rvest::html_text(mydivs)
matchscorelocation <- which(gregexpr("MatchScore", mydivclass) != -1)
scores <- mydivtext[matchscorelocation]
scores
## [1] "4" "1"
The html_nodes
call extracts all <div>
tags in the document (there are 883). The html_attr
call gets the class
attribute of all of these <div>
tags. The html_text
call gets the raw text within each of the <div>
tags in the document. The command which(gregexpr("MatchScore", mydivclass) != -1)
tells us which of the 883 <div>
tags contain the string MatchScore
within their class
attribute. For the first match, the values are 717 and 725. The line scores <- mydivtext[matchscorelocation]
gets the raw text within the 717th and 725th <div>
tags and saves them to the variable scores
. As we can see from the output, it has correctly recorded the home team and away team’s goal tallies respectively.
Extracting the Match Statistics
Next, we need the match statistics. The Match Centre report gives 14 statistics for the two sides: Possession (%), Territory (%), Shots (on target, off target, blocked), Passing (passes completed, corners), Defence (effective tackles, clearances, saves), and Discipline (offsides, fouls conceded, yellow cards, red cards).
For possession, we observe that Liverpool had 55% possession in the first match. Using Ctrl+F in Developer tools, we find the string ‘55%’ in the following tag: <div class="styles__BaseFont-sc-1cu7fqd-0 styles__TeamLabel-sc-1cu7fqd-4 lgyDTW">55%</div>
. The string TeamLabel
seems to be unique to the possession values, so we can use it to extract the possession percentages. Similarly, Liverpool had 60% territory, and we find the string ‘60%’ in the following tag: <div class="styles__StatsComparisonBar-ar67kb-0 igxOSR">60%</div>
. The string StatsComparisonBar-ar67kb
can be used to extract the territory percentages. Since these are both <div>
tags, we can use the variables already defined:
possessionlocation <- which(gregexpr("TeamLabel", mydivclass) != -1)
possession <- mydivtext[possessionlocation]
possession
## [1] "55%" "45%"
territorylocation <- which(gregexpr("StatsComparisonBar-ar67kb", mydivclass) != -1)
territory <- mydivtext[territorylocation]
territory
## [1] "60%" "40%"
The rest of the match statistics can all be dealt with at once. Noting that Liverpool had 392 passes completed, we search for ‘392’ in Developer tools until we find it in the following tag: <figcaption aria-label="392 (58%)" class="src__FigCaption-hoDWbO styles__Caption-kybn6w-0 eYOVXI">392</figcaption>
. We can soon verify that all 12 remaining statistics, for both teams, are stored within <figcaption>
tags. There are only 25 <figcaption>
tags in the document, and the final 24 contain the values of the other 12 statistics, so we can just get the raw text contained by the <figcaption>
tags excluding the first one:
myfigcaptions <- rvest::html_nodes(fullhtml[[i]], "figcaption")
otherstats <- rvest::html_text(myfigcaptions)[-1]
otherstats
## [1] "7" "3" "5" "5" "3" "2" "392" "284" "11" "2" "1" "10" "8" "30"
## [15] "2" "4" "0" "5" "9" "9" "0" "2" "0" "0"
By comparing with the Match Centre page in the browser, we verify that the output 7, 3, 5, 5, … corresponds to Liverpool Shots on Target, Norwich City Shots on Target, Liverpool Shots Off Target, Norwich City Shots Off Target, etc.
Cycling through All the Matches and Writing the Output
We now have a procedure for extracting all of the team-level match statistics from the Match Centre HTML document. It only remains to do this for all i
values from 1 to 290 in a for loop (excluding 271 and 273, the matches that have not yet taken place), saving the values for each match in a matrix, and then write the matrix to a spreadsheet.
matchdata <- matrix(nrow = 0, ncol = 19)
for (i in setdiff(1:nmatch, c(271, 273))) {
# Get name of home team and away team
mytitle <- rvest::html_text(rvest::html_nodes(fullhtml[[i]], "title"))
hometeam <- substr(mytitle[1], start = 1, stop = gregexpr("vs", mytitle[1])[[1]] - 2)
awayteam <- substr(mytitle[1], start = gregexpr("vs", mytitle[1])[[1]] + 3,
stop = gregexpr("Live Statistics", mytitle[1])[[1]] - 2)
# Error checking
if (hometeam == "" || awayteam == "") {
print(i)
stop(paste0(i, "th Team Name missing! Run `fullhtml[[i]] <- xml2::read_html(urls[i])` again for current i"))
}
# Get scores of home team and away team
mydivs <- rvest::html_nodes(fullhtml[[i]], "div")
mydivclass <- rvest::html_attr(mydivs, name = "class")
mydivtext <- rvest::html_text(mydivs)
matchscorelocation <- which(gregexpr("MatchScore", mydivclass) != -1)
scores <- mydivtext[matchscorelocation]
# Get possession and territory values of home team and away team
possessionlocation <- which(gregexpr("TeamLabel", mydivclass) != -1)
possession <- mydivtext[possessionlocation]
territorylocation <- which(gregexpr("StatsComparisonBar-ar67kb", mydivclass) != -1)
territory <- mydivtext[territorylocation]
# Error checking
if (length(possession) == 0 || length(territory) == 0) stop("possession or territory empty! Run `fullhtml[[i]] <- xml2::read_html(urls[i])` again for current i")
# Get other 12 team statistics for home and away team
myfigcaptions <- rvest::html_nodes(fullhtml[[i]], "figcaption")
otherstats <- rvest::html_text(myfigcaptions)[-1]
# Error checking
if (length(otherstats) == 0) stop("otherstats empty! Run `fullhtml[[i]] <- xml2::read_html(urls[i])` again for current i")
# Get matchday (round) number from URL
matchday <- substr(urls[i], start = 77, stop = 78)
# Add row to matchdata matrix with home team's data
matchdata <- rbind(matchdata, t(c(hometeam, awayteam, matchday, TRUE, scores[1],
possession[1], territory[1], otherstats[which(1:length(otherstats) %% 2 == 1)])))
# Add row to matchdata matrix with away team's data
matchdata <- rbind(matchdata, t(c(awayteam, hometeam, matchday, FALSE, scores[2],
possession[2], territory[2], otherstats[which(1:length(otherstats) %% 2 == 0)])))
}
# Create column headings
colnames(matchdata) <- c("Team", "Opponent", "Matchday", "Home", "Score", "Possession_Pct",
"Territory_Pct", "Shots_On_Target", "Shots_Off_Target", "Shots_Blocked",
"Passes_Completed", "Corners", "Effective_Tackles", "Clearances",
"Saves", "Offsides", "Fouls_Conceded", "Yellow_Cards", "Red_Cards")
# Change matrix to data.frame and change stat columns from character to numeric
matchdata <- as.data.frame(matchdata, stringsAsFactors = FALSE)
matchdata[6:7] <- sapply(matchdata[6:7], function(x) gsub("%", "", x))
matchdata[c(3, 5:19)] <- sapply(matchdata[c(3, 5:19)], as.numeric)
# Write to Excel
xlsx::write.xlsx2(matchdata, file = "matchdata2019_20.xlsx", row.names = FALSE)
The Excel spreadsheet matchdata2019_20.xlsx
should now exist in your working directory and should contain the data for all 288 matches played so far this season. (If you are not sure of your working directory, just run the command getwd()
.) It should look like the screenshot below:
In case any Match Centre page failed to download correctly and slipped through the error checking mechanism in the while loop earlier, I have included more error checks at this stage. If R throws one of these or other errors when you run the code, you simply need to run the command fullhtml[[i]] <- xml2::read_html(urls[i])
for the current i
value (on which the error was thrown) and then run your for loop again.
Some Quick Analysis
Now, what fun would it be to create such a cool data set and not do anything with it? For example, maybe you’d like to know how many goals Liverpool has scored this season:
sum(matchdata$Score[matchdata$Team == "Liverpool"])
## [1] 66
(We can verify that this value is correct by comparing it with the ‘GF’ column in the Barclays Premier League Table.) Or maybe we would like a graph of the number of passes completed per game by Manchester City.
plot(matchdata$Passes_Completed[matchdata$Team == "Manchester City"],
type = "p", pch = 20, xlab = "Matchday", ylab = "Passes Completed")
lines(matchdata$Passes_Completed[matchdata$Team == "Manchester City"])
We can see that there seems to be a slight upward trend in number of passes completed per match by Manchester City, with an outlier (very small value) on Matchday 19. Who was Manchester City’s opponent on Matchday 19?
matchdata$Opponent[matchdata$Team == "Manchester City" & matchdata$Matchday == 19]
## [1] "Wolverhampton Wanderers"
Finally, can we get a table with the number of fouls conceded by each team, ranked from least to most?
totalfouls <- aggregate(Fouls_Conceded ~ Team, data = matchdata, FUN = sum)
totalfouls[order(totalfouls$Fouls_Conceded), ]
## Team Fouls_Conceded
## 10 Liverpool 243
## 3 Bournemouth 269
## 13 Newcastle United 276
## 14 Norwich City 278
## 11 Manchester City 283
## 4 Brighton & Hove Albion 288
## 6 Chelsea 291
## 17 Tottenham Hotspur 296
## 19 West Ham United 301
## 15 Sheffield United 303
## 1 Arsenal 307
## 9 Leicester City 309
## 20 Wolverhampton Wanderers 316
## 7 Crystal Palace 317
## 12 Manchester United 320
## 5 Burnley 324
## 2 Aston Villa 327
## 8 Everton 346
## 18 Watford 350
## 16 Southampton 356
We could also create a nice data set from the player-level statistics at the bottom of each Match Centre webpage, but this post is already very long. I will just observe that doing so is much easier, because the player data is already stored in an HTML table on the webpage, which means that we can just run the command rvest::html_table(fullhtml[[i]])
and we will have a list containing the two tables of player statistics from the i
th match. It is then just a matter of combining the two tables from each match into one big table.
Conclusion
In this article we looked at how to do web scraping in R where the content we want to scrape is in static HTML format. In the next article we will look at how to scrape Javascript-rendered content from webpages, which is more advanced but can actually be easier because of the built-in tools in R for doing so.