class: center, middle, inverse, title-slide .title[ # Web scraping ] .author[ ###
Termeh Shafie ] --- layout: true --- class: middle ## Scraping the web --- ## Scraping the web: what? why? - Increasing amount of data is available on the web -- - These data are provided in an unstructured format: you can always copy&paste, but it's time-consuming and prone to errors -- - Web scraping is the process of extracting this information automatically and transform it into a structured dataset -- - Two different scenarios: - Screen scraping: extract data from source code of website, with html parser (easy) or regular expression matching (less easy). - Web APIs (application programming interface): website offers a set of structured http requests that return JSON or XML files. --- class: middle ## Web Scraping with rvest --- ## Hypertext Markup Language - Most of the data on the web is still largely available as HTML - It is structured (hierarchical / tree based), but it''s often not available in a form useful for analysis (flat / tidy). ```html <html> <head> <title>This is a title</title> </head> <body> <p align="center">Hello world!</p> </body> </html> ``` --- ## rvest .pull-left[ - The **rvest** package makes basic processing and manipulation of HTML data straight forward - It's designed to work with pipelines built with `%>%` ] .pull-right[ <img src="img/rvest.png" width="230" style="display: block; margin: auto 0 auto auto;" /> ] --- ## Core rvest functions - `read_html` - Read HTML data from a url or character string - `html_element` - Select a specified node from HTML document - `html_elements` - Select specified nodes from HTML document - `html_table` - Parse an HTML table into a data frame - `html_text` - Extract tag pairs' content - `html_name` - Extract tags' names - `html_attrs` - Extract all of each tag's attributes - `html_attr` - Extract tags' attribute value by name --- ## SelectorGadget .pull-left-narrow[ - Open source tool that eases CSS selector generation and discovery - Easiest to use with the [Chrome Extension](https://chrome.google.com/webstore/detail/selectorgadget/mhjhnkcfbdhnjickkkdbjoemdmbfginb) - Find out more on the [SelectorGadget vignette](https://cran.r-project.org/web/packages/rvest/vignettes/rvest.html) ] .pull-right-wide[ <img src="img/selector-gadget/selector-gadget.png" width="75%" style="display: block; margin: auto;" /> ] --- ## Using the SelectorGadget <img src="img/selector-gadget/selector-gadget.gif" width="80%" style="display: block; margin: auto;" /> --- <img src="img/selector-gadget/selector-gadget-1.png" width="95%" style="display: block; margin: auto;" /> --- <img src="img/selector-gadget/selector-gadget-2.png" width="95%" style="display: block; margin: auto;" /> --- <img src="img/selector-gadget/selector-gadget-3.png" width="95%" style="display: block; margin: auto;" /> --- <img src="img/selector-gadget/selector-gadget-4.png" width="95%" style="display: block; margin: auto;" /> --- <img src="img/selector-gadget/selector-gadget-5.png" width="95%" style="display: block; margin: auto;" /> --- ## Using the SelectorGadget Through this process of selection and rejection, SelectorGadget helps you come up with the appropriate CSS selector for your needs <img src="img/selector-gadget/selector-gadget.gif" width="65%" style="display: block; margin: auto;" /> class: middle ## Scraping top 250 movies on IMDB --- ## Top 250 movies on IMDB Take a look at the source code, look for the tag `table` tag: <br> http://www.imdb.com/chart/top .pull-left[ <img src="img/imdb-top-250.png" width="100%" style="display: block; margin: auto;" /> ] .pull-right[ <img src="img/imdb-top-250-source.png" width="94%" style="display: block; margin: auto;" /> ] --- ## First check if you're allowed! ```r library(robotstxt) paths_allowed("http://www.imdb.com") ``` ``` ## [1] TRUE ``` vs. e.g. ```r paths_allowed("http://www.facebook.com") ``` ``` ## [1] FALSE ``` --- ## Plan <img src="img/plan.png" width="90%" style="display: block; margin: auto;" /> --- ## Plan 1. Read the whole page 2. Scrape movie titles and save as `titles` 3. Scrape years movies were made in and save as `years` 4. Scrape IMDB ratings and save as `ratings` 5. Create a data frame called `imdb_top_250` with variables `title`, `year`, and `rating` --- class: middle ## Step 1. Read the whole page --- ## Read the whole page ```r page <- read_html("https://www.imdb.com/chart/top/") page ``` ``` ## {html_document} ## <html lang="en-US" xmlns:og="http://opengraphprotocol.org/schema/" xmlns:fb="http://www.facebook.com/2008/fbml"> ## [1] <head>\n<meta http-equiv="Content-Type" content="text/html ... ## [2] <body>\n<div> <img height="1" width="1" style="display: ... ``` --- ## A webpage in R - Result is a list with 2 elements ```r typeof(page) ``` ``` ## [1] "list" ``` -- - that we need to convert to something more familiar, like a data frame.... ```r class(page) ``` ``` ## [1] "xml_document" "xml_node" ``` --- class: middle ## Step 2. Scrape movie titles and save as `titles` --- ## Scrape movie titles <img src="img/titles.png" width="70%" style="display: block; margin: auto;" /> --- ## Scrape the nodes .pull-left[ ```r page %>% html_elements(".titleColumn a") ``` ] .pull-right[ <img src="img/titles.png" width="100%" style="display: block; margin: auto;" /> ] --- ## Extract the text from the nodes .pull-left[ ```r page %>% html_elements(".titleColumn a") %>% html_text() ``` ] .pull-right[ <img src="img/titles.png" width="100%" style="display: block; margin: auto;" /> ] --- ## Save as `titles` .pull-left[ ```r titles <- page %>% html_elements(".titleColumn a") %>% html_text() titles ``` ] .pull-right[ ```r knitr::include_graphics("img/titles.png") ``` ] --- class: middle ## Step 3. Scrape year movies were made and save as `years` --- ## Scrape years movies were made in <img src="img/years.png" width="70%" style="display: block; margin: auto;" /> --- ## Scrape the nodes .pull-left[ ```r page %>% html_elements(".secondaryInfo") ``` ] .pull-right[ <img src="img/years.png" width="100%" style="display: block; margin: auto;" /> ] --- ## Extract the text from the nodes .pull-left[ ```r page %>% html_elements(".secondaryInfo") %>% html_text() ``` ] .pull-right[ <img src="img/years.png" width="100%" style="display: block; margin: auto;" /> ] --- ## Clean up the text We need to go from `"(1994)"` to `1994`: - Remove `(` and `)`: string manipulation - Convert to numeric: `as.numeric()` --- ## stringr .pull-left-wide[ - **stringr** provides a cohesive set of functions designed to make working with strings as easy as possible - Functions in stringr start with `str_*()`, e.g. - `str_remove()` to remove a pattern from a string ```r str_remove(string = "jello", pattern = "el") ``` ``` ## [1] "jlo" ``` - `str_replace()` to replace a pattern with another ```r str_replace(string = "jello", pattern = "j", replacement = "h") ``` ``` ## [1] "hello" ``` ] .pull-right-narrow[ <img src="img/stringr.png" width="100%" style="display: block; margin: auto auto auto 0;" /> ] --- ## Clean up the text ```r page %>% html_elements(".secondaryInfo") %>% html_text() %>% str_remove("\\(") # remove ( ``` --- ## Clean up the text ```r page %>% html_elements(".secondaryInfo") %>% html_text() %>% str_remove("\\(") %>% # remove ( str_remove("\\)") # remove ) ``` --- ## Convert to numeric ```r page %>% html_elements(".secondaryInfo") %>% html_text() %>% str_remove("\\(") %>% # remove ( str_remove("\\)") %>% # remove ) as.numeric() ``` --- ## Save as `years` .pull-left[ ```r years <- page %>% html_elements(".secondaryInfo") %>% html_text() %>% str_remove("\\(") %>% # remove ( str_remove("\\)") %>% # remove ) as.numeric() years ``` ] .pull-right[ <img src="img/years.png" width="100%" style="display: block; margin: auto;" /> ] --- class: middle ## Step 4. Scrape IMDB ratings and save as `ratings` --- ## Scrape IMDB ratings <img src="img/ratings.png" width="70%" style="display: block; margin: auto;" /> --- ## Scrape the nodes .pull-left[ ```r page %>% html_elements("strong") ``` ] .pull-right[ <img src="img/ratings.png" width="100%" style="display: block; margin: auto;" /> ] --- ## Extract the text from the nodes .pull-left[ ```r page %>% html_elements("strong") %>% html_text() ``` ] .pull-right[ <img src="img/ratings.png" width="100%" style="display: block; margin: auto;" /> ] --- ## Convert to numeric .pull-left[ ```r page %>% html_elements("strong") %>% html_text() %>% as.numeric() ``` ] .pull-right[ <img src="img/ratings.png" width="100%" style="display: block; margin: auto;" /> ] --- ## Save as `ratings` .pull-left[ ```r ratings <- page %>% html_elements("strong") %>% html_text() %>% as.numeric() ratings ``` ] .pull-right[ <img src="img/ratings.png" width="100%" style="display: block; margin: auto;" /> ] --- class: middle ## Step 5. Create a data frame called `imdb_top_250` --- ## Create a data frame: `imdb_top_250` ```r imdb_top_250 <- tibble( title = titles, year = years, rating = ratings ) imdb_top_250 ``` --- ## Clean up / enhance May or may not be a lot of work depending on how messy the data are - Add a variable for rank ```r imdb_top_250 <- imdb_top_250 %>% mutate(rank = 1:nrow(imdb_top_250)) %>% relocate(rank) ``` --- ```r imdb_top_250 %>% print(n = 20) ``` --- class: middle ## What next? --- .question[ Which years have the most movies on the list? ] -- ```r imdb_top_250 %>% count(year, sort = TRUE) ``` --- .question[ Which 1995 movies made the list? ] -- ```r imdb_top_250 %>% filter(year == 1995) %>% print(n = 8) ``` --- .question[ Visualize the average yearly rating for movies that made it on the top 250 list over time. ] -- .panelset[ .panel[.panel-name[Plot] <img src="img/imdbplot.png" width="65%" style="display: block; margin: auto;" /> ] .panel[.panel-name[Code] ```r imdb_top_250 %>% group_by(year) %>% summarise(avg_score = mean(rating)) %>% ggplot(aes(y = avg_score, x = year)) + geom_point() + geom_smooth(method = "lm", se = FALSE) + labs(x = "Year", y = "Average score") ``` ] ] --- class: middle ## When you re-run all the previous 'scraping' codes on IMDb, it will not work. Why?! --- .pull-left[ what the page used to like (static): <img src="img/imdb-top-250.png" width="100%" style="display: block; margin: auto;" /> ] .pull-right[ what the page looks like now (dynamic): <img src="img/imdb-new.png" width="100%" style="display: block; margin: auto;" /> ] --- .pull-left[ what the page used to like (static): <img src="img/imdb-top-250.png" width="100%" style="display: block; margin: auto;" /> ] .pull-right[ what the page looks like now (dynamic): <img src="img/imdb-new.png" width="100%" style="display: block; margin: auto;" /> ] - Solution 1: extract 25 movies at a time - Solution 2: use a different source (but rating variable is then lost) --- ## Solution 1 ```r page <- "https://www.imdb.com/chart/top/" doc <- read_html(page) title <- doc %>% html_elements("li h3.ipc-title__text") %>% html_text() year <- doc %>% html_elements(".sc-b189961a-8.kLaxqf.cli-title-metadata-item") %>% html_text() year <- as.numeric(year[seq(1,length(year),3)]) ratings <- doc %>% html_elements(".ipc-rating-star.ipc-rating-star--base.ipc-rating-star--imdb.ratingGroup--imdb-rating") %>% html_text() ratings <- as.numeric(str_extract(ratings,"[0-9]\\.[0-9]")) imdb_top_25 <-tibble( title=title, year=year, rating=ratings) ``` ---
--- ## Solution 2 [https://icheckmovies.com/lists/imdbs+top+250/](https://www.icheckmovies.com/lists/imdbs+top+250/) .middle[ <img src="img/sol2.png" width="70%" style="display: block; margin: auto;" /> ] --- ## Solution 2 .middle[ <img src="img/sol2-2.png" width="90%" style="display: block; margin: auto;" /> ] --- ## Solution 2 ```r page <- "https://www.icheckmovies.com/lists/imdbs+top+250/" doc <- read_html(page) title <- doc %>% html_elements("h2 a") %>% html_text() title <- title[-1] year <- doc %>% html_elements(".info a:nth-child(1)") %>% html_text() imdb_top_250 <- tibble(title = title, year = year) ``` ---
--- .midi[ .your-turn[ ### .hand[Your turn!] - Use either of the alternative ways shown of scraping the top 250 movies and try (as best as you can) to answer the questions posed in this slide set! - Can you come up with a better solution? ] ] --- class: middle ## Ethics --- ## "Can you?" vs "Should you?" <img src="img/ok-cupid-1.png" width="60%" style="display: block; margin: auto;" /> .footnote[.small[ Source: Brian Resnick, [Researchers just released profile data on 70,000 OkCupid users without permission](https://www.vox.com/2016/5/12/11666116/70000-okcupid-users-data-release), Vox. ]] --- ## "Can you?" vs "Should you?" <img src="img/ok-cupid-2.png" width="70%" style="display: block; margin: auto;" /> --- class: middle ## Challenges --- ## Unreliable formatting at the source <img src="img/unreliable-formatting.png" width="70%" style="display: block; margin: auto;" /> --- ## Data broken into many pages <img src="img/many-pages.png" width="70%" style="display: block; margin: auto;" /> --- class: middle ## Workflow --- ## Screen scraping vs. APIs Two different scenarios for web scraping: - Screen scraping: extract data from source code of website, with html parser (easy) or regular expression matching (less easy) - Web APIs (application programming interface): website offers a set of structured http requests that return JSON or XML files --- ## A new R workflow - When working in an R Markdown/Quarto document, your analysis is re-run each time you knit/render - If web scraping in an R Markdown/Quarto document, you'd be re-scraping the data each time you knit, which is undesirable (and not *nice*)! - An alternative workflow: - Use an R script to save your code - Saving interim data scraped using the code in the script as CSV or RDS files - Use the saved data in your analysis in your R Markdown document