Title: | These Functions Fetch and Extract Text Content from Specified Web Pages |
---|---|
Description: | The 'scrapeR' package utilizes functions that fetch and extract text content from specified web pages. It handles HTTP errors and parses HTML efficiently. The package can handle hundreds of websites at a time using the scrapeR_in_batches() command. |
Authors: | Mathieu Dubeau [aut, cre, cph] |
Maintainer: | Mathieu Dubeau <[email protected]> |
License: | MIT + file LICENSE |
Version: | 0.1.8 |
Built: | 2024-10-31 16:32:35 UTC |
Source: | https://github.com/cran/scrapeR |
The scrapeR
function fetches and extracts text content from the specified web page.
It handles HTTP errors and parses HTML efficiently.
scrapeR(url)
scrapeR(url)
url |
A character string specifying the URL of the web page to be scraped. |
The function uses tryCatch
to handle potential web scraping errors. It fetches
the webpage content, checks for HTTP errors, and then parses the HTML content to extract
text. The text from different HTML nodes like headings and paragraphs is combined into a
single string.
A character string containing the combined text from the specified HTML nodes of the web
page. Returns NA
if an error occurs or if the page content is not accessible.
This function requires the httr and rvest packages. Ensure that these dependencies are installed and loaded in your R environment.
Mathieu Dubeau, Ph.D.
Refer to the rvest package documentation for underlying HTML parsing and extraction methods.
GET
, read_html
, html_nodes
,
html_text
url <- "http://www.example.com" scraped_text <- scrapeR(url)
url <- "http://www.example.com" scraped_text <- scrapeR(url)
The scrapeR_in_batches
function processes a dataframe in batches, scraping web content from URLs in a specified column and writing the scraped content to a column in df.
scrapeR_in_batches(df, url_column, extract_contacts)
scrapeR_in_batches(df, url_column, extract_contacts)
df |
A dataframe containing the URLs to be scraped. |
url_column |
The name of the column in |
extract_contacts |
A function that searches scraped content for emails and phone numbers, defaults to FALSE. |
This function divides the input dataframe into batches of a fixed size (default: 100). For each batch, it extracts the combined text content from the web pages of the URLs in the specified column. The results are appended to the df. The function also includes a throttling mechanism to pause between batch processing, reducing the load on the server being scraped.
The values are returned to content column and optionally to an email and phone_number column if extract_contacts is TRUE.
Ensure that the httr, rvest, and stringr packages are installed and loaded. Also, handle large datasets and output files with care to avoid memory issues.
Mathieu Dubeau Ph.D
Refer to rvest package documentation and httr package documentation for underlying web scraping methods.
GET
, read_html
, html_nodes
, html_text
, write.table
mock_scrapeR <- function(url) { return(paste("Scraped content from", url)) } df <- data.frame(url = c("http://site1.com", "http://site2.com"), stringsAsFactors = FALSE) ## Not run: scrapeR_in_batches(df, url_column = "url", extract_contacts = FALSE) ## End(Not run)
mock_scrapeR <- function(url) { return(paste("Scraped content from", url)) } df <- data.frame(url = c("http://site1.com", "http://site2.com"), stringsAsFactors = FALSE) ## Not run: scrapeR_in_batches(df, url_column = "url", extract_contacts = FALSE) ## End(Not run)