I am trying to scrape all the data from the table on this website (https://report.boonecountymo.org/mrcjava/servlet/SH01_MP.I00290s) but can't seem to figure out how I would go about scraping all of the subsequent pages. This is the code to scrape the first page of results into a CSV file:
import csv
import requests
from bs4 import BeautifulSoup
url = 'http://www.showmeboone.com/sheriff/JailResidents/JailResidents.asp'
response = requests.get(url)
html = response.content
soup = BeautifulSoup(html, "html.parser")
fileList = []
# For the table-header cells
tableHeader = soup.find('tr', attrs={'class': 'table-header'})
rowList = []
for cell in tableHeader.findAll('th'):
cellText = cell.text.replace(' ', '').replace('\n', '')
rowList.append(cellText)
fileList.append(rowList)
# For the table body cells
table = soup.find('tbody', attrs={'class': 'stripe'})
for row in table.findAll('tr'):
rowList = []
for cell in row.findAll('td'):
cellText = cell.text.replace(' ', '').replace('\n', '')
if cellText == "Details":
continue
rowList.append(cellText)
fileList.append(rowList)
outfile = open("./prison-inmates.csv", "w")
writer = csv.writer(outfile)
writer.writerows(fileList)
How do I get to the next page of results?
Code taken from this tutorial (http://first-web-scraper.readthedocs.io/en/latest/)
Although I wasn't able to get your posted code to run, I did find that the original tutorial code you linked to, can be changed on the url = line to:
url = 'https://report.boonecountymo.org/mrcjava/servlet/SH01_MP.I00290s' \
+ '?max_rows=250'
Running python scrape.py then successfully outputs inmates.csv with all available records.
In short, this works by:
instead of: How do I get to the next page ?
we pursue: How do I remove pagination ?
we cause the page to send all records at once so there would be no pagination to deal with in the first place
This allows us to use the original tutorial code to save the complete set of records.
Explanation
url = 'https://report.boonecountymo.org/mrcjava/servlet/SH01_MP.I00290s' to use the new URL. The old URL in the tutorial: http://www.showmeboone.com/sheriff/JailResidents/JailResidents.asp redirects to this new URL, but doesn't work with our solution so we can't use the old URL
\ is a line break allowing me to continue the line of code on the next line, for readability
+ is to concatenate so we can add the ?max_rows=250.
So the result is equivalent to url = 'https://report.boonecountymo.org/mrcjava/servlet/SH01_MP.I00290s?max_rows=250'
?max_rows=<number-of-records-to-display> is a query string I found that works for this particular Current Detainees page. This can be found by first noticing the Page Size text entry field meant for users to set a custom rows per page. It shows a default value 50. Examine its HTML code, for example in Firefox browser (52.7.3), use Ctrl+shift+i to show the Firefox's Web Developer Inspector tool window. Click the Select element button (icon resembles a box outline with a mouse cursor arrow on it), then click on the input field containing 50. HTML pane below reveals via highlight: <input class="mrcinput" name="max_rows" size="3" title="max_rowsp" value="50" type="text">. This means it submits a form variable named max_rows, which is a number, default 50. Some web pages, depending on how it is coded, can recognize such variables if appended to the URL as a query string, so it is possible to try this by appending ?max_rows= plus a number of your choice. At the time I started the page said 250 Total Items , so I chose to try the custom number 250 by changing my browser address bar to load: https://report.boonecountymo.org/mrcjava/servlet/SH01_MP.I00290s?max_rows=250. It successfully displayed 250 records, making it unnecessary to paginate, so this ?max_rows=250 is what we use to form the URL used by our script
Now however the page now says 242 Total Items, so it seems they are removing inmates, or at least inmate records listed. You can: ?max_rows=242, but ?max_rows=250 will still work because 250 is larger than the total number of records 242, and as long as it is larger the page will not need to paginate, and thus allow you to have all the records on one page.
Warranty
This isn't a universal solution for scraping table data when encountering pagination. It works for this Current Detainees page and pages that may have been coded in the same way
This is because pagination isn't universally implemented, so any code or solution would depend on how the page implements pagination. Here we use ?max_rows=.... However another website, even if they have adjustable per-page limits, may use a different name for this max_rows variable, or ignore query strings altogether and so our solution may not work on a different website
Scalability issues: If you are in a situation with a different website where you need millions of records for example, a download-all-at-once approach like this can run into perhaps memory limits both on the server side and also on your computer, both could time out and fail to finish delivering or processing. A different approach, resembling something like pagination that you had originally asked for, would definitely be more suitable
So in the future if you need to download large amounts of records, this download-all-at-once approach will likely run you into memory-related trouble, but for scraping this particular Current Detainees page, it will get the job done.
Related
I have two lists of baseball players that I would like to scrape data for from the website fangraphs. I am trying to figure out how to have selenium search the first player in the list which would redirect to that players profile, scrape the data I am interested in, and then search the next player until each for loop is completed for the two lists. I have written other scrapers with selenium, but I haven't come across this situation where I need to perform a search, collect the data, then perform the next search, etc ...
Here is a smaller version of one of the lists:
batters = ['Freddie Freeman','Bryce Harper','Jesse Winker']
driver.get('https://www.fangraphs.com/')
search_box = driver.find_element_by_xpath('/html/body/form/div[3]/div[1]/div[2]/header/div[3]/nav/div[1]/div[2]/div/div/input')
search_box.click()
for batter in batters:
search_box.send_keys(batter)
search_box.send_keys(Keys.RETURN)
This will search all the names at once obviously, so I guess I'm trying to figure out how to code the logic of searching one by one but not performing the next search until I have collected the data for the previous search - any help is appreciated cheers
With selenium, you would just have to iterate through the names, "type" it into the search bar, click/go to the link, scrape the stats, then repeat. You have set up to do that, you just need to add the scrape part. So something like:
batters = ['Freddie Freeman','Bryce Harper','Jesse Winker']
driver.get('https://www.fangraphs.com/')
search_box = driver.find_element_by_xpath('/html/body/form/div[3]/div[1]/div[2]/header/div[3]/nav/div[1]/div[2]/div/div/input')
search_box.click()
for batter in batters:
search_box.send_keys(batter)
search_box.send_keys(Keys.RETURN)
## CODE THAT SCRAPES THE DATA ##
## CODE THAT STORES IT SOMEWAY TO APPEND AFTER EACH ITERATION ##
However, they have an api which is a far better solution than Selenium. Why?
APIs are consistent. Parsing HTML with selnium and/or beautifulsoup is reliant on the html structure. If they ever change the layout of the website, it may crash as certain tags that used to be there may not be there anymore, or they may add certain tags and attributes to the html. But the underlying data that is rendered in the html will come from the api in a nice json format and that will rarely change unless they do a complete overhaul of the data structure
It's far more efficient and quicker. No need to have Selenium open a browser, search and load/render the content, that scrape, then repeat. You get the response in 1 request
You'll get far more data than you intended and (imo) is a good thing. I'd rather have more data and "trim" off what I don't need. Lots of the time you'll see very interesting and useful data that you otherwise wouldn't had known was there.
So I'm not sure what you are after specifically, but this will get you going. You'll have to sift through the statsData to figure out what you want, but if you tell me what you are after, I can help get that into a nice table for you. Or if you want to figure it out yourself, look up pandas and the .json_normalize() function with that. Parsing nested json can be tricky (but it's also fun ;-) )
Code:
import requests
# Get teamIds
def get_teamIds():
team_id_dict = {}
url = 'https://cdn.fangraphs.com/api/menu/menu-standings'
jsonData = requests.get(url).json()
for team in jsonData:
team_id_dict[team['shortName']] = str(team['teamid'])
return team_id_dict
# Get Player IDs
def get_playerIds(team_id_dict):
player_id_dict = {}
for team, teamId in team_id_dict.items():
url = 'https://cdn.fangraphs.com/api/depth-charts/roster?teamid={teamId}'.format(teamId=teamId)
jsonData = requests.get(url).json()
print(team)
for player in jsonData:
if 'oPlayerId' in player.keys():
player_id_dict[player['player']] = [str(player['oPlayerId']), player['position']]
else:
player_id_dict[player['player']] = ['N/A', player['position']]
return player_id_dict
team_id_dict = get_teamIds()
player_id_dict = get_playerIds(team_id_dict)
batters = ['Freddie Freeman','Bryce Harper','Jesse Winker']
for player in batters:
playerId = player_id_dict[player][0]
pos = player_id_dict[player][1]
url = 'https://cdn.fangraphs.com/api/players/stats?playerid={playerId}&position={pos}'.format(playerId=playerId, pos=pos)
statsData = requests.get(url).json()
Ouput: Here's just a look at what you get
Trying to scrape information from the www.archive.org, which contains historic product data. My code below, tries to click on every product listed, scrape the information per product, and do the same for subsequent pages.
The problem is that it SKIPS some products (20 in particular), even though the xpath:
products = response.xpath("//article[contains(#class,'product result-prd')]")
is the same for all products. Please see my complete code below.
class CurrysSpider(scrapy.Spider):
name = 'currys_mobiles_2015'
#allowed_domains = ['www.currys.co.uk']
start_urls = ['https://web.archive.org/web/20151204170941/http://www.currys.co.uk/gbuk/phones-broadband-and-sat-nav/mobile-phones-and-accessories/mobile-phones/362_3412_32041_xx_xx/xx-criteria.html']
def parse(self, response):
products = response.xpath("//article[contains(#class,'product result-prd')]") # done
for product in products:
brand = product.xpath(".//span[#data-product='brand']/text()").get() # done
link = product.xpath(".//div[#class='productListImage']/a/#href").get() # done
price = product.xpath(".//strong[#class='price']/text()").get().strip() # done
description = product.xpath(".//ul[#class='productDescription']/li/text()").getall() # done
absolute_url = link # done
yield scrapy.Request(url=absolute_url,callback=self.parse_product,
meta={'brand_name':brand,
'product_price':price,
'product_description':description}) # done
# process next page
next_page_url = response.xpath("//ul[#class='pagination']//li[last()]//#href").get()
absolute_next_page_url = next_page_url
if next_page_url:
yield scrapy.Request(url=absolute_next_page_url,callback=self.parse)
def parse_product(self, response):
.....
I have noticed this problem in many websites that I tried to scrape, and I am not sure why some products are skipped, since the xpath is the same for all of the product listings.
Would appreciate some feedback on this.
try to take a look if those products are present in page html or loaded via js.
Just ctrl+U and check html body for those products.
It's possible the individual pages are not loading properly possibly due to JS loading, as the rest of your code looks fine (though I would recommend using normalize-space($xpath) instead of .strip() for price).
In order to test this (on Chrome), visit your target web page, Open Chrome Dev Tools(F12), Click "Console" and Ctrl+Shift+P to pull up command window.
Next, type in "Disable Javascript" and select that option when it shows up. Now, Ctrl+R to refresh the page, and this is the "View" that your web-scraper gets. Check your Xpath expressions now.
If you do have issues, consider using scrapy-splash or scrapy-selenium to load this JS.
EDIT: I would check for the possibility of a memory leak. According to scrapy docs, using the meta attribute in your callback will sometimes cause leaks.
I am working on a project where I am crawling thousands of websites to extract text data, the end use case is natural language processing.
EDIT * since I am crawling 100's of thousands of websites I cannot tailor a scraping code for each one, which means I cannot search for specific element id's, the solution I am looking for is a general one *
I am aware of solutions such as the .get_text() function from beautiful soup. The issue with this method is that it gets all the text from the website, much of it being irrelevant to the main topic on that particular page. for the most part a website page will be dedicated to a single main topic, however on the sides and top and bottom there may be links or text about other subjects or promotions or other content.
With the .get_text() function it return all the text on the site page in one go. the problem is that it combines it all (the relevant parts with the irrelevant ones. is there another function similar to .get_text() that returns all text but as a list and every list object is a specific section of the text, that way it can be know where new subjects start and end.
As a bonus, is there a way to identify the main body of text on a web page?
Below I have mentioned snippets that you could use to query data in desired way using BeautifulSoup4 and Python3:
import requests
from bs4 import BeautifulSoup
response = requests.get('https://yoursite/page')
soup = BeautifulSoup(response.text, 'html.parser')
# Print the body content in list form
print(soup.body.contents[0])
# Print the first found div on html page
print(soup.find('div'))
# Print the all divs on html page in list form
print(soup.find_all('div'))
# Print the element with 'required_element_id' id
print(soup.find(id='required_element_id'))
# Print the all html elements in list form that matches the selectors
print(soup.select(required_css_selectors))
# Print the attribute value in list form
print(soup.find(id='someid').get("attribute-name"))
# You can also break your one large query into multiple queries
parent = soup.find(id='someid')
# getText() return the text between opening and closing tag
print(parent.select(".some-class")[0].getText())
For your more advance requirement, you can check Scrapy as well. Let me know if you face any challenge in implementing this or if your requirement is something else.
I want to get several "pages" of a website and for some reason the correct url does not give the expected result.
I looked at the url that should be used and it works just fine and tried to use some variable changing.
for i in range(1,100):
MLinks.append("https://#p" + str(i))
for i in range(1,100):
x = i-1
MainR = requests.get(MLinks[x])
SMHTree = html.fromstring(MainR.content)
MainData = SMHTree.xpath('//#*')
j=0
while j <len(MainData):
if 'somthing' in MainData[j] :
PLinks.append(MainData[j]) #Links of products
j=j+1
I am expecting to get every page but when I am reading the contents I always get the contents of the first page.
I assume the URLs you are requesting look like this:
https://somehost.com/products/#p1
https://somehost.com/products/#p2
https://somehost.com/products/#p3
...
That is, the second line of you code would actually be
MLinks.append("https://somehost.com/products/#p" + str(i))
When doing the request, the server never sees the part after the # (this part is called an anchor). So the server just receives 100 requests for "https://somehost.com/products/", which all give the same results. See this website explaining it further: https://developer.mozilla.org/en-US/docs/Learn/Common_questions/What_is_a_URL.
Anchors are sometimes used by client-side JavaScript to load pages dynamically. What this means is that if you open "https://somehost.com/products/" and navigate to "https://somehost.com/products/#p5", the client-side JavaScript will notice it and (usually) issue a request to some other URL to load the products on page 5. This other URL will not be "https://somehost.com/products/#5"! To find out what this URL is, open the developer tools of your browser and see what network requests are made when you navigate to a different product page.
This is a follow-up question to my earlier question on looping through multiple web pages. I am new to programming... so I appreciate your patience and very explicit explanations!
I have programmed a loop through many web pages. On each page, I want to scrape data, save it to a variable or a csv file (whichever is easier/more stable), then click on the "next" button, scrape data on the second page and append it to the variable or csv file, etc.
Specifically, my code looks like this:
url="http://www.url.com"
driver = webdriver.Firefox()
driver.get(url)
(driver.page_source).encode('utf-8')
html = urllib.request.urlopen(url).read()
soup = BeautifulSoup(html)
wait = WebDriverWait(driver, 10)
while True:
# some code to grab the data
job_tag={'class': re.compile("job_title")}
all_jobs=soup.findAll(attrs=job_tag)
jobs=[]
for text in (all_jobs):
t=str(''.join(text.findAll(text=True)).strip())
jobs.append(t)
writer=csv.writer(open('test.csv','a', newline=''))
writer.writerows(jobs)
# click next link
try:
element=wait.until(EC.element_to_be_clickable((By.XPATH, "//*[#id='reviews']/a/span[starts-with(.,'Next')]")))
element.click()
except TimeoutException:
break
It runs without error, but
1) the file collects the data of the first page over and over again, but not the data of the subsequent pages, even though the loop performs correctly (ultimately, I do not really mind duplicate entries, but I do want data from all pages).
I am suspecting that I need to "redefine" the soup for each new page, I am looking into how to make bs4 access those urls.
2) the last page has no "next" button, so the code does not append last page's data (I get that error when I use 'w' instead of 'a' in the csv line, with the data of the second-to-last page writing into the csv file).
Also, although it is a minor issue, the data gets written one letter per cell in the csv, even though when I run that portion in Python with bs4, the data is correctly formatted. What am I missing?
Thanks!
I am suspecting that I need to "redefine" the soup for each new page
Indeed, you should. You see, your while loop runs with soup always referring to the same old object you made before entering that while loop. You should rebind soup to a new BeautifulSoup instance, which is most likely the URL you find behind the anchor (tag a) which you've located in those last lines:
element=wait.until(EC.element_to_be_clickable((By.XPATH, "//*[#id='reviews']/a/span[starts-with(.,'Next')]")))
You could access it with just your soup (note that I haven't tested this for correctness: without the actual source of the page, I'm guessing):
next_link = soup.find(id='reviews').a.get('href')
And then, at the end of your while loop, you would rebind soup:
soup = BeautifulSoup(urllib.request.urlopen(next_link.read()))
You should still add a try - except clause to capture the error it'll generate on the last page when it cannot find the last "Next" link and then break out of the loop.
Note that selenium is most likely not necessary for your use-case, bs4 would be sufficient (but either would work).
Also, although it is a minor issue, the data gets written one letter per cell in the csv, even though when I run that portion in Python with bs4, the data is correctly formatted. What am I missing?
The writer instance you've created expects an iterable for its writerows method. You are passing it a single string (which might have kommas in them, but that's not what csv.writer will look at: it will add kommas (or whichever delimiter you specified in its construction) between every 2 items of the iterable). A Python string is iterable (per character), so writer.writerows("some_string") doesn't result in an error. But you most likely wanted this:
for text in (all_jobs):
t = [x.strip() for x in text.find_all(text=True)]
jobs.append(t)
As a follow-up on the comments:
You'll want to update the soup based on the new url, which you retrieve from the 1, 2, 3 Next >> (it's in a div container with a specific id, so easy to extract with just BeautifulSoup). The code below is a fairly basic example that shows how this is done. Extracting the things you find relevant is done by your own scraping code, which you'll have to add as indicated in the example.
#Python3.x
import urllib
from bs4 import BeautifulSoup
url = 'http://www.indeed.com/cmp/Wesley-Medical-Center/reviews'
base_url_parts = urllib.parse.urlparse(url)
while True:
raw_html = urllib.request.urlopen(url).read()
soup = BeautifulSoup(raw_html)
# scrape the page for the desired info
# ...
last_link = soup.find('div', id='company_reviews_pagination').find_all('a')[-1]
if last_link.text.startswith('Next'):
next_url_parts = urllib.parse.urlparse(last_link['href'])
url = urllib.parse.urlunparse((base_url_parts.scheme, base_url_parts.netloc,
next_url_parts.path, next_url_parts.params, next_url_parts.query,
next_url_parts.fragment))
print(url)
else:
break