This is a follow-up question to my earlier question on looping through multiple web pages. I am new to programming... so I appreciate your patience and very explicit explanations!
I have programmed a loop through many web pages. On each page, I want to scrape data, save it to a variable or a csv file (whichever is easier/more stable), then click on the "next" button, scrape data on the second page and append it to the variable or csv file, etc.
Specifically, my code looks like this:
url="http://www.url.com"
driver = webdriver.Firefox()
driver.get(url)
(driver.page_source).encode('utf-8')
html = urllib.request.urlopen(url).read()
soup = BeautifulSoup(html)
wait = WebDriverWait(driver, 10)
while True:
# some code to grab the data
job_tag={'class': re.compile("job_title")}
all_jobs=soup.findAll(attrs=job_tag)
jobs=[]
for text in (all_jobs):
t=str(''.join(text.findAll(text=True)).strip())
jobs.append(t)
writer=csv.writer(open('test.csv','a', newline=''))
writer.writerows(jobs)
# click next link
try:
element=wait.until(EC.element_to_be_clickable((By.XPATH, "//*[#id='reviews']/a/span[starts-with(.,'Next')]")))
element.click()
except TimeoutException:
break
It runs without error, but
1) the file collects the data of the first page over and over again, but not the data of the subsequent pages, even though the loop performs correctly (ultimately, I do not really mind duplicate entries, but I do want data from all pages).
I am suspecting that I need to "redefine" the soup for each new page, I am looking into how to make bs4 access those urls.
2) the last page has no "next" button, so the code does not append last page's data (I get that error when I use 'w' instead of 'a' in the csv line, with the data of the second-to-last page writing into the csv file).
Also, although it is a minor issue, the data gets written one letter per cell in the csv, even though when I run that portion in Python with bs4, the data is correctly formatted. What am I missing?
Thanks!
I am suspecting that I need to "redefine" the soup for each new page
Indeed, you should. You see, your while loop runs with soup always referring to the same old object you made before entering that while loop. You should rebind soup to a new BeautifulSoup instance, which is most likely the URL you find behind the anchor (tag a) which you've located in those last lines:
element=wait.until(EC.element_to_be_clickable((By.XPATH, "//*[#id='reviews']/a/span[starts-with(.,'Next')]")))
You could access it with just your soup (note that I haven't tested this for correctness: without the actual source of the page, I'm guessing):
next_link = soup.find(id='reviews').a.get('href')
And then, at the end of your while loop, you would rebind soup:
soup = BeautifulSoup(urllib.request.urlopen(next_link.read()))
You should still add a try - except clause to capture the error it'll generate on the last page when it cannot find the last "Next" link and then break out of the loop.
Note that selenium is most likely not necessary for your use-case, bs4 would be sufficient (but either would work).
Also, although it is a minor issue, the data gets written one letter per cell in the csv, even though when I run that portion in Python with bs4, the data is correctly formatted. What am I missing?
The writer instance you've created expects an iterable for its writerows method. You are passing it a single string (which might have kommas in them, but that's not what csv.writer will look at: it will add kommas (or whichever delimiter you specified in its construction) between every 2 items of the iterable). A Python string is iterable (per character), so writer.writerows("some_string") doesn't result in an error. But you most likely wanted this:
for text in (all_jobs):
t = [x.strip() for x in text.find_all(text=True)]
jobs.append(t)
As a follow-up on the comments:
You'll want to update the soup based on the new url, which you retrieve from the 1, 2, 3 Next >> (it's in a div container with a specific id, so easy to extract with just BeautifulSoup). The code below is a fairly basic example that shows how this is done. Extracting the things you find relevant is done by your own scraping code, which you'll have to add as indicated in the example.
#Python3.x
import urllib
from bs4 import BeautifulSoup
url = 'http://www.indeed.com/cmp/Wesley-Medical-Center/reviews'
base_url_parts = urllib.parse.urlparse(url)
while True:
raw_html = urllib.request.urlopen(url).read()
soup = BeautifulSoup(raw_html)
# scrape the page for the desired info
# ...
last_link = soup.find('div', id='company_reviews_pagination').find_all('a')[-1]
if last_link.text.startswith('Next'):
next_url_parts = urllib.parse.urlparse(last_link['href'])
url = urllib.parse.urlunparse((base_url_parts.scheme, base_url_parts.netloc,
next_url_parts.path, next_url_parts.params, next_url_parts.query,
next_url_parts.fragment))
print(url)
else:
break
Related
I've been with this all day and I'm getting a little overwhelmed, I explain, I have a personal project, scrape all the links of the acestream: // protocol from a website and turn them into a playlist for acestream. For now I can either remove the links from the web (something like the site map) or remove the acestream links from a specific subpage. One of the problems I have is that since the same acestream link appears several times on the page,
Obviously I get the same link multiple times and I only want it once. Besides, I don't know how to do it either (I'm very new to this) so that instead of putting the link in it, it automatically takes it from a list of links in a .csv, because I need to get an acestream link from each link that I put on it. in the .csv. I'm sorry about the tirade, I hope it's not a nuisance.
Hope you understand, I translated it with Google Translate
from bs4 import BeautifulSoup
import requests
# creating empty list
urls = []
# function created
def scrape(site):
# getting the request from url
r = requests.get(site)
# converting the text
s = BeautifulSoup(r.text, "html.parser")
for i in s.find_all("a"):
href = i.attrs['href']
if href.startswith("acestream://"):
site = site + href
if site not in urls:
urls.append(site)
print(site)
# calling the scrape function itself
# generally called recursion
scrape(site)
# main function
if __name__ == "__main__":
site = "https://www.websitehere.com/index.htm"
scrape(site)
Based off your last comment and your code, you can read in a .csv using
import pandas as pd
file_path = 'C:\<path to your csv>'
df = pd.read_csv(file_path)
csv_links = df['<your_column_name_for_links>'].to_list()
With this, you can get the URLs from the .csv. Just change the values in the <>.
I want to retrieve all links from a website that contain a specific phrase.
An example on a public website would be to retrieve all videos from a large youtube channel (for example Linus Tech Tips):
from bs4 import BeautifulSoup as bs
import requests
url = 'https://www.youtube.com/user/LinusTechTips/videos'
html = requests.get(url)
soup = bs(html.content, "html.parser")
current_link = ''
for link in soup.find_all('a'):
current_link = link.get('href')
print(current_link)
Now I have 3 problems here:
How do I get only hyperlinks containing a phrase like "watch?v="
Most hyperlinks aren't shown. In the browser: They appear when you scroll down. BeautifulSoup does only find the links which can be found without scrolling. How can I retrieve all hyperlinks?
All hyperlinks appear two times. How can I only choose each hyperlink once?
Any suggestions?
How do I get only hyperlinks containing a phrase like "watch?v="
Add a single if statement above your print statement
if 'watch?v=' in current_link:
print(current_link)
All hyperlinks appear two times. How can I only choose each hyperlink once?
Store all hyperlinks in a dictionary as the key and set the value to any arbitrary number (dictionaries only allow a single key entry so you wont be able to add duplicates)
Something like this:
myLinks = {} //declare a dictionary variable to hold your data
if 'watch?v=' in current_link:
print(current_link)
myLinks[currentLink] = 1
You can iterate over the keys (links) in the dictionary like this:
for link,val in myLinks:
print(link)
This will print all the links in your dictionary
Most hyperlinks aren't shown. In the browser: They appear when you scroll down. BeautifulSoup does only find the links which can be found without scrolling. How can I retrieve all hyperlinks?
I'm unsure as to how you directly get around the scripting on the page you have directed us to but you could always crawl the links you get from the initial scrape and rip new links off the side panels/traverse them, this should give you most, if not all, of the links you want.
To do so you would want another dictionary to store the already traversed links/check if you already traversed them. You can check for a key in a dictionary like so:
if key in myDict:
print('myDict has this key already!')
I would use the request library,
for python3
import urllib.request
import requests
SearchString="SampleURL.com"
response = requests.get(SearchString, stream=True)
zeta= str(response.content)
with open ("File.txt" , "w") as l:
l.write(zeta)
l.close()
#And now open up the file with the information written to it
x = open("File.txt", "r")
jello = []
for line in x:
jello.append(line)
t = (jello[0].split(""""salePrice":""",1)[1].split(",",1)[0] )
#you'll notice above that I have the keyword "salePrice", this should be a unique identifier in the pages xpath. typically f12 in chrome and then navigating til the item is highlighted gives you the xpath if you right click and copy
#Now this will only return a single result, youll want to use a for loop to iterate over the File.txt until you find all the separate results
I hope this helps Ill keep an eye on this thread if you need more help.
Part One and Three:
Create a list and append links to the list:
from bs4 import BeautifulSoup as bs
import requests
url = 'https://www.youtube.com/user/LinusTechTips/videos'
html = requests.get(url)
soup = bs(html.content, "html.parser")
links = [] # see here
for link in soup.find_all('a'):
links.append(link.get('href')) # and here
Then create a set and convert it back to list to remove duplicates:
links = list(set(links))
Now return the items of interest:
clean_links = [i for i in links if 'watch?v=' in i]
Part Two:
In order to navigate through the site you may need more than just Beautiful Soup. Scrapy has a great API that allows you to pull down a page and explore how you want to parse parent and child elements with xpath. I highly encourage you to try Scrapy and use the interactive shell to tweak your extraction method.
HELPFUL LINK
I am trying to scrape all the data from the table on this website (https://report.boonecountymo.org/mrcjava/servlet/SH01_MP.I00290s) but can't seem to figure out how I would go about scraping all of the subsequent pages. This is the code to scrape the first page of results into a CSV file:
import csv
import requests
from bs4 import BeautifulSoup
url = 'http://www.showmeboone.com/sheriff/JailResidents/JailResidents.asp'
response = requests.get(url)
html = response.content
soup = BeautifulSoup(html, "html.parser")
fileList = []
# For the table-header cells
tableHeader = soup.find('tr', attrs={'class': 'table-header'})
rowList = []
for cell in tableHeader.findAll('th'):
cellText = cell.text.replace(' ', '').replace('\n', '')
rowList.append(cellText)
fileList.append(rowList)
# For the table body cells
table = soup.find('tbody', attrs={'class': 'stripe'})
for row in table.findAll('tr'):
rowList = []
for cell in row.findAll('td'):
cellText = cell.text.replace(' ', '').replace('\n', '')
if cellText == "Details":
continue
rowList.append(cellText)
fileList.append(rowList)
outfile = open("./prison-inmates.csv", "w")
writer = csv.writer(outfile)
writer.writerows(fileList)
How do I get to the next page of results?
Code taken from this tutorial (http://first-web-scraper.readthedocs.io/en/latest/)
Although I wasn't able to get your posted code to run, I did find that the original tutorial code you linked to, can be changed on the url = line to:
url = 'https://report.boonecountymo.org/mrcjava/servlet/SH01_MP.I00290s' \
+ '?max_rows=250'
Running python scrape.py then successfully outputs inmates.csv with all available records.
In short, this works by:
instead of: How do I get to the next page ?
we pursue: How do I remove pagination ?
we cause the page to send all records at once so there would be no pagination to deal with in the first place
This allows us to use the original tutorial code to save the complete set of records.
Explanation
url = 'https://report.boonecountymo.org/mrcjava/servlet/SH01_MP.I00290s' to use the new URL. The old URL in the tutorial: http://www.showmeboone.com/sheriff/JailResidents/JailResidents.asp redirects to this new URL, but doesn't work with our solution so we can't use the old URL
\ is a line break allowing me to continue the line of code on the next line, for readability
+ is to concatenate so we can add the ?max_rows=250.
So the result is equivalent to url = 'https://report.boonecountymo.org/mrcjava/servlet/SH01_MP.I00290s?max_rows=250'
?max_rows=<number-of-records-to-display> is a query string I found that works for this particular Current Detainees page. This can be found by first noticing the Page Size text entry field meant for users to set a custom rows per page. It shows a default value 50. Examine its HTML code, for example in Firefox browser (52.7.3), use Ctrl+shift+i to show the Firefox's Web Developer Inspector tool window. Click the Select element button (icon resembles a box outline with a mouse cursor arrow on it), then click on the input field containing 50. HTML pane below reveals via highlight: <input class="mrcinput" name="max_rows" size="3" title="max_rowsp" value="50" type="text">. This means it submits a form variable named max_rows, which is a number, default 50. Some web pages, depending on how it is coded, can recognize such variables if appended to the URL as a query string, so it is possible to try this by appending ?max_rows= plus a number of your choice. At the time I started the page said 250 Total Items , so I chose to try the custom number 250 by changing my browser address bar to load: https://report.boonecountymo.org/mrcjava/servlet/SH01_MP.I00290s?max_rows=250. It successfully displayed 250 records, making it unnecessary to paginate, so this ?max_rows=250 is what we use to form the URL used by our script
Now however the page now says 242 Total Items, so it seems they are removing inmates, or at least inmate records listed. You can: ?max_rows=242, but ?max_rows=250 will still work because 250 is larger than the total number of records 242, and as long as it is larger the page will not need to paginate, and thus allow you to have all the records on one page.
Warranty
This isn't a universal solution for scraping table data when encountering pagination. It works for this Current Detainees page and pages that may have been coded in the same way
This is because pagination isn't universally implemented, so any code or solution would depend on how the page implements pagination. Here we use ?max_rows=.... However another website, even if they have adjustable per-page limits, may use a different name for this max_rows variable, or ignore query strings altogether and so our solution may not work on a different website
Scalability issues: If you are in a situation with a different website where you need millions of records for example, a download-all-at-once approach like this can run into perhaps memory limits both on the server side and also on your computer, both could time out and fail to finish delivering or processing. A different approach, resembling something like pagination that you had originally asked for, would definitely be more suitable
So in the future if you need to download large amounts of records, this download-all-at-once approach will likely run you into memory-related trouble, but for scraping this particular Current Detainees page, it will get the job done.
I am trying to webscrape Airbnb, I had working code but it seems they have updated everything on the page. It intermittently returns the correct output and then sometimes it fails? It will return the NoneType error between the 3rd and 17th page randomly. Is there a way for it to keep trying or is my code incorrect?
for page in range(1,pages + 1):
#get page urls
page_url= url + '&page={0}'.format(page)
print(page_url)
#get page
# browser.get(page_url)
source = requests.get(page_url)
soup = BeautifulSoup(source.text,'html.parser')
#get all listings on page
div = soup.find('div',{'class':'row listing-cards-row'})
#loop through to get all info needed from cards
for pic in div.find_all('div',{'class':'listing-card-wrapper'}):
print(...)
the last for loop is where my error starts to occur. This happens sometimes in my other functions too where it sometimes works sometimes doesn't. I have already given lxml parser a try as well.
After reviewing the soup a couple of times i noticed that every couple of times the program would run the source code tags would change. I threw in some exceptions and it seems to have fixed my "None" issue.
I'd like to scrape all the ~62000 names from this petition, using python. I'm trying to use the beautifulsoup4 library.
However, it's just not working.
Here's my code so far:
import urllib2, re
from bs4 import BeautifulSoup
soup = BeautifulSoup(urllib2.urlopen('http://www.thepetitionsite.com/104/781/496/ban-pesticides-used-to-kill-tigers/index.html').read())
divs = soup.findAll('div', attrs={'class' : 'name_location'})
print divs
[]
What am I doing wrong? Also, I want to somehow access the next page to add the next set of names to the list, but I have no idea how to do that right now. Any help is appreciated, thanks.
You could try something like this:
import urllib2
from bs4 import BeautifulSoup
html = urllib2.urlopen('http://www.thepetitionsite.com/xml/petitions/104/781/496/signatures/latest.xml?1374861495')
# uncomment to try with a smaller subset of the signatures
#html = urllib2.urlopen('http://www.thepetitionsite.com/xml/petitions/104/781/496/signatures/00/00/00/05.xml')
results = []
while True:
# Read the web page in XML mode
soup = BeautifulSoup(html.read(), "xml")
try:
for s in soup.find_all("signature"):
# Scrape the names from the XML
firstname = s.find('firstname').contents[0]
lastname = s.find('lastname').contents[0]
results.append(str(firstname) + " " + str(lastname))
except:
pass
# Find the next page to scrape
prev = soup.find("prev_signature")
# Check if another page of result exists - if not break from loop
if prev == None:
break
# Get the previous URL
url = prev.contents[0]
# Open the next page of results
html = urllib2.urlopen(url)
print("Extracting data from {}".format(url))
# Print the results
print("\n")
print("====================")
print("= Printing Results =")
print("====================\n")
print(results)
Be warned though there is a lot of data there to go through and I have no idea if this is against the terms of service of the website so you would need to check it out.
In most cases it is extremely inconsiderate to simply scrape a site. You put a fairly large load on a site in a short amount of time slowing down legitimate users requests. Not to mention stealing all of their data.
Consider an alternate approach such as asking (politely) for a dump of the data (as mentioned above).
Or if you do absolutely need to scrape:
Space your requests using a timer
Scrape smartly
I took a quick glance at that page and it appears to me they use AJAX to request the signatures. Why not simply copy their AJAX request, it'll most likely be using some sort of REST call. By doing this you lessen the load on their server by only requesting the data you need. It will also be easier for you to actually process the data because it will be in a nice format.
Reedit, I looked at their robots.txt file. It dissallows /xml/ Please respect this.
what do you mean by not working? empty list or error?
if you are receiving an empty list, it is because the class "name_location" does not exist in the document. also checkout bs4's documentation on findAll