I am trying to scrape a page which has many links to pages which contain ads. What I am currently doing to navigate it is going to the first page with the list of ads and getting the link for the individual ads. After that, I check to make sure that I haven't scraped any of the links by pulling data from my database. The code below basically gets all the href attributes and joins them as a list. After, I crosscheck it against the the list of links I have stored in my database of pages I have already scraped. So basically it will return a list of the links I haven't scraped yet.
#staticmethod
def _scrape_home_urls(driver):
home_url_list = list(home_tab.find_element_by_tag_name('a').get_attribute('href') for home_tab in driver.find_elements_by_css_selector('div[class^="nhs_HomeResItem clearfix"]'))
return (home_url for home_url in home_url_list if home_url not in(url[0] for url in NewHomeSource.outputDB()))
Once it scrapes all the links of that page, it goes to the next one. I tried to reuse it by calling _scrape_home_urls() again
NewHomeSource.unique_home_list = NewHomeSource._scrape_home_urls(driver)
for x in xrange(0,limit):
try:
home_url = NewHomeSource.unique_home_list.next()
except StopIteration:
page_num = int(NewHomeSource.current_url[NewHomeSource.current_url.rfind('-')+1:]) + 1 #extract page number from url and gets next page by adding 1. example: /.../.../page-3
page_url = NewHomeSource.current_url[:NewHomeSource.current_url.rfind('-')+1] + str(page_num)
print page_url
driver.get(page_url)
NewHomeSource.current_url = driver.current_url
NewHomeSource.unique_home_list = NewHomeSource._scrape_home_urls(driver)
home_url = NewHomeSource.unique_home_list.next()
#and then I use the home_url to do some processing within the loop
Thanks in advance.
It looks to me like your code would be a lot simpler if you put the logic that scrapes successive pages into a generator function. This would let you use for loops rather than messing around and calling next on the generator objects directly:
def urls_gen(driver):
while True:
for url in NewHomeSource._scrape_home_urls(driver):
yield url
page_num = int(NewHomeSource.current_url[NewHomeSource.current_url.rfind('-')+1:]) + 1 #extract page number from url and gets next page by adding 1. example: /.../.../page-3
page_url = NewHomeSource.current_url[:NewHomeSource.current_url.rfind('-')+1] + str(page_num)
print page_url
driver.get(page_url)
NewHomeSource.current_url = driver.current_url
This will transparently skip over pages that don't have any unprocessed links. The generator function yields the url values indefinitely. To iterate on it with a limit like your old code did, use enumerate and break when the limit is reached:
for i, home_url in urls_gen(driver):
if i == limit:
break
# do stuff with home_url here
I've not changed your code other than what was necessary to change the iteration. There are quite a few other things that could be improved however. For instance, using a shorter variable than NewHomeSource.current_url would make the lines of the that figure out the page number and then the next page's URL much more compact and readable. It's also not clear to me where that variable is initially set. If it's not used anywhere outside of this loop, it could easily be changed to a local variable in urls_gen.
Your _scrape_home_urls function is probably also very inefficient. It looks like it does a database query for every url it returns (not one lookup before checking all of the urls). Maybe that's what you want it to do, but I suspect it would be much faster done another way.
Related
I have two lists of baseball players that I would like to scrape data for from the website fangraphs. I am trying to figure out how to have selenium search the first player in the list which would redirect to that players profile, scrape the data I am interested in, and then search the next player until each for loop is completed for the two lists. I have written other scrapers with selenium, but I haven't come across this situation where I need to perform a search, collect the data, then perform the next search, etc ...
Here is a smaller version of one of the lists:
batters = ['Freddie Freeman','Bryce Harper','Jesse Winker']
driver.get('https://www.fangraphs.com/')
search_box = driver.find_element_by_xpath('/html/body/form/div[3]/div[1]/div[2]/header/div[3]/nav/div[1]/div[2]/div/div/input')
search_box.click()
for batter in batters:
search_box.send_keys(batter)
search_box.send_keys(Keys.RETURN)
This will search all the names at once obviously, so I guess I'm trying to figure out how to code the logic of searching one by one but not performing the next search until I have collected the data for the previous search - any help is appreciated cheers
With selenium, you would just have to iterate through the names, "type" it into the search bar, click/go to the link, scrape the stats, then repeat. You have set up to do that, you just need to add the scrape part. So something like:
batters = ['Freddie Freeman','Bryce Harper','Jesse Winker']
driver.get('https://www.fangraphs.com/')
search_box = driver.find_element_by_xpath('/html/body/form/div[3]/div[1]/div[2]/header/div[3]/nav/div[1]/div[2]/div/div/input')
search_box.click()
for batter in batters:
search_box.send_keys(batter)
search_box.send_keys(Keys.RETURN)
## CODE THAT SCRAPES THE DATA ##
## CODE THAT STORES IT SOMEWAY TO APPEND AFTER EACH ITERATION ##
However, they have an api which is a far better solution than Selenium. Why?
APIs are consistent. Parsing HTML with selnium and/or beautifulsoup is reliant on the html structure. If they ever change the layout of the website, it may crash as certain tags that used to be there may not be there anymore, or they may add certain tags and attributes to the html. But the underlying data that is rendered in the html will come from the api in a nice json format and that will rarely change unless they do a complete overhaul of the data structure
It's far more efficient and quicker. No need to have Selenium open a browser, search and load/render the content, that scrape, then repeat. You get the response in 1 request
You'll get far more data than you intended and (imo) is a good thing. I'd rather have more data and "trim" off what I don't need. Lots of the time you'll see very interesting and useful data that you otherwise wouldn't had known was there.
So I'm not sure what you are after specifically, but this will get you going. You'll have to sift through the statsData to figure out what you want, but if you tell me what you are after, I can help get that into a nice table for you. Or if you want to figure it out yourself, look up pandas and the .json_normalize() function with that. Parsing nested json can be tricky (but it's also fun ;-) )
Code:
import requests
# Get teamIds
def get_teamIds():
team_id_dict = {}
url = 'https://cdn.fangraphs.com/api/menu/menu-standings'
jsonData = requests.get(url).json()
for team in jsonData:
team_id_dict[team['shortName']] = str(team['teamid'])
return team_id_dict
# Get Player IDs
def get_playerIds(team_id_dict):
player_id_dict = {}
for team, teamId in team_id_dict.items():
url = 'https://cdn.fangraphs.com/api/depth-charts/roster?teamid={teamId}'.format(teamId=teamId)
jsonData = requests.get(url).json()
print(team)
for player in jsonData:
if 'oPlayerId' in player.keys():
player_id_dict[player['player']] = [str(player['oPlayerId']), player['position']]
else:
player_id_dict[player['player']] = ['N/A', player['position']]
return player_id_dict
team_id_dict = get_teamIds()
player_id_dict = get_playerIds(team_id_dict)
batters = ['Freddie Freeman','Bryce Harper','Jesse Winker']
for player in batters:
playerId = player_id_dict[player][0]
pos = player_id_dict[player][1]
url = 'https://cdn.fangraphs.com/api/players/stats?playerid={playerId}&position={pos}'.format(playerId=playerId, pos=pos)
statsData = requests.get(url).json()
Ouput: Here's just a look at what you get
I've been thinking over this problem for a while now, for a personal project, I need to get any and every link from the specified initial webpages, that isn't an external link i.e, doesn't leave the initial website.
I'm already using bs4 for scraping webpages for links, however I can't find a way to continue doing this for every scraped link without eventually reaching the maximum recursion depth.
My previous attempt consisted of something like this:
link_list = []
link_buffer = ["www.example.com"]
def get_links(current_link):
new_links = []
soup = BeautifulSoup(current_link, "html.parser")
for link in soup.find_all("a"):
if link.has_attr("href"):
... # check here if the link does not leave the website
new_links.append(link)
return new_links
def get_all_the_links(list_of_links):
target = list_of_links.pop()
link_list.append(target)
...
for link in get_links(target):
...
if link not in link_list:
list_of_links.append(link)
if list_of_links.len() != 0: # recursiveness
get_all_the_links(list_of_links)
get_all_the_links(link_buffer)
I also looked over scrapy, however I found it too complicated for what I'm trying to do since I just plan on saving these links to a text file and then processing them later.
Code sample you've provided is a bit messy. But overall you just need an extra storage variable for visited links and check if link is known already before
visited_links = set()
...
def get_all_the_links(list_of_links):
target = list_of_links.pop()
if target in visited_links:
# Ignore the current link and move on to next one.
get_all_the_links(list_of_links)
else:
visited_links.add(target)
...
# process link here...
...
I'm new to Python and I'm struggling with passing a list as an argument to a function.
I've written a block of code to take a url, extract all links from the page and put them into a list (links=[]). I want to pass this list to a function that filters out any link that is not from the same domain as the starting link (aka the first in the list) and output a new list (filtered_list = []).
This is what I have:
import requests
from bs4 import BeautifulSoup
start_url = "http://www.enzymebiosystems.org/"
r = requests.get(start_url)
html_content = r.text
soup = BeautifulSoup(html_content, features='lxml')
links = []
for tag in soup.find_all('a', href=True):
links.append(tag['href'])
def filter_links(links):
filtered_links = []
for link in links:
if link.startswith(links[0]):
filtered_links.append(link)
print(filter_links(links))
When I run this, I get an unfiltered list and below that, I get None.
Eventually I want to pass the filtered list to a function that grabs the html from each page in the domain linked on the homepage, but I am trying to tackle this problem 1 process at a time. Any tips would be much appreciated, thank you :)
EDIT
I now can pass the list of urls to the filter_links() function however, I'm filtering out too much now. eventually I want to pass several different start urls through this program, so I need a generic way of filtering urls that are within the same domain as the starting url. I have used the built-in startswith function, but its filtering out everything except the starting url. I think I could use regex but this should work too?
This is a follow-up question to my earlier question on looping through multiple web pages. I am new to programming... so I appreciate your patience and very explicit explanations!
I have programmed a loop through many web pages. On each page, I want to scrape data, save it to a variable or a csv file (whichever is easier/more stable), then click on the "next" button, scrape data on the second page and append it to the variable or csv file, etc.
Specifically, my code looks like this:
url="http://www.url.com"
driver = webdriver.Firefox()
driver.get(url)
(driver.page_source).encode('utf-8')
html = urllib.request.urlopen(url).read()
soup = BeautifulSoup(html)
wait = WebDriverWait(driver, 10)
while True:
# some code to grab the data
job_tag={'class': re.compile("job_title")}
all_jobs=soup.findAll(attrs=job_tag)
jobs=[]
for text in (all_jobs):
t=str(''.join(text.findAll(text=True)).strip())
jobs.append(t)
writer=csv.writer(open('test.csv','a', newline=''))
writer.writerows(jobs)
# click next link
try:
element=wait.until(EC.element_to_be_clickable((By.XPATH, "//*[#id='reviews']/a/span[starts-with(.,'Next')]")))
element.click()
except TimeoutException:
break
It runs without error, but
1) the file collects the data of the first page over and over again, but not the data of the subsequent pages, even though the loop performs correctly (ultimately, I do not really mind duplicate entries, but I do want data from all pages).
I am suspecting that I need to "redefine" the soup for each new page, I am looking into how to make bs4 access those urls.
2) the last page has no "next" button, so the code does not append last page's data (I get that error when I use 'w' instead of 'a' in the csv line, with the data of the second-to-last page writing into the csv file).
Also, although it is a minor issue, the data gets written one letter per cell in the csv, even though when I run that portion in Python with bs4, the data is correctly formatted. What am I missing?
Thanks!
I am suspecting that I need to "redefine" the soup for each new page
Indeed, you should. You see, your while loop runs with soup always referring to the same old object you made before entering that while loop. You should rebind soup to a new BeautifulSoup instance, which is most likely the URL you find behind the anchor (tag a) which you've located in those last lines:
element=wait.until(EC.element_to_be_clickable((By.XPATH, "//*[#id='reviews']/a/span[starts-with(.,'Next')]")))
You could access it with just your soup (note that I haven't tested this for correctness: without the actual source of the page, I'm guessing):
next_link = soup.find(id='reviews').a.get('href')
And then, at the end of your while loop, you would rebind soup:
soup = BeautifulSoup(urllib.request.urlopen(next_link.read()))
You should still add a try - except clause to capture the error it'll generate on the last page when it cannot find the last "Next" link and then break out of the loop.
Note that selenium is most likely not necessary for your use-case, bs4 would be sufficient (but either would work).
Also, although it is a minor issue, the data gets written one letter per cell in the csv, even though when I run that portion in Python with bs4, the data is correctly formatted. What am I missing?
The writer instance you've created expects an iterable for its writerows method. You are passing it a single string (which might have kommas in them, but that's not what csv.writer will look at: it will add kommas (or whichever delimiter you specified in its construction) between every 2 items of the iterable). A Python string is iterable (per character), so writer.writerows("some_string") doesn't result in an error. But you most likely wanted this:
for text in (all_jobs):
t = [x.strip() for x in text.find_all(text=True)]
jobs.append(t)
As a follow-up on the comments:
You'll want to update the soup based on the new url, which you retrieve from the 1, 2, 3 Next >> (it's in a div container with a specific id, so easy to extract with just BeautifulSoup). The code below is a fairly basic example that shows how this is done. Extracting the things you find relevant is done by your own scraping code, which you'll have to add as indicated in the example.
#Python3.x
import urllib
from bs4 import BeautifulSoup
url = 'http://www.indeed.com/cmp/Wesley-Medical-Center/reviews'
base_url_parts = urllib.parse.urlparse(url)
while True:
raw_html = urllib.request.urlopen(url).read()
soup = BeautifulSoup(raw_html)
# scrape the page for the desired info
# ...
last_link = soup.find('div', id='company_reviews_pagination').find_all('a')[-1]
if last_link.text.startswith('Next'):
next_url_parts = urllib.parse.urlparse(last_link['href'])
url = urllib.parse.urlunparse((base_url_parts.scheme, base_url_parts.netloc,
next_url_parts.path, next_url_parts.params, next_url_parts.query,
next_url_parts.fragment))
print(url)
else:
break
I'd like to scrape all the ~62000 names from this petition, using python. I'm trying to use the beautifulsoup4 library.
However, it's just not working.
Here's my code so far:
import urllib2, re
from bs4 import BeautifulSoup
soup = BeautifulSoup(urllib2.urlopen('http://www.thepetitionsite.com/104/781/496/ban-pesticides-used-to-kill-tigers/index.html').read())
divs = soup.findAll('div', attrs={'class' : 'name_location'})
print divs
[]
What am I doing wrong? Also, I want to somehow access the next page to add the next set of names to the list, but I have no idea how to do that right now. Any help is appreciated, thanks.
You could try something like this:
import urllib2
from bs4 import BeautifulSoup
html = urllib2.urlopen('http://www.thepetitionsite.com/xml/petitions/104/781/496/signatures/latest.xml?1374861495')
# uncomment to try with a smaller subset of the signatures
#html = urllib2.urlopen('http://www.thepetitionsite.com/xml/petitions/104/781/496/signatures/00/00/00/05.xml')
results = []
while True:
# Read the web page in XML mode
soup = BeautifulSoup(html.read(), "xml")
try:
for s in soup.find_all("signature"):
# Scrape the names from the XML
firstname = s.find('firstname').contents[0]
lastname = s.find('lastname').contents[0]
results.append(str(firstname) + " " + str(lastname))
except:
pass
# Find the next page to scrape
prev = soup.find("prev_signature")
# Check if another page of result exists - if not break from loop
if prev == None:
break
# Get the previous URL
url = prev.contents[0]
# Open the next page of results
html = urllib2.urlopen(url)
print("Extracting data from {}".format(url))
# Print the results
print("\n")
print("====================")
print("= Printing Results =")
print("====================\n")
print(results)
Be warned though there is a lot of data there to go through and I have no idea if this is against the terms of service of the website so you would need to check it out.
In most cases it is extremely inconsiderate to simply scrape a site. You put a fairly large load on a site in a short amount of time slowing down legitimate users requests. Not to mention stealing all of their data.
Consider an alternate approach such as asking (politely) for a dump of the data (as mentioned above).
Or if you do absolutely need to scrape:
Space your requests using a timer
Scrape smartly
I took a quick glance at that page and it appears to me they use AJAX to request the signatures. Why not simply copy their AJAX request, it'll most likely be using some sort of REST call. By doing this you lessen the load on their server by only requesting the data you need. It will also be easier for you to actually process the data because it will be in a nice format.
Reedit, I looked at their robots.txt file. It dissallows /xml/ Please respect this.
what do you mean by not working? empty list or error?
if you are receiving an empty list, it is because the class "name_location" does not exist in the document. also checkout bs4's documentation on findAll