requests.get not getting correct information - python

I want to get several "pages" of a website and for some reason the correct url does not give the expected result.
I looked at the url that should be used and it works just fine and tried to use some variable changing.
for i in range(1,100):
MLinks.append("https://#p" + str(i))
for i in range(1,100):
x = i-1
MainR = requests.get(MLinks[x])
SMHTree = html.fromstring(MainR.content)
MainData = SMHTree.xpath('//#*')
j=0
while j <len(MainData):
if 'somthing' in MainData[j] :
PLinks.append(MainData[j]) #Links of products
j=j+1
I am expecting to get every page but when I am reading the contents I always get the contents of the first page.

I assume the URLs you are requesting look like this:
https://somehost.com/products/#p1
https://somehost.com/products/#p2
https://somehost.com/products/#p3
...
That is, the second line of you code would actually be
MLinks.append("https://somehost.com/products/#p" + str(i))
When doing the request, the server never sees the part after the # (this part is called an anchor). So the server just receives 100 requests for "https://somehost.com/products/", which all give the same results. See this website explaining it further: https://developer.mozilla.org/en-US/docs/Learn/Common_questions/What_is_a_URL.
Anchors are sometimes used by client-side JavaScript to load pages dynamically. What this means is that if you open "https://somehost.com/products/" and navigate to "https://somehost.com/products/#p5", the client-side JavaScript will notice it and (usually) issue a request to some other URL to load the products on page 5. This other URL will not be "https://somehost.com/products/#5"! To find out what this URL is, open the developer tools of your browser and see what network requests are made when you navigate to a different product page.

Related

How to scrape paginated table

I am trying to scrape all the data from the table on this website (https://report.boonecountymo.org/mrcjava/servlet/SH01_MP.I00290s) but can't seem to figure out how I would go about scraping all of the subsequent pages. This is the code to scrape the first page of results into a CSV file:
import csv
import requests
from bs4 import BeautifulSoup
url = 'http://www.showmeboone.com/sheriff/JailResidents/JailResidents.asp'
response = requests.get(url)
html = response.content
soup = BeautifulSoup(html, "html.parser")
fileList = []
# For the table-header cells
tableHeader = soup.find('tr', attrs={'class': 'table-header'})
rowList = []
for cell in tableHeader.findAll('th'):
cellText = cell.text.replace(' ', '').replace('\n', '')
rowList.append(cellText)
fileList.append(rowList)
# For the table body cells
table = soup.find('tbody', attrs={'class': 'stripe'})
for row in table.findAll('tr'):
rowList = []
for cell in row.findAll('td'):
cellText = cell.text.replace(' ', '').replace('\n', '')
if cellText == "Details":
continue
rowList.append(cellText)
fileList.append(rowList)
outfile = open("./prison-inmates.csv", "w")
writer = csv.writer(outfile)
writer.writerows(fileList)
How do I get to the next page of results?
Code taken from this tutorial (http://first-web-scraper.readthedocs.io/en/latest/)
Although I wasn't able to get your posted code to run, I did find that the original tutorial code you linked to, can be changed on the url = line to:
url = 'https://report.boonecountymo.org/mrcjava/servlet/SH01_MP.I00290s' \
+ '?max_rows=250'
Running python scrape.py then successfully outputs inmates.csv with all available records.
In short, this works by:
instead of: How do I get to the next page ?
we pursue: How do I remove pagination ?
we cause the page to send all records at once so there would be no pagination to deal with in the first place
This allows us to use the original tutorial code to save the complete set of records.
Explanation
url = 'https://report.boonecountymo.org/mrcjava/servlet/SH01_MP.I00290s' to use the new URL. The old URL in the tutorial: http://www.showmeboone.com/sheriff/JailResidents/JailResidents.asp redirects to this new URL, but doesn't work with our solution so we can't use the old URL
\ is a line break allowing me to continue the line of code on the next line, for readability
+ is to concatenate so we can add the ?max_rows=250.
So the result is equivalent to url = 'https://report.boonecountymo.org/mrcjava/servlet/SH01_MP.I00290s?max_rows=250'
?max_rows=<number-of-records-to-display> is a query string I found that works for this particular Current Detainees page. This can be found by first noticing the Page Size text entry field meant for users to set a custom rows per page. It shows a default value 50. Examine its HTML code, for example in Firefox browser (52.7.3), use Ctrl+shift+i to show the Firefox's Web Developer Inspector tool window. Click the Select element button (icon resembles a box outline with a mouse cursor arrow on it), then click on the input field containing 50. HTML pane below reveals via highlight: <input class="mrcinput" name="max_rows" size="3" title="max_rowsp" value="50" type="text">. This means it submits a form variable named max_rows, which is a number, default 50. Some web pages, depending on how it is coded, can recognize such variables if appended to the URL as a query string, so it is possible to try this by appending ?max_rows= plus a number of your choice. At the time I started the page said 250 Total Items , so I chose to try the custom number 250 by changing my browser address bar to load: https://report.boonecountymo.org/mrcjava/servlet/SH01_MP.I00290s?max_rows=250. It successfully displayed 250 records, making it unnecessary to paginate, so this ?max_rows=250 is what we use to form the URL used by our script
Now however the page now says 242 Total Items, so it seems they are removing inmates, or at least inmate records listed. You can: ?max_rows=242, but ?max_rows=250 will still work because 250 is larger than the total number of records 242, and as long as it is larger the page will not need to paginate, and thus allow you to have all the records on one page.
Warranty
This isn't a universal solution for scraping table data when encountering pagination. It works for this Current Detainees page and pages that may have been coded in the same way
This is because pagination isn't universally implemented, so any code or solution would depend on how the page implements pagination. Here we use ?max_rows=.... However another website, even if they have adjustable per-page limits, may use a different name for this max_rows variable, or ignore query strings altogether and so our solution may not work on a different website
Scalability issues: If you are in a situation with a different website where you need millions of records for example, a download-all-at-once approach like this can run into perhaps memory limits both on the server side and also on your computer, both could time out and fail to finish delivering or processing. A different approach, resembling something like pagination that you had originally asked for, would definitely be more suitable
So in the future if you need to download large amounts of records, this download-all-at-once approach will likely run you into memory-related trouble, but for scraping this particular Current Detainees page, it will get the job done.

Python Requests/Selenium with BeautifulSoup not returning find_all every time

I am trying to webscrape Airbnb, I had working code but it seems they have updated everything on the page. It intermittently returns the correct output and then sometimes it fails? It will return the NoneType error between the 3rd and 17th page randomly. Is there a way for it to keep trying or is my code incorrect?
for page in range(1,pages + 1):
#get page urls
page_url= url + '&page={0}'.format(page)
print(page_url)
#get page
# browser.get(page_url)
source = requests.get(page_url)
soup = BeautifulSoup(source.text,'html.parser')
#get all listings on page
div = soup.find('div',{'class':'row listing-cards-row'})
#loop through to get all info needed from cards
for pic in div.find_all('div',{'class':'listing-card-wrapper'}):
print(...)
the last for loop is where my error starts to occur. This happens sometimes in my other functions too where it sometimes works sometimes doesn't. I have already given lxml parser a try as well.
After reviewing the soup a couple of times i noticed that every couple of times the program would run the source code tags would change. I threw in some exceptions and it seems to have fixed my "None" issue.

Create a new instance of a generator in python

I am trying to scrape a page which has many links to pages which contain ads. What I am currently doing to navigate it is going to the first page with the list of ads and getting the link for the individual ads. After that, I check to make sure that I haven't scraped any of the links by pulling data from my database. The code below basically gets all the href attributes and joins them as a list. After, I crosscheck it against the the list of links I have stored in my database of pages I have already scraped. So basically it will return a list of the links I haven't scraped yet.
#staticmethod
def _scrape_home_urls(driver):
home_url_list = list(home_tab.find_element_by_tag_name('a').get_attribute('href') for home_tab in driver.find_elements_by_css_selector('div[class^="nhs_HomeResItem clearfix"]'))
return (home_url for home_url in home_url_list if home_url not in(url[0] for url in NewHomeSource.outputDB()))
Once it scrapes all the links of that page, it goes to the next one. I tried to reuse it by calling _scrape_home_urls() again
NewHomeSource.unique_home_list = NewHomeSource._scrape_home_urls(driver)
for x in xrange(0,limit):
try:
home_url = NewHomeSource.unique_home_list.next()
except StopIteration:
page_num = int(NewHomeSource.current_url[NewHomeSource.current_url.rfind('-')+1:]) + 1 #extract page number from url and gets next page by adding 1. example: /.../.../page-3
page_url = NewHomeSource.current_url[:NewHomeSource.current_url.rfind('-')+1] + str(page_num)
print page_url
driver.get(page_url)
NewHomeSource.current_url = driver.current_url
NewHomeSource.unique_home_list = NewHomeSource._scrape_home_urls(driver)
home_url = NewHomeSource.unique_home_list.next()
#and then I use the home_url to do some processing within the loop
Thanks in advance.
It looks to me like your code would be a lot simpler if you put the logic that scrapes successive pages into a generator function. This would let you use for loops rather than messing around and calling next on the generator objects directly:
def urls_gen(driver):
while True:
for url in NewHomeSource._scrape_home_urls(driver):
yield url
page_num = int(NewHomeSource.current_url[NewHomeSource.current_url.rfind('-')+1:]) + 1 #extract page number from url and gets next page by adding 1. example: /.../.../page-3
page_url = NewHomeSource.current_url[:NewHomeSource.current_url.rfind('-')+1] + str(page_num)
print page_url
driver.get(page_url)
NewHomeSource.current_url = driver.current_url
This will transparently skip over pages that don't have any unprocessed links. The generator function yields the url values indefinitely. To iterate on it with a limit like your old code did, use enumerate and break when the limit is reached:
for i, home_url in urls_gen(driver):
if i == limit:
break
# do stuff with home_url here
I've not changed your code other than what was necessary to change the iteration. There are quite a few other things that could be improved however. For instance, using a shorter variable than NewHomeSource.current_url would make the lines of the that figure out the page number and then the next page's URL much more compact and readable. It's also not clear to me where that variable is initially set. If it's not used anywhere outside of this loop, it could easily be changed to a local variable in urls_gen.
Your _scrape_home_urls function is probably also very inefficient. It looks like it does a database query for every url it returns (not one lookup before checking all of the urls). Maybe that's what you want it to do, but I suspect it would be much faster done another way.

Writing CSV file while looping through web pages

This is a follow-up question to my earlier question on looping through multiple web pages. I am new to programming... so I appreciate your patience and very explicit explanations!
I have programmed a loop through many web pages. On each page, I want to scrape data, save it to a variable or a csv file (whichever is easier/more stable), then click on the "next" button, scrape data on the second page and append it to the variable or csv file, etc.
Specifically, my code looks like this:
url="http://www.url.com"
driver = webdriver.Firefox()
driver.get(url)
(driver.page_source).encode('utf-8')
html = urllib.request.urlopen(url).read()
soup = BeautifulSoup(html)
wait = WebDriverWait(driver, 10)
while True:
# some code to grab the data
job_tag={'class': re.compile("job_title")}
all_jobs=soup.findAll(attrs=job_tag)
jobs=[]
for text in (all_jobs):
t=str(''.join(text.findAll(text=True)).strip())
jobs.append(t)
writer=csv.writer(open('test.csv','a', newline=''))
writer.writerows(jobs)
# click next link
try:
element=wait.until(EC.element_to_be_clickable((By.XPATH, "//*[#id='reviews']/a/span[starts-with(.,'Next')]")))
element.click()
except TimeoutException:
break
It runs without error, but
1) the file collects the data of the first page over and over again, but not the data of the subsequent pages, even though the loop performs correctly (ultimately, I do not really mind duplicate entries, but I do want data from all pages).
I am suspecting that I need to "redefine" the soup for each new page, I am looking into how to make bs4 access those urls.
2) the last page has no "next" button, so the code does not append last page's data (I get that error when I use 'w' instead of 'a' in the csv line, with the data of the second-to-last page writing into the csv file).
Also, although it is a minor issue, the data gets written one letter per cell in the csv, even though when I run that portion in Python with bs4, the data is correctly formatted. What am I missing?
Thanks!
I am suspecting that I need to "redefine" the soup for each new page
Indeed, you should. You see, your while loop runs with soup always referring to the same old object you made before entering that while loop. You should rebind soup to a new BeautifulSoup instance, which is most likely the URL you find behind the anchor (tag a) which you've located in those last lines:
element=wait.until(EC.element_to_be_clickable((By.XPATH, "//*[#id='reviews']/a/span[starts-with(.,'Next')]")))
You could access it with just your soup (note that I haven't tested this for correctness: without the actual source of the page, I'm guessing):
next_link = soup.find(id='reviews').a.get('href')
And then, at the end of your while loop, you would rebind soup:
soup = BeautifulSoup(urllib.request.urlopen(next_link.read()))
You should still add a try - except clause to capture the error it'll generate on the last page when it cannot find the last "Next" link and then break out of the loop.
Note that selenium is most likely not necessary for your use-case, bs4 would be sufficient (but either would work).
Also, although it is a minor issue, the data gets written one letter per cell in the csv, even though when I run that portion in Python with bs4, the data is correctly formatted. What am I missing?
The writer instance you've created expects an iterable for its writerows method. You are passing it a single string (which might have kommas in them, but that's not what csv.writer will look at: it will add kommas (or whichever delimiter you specified in its construction) between every 2 items of the iterable). A Python string is iterable (per character), so writer.writerows("some_string") doesn't result in an error. But you most likely wanted this:
for text in (all_jobs):
t = [x.strip() for x in text.find_all(text=True)]
jobs.append(t)
As a follow-up on the comments:
You'll want to update the soup based on the new url, which you retrieve from the 1, 2, 3 Next >> (it's in a div container with a specific id, so easy to extract with just BeautifulSoup). The code below is a fairly basic example that shows how this is done. Extracting the things you find relevant is done by your own scraping code, which you'll have to add as indicated in the example.
#Python3.x
import urllib
from bs4 import BeautifulSoup
url = 'http://www.indeed.com/cmp/Wesley-Medical-Center/reviews'
base_url_parts = urllib.parse.urlparse(url)
while True:
raw_html = urllib.request.urlopen(url).read()
soup = BeautifulSoup(raw_html)
# scrape the page for the desired info
# ...
last_link = soup.find('div', id='company_reviews_pagination').find_all('a')[-1]
if last_link.text.startswith('Next'):
next_url_parts = urllib.parse.urlparse(last_link['href'])
url = urllib.parse.urlunparse((base_url_parts.scheme, base_url_parts.netloc,
next_url_parts.path, next_url_parts.params, next_url_parts.query,
next_url_parts.fragment))
print(url)
else:
break

urllib.open() can't handle strings with an # in them?

I'm working on a small project, a site scraper, and I've run into a problem that (I think) with urllib.open(). So, let's say I want to scrape Google's homepage, a concatenated query, and then a search query. (I'm not actually trying to scrape from google, but I figured they'd be easy to demonstrate on.)
from bs4 import BeautifulSoup
import urllib
url = urllib.urlopen("https://www.google.com/")
soup = BeautifulSoup(url)
parseList1=[]
for i in soup.stripped_strings:
parseList1.append(i)
parseList1 = list(parseList1[10:15])
#Second URL
url2 = urllib.urlopen("https://www.google.com/"+"#q=Kerbal Space Program")
soup2 = BeautifulSoup(url2)
parseList2=[]
for i in soup2.stripped_strings:
parseList2.append(i)
parseList2 = list(parseList2[10:15])
#Third URL
url3 = urllib.urlopen("https://www.google.com/#q=Kerbal Space Program")
soup3 = BeautifulSoup(url3)
parseList3=[]
for i in soup3.stripped_strings:
parseList3.append(i)
parseList3 = list(parseList3[10:15])
print " 1 "
for i in parseList1:
print i
print " 2 "
for i in parseList2:
print i
print " 3 "
for i in parseList3:
print i
This prints out:
1
A whole nasty mess of scraped code from Google
2
3
Which leads me to believe that the # symbol might be preventing the url from opening?
The concatenated string doesn't throw any errors for concatenation, yet still doesn't read anything in.
Does anyone have any idea on why that would happen? I never thought that a # inside a string would have any effect on the code. I figured this would be some silly error on my part, but if it is, I can't see it.
Thanks
Browsers should not send the url fragment part (ends with "#") to servers.
RFC 1808 (Relative Uniform Resource Locators) : Note that the fragment identifier (and the "#" that precedes it) is
not considered part of the URL. However, since it is commonly used
within the same string context as a URL, a parser must be able to
recognize the fragment when it is present and set it aside as part of
the parsing process.
You can get the right result in browsers because a browser send a request to https://www.google.com, the url fragment is detected by javascript(It is similar with spell checking here and most web sites won't do this), browser then send a new ajax request(https://www.google.com?q=xxxxx), finally render the page with the json data got. urllib can not execute javascript for you.
To fix your problem, just replace https://www.google.com/#q=Kerbal Space Program with https://www.google.com/?q=Kerbal Space Program

Categories