python web scraping code wont open links - python

This is from the book "automate the boring stuff with python".
At first I made a .bat file and ran it with arguments from cmd, didnt open any pages in chrome, looked up on here, changed up the code, still it executes perfectly and prints the print line but it doesnt open tabs as it should.
What am I doing wrong? Thanks in advance
#! python3
# lucky.py opens several google search matches
import requests,sys,webbrowser,bs4
searchTerm1 = 'python'
print('Googling...')
res = requests.get('https://www.google.com/search?={0}'.format(searchTerm1))
res.raise_for_status()
#retrieve top search result links
soup = bs4.BeautifulSoup(res.text,"html.parser")
#open a browser tab for each result.
linkElems = soup.select('.r a')
numOpen = min(5,len(linkElems))
for i in range(numOpen):
webbrowser.open('http://google.com' + linkElems[i].get('href'))

The short answer is that your URL is not returning results. Here's a URL that provides results: https://www.google.com/search?q=python.
I changed the one line in your code to use this template: "https://www.google.com/search?q={0} and I saw linkElems was non-trivial.

In short, webbrowser is not opening any pages because numOpen is 0, so the for loop tries to iterate over 0 items, which results in the code within that for loop block (webbrowser.open) to not get executed.
The longer, more detailed explanation of why the numOpen = 0 is due to a redirect that occurs with the initial GET request given your custom Google query. See this answer for how to circumvent these issues as there are numerous ways- the easiest is probably to use the Google search API.
As a result of the redirect, your BeautifulSoup search will not return any successful results, causing the numOpen variable to be set to 0 as there will be no list elements. As there are no list elements, the for loop does not execute.
You can debug things like this on your own the quick and dirty, but not perfect, way by simply adding print statements throughout the script to see which print statements fail to execute as well as looking at the variables and their returned values.
As an aside, the shebag should also be set to #!/usr/bin/env python3 rather than simply #! python3. Reference here.
Hope this helps

Related

Scraping with lxml and python requests.

Okay, I am at it again and really trying to figure this stuff out with lxml and python. The last time I asked a question I was using xpath and had to figure out how to make a change in case that the direct xpath source itself would change. I have edited my code to try to go after the class instead. I keep running into a problem with it pulling the address up in memory and not the text that I want. Before anyone says there is a library for what I want to do, this is not about that but, rather, allowing me to understand this code. Here is what I have so far but when I print it out I get an error and I can add [0] behind the print[0].text but it still give me nothing. Any help would be cool.
from lxml import html
import requests
import time
while True:
page = requests.get('https://markets.businessinsider.com/index/realtime-chart/dow_jones')
content = html.fromstring(page.content)
#This will create a list of prices:
prices = content.find_class('price')
print(prices.text)
time.sleep(.5)
Probably a formatting issue from posting but your while loop is not indented.
Try my code below:
while True:
page = requests.get('https://markets.businessinsider.com/index/realtime-chart/dow_jones')
content = html.fromstring(page.content)
prices = content.find_class('price')
#You need to access the 'text_content' method
text = [p.text_content() for p in prices]
for t in text:
if not t.startswith(r"\"): # Prevents the multiple blank lines
print(t)
time.sleep(0.5)

Why does beautifulsoup return an empty list (but not when run in interpreter?)

I am writing a simple program to scrape flight data off of travel websites.
When I test this in the python interpreter, it works fine. However, when I run it as .py file, it returns an empty list.
I know that the program is able to get the data from the site, because I can print the entire source no problem. However, for some reason the find_all method is returning an empty list.
The part that should work:
import urllib2
from bs4 import BeautifulSoup
raw_page_data = urllib2.urlopen(query_url)
soup = BeautifulSoup(raw_page_data, 'html.parser')
raw_prices = soup.find_all(class_='price-whole')
for item in raw_prices:
for character in item:
if character.isdigit() == True:
print character
Usually, in the interpreter, this returns a list of prices. I cannot print raw_prices at all, as it is empty. I believe that the error is with the find_all method, because the output is [].
I have been told that, perhaps, the JavaScript on the site is not loading the elements I am trying to find. However, this makes no sense since the entirety of the page source is loaded, and printable. If I try print soup,
it prints the entire source (which contains the elements I am trying to find).
I tried with different parser as well, to check if there was a bug.
Any help greatly appreciated cheers!

Some html code dissapears when we use request in Python

I am crawling some data from a website. I need to recover some links from a list of products. First I identified one of the links with inspect element:
Then I used request to save all the source code of that page in a text file:
source_code = requests.get(link)
plain_text= source_code.txt
Then I used my text editor to search the link and it did not found it. Im working with BeautifulSoup4, but I already tried several different ways to crawl the page to get the list of products but all give the same result.
My suspicion is that the list of product is generated by some code (probably Java) when someone enter the page, but I am not sure. I have been several hours trying to make this work so any hint is going to be appreciated.
Python never stop to amuse me. I found a Python library that uses PhantomJS. It allow us to run JavaScript code inside a python program. I will answer my own question after a lot of work:
from ghost import Ghost
import re
def filterProductLinks(links): #filter the useless links using regex
pLinks= list()
for l in links:
if re.match(".*productDetails.*",str(l)):
pLinks.append(l)
return pLinks #List of item url(40 max)
def getProductLinks(url): #get the links generated by Java code
ghost = Ghost(wait_timeout=100)
ghost.open(url)
links = ghost.evaluate("""
var links = document.querySelectorAll("a");
var listRet = [];
for (var i=0; i<links.length; i++){
listRet.push(links[i].href);
}
listRet;
""")
pLinks= filterProductLinks(links[0])
return pLinks
#Test
pLinks= getProductLinks('http://www.lider.cl/walmart/catalog/category.jsp?id=CF_Nivel3_000042&pId=CF_Nivel1_000003&navAction=jump&navCount=0#categoryCategory=CF_Nivel3_000042&pageSizeCategory=20&currentPageCategory=1&currentGroupCategory=1&orderByCategory=lowestPrice&lowerLimitCategory=0&upperLimitCategory=0&&504')
for l in pLinks:
print l
print len(pLinks)
The Java code is not mine. I took it from a Ghost.py documentation page: Ghost.py Documentation

Writing CSV file while looping through web pages

This is a follow-up question to my earlier question on looping through multiple web pages. I am new to programming... so I appreciate your patience and very explicit explanations!
I have programmed a loop through many web pages. On each page, I want to scrape data, save it to a variable or a csv file (whichever is easier/more stable), then click on the "next" button, scrape data on the second page and append it to the variable or csv file, etc.
Specifically, my code looks like this:
url="http://www.url.com"
driver = webdriver.Firefox()
driver.get(url)
(driver.page_source).encode('utf-8')
html = urllib.request.urlopen(url).read()
soup = BeautifulSoup(html)
wait = WebDriverWait(driver, 10)
while True:
# some code to grab the data
job_tag={'class': re.compile("job_title")}
all_jobs=soup.findAll(attrs=job_tag)
jobs=[]
for text in (all_jobs):
t=str(''.join(text.findAll(text=True)).strip())
jobs.append(t)
writer=csv.writer(open('test.csv','a', newline=''))
writer.writerows(jobs)
# click next link
try:
element=wait.until(EC.element_to_be_clickable((By.XPATH, "//*[#id='reviews']/a/span[starts-with(.,'Next')]")))
element.click()
except TimeoutException:
break
It runs without error, but
1) the file collects the data of the first page over and over again, but not the data of the subsequent pages, even though the loop performs correctly (ultimately, I do not really mind duplicate entries, but I do want data from all pages).
I am suspecting that I need to "redefine" the soup for each new page, I am looking into how to make bs4 access those urls.
2) the last page has no "next" button, so the code does not append last page's data (I get that error when I use 'w' instead of 'a' in the csv line, with the data of the second-to-last page writing into the csv file).
Also, although it is a minor issue, the data gets written one letter per cell in the csv, even though when I run that portion in Python with bs4, the data is correctly formatted. What am I missing?
Thanks!
I am suspecting that I need to "redefine" the soup for each new page
Indeed, you should. You see, your while loop runs with soup always referring to the same old object you made before entering that while loop. You should rebind soup to a new BeautifulSoup instance, which is most likely the URL you find behind the anchor (tag a) which you've located in those last lines:
element=wait.until(EC.element_to_be_clickable((By.XPATH, "//*[#id='reviews']/a/span[starts-with(.,'Next')]")))
You could access it with just your soup (note that I haven't tested this for correctness: without the actual source of the page, I'm guessing):
next_link = soup.find(id='reviews').a.get('href')
And then, at the end of your while loop, you would rebind soup:
soup = BeautifulSoup(urllib.request.urlopen(next_link.read()))
You should still add a try - except clause to capture the error it'll generate on the last page when it cannot find the last "Next" link and then break out of the loop.
Note that selenium is most likely not necessary for your use-case, bs4 would be sufficient (but either would work).
Also, although it is a minor issue, the data gets written one letter per cell in the csv, even though when I run that portion in Python with bs4, the data is correctly formatted. What am I missing?
The writer instance you've created expects an iterable for its writerows method. You are passing it a single string (which might have kommas in them, but that's not what csv.writer will look at: it will add kommas (or whichever delimiter you specified in its construction) between every 2 items of the iterable). A Python string is iterable (per character), so writer.writerows("some_string") doesn't result in an error. But you most likely wanted this:
for text in (all_jobs):
t = [x.strip() for x in text.find_all(text=True)]
jobs.append(t)
As a follow-up on the comments:
You'll want to update the soup based on the new url, which you retrieve from the 1, 2, 3 Next >> (it's in a div container with a specific id, so easy to extract with just BeautifulSoup). The code below is a fairly basic example that shows how this is done. Extracting the things you find relevant is done by your own scraping code, which you'll have to add as indicated in the example.
#Python3.x
import urllib
from bs4 import BeautifulSoup
url = 'http://www.indeed.com/cmp/Wesley-Medical-Center/reviews'
base_url_parts = urllib.parse.urlparse(url)
while True:
raw_html = urllib.request.urlopen(url).read()
soup = BeautifulSoup(raw_html)
# scrape the page for the desired info
# ...
last_link = soup.find('div', id='company_reviews_pagination').find_all('a')[-1]
if last_link.text.startswith('Next'):
next_url_parts = urllib.parse.urlparse(last_link['href'])
url = urllib.parse.urlunparse((base_url_parts.scheme, base_url_parts.netloc,
next_url_parts.path, next_url_parts.params, next_url_parts.query,
next_url_parts.fragment))
print(url)
else:
break

urllib.open() can't handle strings with an # in them?

I'm working on a small project, a site scraper, and I've run into a problem that (I think) with urllib.open(). So, let's say I want to scrape Google's homepage, a concatenated query, and then a search query. (I'm not actually trying to scrape from google, but I figured they'd be easy to demonstrate on.)
from bs4 import BeautifulSoup
import urllib
url = urllib.urlopen("https://www.google.com/")
soup = BeautifulSoup(url)
parseList1=[]
for i in soup.stripped_strings:
parseList1.append(i)
parseList1 = list(parseList1[10:15])
#Second URL
url2 = urllib.urlopen("https://www.google.com/"+"#q=Kerbal Space Program")
soup2 = BeautifulSoup(url2)
parseList2=[]
for i in soup2.stripped_strings:
parseList2.append(i)
parseList2 = list(parseList2[10:15])
#Third URL
url3 = urllib.urlopen("https://www.google.com/#q=Kerbal Space Program")
soup3 = BeautifulSoup(url3)
parseList3=[]
for i in soup3.stripped_strings:
parseList3.append(i)
parseList3 = list(parseList3[10:15])
print " 1 "
for i in parseList1:
print i
print " 2 "
for i in parseList2:
print i
print " 3 "
for i in parseList3:
print i
This prints out:
1
A whole nasty mess of scraped code from Google
2
3
Which leads me to believe that the # symbol might be preventing the url from opening?
The concatenated string doesn't throw any errors for concatenation, yet still doesn't read anything in.
Does anyone have any idea on why that would happen? I never thought that a # inside a string would have any effect on the code. I figured this would be some silly error on my part, but if it is, I can't see it.
Thanks
Browsers should not send the url fragment part (ends with "#") to servers.
RFC 1808 (Relative Uniform Resource Locators) : Note that the fragment identifier (and the "#" that precedes it) is
not considered part of the URL. However, since it is commonly used
within the same string context as a URL, a parser must be able to
recognize the fragment when it is present and set it aside as part of
the parsing process.
You can get the right result in browsers because a browser send a request to https://www.google.com, the url fragment is detected by javascript(It is similar with spell checking here and most web sites won't do this), browser then send a new ajax request(https://www.google.com?q=xxxxx), finally render the page with the json data got. urllib can not execute javascript for you.
To fix your problem, just replace https://www.google.com/#q=Kerbal Space Program with https://www.google.com/?q=Kerbal Space Program

Categories