urllib.open() can't handle strings with an # in them? - python

I'm working on a small project, a site scraper, and I've run into a problem that (I think) with urllib.open(). So, let's say I want to scrape Google's homepage, a concatenated query, and then a search query. (I'm not actually trying to scrape from google, but I figured they'd be easy to demonstrate on.)
from bs4 import BeautifulSoup
import urllib
url = urllib.urlopen("https://www.google.com/")
soup = BeautifulSoup(url)
parseList1=[]
for i in soup.stripped_strings:
parseList1.append(i)
parseList1 = list(parseList1[10:15])
#Second URL
url2 = urllib.urlopen("https://www.google.com/"+"#q=Kerbal Space Program")
soup2 = BeautifulSoup(url2)
parseList2=[]
for i in soup2.stripped_strings:
parseList2.append(i)
parseList2 = list(parseList2[10:15])
#Third URL
url3 = urllib.urlopen("https://www.google.com/#q=Kerbal Space Program")
soup3 = BeautifulSoup(url3)
parseList3=[]
for i in soup3.stripped_strings:
parseList3.append(i)
parseList3 = list(parseList3[10:15])
print " 1 "
for i in parseList1:
print i
print " 2 "
for i in parseList2:
print i
print " 3 "
for i in parseList3:
print i
This prints out:
1
A whole nasty mess of scraped code from Google
2
3
Which leads me to believe that the # symbol might be preventing the url from opening?
The concatenated string doesn't throw any errors for concatenation, yet still doesn't read anything in.
Does anyone have any idea on why that would happen? I never thought that a # inside a string would have any effect on the code. I figured this would be some silly error on my part, but if it is, I can't see it.
Thanks

Browsers should not send the url fragment part (ends with "#") to servers.
RFC 1808 (Relative Uniform Resource Locators) : Note that the fragment identifier (and the "#" that precedes it) is
not considered part of the URL. However, since it is commonly used
within the same string context as a URL, a parser must be able to
recognize the fragment when it is present and set it aside as part of
the parsing process.
You can get the right result in browsers because a browser send a request to https://www.google.com, the url fragment is detected by javascript(It is similar with spell checking here and most web sites won't do this), browser then send a new ajax request(https://www.google.com?q=xxxxx), finally render the page with the json data got. urllib can not execute javascript for you.
To fix your problem, just replace https://www.google.com/#q=Kerbal Space Program with https://www.google.com/?q=Kerbal Space Program

Related

requests.get not getting correct information

I want to get several "pages" of a website and for some reason the correct url does not give the expected result.
I looked at the url that should be used and it works just fine and tried to use some variable changing.
for i in range(1,100):
MLinks.append("https://#p" + str(i))
for i in range(1,100):
x = i-1
MainR = requests.get(MLinks[x])
SMHTree = html.fromstring(MainR.content)
MainData = SMHTree.xpath('//#*')
j=0
while j <len(MainData):
if 'somthing' in MainData[j] :
PLinks.append(MainData[j]) #Links of products
j=j+1
I am expecting to get every page but when I am reading the contents I always get the contents of the first page.
I assume the URLs you are requesting look like this:
https://somehost.com/products/#p1
https://somehost.com/products/#p2
https://somehost.com/products/#p3
...
That is, the second line of you code would actually be
MLinks.append("https://somehost.com/products/#p" + str(i))
When doing the request, the server never sees the part after the # (this part is called an anchor). So the server just receives 100 requests for "https://somehost.com/products/", which all give the same results. See this website explaining it further: https://developer.mozilla.org/en-US/docs/Learn/Common_questions/What_is_a_URL.
Anchors are sometimes used by client-side JavaScript to load pages dynamically. What this means is that if you open "https://somehost.com/products/" and navigate to "https://somehost.com/products/#p5", the client-side JavaScript will notice it and (usually) issue a request to some other URL to load the products on page 5. This other URL will not be "https://somehost.com/products/#5"! To find out what this URL is, open the developer tools of your browser and see what network requests are made when you navigate to a different product page.

python web scraping code wont open links

This is from the book "automate the boring stuff with python".
At first I made a .bat file and ran it with arguments from cmd, didnt open any pages in chrome, looked up on here, changed up the code, still it executes perfectly and prints the print line but it doesnt open tabs as it should.
What am I doing wrong? Thanks in advance
#! python3
# lucky.py opens several google search matches
import requests,sys,webbrowser,bs4
searchTerm1 = 'python'
print('Googling...')
res = requests.get('https://www.google.com/search?={0}'.format(searchTerm1))
res.raise_for_status()
#retrieve top search result links
soup = bs4.BeautifulSoup(res.text,"html.parser")
#open a browser tab for each result.
linkElems = soup.select('.r a')
numOpen = min(5,len(linkElems))
for i in range(numOpen):
webbrowser.open('http://google.com' + linkElems[i].get('href'))
The short answer is that your URL is not returning results. Here's a URL that provides results: https://www.google.com/search?q=python.
I changed the one line in your code to use this template: "https://www.google.com/search?q={0} and I saw linkElems was non-trivial.
In short, webbrowser is not opening any pages because numOpen is 0, so the for loop tries to iterate over 0 items, which results in the code within that for loop block (webbrowser.open) to not get executed.
The longer, more detailed explanation of why the numOpen = 0 is due to a redirect that occurs with the initial GET request given your custom Google query. See this answer for how to circumvent these issues as there are numerous ways- the easiest is probably to use the Google search API.
As a result of the redirect, your BeautifulSoup search will not return any successful results, causing the numOpen variable to be set to 0 as there will be no list elements. As there are no list elements, the for loop does not execute.
You can debug things like this on your own the quick and dirty, but not perfect, way by simply adding print statements throughout the script to see which print statements fail to execute as well as looking at the variables and their returned values.
As an aside, the shebag should also be set to #!/usr/bin/env python3 rather than simply #! python3. Reference here.
Hope this helps

Trying to match a regular expression on a website using Mechanize and python

I'm trying to eventually populate a google sheet from data I'm scraping from wikipedia. ( I'll deal with the robots.txt file later I'm just trying to figure out how to do this conceptually. My code is below. I'm trying to put the page in as a string and then run a regexp search my goal is to isolate the specs on the page and at least store them as a value but I'm having a problem searching the page keeps coming up as did not find
Be gentle I'm a noob - Thanks in advance for your help!
import mechanize
import re
import gspread
br = mechanize.Browser()
pagelist=["https://en.wikipedia.org/wiki/Tesla_Model_S"]
wheelbase = ''
length =''
width= ''
height =''
pages=len(pagelist)
i=0
br.open(pagelist[0])
page = br.response()
print page.read()
pageAsaString = str(page.read())
match = re.search('Wheelbase',pageAsaString)
if match:
print 'found', match.group()
else:
print 'did not find'
I get the page just fine - the reason that you're getting a message saying that the page couldn't be found is because your print 'did not find' block isn't properly indented. This matters in Python! Bump it over 4 spaces:
if match:
print 'found', match.group()
else:
print 'did not find'
There's one other thing. I'm not familiar with Mechanize, but you're just calling read() on the page, which exhausts it. So, when you read() the page in print page.read(), there isn't anything left to consume and assign to pageAsaString. You've already read to the end of the page! So you'll want to read the page and save it to a variable first. Check out the documentation for IO operations here.
After fixing the indentation and removing print page.read(), everything appeared to work just fine.
Since you're starting out, I highly recommend reading Dive Into Python. Good luck with your project!

web scraping in python

I'd like to scrape all the ~62000 names from this petition, using python. I'm trying to use the beautifulsoup4 library.
However, it's just not working.
Here's my code so far:
import urllib2, re
from bs4 import BeautifulSoup
soup = BeautifulSoup(urllib2.urlopen('http://www.thepetitionsite.com/104/781/496/ban-pesticides-used-to-kill-tigers/index.html').read())
divs = soup.findAll('div', attrs={'class' : 'name_location'})
print divs
[]
What am I doing wrong? Also, I want to somehow access the next page to add the next set of names to the list, but I have no idea how to do that right now. Any help is appreciated, thanks.
You could try something like this:
import urllib2
from bs4 import BeautifulSoup
html = urllib2.urlopen('http://www.thepetitionsite.com/xml/petitions/104/781/496/signatures/latest.xml?1374861495')
# uncomment to try with a smaller subset of the signatures
#html = urllib2.urlopen('http://www.thepetitionsite.com/xml/petitions/104/781/496/signatures/00/00/00/05.xml')
results = []
while True:
# Read the web page in XML mode
soup = BeautifulSoup(html.read(), "xml")
try:
for s in soup.find_all("signature"):
# Scrape the names from the XML
firstname = s.find('firstname').contents[0]
lastname = s.find('lastname').contents[0]
results.append(str(firstname) + " " + str(lastname))
except:
pass
# Find the next page to scrape
prev = soup.find("prev_signature")
# Check if another page of result exists - if not break from loop
if prev == None:
break
# Get the previous URL
url = prev.contents[0]
# Open the next page of results
html = urllib2.urlopen(url)
print("Extracting data from {}".format(url))
# Print the results
print("\n")
print("====================")
print("= Printing Results =")
print("====================\n")
print(results)
Be warned though there is a lot of data there to go through and I have no idea if this is against the terms of service of the website so you would need to check it out.
In most cases it is extremely inconsiderate to simply scrape a site. You put a fairly large load on a site in a short amount of time slowing down legitimate users requests. Not to mention stealing all of their data.
Consider an alternate approach such as asking (politely) for a dump of the data (as mentioned above).
Or if you do absolutely need to scrape:
Space your requests using a timer
Scrape smartly
I took a quick glance at that page and it appears to me they use AJAX to request the signatures. Why not simply copy their AJAX request, it'll most likely be using some sort of REST call. By doing this you lessen the load on their server by only requesting the data you need. It will also be easier for you to actually process the data because it will be in a nice format.
Reedit, I looked at their robots.txt file. It dissallows /xml/ Please respect this.
what do you mean by not working? empty list or error?
if you are receiving an empty list, it is because the class "name_location" does not exist in the document. also checkout bs4's documentation on findAll

Screen scraping in LXML with python-- extract specific data

I've been trying to write a program for the last several hours that does what I thought would be an incredibly simple task:
Program asks for user input (let's say the type 'happiness')
Program queries the website thinkexist using this format ("http://thinkexist.com/search/searchQuotation.asp?search=USERINPUT")
Program returns first quote from the website.
I've tried using Xpath with lxml, but have no experience and every single construction comes back with a blank array.
The actual meat of the quote appears to be contained in the class "sqq."
If I navigate the site via Firebug, click the DOM tab, it appears the quote is in a textNode attribute "wholeText" or "textContent"-- but I don't know how to use that knowledge programatically.
Any ideas?
import lxml.html
import urllib
site = 'http://thinkexist.com/search/searchquotation.asp'
userInput = raw_input('Search for: ').strip()
url = site + '?' + urllib.urlencode({'search':userInput})
root = lxml.html.parse(url).getroot()
quotes = root.xpath('//a[#class="sqq"]')
print quotes[0].text_content()
... and if you enter 'Shakespeare', it returns
In real life, unlike in Shakespeare, the sweetness
of the rose depends upon the name it bears. Things
are not only what they are. They are, in very important
respects, what they seem to be.
If it's not necessary for you to implement this via XPath, you may use BeautifilSoup library like this (let myXml variable contain the page HTML source):
soup = BeautifulSoup(myXml)
for a in soup.findAll(a,{'class' : 'sqq'}):
# this is your quote
print a.contents
Anyway, read the BS documentation, it may be very useful for some scraping needs that don't require the power of XPath.
You could open the html source to find out the exact class you are looking for. For example, to grab the first StackOverflow username encountered on the page you could do:
#!/usr/bin/env python
from lxml import html
url = 'http://stackoverflow.com/questions/4710307'
tree = html.parse(url)
path = '//div[#class="user-details"]/a[#href]'
print tree.findtext(path)
# -> Parseltongue
# OR to print text including the text in children
a = tree.find(path)
print a.text_content()
# -> Parseltongue

Categories