Scraping webdata from a website that loads data in a streaming fashion - python

I'm trying to scrape some data off of the FEC.gov website using python for a project of mine. Normally I use python mechanize and beautifulsoup to do the scraping.
I've been able to figure out most of the issues but can't seem to get around a problem. It seems like the data is streamed into the table and mechanize.Browser() just stops listening.
So here's the issue:
If you visit http://query.nictusa.com/cgi-bin/can_ind/2011_P80003338/1/A ... you get the first 500 contributors whose last name starts with A and have given money to candidate P80003338 ... however, if you use browser.open() at that url all you get is the first ~5 rows.
I'm guessing its because mechanize isn't letting the page fully load before the .read() is executed. I tried putting a time.sleep(10) between the .open() and .read() but that didn't make much difference.
And I checked, there's no javascript or AJAX in the website (or at least none are visible when you use the 'view-source'). SO I don't think its a javascript issue.
Any thoughts or suggestions? I could use selenium or something similar but that's something that I'm trying to avoid.
-Will

Why not use an html parser like lxml with xpath expressions.
I tried
>>> import lxml.html as lh
>>> data = lh.parse('http://query.nictusa.com/cgi-bin/can_ind/2011_P80003338/1/A')
>>> name = data.xpath('/html/body/table[2]/tr[5]/td[1]/a/text()')
>>> name
[' AABY, TRYGVE']
>>> name = data.xpath('//table[2]/*/td[1]/a/text()')
>>> len(name)
500
>>> name[499]
' AHMED, ASHFAQ'
>>>
Similarly, you can create xpath expression of your choice to work with.

Instead of using mechanize, why don't you use something like requests?

Related

Python: a script to find the website language

Hello everyone,
I am trying to write a program in Python to automatically check a website language. My script looks at the HTML header, identify where the string 'lang' appears, and print the corresponding language. I use the module 'requests'.
request = requests.get('https://en.wikipedia.org/wiki/Main_Page')
splitted_text = request.text.split()
matching = [s for s in splitted_text if "lang=" in s]
language_website = matching[0].split('=')[1]
print(language_website[1:3])
>>> en
I have tested it over several websites, and it works (given the language is correctly configured in the HTML in the first place, which is likely for the websites I consider in my research).
My question is: is there a more straightforward / consistent / systematic way to achieve the same thing. How one would look at the HTML using python and return the language the website is written in? Is there a quicker way using lxml for instance (that does not involve parsing strings like I do)?
I know the question of how to find a website language has been asked before, and the method using the HTML header to retrieve the language was mentioned, but it was not developed and no code was suggested, so I think this post is reasonably different.
Thank you so very much! Have a wonderful day,
Berti
You can try this :
import requests
request = requests.head('https://en.wikipedia.org/wiki/Main_Page')
print(request.headers["Content-language"])
If you are interested to get the data from page source. This might help.
import lxml
request = requests.get('https://en.wikipedia.org/wiki/Main_Page')
root = lxml.html.fromstring(request.text)
language_construct = root.xpath("//html/#lang") # this xpath is reliable(in long-term), since this is a standard construct.
language = "Not found in page source"
if language_construct:
language = language_construct[0]
print(language)
Note: This approach will not give result for all webpages, only those which contains HTML Language Code Reference.
Refer https://www.w3schools.com/tags/ref_language_codes.asp for more.
Combining the above responses
import requests
request = requests.head('https://en.wikipedia.org/wiki/Main_Page')
print(request.headers.get("Content-language", "Not found in page source"))

How to control if a string is present in a website through python

I'm trying to identify if a string like "data=sold" is present in a website.
Now I'm using requests and a while loop but I need it to be faster:
response = requests.get(link)
if ('data=sold' in response.text):
It works well but it is not fast , is there a way to "request" only the part of the website I need to make the researching faster ?
I think you response.text is html right ?
to avoid to search string you can try with Beautiful Soup Doc here
from bs4 import BeautifulSoup
html = response.text
bs = BeautifulSoup(html)
[item['data-sold] for item in bs.find_all('ul', attrs={'data-sold' : True})]
can see other ref here
or maybe I think a about parallel for loop in python
we can make many requests in same time
As already commented, it depends on the website/server if you can only request a part of the page. Since it is a website I would think it's not possible.
If the website is really really big, the only way I can currently think of to make the search faster is to process the data just in time. When you call requests.get(link), the site will be downloaded before you can process the data. You maybe could try to call
r = requests.get(link, stream=True)
instead. And then iterate through all the lines:
for line in r:
if ('data=sold' in line):
print("hooray")
Of course you could also analyze the raw stream and just skip x bytes, use the aiohttp library, ... maybe you need to give some more information about your problem.

How can I dependably web-scrape a largely unattached line effectively?

Sorry if that was a vague title. I'm trying to scrape the number of XKCD web-comics on a consistent basis. I saw that http://xkcd.com/ always has their newest comic on the front page along with a line further down the site saying:
Permanent link to this comic: http://xkcd.com/1520/
Where 1520 is the number of the newest comic on display. I want to scrape this number, however, I can't find any good way to do so. Currently all my attempts look really hackish like:
soup = BeautifulSoup(urllib.urlopen('http://xkcd.com/').read())
test = soup.find_all('div')[7].get_text().split()[20][-5:-1]
I mean.. That technically works, but if anything on the website gets moved in the slightest it could break horribly. I know there has to be better way to just search for http:xkcd.com/####/ within the a section of the front page and just return #### but I can't seem to find it. The Permanent link to this comic: http://xkcd.com/1520/ line just seems to be kind of floating around without any kinds of tags, class, or ID. Can anyone offer any assistance?
Usually I insist on using HTML parsers. Here, since we are looking for a specific text in HTML (not checking any tags), it is pretty much okay to apply a regular expression search on:
Permanent link to this comic: http://xkcd.com/(\d+)/
saving digits in a group.
Demo:
>>> import re
>>> import requests
>>>
>>>
>>> data = requests.get("http://xkcd.com/").content
>>> pattern = re.compile(r'Permanent link to this comic: http://xkcd.com/(\d+)/')
>>> print pattern.search(data).group(1)
1520

Python Regex scraping data from a webpage

My idea was to explore the Groupon's website to extract the url of the deals. The problem is that I'm trying to do a findall on the Groupon's page to find datas like this: (of this page: http://www.groupon.de/alle-deals/muenchen/restaurant-296)
"category":"RESTAURANT1","dealPermaLink":"/deals/muenchen-special/Casa-Lavecchia/24788330", and I'd like to get the 'deals/muenchen-special/Casa-Lavecchia/24788330'.
I tried the whole night but I'm unable to find a correct regex. I tried:
import urllib2
import re
Page_Web = urllib2.urlopen('http://www.groupon.de/alle-deals/muenchen/restaurant-296').read()
for m in re.findall('category*RESATAURANT1*dealPermaLink*:?/*/*/*/*\d$',Page_Web):
print m
But it doesn't print anything.
In order to extrapolate the block that interest you, I would do this way:
from bs4 import BeautifulSoup
import urllib2
html = urllib2.urlopen('http://www.groupon.de/alle-deals/muenchen/restaurant-296').read()
soup = BeautifulSoup(html)
scriptResults = soup('script',{'type' : 'text/javascript'})
js_block = scriptResults[12]
Starting from this you can parse with a regex if you want or try to interprete the js (there are some threads on stackoverflow about that).
Anyway, like the others said, you should use groupon api...
P.S.
The block that you are parsing can be easily parsed as a dictionary, is already a list of dictionary if you look well...
How about changing RESATAURANT1 to RESTAURANT1, for starters?

Screen scraping in LXML with python-- extract specific data

I've been trying to write a program for the last several hours that does what I thought would be an incredibly simple task:
Program asks for user input (let's say the type 'happiness')
Program queries the website thinkexist using this format ("http://thinkexist.com/search/searchQuotation.asp?search=USERINPUT")
Program returns first quote from the website.
I've tried using Xpath with lxml, but have no experience and every single construction comes back with a blank array.
The actual meat of the quote appears to be contained in the class "sqq."
If I navigate the site via Firebug, click the DOM tab, it appears the quote is in a textNode attribute "wholeText" or "textContent"-- but I don't know how to use that knowledge programatically.
Any ideas?
import lxml.html
import urllib
site = 'http://thinkexist.com/search/searchquotation.asp'
userInput = raw_input('Search for: ').strip()
url = site + '?' + urllib.urlencode({'search':userInput})
root = lxml.html.parse(url).getroot()
quotes = root.xpath('//a[#class="sqq"]')
print quotes[0].text_content()
... and if you enter 'Shakespeare', it returns
In real life, unlike in Shakespeare, the sweetness
of the rose depends upon the name it bears. Things
are not only what they are. They are, in very important
respects, what they seem to be.
If it's not necessary for you to implement this via XPath, you may use BeautifilSoup library like this (let myXml variable contain the page HTML source):
soup = BeautifulSoup(myXml)
for a in soup.findAll(a,{'class' : 'sqq'}):
# this is your quote
print a.contents
Anyway, read the BS documentation, it may be very useful for some scraping needs that don't require the power of XPath.
You could open the html source to find out the exact class you are looking for. For example, to grab the first StackOverflow username encountered on the page you could do:
#!/usr/bin/env python
from lxml import html
url = 'http://stackoverflow.com/questions/4710307'
tree = html.parse(url)
path = '//div[#class="user-details"]/a[#href]'
print tree.findtext(path)
# -> Parseltongue
# OR to print text including the text in children
a = tree.find(path)
print a.text_content()
# -> Parseltongue

Categories