How can I dependably web-scrape a largely unattached line effectively?

How can I dependably web-scrape a largely unattached line effectively? - python

Sorry if that was a vague title. I'm trying to scrape the number of XKCD web-comics on a consistent basis. I saw that http://xkcd.com/ always has their newest comic on the front page along with a line further down the site saying:
Permanent link to this comic: http://xkcd.com/1520/
Where 1520 is the number of the newest comic on display. I want to scrape this number, however, I can't find any good way to do so. Currently all my attempts look really hackish like:
soup = BeautifulSoup(urllib.urlopen('http://xkcd.com/').read())
test = soup.find_all('div')[7].get_text().split()[20][-5:-1]
I mean.. That technically works, but if anything on the website gets moved in the slightest it could break horribly. I know there has to be better way to just search for http:xkcd.com/####/ within the a section of the front page and just return #### but I can't seem to find it. The Permanent link to this comic: http://xkcd.com/1520/ line just seems to be kind of floating around without any kinds of tags, class, or ID. Can anyone offer any assistance?

Usually I insist on using HTML parsers. Here, since we are looking for a specific text in HTML (not checking any tags), it is pretty much okay to apply a regular expression search on:
Permanent link to this comic: http://xkcd.com/(\d+)/
saving digits in a group.
Demo:
>>> import re
>>> import requests
>>>
>>>
>>> data = requests.get("http://xkcd.com/").content
>>> pattern = re.compile(r'Permanent link to this comic: http://xkcd.com/(\d+)/')
>>> print pattern.search(data).group(1)
1520

Related

Python: a script to find the website language

Hello everyone,
I am trying to write a program in Python to automatically check a website language. My script looks at the HTML header, identify where the string 'lang' appears, and print the corresponding language. I use the module 'requests'.
request = requests.get('https://en.wikipedia.org/wiki/Main_Page')
splitted_text = request.text.split()
matching = [s for s in splitted_text if "lang=" in s]
language_website = matching[0].split('=')[1]
print(language_website[1:3])
>>> en
I have tested it over several websites, and it works (given the language is correctly configured in the HTML in the first place, which is likely for the websites I consider in my research).
My question is: is there a more straightforward / consistent / systematic way to achieve the same thing. How one would look at the HTML using python and return the language the website is written in? Is there a quicker way using lxml for instance (that does not involve parsing strings like I do)?
I know the question of how to find a website language has been asked before, and the method using the HTML header to retrieve the language was mentioned, but it was not developed and no code was suggested, so I think this post is reasonably different.
Thank you so very much! Have a wonderful day,
Berti

You can try this :
import requests
request = requests.head('https://en.wikipedia.org/wiki/Main_Page')
print(request.headers["Content-language"])

If you are interested to get the data from page source. This might help.
import lxml
request = requests.get('https://en.wikipedia.org/wiki/Main_Page')
root = lxml.html.fromstring(request.text)
language_construct = root.xpath("//html/#lang") # this xpath is reliable(in long-term), since this is a standard construct.
language = "Not found in page source"
if language_construct:
language = language_construct[0]
print(language)
Note: This approach will not give result for all webpages, only those which contains HTML Language Code Reference.
Refer https://www.w3schools.com/tags/ref_language_codes.asp for more.

Combining the above responses
import requests
request = requests.head('https://en.wikipedia.org/wiki/Main_Page')
print(request.headers.get("Content-language", "Not found in page source"))

web-scraping, regex and iteration in python

I have the following url 'http://www.alriyadh.com/file/278?&page=1'
I would like to write a regex to access urls from page=2 till page=12
For example, this url is needed 'http://www.alriyadh.com/file/278?&page=4', but not page = 14
I reckon what will work is a function that iterate the specified 10 pages to access all the urls within them. I have tried this regex but does not work
'.*?=[2-9]'
My aim is to get the content from those urls using newspaper package. I simply want this data for my research
Thanks in advance

does not require regex, a simple preset loop will do.
import requests
from bs4 import BeautifulSoup as bs
url = 'http://www.alriyadh.com/file/278?&page='
for page in range(2,13):
html = requests.get(url+str(page)).text
soup = bs(html)

Here's a regex to access the proper range (i.e. 2-12):
([2-9]|1[012])
Judging by what you have now, I am unsure that your regex will work as you intend it to. Perhaps I am misinterpreting your regex altogether, but is the '?=' intended to be a lookahead?
Or are you actually searching for a '?' immediately followed by a '=' immediately followed by any number 2-9?
How familiar are you with regexs in general? This particular one seems dangerously vague to find a meaningful match.

BeautifulSoup find and find_all not working as expect

I just starting using BeautifulSoup and I am encountering a problem. I set up a html snippet below and make a BeautifulSoup object:
html_snippet = '<p class="course"><span class="text84">Ae 100. Research in Aerospace. </span><span class="text85">Units to be arranged in accordance with work accomplished. </span><span class="text83">Open to suitably qualified undergraduates and first-year graduate students under the direction of the staff. Credit is based on the satisfactory completion of a substantive research report, which must be approved by the Ae 100 adviser and by the option representative. </span> </p>'
subject = BeautifulSoup(html_snippet)
I have tried doing several find and find_all operations like below but all I am getting is nothing or an empty list:
subject.find(text = 'A')
subject.find(text = 'Research')
subject.next_element.find('A')
subject.find_all(text = 'A')
When I created the BeautifulSoup object from a html file on my computer before, the find and find_all operations were all working fine. However, when I pulled the html_snippet from reading a webpage online through urllib2, I am getting problems.
Can anyone point out where the issue is?

Pass the argument like this:
import re
subject.find(text=re.compile('A'))
The default behavior for the text filter is to match on the entire body. Passing in a regular expression lets you match on fragments.
EDIT: To match only bodies beginning with A, you can use the following:
subject.find(text=re.compile('^A'))
To match only bodies containing words that begin with A, you can use:
subject.find_all(text = re.compile(r'\bA'))
It's difficult to tell more specifically what you're looking for, let me know if I've misinterpreted what you're asking.

Scraping webdata from a website that loads data in a streaming fashion

I'm trying to scrape some data off of the FEC.gov website using python for a project of mine. Normally I use python mechanize and beautifulsoup to do the scraping.
I've been able to figure out most of the issues but can't seem to get around a problem. It seems like the data is streamed into the table and mechanize.Browser() just stops listening.
So here's the issue:
If you visit http://query.nictusa.com/cgi-bin/can_ind/2011_P80003338/1/A ... you get the first 500 contributors whose last name starts with A and have given money to candidate P80003338 ... however, if you use browser.open() at that url all you get is the first ~5 rows.
I'm guessing its because mechanize isn't letting the page fully load before the .read() is executed. I tried putting a time.sleep(10) between the .open() and .read() but that didn't make much difference.
And I checked, there's no javascript or AJAX in the website (or at least none are visible when you use the 'view-source'). SO I don't think its a javascript issue.
Any thoughts or suggestions? I could use selenium or something similar but that's something that I'm trying to avoid.
-Will

Why not use an html parser like lxml with xpath expressions.
I tried
>>> import lxml.html as lh
>>> data = lh.parse('http://query.nictusa.com/cgi-bin/can_ind/2011_P80003338/1/A')
>>> name = data.xpath('/html/body/table[2]/tr[5]/td[1]/a/text()')
>>> name
[' AABY, TRYGVE']
>>> name = data.xpath('//table[2]/*/td[1]/a/text()')
>>> len(name)
500
>>> name[499]
' AHMED, ASHFAQ'
>>>
Similarly, you can create xpath expression of your choice to work with.

Instead of using mechanize, why don't you use something like requests?

Screen scraping in LXML with python-- extract specific data

I've been trying to write a program for the last several hours that does what I thought would be an incredibly simple task:
Program asks for user input (let's say the type 'happiness')
Program queries the website thinkexist using this format ("http://thinkexist.com/search/searchQuotation.asp?search=USERINPUT")
Program returns first quote from the website.
I've tried using Xpath with lxml, but have no experience and every single construction comes back with a blank array.
The actual meat of the quote appears to be contained in the class "sqq."
If I navigate the site via Firebug, click the DOM tab, it appears the quote is in a textNode attribute "wholeText" or "textContent"-- but I don't know how to use that knowledge programatically.
Any ideas?

import lxml.html
import urllib
site = 'http://thinkexist.com/search/searchquotation.asp'
userInput = raw_input('Search for: ').strip()
url = site + '?' + urllib.urlencode({'search':userInput})
root = lxml.html.parse(url).getroot()
quotes = root.xpath('//a[#class="sqq"]')
print quotes[0].text_content()
... and if you enter 'Shakespeare', it returns
In real life, unlike in Shakespeare, the sweetness
of the rose depends upon the name it bears. Things
are not only what they are. They are, in very important
respects, what they seem to be.

If it's not necessary for you to implement this via XPath, you may use BeautifilSoup library like this (let myXml variable contain the page HTML source):
soup = BeautifulSoup(myXml)
for a in soup.findAll(a,{'class' : 'sqq'}):
# this is your quote
print a.contents
Anyway, read the BS documentation, it may be very useful for some scraping needs that don't require the power of XPath.

You could open the html source to find out the exact class you are looking for. For example, to grab the first StackOverflow username encountered on the page you could do:
#!/usr/bin/env python
from lxml import html
url = 'http://stackoverflow.com/questions/4710307'
tree = html.parse(url)
path = '//div[#class="user-details"]/a[#href]'
print tree.findtext(path)
# -> Parseltongue
# OR to print text including the text in children
a = tree.find(path)
print a.text_content()
# -> Parseltongue

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

How can I dependably web-scrape a largely unattached line effectively? - python

Related

Python: a script to find the website language

web-scraping, regex and iteration in python

BeautifulSoup find and find_all not working as expect

Scraping webdata from a website that loads data in a streaming fashion

Screen scraping in LXML with python-- extract specific data

Categories

Resources