Python, parsing html - python

Thanks to the kind users of this site, I have some idea of how to use re as an alternative to a non-standard python module so that my script will work with minimum over-hang. Today, I've been experimenting with parsing modules. I've come across beautifulsoup.. this is all great, but I don't understand it.
For educational purposes, I'd like to strip the following information from http://yify-torrents.com/browse-movie (please don't tell me to use a web-crawler, I'm not trying to crawl the whole site - just extract the information from this page to learn how parsing modules work!)
Movie Title
Quality
Torrent Link
There is 22 of these items, I wish for them to be stored in lists in order, ie. item_1, item_2. And these lists need to contain these three items. For instance:
item_1 = ["James Bond: Casino Royale (2006)", "720p", "http://yify-torrents.com/download/start/James_Bond_Casino_Royale_2006.torrent"]
item_2 = ["Pitch Perfect (2012)", "720p", "http://yify-torrents.com/download/start/Pitch_Perfect_2012.torrent"]
And then, to make matters simple, I just want to print every item to the console. To make things more difficult, however, these items don't have identifiers on the page, so the info. needs to be strictly ordered. This is all good, but all I'm getting is either the entire source being contained by each list item, or empty items! An example item divider is as follows:
<div class="browse-info">
<span class="info">
<h3>James Bond: Casino Royale (2006)</h3>
<p><b>Size:</b> 1018.26 MB</p>
<p><b>Quality:</b> 720p</p>
<p><b>Genre:</b> Action | Crime</p>
<p><b>IMDB Rating:</b> 7.9/10</p>
<span>
<p class="peers"><b>Peers:</b> 698</p>
<p class="peers"><b>Seeds:</b> 356</p>
</span>
</span>
<span class="links">
View Info<span></span>
Download<span></span>
</span>
</div>
Any ideas? Would someone please do me the honours of giving me an example of how to do this? I'm not sure beautiful soup accommodates all of my requirements! PS. Sorry for the poor English, it's not my first language.

from bs4 import BeautifulSoup
import urllib2
f=urllib2.urlopen('http://yify-torrents.com/browse-movie')
html=f.read()
soup=BeautifulSoup(html)
In [25]: for i in soup.findAll("div",{"class":"browse-info"}):
...: name=i.find('a').text
...: for x in i.findAll('b'):
...: if x.text=="Quality:":
...: quality=x.parent.text
...: link=i.find('a',{"class":"std-btn-small mleft torrentDwl"})['href']
...: print [name,quality,link]
...:
[u'James Bond: Casino Royale (2006)', u'Quality: 720p', 'http://yify-torrents.com/download/start/James_Bond_Casino_Royale_2006.torrent']
[u'Pitch Perfect (2012)', u'Quality: 720p', 'http://yify-torrents.com/download/start/Pitch_Perfect_2012.torrent']
...
or to get exactly the output you wanted:
In [26]: for i in soup.findAll("div",{"class":"browse-info"}):
...: name=i.find('a').text
...: for x in i.findAll('b'):
...: if x.text=="Quality:":
...: quality=x.parent.find(text=True, recursive=False).strip()
...: link=i.find('a',{"class":"std-btn-small mleft torrentDwl"})['href']
...: print [name,quality,link]

As you request I paste simple example of parser. As you can see it's use lxml. With lxml you have two ways to work with DOM tree one of these is xpath and the second is css selectors
I prefered xpath.
import lxml.html
import decimal
import urllib
def parse():
url = 'https://sometotosite.com'
doc = lxml.html.fromstring(urllib.urlopen(url).read())
main_div = doc.xpath("//div[#id='line']")[0]
main = {}
tr = []
for el in main_div.getchildren():
if el.xpath("descendant::a[contains(#name,'tn')]/text()"):
category = el.xpath("descendant::a[contains(#name,'tn')]/text()")[0]
main[category] = ''
tr = []
else:
for element in el.getchildren():
if '&#8212' in lxml.html.tostring(element):
tr.append(element)
print category, tr
parse()
LXML official site

Related

Beautiful Soup: Extracting all the <br/> from the <strong>

I have a really silly and annoying problem, I try to convert html into markdown but my html is silly formatted: I keep having stuff like that:
<strong>Ihre Aufgaben:<br/></strong>
or
<strong> <br/>Über die XXXX GmbH:<br/></strong>
which is totally valid HTML.
However my library to convert to Markdown (HTML2Text) converts it to:
**Ihre Aufgaben:\n**
and
** \nÜber die XXXX GmbH:\n**
which is an already reported issue because the markdown is then invalid and cannot be rendered properly
My approach to this problem was the following:
Use BeautifulSoup to find all the strong that lead to this problem
Classify the <br/> into 2 groups: the ones coming before the text and the ones coming after the text.
Unwrap the ones after the text in order to push them out of the <strong>
My code (not so great formatted yet):
soup = BeautifulSoup(html)
emphased = soup.find_all('strong')
for single in emphased:
children = single.children
before = 0
foundText = None
after = 0
for child in children:
if not isinstance(child, NavigableString):
if foundText:
after += 1
child.unwrap()
else:
before += 1
# DOES NOT WORK
child.unwrap()
else:
foundText = single.get_text().strip()
What is my current problem?
I want to unwrap the <br/> before the content and put them before the <strong> element and I cannot achieve that (and didn't find how to proceed in the doc).
What do I want to achieve more generally?:
I want to transform that:
<strong> <br/>Über die XXXX GmbH: </strong>
into
# Note the space
(whitespace)<br/><strong>Über die XXXX GmbH:</strong>(whitespace)
It doesn't have to use Beautiful Soup, I'm just not aware of other solutions.
Thanks in advance!
Per your example you can extract all the br tags from the strong ones and prepend them, replacing the latest tag with the new one.
Here is a snippet:
from bs4 import BeautifulSoup
soup = BeautifulSoup("<strong>Ihre Aufgaben:<br/></strong>", "html.parser")
for strong in soup.find_all("strong"):
[s.extract() for s in strong.find_all('br')]
strong.string = strong.get_text(strip=True)
strong.replaceWith(BeautifulSoup( " %s%s " % ("<br/>", strong), "html.parser"))
print soup
Which outputs:
<br/><strong>Ihre Aufgaben:</strong>

Beautiful Soup / Regular Expressions: Extract a portion of text from NavigableString

I'm really new to learning python so this could be really obvious, but I have extracted a NavigableString from BeautifulSoup and I need to find data in the string. However, it's not as easy as some of the examples I've seen online.
My end goal is to create a dictionary that looks something like this:
dict = {'Fandom':'Undertale (Video Game)', 'Works':15341}
Here's are two examples of the strings:
<li>
<a class="tag" href="/tags/Undertale%20(Video%20Game)/works">Undertale (Video Game)</a>
(15341)
</li>
<a class="tag" href="/tags/Sherlock%20Holmes%20*a*%20Related%20Fandoms/works">Sherlock Holmes & Related Fandoms</a>
(101015)
</li>
I've already succeeded extracting fandom from the string, but now I need the works count in parenthesis. How would I use Beautiful Soup and/or Regular Expressions to do this?
I also need to do error handling because while a fandom will always be displayed, it may not have a work count next to it.
<li>
<a class="tag" href="/tags/Composer%20-%20Fandom/works">Composer - Fandom</a>
</li>
Here's the relevant pieces of code:
for each_f in cate:
#print(each_f)
result = each_f.find('a')
if result !=-1:
#here is where I grab the Fandom vals
fandom_name = result.contents
#print(result.contents)
NOTE: I know I'm missing the code to append to the dictionary, I haven't made it that far yet. I'm just trying to get the values to print to the screen.
use dict.fromkeys(('Fandom', 'Works')) to get :
In [17]: dict.fromkeys(('Fandom', 'Works'))
Out[17]: {'Fandom': None, 'Works': None}
use zip to combines the key with strings in the li tag, this will only combines the shortest:
zip(('Fandom', 'Works'),li.stripped_strings)
[('Fandom', 'Undertale (Video Game)'), ('Works', '(15341)')]
[('Fandom', 'Sherlock Holmes & Related Fandoms'), ('Works', '(101015)')]
[('Fandom', 'Composer - Fandom')]
then we update the dict with those data:
In [20]: for li in soup.find_all('li'):
...: d = dict.fromkeys(('Fandom', 'Works'))
...: out = zip(('Fandom', 'Works'),li.stripped_strings)
...: d.update(out)
...: print(d)
out:
{'Works': '(15341)', 'Fandom': 'Undertale (Video Game)'}
{'Works': '(101015)', 'Fandom': 'Sherlock Holmes & Related Fandoms'}
{'Works': None, 'Fandom': 'Composer - Fandom'}
You can use stripped_strings and unpack the values to get your blocks of text. You can store the results in a dictso that you can use them later.
Example:
from bs4 import BeautifulSoup
import requests
example = """<li>
<a class="tag" href="/tags/Undertale%20(Video%20Game)/works">Undertale (Video Game)</a>
(15341)
</li>
<li><a class="tag" href="/tags/Sherlock%20Holmes%20*a*%20Related%20Fandoms/works">Sherlock Holmes & Related Fandoms</a>
(101015)
</li>
<li>
<a class="tag" href="/tags/Composer%20-%20Fandom/works">Composer - Fandom</a>
</li>"""
soup = BeautifulSoup(example, "html.parser")
Fandom = {"Fandom" : []}
for li in soup.find_all("li"):
try:
fandom, count = li.stripped_strings
Fandom["Fandom"].append({fandom.strip() : count[1:-1]})
except:
fandom = li.text.strip()
Fandom["Fandom"].append({fandom.strip() : 0})
print (Fandom)
This outputs:
{'Fandom': [{'Undertale (Video Game)': '15341'}, {'Sherlock Holmes & Related Fandoms': '101015'}, {'Composer - Fandom': 0}]}
The try-catch will catch any unpacking that doesn't contains two values: your fandom title and the word count.

Python webscraping and getting contents of first div tag of its class

I'm working with Python 3.3 and this website:
http://www.nasdaq.com/markets/ipos/
My goal is to read only the companies that are in the Upcoming IPO. It is in the div tag with div class="genTable thin floatL" There are two with this class, and the target data is in the first one.
Here is my code
from urllib.request import urlopen
from bs4 import BeautifulSoup
import re
html = urlopen("http://www.nasdaq.com/markets/ipos/").read()
soup = BeautifulSoup(html)
for divparent in soup.find_all('div', attrs={'class':'genTable thin floatL'}) [0]: # I tried putting a [0] so it will only return divs in the first genTable thin floatL class
for div in soup.find_all('div', attrs={'class':'ipo-cell-height'}):
s = div.string
if re.match(r'\d{1,2}/\d{1,2}/\d{4}$', s):
div_next = div.find_next('div')
print('{} - {}'.format(s, div_next.string))
I'd like it to return only
3/7/2014 - RECRO PHARMA, INC.
2/28/2014 - VARONIS SYSTEMS INC
2/27/2014 - LUMENIS LTD
2/21/2014 - SUNDANCE ENERGY AUSTRALIA LTD
2/21/2014 - SEMLER SCIENTIFIC, INC.
But it prints all div classes with the re.match specifications and multiple times as well. I tried inserting [0] on the for divparent loop to retrieve only the first one but this cause the repeating problem instead.
EDIT: Here is the updated code according to warunsl solution. This works.
from urllib.request import urlopen
from bs4 import BeautifulSoup
import re
html = urlopen("http://www.nasdaq.com/markets/ipos/").read()
soup = BeautifulSoup(html)
divparent = soup.find_all('div', attrs={'class':'genTable thin floatL'})[0]
table= divparent.find('table')
for div in table.find_all('div', attrs={'class':'ipo-cell-height'}):
s = div.string
if re.match(r'\d{1,2}/\d{1,2}/\d{4}$', s):
div_next = div.find_next('div')
print('{} - {}'.format(s, div_next.string))
You mentioned that there are two elements that fit the 'class':'genTable thin floatL' criteria. So running a for loop for it's first element does not make sense.
So replace your outer for loop with
divparent = soup.find_all('div', attrs={'class':'genTable thin floatL'})[0]
Now you need not do a soup.find_all again. Doing so will search the entire document. You need to restrict the search to the divparent. So, you do:
table = divparent.find('table')
The remainder of the code to extract the dates and the company name would be the same, except that they will be with reference to the table variable.
for row in table.find_all('tr'):
for data in row.find_all('td'):
print data.string
Hope it helps.

Parsing nested span tags using beautifulsoup

I am trying to pull company information from the following website:
http://www.theglobeandmail.com/globe-investor/markets/stocks/summary/?q=T-T
I see from there page source that there are nested span statements like:
<li class="clearfix">
<span class="label">Low</span>
<span class="giw-a-t-sc-data">36.39</span>
</li>
<li class="clearfix">
<span class="label">Bid<span class="giw-a-t-sc-bidSize smallsize">x0</span></span>
<span class="giw-a-t-sc-data">36.88</span>
</li>
The code I wrote will grab (Low, 36.69) without problem. I have spent hours reading this forum and others trying to get bs4 to also break out (Bid, 36.88). The problem is, Bid comes out as "None" because of the nested span tags.
I am an old "c" programmer (GNU Cygwin) and this python, Beautifulsoup stuff is new to me. I love it though, awesome potential for interesting and time saving scripts.
Can anyone help with this question, I hope I have posed it well enough.
Please keep it simple because I am definitely a newbie.
thanks in advance.
If you want to get data from a website, I recommend PyQuery (https://pypi.python.org/pypi/pyquery). Just as BeautifulSoup it uses lxml for fast XML/HTML-parsing and you can access the HTML-elements jQuery-like selectors.
import pyquery
root = pyquery.PyQuery("http://www.theglobeandmail.com/globe-investor/markets/stocks/summary/?q=T-T") # you can also pass the HTML-source, that you want to parse
spanlist = root("li.clearfix > span")
for span in spanlist: print span.text
Output:
Open
36.45
Previous Close
36.28
High
37.36
Low
36.39
Bid
36.88
(Just the first ten lines of output, but I think you get my point: few lines, great result...)
Almost the same with BeautifulSoup4
>>> import bs4
>>> text = "<li ..." # HTML-source-code from Question
>>> root = bs4.BeautifulSoup(text)
>>> [ span.text for span in root("li.clearfix > span") ]
[u'Low', u'36.39', u'Bidx0', u'36.88']
And now structured:
>>> [ ( span.text, span.findNextSibling('span').text) for span in root.select("li.clearfix > span.label") ]
[(u'Low', u'36.39'), (u'Bidx0', u'36.88')]
Print in separate columns:
>>> for span in root.select("li.clearfix > span.label"):
>>> print "%s\t%s" % ( span.text, span.findNextSibling('span').text )
Low 36.39
Bidx0 36.88
So it is working way, way better than I had it working, but there are still some issues. I am posting the full script so you can see what I am up to. I will spend some time and effort investigating the issues, but this will help me learn python and beautifulsoup better anyway.
"""
This program imports a list of stock ticker symbols from "ca_stocks.txt"
It then goes to the Globe website and gets current company stock data
It then writes this data to a file to a CSV file in the form
index, ticker, date&time, dimension, measure
"""
import urllib2
import csv, os
import datetime
import re #regular expressions library
import bs4
#from bs4 import BeautifulStoneSoup as bss
#from time import gmtime, strftime
#from lxml import etree
import pyquery
#import dataextract as tde
os.chdir('D:\\02 - \\003 INVESTMENTS\\Yahoo Finance Data')
symbolfile = open('ca_stocks2.txt')
symbolslist = symbolfile.read().split('\n')
def pairs(l,n):
# l = list
# n = number
return zip(*[l[i::n] for i in range(n)])
def main():
i=0
while i<len(symbolslist):
print symbolslist[i]
url = urllib2.urlopen("http://www.theglobeandmail.com/globe-investor/markets/stocks/summary/?q=" +symbolslist[i])
root = bs4.BeautifulSoup(url)
[span.text for span in root("li.clearfix > span")]
[(span.text, span.findNextSibling('span').text) for span in root.select("li.clearfix > span.label")]
dims = [[]] *40
mess = [[]] *40
j=0
for span in root.select("li.clearfix > span.label"):
#print "%s\t%s" % ( span.text, span.findNextSibling('span').text)
dims[j] = span.text
mess[j] = span.findNextSibling('span').text
j+=1
nowtime = datetime.datetime.now().isoformat()
with open('globecdndata.csv','ab') as f:
fw = csv.writer(f, dialect='excel')
for s in range(0,37):
csvRow = s, symbolslist[i], nowtime, dims[s], mess[s]
print csvRow
fw.writerow(csvRow)
f.close()
i+=1
if __name__ == "__main__":
main()
I know this is ugly code, but hey, I am learning. The output to CSV looks like this now:
(4, 'T-T', '2013-11-09T19:32:32.416000', u'Bidx0', u'36.88')
(5, 'T-T', '2013-11-09T19:32:32.416000', u'Askx0', u'36.93')
(6, 'T-T', '2013-11-09T19:32:32.416000', u'52-week High05/22', u'37.94')
The date, "05/22" would change every time the price breaks out to a new high or low. This is not ideal for the name of a dimension (field).
(7, 'T-T', '2013-11-09T19:32:32.416000', u'52-week Low06/27', u'29.52')
(35, 'T-T', '2013-11-09T19:32:32.416000', u'Top 1000 Ranking:', u'Profit: 28Revenue: 34Assets: 36')
For some reason, it has lumped these dimensions (fields) and measures (data) all together. Hmm...
This is a list of some of the problems. But, like I said, I should be able to figure this out now. Learn a lot, thanks. Having someone that knows what they are doing, provide some input is awesome.

Python/lxml/Xpath: How do I find the row containing certain text?

Given the URL http://www.smartmoney.com/quote/FAST/?story=financials&timewindow=1&opt=YB&isFinprint=1&framework.view=smi_emptyView , how would you capture and print the contents of an entire row of data?
For example, what would it take to get an output that looked something like:
"Cash & Short Term Investments 144,841 169,760 189,252 86,743 57,379"? Or something like "Property, Plant & Equipment - Gross 725,104 632,332 571,467 538,805 465,493"?
I've been introduced to the basics of Xpath through sites http://www.techchorus.net/web-scraping-lxml . However, the Xpath syntax is still largely a mystery to me.
I already have successfully done this in BeautifulSoup. I like the fact that BeautifulSoup doesn't require me to know the structure of the file - it just looks for the element containing the text I search for. Unfortunately, BeautifulSoup is too slow for a script that has to do this THOUSANDS of times. The source code for my task in BeautifulSoup is (with title_input equal to "Cash & Short Term Investments"):
page = urllib2.urlopen (url_local)
soup = BeautifulSoup (page)
soup_line_item = soup.findAll(text=title_input)[0].parent.parent.parent
list_output = soup_line_item.findAll('td') # List of elements
So what would the equivalent code in lxml be?
EDIT 1: The URLs were concealed the first time I posted. I have now fixed that.
EDIT 2: I have added my BeautifulSoup-based solution to clarify what I'm trying to do.
EDIT 3: +10 to root for your solution. For the benefit of future developers with the same question, I'm posting here a quick-and-dirty script that worked for me:
#!/usr/bin/env python
import urllib
import lxml.html
url = 'balancesheet.html'
result = urllib.urlopen(url)
html = result.read()
doc = lxml.html.document_fromstring(html)
x = doc.xpath(u'.//th[div[text()="Cash & Short Term Investments"]]/following-sibling::td/text()')
print x
In [18]: doc.xpath(u'.//th[div[text()="Cash & Short Term Investments"]]/following-sibling::td/text()')
Out[18]: [' 144,841', ' 169,760', ' 189,252', ' 86,743', ' 57,379']
or you can define a little function to get the rows by text:
In [19]: def func(doc,txt):
...: exp=u'.//th[div[text()="{0}"]]'\
...: u'/following-sibling::td/text()'.format(txt)
...: return [i.strip() for i in doc.xpath(exp)]
In [20]: func(doc,u'Total Accounts Receivable')
Out[20]: ['338,594', '270,133', '214,169', '244,940', '236,331']
or you can get all the rows to a dict:
In [21]: d={}
In [22]: for i in doc.xpath(u'.//tbody/tr'):
...: if len(i.xpath(u'.//th/div/text()')):
...: d[i.xpath(u'.//th/div/text()')[0]]=\
...: [e.strip() for e in i.xpath(u'.//td/text()')]
In [23]: d.items()[:3]
Out[23]:
[('Accounts Receivables, Gross',
['344,241', '274,894', '218,255', '247,600', '238,596']),
('Short-Term Investments',
['27,165', '26,067', '24,400', '851', '159']),
('Cash & Short Term Investments',
['144,841', '169,760', '189,252', '86,743', '57,379'])]
let html holds the html source code:
import lxm.html
doc = lxml.html.document_fromstring(html)
rows_element = doc.xpath('/html/body/div/div[2]/div/div[5]/div/div/table/tbody/tr')
for row in rows_element:
print row.text_content()
not tested but should work
P.S.Install xpath cheker or firefinder in firefox to help you with xpath

Categories