Python/lxml/Xpath: How do I find the row containing certain text? - python

Given the URL http://www.smartmoney.com/quote/FAST/?story=financials&timewindow=1&opt=YB&isFinprint=1&framework.view=smi_emptyView , how would you capture and print the contents of an entire row of data?
For example, what would it take to get an output that looked something like:
"Cash & Short Term Investments 144,841 169,760 189,252 86,743 57,379"? Or something like "Property, Plant & Equipment - Gross 725,104 632,332 571,467 538,805 465,493"?
I've been introduced to the basics of Xpath through sites http://www.techchorus.net/web-scraping-lxml . However, the Xpath syntax is still largely a mystery to me.
I already have successfully done this in BeautifulSoup. I like the fact that BeautifulSoup doesn't require me to know the structure of the file - it just looks for the element containing the text I search for. Unfortunately, BeautifulSoup is too slow for a script that has to do this THOUSANDS of times. The source code for my task in BeautifulSoup is (with title_input equal to "Cash & Short Term Investments"):
page = urllib2.urlopen (url_local)
soup = BeautifulSoup (page)
soup_line_item = soup.findAll(text=title_input)[0].parent.parent.parent
list_output = soup_line_item.findAll('td') # List of elements
So what would the equivalent code in lxml be?
EDIT 1: The URLs were concealed the first time I posted. I have now fixed that.
EDIT 2: I have added my BeautifulSoup-based solution to clarify what I'm trying to do.
EDIT 3: +10 to root for your solution. For the benefit of future developers with the same question, I'm posting here a quick-and-dirty script that worked for me:
#!/usr/bin/env python
import urllib
import lxml.html
url = 'balancesheet.html'
result = urllib.urlopen(url)
html = result.read()
doc = lxml.html.document_fromstring(html)
x = doc.xpath(u'.//th[div[text()="Cash & Short Term Investments"]]/following-sibling::td/text()')
print x

In [18]: doc.xpath(u'.//th[div[text()="Cash & Short Term Investments"]]/following-sibling::td/text()')
Out[18]: [' 144,841', ' 169,760', ' 189,252', ' 86,743', ' 57,379']
or you can define a little function to get the rows by text:
In [19]: def func(doc,txt):
...: exp=u'.//th[div[text()="{0}"]]'\
...: u'/following-sibling::td/text()'.format(txt)
...: return [i.strip() for i in doc.xpath(exp)]
In [20]: func(doc,u'Total Accounts Receivable')
Out[20]: ['338,594', '270,133', '214,169', '244,940', '236,331']
or you can get all the rows to a dict:
In [21]: d={}
In [22]: for i in doc.xpath(u'.//tbody/tr'):
...: if len(i.xpath(u'.//th/div/text()')):
...: d[i.xpath(u'.//th/div/text()')[0]]=\
...: [e.strip() for e in i.xpath(u'.//td/text()')]
In [23]: d.items()[:3]
Out[23]:
[('Accounts Receivables, Gross',
['344,241', '274,894', '218,255', '247,600', '238,596']),
('Short-Term Investments',
['27,165', '26,067', '24,400', '851', '159']),
('Cash & Short Term Investments',
['144,841', '169,760', '189,252', '86,743', '57,379'])]

let html holds the html source code:
import lxm.html
doc = lxml.html.document_fromstring(html)
rows_element = doc.xpath('/html/body/div/div[2]/div/div[5]/div/div/table/tbody/tr')
for row in rows_element:
print row.text_content()
not tested but should work
P.S.Install xpath cheker or firefinder in firefox to help you with xpath

Related

code for counting word frequency in website using Python doesn't output the right frequency

I'd like to count the frequency of a list of words in a specific website. The code however doesn't return the exact number of words that a manual "control F" command would.
What am I doing wrong?
Here's my code:
import pandas as pd
from bs4 import BeautifulSoup
import requests
from selenium import webdriver
import re
url='https://www.gov.uk/government/publications/specialist-quality-mark-tender-2016'
fr=[]
wanted = ['tender','2020','date']
for word in wanted:
a=requests.get(url).text.count(word)
dic={'phrase':word,
'frequency':a,
}
fr.append(dic)
print('Frequency of',word, 'is:',a)
data=pd.DataFrame(fr)
Refer to the comments in your question to see why using requests might be a bad idea to count the frequency of a word in the "visible spectrum" of a webpage (what you actually see in the browser).
If you want to go about this with selenium, you could try:
from selenium import webdriver
url = 'https://www.gov.uk/government/publications/specialist-quality-mark-tender-2016'
driver = webdriver.Chrome(chromedriver_location)
driver.get(url)
body = driver.find_element_by_tag_name('body')
fr = []
wanted = ['tender', '2020', 'date']
for word in wanted:
freq = body.text.lower().count(word) # .lower() to account for count's case sensitive behaviour
dic = {'phrase': word, 'frequency': freq}
fr.append(dic)
print('Frequency of', word, 'is:', freq)
which gave me the same results that a CTRL + F does.
You can test BeautifulSoup too (which you're importing by the way) by modifying your code a little bit:
import requests
from bs4 import BeautifulSoup
url = 'https://www.gov.uk/government/publications/specialist-quality-mark-tender-2016'
fr = []
wanted = ['tender','2020','date']
a = requests.get(url).text
soup = BeautifulSoup(a, 'html.parser')
for word in wanted:
freq = soup.get_text().lower().count(word)
dic = {'phrase': word, 'frequency': freq}
fr.append(dic)
print('Frequency of', word, 'is:', freq)
That gave me the same results, except for the word tender, which according to BeautifulSoup appears 12 times, and not 11. Test them out for yourself and see what suits you.
When I tried your code on the word "Tender", a=requests.get(url).text.count(word) returned many more results than ctrl + F, which was weird because I was expecting to return less ( text.count is case-sensitive, HTML sometimes breaks elements into multiple lines and all that ).
But by printing the variable "a" and going through it you'll notice there are elements that aren't displayed on the page, also that there are plenty of "Tender" between tags.
I'd advise you to use BeautifulSoup or find some way to avoid going through the invisible text.
And by the way, little thing, you can put the requests.get(url).text as a variable out of the loop so you don't have to send a request at every iteration.

Display XML tree structure with BeautifulSoup

When working with a new XML structure, it is always helpful to see the big picture first.
When loading it with BeautifulSoup:
import requests, bs4
s = requests.get('https://www.w3schools.com/xml/cd_catalog.xml').text
x = bs4.BeautifulSoup(s, 'xml')
print(x)
is there a built-in way to display its tree structure with different depths?
Example for https://www.w3schools.com/xml/cd_catalog.xml, with maxdepth=0, it would be:
CATALOG
with maxdepth=1, it would be:
CATALOG
CD
CD
CD
...
and with maxdepth=2, it would be:
CATALOG
CD
TITLE
ARTIST
COUNTRY
COMPANY
PRICE
YEAR
CD
TITLE
ARTIST
COUNTRY
COMPANY
PRICE
YEAR
...
Here's a quick way to do it: Use the prettify() function to structure it, then get the indentation and opening tag names via regex (catches uppercase words inside opening tags in this case). If the indentation from pretify() meets the depth specification, then print it with the specified indentation size.
import requests, bs4
import re
maxdepth = 1
indent_size = 2
s = requests.get('https://www.w3schools.com/xml/cd_catalog.xml').text
x = bs4.BeautifulSoup(s, 'xml').prettify()
for line in x.split("\n"):
match = re.match("(\s*)<([A-Z]+)>", line)
if match and len(match.group(1)) <= maxdepth:
print(indent_size*match.group(1) + match.group(2))
I have used xmltodict 0.12.0 (installed via anaconda), which did the job for xml parsing, not for depth-wise viewing though. Works much like any other dictionary. From here a recursion with depth counting should be a way to go.
import requests, xmltodict, json
s = requests.get('https://www.w3schools.com/xml/cd_catalog.xml').text
x = xmltodict.parse(s, process_namespaces=True)
for key in x:
print(json.dumps(x[key], indent=4, default=str))
Here is one solution without BeautifulSoup.
import requests
s = requests.get('https://www.w3schools.com/xml/cd_catalog.xml').text
array = []
tab_size = 2
target_depth = 2
for element in s.split('\n'):
depth = (len(element) - len(element.lstrip())) / tab_size
if depth <= target_depth:
print(' ' * int(depth) + element)

Can't isolate desired results out of crude ones

I've created a script in python to get the name of neighbors from a webpage. I've used requests library along with re module to parse the content from some script tag out of that site. when I run the script I get the name of neighbors in the right way. However, the problem is i've used this line if not item.startswith("NY:"):continue to get rid of unwanted results from that page. I do not wish to use this hardcoded portion NY: to do this trick.
website link
I've tried with:
import re
import json
import requests
link = 'https://www.yelp.com/search?find_desc=Restaurants&find_loc=New%20York%2C%20NY&start=1'
resp = requests.get(link,headers={"User-Agent":"Mozilla/5.0"})
data = json.loads(re.findall(r'data-hypernova-key[^{]+(.*)--></script>',resp.text)[0])
items = data['searchPageProps']['filterPanelProps']['filterInfoMap']
for item in items:
if not item.startswith("NY:"):continue
print(item)
Result I'm getting (desired result):
NY:New_York:Brooklyn:Mill_Basin
NY:New_York:Bronx:Edenwald
NY:New_York:Staten_Island:Stapleton
If I do not use this line if not item.startswith("NY:"):continue, the results are something like:
rating
NY:New_York:Brooklyn:Mill_Basin
NY:New_York:Bronx:Edenwald
NY:New_York:Staten_Island:Stapleton
NY:New_York:Staten_Island:Lighthouse_Hill
NY:New_York:Queens:Rochdale
NY:New_York:Queens:Pomonok
BusinessParking.validated
food_court
NY:New_York:Queens:Little_Neck
The bottom line is I wish to get everything started with NY:New_York:. What I meant by unwanted results are rating, BusinessParking.validated, food_court and so on.
How can I get the neighbors without using any hardcoded portion of search within the script?
I'm not certain what your complete data set looks like, but based on your sample,
you might use something like:
if ':' not in item:
continue
# or perhaps:
if item.count(':') < 3:
continue
# I'd prefer a list comprehension if I didn't need the other data
items = [x for x in data['searchPageProps']['filterPanelProps']['filterInfoMap'] if ':' in x]
If that doesn't work for what you're trying to achieve then you could just use a variable for the state.
Another solution - using BeautifulSoup - which doesn't involve regex or hardcoding "NY:New_York" is below; it's convoluted, but mainly because Yelp buried it's treasure several layers deep...
So for future reference:
from bs4 import BeautifulSoup as bs
import json
import requests
link = 'https://www.yelp.com/search?find_desc=Restaurants&find_loc=New%20York%2C%20NY&start=1'
resp = requests.get(link,headers={"User-Agent":"Mozilla/5.0"})
target = soup.find_all('script')[14]
content = target.text.replace('<!--','').replace('-->','')
js_data = json.loads(content)
And now the fun of extracting NYC info from the json begins....
for a in js_data:
if a == 'searchPageProps':
level1 = js_data[a]
for b in level1:
if b == 'filterPanelProps':
level2 = level1[b]
for c in level2:
if c == 'filterSets':
level3 = level2[c][1]
for d in level3:
if d == 'moreFilters':
level4 = level3[d]
for e in range(len(level4)):
print(level4[e]['title'])
print(level4[e]['sectionFilters'])
print('---------------')
The output is the name of each borough plus a list of all neighborhoods in that borough. For example:
Manhattan
['NY:New_York:Manhattan:Alphabet_City',
'NY:New_York:Manhattan:Battery_Park',
'NY:New_York:Manhattan:Central_Park', 'NY:New_York:Manhattan:Chelsea',
'...]
etc.

Parsing nested span tags using beautifulsoup

I am trying to pull company information from the following website:
http://www.theglobeandmail.com/globe-investor/markets/stocks/summary/?q=T-T
I see from there page source that there are nested span statements like:
<li class="clearfix">
<span class="label">Low</span>
<span class="giw-a-t-sc-data">36.39</span>
</li>
<li class="clearfix">
<span class="label">Bid<span class="giw-a-t-sc-bidSize smallsize">x0</span></span>
<span class="giw-a-t-sc-data">36.88</span>
</li>
The code I wrote will grab (Low, 36.69) without problem. I have spent hours reading this forum and others trying to get bs4 to also break out (Bid, 36.88). The problem is, Bid comes out as "None" because of the nested span tags.
I am an old "c" programmer (GNU Cygwin) and this python, Beautifulsoup stuff is new to me. I love it though, awesome potential for interesting and time saving scripts.
Can anyone help with this question, I hope I have posed it well enough.
Please keep it simple because I am definitely a newbie.
thanks in advance.
If you want to get data from a website, I recommend PyQuery (https://pypi.python.org/pypi/pyquery). Just as BeautifulSoup it uses lxml for fast XML/HTML-parsing and you can access the HTML-elements jQuery-like selectors.
import pyquery
root = pyquery.PyQuery("http://www.theglobeandmail.com/globe-investor/markets/stocks/summary/?q=T-T") # you can also pass the HTML-source, that you want to parse
spanlist = root("li.clearfix > span")
for span in spanlist: print span.text
Output:
Open
36.45
Previous Close
36.28
High
37.36
Low
36.39
Bid
36.88
(Just the first ten lines of output, but I think you get my point: few lines, great result...)
Almost the same with BeautifulSoup4
>>> import bs4
>>> text = "<li ..." # HTML-source-code from Question
>>> root = bs4.BeautifulSoup(text)
>>> [ span.text for span in root("li.clearfix > span") ]
[u'Low', u'36.39', u'Bidx0', u'36.88']
And now structured:
>>> [ ( span.text, span.findNextSibling('span').text) for span in root.select("li.clearfix > span.label") ]
[(u'Low', u'36.39'), (u'Bidx0', u'36.88')]
Print in separate columns:
>>> for span in root.select("li.clearfix > span.label"):
>>> print "%s\t%s" % ( span.text, span.findNextSibling('span').text )
Low 36.39
Bidx0 36.88
So it is working way, way better than I had it working, but there are still some issues. I am posting the full script so you can see what I am up to. I will spend some time and effort investigating the issues, but this will help me learn python and beautifulsoup better anyway.
"""
This program imports a list of stock ticker symbols from "ca_stocks.txt"
It then goes to the Globe website and gets current company stock data
It then writes this data to a file to a CSV file in the form
index, ticker, date&time, dimension, measure
"""
import urllib2
import csv, os
import datetime
import re #regular expressions library
import bs4
#from bs4 import BeautifulStoneSoup as bss
#from time import gmtime, strftime
#from lxml import etree
import pyquery
#import dataextract as tde
os.chdir('D:\\02 - \\003 INVESTMENTS\\Yahoo Finance Data')
symbolfile = open('ca_stocks2.txt')
symbolslist = symbolfile.read().split('\n')
def pairs(l,n):
# l = list
# n = number
return zip(*[l[i::n] for i in range(n)])
def main():
i=0
while i<len(symbolslist):
print symbolslist[i]
url = urllib2.urlopen("http://www.theglobeandmail.com/globe-investor/markets/stocks/summary/?q=" +symbolslist[i])
root = bs4.BeautifulSoup(url)
[span.text for span in root("li.clearfix > span")]
[(span.text, span.findNextSibling('span').text) for span in root.select("li.clearfix > span.label")]
dims = [[]] *40
mess = [[]] *40
j=0
for span in root.select("li.clearfix > span.label"):
#print "%s\t%s" % ( span.text, span.findNextSibling('span').text)
dims[j] = span.text
mess[j] = span.findNextSibling('span').text
j+=1
nowtime = datetime.datetime.now().isoformat()
with open('globecdndata.csv','ab') as f:
fw = csv.writer(f, dialect='excel')
for s in range(0,37):
csvRow = s, symbolslist[i], nowtime, dims[s], mess[s]
print csvRow
fw.writerow(csvRow)
f.close()
i+=1
if __name__ == "__main__":
main()
I know this is ugly code, but hey, I am learning. The output to CSV looks like this now:
(4, 'T-T', '2013-11-09T19:32:32.416000', u'Bidx0', u'36.88')
(5, 'T-T', '2013-11-09T19:32:32.416000', u'Askx0', u'36.93')
(6, 'T-T', '2013-11-09T19:32:32.416000', u'52-week High05/22', u'37.94')
The date, "05/22" would change every time the price breaks out to a new high or low. This is not ideal for the name of a dimension (field).
(7, 'T-T', '2013-11-09T19:32:32.416000', u'52-week Low06/27', u'29.52')
(35, 'T-T', '2013-11-09T19:32:32.416000', u'Top 1000 Ranking:', u'Profit: 28Revenue: 34Assets: 36')
For some reason, it has lumped these dimensions (fields) and measures (data) all together. Hmm...
This is a list of some of the problems. But, like I said, I should be able to figure this out now. Learn a lot, thanks. Having someone that knows what they are doing, provide some input is awesome.

Python, parsing html

Thanks to the kind users of this site, I have some idea of how to use re as an alternative to a non-standard python module so that my script will work with minimum over-hang. Today, I've been experimenting with parsing modules. I've come across beautifulsoup.. this is all great, but I don't understand it.
For educational purposes, I'd like to strip the following information from http://yify-torrents.com/browse-movie (please don't tell me to use a web-crawler, I'm not trying to crawl the whole site - just extract the information from this page to learn how parsing modules work!)
Movie Title
Quality
Torrent Link
There is 22 of these items, I wish for them to be stored in lists in order, ie. item_1, item_2. And these lists need to contain these three items. For instance:
item_1 = ["James Bond: Casino Royale (2006)", "720p", "http://yify-torrents.com/download/start/James_Bond_Casino_Royale_2006.torrent"]
item_2 = ["Pitch Perfect (2012)", "720p", "http://yify-torrents.com/download/start/Pitch_Perfect_2012.torrent"]
And then, to make matters simple, I just want to print every item to the console. To make things more difficult, however, these items don't have identifiers on the page, so the info. needs to be strictly ordered. This is all good, but all I'm getting is either the entire source being contained by each list item, or empty items! An example item divider is as follows:
<div class="browse-info">
<span class="info">
<h3>James Bond: Casino Royale (2006)</h3>
<p><b>Size:</b> 1018.26 MB</p>
<p><b>Quality:</b> 720p</p>
<p><b>Genre:</b> Action | Crime</p>
<p><b>IMDB Rating:</b> 7.9/10</p>
<span>
<p class="peers"><b>Peers:</b> 698</p>
<p class="peers"><b>Seeds:</b> 356</p>
</span>
</span>
<span class="links">
View Info<span></span>
Download<span></span>
</span>
</div>
Any ideas? Would someone please do me the honours of giving me an example of how to do this? I'm not sure beautiful soup accommodates all of my requirements! PS. Sorry for the poor English, it's not my first language.
from bs4 import BeautifulSoup
import urllib2
f=urllib2.urlopen('http://yify-torrents.com/browse-movie')
html=f.read()
soup=BeautifulSoup(html)
In [25]: for i in soup.findAll("div",{"class":"browse-info"}):
...: name=i.find('a').text
...: for x in i.findAll('b'):
...: if x.text=="Quality:":
...: quality=x.parent.text
...: link=i.find('a',{"class":"std-btn-small mleft torrentDwl"})['href']
...: print [name,quality,link]
...:
[u'James Bond: Casino Royale (2006)', u'Quality: 720p', 'http://yify-torrents.com/download/start/James_Bond_Casino_Royale_2006.torrent']
[u'Pitch Perfect (2012)', u'Quality: 720p', 'http://yify-torrents.com/download/start/Pitch_Perfect_2012.torrent']
...
or to get exactly the output you wanted:
In [26]: for i in soup.findAll("div",{"class":"browse-info"}):
...: name=i.find('a').text
...: for x in i.findAll('b'):
...: if x.text=="Quality:":
...: quality=x.parent.find(text=True, recursive=False).strip()
...: link=i.find('a',{"class":"std-btn-small mleft torrentDwl"})['href']
...: print [name,quality,link]
As you request I paste simple example of parser. As you can see it's use lxml. With lxml you have two ways to work with DOM tree one of these is xpath and the second is css selectors
I prefered xpath.
import lxml.html
import decimal
import urllib
def parse():
url = 'https://sometotosite.com'
doc = lxml.html.fromstring(urllib.urlopen(url).read())
main_div = doc.xpath("//div[#id='line']")[0]
main = {}
tr = []
for el in main_div.getchildren():
if el.xpath("descendant::a[contains(#name,'tn')]/text()"):
category = el.xpath("descendant::a[contains(#name,'tn')]/text()")[0]
main[category] = ''
tr = []
else:
for element in el.getchildren():
if '&#8212' in lxml.html.tostring(element):
tr.append(element)
print category, tr
parse()
LXML official site

Categories