Retrieving data using Beautiful Soup - python

So I've been trying to retrieve some data using BeautifulSoup but I've hit a brick wall.
<tr data-name="A Color Similar to Slate">
<th class="unique"><span style='color: #7D6D00'>A Color Similar to Slate</span></th>
<td class=unique>0/10</td>
<td class="unique" data-conversion="14 ref">35,000</td>
<td class="unique" data-conversion="13.02 ref">32,550</td>
<td class="unique" data-conversion="13.51 ref">33,775</td>
<td class="unique" style="text-align: center;"><a class="item-link-backpack" href="http://backpack.tf/stats/Unique/A+Color+Similar+to+Slate/tradable/craftable"><img src="/img/bptf-icon.png" alt="View on Backpack.tf"/></a></td>
</tr>
What I'd like my script to do is to take an input (in this case a "A Color Similar to Slate" string) and have it return the data below(0/10, 14 ref etc) so that I can compare it to a different set of data. How can I make it work?

similar_color = soup.find('tr', {'data-name': 'A Color Similar to Slate'})
for value in similar_color.find_all('td'):
print(value.text)
Should result in:
0/10
35,000
and so on, so forth. However, it seems like you want to grab the text value sometimes, and the data-conversion value other times. To do that, you would just substitute the print(value.text) line with:
print(value.attrs.get('data-conversion'))

In case you will use it on other HTML style files:
from bs4 import BeautifulSoup
html= """<tr data-name="A Color Similar to Slate">
<th class="unique"><span style='color: #7D6D00'>A Color Similar to Slate</span></th>
<td class=unique>0/10</td>
<td class="unique" data-conversion="14 ref">35,000</td>
<td class="unique" data-conversion="13.02 ref">32,550</td>
<td class="unique" data-conversion="13.51 ref">33,775</td>
<td class="unique" style="text-align: center;"><a class="item-link-backpack" href="http://backpack.tf/stats/Unique/A+Color+Similar+to+Slate/tradable/craftable"><img src="/img/bptf-icon.png" alt="View on Backpack.tf"/></a></td>
</tr>"""
soup = BeautifulSoup(html)
texts = [i.get_text() for i in soup.find_all() if i.get_text()]
print(texts[texts.index('A Color Similar to Slate'):])
This checks all the tags not just td. The output is ['A Color Similar to Slate', 'A Color Similar to Slate', 'A Color Similar to Slate', '0/10', '35,000', '32,550', '33,775']

Related

Python: Accessing a new <tr> while inside a different <tr> with BeautifulSoup4

I am trying to gather some data by webscraping a local HTML file using BeautifulSoup4. The problem is, that the information I'm trying to get is on different rows that have the same class tags. I'm not sure about how to access them. The following html screenshot contains the two rows I'm accessing with the data I need highlighted (sensitive info is scribbled out).
The code I have currently is:
def find_data(fileName):
with open(fileName) as html_file:
soup = bs(html_file, "lxml")
hline1 = soup.find("td", class_="headerTableEntry")
hline2 = hline1.find_next_sibling("td")
hline3 = hline2.find_next_sibling("td")
hline4 = hline3.find_next_sibling("td", class_="headerTableEntry")
line1 = hline1.text
line2 = hline2.text
line3 = hline3.text
#Nothing yet for lines 4,5,6
The first 3 lines work great and give 13, 39, and 33.3% as they should. But for line 4 (which should be the second tag and first tag with class=headerTableEntry) I get an error "'NoneType' object is not callable".
My question is, is there a different way to go at this so I can access all 6 data cells or is there a way to edit how I wrote line 4 to work? Thank you for your help, it is very much appreciated!
The <tr> tag is not inside another <tr> tag as you can see that first <tr> tag is closed with the </tr> So that next <td> is not a sibling of the previous, hence it returns None. It's within the next <tr> tag.
Pandas is a great package to parse html <table> tags (which this is). It actually uses beautifulsoup under the hood. Just get the full table, and slice the table for the columns you want:
html_file = '''<table>
<tr>
<td class="headerName">File:</td>
<td class="HeaderValue">Some Value</td>
<td></td>
<td class="headerName">Lines:</td>
<td class="headerTableEntry">13</td>
<td class="headerTableEntry">39</td>
<td class="headerTableEntry" style="back-ground-color:LightPink">33.3 %</td>
</tr>
<tr>
<td class="headerName">Date:</td>
<td class="HeaderValue">2020-06-18 11:15:19</td>
<td></td>
<td class="headerName">Branches:</td>
<td class="headerTableEntry">10</td>
<td class="headerTableEntry">12</td>
<td class="headerTableEntry" style="back-ground-color:#FFFF55">83.3 %</td>
</tr>
</table>'''
import pandas as pd
df = pd.read_html(html_file)[0]
df = df.iloc[:,3:]
So for your code:
def find_data(fileName):
with open(fileName) as html_file:
df = pd.read_html(html_file)[0].iloc[:,3:]
print (df)
Output:
print (df)
3 4 5 6
0 Lines: 13 39 33.3 %
1 Branches: 10 12 83.3 %

joining RE requests in Python

I have a page.htm file:
</td></tr>
<tr>
<td height="120" class="box_pic">
<img src="http://s.fdert.com/pics.php?q=4iGjVtivCiBKELV%2BVUi27TIgo9KhXusVoizsXDI8FN1HTLACXmZddfsdsgsdghqJXZkz5vSkYq6xISbd2zaUA%3D%3D" alt="[без описания]" width="140" height="105">
</td>
</tr>
<tr align="center" valign="middle">
<td valign="top">
<table width="100%" border="0" cellspacing="0" cellpadding="0">
<tr>
<td class="box_prc"><span class="nwr"><img src="/map/gender_pair.gif" width="11" height="11" alt="Сова" border=0> <a class="usernick" href="/index.php?action=user&id=79159" target="_blank">ABird</a></span></td>
</td></tr>
<tr>
<td height="120" class="box_pic">
<img src="http://s.fdert.com/pics.php?q=4iGjVtivCiBKELV%2BVUi27TIgo9KhXusVoizsXDI8FN1HTLACXmZddfsdsgsdghqJXZkz5vSkYq6xISbd2zaUA%3D%3D" alt="[без описания]" width="140" height="105">
</td>
</tr>
<tr align="center" valign="middle">
<td valign="top">
<table width="100%" border="0" cellspacing="0" cellpadding="0">
<tr>
<td class="box_prc"><span class="nwr"><img src="/map/gender_pair.gif" width="11" height="11" alt="Сова" border=0> <a class="usernick" href="/index.php?action=user&id=78759" target="_blank">ADog</a></span></td>
</td></tr>
<tr>
<td height="120" class="box_pic">
<img src="http://s.fdert.com/pics.php?q=4iGjVtivCiBKELV%2BVUi27TIgo9KhXusVoizsXDI8FN1HTLACXmZddfsdsgsdghqJXfdgfdgZkz5vSkYq6xISbd2zaUA%3D%3D" alt="[без описания]" width="140" height="105">
</td>
</tr>
<tr align="center" valign="middle">
<td valign="top">
<table width="100%" border="0" cellspacing="0" cellpadding="0">
<tr>
<td class="box_prc"><span class="nwr"><img src="/map/gender_pair.gif" width="11" height="11" alt="Сова" border=0> <a class="usernick" href="/index.php?action=user&id=87159" target="_blank">ACat56</a></span></td>
It has 3 sets of data which I need:
1) 1322679 79159 ABird
2) 1546679 78759 ADog
3) 5622679 87159 ACat56
I have 3 requests for RE which can dig elements from this page:
import re
with open('page.htm', 'r') as our_file:
page=our_file.read()
result = re.findall(r'view\.php\?item=(\d+)', page)
result2 = re.findall(r'user&id=(\d+)', page)
result3 = re.findall(r'user&id=.*>(\w+)', page)
print (result, len(result))
print (result2, len(result2))
print (result3, len(result3))
the result I get:
['1322679', '1546679', '5622679'] 3
['79159', '78759', '87159'] 3
['ABird', 'ADog', 'ACat56'] 3
Do you know the way to join these 3 requests in ONE? So that
1) file would be analized 1 time instead of 3 times
2) only ONE re.findall() would be used
3) data would be joined in the way I need
a) 1322679 79159 ABird
b) 1546679 78759 ADog
c) 5622679 87159 ACat56
the result request should be something like this:
result = re.findall(r'view\.php\?item=(\d+) SOMETHING_HERE user&id=(\d+) SOMETHING_HERE .*>(\w+)', page)
Here is how to do it properly with an HTML parser in Python 2:
from urlparse import parse_qs, urlparse
from bs4 import BeautifulSoup
def only(x):
x = list(x)
assert len(x) == 1
return x[0]
def url_params(a):
return parse_qs(urlparse(a['href']).query)
def main():
with open('page.html') as f:
soup = BeautifulSoup(f, 'html.parser')
rows = soup.find_all('tr', recursive=False)
# Data is in alternating rows, so take pairs of rows at a time
for row1, row2 in zip(rows[::2], rows[1::2]):
a = only(row1.select('td.box_pic a'))
item_id = only(url_params(a)['item'])
a = only(row2.select('a.usernick'))
user_id = only(url_params(a)['id'])
nick = a.text
print item_id, user_id, nick
main()
Output:
1322679 79159 ABird
1546679 78759 ADog
5622679 87159 ACat56
Now, this may not be as concise as the re method, but this code is aware of how the input is meant to be structured and that makes it robust. If the structure of the input changes, e.g. the format of the URLs or the shape of the HTML, this code will either continue to work correctly or it will raise an error to tell you that things aren't as expected. The re method may very easily continue to run but give you incorrect results, which is not a situation you want. And if you want to extract more information in the future, it's very easy to add the necessary lines without interfering with the existing code.
finally, I found the solution:
This is the answer, which satisfies all the requirements:
import re
with open('page.htm', 'r') as our_file:
page=our_file.read()
page = re.sub(r'[\t\r\n\s]','',page)
re.DOTALL
result = re.findall(r'view\.php\?item=(\d+).*?user&id=(\d+).*?>(\w+)', page)
print (result, len(result))
and:
1) results are in needed order
2) 1 request
result:
[('1322679', '79159', 'ABird'), ('1546679', '78759', 'ADog'), ('5622679', '87159', 'ACat56')] 3

Parsing webpage in Python using Beautiful Soup

This is the Web page Source code which I am scraping using Beautiful Soup.
<tr>
<td>
1
</td>
<td style="cipher1">
<img class="cipher2" src="http://cipher3.png" alt="cipher4" title="cipher5" />
<span class="cipher8">t</span>cipher9
</td>
<td>
112
</td>
<td>
3510
</td>
// Pattern Repeated
<tr >
<td>
2
</td>
<td style="cipher1">
I wrote some code using BeautifulSoup but I am getting more results than I want due to multiple occurrences of the pattern.
I have used
row1 = soup.find_all('a' ,class_ = "cipher7" )
for row in row1:
f.write( row['title'] + "\n")
But with this I get multiple occurences for 'cipher7' since it is occurring multiple times in the web page.
So the thing I can use this
<td style="cipher1">...
since it is unique to the things which I want.
So, How to modify my code to do this?
You can use a convenient select method which takes a CSS selector as an argument:
row = soup.select("td[style=cipher1] > a.cipher7")
You can first find the td tag (since you said it is unique) and then find the specified atag from it.
all_as = []
rows = soup.find_all('td', {'style':'cipher1'})
for row in rows:
all_as.append(row.find_all('a', class_ = "cipher7"))

Python + BS Picking a specific word(location) form webpage table

Hello all…I want to pick a word on specific locaiton from a table on webpage. The source code is like:
table = '''
<TABLE class=form border=0 cellSpacing=1 cellPadding=2 width=500>
<TBODY>
<TR>
<TD vAlign=top colSpan=3><IMG class=ad src="/images/ad.gif" width=1 height=1></TD></TR>
<TR>
<TH vAlign=top width=22>Code:</TH>
<TD class=dash vAlign=top width=5 lign="left"> </TD>
<TD class=dash vAlign=top width=30 align=left><B>BAN</B></TD></TR>
<TR>
<TH vAlign=top>Color:</TH>
<TD class=dash vAlign=top align=left> </TD>
<TD class=dash vAlign=top align=left>White</TD></TR>
<TR>
<TD colSpan=3> </TD></TR></TBODY></TABLE>
'''
I want to pick the word of color here (it could be “White”, "red" or something else). What I tried is:
soup = BeautifulSoup(table)
for a in soup.find_all('table')[0].find_all('tr')[2:3]:
print a.text
It gives:
Color:
 
White
It looks like 4 lines. I tried to add them into a list then remove the unwanted but unsuccessful.
What’s the best way to only pick the color in the table?
Many thanks.
This will match all instances of 'white' case independent ...
soup = BeautifulSoup(table)
res = []
for a in soup.find_all('table')[0].find_all('tr')[2:3]:
if 'white' in a.text.lower():
text = a.text.encode('ascii', 'ignore').replace(':','').split()
res.append(text)
slightly better implementation ...
# this will iterate through all 'table' and 'tr' tags within each 'table'
res = [tr.text.encode('ascii', 'ignore').replace(':','').split() \
for table in soup.findAll('table') for tr in table.findAll('tr') \
if 'color' in tr.text.lower()]
print res
[['Color', 'White']]
to only return the colors themselves, do...
# Assuming the same format throughout the html
# if format is changing just add more logic
tr.text.encode('ascii', 'ignore').replace(':','').split()[1]
...
print res
['White']

How to extract info from varying table entries: Text vs. DIV vs. SPAN

I am new to python and searched the internet to find an answer to my problem, but so far I failed...
The problem: My aim is to extract data from websites. More specifically, from the tables in these websites. The relevant snippet from the website-code you find in "data" in my python-code example here:
from bs4 import BeautifulSoup
data = '''<table class="ds-table">
<tr>
<td class="data-label">year of birth:</td>
<td class="data-value">1994</td>
</tr>
<tr>
<td class="data-label">reporting period:</td>
<td class="data-value">
<span class="editable" id="c-scope_beginning_date">
? </span>
-
<span class="editable" id="c-scope_ending_date">
? </span>
</td>
</tr>
<tr>
<td class="data-label">reporting cycle:</td>
<td class="data-value">
<span class="editable" id="c-periodicity">
- </span>
</td>
</tr>
<tr>
<td class="data-label">grade:</td>
<td class="data-value">1.3, upper 10% of class</td>
</tr>
<tr>
<td class="data-label">status:</td>
<td class="data-value"></td>
</tr>
</table>
<table class="ds-table">
<tr>
<td class="data-label">economics:</td>
<td class="data-value"><span class="positive-value"></span></td>
</tr>
<tr>
<td class="data-label">statistics:</td>
<td class="data-value"><span class="negative-value"></span></td>
</tr>
<tr>
<td class="data-label">social:</td>
<td class="data-value"><div id="music_id" class="trigger"><span class="negative-value"></span></div></td>
</tr>
<tr>
<td class="data-label">misc:</td>
<td class="data-value">
<div id="c_assurance" class="">
<span class="positive-value"></span> </div>
</td>
</tr>
<tr>
<td class="data-label">recommendation:</td>
<td class="data-value">
<span class="negative-value"></span> </td>
</tr>
</table>'''
soup = BeautifulSoup(data)
For the class="data-label" so far I successfully implemented...
box_cdl = []
for i, cdl in enumerate(soup.findAll('td', attrs={'class': 'data-label'})):
box_cdl.append(cdl.contents[0])
print box_cdl
...which extracts the text from the columns, in the (for me satisfying) output:
[u'year of birth:',
u'reporting period:',
u'reporting cycle:',
u'grade:',
u'status:',
u'economics:',
u'statistics:',
u'social:',
u'misc:',
u'recommendation:']
Where I get stuck is the part for class="data-value" with the div- and span-fields and that some of the relevant information is hidden in the span-class. Moreover, the amount of the tr-rows can change from website to website, e.g. "status" comes after "reporting cycle" (instead of "grade").
However, when I do...
box_cdv = []
for j, cdv in enumerate(soup.findAll('td', attrs={'class': 'data-value'})):
box_cdv.append(cdv.contents[0])
print box_cdv
...I get the error:
Traceback (most recent call last):
File "<ipython-input-53-7d5c095cf647>", line 3, in <module>
box_cdv.append(cdv.contents[0])
IndexError: list index out of range
What I would like to get instead is something like this (corresponding to the above "data"-example):
[u'1994',
u'? - ?',
u'-',
u'1.3, upper 10% of class',
u'',
u'positive-value',
u'negative-value',
u'negative-value',
u'positive-value',
u'negative-value']
The Question: how can I extract this information and collect the relevant data from each tr-row, given that the adequate extraction-code depends on the type of the category (year of birth, reporting period, ..., recommendation)?
Or, asking differently: what code extracts me, depending on the category (year of birth, reporting period, ..., recommendation), the corresponding value (1994, ..., negative-value)?
Since the amount and the type of the table-entries can differ between websites, a simple "on the i-th entry do the following" procedure is not applicable. The thing I am looking for I think is something like "if you find the text "recommendation:", then extract the class-type from the span-field", I guess. But unfortunately I do not have any clue how to translate that into python-language.
Any help is highly appreciated.
You get that error because one of the tags don't have any children so the contents list gives an error when searching for that index.
You can approeach this on the following way:
1) Search for the data-label tags;
2) Find the next TD sibling;
3 A) Check of the sibling has text;
3 A) 1) If so create a dict entry with data-label as the key and the sibling text as its value;
3 A) B) If not check if the sibling first child have a class containing -value`
4) Parse the data.
Example:
soup = BeautifulSoup(data, 'lxml')
result = {}
for tag in soup.find_all("td", { "class" : "data-label" }):
NextSibling = tag.find_next("td", { "class" : "data-value" }).get_text(strip = True)
if not NextSibling and len(tag.find_next("td").select('span[class*=-value]')) > 0:
NextSibling = tag.find_next("td").select('span[class*=-value]')[0]["class"][0]
result[tag.get_text(strip = True)] = NextSibling
print (result)
Result:
{
'year of birth:': '1994',
'reporting period:': '?-?',
'reporting cycle:': '-',
'grade:': '1.3, upper 10% of class',
'status:': '',
'economics:': 'positive-value',
'statistics:': 'negative-value',
'social:': 'negative-value',
'misc:': 'positive-value',
'recommendation:': 'negative-value'
}

Categories