python beautiful soup extract data - python

I am parsing a html document using a Beautiful Soup 4.0.
Here is an example of table in document
<tr>
<td class="nob"></td>
<td class="">Time of price</td>
<td class=" pullElement pullData-DE000BWB14W0.teFull">08/06/2012</td>
<td class=" pullElement pullData-DE000BWB14W0.PriceTimeFull">11:43:08 </td>
<td class="nob"></td>
</tr>
<tr>
<td class="nob"></td>
<td class="">Daily volume (units)</td>
<td colspan="2" class=" pullElement pullData-DE000BWB14W0.EWXlume">0</td>
<td class="nob"></td>
<t/r>
I would like to extract 08/06/2012 and 11:43:08 DAily volume, 0 etc.
This is my code to find specific table and all data of it
html = file("some_file.html")
soup = BeautifulSoup(html)
t = soup.find(id="ctnt-2308")
dat = [ map(str, row.findAll("td")) for row in t.findAll("tr") ]
I get a list of data that needs to be organized
Any suggestions to do it in a simple way??
Thank you

list(soup.stripped_strings)
will give you all the string in that soup (removing all trailing spaces)

Related

How to parse column values and its href with selenuim

im new with selenium and parsing data from the website.
The problem is: i have website table with such HTML code:
<table width="580" cellspacing="1" cellpadding="3" bgcolor="#ffffff" id="restab">
<tbody>
<tr align="center" valign="middle">
<td width="40" bgcolor="#555555"><font color="#ffffff">№</font></td>
<td width="350" bgcolor="#555555"><font color="#ffffff">Название организации</font></td>
<td width="100" bgcolor="#555555"><font color="#ffffff">Город</font></td>
<td width="60" bgcolor="#555555"><span title="Число публикаций данной организации на eLibrary.Ru"><font color="#ffffff">Публ.</font></span></td><td width="30" bgcolor="#555555"><span title="Число ссылок на публикации организации"><font color="#ffffff">Цит.</font></span></td>
</tr>
<tr valign="middle" bgcolor="#f5f5f5" id="a18098">
<td align="center"><font color="#00008f">1</font></td>
<td align="left"><font color="#00008f"><a href="org_about.asp?orgsid=18098">
"Академия информатизации образования" по Ленинградской области</a></font></td>
<td align="center"><font color="#00008f">Гатчина</font></td>
<td align="right"><font color="#00008f">0<img src="/pic/1pix.gif" hspace="16"></font></td>
<td align="center"><font color="#00008f">0</font></td>
</tr>
<tr valign="middle" bgcolor="#f5f5f5" id="a17954">
<td align="center"><font color="#00008f">2</font></td>
<td align="left"><font color="#00008f"><a href="org_about.asp?orgsid=17954">
"Академия талантов" Санкт-Петербурга</a></font></td>
<td align="center"><font color="#00008f">Санкт-Петербург</font></td>
<td align="right"><font color="#00008f">3<img src="/pic/stat.gif" width="12" height="13" hspace="10" border="0"></font></td>
<td align="center"><font color="#00008f">0</font></td>
</tr>
</tbody>
</table>
and i need to get all this table values and href's of each value in left td
I tried to use Xpath, but it writes some error, how to do it better?
In conclusion i need to get dataframe with table values + extra column with href of left column
First try to use pandas.read_html(). See code example below.
If that doesn't work, then use use right-click menu on browser such as Mozilla Firefox (Inspect Element) or Google Chrome (Developer Tools) to find the CSS or Xpath. Then feed the CSS or Xpath into Selenium.
Another useful tool for finding complicated CSS/Xpath is the Inspector Gadget browser plug-in.
import pandas as pd
# this is the website you want to read ... table with "Minimum Level for Adult Cats"
str_url = 'http://www.felinecrf.org/catfood_data_how_to_use.htm'
# use pandas.read_html()
# https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_html.html
list_df = pd.read_html(str_url, match='DMA')
print('Number of dataframes on the page: ', len(list_df))
print()
for idx, each_df in enumerate(list_df):
print(f'Show dataframe number {idx}:')
print(each_df.head())
print()
# use table 2 on the page
df_target = list_df[2]
# create column headers
# https://chrisalbon.com/python/data_wrangling/pandas_rename_column_headers/
header_row = df_target.iloc[0]
# Replace the dataframe with a new one which does not contain the first row
df_target = df_target[1:]
# Rename the dataframe's column values with the header variable
df_target.columns = header_row
print(df_target.head())

Retrieving the name of a class attribute with lxml

I am working on a python project using lxml to scrap a page and I am having the challenge of retrieving the name of a span class attribute. The html snippet is below:
<tr class="nogrid">
<td class="date">12th January 2016</td>
<td class="time">11:22pm</td>
<td class="category">Clothing</td>
<td class="product">
<span class="brand">carlos santos</span>
</td>
<td class="size">10</td>
<td class="name">polo</td>
</tr>
....
How do I retrieve the value of the span's class attribute below:
<span class="brand">carlos santos</span>
You can use the following XPath to get class attribute of span element that is direct child of td with class product :
//td[#class="product"]/span/#class
working demo example :
from lxml import html
raw = '''<tr class="nogrid">
<td class="date">12th January 2016</td>
<td class="time">11:22pm</td>
<td class="category">Clothing</td>
<td class="product">
<span class="brand">carlos santos</span>
</td>
<td class="size">10</td>
<td class="name">polo</td>
</tr>'''
root = html.fromstring(raw)
span = root.xpath('//td[#class="product"]/span/#class')[0]
print span
output :
Brand
from bs4 import BeautifulSoup
lxml = '''<tr class="nogrid">
<td class="date">12th January 2016</td>
<td class="time">11:22pm</td>
<td class="category">Clothing</td>
<td class="product">
<span class="brand">carlos santos</span>
</td>
<td class="size">10</td>
<td class="name">polo</td>
<tr>'''
soup = BeautifulSoup(lxml, 'lxml')
result = soup.find('span')['class'] # result = 'brand'

how to find class with beautifulsoup

<hknbody>
<tr>
<td class="padding_25 font_7 bold xicolor_07" style="width:30%">
date
</td>
<td class="font_34 xicolor_42">
19 Eylül 2013
</td>
</tr>
<tr>
<td style="height:10px" colspan="3"></td>
</tr>
<tr>
<td class="bgcolor_09" style="height:5px" colspan="3"></td>
</tr>
<tr>
<td style="height:10px" colspan="3"></td>
</tr>
<tr>
<td class="padding_25 font_7 bold xicolor_07" style="width:30%">
Size
</td>
<td class="font_34 xicolor_42">
650 cm
Classes names same, classes in the same table.
How can I find correct data? Example; if "date" doesn't exist in <td class="padding_25 font_7 bold xicolor_07>, you don't pull date and find next data.
If this is your HTML and you can change it, you should be using semantic HTML to markup your elements with class, id, or name attributes that describe the meaning of the data, not its appearance. Then you will have an unambiguous way of selecting the required tags.
As it is all you have to do something like this:
import re
from bs4 import BeautifulSoup
soup = BeautifulSoup(html)
date_tag = soup.find('td', text=re.compile('^\s*date\s*$')) # find first <td> containing text "date"
if date_tag:
date_value = date_tag.find_next_sibling('td').text.strip()
>>> print date_value
19 Eylül 2013

scraping tables with beautifulsoup

I seem to be stuck, If i had the following table:
<table align=center cellpadding=3 cellspacing=0 border=1>
<tr bgcolor="#EEEEFF">
<td align="center">
40 </td>
<td align="center">
44 </td>
<td align="center">
<font color="green"><b>+4</b></font>
</td>
<td align="center">
1,000</td>
<td align="center">
15,000 </td>
<td align="center">
44,000 </td>
<td align="center">
<font color="green"><b><nobr>+193.33%</nobr></b></font>
</td>
</tr>
what would be the ideal way to use find_all to pull the 44,000 td from the table?
If it is a recurring position of the table you would like to scrape you would like to scrape I would use beautiful soup to extract all elements in the table and then extract that data. See the pseudo code below.
known_position = 5
tds = bs4.find_all('td')
number = tds[known_position].text()
on the other hand if you're specifically searching for a given number I would just iterate over the list.
tds = bs4.find_all('td')
for td in tds:
if td.text = 'number here':
# do your stuff

BeautifulSoup help, how to extract content from not proper tags text in html file?

<tr>
<td nowrap> good1 </td>
<td class = "td_left" nowrap=""> 1 </td>
</tr>
<tr0>
<td nowrap> good2 </td>
<td class = "td_left" nowrap=""> </td>
</tr0>
How to using python parse it? please help.
I want to get the result as list ['good1',1,'good2',None]
Find all tr tags and get all tds from it:
from bs4 import BeautifulSoup
page = """<tr>
<td nowrap> good1 </td>
<td nowrap class = "td_left"> 1 </td>
</tr>
<tr>
<td nowrap> good2 </td>
<td nowrap class = "td_left"> 2 </td>
</tr>"""
soup = BeautifulSoup(page)
rows = soup.body.find_all('tr')
print [td.text.strip() for row in rows for td in row.find_all('td')]
prints:
[u'good1', u'1', u'good2', u'2']
Note, strip() helps to get rid of leading and trailing whitespaces.
Hope that helps.

Categories