Beautifulsoup to parse html table for text and links

Beautifulsoup to parse html table for text and links - python

I have a table with several columns. The last column may contain link to documents, number of links per cell is not determined (from 0 to infinity).
<tbody>
<tr>
<td>
<h2>Table Section</h2>
</td>
</tr>
<tr>
<td>
Object 1
</td>
<td>Param 1</td>
<td>
<span class="text-nowrap">Param 2</span>
</td>
<td class="text-nowrap"></td>
</tr>
<tr>
<td>
Object 2
</td>
<td>Param 1</td>
<td>
<span class="text-nowrap">Param 2</span>
<td>
<ul>
<li>
<small>
TitleNotes
</small>
</li>
<li>
<small>
Title2Notes2
</small>
</li>
</ul>
</td>
</tr>
</tbody>
So basic parsing is not a problem. I'm stuck with getting those links with titles and notes and appending them tor python's list (or numpy array).
from bs4 import BeautifulSoup
with open("new 1.html", encoding="utf8") as dump:
soup = BeautifulSoup(dump, features="lxml")
data = []
table_body = soup.find('tbody')
rows = table_body.find_all('tr')
for row in rows:
cols = row.find_all('td')
cols = [ele.text.strip() for ele in cols]
data.append(cols)
a = row.find_all('a')
for ele1 in a:
if ele1.get('href') != "#":
data.append([ele1.get('href')])
print(*data, sep='\n')
Output:
['Table Section']
['Object 1', 'Param 1', 'Param 2', '']
['Object 2', 'Param 1', 'Param 2', 'TitleNotes\n\t\t\t \n\n\n\nTitle2Notes2']
['link_to.doc']
['another_link_to.doc']
Is there any way to append links to the first list? I wish a list for a second row looked like this:
['Object 2', 'Param 1', 'Param 2', 'Title', 'Notes', 'link_to.doc', ' Title2', 'Notes2', 'another_link_to.doc']

Something like this
from bs4 import BeautifulSoup
html = '''<tbody>
<tr>
<td>
<h2>Table Section</h2>
</td>
</tr>
<tr>
<td>
Object 1
</td>
<td>Param 1</td>
<td>
<span class="text-nowrap">Param 2</span>
</td>
<td class="text-nowrap"></td>
</tr>
<tr>
<td>
Object 2
</td>
<td>Param 1</td>
<td>
<span class="text-nowrap">Param 2</span>
<td>
<ul>
<li>
<small>
TitleNotes
</small>
</li>
<li>
<small>
Title2Notes2
</small>
</li>
</ul>
</td>
</tr>
</tbody>'''
soup = BeautifulSoup(html, features="lxml")
smalls = soup.find_all('small')
links = [s.contents[1].attrs['href'] for s in smalls]
print(links)
output
['link_to.doc', 'another_link_to.doc']

Related

How to get a text of certain elements BeautifulSoup Python

I have this kind of html code
<tr>
<td class="a">...</td>
<td class="a">...</td>
<td class="a">
<p>
<sup>
Name Name Name
</sup>
</p>
</td>
<td class="a">...</td>
<td class="a">...</td>
<td class="a">
<p>
<sup>25.01.1980</sup>
</p>
</td>
<td class="a">...</td>
<td class="a">...</td>
</tr>
<tr>...</tr>
<tr>...</tr>
I need to get the text of every 3rd and 5th td of every tr
Apparently this doesn't work:)
from bs4 import BeautifulSoup
import index
soup = BeautifulSoup(index.index_doc, 'lxml')
for i in soup.find_all('tr')[2:]:
print(i[2].text, i[4].text)

You could use css selectors and pseudo classe :nth-of-type() to select your elements (assumed you need the date, so I selected the 6th td):
data = [e.get_text(strip=True) for e in soup.select('tr td:nth-of-type(3),tr td:nth-of-type(6)')]
And to get a list of tuples:
list(zip(data, data[1:]))
Example
from bs4 import BeautifulSoup
html = '''
<tr>
<td class="a">...</td>
<td class="a">...</td>
<td class="a">
<p>
<sup>
Name Name Name
</sup>
</p>
</td>
<td class="a">...</td>
<td class="a">...</td>
<td class="a">
<p>
<sup>25.01.1980</sup>
</p>
</td>
<td class="a">...</td>
<td class="a">...</td>
</tr>
<tr>...</tr>
<tr>...</tr>
'''
soup = BeautifulSoup(html)
data = [e.get_text(strip=True) for e in soup.select('tr td:nth-of-type(3),tr td:nth-of-type(6)')]
list(zip(data, data[1:]))

How to extract text from table moving between<tr> tags using Beautifulsoup

I need to extract text from a table using BeautifulSoup.
Below is the code which I have written and output
HTML:
<div class="Tech">
<div class="select">
<span>Selection is mandatory</span>
</div>
<table id="product">
<tbody>
<tr class="feature">
<td class="title" rowspan="3">
<h2>Information</h2>
</td>
<td class="label">
<h3>Design</h3>
</td>
<td class="checkbox">product</td>
</tr>
<tr>
<td class="label">
<h3>Marque</h3>
</td>
<td class="checkbox">
<input type="checkbox">
<label>retro</label>
<a href="link">
Landlord
</a>
</td>
</tr>
<tr>
<td class="label">
<h3>Model</h3>
</td>
<td class="checkbox">model123</td>
</tr>
import requests
from bs4 import BeautifulSoup
url='someurl.com'
source2= requests.get(url,timeout=30).text
soup2=BeautifulSoup(source2,'lxml')
element2= soup2.find('div',class_='Tech')
pin= element2.find('table',id='product').tbody.tr.text
print(pin)
Output that I am getting is:
Information
Design
product
How to do I move between <tr>s? I need the output as: model123.

To get output model123, you can try:
# search <h3> that contains "Model"
h3 = soup.select_one('h3:contains("Model")')
# search next <td>
model = h3.find_next("td").text
print(model)
Prints:
model123
Or without CSS selectors:
model = (
soup.find(lambda tag: tag.name == "h3" and tag.text.strip() == "Model")
.find_next("td")
.text
)
print(model)

Extract heading from various tags using beautiful soup

How can i extract table headings from both table types from the below html using beautiful soup
<body>
<p>some other data 1</p>
<p>Table1 heading</p>
<div></div>
<div>
<div><table width="15%"><tbody>
<tr>
<td><p>data1_00</p></td>
<td><p>data1_01</p></td>
</tr>
<tr>
<td><p>data1_10</p></td>
<td><p>data1_11</p></td>
</tr>
</tbody></table></div>
</div>
<br><br>
<div>some other data 2</div>
<div>Table2 heading</div>
<div>
<div><table width="15%"><tbody>
<tr>
<td><p>data2_00</p></td>
<td><p>data2_01</p></td>
</tr>
<tr>
<td><p>data2_10</p></td>
<td><p>data2_11</p></td>
</tr>
</tbody></table></div>
</div>
</body>
On the first table, heading comes inside <p> tag and on the second table heading comes inside <div> tag. Also on the second table there is a blank <div> tag just above the table.
How to extract both table headings?
Currently i am searching for the previous <div> above current table using table.find_previous('div') and the text inside it will be saved as heading.
from bs4 import BeautifulSoup
import urllib.request
htmlpage = urllib.request.urlopen(url)
page = BeautifulSoup(htmlpage, "html.parser")
all_divtables = page.find_all('table')
for table in all_divtables:
curr_div = table
while True:
curr_div = curr_div.find_previous('div')
if len(curr_div.find_all('table')) > 0:
continue
else:
heading = curr_div.text.strip()
print(heading)
break
desired output :
Table1 heading
Table2 heading

You can use find_previous() function with lambda parameter, that selects first previous tag which doesn't contain other table and doesn't contain empty string:
data = '''<body>
<p>some other data 1</p>
<p>Table1 heading</p>
<div></div>
<div>
<div><table width="15%"><tbody>
<tr>
<td><p>data1_00</p></td>
<td><p>data1_01</p></td>
</tr>
<tr>
<td><p>data1_10</p></td>
<td><p>data1_11</p></td>
</tr>
</tbody></table></div>
</div>
<br><br>
<div>some other data 2</div>
<div>Table2 heading</div>
<div>
<div><table width="15%"><tbody>
<tr>
<td><p>data2_00</p></td>
<td><p>data2_01</p></td>
</tr>
<tr>
<td><p>data2_10</p></td>
<td><p>data2_11</p></td>
</tr>
</tbody></table></div>
</div>
<div>some other data 3</div>
<div>Table3 heading</div>
<div>
<div><table width="15%"><tbody>
<tr>
<td><p>data2_00z</p></td>
<td><p>data2_01z</p></td>
</tr>
<tr>
<td><p>data2_10z</p></td>
<td><p>data2_11z</p></td>
</tr>
</tbody></table></div>
</div>
<div>
<div><table width="15%"><tbody>
<tr>
<td><p>data2_00x</p></td>
<td><p>data2_01x</p></td>
</tr>
<tr>
<td><p>data2_10x</p></td>
<td><p>data2_11x</p></td>
</tr>
</tbody></table></div>
</div>
</body>'''
from bs4 import BeautifulSoup
soup = BeautifulSoup(data, 'lxml')
for table in soup.select('table'):
for i in table.find_previous(lambda t: not t.find('table') and t.text.strip() != ''):
if i.find_parents('table'):
continue
print(i)
print('*' * 80)
Prints:
Table1 heading
********************************************************************************
Table2 heading
********************************************************************************
Table3 heading
********************************************************************************

urldata='''<body>
<p>some other data 1</p>
<p>Table1 heading</p>
<div></div>
<div>
<div><table width="15%"><tbody>
<tr>
<td><p>data1_00</p></td>
<td><p>data1_01</p></td>
</tr>
<tr>
<td><p>data1_10</p></td>
<td><p>data1_11</p></td>
</tr>
</tbody></table></div>
</div>
<br><br>
<div>some other data 2</div>
<div>Table2 heading</div>
<div>
<div><table width="15%"><tbody>
<tr>
<td><p>data2_00</p></td>
<td><p>data2_01</p></td>
</tr>
<tr>
<td><p>data2_10</p></td>
<td><p>data2_11</p></td>
</tr>
</tbody></table></div>
</div>
</body>'''
import re
from bs4 import BeautifulSoup
import urllib.request
soup = BeautifulSoup(data, 'lxml')
results =soup.body.findAll(text=re.compile('heading'))
for result in results:
print(result)
**Output:-**
Table1 heading
Table2 heading

Extracting multiple table data using python and beautiful soup

<div class="row margin_30">
<div class="col-md-12 col-sm-12 col-xs-12 col-lg-12">
<div class="table-responsive table-border-radius">
<table class="table table-hover result-table-new1 " style="margin:0">
<thead class="">
<tr class="">
<th style="text-align:center;">Pl</th>
<th>H.No</th>
<th>Horse/Pedigree</th>
<th>Desc</th>
<th>Trainer</th>
<th>Jockey</th>
<th>Wt</th>
<th>Al</th>
<th>Dr</th>
<th>Sh</th>
<th>Won By</th>
<th>Dist Win</th>
<th>Rtg</th>
<th>Odds</th>
<th>Time</th>
</tr>
</thead>
<tbody class="">
<tr class="dividend_tr" >
<td>1 </td>
<td style="text-align: center;">7 </td>
<td class="race_card_td"><h5 style="font-size:16px">
<a href="http://www.indiarace.com/Home/horseStatistics/55234/SILKEN
STRIKER">
SILKEN STRIKER </a></h5>
<h6 class="margin_remove">Sussex(GB)-Flying Rani </h6>
</td>
<td>
4y b g </td>
<td>
Irfan Ghatala </td>
<td>
Anjar Alam </td>
<td>
56 </td>
<td>
- </td>
<td>
6 </td>
<td>
A </td>
<td>
5 1/2 </td>
<td>
</td>
<td>
12 </td>
<td>
</td>
<td>
1:14.57 </td>
</tr>
<tr class="dividend_tr" >
<td>
2 </td>
<td style="text-align: center;">
5 </td>
<td class="race_card_td">
<h5 style="font-size:16px">
<a href="http://www.indiarace.com/Home/horseStatistics/55737/ULTIMATE
POWER">
ULTIMATE POWER </a>
</h5>
<h6 class="margin_remove">
Epicentre(USA)-Methodical </h6>
</td>
<td>
4y b g </td>
<td>
V Lokanath </td>
<td>
Darshan R N </td>
<td>
57 </td>
<td>
-1 </td>
<td>
3 </td>
<td>
A </td>
<td>
5 </td>
<td>
5.5 </td>
<td>
14 </td>
<td>
</td>
<td>
1:15.47 </td>
</tr>
</tbody>
</table>
</div>
I want the following output using Beautiful soup and want to store it in csv file. The actual page [http://www.indiarace.com/Home/racingCenterEvent?venueId=3&event_date=2018-08-10&race_type=RESULTS] has multiple tables and many rows. Also, I need to write a function to get data from different pages.
[Result][1]
[1]: https://i.stack.imgur.com/4LYt8.jpg
Any help would be greatful.

It's pretty simple you need find all tables then iterate tr and td as per your requirement. You can use pandas to save the scraped data. i have parse the tables for you (the rest you have to do)...check the code below.
import requests
from bs4 import BeautifulSoup
url = 'http://www.indiarace.com/Home/racingCenterEvent?venueId=3&event_date=2018-08-10&race_type=RESULTS'
html = requests.get(url)
soup = BeautifulSoup(html.content, 'html.parser')
table = soup.find_all('table', attrs={
'class':'result-table-new1'})
for i in table:
tr = i.find_all('tr')
for td in tr:
print(td.text.replace('\n', ' '))

How to extract items line by line

I need some help here using beautifulsoup4 to extract data from my inventory webpage.
The webpage was written in the following format: name of the item, followed by a table listing the multiple rows of details for that particular inventory.
I am interested in getting the item name, actual qty and expiry date.
How do I go about doing it given such HTML structure (see appended)?
<div style="font-weight: bold">Item X</div>
<table cellspacing="0" cellpadding="0" class="table table-striped report-table" style="width: 800px">
<thead>
<tr>
<th> </th>
<th>Supplier</th>
<th>Packing</th>
<th>Default Qty</th>
<th>Expensive</th>
<th>Reorder Point</th>
<th>Actual Qty</th>
<th>Expiry Date</th>
</tr>
</thead>
<tbody>
<tr>
<td>1</td>
<td>Company 1</td>
<td>3.8 L</td>
<td>
4
</td>
<td>
No
</td>
<td>2130.00</td>
<td>350.00</td>
<td>31-05-2019</td>
</tr>
<tr>
<td>2</td>
<td>Company 1</td>
<td>3.8 L</td>
<td>
4
</td>
<td>
No
</td>
<td>2130.00</td>
<td>15200.00</td>
<td>31-05-2019</td>
</tr>
<tr>
<td>3</td>
<td>Company 1</td>
<td>3.8 L</td>
<td>
4
</td>
<td>
No
</td>
<td>2130.00</td>
<td>210.00</td>
<td>31-05-2019</td>
</tr>
<tr>
<td colspan="5"> </td>
<td>Total Qty 15760.00</td>
<td> </td>
</tr>
</tbody>
</table>
<div style="font-weight: bold">Item Y</div>
<table cellspacing="0" cellpadding="0" class="table table-striped report-table" style="width: 800px">
<thead>
<tr>
<th> </th>
<th>Supplier</th>
<th>Packing</th>
<th>Default Qty</th>
<th>Expensive</th>
<th>Reorder Point</th>
<th>Actual Qty</th>
<th>Expiry Date</th>
</tr>
</thead>
<tbody>
<tr>
<td>1</td>
<td>Company 2</td>
<td>50X10's</td>
<td>
10
</td>
<td>
Yes
</td>
<td>1090.00</td>
<td>271.00</td>
<td>31-01-2020</td>
</tr>
<tr>
<td>2</td>
<td>Company 2</td>
<td>50X10's</td>
<td>
10
</td>
<td>
Yes
</td>
<td>1090.00</td>
<td>500.00</td>
<td>31-01-2020</td>
</tr>
<tr>
<td>3</td>
<td>Company 2</td>
<td>50X10's</td>
<td>
10
</td>
<td>
Yes
</td>
<td>1090.00</td>
<td>69.00</td>
<td>31-01-2020</td>
</tr>
<tr>
<td>4</td>
<td>Company 2</td>
<td>50X10's</td>
<td>
10
</td>
<td>
Yes
</td>
<td>1090.00</td>
<td>475.00</td>
<td>01-01-2020</td>
</tr>
<tr>
<td colspan="5"> </td>
<td>Total Qty 1315.00</td>
<td> </td>
</tr>
</tbody>
</table>

Here is one way to do it. The idea is to iterate over the items - div elements with bold substring inside the style attribute. Then, for every item, get the next table sibling using find_next_sibling() and parse the row data into a dictionary for convenient access by a header name:
from bs4 import BeautifulSoup
data = """your HTML here"""
soup = BeautifulSoup(data, "lxml")
for item in soup.select("div[style*=bold]"):
item_name = item.get_text()
table = item.find_next_sibling("table")
headers = [th.get_text(strip=True) for th in table('th')]
for row in table('tr')[1:-1]:
row_data = dict(zip(headers, [td.get_text(strip=True) for td in row('td')]))
print(item_name, row_data['Actual Qty'], row_data['Expiry Date'])
print("-----")
Prints:
Item X 350.00 31-05-2019
Item X 15200.00 31-05-2019
Item X 210.00 31-05-2019
-----
Item Y 271.00 31-01-2020
Item Y 500.00 31-01-2020
Item Y 69.00 31-01-2020
Item Y 475.00 01-01-2020
-----

One solution is to iterate through each row tag i.e. <tr> and then just figure out what the column cell at each index represents and access columns that way. To do so, you can use the find_all method in BeautifulSoup, which will return a list of all elements with the tag given.
Example:
from bs4 import BeautifulSoup
html_doc = YOUR HTML HERE
soup = BeautifulSoup(html_doc, 'html.parser')
for row in soup.find_all("tr"):
cells = row.find_all("td")
if len(cells) == 0:
#This is the header row
else:
#If you want to access the text of the Default Quantity column for example
default_qty = cells[3].text
Note that in the case that the tr tag is actually the header row, then there will not be td tags (there will only be th tags), so in this case len(cells)==0

You can select all divs and walk through to find the next table.
If you go over the rows of the table with the exception of the last row, you can extract text from specific cells and build your inventory list.
soup = BeautifulSoup(markup, "html5lib")
inventory = []
for itemdiv in soup.select('div'):
table = itemdiv.find_next('table')
for supply_row in table.tbody.select('tr')[:-1]:
sn, supplier, _, actual_qty, _, _, _, exp = supply_row.select('td')
item = map(lambda node: node.text.strip(), [sn, supplier, actual_qty, exp])
item[1:1] = [itemdiv.text]
inventory.append(item)
print(inventory)
You can use the csv library to write the inventory like so:
import csv
with open('some.csv', 'wb') as f:
writer = csv.writer(f, delimiter="|")
writer.writerow(('S/N', 'Item', 'Supplier', 'Actual Qty', 'Expiry Date'))
writer.writerows(inventory)

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Beautifulsoup to parse html table for text and links - python

Related

How to get a text of certain elements BeautifulSoup Python

How to extract text from table moving between<tr> tags using Beautifulsoup

Extract heading from various tags using beautiful soup

Extracting multiple table data using python and beautiful soup

How to extract items line by line

Categories

Resources