For a project, I'm trying to scrape the macronutrients from this website, this is the table below called 'Voedingswaarden' I'm trying to scrape and I only want the information marked with red. The problem I found is that there is no TH in the table, the TH is also a TD with the same class name called 'column'. How can I separate those 2 TD's so I have 1 for the column and one for the value for a Pandas DataFrame?
Thanks for any help you can provide.
.
Just in addition to #Jurakin how decomposes elements from the tree, you could also select only elements you need with css selectors, so tree will not be effected in that way. stripped_strings will extract the pairs texts you can build your DataFrame on.
EDIT
As you only like to scrape the red marked parts, you could go with the same methode, but have to use pandas.set_index(0) and pandas.T to transform and make the first column to headers.
Example
import requests
import pandas as pd
soup = BeautifulSoup(requests.get('https://www.jumbo.com/producten/jumbo-scharrelkip-filet-800g-515026BAK',headers = {'User-Agent': 'Mozilla/5.0'}, cookies={'CONSENT':'YES+'}).text)
pd.DataFrame(
(e.stripped_strings for e in soup.select('table tr:not(:has(th,td.sub-label,td[colspan]))')),
).set_index(0).T
Output
Energie
Vetten
Koolhydraten
Vezels
Eiwitten
Zout
1
kJ 450/kcal 106
0.8 g
0.0 g
0.0 g
24.7 g
0.14 g
Example
import requests
import pandas as pd
soup = BeautifulSoup(requests.get('https://www.jumbo.com/producten/jumbo-scharrelkip-filet-800g-515026BAK',headers = {'User-Agent': 'Mozilla/5.0'}, cookies={'CONSENT':'YES+'}).text)
pd.DataFrame(
(e.stripped_strings for e in soup.select('table tr:not(:has(th,td.sub-label,td[colspan]))')),
columns = soup.select_one('table tr').stripped_strings
)
Output
Voedingswaarden
per 100g
0
Energie
kJ 450/kcal 106
1
Vetten
0.8 g
2
Koolhydraten
0.0 g
3
Vezels
0.0 g
4
Eiwitten
24.7 g
5
Zout
0.14 g
You can remove those tr that have a child td that has a sub-label class or contains a col-span attribute, then pass it to pd.read_html to create a data frame.
import requests
from bs4 import BeautifulSoup
import pandas as pd
url = "https://www.jumbo.com/producten/jumbo-scharrelkip-filet-800g-515026BAK"
user_agent = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.5026.0 Safari/537.36 Edg/103.0.1254.0"
# get page source code
page = requests.get(url, headers={"user-agent": user_agent})
page.raise_for_status()
soup = BeautifulSoup(page.content, "html.parser")
# find table
table = soup.find("table")
# check existence of the table
assert table, "no table found"
print(table)
# select garbage
# - td containing sub-label in class
# - td with colspan attribute
garbage = table.find_all("td", class_="sub-label") \
+ table.find_all("td", colspan=True)
for item in garbage:
# remove item with its parent tr
item.parent.decompose()
# load html into dataframe
df = pd.read_html(str(table))[0]
print(df)
This is the table printed from code:
<table aria-label="Table containing info" class="jum-table striped" data-v-038af5f8="" data-v-e2cf3b44="">
<thead data-v-038af5f8="" data-v-e2cf3b44="">
<tr class="header-row" data-v-038af5f8="" data-v-e2cf3b44="">
<th class="header-column" data-v-038af5f8="" data-v-e2cf3b44="" id="Voedingswaarden"> Voedingswaarden </th>
<th class="header-column" data-v-038af5f8="" data-v-e2cf3b44="" id="per 100g"> per 100g </th>
</tr>
</thead>
<tbody data-v-038af5f8="" data-v-e2cf3b44="">
<tr class="row" data-v-038af5f8="" data-v-e2cf3b44="">
<td class="column" data-v-038af5f8="" data-v-e2cf3b44="">Energie</td>
<td class="column" data-v-038af5f8="" data-v-e2cf3b44="">kJ 450/kcal 106</td>
</tr>
<tr class="row" data-v-038af5f8="" data-v-e2cf3b44="">
<td class="column" data-v-038af5f8="" data-v-e2cf3b44="">Vetten</td>
<td class="column" data-v-038af5f8="" data-v-e2cf3b44="">0.8 g</td>
</tr>
<tr class="row" data-v-038af5f8="" data-v-e2cf3b44="">
# contains sub-label class
<td class="column sub-label" data-v-038af5f8="" data-v-e2cf3b44="">-waarvan verzadigde vetzuren</td>
<td class="column" data-v-038af5f8="" data-v-e2cf3b44="">0.4 g</td>
</tr>
<tr class="row" data-v-038af5f8="" data-v-e2cf3b44="">
<td class="column" data-v-038af5f8="" data-v-e2cf3b44="">Koolhydraten</td>
<td class="column" data-v-038af5f8="" data-v-e2cf3b44="">0.0 g</td>
</tr>
<tr class="row" data-v-038af5f8="" data-v-e2cf3b44="">
# contains sub-label class
<td class="column sub-label" data-v-038af5f8="" data-v-e2cf3b44="">-waarvan suikers</td>
<td class="column" data-v-038af5f8="" data-v-e2cf3b44="">0.0 g</td>
</tr>
<tr class="row" data-v-038af5f8="" data-v-e2cf3b44="">
<td class="column" data-v-038af5f8="" data-v-e2cf3b44="">Vezels</td>
<td class="column" data-v-038af5f8="" data-v-e2cf3b44="">0.0 g</td>
</tr>
<tr class="row" data-v-038af5f8="" data-v-e2cf3b44="">
<td class="column" data-v-038af5f8="" data-v-e2cf3b44="">Eiwitten</td>
<td class="column" data-v-038af5f8="" data-v-e2cf3b44="">24.7 g</td>
</tr>
<tr class="row" data-v-038af5f8="" data-v-e2cf3b44="">
<td class="column" data-v-038af5f8="" data-v-e2cf3b44="">Zout</td>
<td class="column" data-v-038af5f8="" data-v-e2cf3b44="">0.14 g</td>
</tr>
<tr class="row" data-v-038af5f8="" data-v-e2cf3b44="">
# contains colspan attribute
<td class="column" colspan="2" data-v-038af5f8="" data-v-e2cf3b44="">*Het zoutgehalte bestaat uit van nature voorkomend natrium.</td>
</tr>
</tbody>
</table>
Output dataframe:
Voedingswaarden per 100g
0 Energie kJ 450/kcal 106
1 Vetten 0.8 g
2 Koolhydraten 0.0 g
3 Vezels 0.0 g
4 Eiwitten 24.7 g
5 Zout 0.14 g
Related
Trying to grab some table data from a website.
Here's a sample of the html that can be found here https://www.madeinalabama.com/warn-list/:
<div class="warn-data">
<table>
<thead>
<tr>
<th>Closing or Layoff</th>
<th>Initial Report Date</th>
<th>Planned Starting Date</th>
<th>Company</th>
<th>City</th>
<th>Planned # Affected Employees</th>
</tr>
</thead>
<tbody>
<tr>
<td>Closing * </td>
<td>09/17/2019</td>
<td>11/14/2019</td>
<td>FLOWERS BAKING CO. </td>
<td>Opelika </td>
<td> 146 </td>
</tr>
<tr>
<td>Closing * </td>
<td>08/05/2019</td>
<td>10/01/2019</td>
<td>INFORM DIAGNOSTICS </td>
<td>Daphne </td>
<td> 72 </td>
</tr>
I'm trying to grab the data corresponding to the 6th td for each table row.
I tried this:
url = 'https://www.madeinalabama.com/warn-list/'
browser = webdriver.Chrome()
browser.get(url)
elements = browser.find_elements_by_xpath("//table/tbody/tr/td[6]").text
and elements comes back as this:
<selenium.webdriver.remote.webelement.WebElement (session="8199967d541da323f5d5c72623a5e607", element="7d2f8991-d30b-4bc0-bfa5-4b7e909fb56c")>,
<selenium.webdriver.remote.webelement.WebElement (session="8199967d541da323f5d5c72623a5e607", element="ba0cd72d-d105-4f8c-842f-6f20b3c2a9de")>,
<selenium.webdriver.remote.webelement.WebElement (session="8199967d541da323f5d5c72623a5e607", element="1ec14439-0732-4417-ac4f-be118d8d1f85")>,
<selenium.webdriver.remote.webelement.WebElement (session="8199967d541da323f5d5c72623a5e607", element="d8226534-4fc7-406c-935a-d43d6d777bfb")>]
Desired output is a simple df like this:
Planned # Affected Employees
146
72
.
.
.
Please someone explain how to do this using selenium find_elements_by_xpath. We have ample beautiful_soup explanations.
You can use pd.read_html() function:
txt = '''<div class="warn-data">
<table>
<thead>
<tr>
<th>Closing or Layoff</th>
<th>Initial Report Date</th>
<th>Planned Starting Date</th>
<th>Company</th>
<th>City</th>
<th>Planned # Affected Employees</th>
</tr>
</thead>
<tbody>
<tr>
<td>Closing * </td>
<td>09/17/2019</td>
<td>11/14/2019</td>
<td>FLOWERS BAKING CO. </td>
<td>Opelika </td>
<td> 146 </td>
</tr>
<tr>
<td>Closing * </td>
<td>08/05/2019</td>
<td>10/01/2019</td>
<td>INFORM DIAGNOSTICS </td>
<td>Daphne </td>
<td> 72 </td>
</tr>'''
df = pd.read_html(txt)[0]
print(df)
Prints:
Closing or Layoff Initial Report Date Planned Starting Date Company City Planned # Affected Employees
0 Closing * 09/17/2019 11/14/2019 FLOWERS BAKING CO. Opelika 146
1 Closing * 08/05/2019 10/01/2019 INFORM DIAGNOSTICS Daphne 72
Then:
print(df['Planned # Affected Employees'])
Prints:
0 146
1 72
Name: Planned # Affected Employees, dtype: int64
EDIT: Solution with BeautifulSoup:
soup = BeautifulSoup(txt, 'html.parser')
all_data = []
for tr in soup.select('.warn-data tr:has(td)'):
*_, last_column = tr.select('td')
all_data.append(last_column.get_text(strip=True))
df = pd.DataFrame({'Planned': all_data})
print(df)
Prints:
Planned
0 146
1 72
Or:
soup = BeautifulSoup(txt, 'html.parser')
all_data = [td.get_text(strip=True) for td in soup.select('.warn-data tr > td:nth-child(6)')]
df = pd.DataFrame({'Planned': all_data})
print(df)
You could also do td:nth-last-child(1) assuming its the last child
soup.select('div.warn-data > table > tbody > tr > td:nth-last-child(1)')
Example
from bs4 import BeautifulSoup
html = """
<div class="warn-data">
<table>
<thead>
<tr>
<th>Closing or Layoff</th>
<th>Initial Report Date</th>
<th>Planned Starting Date</th>
<th>Company</th>
<th>City</th>
<th>Planned # Affected Employees</th>
</tr>
</thead>
<tbody>
<tr>
<td>Closing *</td>
<td>09/17/2019</td>
<td>11/14/2019</td>
<td>FLOWERS BAKING CO.</td>
<td>Opelika</td>
<td> 146 </td>
</tr>
<tr>
<td>Closing *</td>
<td>08/05/2019</td>
<td>10/01/2019</td>
<td>INFORM DIAGNOSTICS</td>
<td>Daphne</td>
<td> 72 </td>
</tr>
"""
soup = BeautifulSoup(html, features='html.parser')
elements = soup.select('div.warn-data > table > tbody > tr > td:nth-last-child(1)')
for index, item in enumerate(elements):
print(index, item.text)
Using beautifulsou 4 to scrape a HTML table.
To display values from one of the table rows and remove any empty td fields.
The source being scraped shares classes=''
So is there any way to pull the data form just one row? using
data-name ="Georgia" in the html source below
Using: beautifulsoup4
Current code
import bs4 as bs from urllib.request import FancyURLopener
class MyOpener(FancyURLopener):
version = 'My new User-Agent' # Set this to a string you want for your user agent
myopener = MyOpener()
sauce = myopener.open('')
soup = bs.BeautifulSoup(sauce,'lxml')
#table = soupe.table
table = soup.find('table')
table_rows = table.find_all_next('tr')
for tr in table_rows:
td = tr.find_all('td')
row = [i.text for i in td]
print(row)
HTML SOURCE
<tr>
<td class="text--gray">
<span class="save-button" data-status="unselected" data-type="country" data-name="Kazakhstan">★</span>
Kazakhstan
</td>
<td class="text--green">
81
</td>
<td class="text--green">
9
</td>
<td class="text--green">
12.5
</td>
<td class="text--red">
0
</td>
<td class="text--red">
0
</td>
<td class="text--red">
0
</td>
<td class="text--blue">
0
</td>
<td class="text--yellow">
0
</td>
</tr>
<tr>
<td class="text--gray">
<span class="save-button" data-status="unselected" data-type="country" data-name="Georgia">★</span>
Georgia
</td>
<td class="text--green">
75
</td>
<td class="text--green">
0
</td>
<td class="text--green">
0
</td>
<td class="text--red">
0
</td>
<td class="text--red">
0
</td>
<td class="text--red">
0
</td>
<td class="text--blue">
10
</td>
<td class="text--yellow">
1
</td>
</tr>
Are you talking about something like:
tr.find_all('td',{'data-name' : True})
That should find any td that contains data name. I could be reading your question all wrong though.
HTML-page structure:
<table>
<tbody>
<tr>
<th>Timestamp</th>
<th>Call</th>
<th>MHz</th>
<th>SNR</th>
<th>Drift</th>
<th>Grid</th>
<th>Pwr</th>
<th>Reporter</th>
<th>RGrid</th>
<th>km</th>
<th>az</th>
</tr>
<tr>
<td align="right"> 2019-12-10 14:02 </td>
<td align="left"> DL1DUZ </td>
<td align="right"> 10.140271 </td>
<td align="right"> -26 </td>
<td align="right"> 0 </td>
<td align="left"> JO61tb </td>
<td align="right"> 0.2 </td>
<td align="left"> F4DWV </td>
<td align="left"> IN98bc </td>
<td align="right"> 1162 </td>
<td align="right"> 260 </td>
</tr>
<tr>
<td align="right"> 2019-10-10 14:02 </td>
<td align="left"> DL23UH </td>
<td align="right"> 11.0021 </td>
<td align="right"> -20 </td>
<td align="right"> 0 </td>
<td align="left"> JO61tb </td>
<td align="right"> 0.2 </td>
<td align="left"> F4DWV </td>
<td align="left"> IN98bc </td>
<td align="right"> 1162 </td>
<td align="right"> 260 </td>
</tr>
</tbody>
</table>
and so on tr-td...
My code:
from bs4 import BeautifulSoup as bs
import requests
import csv
base_url = 'some_url'
session = requests.Session()
request = session.get(base_url)
val_th = []
val_td = []
if request.status_code == 200:
soup = bs(request.content, 'html.parser')
table = soup.findChildren('table')
tr = soup.findChildren('tr')
my_table = table[0]
my_tr_th = tr[0]
my_tr_td = tr[1]
rows = my_table.findChildren('tr')
row_th = my_tr_th.findChildren('th')
row_td = my_tr_td.findChildren('td')
for r_th in row_th:
heading = r_th.text
val_th.append(heading)
for r_td in row_td:
data = r_td.text
val_td.append(data)
with open('output.csv', 'w') as f:
a_pen = csv.writer(f)
a_pen.writerow(val_th)
a_pen.writerow(val_td)
1) I printed 1 line of td. How to make sure that all the lines of td on the page are displayed in csv?
2) td tags - many on the page.
3) If my_tr_td = tr[1] write as my_tr_td = tr[1:50] - it's mistake.
How to write all data from tr-td lines to a csv file?
Thanks in advance.
Let's try it this way:
import lxml.html
import csv
import requests
url = "http://wsprnet.org/drupal/wsprnet/spots"
res = requests.get(url)
doc = lxml.html.fromstring(res.text)
cols = []
#first, we need to extract the column headers, stuck all the way at the top, with the first one in a particular location and format
cols.append(doc.xpath('//table/tr/node()/text()')[0])
for item in doc.xpath('//table/tr/th'):
typ = str(type(item.getnext()))
if not 'NoneType' in typ:
cols.append(item.getnext().text)
#now for the actual data
inf = []
for item in doc.xpath('//table//tr//td'):
inf.append(item.text.replace('\\xa02', '').strip()) #text info needs to be cleaned
#this will take all the data and split it into rows for each column
rows = [inf[x:x+len(cols)] for x in range(0, len(inf), len(cols))]
#finally, write to file:
with open("output.csv", "w", newline='') as f:
writer = csv.writer(f)
writer.writerow(cols)
for l in rows:
writer.writerow(l)
I'm trying to parse rows within a table (the departure board times) from the following:
buscms_widget_departureboard_ui_displayStop_Callback("
<div class='\"livetimes\"'>
<table class='\"busexpress-clientwidgets-departures-departureboard\"'>
<thead>
<tr class='\"rowStopName\"'>
<th colspan='\"3\"' data-bearing='\"SW\"' data-lat='\"51.7505683898926\"' data-lng='\"-1.225102186203\"' title='\"oxfajmwg\"'>
Divinity Road
</th>
<tr>
<tr class='\"textHeader\"'>
<th colspan='\"3\"'>
text 69325694 to 84637 for live times
</th>
<tr>
<tr class='\"rowHeaders\"'>
<th>
service
</th>
<th>
destination
</th>
<th>
time
</th>
<tr>
</tr>
</tr>
</tr>
</tr>
</tr>
</tr>
</thead>
<tbody>
<tr class='\"rowServiceDeparture\"'>
<td class='\"colServiceName\"'>
4A (OBC)
</td>
<td class='\"colDestination\"' rise\"="" title='\"Elms'>
Elms Rise
</td>
<td 21:49:00\"="" class='\"colDepartureTime\"' data-departuretime='\"20/02/2017' mins\"="" title='\"5'>
5 mins
</td>
</tr>
<tr class='\"rowServiceDeparture\"'>
<td class='\"colServiceName\"'>
4A (OBC)
</td>
<td class='\"colDestination\"' rise\"="" title='\"Elms'>
Elms Rise
</td>
<td 22:11:00\"="" class='\"colDepartureTime\"' data-departuretime='\"20/02/2017' mins\"="" title='\"27'>
27 mins
</td>
</tr>
<tr class='\"rowServiceDeparture\"'>
<td class='\"colServiceName\"'>
4 (OBC)
</td>
<td class='\"colDestination\"' title='\"Abingdon\"'>
Abingdon
</td>
<td 22:29:00\"="" class='\"colDepartureTime\"' data-departuretime='\"20/02/2017' title='\"22:29\"'>
22:29
</td>
</tr>
<tr class='\"rowServiceDeparture\"'>
<td class='\"colServiceName\"'>
4A (OBC)
</td>
<td class='\"colDestination\"' rise\"="" title='\"Elms'>
Elms Rise
</td>
<td 22:49:00\"="" class='\"colDepartureTime\"' data-departuretime='\"20/02/2017' mins\"="" title='\"65'>
65 mins
</td>
</tr>
<tr class='\"rowServiceDeparture\"'>
<td class='\"colServiceName\"'>
4A (OBC)
</td>
<td class='\"colDestination\"' rise\"="" title='\"Elms'>
Elms Rise
</td>
<td 23:09:00\"="" class='\"colDepartureTime\"' data-departuretime='\"20/02/2017' title='\"23:09\"'>
23:09
</td>
</tr>
</tbody>
</table>
</div>
<div class='\"scrollmessage_container\"'>
<div class='\"scrollmessage\"'>
</div>
</div>
<div class='\"services\"'>
<a class='\"service' href='\"#\"' onclick="\"serviceNameClick('');\"" selected\"="">
all
</a>
<a class='\"service\"' href='\"#\"' onclick="\"serviceNameClick('4');\"">
4
</a>
</div>
<div class="dptime">
<span>
times generated at:
</span>
<span>
21:43
</span>
</div>
");
In particular, I'm trying to extract all the departure times - so I'd like to capture the minutes from departure - for example 12 minutes away.
I have the following code:
# import libraries
import urllib.request
from bs4 import BeautifulSoup
# specify the url
quote_page = 'http://www.buscms.com/api/REST/html/departureboard.aspx?callback=buscms_widget_departureboard_ui_displayStop_Callback&clientid=Nimbus&stopcode=69325694&format=jsonp&servicenamefilder=&cachebust=123&sourcetype=siri&requestor=Netescape&includeTimestamp=true&_=1487625719723'
# query the website and return the html to the variable 'page'
page = urllib.request.urlopen(quote_page)
# parse the html using beautiful soap and store in variable `soup`
soup = BeautifulSoup(page, 'html.parser')
print(soup.prettify())
I'm not sure how to find the minutes from departure from the above? Is it something like:
minutes_from_depart = soup.find("tbody", attrs={'td': 'mins'})
Could you try this ?
import urllib.request
from bs4 import BeautifulSoup
import re
quote_page = 'http://www.buscms.com/api/REST/html/departureboard.aspx?callback=buscms_widget_departureboard_ui_displayStop_Callback&clientid=Nimbus&stopcode=69325694&format=jsonp&servicenamefilder=&cachebust=123&sourcetype=siri&requestor=Netescape&includeTimestamp=true&_=1487625719723'
page = urllib.request.urlopen(quote_page).read()
soup = BeautifulSoup(page, 'lxml')
print(soup.prettify())
minutes = soup.find_all("td", class_=re.compile(r"colDepartureTime"))
for elements in minutes:
print(elements.getText())
So I got to my answer with the following code - which was actually quite easy once I had played around with the soup.find_all function:
import urllib.request
from bs4 import BeautifulSoup
# specify the url
quote_page = 'http://www.buscms.com/api/REST/html/departureboard.aspx?callback=buscms_widget_departureboard_ui_displayStop_Callback&clientid=Nimbus&stopcode=69325694&format=jsonp&servicenamefilder=&cachebust=123&sourcetype=siri&requestor=Netescape&includeTimestamp=true&_=1487625719723'
# query the website and return the html to the variable 'page'
page = urllib.request.urlopen(quote_page)
# parse the html using beautiful soap and store in variable `soup`
soup = BeautifulSoup(page, 'html.parser')
for link in soup.find_all('td',class_='\\"colDepartureTime\\"'):
print(link.get_text())
I get the following output:
10:40
10 mins
21 mins
30 mins
40 mins
50 mins
60 mins
I want to use BeautifulSoup to pick up the ‘Model Type’ values on company’s webpages which from codes like below:
it forms 2 tables shown on the webpage, side by side.
updated source code of the webpage
<TR class=tableheader>
<TD width="12%"> </TD>
<TD style="TEXT-ALIGN: left" width="12%">Group </TD>
<TD style="TEXT-ALIGN: left" width="15%">Model Type </TD>
<TD style="TEXT-ALIGN: left" width="15%">Design Year </TD></TR>
<TR class=row1>
<TD width="10%"> </TD>
<TD class=row1>South West</TD>
<TD>VIP QB662FG (Registered) </TD>
<TD>2013 (Registered) </TD></TR></TBODY></TABLE></TD></TR>
I am using following however it doesn’t get the ‘VIP QB662FG’ wanted:
from bs4 import BeautifulSoup
import urllib2
url = "http://www.thewebpage.com"
page = urllib2.urlopen(url)
soup = BeautifulSoup(page.read())
find_it = soup.find_all(text=re.compile("Model Type "))
the_value = find_it[0].findNext('td').contents[0]
print the_value
in what way I can get it? I'm using Python 2.7.
You are looking for the next row, then the next cell in the same position. The latter is tricky; we could assume it is always the 3rd column:
header_text = soup.find(text=re.compile("Model Type "))
value = header_cell.find_next('tr').select('td:nth-of-type(3)')[0].get_text()
If you just ask for the next td, you get the Design Year column instead.
There could well be better methods to get to your one cell; if we assume there is only one tr row with the class row1, for example, the following would get your value in one step:
value = soup.select('tr.row1 td:nth-of-type(3)')[0].get_text()
Find all tr's and output it's third child unless it's first row
import bs4
data = """
<TR class=tableheader>
<TD width="12%"> </TD>
<TD style="TEXT-ALIGN: left" width="12%">Group </TD>
<TD style="TEXT-ALIGN: left" width="15%">Model Type </TD>
<TD style="TEXT-ALIGN: left" width="15%">Design Year </TD></TR>
<TR class=row1>
<TD width="10%"> </TD>
<TD class=row1>South West</TD>
<TD>VIP QB662FG (Registered) </TD>
<TD>2013 (Registered) </TD>
"""
soup = bs4.BeautifulSoup(data)
#table = soup.find('tr', {'class':'tableheader'}).parent
table = soup.find('table', {'class':'tableforms'})
for i,tr in enumerate(table.findChildren()):
if i>0:
for idx,td in enumerate(tr.findChildren()):
if idx==2:
print td.get_text().replace('(Registered)','').strip()
I think you can do as follows :
from bs4 import BeautifulSoup
html = """<TD colSpan=3>Desinger </TD></TR>
<TR>
<TD class=row2bold width="5%"> </TD>
<TD class=row2bold width="30%" align=left>Gender </TD>
<TD class=row1 width="20%" align=left>Male </TD></TR>
<TR>
<TD class=row2bold width="5%"> </TD>
<TD class=row2bold width="30%" align=left>Born Country </TD>
<TD class=row1 width="20%" align=left>DE </TD></TR></TBODY></TABLE></TD>
<TD height="100%" vAlign=top>
<TABLE class=tableforms>
<TBODY>
<TR class=tableheader>
<TD colSpan=4>Remarks </TD></TR>
<TR class=tableheader>
<TD width="12%"> </TD>
<TD style="TEXT-ALIGN: left" width="12%">Group </TD>
<TD style="TEXT-ALIGN: left" width="15%">Model Type </TD>
<TD style="TEXT-ALIGN: left" width="15%">Design Year </TD></TR>
<TR class=row1>
<TD width="10%"> </TD>
<TD class=row1>South West</TD>
<TD>VIP QB662FG (Registered) </TD>
<TD>2013 (Registered) </TD></TR></TBODY></TABLE></TD></TR>"""
soup = BeautifulSoup(html, "html.parser")
soup = soup.find('table',{'class':'tableforms'})
dico = {}
l1 = soup.findAll('tr')[1].findAll('td')
l2 = soup.findAll('tr')[2].findAll('td')
for i in range(len(l1)):
dico[l1[i].getText().strip()] = l2[i].getText().replace('(Registered)','').strip()
print dico['Model Type']
It prints : u'VIP QB662FG'