Beautiful Soup scrape for "Worldwide" - python

I'm trying to scrape some Box Office Mojo pages for Worldwide box office gross figures using Beautiful Soup.My code below will grab the Domestic figures just fine, won't work when I sub in "Worldwide" for "Domestic Total Gross." Maybe because "Worldwide" show's up on the page more than once or something.
Any help on fixing it? I'll past the source code for the two portions as well. Thanks!
Source code below
<center><table border="0" border="0" cellspacing="1" cellpadding="4" bgcolor="#dcdcdc" width="95%"><tr bgcolor="#ffffff"><td align="center" colspan="2"><font size="4">Domestic Total Gross: <b>$172,825,435</b></font></td></tr><tr bgcolor="#ffffff"><td valign="top">Distributor: <b>MGM</b></td><td valign="top">Release Date: <b><nobr>December 16, 1988</nobr></b></td></tr><tr bgcolor="#ffffff"><td valign="top">Genre: <b>Drama</b></td><td valign="top">Runtime: <b>2 hrs. 13 min.</b></td></tr><tr bgcolor="#ffffff"><td valign="top">MPAA Rating: <b>R</b></td><td valign="top">Production Budget: <b>$25 million</b></td></tr></table> </td>
...skip...
<tr>
<td width="40%">= <b>Worldwide:</b></td>
<td width="35%" align="right"> <b>$354,825,435</b></td>
<td width="25%"> </td>
</tr>
Python code below
BOG_titles = ['=RainMan.htm']
def get_movie_value(soup, field_name):
obj = soup.find(text = re.compile(field_name))
if not obj:
return "Nothing"
next_sibling = obj.findNextSibling()
if next_sibling:
return next_sibling.text
else:
return "Still Nothing"
BOG_data = []
for x in BOG_titles:
y = 'http://www.boxofficemojo.com/movies/?id' + x
page = urllib2.urlopen(y)
soup = BeautifulSoup(page)
m = get_movie_value(soup, "Worldwide")
title_string = soup.find('title').text
title = title_string.split('(')[0].strip()
BOG_data.append([title,m])

Use the table inside the div.mp_box structure to get what you want:
In [1]: from bs4 import BeautifulSoup
In [2]: import requests
In [3]: r = requests.get("http://www.boxofficemojo.com/movies/?id=rainman.htm").content
In [4]: soup = BeautifulSoup(r,"lxml")
In [5]: table = soup.select_one("div.mp_box table")
In [6]: print(table)
<table border="0" cellpadding="0" cellspacing="0">
<tr>
<td width="40%"><b>Domestic:</b></td>
<td align="right" width="35%"> <b>$172,825,435</b></td>
<td align="right" width="25%">   <b>48.7%</b></td>
</tr>
<tr>
<td width="40%">+ Foreign:</td>
<td align="right" width="35%"> $182,000,000</td>
<td align="right" width="25%">   51.3%</td>
</tr>
<tr>
<td colspan="3" width="100%"><hr/></td>
</tr>
<tr>
<td width="40%">= <b>Worldwide:</b></td>
<td align="right" width="35%"> <b>$354,825,435</b></td>
<td width="25%"> </td>
</tr>
</table>
In [7]: rows = table.select("tr")
In [8]: rows[0].select_one("td + td").text
Out[8]: u'\xa0$172,825,435'
In [9]: rows[1].select_one("td + td").text
Out[9]: u'\xa0$182,000,000'
In [10]: rows[-1].select_one("td + td").text
Out[10]: u'\xa0$354,825,435'
To use the text without specifying the row:
In [27]: soup = BeautifulSoup(r,"lxml")
In [28]: table = soup.select_one("div.mp_box table")
In [29]: print(table.find("b", text="Domestic:").find_next("td").text)
 $172,825,435
In [30]: print(table.find("b", text="Worldwide:").find_next("td").text)
 $354,825,435
In [31]: print(table.find("a", text="Foreign:").find_next("td").text)
 $182,000,000

Related

grabbing child from html table

Trying to grab some table data from a website.
Here's a sample of the html that can be found here https://www.madeinalabama.com/warn-list/:
<div class="warn-data">
<table>
<thead>
<tr>
<th>Closing or Layoff</th>
<th>Initial Report Date</th>
<th>Planned Starting Date</th>
<th>Company</th>
<th>City</th>
<th>Planned # Affected Employees</th>
</tr>
</thead>
<tbody>
<tr>
<td>Closing * </td>
<td>09/17/2019</td>
<td>11/14/2019</td>
<td>FLOWERS BAKING CO. </td>
<td>Opelika </td>
<td> 146 </td>
</tr>
<tr>
<td>Closing * </td>
<td>08/05/2019</td>
<td>10/01/2019</td>
<td>INFORM DIAGNOSTICS </td>
<td>Daphne </td>
<td> 72 </td>
</tr>
I'm trying to grab the data corresponding to the 6th td for each table row.
I tried this:
url = 'https://www.madeinalabama.com/warn-list/'
browser = webdriver.Chrome()
browser.get(url)
elements = browser.find_elements_by_xpath("//table/tbody/tr/td[6]").text
and elements comes back as this:
<selenium.webdriver.remote.webelement.WebElement (session="8199967d541da323f5d5c72623a5e607", element="7d2f8991-d30b-4bc0-bfa5-4b7e909fb56c")>,
<selenium.webdriver.remote.webelement.WebElement (session="8199967d541da323f5d5c72623a5e607", element="ba0cd72d-d105-4f8c-842f-6f20b3c2a9de")>,
<selenium.webdriver.remote.webelement.WebElement (session="8199967d541da323f5d5c72623a5e607", element="1ec14439-0732-4417-ac4f-be118d8d1f85")>,
<selenium.webdriver.remote.webelement.WebElement (session="8199967d541da323f5d5c72623a5e607", element="d8226534-4fc7-406c-935a-d43d6d777bfb")>]
Desired output is a simple df like this:
Planned # Affected Employees
146
72
.
.
.
Please someone explain how to do this using selenium find_elements_by_xpath. We have ample beautiful_soup explanations.
You can use pd.read_html() function:
txt = '''<div class="warn-data">
<table>
<thead>
<tr>
<th>Closing or Layoff</th>
<th>Initial Report Date</th>
<th>Planned Starting Date</th>
<th>Company</th>
<th>City</th>
<th>Planned # Affected Employees</th>
</tr>
</thead>
<tbody>
<tr>
<td>Closing * </td>
<td>09/17/2019</td>
<td>11/14/2019</td>
<td>FLOWERS BAKING CO. </td>
<td>Opelika </td>
<td> 146 </td>
</tr>
<tr>
<td>Closing * </td>
<td>08/05/2019</td>
<td>10/01/2019</td>
<td>INFORM DIAGNOSTICS </td>
<td>Daphne </td>
<td> 72 </td>
</tr>'''
df = pd.read_html(txt)[0]
print(df)
Prints:
Closing or Layoff Initial Report Date Planned Starting Date Company City Planned # Affected Employees
0 Closing * 09/17/2019 11/14/2019 FLOWERS BAKING CO. Opelika 146
1 Closing * 08/05/2019 10/01/2019 INFORM DIAGNOSTICS Daphne 72
Then:
print(df['Planned # Affected Employees'])
Prints:
0 146
1 72
Name: Planned # Affected Employees, dtype: int64
EDIT: Solution with BeautifulSoup:
soup = BeautifulSoup(txt, 'html.parser')
all_data = []
for tr in soup.select('.warn-data tr:has(td)'):
*_, last_column = tr.select('td')
all_data.append(last_column.get_text(strip=True))
df = pd.DataFrame({'Planned': all_data})
print(df)
Prints:
Planned
0 146
1 72
Or:
soup = BeautifulSoup(txt, 'html.parser')
all_data = [td.get_text(strip=True) for td in soup.select('.warn-data tr > td:nth-child(6)')]
df = pd.DataFrame({'Planned': all_data})
print(df)
You could also do td:nth-last-child(1) assuming its the last child
soup.select('div.warn-data > table > tbody > tr > td:nth-last-child(1)')
Example
from bs4 import BeautifulSoup
html = """
<div class="warn-data">
<table>
<thead>
<tr>
<th>Closing or Layoff</th>
<th>Initial Report Date</th>
<th>Planned Starting Date</th>
<th>Company</th>
<th>City</th>
<th>Planned # Affected Employees</th>
</tr>
</thead>
<tbody>
<tr>
<td>Closing *</td>
<td>09/17/2019</td>
<td>11/14/2019</td>
<td>FLOWERS BAKING CO.</td>
<td>Opelika</td>
<td> 146 </td>
</tr>
<tr>
<td>Closing *</td>
<td>08/05/2019</td>
<td>10/01/2019</td>
<td>INFORM DIAGNOSTICS</td>
<td>Daphne</td>
<td> 72 </td>
</tr>
"""
soup = BeautifulSoup(html, features='html.parser')
elements = soup.select('div.warn-data > table > tbody > tr > td:nth-last-child(1)')
for index, item in enumerate(elements):
print(index, item.text)

How to read the data from HTML file and write the data to CSV file using python?

I have a .html file report which consists of the data in terms of tables and pass-fail criteria. so I want this data to be written to .csv file Using Python3.
Please suggest me how to proceed?
For example, the data will be like this:
<h2>Sequence Evaluation of Entire Project <em class="contentlink">[Contents]</em> </h2>
<table width="100%" class="coverage">
<tr class="nohover">
<td colspan="8" class="tableabove">Test Sequence State</td>
</tr>
<tr>
<th colspan="2" style="white-space:nowrap;">Metric</th>
<th colspan="2">Percentage</th>
<th>Target</th>
<th>Total</th>
<th>Reached</th>
<th>Unreached</th>
</tr>
<tr>
<td colspan="2">Test Sequence Work Progress</td>
<td>100.0%</td>
<td>
<table class="metricbar">
<tr class="borderX">
<td class="white"></td>
<td class="target"></td>
<td class="white" colspan="2"></td>
</tr>
<tr>
<td class="covreached" width="99%"></td>
<td class="target" width="1%"></td>
<td class="covreached" width="0%"></td>
<td class="covnotreached" width="0%"></td>
</tr>
<tr class="borderX">
<td class="white"></td>
<td class="target"></td>
<td class="white" colspan="2"></td>
</tr>
</table>
</td>
<td>100%</td>
<td>24</td>
<td>-</td>
<td>-</td>
</tr>
<tr>
Assuming you know header and really only need the associated percentage, with bs4 4.7.1 you can use :contains to target header and then take next td. You would be reading your HTML from file into html variable shown.
import requests
from bs4 import BeautifulSoup as bs
import pandas as pd
html = '''
<h2>Sequence Evaluation of Entire Project <em class="contentlink">[Contents]</em> </h2>
<table width="100%" class="coverage">
<tr class="nohover">
<td colspan="8" class="tableabove">Test Sequence State</td>
</tr>
<tr>
<th colspan="2" style="white-space:nowrap;">Metric</th>
<th colspan="2">Percentage</th>
<th>Target</th>
<th>Total</th>
<th>Reached</th>
<th>Unreached</th>
</tr>
<tr>
<td colspan="2">Test Sequence Work Progress</td>
<td>100.0%</td>
<td>
<table class="metricbar">
<tr class="borderX">
<td class="white"></td>
<td class="target"></td>
<td class="white" colspan="2"></td>
</tr>
<tr>
<td class="covreached" width="99%"></td>
<td class="target" width="1%"></td>
<td class="covreached" width="0%"></td>
<td class="covnotreached" width="0%"></td>
</tr>
<tr class="borderX">
<td class="white"></td>
<td class="target"></td>
<td class="white" colspan="2"></td>
</tr>
</table>
</td>
<td>100%</td>
<td>24</td>
<td>-</td>
<td>-</td>
</tr>
<tr>
'''
soup = bs(html, 'lxml') # 'html.parser' if lxml not installed
header = 'Test Sequence Work Progress'
result = soup.select_one('td:contains("' + header + '") + td').text
df = pd.DataFrame([result], columns = [header])
print(df)
df.to_csv(r'C:\Users\User\Desktop\data.csv', sep=',', encoding='utf-8-sig',index = False )
import csv
from bs4 import BeautifulSoup
out = open('out.csv', 'w', encoding='utf-8')
path="my.html" #add the path of your local file here
soup = BeautifulSoup(open(path), 'html.parser')
for link in soup.find_all('p'): #add tag whichyou want to extract
a=link.get_text()
out.write(a)
out.write('\n')
out.close()

Using BeautifulSoup to pick up text in table, on webpages

I want to use BeautifulSoup to pick up the ‘Model Type’ values on company’s webpages which from codes like below:
it forms 2 tables shown on the webpage, side by side.
updated source code of the webpage
<TR class=tableheader>
<TD width="12%"> </TD>
<TD style="TEXT-ALIGN: left" width="12%">Group </TD>
<TD style="TEXT-ALIGN: left" width="15%">Model Type </TD>
<TD style="TEXT-ALIGN: left" width="15%">Design Year </TD></TR>
<TR class=row1>
<TD width="10%"> </TD>
<TD class=row1>South West</TD>
<TD>VIP QB662FG (Registered) </TD>
<TD>2013 (Registered) </TD></TR></TBODY></TABLE></TD></TR>
I am using following however it doesn’t get the ‘VIP QB662FG’ wanted:
from bs4 import BeautifulSoup
import urllib2
url = "http://www.thewebpage.com"
page = urllib2.urlopen(url)
soup = BeautifulSoup(page.read())
find_it = soup.find_all(text=re.compile("Model Type "))
the_value = find_it[0].findNext('td').contents[0]
print the_value
in what way I can get it? I'm using Python 2.7.
You are looking for the next row, then the next cell in the same position. The latter is tricky; we could assume it is always the 3rd column:
header_text = soup.find(text=re.compile("Model Type "))
value = header_cell.find_next('tr').select('td:nth-of-type(3)')[0].get_text()
If you just ask for the next td, you get the Design Year column instead.
There could well be better methods to get to your one cell; if we assume there is only one tr row with the class row1, for example, the following would get your value in one step:
value = soup.select('tr.row1 td:nth-of-type(3)')[0].get_text()
Find all tr's and output it's third child unless it's first row
import bs4
data = """
<TR class=tableheader>
<TD width="12%"> </TD>
<TD style="TEXT-ALIGN: left" width="12%">Group </TD>
<TD style="TEXT-ALIGN: left" width="15%">Model Type </TD>
<TD style="TEXT-ALIGN: left" width="15%">Design Year </TD></TR>
<TR class=row1>
<TD width="10%"> </TD>
<TD class=row1>South West</TD>
<TD>VIP QB662FG (Registered) </TD>
<TD>2013 (Registered) </TD>
"""
soup = bs4.BeautifulSoup(data)
#table = soup.find('tr', {'class':'tableheader'}).parent
table = soup.find('table', {'class':'tableforms'})
for i,tr in enumerate(table.findChildren()):
if i>0:
for idx,td in enumerate(tr.findChildren()):
if idx==2:
print td.get_text().replace('(Registered)','').strip()
I think you can do as follows :
from bs4 import BeautifulSoup
html = """<TD colSpan=3>Desinger </TD></TR>
<TR>
<TD class=row2bold width="5%"> </TD>
<TD class=row2bold width="30%" align=left>Gender </TD>
<TD class=row1 width="20%" align=left>Male </TD></TR>
<TR>
<TD class=row2bold width="5%"> </TD>
<TD class=row2bold width="30%" align=left>Born Country </TD>
<TD class=row1 width="20%" align=left>DE </TD></TR></TBODY></TABLE></TD>
<TD height="100%" vAlign=top>
<TABLE class=tableforms>
<TBODY>
<TR class=tableheader>
<TD colSpan=4>Remarks </TD></TR>
<TR class=tableheader>
<TD width="12%"> </TD>
<TD style="TEXT-ALIGN: left" width="12%">Group </TD>
<TD style="TEXT-ALIGN: left" width="15%">Model Type </TD>
<TD style="TEXT-ALIGN: left" width="15%">Design Year </TD></TR>
<TR class=row1>
<TD width="10%"> </TD>
<TD class=row1>South West</TD>
<TD>VIP QB662FG (Registered) </TD>
<TD>2013 (Registered) </TD></TR></TBODY></TABLE></TD></TR>"""
soup = BeautifulSoup(html, "html.parser")
soup = soup.find('table',{'class':'tableforms'})
dico = {}
l1 = soup.findAll('tr')[1].findAll('td')
l2 = soup.findAll('tr')[2].findAll('td')
for i in range(len(l1)):
dico[l1[i].getText().strip()] = l2[i].getText().replace('(Registered)','').strip()
print dico['Model Type']
It prints : u'VIP QB662FG'

Cant extract tables from a html code

I am working to parse a html table given below(its a section of complete html code) But the code is not working. Can some one please help me.There is an error saying "table has no attribute findall".
The code is:
import re
import HTMLParser
from urllib2 import urlopen
import urllib2
from bs4 import BeautifulSoup
url = 'http://164.100.47.132/LssNew/Members/Biography.aspx?mpsno=4064'
url_data = urlopen(url).read()
html_page = urllib2.urlopen(url)
soup = BeautifulSoup(html_page)
title = soup.title
final_tit = title.string
table = soup.find('table',id = "ctl00_ContPlaceHolderMain_Bioprofile1_Datagrid1")
tr = table.findall('tr')
for tr in table:
cols = tr.findAll('td')
for td in cols:
text = ''.join(td.find(text=True))
print text+"|",
print
<table style="WIDTH: 565px">
<tr>
<td vAlign="top" align="left"><img id="ctl00_ContPlaceHolderMain_Bioprofile1_Image1" src="http://164.100.47.132/mpimage/photo/4064.jpg" style="height:140px;border-width:0px;" /></td>
<td vAlign="top"><table cellspacing="0" rules="all" border="2" id="ctl00_ContPlaceHolderMain_Bioprofile1_Datagrid1" style="border-color:#FAE3C3;border-width:2px;border-style:Solid;width:433px;border-collapse:collapse;">
<tr>
<td>
<table align="center" height="30px">
<tr valign="top">
<td align="center" valign="top" class="gridheader1">Aaroon Rasheed,Shri J.M.</td>
</tr>
</table>
<table height="110px">
<tr>
<td align="left" class="darkerb" width="133px" valign="top">Constituency :</td>
<td align="left" valign="top" class="griditem2" width="300px">Theni (Tamil Nadu )</td>
</tr>
<tr>
<td align="left" width="133px" class="darkerb" valign="top">
Party Name :</td>
<td align="left" width="300px" valign="top" class="griditem2">Indian National Congress(INC)</td>
</tr>
<tr>
<td align="left" class="darkerb" valign="top" width="133px">
Email Address :
</td>
<td align="left" valign="top" class="griditem2" width="300px">jm.aaronrasheed#sansad.nic.in</td>
</tr>
</table>
</td>
</tr>
</table></td>
</tr>
</table>
The method is called find_all(), not findall:
tr = table.find_all('tr')

Extract a row in html file without the html tags

I need to extract a row containing a specific string but my following code gives html tags along with it.
from BeautifulSoup import BeautifulSoup
import re
import os
import codecs
import sys
get_company = "ABB LTD"
OUTFILE = os.path.join('company', 'a', 'viewids')
soup = BeautifulSoup(open("/company/a/searches/a"))
rows = soup.findAll("table",{"id":"cos"})[0].findAll('tr')
userrows = [t for t in rows if t.findAll(text=re.compile(get_company))]
print userrows
This is my table format
<table id="cos" width="500" cellpadding="3" cellspacing="0" border="1">
<tr>
<th>Company Name</th>
<th>CIK Number</th>
<th>SIC Code</th>
</tr>
<tr valign="top">
<td>A CONSULTING TEAM INC</td>
<td align="right">1040792</td>
<td align="right">7380</td>
</tr>
<tr valign="top">
<td>A J&J PHARMA CORP</td>
<td align="right">1140452</td>
<td align="right">9995</td>
</tr>
</table>
So if I need A J&J PHARMA CORP's CIK number how to do it? Right now it gives me an output like this:
[<tr valign="top">
<td>A J&J PHARMA CORP</td>
<td align="right">1140452</td>
<td align="right">9995</td>
</tr>]
import re
from BeautifulSoup import BeautifulSoup
html= '''
<table id="cos" width="500" cellpadding="3" cellspacing="0" border="1">
<tr>
<th>Company Name</th>
<th>CIK Number</th>
<th>SIC Code</th>
</tr>
<tr valign="top">
<td>A CONSULTING TEAM INC</td>
<td align="right">1040792</td>
<td align="right">7380</td>
</tr>
<tr valign="top">
<td>A J&J PHARMA CORP</td>
<td align="right">1140452</td>
<td align="right">9995</td>
</tr>
</table>
'''
soup = BeautifulSoup(html)
table = soup.find("table", {"id":"cos"})
td = table.find('td', text='A J&J PHARMA CORP')
# ^ This return text node, not td.
print(td.parent.parent.findAll('td')[1].string)
prints
1140452

Categories