How to extract elements from html with BeautifulSoup - python

I am beginning to learn python and would like to try to use BeautifulSoup to extract the elements in the below html.
This html is taken from a voice recording system that logs the time and date in local time, UTC, call duration, called number, name, calling number, name, etc
There are usually hundreds of these entries.
What I am attempting to do is extract the elements and print them in one line to a comma delimited format in order to compare with call detail records from call manager. This will help to verify that all calls were recorded and not missed.
I believe BeautifulSoup is the right tool to do this.
Could someone point me in the right direction?
<tbody>
<tr class="formRowLight">
<td class="formRowLight" >24/10/16<br>16:24:47</td>
<td class="formRowLight" >24/10/16 07:24:47</td>
<td class="formRowLight" >00:45</td>
<td class="formRowLight" >31301</td>
<td class="formRowLight" >Joe Smith</td>
<td class="formRowLight" >31111</td>
<td class="formRowLight" >Jane Doe</td>
<td class="formRowLight" >N/A</td>
<td class="formRowLight" >1432875648934</td>
<td align="center" class"formRowLight"> </td>
<tr class="formRowLight">
<td class="formRowLight" >24/10/16<br>17:33:02</td>
<td class="formRowLight" >24/10/16 08:33:02</td>
<td class="formRowLight" >00:58</td>
<td class="formRowLight" >35664</td>
<td class="formRowLight" >Billy Bob</td>
<td class="formRowLight" >227045665</td>
<td class="formRowLight" >James Dean</td>
<td class="formRowLight" >N/A</td>
<td class="formRowLight" >9934959586849</td>
<td align="center" class"formRowLight"> </td>
</tr>
</tbody>

The pandas.read_html() would make things much easier - it would convert your tabular data from the HTML table into a dataframe which, if needed, you can later dump into CSV.
Here is a sample code to get you started:
import pandas as pd
data = """
<table>
<thead>
<tr>
<th>Date</th>
<th>Name</th>
<th>ID</th>
</tr>
</thead>
<tbody>
<tr class="formRowLight">
<td class="formRowLight">24/10/16<br>16:24:47</td>
<td class="formRowLight">Joe Smith</td>
<td class="formRowLight">1432875648934</td>
</tr>
<tr class="formRowLight">
<td class="formRowLight">24/10/16<br>17:33:02</td>
<td class="formRowLight">Billy Bob</td>
<td class="formRowLight">9934959586849</td>
</tr>
</tbody>
</table>"""
df = pd.read_html(data)[0]
print(df.to_csv(index=False))
Prints:
Date,Name,ID
24/10/1616:24:47,Joe Smith,1432875648934
24/10/1617:33:02,Billy Bob,9934959586849
FYI, read_html() actually uses BeautifulSoup to parse HTML under-the-hood.

import BeautifulSoup
import urllib2
import requests
request = urllib2.Request(your url)
response = urllib2.urlopen(request)
soup = BeautifulSoup.BeautifulSoup(response)
mylist = []
div = soup.findAll('tr', {"class":"formRowLight"})
for line in div:
text= video.findNext('td',{"class":"formRowLight"}).text
mylist.append(text)
print mylist
But you need to edit this code a litt to prevent any duplicated content.

Yes, BeautifulSoup is a good tool to reach for in this problem. Something to get you started would be as follows:
from bs4 import BeautifulSoup
with open("my_log.html") as log_file:
html = log_file.read()
soup = BeautifulSoup(html)
#normally you specify a parser too `(html, 'lxml')` for example
#without specifying a parser, it will warn you and select one automatically
table_rows = soup.find_all("tr") #get list of all <tr> tags
for row in table_rows:
table_cells = row.find_all("td") #get list all <td> tags in row
joined_text = ",".join(cell.get_text() for cell in table_cells)
print(joined_text)
However, pandas's read_html may make this a bit more seamless, as mentioned in another answer to this question. Arguably pandas may be a better hammer to hit this nail with, but learning to use BeautifulSoup for this will also give you the skills to scrape all kinds of HTML in the future.

First get list of html strings, To get that follow this Convert BeautifulSoup4 HTML Table to a list of lists, iterating over each Tag elements
Then perform following operation in that,
This will fetch you all values of elements you desire !
for element in html_list:
output = soup.select(element)[0].text
print("%s ," % output)
This will give you what you desires,
Hope that helps !

Related

Beautiful soup: why not printing inside the for-loop in my code?

from bs4 import BeautifulSoup
import numpy as np
import requests
from selenium import webdriver
from nltk.tokenize import sent_tokenize,word_tokenize
html = webdriver.Firefox(executable_path=r'D:\geckodriver.exe')
html.get("https://www.tsa.gov/coronavirus/passenger-throughput")
def TSA_travel_numbers(html):
print('NASEEF')
soup = BeautifulSoup(html,'lxml')
print('naseef2')
for i,rows in enumerate(soup.find_all('tr',class_='view-content')):
print('naseef3')
for texts in soup.find('td',header = 'view-field-2021-throughput-table-column'):
print('naseef4')
number = texts.text
if number is None:
continue
print('Naseef')
TSA_travel_numbers(html.page_source)
As you can see NASEEF and naseef2 gets printed into the console, but not naseef3 and naseef4, and no error to this code, it runs fine, I don't know what is happening here, anyone please point me what is really happening here?
In other words it is not going inside the for loops specified in that function.
please help me, and sorry for your time and advance thanks!
Your page does not contain <tr> tags with a class of view-content, so find_all is correctly returning no results. If you remove the class restriction, you get many results:
>>> soup.find_all('tr', limit=2)
[<tr>
<th class="views-align-center views-field views-field-field-today-date views-align-center" id="view-field-today-date-table-column" scope="col">Date</th>
<th class="views-align-center views-field views-field-field-2021-throughput views-align-center" id="view-field-2021-throughput-table-column" scope="col">2021 Traveler Throughput </th>
<th class="views-align-center views-field views-field-field-2020-throughput views-align-center" id="view-field-2020-throughput-table-column" scope="col">2020 Traveler Throughput </th>
<th class="views-align-center views-field views-field-field-2019-throughput views-align-center" id="view-field-2019-throughput-table-column" scope="col">2019 Traveler Throughput </th>
</tr>, <tr>
<td class="views-field views-field-field-today-date views-align-center" headers="view-field-today-date-table-column">5/9/2021 </td>
<td class="views-field views-field-field-2021-throughput views-align-center" headers="view-field-2021-throughput-table-column">1,707,805 </td>
<td class="views-field views-field-field-2020-throughput views-align-center" headers="view-field-2020-throughput-table-column">200,815 </td>
<td class="views-field views-field-field-2019-throughput views-align-center" headers="view-field-2019-throughput-table-column">2,419,114 </td>
</tr>]
Once you change that, the inner loop is looking for <td> tags with a header of view-field-2021-throughput-table-column. There are no such tags in the page either, but there are those which have a headers field with that name.
This line is also wrong:
number = texts.text
...because texts is a NavigableString and does not have the text attribute.
Additionally, the word naseef is not really clear as to what it means, so it's better to replace that with more descriptive strings. Finally, you don't really need the Selenium connection or the tokenizer, so for the purposes of this example we can leave those out. The resulting code looks like this:
from bs4 import BeautifulSoup
import numpy as np
import requests
html = requests.get("https://www.tsa.gov/coronavirus/passenger-throughput").text
def TSA_travel_numbers(html):
print('Entering parsing function')
soup = BeautifulSoup(html,'lxml')
print('Parsed HTML to soup')
for i,rows in enumerate(soup.find_all('tr')):
print('Found <tr> tag number', i)
for texts in soup.find('td',headers = 'view-field-2021-throughput-table-column'):
print('found <td> tag with headers')
number = texts
if number is None:
continue
print('Value is', number)
TSA_travel_numbers(html)
Its output looks like:
Entering parsing function
Parsed HTML to soup
Found <tr> tag number 0
found <td> tag with headers
Value is 1,707,805
Found <tr> tag number 1
found <td> tag with headers
Value is 1,707,805
Found <tr> tag number 2
found <td> tag with headers
...

Structuring a table using Scrapy Data

I have a website that contains tables (trs and tds). I want to create a structured CSV file from the table data. I'm trying to create field names from the scraped table as those field names can change depending upon the month or selections.
While I have been successful at iterating through the table and actually scraping the data I want to use as my field names I have yet to figure out how to yield that data into the CSV file.
Right now I have them scraped into an Item named "h1header" and when yielded to a CSV file they appear as rows under that item key "h1header" so:
Project Owning Org
Project Date Range
Fee Factor
Project Organization
Project Manager
Fee Calculation Method
Project Code
Project Lead
Status
Project Title
Total Project Value
Condition
External System Code
Funded Value
Billing Type
What I would ultimately like is the following:
Project Owning Org, Project Date Range, Fee Factor, Project Organization ...etc
so instead of rows they are columns and then I can populate the multiple tables on the page that are formatted with the same h1header with the data as field values of those columns.
Below is an example of the html that I'm scraping. This particular tbody.h1 repeats multiple times on the page depending on the results.
<table class="report">
<tbody class="h1"><tr><td colspan="22">
<table class="report" >
<tbody class="h1">
<tr>
<td class="label">Project Owning Organization:</td><td>1.02.10</td>
<td class="label">Project Date Range:</td><td>8/12/2020 - 8/11/2021</td>
<td class="label">Fee Factor:</td><td>—</td>
</tr>
<tr>
<td class="label">Project Organization:</td><td>1.2.26.1</td>
<td class="label">Project Manager:</td><td>Smith, John</td>
<td class="label">Fee Calculation Method:</td><td>—</td>
</tr>
<tr>
<td class="label">Project Code:</td><td>PROJECT.001</td>
<td class="label">Project Lead:</td><td>Doe, Jane</td>
<td class="label">Status:</td><td>Backlog</td>
</tr>
<tr>
<td class="label">Project Title:</td><td>Scrapy Project</td>
<td class="label">Total Project Value:</td><td>1,438.00</td>
<td class="label">Condition:</td><td>Green<img src="/images/status_green.png" alt="Green"
title="Green"></td>
</tr>
<tr>
<td class="label">External System Code:</td><td>—</td>
<td class="label">Funded Value:</td><td>1,438.00</td>
<td class="label">Billing Type:</td><td>FP</td>
</tr>
</tbody>
There are other tables within this html (tbody.h1 and tbody.detail) where I will then need to append columns to the above.
I've done this in Java using Beautiful Soup by creating and writing to arrays then ultimately exporting those built arrays as csv files. Python Scrapy is FAR easier to get the data than Java was and I'm sure I'm over complicating this but am stuck trying to figure it out so any guidance would be appreciated!
Try this.
from simplified_scrapy import SimplifiedDoc, req, utils
html = '''
<table class="report">
<tbody class="h1"><tr><td colspan="22">
<table class="report" >
<tbody class="h1">
<tr>
<td class="label">Project Owning Organization:</td><td>1.02.10</td>
<td class="label">Project Date Range:</td><td>8/12/2020 - 8/11/2021</td>
<td class="label">Fee Factor:</td><td>—</td>
</tr>
<tr>
<td class="label">Project Organization:</td><td>1.2.26.1</td>
<td class="label">Project Manager:</td><td>Smith, John</td>
<td class="label">Fee Calculation Method:</td><td>—</td>
</tr>
<tr>
<td class="label">Project Code:</td><td>PROJECT.001</td>
<td class="label">Project Lead:</td><td>Doe, Jane</td>
<td class="label">Status:</td><td>Backlog</td>
</tr>
<tr>
<td class="label">Project Title:</td><td>Scrapy Project</td>
<td class="label">Total Project Value:</td><td>1,438.00</td>
<td class="label">Condition:</td><td>Green<img src="/images/status_green.png" alt="Green"
title="Green"></td>
</tr>
<tr>
<td class="label">External System Code:</td><td>—</td>
<td class="label">Funded Value:</td><td>1,438.00</td>
<td class="label">Billing Type:</td><td>FP</td>
</tr>
</tbody>
</table>
</tbody>
</table>
'''
# html = req.get('your url')
# html = utils.getFileContent('your file path')
# header = []
rows = []
doc = SimplifiedDoc(html)
tds = doc.selects('table.report>table.report>td')
row = []
for i in range(0,len(tds),2):
# header.append(tds[i].text.strip(':'))
row.append(tds[i+1].text)
# rows.append(header)
rows.append(row)
utils.save2csv('test.csv', rows, mode='a')
dabingsou, thank you for the inspiration. While your solution didn't work for my code your idea to use a csv utility other than what was bundled with Scrapy was the solution I was looking for!
My code was very similar to what you wrote with the exception of your MUCH more simplistic way of only looping once through the headers! Below is my code that I utilized and then simply added the csv package to write the file perfectly! This code utilizes Scrapy vs simple-scrapy and allows me to scrape the page using scrapy-splash.
def h1header_scrape(self, response):
td_labels = response.css('tbody.h1 td.label')
h1headers = []
lastitemnum = len(td_labels)-1 #This provides the last item number and subtracts the duplicate "Project Organization" label in the first tr in the css
lastheader= response.css('tbody.h1 td.label::text')[lastitemnum].get() #Using the count function defined above this gets the last header name in h1 and uses it to stop the for loop below
for td_label in td_labels:
columns = td_label.css('td.label')
header = columns.css('.label::text').get()
h1header = columns.css('.label::text').get().replace(":","")
h1headers.append(h1header)
if header == lastheader:
break
print(h1headers)
with open('testfile.csv','w',newline='') as file:
writer = csv.writer(file)
writer.writerow(h1headers)

Beautiful Soup Not Finding Basic HTML Data

I'm trying to extract data from a page using BeautifulSoup. I obtain my HTML data (type: bs4.element.ResultSet) and it contains mutliple lines such as the following, which I would like to compile into a list:
<td class="va-infobox-label" colspan="1" style="" title="">Weight</td>
But when I run a line such as one of those shown below...
labels = soup.find_all("va-infobox-label")
labels = soup.find(colspan="1", style="")
...I get an attribute error. Alternatively running...
labels = soup.find_all("va-infobox-label")
...returns a syntax error
What command or tool should I be using if not find to obtain all lines containing va-infobox-label? My end goal is to obtain a list of labels from this HTML, one of which will be 'weight' as per my example (title="">Weight<).
If you need to replicate the error:
import requests
from bs4 import BeautifulSoup
as_val_url = 'https://escapefromtarkov.gamepedia.com/AS_VAL'
as_val_page = requests.get(as_val_url)
as_val_soup = BeautifulSoup(as_val_page.content, 'html.parser')
soup = as_val_soup.find_all(id="va-infobox0-content")
labels = soup.find_all("va-infobox-label")
If a glance at the HTML would help you, a public 'beautified' copy of it is present in my pastebin. My example is from line 36.
You can use soup.select to search via CSS selectors or soup.find_all as below
from bs4 import BeautifulSoup
from io import StringIO
data = '''
<tr>
<td class="va-infobox-label" colspan="1" style="" title="">Slot</td>
<td class="va-infobox-spacing-h"></td>
<td class="va-infobox-content" colspan="1" style="" title="">Primary</td>
</tr>
<tr class="va-infobox-spacing">
<td class="va-infobox-spacing-v" colspan="3"></td>
</tr>
<tr>
<td class="va-infobox-label" colspan="1" style="" title="">Weight</td>
<td class="va-infobox-spacing-h"></td>
<td class="va-infobox-content" colspan="1" style="" title="">2.587 kg</td>
</tr>
<tr class="va-infobox-spacing">
<td class="va-infobox-spacing-v" colspan="3"></td>
</tr>
<tr>
<td class="va-infobox-label" colspan="1" style="" title="">Grid size</td>
<td class="va-infobox-spacing-h"></td>
<td class="va-infobox-content" colspan="1" style="" title="">5x2</td>
</tr>
'''
f = StringIO(data)
soup = BeautifulSoup(f, 'html.parser')
for e in soup.find_all('td', {'class': 'va-infobox-label'}):
print('find_all', e)
for e in soup.select('.va-infobox-label'):
print('select', e)

Python: robust xpath for table in tr and td tags, eliminate unwanted data

I need to robust way to get the xpath for this url "http://www.screener.com/v2/stocks/view/5131"
However, there are too many blank space before the desirable data in between and it is not robust.
The part I need is 11.48,9.05,11.53 from the html below:
<div class="table-responsive">
<table class="table table-hover">
<tr>
<th>Financial Year</th>
<th class="number">Revenue ('000)</th>
<th class="number">Net ('000)</th>
<th class="number">EPS</th>
<th></th>
</tr>
<tr>
<td>30 Nov, 2017</td>
<td class="number">205,686</td>
<td class="number">52,812</td>
<td class="number">11.48</td>
<td></td>
</tr>
<tr>
<td>30 Nov, 2016</td>
<td class="number">191,301</td>
<td class="number">41,598</td>
<td class="number">9.05</td>
<td></td>
</tr>
<tr>
<td>30 Nov, 2015</td>
<td class="number">225,910</td>
<td class="number">51,082</td>
<td class="number">11.53</td>
<td></td>
</tr>
My code as below
from lxml import html
import requests
page = requests.get('http://www.screener.com/v2/stocks/view/5131')
output = html.fromstring(page.content)
output.xpath('//tr/td/following-sibling::td/text()')
How the code shall be change so that it can robustly get the three number from the tables as shown above?
I just want the output 11.48,9.05,11.53but I unable to get rid of too many of the data inside teh tables
Try below XPath to get desired output:
//div[#id="annual"]//tr/td[position() = last() - 1]/text()

Beautifulsoup Unable to Find Classes with Hyphens in Their Name

I am using BeautifulSoup4 on a MacOSX running Python 2.7.8. I am having difficulty extracting information from the following html code
<tbody tabindex="0" class="yui-dt-data" id="yui_3_5_0_1_1408418470185_1650">
<tr id="yui-rec0" class="yui-dt-first yui-dt-even">
<td headers="yui-dt0-th-rank" class="rank yui-dt0-col-rank"></td>
</tr>
<tr id="yui-rec1" class="yui-dt-odd">...</tr>
<tr id="yui-rec2" class="yui-dt-even">...</tr>
</tbody>
I can't seem to grab the table or any of it's contents because BS and/or python doesn't seem to recognize values with hyphens. So the usual code, something like
Table = soup.find('tbody',{'class':'yui-dt-data'})
or
Row2 = Table.find('tr',{'id':'yui-rec2'})
just returns an empty object (not NONE, simply empty). I'm not new to BS4 or Python and I've extracted information from this site before, but the class names are different now than when I previously did it. Now everything has hyphens. Is there any way to get Python to recognize the hyphen or a workaround?
I need to have my code be general so that I can run it across numerous pages that all have the same class name. Unfortunately, the id attribute in <tbody> is unique to that particular table, so I can't use that to identify this table across webpages.
Any help would be appreciated. Thanks in advance.
The following code:
from bs4 import BeautifulSoup
htmlstring = """ <tbody tabindex="0" class="yui-dt-data" id="yui_3_5_0_1_1408418470185_1650">
<tr id="yui-rec0" class="yui-dt-first yui-dt-even">
<tr id="yui-rec1" class="yui-dt-odd">
<tr id="yui-rec2" class="yui-dt-even">"""
soup = BeautifulSoup(htmlstring)
Table = soup.find('tbody', attrs={'class': 'yui-dt-data'})
print("Table:\n")
print(Table)
tr = Table.find('tr', attrs={'class': 'yui-dt-odd'})
print("tr:\n")
print(tr)
outputs:
Table:
<tbody class="yui-dt-data" id="yui_3_5_0_1_1408418470185_1650" tabindex="0">
<tr class="yui-dt-first yui-dt-even" id="yui-rec0">
<tr class="yui-dt-odd" id="yui-rec1">
<tr class="yui-dt-even" id="yui-rec2"></tr></tr></tr></tbody>
tr:
<tr class="yui-dt-odd" id="yui-rec1">
<tr class="yui-dt-even" id="yui-rec2"></tr></tr>
Even though the html you supplied isn't by itself valid, it seems that BS is making a guess about how it should be, because soup.prettify() yields
<tbody class="yui-dt-data" id="yui_3_5_0_1_1408418470185_1650" tabindex="0">
<tr class="yui-dt-first yui-dt-even" id="yui-rec0">
<tr class="yui-dt-odd" id="yui-rec1">
<tr class="yui-dt-even" id="yui-rec2">
</tr>
</tr>
</tr>
</tbody>
Though I'm guessing those tr's aren't supposed to be nested.
Could you try running that exact code and seeing what the output is?
For people trying to find a solution to find a tag with hyphen in its attributes, there is an answer in the document
https://www.crummy.com/software/BeautifulSoup/bs4/doc/#the-keyword-arguments
This segment of code will cause error
data_soup = BeautifulSoup('<div data-foo="value">foo!</div>')
data_soup.find_all(data-foo="value")
# SyntaxError: keyword can't be an expression
you should do this
data_soup.find_all(attrs={"data-foo": "value"})
# [<div data-foo="value">foo!</div>]
Just use select. bs4 4.7.1
import requests
from bs4 import BeautifulSoup as bs
html = '''
<tbody tabindex="0" class="yui-dt-data" id="yui_3_5_0_1_1408418470185_1650">
<tr id="yui-rec0" class="yui-dt-first yui-dt-even">
<td headers="yui-dt0-th-rank" class="rank yui-dt0-col-rank"></td>
</tr>
<tr id="yui-rec1" class="yui-dt-odd">...</tr>
<tr id="yui-rec2" class="yui-dt-even">...</tr>
</tbody>
'''
soup = bs(html, 'lxml')
soup.select('.yui-dt-data')

Categories