Getting data out of a <TD> Element using Python - python

I am writing an agent for plex and I am scraping the following html table
I am rather new to python and web scraping in general
I am trying to get to the data XXXXXXXXXX
THE DATA
<table class="d">
<tbody>
<tr>
<th class="ch">title</th>
<th class="ch">released</th>
<th class="ch">company</th>
<th class="ch">type</th>
<th class="ch">rating</th>
<th class="ch">category</th>
</tr>
<tr>
<td class="cd" valign="top">
XXXXXXXXXX
</td>
<td class="cd">2015</td>
<td class="cd">My Films</td>
<td class="cd"> </td>
<td class="cd"> </td>
<td class="cd">General Hardcore</td>
</tr>
</tbody>
</table>
THE CODE
This is a segment of the code I am using :
myTable = HTML.ElementFromURL(searchQuery, sleep=REQUEST_DELAY).xpath('//table[contains(#class,"d")]/tr')
self.log('SEARCH:: My Table: %s', myTable)
# This logs the following
#2019-12-26 00:26:49,329 (17a4) : INFO (logkit:16) - GEVI - SEARCH:: My Table: [<Element tr at 0x5225c30>, <Element tr at 0x5225c00>]
for myRow in myTable:
siteTitle = title[0]
self.log('SEARCH:: Site Title: %s', siteTitle)
siteTitle = title[0].text_content().strip()
self.log('SEARCH:: Site Title: %s', siteTitle)
# This logs the following for <tr>/<th> - ROW 1
# 2019-12-26 00:26:49,335 (17a4) : INFO (logkit:16) - GEVI - SEARCH:: Site Title: <Element th at 0x5225180>
# 2019-12-26 00:26:49,342 (17a4) : INFO (logkit:16) - GEVI - SEARCH:: Site Title: title
# This logs the following for <tr>/<th> - ROW 2
# 2019-12-26 00:26:49,362 (17a4) : INFO (logkit:16) - GEVI - SEARCH:: Site Title: <Element td at 0x52256f0>
# 2019-12-26 00:26:49,369 (17a4) : INFO (logkit:16) - GEVI - SEARCH:: Site Title: #### this is my issue... should be XXXXXXXXXX
# I can get the href using the following code
siteURL = myRow.xpath('.//td/a')[0].get('href')
THE QUESTIONS
A. How do I get the value 'XXXXXXXXXX', I tried using xPath but it picked up data from another table on the same page
B. Is There a better way of getting the href attribute?
OTHER
The python libraries I am using are
import datetime, linecache, platform, os, re, string, sys, urllib
I can not use beautifulsoup as this is an agent for plex and therefore i am assuming that whoever wanted to use this agent would have to install beautifulsoup.
so that is a no go

How's this?
from simplified_scrapy.simplified_doc import SimplifiedDoc
html = '''<table class="d">
<tbody>
<tr>
<th class="ch">title</th>
<th class="ch">released</th>
<th class="ch">company</th>
<th class="ch">type</th>
<th class="ch">rating</th>
<th class="ch">category</th>
</tr>
<tr>
<td class="cd" valign="top">
XXXXXXXXXX
</td>
<td class="cd">2015</td>
<td class="cd">My Films</td>
<td class="cd"> </td>
<td class="cd"> </td>
<td class="cd">General Hardcore</td>
</tr>
</tbody>
</table>'''
doc = SimplifiedDoc(html)
table = doc.getElement('table','d') # doc.getElement(tag='table',attr='class',value='d')
trs = table.trs.contains('<a ') # table.getElementsByTag('tr').contains('<a ')
for tr in trs:
a = tr.a
print (a)
print (a.text) # XXXXXXXXXX

Related

Python : Scrape each info in table without class using beautifulsoup4

I'm new to python and i have a problem for scraping with beautifulsoup4 a table containing informations of a book because each tr and td of the table doesnt contain classnames.
https://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html
here is the table in the website:
<table class="table table-striped">
<tr>
<th>
UPC
</th>
<td>
a897fe39b1053632
</td>
</tr>
<tr>
<th>
Product Type
</th>
<td>
Books
</td>
</tr>
<tr>
<th>
Price (excl. tax)
</th>
<td>
£51.77
</td>
</tr>
<tr>
<th>
Price (incl. tax)
</th>
<td>
£51.77
</td>
</tr>
<tr>
<th>
Tax
</th>
<td>
£0.00
</td>
</tr>
<tr>
<th>
Availability
</th>
<td>
In stock (22 available)
</td>
</tr>
<tr>
<th>
Number of reviews
</th>
<td>
0
</td>
</tr>
</table>
the only thing i learned is with classnames, for example : book_price = soup.find('td', class_='book-price').
but in this situation i am blocked...
Is there something like find and pair the first th tag with the first td and the second th tag with the second td and so on.
i see something like that :
import requests
from bs4 import BeautifulSoup
book_url = "https://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html"
page = requests.get(book_url)
soup = BeautifulSoup(page.content, 'html.parser')
title = soup.find('table').prettify()
table_infos = soup.find('table')
for info in table_infos.findAll('tr'):
upc = ...
price = ...
tax = ...
thank you !

grabbing child from html table

Trying to grab some table data from a website.
Here's a sample of the html that can be found here https://www.madeinalabama.com/warn-list/:
<div class="warn-data">
<table>
<thead>
<tr>
<th>Closing or Layoff</th>
<th>Initial Report Date</th>
<th>Planned Starting Date</th>
<th>Company</th>
<th>City</th>
<th>Planned # Affected Employees</th>
</tr>
</thead>
<tbody>
<tr>
<td>Closing * </td>
<td>09/17/2019</td>
<td>11/14/2019</td>
<td>FLOWERS BAKING CO. </td>
<td>Opelika </td>
<td> 146 </td>
</tr>
<tr>
<td>Closing * </td>
<td>08/05/2019</td>
<td>10/01/2019</td>
<td>INFORM DIAGNOSTICS </td>
<td>Daphne </td>
<td> 72 </td>
</tr>
I'm trying to grab the data corresponding to the 6th td for each table row.
I tried this:
url = 'https://www.madeinalabama.com/warn-list/'
browser = webdriver.Chrome()
browser.get(url)
elements = browser.find_elements_by_xpath("//table/tbody/tr/td[6]").text
and elements comes back as this:
<selenium.webdriver.remote.webelement.WebElement (session="8199967d541da323f5d5c72623a5e607", element="7d2f8991-d30b-4bc0-bfa5-4b7e909fb56c")>,
<selenium.webdriver.remote.webelement.WebElement (session="8199967d541da323f5d5c72623a5e607", element="ba0cd72d-d105-4f8c-842f-6f20b3c2a9de")>,
<selenium.webdriver.remote.webelement.WebElement (session="8199967d541da323f5d5c72623a5e607", element="1ec14439-0732-4417-ac4f-be118d8d1f85")>,
<selenium.webdriver.remote.webelement.WebElement (session="8199967d541da323f5d5c72623a5e607", element="d8226534-4fc7-406c-935a-d43d6d777bfb")>]
Desired output is a simple df like this:
Planned # Affected Employees
146
72
.
.
.
Please someone explain how to do this using selenium find_elements_by_xpath. We have ample beautiful_soup explanations.
You can use pd.read_html() function:
txt = '''<div class="warn-data">
<table>
<thead>
<tr>
<th>Closing or Layoff</th>
<th>Initial Report Date</th>
<th>Planned Starting Date</th>
<th>Company</th>
<th>City</th>
<th>Planned # Affected Employees</th>
</tr>
</thead>
<tbody>
<tr>
<td>Closing * </td>
<td>09/17/2019</td>
<td>11/14/2019</td>
<td>FLOWERS BAKING CO. </td>
<td>Opelika </td>
<td> 146 </td>
</tr>
<tr>
<td>Closing * </td>
<td>08/05/2019</td>
<td>10/01/2019</td>
<td>INFORM DIAGNOSTICS </td>
<td>Daphne </td>
<td> 72 </td>
</tr>'''
df = pd.read_html(txt)[0]
print(df)
Prints:
Closing or Layoff Initial Report Date Planned Starting Date Company City Planned # Affected Employees
0 Closing * 09/17/2019 11/14/2019 FLOWERS BAKING CO. Opelika 146
1 Closing * 08/05/2019 10/01/2019 INFORM DIAGNOSTICS Daphne 72
Then:
print(df['Planned # Affected Employees'])
Prints:
0 146
1 72
Name: Planned # Affected Employees, dtype: int64
EDIT: Solution with BeautifulSoup:
soup = BeautifulSoup(txt, 'html.parser')
all_data = []
for tr in soup.select('.warn-data tr:has(td)'):
*_, last_column = tr.select('td')
all_data.append(last_column.get_text(strip=True))
df = pd.DataFrame({'Planned': all_data})
print(df)
Prints:
Planned
0 146
1 72
Or:
soup = BeautifulSoup(txt, 'html.parser')
all_data = [td.get_text(strip=True) for td in soup.select('.warn-data tr > td:nth-child(6)')]
df = pd.DataFrame({'Planned': all_data})
print(df)
You could also do td:nth-last-child(1) assuming its the last child
soup.select('div.warn-data > table > tbody > tr > td:nth-last-child(1)')
Example
from bs4 import BeautifulSoup
html = """
<div class="warn-data">
<table>
<thead>
<tr>
<th>Closing or Layoff</th>
<th>Initial Report Date</th>
<th>Planned Starting Date</th>
<th>Company</th>
<th>City</th>
<th>Planned # Affected Employees</th>
</tr>
</thead>
<tbody>
<tr>
<td>Closing *</td>
<td>09/17/2019</td>
<td>11/14/2019</td>
<td>FLOWERS BAKING CO.</td>
<td>Opelika</td>
<td> 146 </td>
</tr>
<tr>
<td>Closing *</td>
<td>08/05/2019</td>
<td>10/01/2019</td>
<td>INFORM DIAGNOSTICS</td>
<td>Daphne</td>
<td> 72 </td>
</tr>
"""
soup = BeautifulSoup(html, features='html.parser')
elements = soup.select('div.warn-data > table > tbody > tr > td:nth-last-child(1)')
for index, item in enumerate(elements):
print(index, item.text)

How to read the data from HTML file and write the data to CSV file using python?

I have a .html file report which consists of the data in terms of tables and pass-fail criteria. so I want this data to be written to .csv file Using Python3.
Please suggest me how to proceed?
For example, the data will be like this:
<h2>Sequence Evaluation of Entire Project <em class="contentlink">[Contents]</em> </h2>
<table width="100%" class="coverage">
<tr class="nohover">
<td colspan="8" class="tableabove">Test Sequence State</td>
</tr>
<tr>
<th colspan="2" style="white-space:nowrap;">Metric</th>
<th colspan="2">Percentage</th>
<th>Target</th>
<th>Total</th>
<th>Reached</th>
<th>Unreached</th>
</tr>
<tr>
<td colspan="2">Test Sequence Work Progress</td>
<td>100.0%</td>
<td>
<table class="metricbar">
<tr class="borderX">
<td class="white"></td>
<td class="target"></td>
<td class="white" colspan="2"></td>
</tr>
<tr>
<td class="covreached" width="99%"></td>
<td class="target" width="1%"></td>
<td class="covreached" width="0%"></td>
<td class="covnotreached" width="0%"></td>
</tr>
<tr class="borderX">
<td class="white"></td>
<td class="target"></td>
<td class="white" colspan="2"></td>
</tr>
</table>
</td>
<td>100%</td>
<td>24</td>
<td>-</td>
<td>-</td>
</tr>
<tr>
Assuming you know header and really only need the associated percentage, with bs4 4.7.1 you can use :contains to target header and then take next td. You would be reading your HTML from file into html variable shown.
import requests
from bs4 import BeautifulSoup as bs
import pandas as pd
html = '''
<h2>Sequence Evaluation of Entire Project <em class="contentlink">[Contents]</em> </h2>
<table width="100%" class="coverage">
<tr class="nohover">
<td colspan="8" class="tableabove">Test Sequence State</td>
</tr>
<tr>
<th colspan="2" style="white-space:nowrap;">Metric</th>
<th colspan="2">Percentage</th>
<th>Target</th>
<th>Total</th>
<th>Reached</th>
<th>Unreached</th>
</tr>
<tr>
<td colspan="2">Test Sequence Work Progress</td>
<td>100.0%</td>
<td>
<table class="metricbar">
<tr class="borderX">
<td class="white"></td>
<td class="target"></td>
<td class="white" colspan="2"></td>
</tr>
<tr>
<td class="covreached" width="99%"></td>
<td class="target" width="1%"></td>
<td class="covreached" width="0%"></td>
<td class="covnotreached" width="0%"></td>
</tr>
<tr class="borderX">
<td class="white"></td>
<td class="target"></td>
<td class="white" colspan="2"></td>
</tr>
</table>
</td>
<td>100%</td>
<td>24</td>
<td>-</td>
<td>-</td>
</tr>
<tr>
'''
soup = bs(html, 'lxml') # 'html.parser' if lxml not installed
header = 'Test Sequence Work Progress'
result = soup.select_one('td:contains("' + header + '") + td').text
df = pd.DataFrame([result], columns = [header])
print(df)
df.to_csv(r'C:\Users\User\Desktop\data.csv', sep=',', encoding='utf-8-sig',index = False )
import csv
from bs4 import BeautifulSoup
out = open('out.csv', 'w', encoding='utf-8')
path="my.html" #add the path of your local file here
soup = BeautifulSoup(open(path), 'html.parser')
for link in soup.find_all('p'): #add tag whichyou want to extract
a=link.get_text()
out.write(a)
out.write('\n')
out.close()

Parsing webpage with robobrowser and beautifulsoup

I'm new to webscraping trying to parse a website after doing a form submission with robobrowser. I get the correct data back (I can view it when I do: print(browser.parsed)) but am having trouble parsing it. The relevant part of the source code of the webpage looks like this:
<div id="ii">
<tr>
<td scope="row" id="t1a"> ID (ID Number)</a></td>
<td headers="t1a">1234567 </td>
</tr>
<tr>
<td scope="row" id="t1b">Participant Name</td>
<td headers="t1b">JONES, JOHN </td>
</tr>
<tr>
<td scope="row" id="t1c">Sex</td>
<td headers="t1c">MALE </td>
</tr>
<tr>
<td scope="row" id="t1d">Date of Birth</td>
<td headers="t1d">11/25/2016 </td>
</tr>
<tr>
<td scope="row" id="t1e">Race / Ethnicity</a></td>
<td headers="t1e">White </td>
</tr>
if I do
in: browser.select('#t1b")
I get:
out: [<td id="t1b" scope="row">Inmate Name</td>]
instead of JONES, JOHN.
The only way I've been able to get the relevant data is doing:
browser.select('tr')
This returns a list of each of the 29 with each 'tr' that I can convert to text and search for the relevant info.
I've also tried creating a BeautifulSoup object:
x = browser.select('#ii')
soup = BeautifulSoup(x[0].text, "html.parser")
but it loses all tags/ids and so I can't figure out how to search within it.
Is there an easy way to have it loop through each element with 'tr' and get the actual data and not the label as oppose to repeatedly converting to a string variable and searching through it?
Thanks
Get all the "label" td elements and get the next td sibling value collecting results into a dict:
from pprint import pprint
from bs4 import BeautifulSoup
data = """
<table>
<tr>
<td scope="row" id="t1a"> ID (ID Number)</a></td>
<td headers="t1a">1234567 </td>
</tr>
<tr>
<td scope="row" id="t1b">Participant Name</td>
<td headers="t1b">JONES, JOHN </td>
</tr>
<tr>
<td scope="row" id="t1c">Sex</td>
<td headers="t1c">MALE </td>
</tr>
<tr>
<td scope="row" id="t1d">Date of Birth</td>
<td headers="t1d">11/25/2016 </td>
</tr>
<tr>
<td scope="row" id="t1e">Race / Ethnicity</a></td>
<td headers="t1e">White </td>
</tr>
</table>
"""
soup = BeautifulSoup(data, 'html5lib')
data = {
label.get_text(strip=True): label.find_next_sibling("td").get_text(strip=True)
for label in soup.select("tr > td[scope=row]")
}
pprint(data)
Prints:
{'Date of Birth': '11/25/2016',
'ID (ID Number)': '1234567',
'Participant Name': 'JONES, JOHN',
'Race / Ethnicity': 'White',
'Sex': 'MALE'}

scraping tables with beautifulsoup

I seem to be stuck, If i had the following table:
<table align=center cellpadding=3 cellspacing=0 border=1>
<tr bgcolor="#EEEEFF">
<td align="center">
40 </td>
<td align="center">
44 </td>
<td align="center">
<font color="green"><b>+4</b></font>
</td>
<td align="center">
1,000</td>
<td align="center">
15,000 </td>
<td align="center">
44,000 </td>
<td align="center">
<font color="green"><b><nobr>+193.33%</nobr></b></font>
</td>
</tr>
what would be the ideal way to use find_all to pull the 44,000 td from the table?
If it is a recurring position of the table you would like to scrape you would like to scrape I would use beautiful soup to extract all elements in the table and then extract that data. See the pseudo code below.
known_position = 5
tds = bs4.find_all('td')
number = tds[known_position].text()
on the other hand if you're specifically searching for a given number I would just iterate over the list.
tds = bs4.find_all('td')
for td in tds:
if td.text = 'number here':
# do your stuff

Categories