I have some html scraping code issues with beautiful soup. I cannot figure out how to go through the whole html document to find the rest of the things I am looking for.
I have this code that will find and print the word "Totem" in the below html. I want to be able to cycle through the html and find the remaining "One, Two, Three", and "Rent"
Code that works to find the first tag and text:
print(html.find('td', {'class': 'play'}).next_sibling.next_sibling.text)
Let the below be the sample html to scrape:
<tr>
<td class="play">
<span class="play-button as_audio-button"></span>
<audio class="as_audio_preview" src="https://shopify.audiosalad.com/" >foo</audio>
</td>
**<td>Totem</td>**
<!--<td>$0.99</td>-->
<td class="buy">
<tr>
<td class="play">
<span class="play-button as_audio-button"></span>
<audio class="as_audio_preview" src="https://shopify.audiosalad.com/" >foo</audio>
</td>
**<td>One, Two, Three</td>**
<!--<td>$0.99</td>-->
<td class="buy">
<tr>
<td class="play">
<span class="play-button as_audio-button"></span>
<audio class="as_audio_preview" src="https://shopify.audiosalad.com/" >foo</audio>
</td>
**<td>Rent</td>**
<!--<td>$0.99</td>-->
<td class="buy">
Try this. It should fetch you the content you are after:
from bs4 import BeautifulSoup
soup = BeautifulSoup(content,"lxml")
for items in soup.find_all(class_="play"):
data = items.find_next_sibling().text
print(data)
Or, you can try like this as well:
for items in soup.find_all(class_="play"):
data = items.find_next("td").text
print(data)
Output:
Totem
One, Two, Three
Rent
you have to iterate over elements, like this:
for td in html.find_all('td', {'class': 'play'}):
print(td.next_sibling.next_sibling.text)
Related
This is the html I have on a website:
<table class="table table-fixed table-header-right text-medium">
<tbody><tr><th class="no-border">Certification Number</th><td class="no-border">48487270</td></tr>
<tr>
<th>Label Type</th>
<td>
<img width="69" height="38" class="margin-right-min" alt="" aria-hidden="true" src="https://i.psacard.com/psacard/images/cert/table-image-ink.png" style="">
<span class="inline-block padding-top-min">with fugitive ink technology</span>
</td>
</tr>
<tr><th>Reverse Cert Number/Barcode</th><td>Yes</td></tr>
<tr><th>Year</th><td>2020</td></tr>
<tr><th>Brand</th><td>TOPPS</td></tr>
<tr><th>Sport</th><td>BASEBALL CARDS</td></tr>
<tr><th>Card Number</th><td>20</td></tr>
<tr><th>Player</th><td>ARISTIDES AQUINO</td></tr>
<tr><th>Variety/Pedigree</th><td></td></tr>
<tr><th>Grade</th><td>NM-MT 8</td></tr>
</tbody></table>
I am trying to figure out a way to get and set the year to a variable, the normal way I find elements is with XPath but since these tags are repeated so many times with no other indicators I am unsure how to go about this. The year will change so I cant search by text. Any help would be appreciated.
Use BeautifulSoup to find the <th> tag with the text 'Year'. Then find the next <td> tag and extract the text from that:
from bs4 import BeautifulSoup
html = '''<table class="table table-fixed table-header-right text-medium">
<tbody><tr><th class="no-border">Certification Number</th><td class="no-border">48487270</td></tr>
<tr>
<th>Label Type</th>
<td>
<img width="69" height="38" class="margin-right-min" alt="" aria-hidden="true" src="https://i.psacard.com/psacard/images/cert/table-image-ink.png" style="">
<span class="inline-block padding-top-min">with fugitive ink technology</span>
</td>
</tr>
<tr><th>Reverse Cert Number/Barcode</th><td>Yes</td></tr>
<tr><th>Year</th><td>2020</td></tr>
<tr><th>Brand</th><td>TOPPS</td></tr>
<tr><th>Sport</th><td>BASEBALL CARDS</td></tr>
<tr><th>Card Number</th><td>20</td></tr>
<tr><th>Player</th><td>ARISTIDES AQUINO</td></tr>
<tr><th>Variety/Pedigree</th><td></td></tr>
<tr><th>Grade</th><td>NM-MT 8</td></tr>
</tbody></table>'''
soup = BeautifulSoup(html, 'html.parser')
year = soup.find('th', text='Year').find_next('td').text
print(year)
Output:
'2020'
Firstly we need to find out webelements using driver.findelements function using that classname
And then we can get elements from that list
By list.get(index)
Or,
You can store all the td/th elements in a list and than search the list for year you are looking for.
I'm just starting coding in Python and my friend asked me for application finding specific data on the web, representing it nicely.
I already found pretty web, where the data is contained, I can find basic info, but then the challenge is to get deeper.
While using BS4 in Python 3.4 I have reached exemplary code:
<tr class=" " somethingc1="" somethingc2="" somethingc3="" data-something="1" something="1something6" something_id="6something0">
<td class="text-center td_something">
<div>
Super String of Something
</div>
</td>
<td class="text-center">08/26 15:00</td>
<td class="text-center something_status">
<span class="something_status_something">Full</span>
</td>
</tr>
<tr class=" " somethingc1="" somethingc2="" somethingc3="" data-something="0" something="1something4" something_id="6something7">
<td class="text-center td_something">
<div>
Super String of Something
</div>
</td>
<td class="text-center">05/26 15:00</td>
<td class="text-center something_status">
<span class="something_status_something"></span>
</td>
</tr>
What I want to do now is finding the date string of but only if data-something="1" of parent and not if data-something="0"
I can scrap all dates by :
soup.find_all(lambda tag: tag.name == 'td' and tag.get('class') == ['text-center'] and not tag.has_attr('style'))
but it does not check parent. That is why I tried:
def KieMeWar(tag):
return tag.name == 'td' and tag.parent.name == 'tr' and tag.parent.attrs == {"data-something": "1"} #and tag.get('class') == ['text-center'] and not tag.has_attr('style')
soup.find_all(KieMeWar)
The result is an empty set. What is wrong or how to reach the target I am aiming for with easiest solution?
P.S. This is exemplary part of full code, that is why I use not Style, even though it does not appear here but does so later.
BeautifulSoup's findAll has the attrs kwarg, which is used to find tags with a given attribute
import bs4
soup = bs4.BeautifulSoup(html)
trs = soup.findAll('tr', attrs={'data-something':'1'})
That finds all tr tags with the attribute data-something="1". Afterwards, you can loop through the trs and grab the 2nd td tag to extract the date
for t in trs:
print(str(t.findAll('td')[1].text))
>>> 08/26 15:00
Due to some help I received from here, I was able to put together a Python script that allows me to pull numbers from HTML data. However, for some reason, not all of the numbers are being pulled even though I believe I am using the correct criteria in my findall() method. Here is what my script looks like:
search1 = []
soup = BeautifulSoup(data_payload, 'html.parser')
data = soup.findAll("td", {"class":"confluenceTd"})
for d in data:
m = re.search('([0-9]+)',str(d.findAll(text=True)))
if m:
search1.append(m.group(0))
print search1
Here is a sample of HTML where the numbers ARE all pulled:
<td class="confluenceTd">
<span>
AutoRun
</span>
</td>
<td class="confluenceTd">
<span>
1514444
</span>
</td>
<td class="confluenceTd" colspan="1">
<span style="color: rgb(0,0,0);">
61888758
</span>
And in the sample below, only "63811289" is getting pulled but nothing else:
<td class="confluenceTd" colspan="1">
<p>
CSY: 63811289, 62277372, 612377891, 653856796
</p>
<p>
RTY: 54346678
</p>
</td>
Any help would be appreciated. Thanks.
re.search only returns the first match: https://docs.python.org/2/library/re.html#re.search
You can pull out all of the strings with re.finditer():
for i in re.finditer('([0-9]+)',str(d.findAll(text=True))):
search1.append(i.group(0))
As a novice with bs4 I'm looking for some help in working out how to extract the text from a series of webpage tables, one of which is like this:
<table style="padding:0px; margin:1px" width="715px">
<tr>
<td height="22" width="33%" >
<span class="darkGreenText"><strong> Name: </strong></span>
Tyto alba
</td>
<td height="22" width="33%" >
<span class="darkGreenText"><strong> Order: </strong></span>
Strigiformes
</td>
<td height="22" width="33%">
<span class="darkGreenText"><strong> Family: </strong></span>
Tytonidae
</td>
<td height="22" width="66%" colspan="2">
<span class="darkGreenText"><strong> Status: </strong></span>
Least Concern
</td>
</tr>
</table>
Desired output:
Name: Tyto alba
Order: Strigiformes
Family: Tytonidae
Status: Least Concern
I've tried using [index] as recommended (https://stackoverflow.com/a/35050622/1726290),
and also next_sibling (https://stackoverflow.com/a/23380225/1726290) but I'm getting stuck as one part of the text I need is tagged and the second part is not. Any help would be appreciated.
It seems like what you want is to call get_text(strip=True)(docs) on the BeautifulSoup Tag. Assuming raw_html is the html you pasted above:
htmlSoup = BeautifulSoup(raw_html)
for tag in htmlSoup.select('td'):
print(tag.get_text(strip=True))
which prints:
Name:Tyto alba
Order:Strigiformes
Family:Tytonidae
Status:Least Concern
I'm trying to scrape data off a table on a web page using Python, BeautifulSoup, Requests, as well as Selenium to log into the site.
Here's the table I'm looking to get data for...
<div class="sastrupp-class">
<table>
<tbody>
<tr>
<td class="key">Thing I dont want 1</td>
<td class="value money">$1.23</td>
<td class="key">Thing I dont want 2</td>
<td class="value">99,999,999</td>
<td class="key">Target</td>
<td class="money value">$1.23</td>
<td class="key">Thing I dont want 3</td>
<td class="money value">$1.23</td>
<td class="key">Thing I dont want 4</td>
<td class="value percentage">1.23%</td>
<td class="key">Thing I dont want 5</td>
<td class="money value">$1.23</td>
</tr>
</tbody>
</table>
</div>
I can find the "sastrupp-class" fine, but I don't know how to look through it and get to the part of the table I want.
I figured I could just look for the class that I'm searching for like this...
output = soup.find('td', {'class':'key'})
print(output)
but that doesn't return anything.
Important to note:
< td>s inside the table have the same class name as the one that I want. If I can't separate them out, I'm ok with that although I'd rather just return the one I want.
2.There are other < div>s with class="sastrupp-class" on the site.
I'm obviously a beginner at this so let me know if I can help you help me.
Any help/pointers would be appreciated.
1) First of, to get your 'Target' you need find_all, not find. Then, considering you know exactly in which position your target will be (in the example you gave it is index=2) the solution could be reached like this:
from bs4 import BeautifulSoup
html = """(YOUR HTML)"""
soup = BeautifulSoup(html, 'html.parser')
table = soup.find('div', {'class': 'sastrupp-class'})
all_keys = table.find_all('td', {'class': 'key'})
my_key = all_keys[2]
print my_key.text # prints 'Target'
2)
There are other < div>s with class="sastrupp-class" on the site
Again, you need to select the one you need using find_all and then selecting the correct index.
Example HTML:
<body>
<div class="sastrupp-class"> Don't need this</div>
<div class="sastrupp-class"> Don't need this</div>
<div class="sastrupp-class"> Don't need this</div>
<div class="sastrupp-class"> Target</div>
</body>
To extract the target, you can just:
all_divs = soup.find_all('div', {'class':'sastrupp-class'})
target = all_divs[3] # assuming you know exactly which index to look for