Web scraping in Python - how to capture all <a> elements

Web scraping in Python - how to capture all <a> elements - python

I'm using beautifulsoup4 to scrape data from the lyrics.com website, specifically this link: https://www.lyrics.com/album/1447935.
From this block, I'm trying to extract both <a> elements:
[<table class="tdata">
<colgroup>
<col style="width: 50px;"/>
<col style="width: 430px;"/>
<col style="width: 80px;"/>
<col style="width: 80px;"/>
</colgroup>
<thead>
<tr>
<th>#</th>
<th>Song</th>
<th>Duration</th>
<th> </th>
</tr>
</thead>
<tbody>
<tr>
<td class="tal qx">1</td>
<td class="tal qx">
<strong>
Make You Feel My Love
</strong>
</td>
<td class="tal qx">3:32</td>
<td class="tal vam rt">
</td></tr><tr><td class="tal qx">2</td>
<td class="tal qx">
<strong>
Painting Pictures
</strong>
</td>
<td class="tal qx">3:33</td>
<td class="tal vam rt"> </td>
</tr>
</tbody>
</table>]
This is my code:
url = "http://www.lyrics.com" + album_url
page = r.get(url)
soup = bs(page.content, "html.parser")
songs = [a.get('href') for a in (table.find('a') for table in soup.findAll('table')) if a]
However, it's only returning the first <a>:
['/lyric/15183453/Make+You+Feel+My+Love']
What could be wrong?
Edit: Thank you all for the answers! I upvoted but I don't have enough rep for it to show

This will work:
songs = [song['href'] for song in soup.select('table a')]
Output:
['/lyric/15183453/Make+You+Feel+My+Love', '/lyric/15183454/Painting+Pictures']

Was able to make it work with:
for a in soup.findAll('a'):
if a.parent.name == 'strong':
if a.parent.parent.name == 'td':
print(a["href"])
Still not sure why the other method doesn't work, though, since I've used it elsewhere in my program with no issues.

Other solutions work fine, however I prefer using good old selectors
from bs4 import BeautifulSoup as bs
import requests as req
page = req.get('https://www.lyrics.com/album/1447935')
soup = bs(page.content, 'html.parser')
links = soup.select('table.tdata a[href]')
print(links)
This will print
[Make You Feel My Love, Painting Pictures]
If you aren't familiar with selectors, this will grab table elements that has the class tdata and then collect all the href property on the a elements

Looks like you want table.findAll instead of table.find.

Related

BS4 get the TH data within the table

I am trying to read data from a website which has a table like this:
<table border="0" width = "100%">
<tr>
<th width="33%">Product</th>
<th width="34%">ID</th>
<th width="33%">Serial</th>
</tr>
<tr>
<td align='center'>
<a target="_TOP" href="Link1.html">ProductName</a>
<br>
<a href='Link2.html' TARGET='_TOP'><img src='https://?uid=1'></a>
</td>
<td align='center'>
<a target="_TOP" href="Link2.html">ProductID</a>
<br>
<a href='Link2.html' TARGET='_TOP'><img src='https://?uid=2'></a>
</td>
<td align='center'>
<a target="_TOP" href="Link3.html">ProductSerial</a>
<br>
<a href='Link2.html' TARGET='_TOP'><img src='https://?uid=3'></a>
</td>
</tr>
</table>
and all I want from this table is the ProductID which is content inside of the tag.
The problem is, I am trying to use BS4 for this, to find the TAG, and read inside of it, but how to accurately point BS4 to it?
I have tried:
with open("src/file.html", 'r') as inf:
html = inf.read()
soup = bs4.BeautifulSoup(html, features="html.parser")
for container in soup.find_all("table", {"td": ""}):
print(container)
But doesn't work..Is there Any way to achieve this? To read the content inside of the a tag?

You can use the :nth-of-type CSS selector:
print(soup.select_one("td:nth-of-type(2) a:nth-of-type(1)").text)
Output:
ProductID

Beautiful Soup Not Finding Basic HTML Data

I'm trying to extract data from a page using BeautifulSoup. I obtain my HTML data (type: bs4.element.ResultSet) and it contains mutliple lines such as the following, which I would like to compile into a list:
<td class="va-infobox-label" colspan="1" style="" title="">Weight</td>
But when I run a line such as one of those shown below...
labels = soup.find_all("va-infobox-label")
labels = soup.find(colspan="1", style="")
...I get an attribute error. Alternatively running...
labels = soup.find_all("va-infobox-label")
...returns a syntax error
What command or tool should I be using if not find to obtain all lines containing va-infobox-label? My end goal is to obtain a list of labels from this HTML, one of which will be 'weight' as per my example (title="">Weight<).
If you need to replicate the error:
import requests
from bs4 import BeautifulSoup
as_val_url = 'https://escapefromtarkov.gamepedia.com/AS_VAL'
as_val_page = requests.get(as_val_url)
as_val_soup = BeautifulSoup(as_val_page.content, 'html.parser')
soup = as_val_soup.find_all(id="va-infobox0-content")
labels = soup.find_all("va-infobox-label")
If a glance at the HTML would help you, a public 'beautified' copy of it is present in my pastebin. My example is from line 36.

You can use soup.select to search via CSS selectors or soup.find_all as below
from bs4 import BeautifulSoup
from io import StringIO
data = '''
<tr>
<td class="va-infobox-label" colspan="1" style="" title="">Slot</td>
<td class="va-infobox-spacing-h"></td>
<td class="va-infobox-content" colspan="1" style="" title="">Primary</td>
</tr>
<tr class="va-infobox-spacing">
<td class="va-infobox-spacing-v" colspan="3"></td>
</tr>
<tr>
<td class="va-infobox-label" colspan="1" style="" title="">Weight</td>
<td class="va-infobox-spacing-h"></td>
<td class="va-infobox-content" colspan="1" style="" title="">2.587 kg</td>
</tr>
<tr class="va-infobox-spacing">
<td class="va-infobox-spacing-v" colspan="3"></td>
</tr>
<tr>
<td class="va-infobox-label" colspan="1" style="" title="">Grid size</td>
<td class="va-infobox-spacing-h"></td>
<td class="va-infobox-content" colspan="1" style="" title="">5x2</td>
</tr>
'''
f = StringIO(data)
soup = BeautifulSoup(f, 'html.parser')
for e in soup.find_all('td', {'class': 'va-infobox-label'}):
print('find_all', e)
for e in soup.select('.va-infobox-label'):
print('select', e)

Is it possible that BeautifulSoup can not parse a table in html documents?

Here an example of the code I used to scraping the table:
with open ('text.txt', 'w') as algroo:
for row in RoOtbody.find_all('tr'):
for cell in row.find_all('td'):
algroo.write(cell.text)
algroo.write('\n')
I already used Selenium and requests to extract the outer html from the webpage. I also tried to use html.parser and lxml.
The html looks like this:
<tr class="table">
<td class="table" valign="top">
<p class="tbl-hdr">HS heading</p>
</td>
<td class="table" valign="top">
<p class="tbl-hdr">Desccription of product</p>
</td>
<td class="table" colspan="2" valign="top">
<p class="tbl-hdr">Working or processing, carried out on non-originating
materials, which confers originating status</p>
</td>
</tr>
The problem is that when I open the txt file, all the cells elements are in a single column like the one below, literaly:
HS heading
Desccription of product
Working or processing, carried out on non-originating materials, which confers originating status
In all the tutorials I watched and read, they should be in the same row, like this:
HS headingDesccription of productWorking or processing, carried out on non-originating materials, which confers originating status
Can anyone help me, please?

I don't know if this will help you
from simplified_scrapy.simplified_doc import SimplifiedDoc
html = '''<tr class="table">
<td class="table" valign="top">
<p class="tbl-hdr">HS heading</p>
</td>
<td class="table" valign="top">
<p class="tbl-hdr">Desccription of product</p>
</td>
<td class="table" colspan="2" valign="top">
<p class="tbl-hdr">Working or processing, carried out on non-originating
materials, which confers originating status</p>
</td>
</tr>'''
doc = SimplifiedDoc(html)
tr = doc.tr # get first tr
print (tr.text)
print (tr.getText(' '))
tds = tr.tds # get all td
for td in tds:
print (td.text)

Fetch html table row data using python 3.6 beautiful soup

I have below html table and want to fetch table data i.e "Revenues ($M) $135,987" which exist in first row of table. How to achieve this using python beautifulsoup.
<table data-reactid=".romjx8c48.1.0.5.1:1.4.0.3.1.0.0.0.0.1.0.0.0.0">
<thead data-reactid=".romjx8c48.1.0.5.1:1.4.0.3.1.0.0.0.0.1.0.0.0.0.0">
<tr data-reactid=".romjx8c48.1.0.5.1:1.4.0.3.1.0.0.0.0.1.0.0.0.0.0.0">
<th data-reactid=".romjx8c48.1.0.5.1:1.4.0.3.1.0.0.0.0.1.0.0.0.0.0.0.0" width="200">
</th>
<th data-reactid=".romjx8c48.1.0.5.1:1.4.0.3.1.0.0.0.0.1.0.0.0.0.0.0.1:$th-$ millions">
$ millions
</th>
<th data-reactid=".romjx8c48.1.0.5.1:1.4.0.3.1.0.0.0.0.1.0.0.0.0.0.0.1:$th-% change">
% change
</th>
</tr>
</thead>
<tbody data-reactid=".romjx8c48.1.0.5.1:1.4.0.3.1.0.0.0.0.1.0.0.0.0.1">
<tr data-reactid=".romjx8c48.1.0.5.1:1.4.0.3.1.0.0.0.0.1.0.0.0.0.1.$company-data-Revenues ($M)">
<td class="title" data-reactid=".romjx8c48.1.0.5.1:1.4.0.3.1.0.0.0.0.1.0.0.0.0.1.$company-data-Revenues ($M).0">
Revenues ($M)
</td>
<td data-reactid=".romjx8c48.1.0.5.1:1.4.0.3.1.0.0.0.0.1.0.0.0.0.1.$company-data-Revenues ($M).1">
$135,987
</td>
<td data-reactid=".romjx8c48.1.0.5.1:1.4.0.3.1.0.0.0.0.1.0.0.0.0.1.$company-data-Revenues ($M).2">
27.1%
</td>
</tr>
Script to extract data from direct source:
import requests
from bs4 import BeautifulSoup as bs
r = requests.get('http://fortune.com/fortune500/amazon-com/')
soup = bs(r.content, 'html.parser')
result = soup.find('div', {'class': 'small-12 columns'})
table = result.find_all('table')[0] # Grab the first table
print(table.find('td', {'data-reactid': '.romjx8c48.1.0.5.1:1.4.0.3.1.0.0.0.0.1.0.0.0.0.1.$company-data-Revenues ($M).1'}).text)

Select the 'data-reactid' with the value '.romjx8c48.1.0.5.1:1.4.0.3.1.0.0.0.0.1.0.0.0.0.1.$company-data-Revenues ($M).1'} and read it's text.
from bs4 import BeautifulSoup
html = """<table data-reactid=".romjx8c48.1.0.5.1:1.4.0.3.1.0.0.0.0.1.0.0.0.0">
<thead data-reactid=".romjx8c48.1.0.5.1:1.4.0.3.1.0.0.0.0.1.0.0.0.0.0">
<tr data-reactid=".romjx8c48.1.0.5.1:1.4.0.3.1.0.0.0.0.1.0.0.0.0.0.0">
<th data-reactid=".romjx8c48.1.0.5.1:1.4.0.3.1.0.0.0.0.1.0.0.0.0.0.0.0" width="200">
</th>
<th data-reactid=".romjx8c48.1.0.5.1:1.4.0.3.1.0.0.0.0.1.0.0.0.0.0.0.1:$th-$ millions">
$ millions
</th>
<th data-reactid=".romjx8c48.1.0.5.1:1.4.0.3.1.0.0.0.0.1.0.0.0.0.0.0.1:$th-% change">
% change
</th>
</tr>
</thead>
<tbody data-reactid=".romjx8c48.1.0.5.1:1.4.0.3.1.0.0.0.0.1.0.0.0.0.1">
<tr data-reactid=".romjx8c48.1.0.5.1:1.4.0.3.1.0.0.0.0.1.0.0.0.0.1.$company-data-Revenues ($M)">
<td class="title" data-reactid=".romjx8c48.1.0.5.1:1.4.0.3.1.0.0.0.0.1.0.0.0.0.1.$company-data-Revenues ($M).0">
Revenues ($M)
</td>
<td data-reactid=".romjx8c48.1.0.5.1:1.4.0.3.1.0.0.0.0.1.0.0.0.0.1.$company-data-Revenues ($M).1">
$135,987
</td>
<td data-reactid=".romjx8c48.1.0.5.1:1.4.0.3.1.0.0.0.0.1.0.0.0.0.1.$company-data-Revenues ($M).2">
27.1%
</td>
</tr>
<tr data-reactid=".romjx8c48.1.0.5.1:1.4.0.3.1.0.0.0.0.1.0.0.0.0.1.$company-data-Profits ($M)">
<td class="title" data-reactid=".romjx8c48.1.0.5.1:1.4.0.3.1.0.0.0.0.1.0.0.0.0.1.$company-data-Profits ($M).0">
Profits ($M)
</td>
<td data-reactid=".romjx8c48.1.0.5.1:1.4.0.3.1.0.0.0.0.1.0.0.0.0.1.$company-data-Profits ($M).1">
$2,371.0
</td>
<td data-reactid=".romjx8c48.1.0.5.1:1.4.0.3.1.0.0.0.0.1.0.0.0.0.1.$company-data-Profits ($M).2">
297.8%
</td>
</tr>
</tbody>
</table>
"""
soup = BeautifulSoup(html, 'html.parser')
print(soup.find('td', {'data-reactid': '.romjx8c48.1.0.5.1:1.4.0.3.1.0.0.0.0.1.0.0.0.0.1.$company-data-Revenues ($M).1'}).text)
Outputs:
$135,987
Updated in response to comment:
the page is rendered with JavaScript you can use Selenium to render it:
First install Selenium:
sudo pip3 install selenium
Then get a driver https://sites.google.com/a/chromium.org/chromedriver/downloads you can use a headless version of chrome "Chrome Canary" if you are on Windows or Mac.
import bs4 as bs
from selenium import webdriver
browser = webdriver.Chrome()
url = "http://fortune.com/fortune500/amazon-com/"
browser.get(url)
html_source = browser.page_source
browser.quit()
soup = bs.BeautifulSoup(html_source, "html.parser")
# print (soup)
tds = soup.find_all('td')
print(tds[1].text)
Or for other non-selenium methods see my answer to Scraping Google Finance (BeautifulSoup)

Beautifulsoup Unable to Find Classes with Hyphens in Their Name

I am using BeautifulSoup4 on a MacOSX running Python 2.7.8. I am having difficulty extracting information from the following html code
<tbody tabindex="0" class="yui-dt-data" id="yui_3_5_0_1_1408418470185_1650">
<tr id="yui-rec0" class="yui-dt-first yui-dt-even">
<td headers="yui-dt0-th-rank" class="rank yui-dt0-col-rank"></td>
</tr>
<tr id="yui-rec1" class="yui-dt-odd">...</tr>
<tr id="yui-rec2" class="yui-dt-even">...</tr>
</tbody>
I can't seem to grab the table or any of it's contents because BS and/or python doesn't seem to recognize values with hyphens. So the usual code, something like
Table = soup.find('tbody',{'class':'yui-dt-data'})
or
Row2 = Table.find('tr',{'id':'yui-rec2'})
just returns an empty object (not NONE, simply empty). I'm not new to BS4 or Python and I've extracted information from this site before, but the class names are different now than when I previously did it. Now everything has hyphens. Is there any way to get Python to recognize the hyphen or a workaround?
I need to have my code be general so that I can run it across numerous pages that all have the same class name. Unfortunately, the id attribute in <tbody> is unique to that particular table, so I can't use that to identify this table across webpages.
Any help would be appreciated. Thanks in advance.

The following code:
from bs4 import BeautifulSoup
htmlstring = """ <tbody tabindex="0" class="yui-dt-data" id="yui_3_5_0_1_1408418470185_1650">
<tr id="yui-rec0" class="yui-dt-first yui-dt-even">
<tr id="yui-rec1" class="yui-dt-odd">
<tr id="yui-rec2" class="yui-dt-even">"""
soup = BeautifulSoup(htmlstring)
Table = soup.find('tbody', attrs={'class': 'yui-dt-data'})
print("Table:\n")
print(Table)
tr = Table.find('tr', attrs={'class': 'yui-dt-odd'})
print("tr:\n")
print(tr)
outputs:
Table:
<tbody class="yui-dt-data" id="yui_3_5_0_1_1408418470185_1650" tabindex="0">
<tr class="yui-dt-first yui-dt-even" id="yui-rec0">
<tr class="yui-dt-odd" id="yui-rec1">
<tr class="yui-dt-even" id="yui-rec2"></tr></tr></tr></tbody>
tr:
<tr class="yui-dt-odd" id="yui-rec1">
<tr class="yui-dt-even" id="yui-rec2"></tr></tr>
Even though the html you supplied isn't by itself valid, it seems that BS is making a guess about how it should be, because soup.prettify() yields
<tbody class="yui-dt-data" id="yui_3_5_0_1_1408418470185_1650" tabindex="0">
<tr class="yui-dt-first yui-dt-even" id="yui-rec0">
<tr class="yui-dt-odd" id="yui-rec1">
<tr class="yui-dt-even" id="yui-rec2">
</tr>
</tr>
</tr>
</tbody>
Though I'm guessing those tr's aren't supposed to be nested.
Could you try running that exact code and seeing what the output is?

For people trying to find a solution to find a tag with hyphen in its attributes, there is an answer in the document
https://www.crummy.com/software/BeautifulSoup/bs4/doc/#the-keyword-arguments
This segment of code will cause error
data_soup = BeautifulSoup('<div data-foo="value">foo!</div>')
data_soup.find_all(data-foo="value")
# SyntaxError: keyword can't be an expression
you should do this
data_soup.find_all(attrs={"data-foo": "value"})
# [<div data-foo="value">foo!</div>]

Just use select. bs4 4.7.1
import requests
from bs4 import BeautifulSoup as bs
html = '''
<tbody tabindex="0" class="yui-dt-data" id="yui_3_5_0_1_1408418470185_1650">
<tr id="yui-rec0" class="yui-dt-first yui-dt-even">
<td headers="yui-dt0-th-rank" class="rank yui-dt0-col-rank"></td>
</tr>
<tr id="yui-rec1" class="yui-dt-odd">...</tr>
<tr id="yui-rec2" class="yui-dt-even">...</tr>
</tbody>
'''
soup = bs(html, 'lxml')
soup.select('.yui-dt-data')

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Web scraping in Python - how to capture all <a> elements - python

This will work: songs = [song['href'] for song in soup.select('table a')] Output: ['/lyric/15183453/Make+You+Feel+My+Love', '/lyric/15183454/Painting+Pictures']

Was able to make it work with: for a in soup.findAll('a'): if a.parent.name == 'strong': if a.parent.parent.name == 'td': print(a["href"]) Still not sure why the other method doesn't work, though, since I've used it elsewhere in my program with no issues.

Looks like you want table.findAll instead of table.find.

Related

BS4 get the TH data within the table

Beautiful Soup Not Finding Basic HTML Data

Is it possible that BeautifulSoup can not parse a table in html documents?

Fetch html table row data using python 3.6 beautiful soup

Beautifulsoup Unable to Find Classes with Hyphens in Their Name

Categories

Resources