Pulling text from a scraped page with BeautifulSoup - python

New to programming and web scraping and having some trouble getting BeautifulSoup to pull only the text from a given page.
Here's what I'm working with right now:
import requests
from bs4 import BeautifulSoup
url = 'https://www.tsn.ca/panarin-tops-2019-free-agent-frenzy-class-1.1303592'
headers = {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.103 Safari/537.36'}
page = requests.get(url)
soup = BeautifulSoup(page.text, 'html.parser')
players = soup.find_all('td').text
print(players)
Which returns the following:
Traceback (most recent call last):
File "tsn.py", line 10, in <module>
players = soup.find_all('td').text
File "/home/debian1/.local/lib/python3.5/site-packages/bs4/element.py", line 1620, in __getattr__
"ResultSet object has no attribute '%s'. You're probably treating a list of items like a single item. Did you call find_all() when you meant to call find()?" % key
AttributeError: ResultSet object has no attribute 'text'. You're probably treating a list of items like a single item. Did you call find_all() when you meant to call find()?
I have also seen .get_text() used in BS documentation but that returns the same error.

Your solution was correct. You get a list of values from the find_all() method. all you have to do is iterate it and get the required text. I have corrected the code and put it below.
import requests
from bs4 import BeautifulSoup
url = 'https://www.tsn.ca/panarin-tops-2019-free-agent-frenzy-class-1.1303592'
headers = {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.103 Safari/537.36'}
page = requests.get(url)
soup = BeautifulSoup(page.text, 'html.parser')
# This is how you should have extracted the text from the ResultSet
players = [elem.text for elem in soup.find_all('td')]
print(players)

find_all() will return a list of all elements meeting your specifications. Even if only a single item, or no item is found it will return [item] or [] respectively. To get the text you will need to index to the item like:
players_list = soup.find_all('td')
for player in players_list:
print(player.text)
I use .getText() in my scripts, I'm not sure if .text works the same or not!

That error indicates that you should iterate over each item like this:
players = [item.text for item in soup.find_all('td')] # Iterate over every item and extract the text
print(players)
print("".join(players)) # If you want all the text in one string
Hope this helps!

This is a working script:
import requests
from bs4 import BeautifulSoup
url = 'https://www.tsn.ca/panarin-tops-2019-free-agent-frenzy-class-1.1303592'
headers = {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.103 Safari/537.36'}
page = requests.get(url)
soup = BeautifulSoup(page.text, 'html.parser')
players = []
tbl = soup.find('table', attrs={'class':'stats-table-scrollable article-table'})
tbl_body = tbl.find('tbody')
rows = tbl_body.find_all('tr')
for row in rows:
columns = row.find_all('td')
columns = [c.text for c in columns]
players.append(columns[1])
print(players)
Result:
['Artemi Panarin', 'Erik Karlsson', 'Sergei Bobrovsky', 'Matt Duchene', 'Jeff Skinner', 'Anders Lee', 'Joe Pavelski', 'Brock Nelson', 'Tyler Myers', 'Mats Zuccarello', 'Alex Edler', 'Gustav Nyquist', 'Jordan Eberle', 'Micheal Ferland', 'Jake Gardiner', 'Ryan Dzingel', 'Kevin Hayes', 'Brett Connolly', 'Marcus Johansson', 'Braydon Coburn', 'Wayne Simmonds', 'Brandon Tanev', 'Joonas Donskoi', 'Colin Wilson', 'Ron Hainsey']

Related

List Converts to Blank Dataframe

My list xfrs, returns a blank DF when I convert it....does anyone see any issues with the code?
I'm able to append and print the list fine, but when I append, the DF transfers is blank.
url2 = 'https://247sports.com/Season/2020-Football/TransferPortalPositionRanking/'
headers = {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.110 Safari/537.36'}
response = requests.get(url2, headers = headers)
soup = BeautifulSoup(response.content, 'html.parser')
xfrs = []
schools = []
for li in soup.findAll('li', attrs={'class':'transfer-player'}):
xfrs.append(li.find('a').contents)
schools.append(li.find('li', attrs={'class':'destination'}))
transfers = pd.DataFrame(xfrs, columns=['Players'])
print(transfers)
As mentioned, .contents returns a list of BeautifulSoup objects, so you need to use for example .text to get the name. Also take care of your selection it should be more specific.
Storing the scraped data in a dataframe try to collect it as list of dicts:
data.append({
'Player':li.h3.text,
'Destination':destination['alt'] if (destination:=li.select_one('img[class="logo"]')) else None
})
Example
import requests,json
from bs4 import BeautifulSoup as bs
url2 = 'https://247sports.com/Season/2020-Football/TransferPortalPositionRanking/'
headers = {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.110 Safari/537.36'}
response = requests.get(url2, headers = headers)
soup = BeautifulSoup(response.content, 'html.parser')
data = []
for li in soup.find_all('li', attrs={'class':'transfer-player'}):
data.append({
'Player':li.h3.text,
'Destination':destination['alt'] if (destination:=li.select_one('img[class="logo"]')) else None
})
pd.DataFrame(data)
Output
Player
Destination
JT Daniels
Georgia
KJ Costello
Mississippi State
Jamie Newman
Georgia
...
...

Attribute Error :: 'list' object has no attribute 'split'

I am trying to split the links of the images
what is wrong in my code
mainURL = "https://w.cima4u.ws/category/%d8%a7%d9%81%d9%84%d8%a7%d9%85-
%d9%83%d8%b1%d8%aa%d9%88%d9%86-movies-anime/"
headers = {"user-agent":"Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like
Gecko) Chrome/96.0.4664.93 Safari/537.36"}
s = requests.Session()
r = s.get(mainURL)
soup = BeautifulSoup(r.content, "html.parser")
for movie in soup.findAll('li', {'class':'MovieBlock'}):
movieLink = movie.find('a')
imageLink = movie.find('div', {'class':'Half1'})
imageLink = (['style'])
imageLink = imageLink.split("url(")[1][:-2]
print(imageLink)
since you didn't added the full stack trace, i suppose the error originating in this line
imageLink = imageLink.split("url(")[1][:-2]
split cannot be executed on a list, but on a string. in this case, imageLink is a list.
doc

Parsing a table - tr.findall('td') - TypeError: 'NoneType' object is not callable

Does anyone know the error? the error displayed doesn't make much sense to me because I followed everything that the person was typing. And yes the website is a demo website for webscraping purposes.
from bs4 import BeautifulSoup
headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/90.0.4430.212 Safari/537.36"}
response = requests.get("https://shubhamsayon.github.io/python/demo_html", headers = headers)
webpage = response.content
soup = BeautifulSoup(webpage, "html.parser")
for tr in soup.find_all('tr'):
topic = "TOPIC: "
url = "URL: "
values = [data for data in tr.findall('td')]
for value in values:
print(topic, value.text)
topic = url
C:UsersAndyPycharmProjectspythonProjectvenvScriptspython.exe C:/Users/Andy/PycharmProjects/pythonProject/main.py
Traceback (most recent call last):
File "C:UsersAndyPycharmProjectspythonProjectmain.py", line 14, in
values = [data for data in tr.findall('td')]
TypeError: 'NoneType' object is not callable
Process finished with exit code 1```
from bs4 import BeautifulSoup
headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/90.0.4430.212 Safari/537.36"}
response = requests.get("https://shubhamsayon.github.io/python/demo_html", headers = headers)
webpage = response.content
soup = BeautifulSoup(webpage, "html.parser")
for tr in soup.find_all('tr'):
topic = "TOPIC: "
url = "URL: "
values = [data for data in tr.find_all('td')]
for value in values:
print(topic, value.text)
topic = url
Output:
TOPIC: __str__ vs __repr__ In Python
URL: https://blog.finxter.com/python-__str__-vs-__repr__/
....
You can also try with pandas module to fetch table from url
import pandas as pd
df=pd.read_html("https://shubhamsayon.github.io/python/demo_html")[0]
df
Output:
```
TOPIC LINK
0 __str__ vs __repr__ In Python https://blog.finxter.com/python-__str__-vs-__r...
1 How to Read a File Line-By-Line and Store Into. https://blog.finxter.com/how-to-read-a-file-li...
2 How To Convert a String To a List In Python? https://blog.finxter.com/how-to-convert-a-stri...
3 How To Iterate Through Two Lists In Parallel? https://blog.finxter.com/how-to-iterate-throug...
4 Python Scoping Rules – A Simple Illustrated. https://blog.finxter.com/python-scoping-rules-...
5 Flatten A List Of Lists In Python https://blog.finxter.com/flatten-a-list-of-lis...

scrape data from a table that has a "show all" button

I am trying to scrape "ALL EQUITIES" table in the following link which has a show all button
https://www.trading212.com/en/Trade-Equities
I should be able to get the expanded table, not just some of the rows before the table is expanded.
here is my code
import requests
import pandas as pd
from bs4 import BeautifulSoup as bs
header = {"User-Agent" : "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like
Gecko) Chrome/88.0.4324.150 Safari/537.36"}
url = 'https://www.trading212.com/en/Trade-Equities'
r = requests.get(url, headers = header)
soup = bs(r.content, 'html.parser')
all_equities = soup.find('table' , class_ = 'I cant find the name of the class')
print(all_equities)
The contents are actually in a div, not a table. You can grab all of the content by using the class that is on each of the divs.
all_equities = soup.find_all('div' , class_ = 'js-search-row')
will give you a list of all of the divs with the equities in them.
Give this code a try:
all_equities = soup.find_all('div' , class_ = 'd-row js-search-row js-acc-wrapper')

Scrape and return a value from within a div class with Python

Any idea how can i retrieve the price (now 2917.99) from this source code view-https://www.emag.ro/televizor-led-smart-samsung-138-cm-55ru7402-4k-ultra-hd-ue55ru7402uxxh/pd/DTN2XZBBM/
If I call the class p.product-new-price i get None.
I have managed to get the title, but not the price.
What I have done so far:
import requests
from bs4 import BeautifulSoup
URL = 'https://www.emag.ro/televizor-led-smart-samsung-138-cm-55ru7402-4k-ultra-hd-ue55ru7402uxxh/pd/DTN2XZBBM/'
headers = {"User-Agent":'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.88 Safari/537.36'}
page = requests.get(URL, headers = headers)
soup = BeautifulSoup(page.content, 'html.parser')
title = soup.find('title')
div = soup.find('div', {"class" : 'product-new-price'})
text = div.string
print(text)
The class looks like below and I want to extract the 2917 as int.
div class="product-highlight product-page-pricing"
p class="product-new-price"
2.917<sup>99</sup> <span>Lei</span>
Thank you very much!
Ok, with minor modifications:
It seems that the class product-new-price is on the p element for me!
I am assuming there will always be a <sup> tag after the main price
import requests
from bs4 import BeautifulSoup
URL = 'https://www.emag.ro/televizor-led-smart-samsung-138-cm-55ru7402-4k-ultra-hd-ue55ru7402uxxh/pd/DTN2XZBBM/'
headers = {"User-Agent":'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.88 Safari/537.36'}
page = requests.get(URL, headers = headers)
soup = BeautifulSoup(page.content, 'html.parser')
title = soup.find('title')
p = soup.find('p', {"class" : 'product-new-price'})
# Get the text before <sup> tag
value = p.find('sup').previousSibling.strip()
print("Value: {}".format(value))
# Keep only numbers
value = ''.join(c for c in value if c.isdigit())
price = int(value)
print("Price: {}".format(price))
The above prints:
$ python3 ./test.py
Value: 2.917
Price: 2917
Now, with small changes you can also add the missing .99 if this is required

Categories