Using findAll two times BeautifulSoup - python

I'm scraping this webpage with some tables. I want to 'build' two lists and the site have the class 'txt' for two datatypes. I need to extract those datatypes separately, so I'm tryng to "filter" the first type, extract, and then doing the other type.
I made this code:
from bs4 import BeautifulSoup
r = requests.get(url, headers=header)
soup = BeautifulSoup(r.content, 'html.parser')
page = soup.find('div', class_='content')
labels = page.findAll('td', class_='label')
Output:
[<td class="label w15"><span class="help tips" title="Code">?</span><span class="txt">Paper</span></td>,
<td class="label destaque w2"><span class="help tips" title="Last value">?</span><span class="txt">Value</span></td>]
I need what is inside those <span class="txt">Paper</span>
When I try this:
myfilter = labels.findAll('span', class_='txt')
I get this error:
AttributeError: ResultSet object has no attribute 'findAll'. You're probably treating a list of elements like a single element. Did you call find_all() when you meant to call find()?
Why? How I can do this?

As the error message says, you can't use a list of results as a result by itself. You need to loop over them.
myfilter = []
for label in labels:
myfilter.extend(label.findAll('span', class_='txt'))

Related

How to get the desired value in BeautifulSoup?

Suppose we have the html code as follows:
html = '<div class="dt name">abc</div><div class="name">xyz</div>'
soup = BeautifulSoup(html, 'lxml')
I want to get the name xyz. Then, I write
soup.find('div',{'class':'name'})
However, it returns abc.
How to solve this problem?
The thing is that Beautiful Soup returns the first element that has the class name and div so the thing is that the first div has class name and class dt so it selects that div.
So, div helps but it still narrows down to 2 divs. Next, it returns a array so check the second div to use print(soup('div')[1].text). If you want to print all the divs use this code:
for i in range(len(soup('div')))
print(soup('div')[i].text)
And as pointed out in Ankur Sinha's answer, if you want to select all the divs that have only class name, then you have to use select, like this:
soup.select('div[class=name]')[0].get_text()
But if there are multiple divs that satisfy this property, use this:
for i in range(len(soup.select('div[class=name]'))):
print(soup.select('div[class=name]')[i].get_text())
Just to continue Ankur Sinha, when you use select or even just soup() it forms a array, because there can be multiple items so that's why I used len(), to figure out the length of the array. Then I ran a for loop on it and then printed the select function at i starting from 0.
When you do that, it rather would give a specific div instead of a array, and if it gave out a array, calling get_text() would produce errors because the array is NOT text.
This blog was helpful in doing what you would like, and that is to explicitly find a tag with specific class attribute:
from bs4 import BeautifulSoup
html = '<div class="dt name">abc</div><div class="name">xyz</div>'
soup = BeautifulSoup(html, 'html.parser')
soup.find(lambda tag: tag.name == 'div' and tag['class'] == ['name'])
Output:
<div class="name">xyz</div>
You can do it without lambda also using select to find exact class name like this:
soup.select("div[class = name]")
Will give:
[<div class="name">xyz</div>]
And if you want the value between tags:
soup.select("div[class=name]")[0].get_text()
Will give:
xyz
In case you have multiple div with class = 'name', then you can do:
for i in range(len(soup.select("div[class=name]"))):
print(soup.select("div[class=name]")[i].get_text())
Reference:
https://www.crummy.com/software/BeautifulSoup/bs4/doc/#css-selectors
This might work for you, note that it is contingent on the div being the second div item in the html.
import requests
from bs4 import BeautifulSoup
html = '<div class="dt name">abc</div><div class="name">xyz</div>'
soup = BeautifulSoup(html, features='lxml')
print(soup('div')[1].text)

Problem with printing in a single line extracted text from multiple links that are contained in divs with the same class

I am trying to extract text from a page that has several divs with the same class. Each div contains different number of links with text. The extracted text from each div needs to be printed in a single line.
If for example one div contains three links and the other div contains 2 links, I want to extract the text from the three links in the first div and print the results in a single line and extract the text from the two links in the second div and print it in a new line. I also want to store the extracted data as a single item in an array.
The code below prints correctly the combined data however in addition to the extracted text, it also prints the <a> tags and the URLs. I tried to add the text attribute (content.text) however I got the following error:
AttributeError: ResultSet object has no attribute 'text'. You're probably treating a list of items like a single item. Did you call find_all() when you meant to call find()?
from urllib.request import urlopen
from bs4 import BeautifulSoup
html = urlopen("URL")
bs = BeautifulSoup(html.read(), "html.parser")
int_array = []
int_data = bs.findAll("div", {"class": "new_titles"})
for div in int_data:
content = div.find_all("a")
int_array.append(content)
print(content)
Try the below code.I think you are looking after this.
bs = BeautifulSoup(html.read(), "html.parser")
int_array = []
int_data = bs.findAll("div", {"class": "new_titles"})
for div in int_data:
item=[a.text.strip() for a in div.find_all("a")]
content =' '.join(item)
int_array.append(content)
print(content)
The error message says it all: you are treating a list of hyperlinks (div.find_all("a") will give you many) like a single item, if you just put .text after it.
Similar to the <div> elements, you need to loop over the links and make use of the text of each individual link.
from urllib.request import urlopen
from bs4 import BeautifulSoup
html = urlopen("https://stackoverflow.com/questions/57732994/problem-with-printing-in-a-single-line-extracted-text-from-multiple-links-that-a/57733094?noredirect=1#comment101906332_57733094")
bs = BeautifulSoup(html.read(), "html.parser")
int_data = bs.findAll("div")
for div in int_data:
int_array = []
content = div.find_all("a")
for link in content:
int_array.append(link.text.replace("\n", "").replace("\r", ""))
print("***"+" ".join(int_array)+"***")

How to locate td class in html code using python?

I have a class in my html code. I need to locate td class "Currentlocation" using python.
CODE :
<td class="CURRENTLOCATION"><img align="MIDDLE" src="..\Images\FolderOpen.bmp"/> Metrics</td>
Below are the codes I tried.
First attempt:
My_result = page_soup.find_element_by_class_name('CURRENTLOCATION')
Getting "TypeError: 'NoneType' object is not callable" error. Second attempt:
My_result = page_soup.find(‘td’, attrs={‘class’: ‘CURRENTLOCATION’})
Getting "invalid character in identifier" error.
Can anyone please help me locate a class in html code using python?
from bs4 import BeautifulSoup
sdata = '<td class="CURRENTLOCATION"><img align="MIDDLE" src="..\Images\FolderOpen.bmp"/> Metrics</td>'
soup = BeautifulSoup(sdata, 'lxml')
mytds = soup.findAll("td", {"class": "CURRENTLOCATION"})
for td in mytds:
print(td)
I tried your code, the second example, and the problem are the quotation marks you use. To me they are apostrophes (‘, unicode code point \u2019), while the python interpreter requires single (') or double (") quotation marks.
Changing them I can find the tag:
>>> bs.find('td', attrs={'class': 'CURRENTLOCATION'})
<td class="CURRENTLOCATION"><img align="MIDDLE" src="..\Images\FolderOpen.bmp"/> Metrics</td>
About your first example. I do not know where you find a reference to the method find_element_by_class_name but it seems to not be implemented by the BeautifulSoup class. The class instead implements the __getattr__ method that is a special one that is invoked anytime you try to access a non existing attribute. Here an excerpt of the method:
def __getattr__(self, tag):
#print "Getattr %s.%s" % (self.__class__, tag)
if len(tag) > 3 and tag.endswith('Tag'):
#
# We special case contents to avoid recursion.
elif not tag.startswith("__") and not tag == "contents":
return self.find(tag)
When you try to access the attribute find_element_by_class_name, you are actually looking for a tag with the same name.
There is a function in BeautifulSoup for this.
You can get all the desired tags and specify the attributes which you are lookin for in find_all function. It returns the list of all the elements which fulfill the criteria
import re
from bs4 import BeautifulSoup
text = '<td class="CURRENTLOCATION"><img align="MIDDLE" src="..\Images\FolderOpen.bmp"/> Metrics</td>'
soup = BeautifulSoup(text, 'lxml')
output_list = soup.find_all('td',{"class": "CURRENTLOCATION"}) # I am looking for all the td tags whose class atrribute is set to CURRENTLOCATION

Scraping PFR Football Data with Python for a Beginner

background: i'm trying to scrape some tables from this pro-football-reference page. I'm a complete newbie to Python, so a lot of the technical jargon ends up lost on me but in trying to understand how to solve the issue, i can't figure it out.
specific issue: because there are multiple tables on the page, i can't figure out how to get python to target the one i want. I'm trying to get the Defense & Fumbles table. The code below is what i've got so far, and it's from this tutorial using a page from the same site- but one that only has a single table.
sample code:
#url we are scraping
url = "https://www.pro-football-reference.com/teams/nwe/2017.htm"
#html from the given url
html=urlopen(url)
# make soup object of html
soup = BeautifulSoup(html)
# we see that soup is a beautifulsoup object
type(soup)
#
column_headers = [th.getText() for th in
soup.findAll('table', {"id": "defense").findAll('th')]
column_headers #our column headers
attempts made: I realized that the tutorial's method would not work for me, so i attempted to change the soup.findAll portion to target the specific table. But i repeatedly get an error saying:
AttributeError: ResultSet object has no attribute 'findAll'. You're probably treating a list of items like a single item. Did you call find_all() when you meant to call find()?
when changing it to find, the error becomes:
AttributeError: 'NoneType' object has no attribute 'find'
I'll be absolutely honest that i have no idea what i'm doing or what these mean. I'd appreciate any help in figuring how to target that data and then scrape it.
Thank you,
your missing a "}" in the dict after the word "defense". Try below and see if it works.
column_headers = [th.getText() for th in
soup.findAll('table', {"id": "defense"}).findAll('th')]
First off, you want to use soup.find('table', {"id": "defense"}).findAll('th') - find one table, then find all of its 'th' tags.
The other problem is that the table with id "defense" is commented out in the html on that page:
<div class="placeholder"></div>
<!--
<div class="table_outer_container">
<div class="overthrow table_container" id="div_defense">
<table class="sortable stats_table" id="defense" data-cols-to-freeze=2><caption>Defense & Fumbles Table</caption>
<colgroup><col><col><col><col><col><col><col><col><col><col><col><col><col><col><col><col><col><col><col><col></colgroup>
<thead>
etc. I assume that javascript is un-hiding it. BeautifulSoup doesn't parse the text of comments, so you'll need to find the text of all the comments on the page as in this answer, look for one with id="defense" in it, and then feed the text of that comment into BeautifulSoup.
Like this:
from bs4 import Comment
comments = comments = soup.findAll(text=lambda text:isinstance(text, Comment))
defenseComment = next(c for c in comments if 'id="defense"' in c)
defenseSoup = BeautifulSoup(str(defenseComment))

Beautiful Soup 'ResultSet' object has no attribute 'text'

from bs4 import BeautifulSoup
import urllib.request
import win_unicode_console
win_unicode_console.enable()
link = ('https://pietroalbini.io/')
req = urllib.request.Request(link, headers={'User-Agent': 'Mozilla/5.0'})
url = urllib.request.urlopen(req).read()
soup = BeautifulSoup(url, "html.parser")
body = soup.find_all('div', {"class":"wrapper"})
print(body.text)
Hi, I have a problem with Beautiful Soup, if I run this code without ".text" at the end it show me a list of div but if I add ".text" at the end come the error
Traceback (most recent call last):
File "script.py", line 15, in
print(body.text)
AttributeError: 'ResultSet' object has no attribute 'text'
find_all returns a ResultSet object which you can iterate over using a for loop. What you can do is:
for wrapper in body.find_all('div', {"class":"wrapper"}):
print wrapper.text
If you'll type:
print(type(body))
you'll see body is <class 'bs4.element.ResultSet'> It means all the elements that match the class. You can either iterate over them:
for div in body:
print(div.text)
Or if you know you only have div, you can use find instead:
div = soup.find('div', {"class":"wrapper"})
div.text
Probably should have posted as answer.. so as stated in the comments almost verbatim
Your code should be the following:
for div in body:
print div.text
#python3
#print(div.text)
Or some naming schema to your preference thereof.
The find_all method returns a generated list ( loosely using the term list here ) of items that beautifulsoup has found matching your criteria after parsing the source webpages html either recursively or non-recursively depending upon how you search.
As the error says the resulting set of objects has no attribute text, since it isn't an element but rather a collection of them.
However, the items inside the resulting set ( should any be found ) do.
You can view the documentation here

Categories