Finding the number of divs with a certain id in BeautifulSoup?

Finding the number of divs with a certain id in BeautifulSoup? - python

I am trying to find a way to count the number of divs with the id "blue". Is this possible in BeautifulSoup? Here is my code:
import BeautifulSoup
scanning = True
soup = BeautifulSoup.BeautifulSoup("<html><body><div id='blue'></div><div id='blue'></div><div id='purple'></div></body></html>")
blues = []
blues.append(soup.find("div", {"id": "blue"}))
print len(blues)

the find method will only fetch the first occurence, hence the output of 1. If you use find_all, it will literally find all of the occurences, saving the results to a list on your behalf. In this case 'divs' becomes a list of every div id=blue, and you can check the length of that.
import BeautifulSoup
soup = BeautifulSoup.BeautifulSoup("<html><body><div id='blue'></div><div id='blue'></div><div id='purple'></div></body></html>", 'html.parser')
divs = soup.find_all("div", {"id": "blue"})
print(len(divs))

If divs only have the id blue, then you could just use:
divs = soup.find_all("#blue")
blues = len(divs) if divs else 0

Related

I can't scrap salary text using beautiful soup

I want someone to help me figure this out I want to scrap text value of salary (Confidential) using beautiful soup
import requests
from bs4 import BeautifulSoup
result=requests.get("https://wuzzuf.net/jobs/p/WBHqaf7WeZYe-Senior-Python-Developer-Trufla-Cairo-Egypt?o=1&l=sp&t=sj&a=python|search-v3")
src=result.content
soup = BeautifulSoup(src,"lxml")
after this I used
salary=soup.find_all("span":{"class":"css-4xky9y"})
but it returns empty list
ـــــــــــــــــــــــــــــــــــــــــــــــــــ
import requestsfrom bs4 import BeautifulSoup
result=requests.get("https://wuzzuf.net/jobs/p/WBHqaf7WeZYe-Senior-Python-Developer-Trufla-Cairo-Egypt?o=1&l=sp&t=sj&a=python|search-v3")src=result.contentsoup = BeautifulSoup(src,"lxml")salary=soup.find("div",{'id':'app'})salary_text=salary.contents[0]h=salary_text.contents[4]
print(h)
when I print (h) it gives me the value
Please help me Guys finding the text value of salary
I have tried in past 5 days using what is mentioned in above.

If the salary information is in a <span> tag, you can use a code snippet like,
salary = soup.find("span", {"class": "salary"}).text
If this doesn't work, or you are unable to find the specific tag & class, you can always use the find_all() method search for <span> tags, then filter through the resulting list to find the tag that contains the salary information.
for span in soup.find_all("span"):
if span is not None and \
span.text is not None and \
"salary" in span.text:
salary = span.text
break
else:
salary = None # or np.nan

Webscraping - Beautifulsoup4 - Accessing indexed item in a find_all loop

How do I make it so that I can choose an item in the list in that for loop?
When I print it without brackets, I get the full list and every index seems to be the proper item that I need
for h3 in soup.find_all('h3', itemprop="name"):
bookname = h3.a.text
bookname = bookname.split('\n')
print(bookname)
However, when I print it by specifying an index, whether it is inside the loop or outside, it returns "list index out of range"
for h3 in soup.find_all('h3', itemprop="name"):
bookname = h3.a.text
bookname = bookname.split('\n')
print(bookname[2])
What's my problem here? How do I change my code so that I can scrape all the h3 names, yet at the same time be able to choose specific indexed h3 names when I want to?
Here's the entire code:
import requests
from bs4 import BeautifulSoup
source = requests.get("https://ca1lib.org/s/ginger") #gets the source of the site and returns it
soup = BeautifulSoup(source.text, 'html5lib')
for h3 in soup.find_all('h3', itemprop="name"):
bookname = h3.a.text
bookname = bookname.split('\n')
print(bookname[2])

At a first glance, assuming that your h3 element contains more book names ("book1" \n "book2" \n "book3"), your problem could be that certain h3 elements have less than 3 elements, so the bookname[2] part can't access an element from a shorter list.
On the other hand, if your h3 element has only 1 item (h3 book1 h3), you are iterating all the h3 tags, so you are basically taking each one of them (so in your first iteration you'll have "h3 book1 h3", in your second iteration "h3 book2 h3"), in which case you should make a list with all the h3.a.text elements, then access the desired value.
Hope this helps!

I forgot to append. I figured it out.
Here's my final code:
import requests
from bs4 import BeautifulSoup
source = requests.get("https://ca1lib.org/s/ginger") #gets the source of the site and returns it
soup = BeautifulSoup(source.text, 'html.parser')
liste = []
for h3_tag in soup.find_all('h3', itemprop="name"):
liste.append(h3_tag.a.text.split("\n"))
#bookname = h3.a.text #string
#bookname = bookname.split('\n') #becomes list
print(liste[5])

Python 2 Beautiful Soup, get text from all tags

Trying to get the text from all tags that have the class task-topic-deprecated, however I only seem to be able to get one.
Not a duplicate of BeautifulSoup get_text from find_all - This issue uses multiple class names and so the working syntax is slightly different, class_ as opposed to attrs={'class':'
Source page:
https://developer.apple.com/documentation/cfnetwork?language=objc
The output would be any string that is struckout on the page above:
CFFTPCreateParsedResourceListing
kCFFTPResourceGroup
...etc
find_next() doesn't seem to move to the next item how I am expecting it to, and prints out the text I have already.
page = requests.get("https://developer.apple.com/documentation/cfnetwork?language=objc")
soup = BeautifulSoup(page.content, 'html.parser')
aRow = soup.find('a', attrs={'class':'task-topic-deprecated has-adjacent-element symbol-name'}).get_text()
print aRow
bRow = soup.find('a', attrs={'class':'task-topic-deprecated has-adjacent-element symbol-name'}).find_next().get_text()
print bRow
cRow = soup.find('a', attrs={'class':'task-topic-deprecated has-adjacent-element symbol-name'}).find_next().find_next().get_text()
print cRow
CFFTPCreateParsedResourceListing
CFFTPCreateParsedResourceListing
CFFTPCreateParsedResourceListing
Also tried putting it in a loop from various things I have found on Stack Overflow, but it seems to still only grab 1 item as per above.
Also tried with xPath, but this doesn't grab anything and prints out a blank list
tree = html.fromstring(page.content)
allItems = tree.xpath('//a[#class="task-topic-deprecated has-adjacent-element symbol-name"]/text()')
print allItems

I think you have doing it wrong instead of find you can use find_all method to get result.
for i in soup.find_all('a', class_='task-topic-deprecated has-adjacent-element symbol-name'):
print i.get_text()
May be this could help

Extracting only the bullet points after a 'strong' title from a website using python

I want to extract only the points listed as bullets under the title 'WHAT RESPONDENTS ARE SAYING …' in this webpage.
I am able to achieve it with this code:
import requests
URL = 'https://www.instituteforsupplymanagement.org/about/MediaRoom/newsreleasedetail.cfm?ItemNumber=30655&SSO=1'
r = requests.get(URL)
page = r.text
from bs4 import BeautifulSoup
soup = BeautifulSoup(page, 'lxml')
strong_el = soup.find('strong',text='WHAT RESPONDENTS ARE SAYING …')
strong_el.find_all_next('li')[9]
But the problem here is that I have to know how many bullet points are listed (There are 10 in this case. Hence, it returns valid values until [9]). What is the best way extract all of the bullet points even without knowing how many of them are listed? Also, I need only the text and not the html.

You can use find_next_sibling to get the ul element next to strong which contains these li elements. Then get all the children of ul which are li elements:
ul_tag = strong_el.find_next_sibling('ul')
for li_tag in ul_tag.children:
print li_tag.string

you should find the ul tag first, it contains all the li tags
In [3]: ul = strong_el.find_next('ul')
In [4]: for li in ul.find_all('li'):
...: print(li.text)
out:
“Demand very steady to start the year.” (Chemical Products)
“January revenue target slightly lower following a big December shipment month.” (Computer & Electronic Products)

How to get a nested element in beautiful soup

I am struggling with the syntax required to grab some hrefs in a td.
The table, tr and td elements dont have any class's or id's.
If I wanted to grab the anchor in this example, what would I need?
< tr >
< td > < a >...
Thanks

As per the docs, you first make a parse tree:
import BeautifulSoup
html = "<html><body><tr><td><a href='foo'/></td></tr></body></html>"
soup = BeautifulSoup.BeautifulSoup(html)
and then you search in it, for example for <a> tags whose immediate parent is a <td>:
for ana in soup.findAll('a'):
if ana.parent.name == 'td':
print ana["href"]

Something like this?
from BeautifulSoup import BeautifulSoup
soup = BeautifulSoup(html)
anchors = [td.find('a') for td in soup.findAll('td')]
That should find the first "a" inside each "td" in the html you provide. You can tweak td.find to be more specific or else use findAll if you have several links inside each td.
UPDATE: re Daniele's comment, if you want to make sure you don't have any None's in the list, then you could modify the list comprehension thus:
from BeautifulSoup import BeautifulSoup
soup = BeautifulSoup(html)
anchors = [a for a in (td.find('a') for td in soup.findAll('td')) if a]
Which basically just adds a check to see if you have an actual element returned by td.find('a').

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Finding the number of divs with a certain id in BeautifulSoup? - python

If divs only have the id blue, then you could just use: divs = soup.find_all("#blue") blues = len(divs) if divs else 0

Related

I can't scrap salary text using beautiful soup

Webscraping - Beautifulsoup4 - Accessing indexed item in a find_all loop

Python 2 Beautiful Soup, get text from all tags

Extracting only the bullet points after a 'strong' title from a website using python

How to get a nested element in beautiful soup

Categories

Resources