I can't scrap salary text using beautiful soup - python

I want someone to help me figure this out I want to scrap text value of salary (Confidential) using beautiful soup
import requests
from bs4 import BeautifulSoup
result=requests.get("https://wuzzuf.net/jobs/p/WBHqaf7WeZYe-Senior-Python-Developer-Trufla-Cairo-Egypt?o=1&l=sp&t=sj&a=python|search-v3")
src=result.content
soup = BeautifulSoup(src,"lxml")
after this I used
salary=soup.find_all("span":{"class":"css-4xky9y"})
but it returns empty list
ـــــــــــــــــــــــــــــــــــــــــــــــــــ
import requestsfrom bs4 import BeautifulSoup
result=requests.get("https://wuzzuf.net/jobs/p/WBHqaf7WeZYe-Senior-Python-Developer-Trufla-Cairo-Egypt?o=1&l=sp&t=sj&a=python|search-v3")src=result.contentsoup = BeautifulSoup(src,"lxml")salary=soup.find("div",{'id':'app'})salary_text=salary.contents[0]h=salary_text.contents[4]
print(h)
when I print (h) it gives me the value
Please help me Guys finding the text value of salary
I have tried in past 5 days using what is mentioned in above.

If the salary information is in a <span> tag, you can use a code snippet like,
salary = soup.find("span", {"class": "salary"}).text
If this doesn't work, or you are unable to find the specific tag & class, you can always use the find_all() method search for <span> tags, then filter through the resulting list to find the tag that contains the salary information.
for span in soup.find_all("span"):
if span is not None and \
span.text is not None and \
"salary" in span.text:
salary = span.text
break
else:
salary = None # or np.nan

Related

Extracting only the bullet points after a 'strong' title from a website using python

I want to extract only the points listed as bullets under the title 'WHAT RESPONDENTS ARE SAYING …' in this webpage.
I am able to achieve it with this code:
import requests
URL = 'https://www.instituteforsupplymanagement.org/about/MediaRoom/newsreleasedetail.cfm?ItemNumber=30655&SSO=1'
r = requests.get(URL)
page = r.text
from bs4 import BeautifulSoup
soup = BeautifulSoup(page, 'lxml')
strong_el = soup.find('strong',text='WHAT RESPONDENTS ARE SAYING …')
strong_el.find_all_next('li')[9]
But the problem here is that I have to know how many bullet points are listed (There are 10 in this case. Hence, it returns valid values until [9]). What is the best way extract all of the bullet points even without knowing how many of them are listed? Also, I need only the text and not the html.
You can use find_next_sibling to get the ul element next to strong which contains these li elements. Then get all the children of ul which are li elements:
ul_tag = strong_el.find_next_sibling('ul')
for li_tag in ul_tag.children:
print li_tag.string
you should find the ul tag first, it contains all the li tags
In [3]: ul = strong_el.find_next('ul')
In [4]: for li in ul.find_all('li'):
...: print(li.text)
out:
“Demand very steady to start the year.” (Chemical Products)
“January revenue target slightly lower following a big December shipment month.” (Computer & Electronic Products)

Finding the number of divs with a certain id in BeautifulSoup?

I am trying to find a way to count the number of divs with the id "blue". Is this possible in BeautifulSoup? Here is my code:
import BeautifulSoup
scanning = True
soup = BeautifulSoup.BeautifulSoup("<html><body><div id='blue'></div><div id='blue'></div><div id='purple'></div></body></html>")
blues = []
blues.append(soup.find("div", {"id": "blue"}))
print len(blues)
the find method will only fetch the first occurence, hence the output of 1. If you use find_all, it will literally find all of the occurences, saving the results to a list on your behalf. In this case 'divs' becomes a list of every div id=blue, and you can check the length of that.
import BeautifulSoup
soup = BeautifulSoup.BeautifulSoup("<html><body><div id='blue'></div><div id='blue'></div><div id='purple'></div></body></html>", 'html.parser')
divs = soup.find_all("div", {"id": "blue"})
print(len(divs))
If divs only have the id blue, then you could just use:
divs = soup.find_all("#blue")
blues = len(divs) if divs else 0

Python 2.7 : Can't figure out how to parse a tree with BeautifulSoup4

I am trying to parse this site to create 5 lists, one for each day and filled with one string for each announcement. For example
[in] custom_function(page)
[out] [[<MONDAYS ANNOUNCEMENTS>],
[<TUESDAYS ANNOUNCEMENTS>],
[<WEDNESDAYS ANNOUNCEMENTS>],
[<THURSDAYS ANNOUNCEMENTS>],
[<FRIDAYS ANNOUNCEMENTS>]]
But I can't figure out the correct way to do this.
This is what I have so far
from bs4 import BeautifulSoup
import requests
import datetime
url = http://mam.econoday.com/byweek.asp?day=7&month=4&year=2014&cust=mam&lid=0
# Get the text of the webpage
r = requests.get(url)
data = r.text
soup = BeautifulSoup(data)
full_table_1 = soup.find('table', 'eventstable')
I Figured out that what I want is in the highlighted tag, but I'm not sure how to get to that exact tag and then parse out the times/announcements into a list. I've tried multiple methods but it just keeps getting messier.
What do I do?
The idea is to find all td elements with events class, then read div elements inside:
data = []
for day in soup.find_all('td', class_='events'):
data.append([div.text for div in day.find_all('div', class_='econoevents')])
print data
prints:
[[u'Gallup US Consumer Spending Measure8:30 AM\xa0ET',
u'4-Week Bill Announcement11:00 AM\xa0ET',
u'3-Month Bill Auction11:30 AM\xa0ET',
...
],
...
]

Python webscraping and getting contents of first div tag of its class

I'm working with Python 3.3 and this website:
http://www.nasdaq.com/markets/ipos/
My goal is to read only the companies that are in the Upcoming IPO. It is in the div tag with div class="genTable thin floatL" There are two with this class, and the target data is in the first one.
Here is my code
from urllib.request import urlopen
from bs4 import BeautifulSoup
import re
html = urlopen("http://www.nasdaq.com/markets/ipos/").read()
soup = BeautifulSoup(html)
for divparent in soup.find_all('div', attrs={'class':'genTable thin floatL'}) [0]: # I tried putting a [0] so it will only return divs in the first genTable thin floatL class
for div in soup.find_all('div', attrs={'class':'ipo-cell-height'}):
s = div.string
if re.match(r'\d{1,2}/\d{1,2}/\d{4}$', s):
div_next = div.find_next('div')
print('{} - {}'.format(s, div_next.string))
I'd like it to return only
3/7/2014 - RECRO PHARMA, INC.
2/28/2014 - VARONIS SYSTEMS INC
2/27/2014 - LUMENIS LTD
2/21/2014 - SUNDANCE ENERGY AUSTRALIA LTD
2/21/2014 - SEMLER SCIENTIFIC, INC.
But it prints all div classes with the re.match specifications and multiple times as well. I tried inserting [0] on the for divparent loop to retrieve only the first one but this cause the repeating problem instead.
EDIT: Here is the updated code according to warunsl solution. This works.
from urllib.request import urlopen
from bs4 import BeautifulSoup
import re
html = urlopen("http://www.nasdaq.com/markets/ipos/").read()
soup = BeautifulSoup(html)
divparent = soup.find_all('div', attrs={'class':'genTable thin floatL'})[0]
table= divparent.find('table')
for div in table.find_all('div', attrs={'class':'ipo-cell-height'}):
s = div.string
if re.match(r'\d{1,2}/\d{1,2}/\d{4}$', s):
div_next = div.find_next('div')
print('{} - {}'.format(s, div_next.string))
You mentioned that there are two elements that fit the 'class':'genTable thin floatL' criteria. So running a for loop for it's first element does not make sense.
So replace your outer for loop with
divparent = soup.find_all('div', attrs={'class':'genTable thin floatL'})[0]
Now you need not do a soup.find_all again. Doing so will search the entire document. You need to restrict the search to the divparent. So, you do:
table = divparent.find('table')
The remainder of the code to extract the dates and the company name would be the same, except that they will be with reference to the table variable.
for row in table.find_all('tr'):
for data in row.find_all('td'):
print data.string
Hope it helps.

How to get a nested element in beautiful soup

I am struggling with the syntax required to grab some hrefs in a td.
The table, tr and td elements dont have any class's or id's.
If I wanted to grab the anchor in this example, what would I need?
< tr >
< td > < a >...
Thanks
As per the docs, you first make a parse tree:
import BeautifulSoup
html = "<html><body><tr><td><a href='foo'/></td></tr></body></html>"
soup = BeautifulSoup.BeautifulSoup(html)
and then you search in it, for example for <a> tags whose immediate parent is a <td>:
for ana in soup.findAll('a'):
if ana.parent.name == 'td':
print ana["href"]
Something like this?
from BeautifulSoup import BeautifulSoup
soup = BeautifulSoup(html)
anchors = [td.find('a') for td in soup.findAll('td')]
That should find the first "a" inside each "td" in the html you provide. You can tweak td.find to be more specific or else use findAll if you have several links inside each td.
UPDATE: re Daniele's comment, if you want to make sure you don't have any None's in the list, then you could modify the list comprehension thus:
from BeautifulSoup import BeautifulSoup
soup = BeautifulSoup(html)
anchors = [a for a in (td.find('a') for td in soup.findAll('td')) if a]
Which basically just adds a check to see if you have an actual element returned by td.find('a').

Categories