how BeautifulSoup get the content inside a span? - python

I'm trying to parse fixture contents from a website I managed to parse Match column but facing difficulty in parsing date and time column.
My program
import re
import pytz
import requests
import datetime
from bs4 import BeautifulSoup
from espncricinfo.exceptions import MatchNotFoundError, NoScorecardError
from espncricinfo.match import Match
bigbash_article_link = "http://www.espncricinfo.com/ci/content/series/1128817.html?template=fixtures"
r = requests.get(bigbash_article_link)
bigbash_article_html = r.text
soup = BeautifulSoup(bigbash_article_html, "html.parser")
bigbash1_items = soup.find_all("span",{"class": "fixture_date"})
bigbash_items = soup.find_all("span",{"class": "play_team"})
bigbash_article_dict = {}
date_dict = {}
for div in bigbash_items:
a = div.find('a')['href']
bigbash_article_dict[div.find('a').string] = a
print(bigbash_article_dict)
for div in bigbash1_items:
a = div.find('span').string
date_dict[div.find('span').string] = a
print(date_dict)
When I execute this I get print(bigbash_article_dict) output, but print(date_dict) gives me error, how can I parse date and time content?

Follow your code, you want to get the content inside the tag span.
So you should using "div.contents" to get the contents of span.
And your question should be how BeautifulSoup get the content inside a span.
eg.
div= <span class="fixture_date">
Thu Feb 22
</span>
div.contents[0].strip()= Thu Feb 22
------------
for div in bigbash1_items:
print("div=",div)
print("div.contents[0].strip()=",div.contents[0].strip(),"\r\n------------\r\n")

Elements with class fixture_date don't have a <span>, they are the span. You can get the data from them directly.
So instead of this:
div.find('span').string
You can do this:
div.string
From the structure of the website, this would return the date on odd iterations (1, 3, ..) and time on even iterations (2, 4, ..).
Oh and I'd advice you to make the variable name meaningful, so rename div to span.
Because in your code, all div variables actually contain <span> tags ;)

Related

I can't scrap salary text using beautiful soup

I want someone to help me figure this out I want to scrap text value of salary (Confidential) using beautiful soup
import requests
from bs4 import BeautifulSoup
result=requests.get("https://wuzzuf.net/jobs/p/WBHqaf7WeZYe-Senior-Python-Developer-Trufla-Cairo-Egypt?o=1&l=sp&t=sj&a=python|search-v3")
src=result.content
soup = BeautifulSoup(src,"lxml")
after this I used
salary=soup.find_all("span":{"class":"css-4xky9y"})
but it returns empty list
ـــــــــــــــــــــــــــــــــــــــــــــــــــ
import requestsfrom bs4 import BeautifulSoup
result=requests.get("https://wuzzuf.net/jobs/p/WBHqaf7WeZYe-Senior-Python-Developer-Trufla-Cairo-Egypt?o=1&l=sp&t=sj&a=python|search-v3")src=result.contentsoup = BeautifulSoup(src,"lxml")salary=soup.find("div",{'id':'app'})salary_text=salary.contents[0]h=salary_text.contents[4]
print(h)
when I print (h) it gives me the value
Please help me Guys finding the text value of salary
I have tried in past 5 days using what is mentioned in above.
If the salary information is in a <span> tag, you can use a code snippet like,
salary = soup.find("span", {"class": "salary"}).text
If this doesn't work, or you are unable to find the specific tag & class, you can always use the find_all() method search for <span> tags, then filter through the resulting list to find the tag that contains the salary information.
for span in soup.find_all("span"):
if span is not None and \
span.text is not None and \
"salary" in span.text:
salary = span.text
break
else:
salary = None # or np.nan

url split with beautifulsoup

When I run the code below,
link = f"https://www.ambalajstore.com/kategori/bardak-tabak?siralama=fiyat:asc&stoktakiler=1&tp=1"
response = requests.get(link)
html_icerigi = response.content
corba = BeautifulSoup(html_icerigi,"html.parser")
for a in corba.find_all("div",{"class":"paginate-content"}):
x = corba.find_all("div",{"class":"paginate-content"})
print(x)
I get results:
[<div class="paginate-content">
<a class="paginate-element-active" href="javascript:void(0);">1</a>
2
3
4
..
13
</div>]
What I need is just the number 13 (last number) in the last line (13)
Can you help me on how to do this?
You can do it like this
corba.find("div",{"class":"paginate-content"}).find_all('a')[-1].text
this will give you the text content of the last item(13 in your case)
As you have 1 div in x so you can get by following:
x.find_all('a')[-1].text
You can handle the case if no anchor tag found.
There are different approaches possible to scrape the text of your element.
css selectors calling the last element of type:
corba.select_one('.paginate-content a:last-of-type').text
picking last element by its list index:
corba.find('div',{'class':'paginate-content'}).find_all('a')[-1].text
Example
from bs4 import BeautifulSoup
import requests
url = 'https://www.ambalajstore.com/kategori/bardak-tabak?siralama=fiyat:asc&stoktakiler=1&tp=1'
req = requests.get(url)
corba = BeautifulSoup(req.content)
corba.select_one('.paginate-content a:last-of-type').text
Output
13

Python BeautifulSoup webcrawling getting text tag inside link

I need to get the information within the "< b >" tags for each website.
response = requests.get(href)
soup = BeautifulSoup(response.content, "lxml") # or BeautifulSoup(response.content, "html5lib")
tempWeekend = []
print soup.findAll('b')
The soup.findAll('b') line prints all the b tags in the site, how can I limit it to just the dates that I want?
The website is http://www.boxofficemojo.com/movies/?page=weekend&id=catchingfire.htm, under the weekend tab.
It is often easiest to search using CSS selectors, e.g.
soup.select('table.chart-wide > tr > td > nobr > font > a > b')
Sadly, if the tags are not further identified, there is no way to select specific ones. How should BeautifulSoup be able to distinguish between them. If you know what to roughly expect in the tags you need you could iterate over all of them and check if they match:
for b in soup.findAll('b):
if b.innerHTML == whatever:
return b
or something like that...
Or you could get the surrounding tags, i.e. 'a' in your example and check if that matches and then get the next occurence of 'b'.
Why not search for all the b tags, and choose the ones which contain a month?
import requests
from bs4 import BeautifulSoup
s = requests.get('http://www.boxofficemojo.com/movies/?page=weekend&id=catchingfire.htm').content
soup = BeautifulSoup(s, "lxml") # or BeautifulSoup(response.content, "html5lib")
dates = []
for i in soup.find_all('b'):
if i.text.split()[0].upper() in "JAN FEB MAR APR JUN JUL AUG SEP OCT NOV DEC":
dates.append(i.text)
print dates
(Note: I did not check the exact abbreviations that the website uses. Please check these first and accordingly modify the code)
Looking at that page it doesn't have any divs or class or id tags which makes it tough. The only pattern I could see what that the <b> tag directly before the dates was <b>Date:</b>. I would iterate over the <b> tags and then collect the tags after I hit the one with Date in it.
i would try something like
all_a = site.find_all('a')
for a in all_a:
if '?yr=?' in a['href']:
dates.append(a.get_text())

Python 2.7 : Can't figure out how to parse a tree with BeautifulSoup4

I am trying to parse this site to create 5 lists, one for each day and filled with one string for each announcement. For example
[in] custom_function(page)
[out] [[<MONDAYS ANNOUNCEMENTS>],
[<TUESDAYS ANNOUNCEMENTS>],
[<WEDNESDAYS ANNOUNCEMENTS>],
[<THURSDAYS ANNOUNCEMENTS>],
[<FRIDAYS ANNOUNCEMENTS>]]
But I can't figure out the correct way to do this.
This is what I have so far
from bs4 import BeautifulSoup
import requests
import datetime
url = http://mam.econoday.com/byweek.asp?day=7&month=4&year=2014&cust=mam&lid=0
# Get the text of the webpage
r = requests.get(url)
data = r.text
soup = BeautifulSoup(data)
full_table_1 = soup.find('table', 'eventstable')
I Figured out that what I want is in the highlighted tag, but I'm not sure how to get to that exact tag and then parse out the times/announcements into a list. I've tried multiple methods but it just keeps getting messier.
What do I do?
The idea is to find all td elements with events class, then read div elements inside:
data = []
for day in soup.find_all('td', class_='events'):
data.append([div.text for div in day.find_all('div', class_='econoevents')])
print data
prints:
[[u'Gallup US Consumer Spending Measure8:30 AM\xa0ET',
u'4-Week Bill Announcement11:00 AM\xa0ET',
u'3-Month Bill Auction11:30 AM\xa0ET',
...
],
...
]

Python webscraping and getting contents of first div tag of its class

I'm working with Python 3.3 and this website:
http://www.nasdaq.com/markets/ipos/
My goal is to read only the companies that are in the Upcoming IPO. It is in the div tag with div class="genTable thin floatL" There are two with this class, and the target data is in the first one.
Here is my code
from urllib.request import urlopen
from bs4 import BeautifulSoup
import re
html = urlopen("http://www.nasdaq.com/markets/ipos/").read()
soup = BeautifulSoup(html)
for divparent in soup.find_all('div', attrs={'class':'genTable thin floatL'}) [0]: # I tried putting a [0] so it will only return divs in the first genTable thin floatL class
for div in soup.find_all('div', attrs={'class':'ipo-cell-height'}):
s = div.string
if re.match(r'\d{1,2}/\d{1,2}/\d{4}$', s):
div_next = div.find_next('div')
print('{} - {}'.format(s, div_next.string))
I'd like it to return only
3/7/2014 - RECRO PHARMA, INC.
2/28/2014 - VARONIS SYSTEMS INC
2/27/2014 - LUMENIS LTD
2/21/2014 - SUNDANCE ENERGY AUSTRALIA LTD
2/21/2014 - SEMLER SCIENTIFIC, INC.
But it prints all div classes with the re.match specifications and multiple times as well. I tried inserting [0] on the for divparent loop to retrieve only the first one but this cause the repeating problem instead.
EDIT: Here is the updated code according to warunsl solution. This works.
from urllib.request import urlopen
from bs4 import BeautifulSoup
import re
html = urlopen("http://www.nasdaq.com/markets/ipos/").read()
soup = BeautifulSoup(html)
divparent = soup.find_all('div', attrs={'class':'genTable thin floatL'})[0]
table= divparent.find('table')
for div in table.find_all('div', attrs={'class':'ipo-cell-height'}):
s = div.string
if re.match(r'\d{1,2}/\d{1,2}/\d{4}$', s):
div_next = div.find_next('div')
print('{} - {}'.format(s, div_next.string))
You mentioned that there are two elements that fit the 'class':'genTable thin floatL' criteria. So running a for loop for it's first element does not make sense.
So replace your outer for loop with
divparent = soup.find_all('div', attrs={'class':'genTable thin floatL'})[0]
Now you need not do a soup.find_all again. Doing so will search the entire document. You need to restrict the search to the divparent. So, you do:
table = divparent.find('table')
The remainder of the code to extract the dates and the company name would be the same, except that they will be with reference to the table variable.
for row in table.find_all('tr'):
for data in row.find_all('td'):
print data.string
Hope it helps.

Categories