url split with beautifulsoup - python

When I run the code below,
link = f"https://www.ambalajstore.com/kategori/bardak-tabak?siralama=fiyat:asc&stoktakiler=1&tp=1"
response = requests.get(link)
html_icerigi = response.content
corba = BeautifulSoup(html_icerigi,"html.parser")
for a in corba.find_all("div",{"class":"paginate-content"}):
x = corba.find_all("div",{"class":"paginate-content"})
print(x)
I get results:
[<div class="paginate-content">
<a class="paginate-element-active" href="javascript:void(0);">1</a>
2
3
4
..
13
</div>]
What I need is just the number 13 (last number) in the last line (13)
Can you help me on how to do this?

You can do it like this
corba.find("div",{"class":"paginate-content"}).find_all('a')[-1].text
this will give you the text content of the last item(13 in your case)

As you have 1 div in x so you can get by following:
x.find_all('a')[-1].text
You can handle the case if no anchor tag found.

There are different approaches possible to scrape the text of your element.
css selectors calling the last element of type:
corba.select_one('.paginate-content a:last-of-type').text
picking last element by its list index:
corba.find('div',{'class':'paginate-content'}).find_all('a')[-1].text
Example
from bs4 import BeautifulSoup
import requests
url = 'https://www.ambalajstore.com/kategori/bardak-tabak?siralama=fiyat:asc&stoktakiler=1&tp=1'
req = requests.get(url)
corba = BeautifulSoup(req.content)
corba.select_one('.paginate-content a:last-of-type').text
Output
13

Related

scraping B in <span><span> flow text </span> B </span> using BeautifulSoup

I have the following bs4 element tag :
<span><span>some content</span> B</span>
The len of string B is unknown (I named it B for simplification)
How can I use beautifulSoup to extract "b" ? Or I just have as solution to extract the text and then use some regexp techniques
Thanks
Edit : Complete code
def get_doc_yakarouler(license_plate,url = 'https://www.yakarouler.com/car_search/immat?immat='):
response = requests.get(url+license_plate)
content = response.content
doc = BeautifulSoup(content,'html.parser')
result = doc.span.text
if 'identifié' in result :
return doc
else :
return f"La plaque {license_plate} n'est pas recensé sur yakarouler"
doc = get_doc_yakarouler('AA300AA')
span = doc.find_all('span')
motorisation_tag = span[1]
I want to extract "1.6 TDI"
I found solution using : motorisation_tag.text.replace(u'\xa0', ' ').split(' ')[1] but I would like if it is directly possible using bs4
Assuming you have a variable span which represents the outer <span> tag, you can do the following to extract 'B': span.contents[1]. This works because .contents will return a list of the tag's contents, in this case [<span>some content</span>, ' B']. And then you can access the 'B' text as the second element of the array. Be aware that if there is a space before B, like you showed in your HTML sample, the space will be included in the string
from bs4 import BeautifulSoup as bs , NavigableString
html = '<span><span>some content</span> B</span>'
soup = bs(html, 'html.parser')
span = soup.find("span")
# First approach Using Regular Expressions
outer_text_1 = span.find(text=True, recursive=False)
# Second approach is looping through the contents of the tag and check if it's the outer text and not a tag
outer_text_2 = ' '.join([t for t in span.contents if type(t)== NavigableString])
print(outer_text_1) # output B
print(outer_text_2) # output B

how BeautifulSoup get the content inside a span?

I'm trying to parse fixture contents from a website I managed to parse Match column but facing difficulty in parsing date and time column.
My program
import re
import pytz
import requests
import datetime
from bs4 import BeautifulSoup
from espncricinfo.exceptions import MatchNotFoundError, NoScorecardError
from espncricinfo.match import Match
bigbash_article_link = "http://www.espncricinfo.com/ci/content/series/1128817.html?template=fixtures"
r = requests.get(bigbash_article_link)
bigbash_article_html = r.text
soup = BeautifulSoup(bigbash_article_html, "html.parser")
bigbash1_items = soup.find_all("span",{"class": "fixture_date"})
bigbash_items = soup.find_all("span",{"class": "play_team"})
bigbash_article_dict = {}
date_dict = {}
for div in bigbash_items:
a = div.find('a')['href']
bigbash_article_dict[div.find('a').string] = a
print(bigbash_article_dict)
for div in bigbash1_items:
a = div.find('span').string
date_dict[div.find('span').string] = a
print(date_dict)
When I execute this I get print(bigbash_article_dict) output, but print(date_dict) gives me error, how can I parse date and time content?
Follow your code, you want to get the content inside the tag span.
So you should using "div.contents" to get the contents of span.
And your question should be how BeautifulSoup get the content inside a span.
eg.
div= <span class="fixture_date">
Thu Feb 22
</span>
div.contents[0].strip()= Thu Feb 22
------------
for div in bigbash1_items:
print("div=",div)
print("div.contents[0].strip()=",div.contents[0].strip(),"\r\n------------\r\n")
Elements with class fixture_date don't have a <span>, they are the span. You can get the data from them directly.
So instead of this:
div.find('span').string
You can do this:
div.string
From the structure of the website, this would return the date on odd iterations (1, 3, ..) and time on even iterations (2, 4, ..).
Oh and I'd advice you to make the variable name meaningful, so rename div to span.
Because in your code, all div variables actually contain <span> tags ;)

get last page number - web scraping

I am trying to scrape a site with multiple pages. I would like to build a function that returns the number of pages within a set of pages.
Here is an example starting page.
There are 29 sub pages within that leading page, ideally the function would therefore return 29.
By subpage I mean, page 1 of 29, 2 of 29 etc etc.
This is the HTML snippet which contains the last page information, from the link posted above.
<div id="paging-wrapper-btm" class="paging-wrapper">
<ol class="page-nos"><li ><span class="selected">1</span></li><li ><a href='http://www.asos.de/Herren-Jeans/podlh/?cid=4208&pge=1&pgesize=36&sort=-1'>Weiter »</li></ol>
I have the following code which will find all ol tags, but can't figure out how to access the contents contained within in each 'a' .
a = soup.find_all('ol')
b = [x['a'] for x in a] <-- this part returns an error.
< further processing >
Any help/suggestions much appreciated.
Ah.. I found a simple solution.
for item in soup.select("ol a"):
x = item.text
print x
I can then sort and select the largest number.
Try this:
ols = soup.find_all('ol')
list_of_as = [ol.find_all('a') for ol in ols] # Finds all a's inside each ol in the ols list
all_as = []
for a in list_of_as: # This is to expand each sublist of a's and put all of them in one list
all_as.extend(a)
print all_as
The following would extract the last page number:
from bs4 import BeautifulSoup
import requests
html = requests.get("http://www.asos.de/Herren-Jeans/podlh/?cid=4208&via=top&r=2#parentID=-1&pge=1&pgeSize=36&sort=-1")
soup = BeautifulSoup(html.text)
ol = soup.find('ol', class_='page-nos')
pages = [li.text for li in ol.find_all('li')]
last_page = pages[-2]
print last_page
Which for your website will display:
30

Python BeautifulSoup webcrawling getting text tag inside link

I need to get the information within the "< b >" tags for each website.
response = requests.get(href)
soup = BeautifulSoup(response.content, "lxml") # or BeautifulSoup(response.content, "html5lib")
tempWeekend = []
print soup.findAll('b')
The soup.findAll('b') line prints all the b tags in the site, how can I limit it to just the dates that I want?
The website is http://www.boxofficemojo.com/movies/?page=weekend&id=catchingfire.htm, under the weekend tab.
It is often easiest to search using CSS selectors, e.g.
soup.select('table.chart-wide > tr > td > nobr > font > a > b')
Sadly, if the tags are not further identified, there is no way to select specific ones. How should BeautifulSoup be able to distinguish between them. If you know what to roughly expect in the tags you need you could iterate over all of them and check if they match:
for b in soup.findAll('b):
if b.innerHTML == whatever:
return b
or something like that...
Or you could get the surrounding tags, i.e. 'a' in your example and check if that matches and then get the next occurence of 'b'.
Why not search for all the b tags, and choose the ones which contain a month?
import requests
from bs4 import BeautifulSoup
s = requests.get('http://www.boxofficemojo.com/movies/?page=weekend&id=catchingfire.htm').content
soup = BeautifulSoup(s, "lxml") # or BeautifulSoup(response.content, "html5lib")
dates = []
for i in soup.find_all('b'):
if i.text.split()[0].upper() in "JAN FEB MAR APR JUN JUL AUG SEP OCT NOV DEC":
dates.append(i.text)
print dates
(Note: I did not check the exact abbreviations that the website uses. Please check these first and accordingly modify the code)
Looking at that page it doesn't have any divs or class or id tags which makes it tough. The only pattern I could see what that the <b> tag directly before the dates was <b>Date:</b>. I would iterate over the <b> tags and then collect the tags after I hit the one with Date in it.
i would try something like
all_a = site.find_all('a')
for a in all_a:
if '?yr=?' in a['href']:
dates.append(a.get_text())

Python 2.7 : Can't figure out how to parse a tree with BeautifulSoup4

I am trying to parse this site to create 5 lists, one for each day and filled with one string for each announcement. For example
[in] custom_function(page)
[out] [[<MONDAYS ANNOUNCEMENTS>],
[<TUESDAYS ANNOUNCEMENTS>],
[<WEDNESDAYS ANNOUNCEMENTS>],
[<THURSDAYS ANNOUNCEMENTS>],
[<FRIDAYS ANNOUNCEMENTS>]]
But I can't figure out the correct way to do this.
This is what I have so far
from bs4 import BeautifulSoup
import requests
import datetime
url = http://mam.econoday.com/byweek.asp?day=7&month=4&year=2014&cust=mam&lid=0
# Get the text of the webpage
r = requests.get(url)
data = r.text
soup = BeautifulSoup(data)
full_table_1 = soup.find('table', 'eventstable')
I Figured out that what I want is in the highlighted tag, but I'm not sure how to get to that exact tag and then parse out the times/announcements into a list. I've tried multiple methods but it just keeps getting messier.
What do I do?
The idea is to find all td elements with events class, then read div elements inside:
data = []
for day in soup.find_all('td', class_='events'):
data.append([div.text for div in day.find_all('div', class_='econoevents')])
print data
prints:
[[u'Gallup US Consumer Spending Measure8:30 AM\xa0ET',
u'4-Week Bill Announcement11:00 AM\xa0ET',
u'3-Month Bill Auction11:30 AM\xa0ET',
...
],
...
]

Categories