get last page number - web scraping - python

I am trying to scrape a site with multiple pages. I would like to build a function that returns the number of pages within a set of pages.
Here is an example starting page.
There are 29 sub pages within that leading page, ideally the function would therefore return 29.
By subpage I mean, page 1 of 29, 2 of 29 etc etc.
This is the HTML snippet which contains the last page information, from the link posted above.
<div id="paging-wrapper-btm" class="paging-wrapper">
<ol class="page-nos"><li ><span class="selected">1</span></li><li ><a href='http://www.asos.de/Herren-Jeans/podlh/?cid=4208&pge=1&pgesize=36&sort=-1'>Weiter ยป</li></ol>
I have the following code which will find all ol tags, but can't figure out how to access the contents contained within in each 'a' .
a = soup.find_all('ol')
b = [x['a'] for x in a] <-- this part returns an error.
< further processing >
Any help/suggestions much appreciated.

Ah.. I found a simple solution.
for item in soup.select("ol a"):
x = item.text
print x
I can then sort and select the largest number.

Try this:
ols = soup.find_all('ol')
list_of_as = [ol.find_all('a') for ol in ols] # Finds all a's inside each ol in the ols list
all_as = []
for a in list_of_as: # This is to expand each sublist of a's and put all of them in one list
all_as.extend(a)
print all_as

The following would extract the last page number:
from bs4 import BeautifulSoup
import requests
html = requests.get("http://www.asos.de/Herren-Jeans/podlh/?cid=4208&via=top&r=2#parentID=-1&pge=1&pgeSize=36&sort=-1")
soup = BeautifulSoup(html.text)
ol = soup.find('ol', class_='page-nos')
pages = [li.text for li in ol.find_all('li')]
last_page = pages[-2]
print last_page
Which for your website will display:
30

Related

url split with beautifulsoup

When I run the code below,
link = f"https://www.ambalajstore.com/kategori/bardak-tabak?siralama=fiyat:asc&stoktakiler=1&tp=1"
response = requests.get(link)
html_icerigi = response.content
corba = BeautifulSoup(html_icerigi,"html.parser")
for a in corba.find_all("div",{"class":"paginate-content"}):
x = corba.find_all("div",{"class":"paginate-content"})
print(x)
I get results:
[<div class="paginate-content">
<a class="paginate-element-active" href="javascript:void(0);">1</a>
2
3
4
..
13
</div>]
What I need is just the number 13 (last number) in the last line (13)
Can you help me on how to do this?
You can do it like this
corba.find("div",{"class":"paginate-content"}).find_all('a')[-1].text
this will give you the text content of the last item(13 in your case)
As you have 1 div in x so you can get by following:
x.find_all('a')[-1].text
You can handle the case if no anchor tag found.
There are different approaches possible to scrape the text of your element.
css selectors calling the last element of type:
corba.select_one('.paginate-content a:last-of-type').text
picking last element by its list index:
corba.find('div',{'class':'paginate-content'}).find_all('a')[-1].text
Example
from bs4 import BeautifulSoup
import requests
url = 'https://www.ambalajstore.com/kategori/bardak-tabak?siralama=fiyat:asc&stoktakiler=1&tp=1'
req = requests.get(url)
corba = BeautifulSoup(req.content)
corba.select_one('.paginate-content a:last-of-type').text
Output
13

Extract href link from html in python

I get this output list of html
HyperSense Software
QSS Technosoft - A CMMI Level 3 Certified Company
and more in the same format I need to extract href link from them?
My code
mainurl="https://www.appfutura.com/app-developers"
html = urlopen(mainurl).read()
main_soup = BeautifulSoup(html,"lxml")
allurl=main_soup.find_all('h3')
for i in allurl:
for a in i :
print(a)
How can I extract href in this loop?
You're close. One small change in your for loop:
for i in allurl:
print(i.a["href"])
This gets the child with tag "a" and then the "href" attribute for that tag.
If you aren't sure how many "a" tags there are in each "h3" block, or there are more than one, you can use another for loop (or depending on what you're doing, list comprehensions):
for i in allurl:
aa = i.find_all('a')
for j in aa:
print(j["href"])
I found a way using css selector
urllist=[]
mainurl="https://www.appfutura.com/app-developers"
html = urlopen(mainurl).read()
main_soup = BeautifulSoup(html,"lxml")
elms = main_soup.select("h3 a")
for i in elms:
urllist.append(i.attrs["href"])
print(urllist)
Thanks !!

how to get text after a specific p tag in beautifulsoup?

how to get all text after third p tag from this code in BeautifulSoup web scraping.
questions = soup.find('div',{'class':'entry-content'})
exp = questions.p[3].text
(there is c a way something like this but i cant get it. )
anyone here can help. shall be very thanksfullenter image description here
Try below code, if that helps:
#This will fetch first div with class entry-content.
# In case if that is not the first div then instead use find_all and select the
# appropriate div with help of indexing.
questions = soup.find('div', class_= 'entry-content')
#This will get all the p tags present in questions.
p_tags = questions.find_all('p')
lst=[]
for tag in p_tags[3:]:
lst.append(tag.text)
#This will get you the text of the 4th <p> tag.
exp = p_tags[3].text
This questions = soup.find('div',{'class':'entry-content'})
Only finds one p tag,
you need:
questions = soup.find_all('div',{'class':'entry-content'})
To find all the p tags, then you can use [3]

Python script extract data from HTML page

I'm trying to do a massive data accumulation on college basketball teams. This link: https://www.teamrankings.com/ncb/stats/ has a TON of team stats.
I have tried to write a script that scans all the desired links (all Team Stats) from this page, finds the rank of the specified team (an input), then returns the sum of that teams ranks from all links.
I graciously found this: https://gist.github.com/phillipsm/404780e419c49a5b62a8
...which is GREAT!
But I must have something wrong because I'm getting 0
Here's my code:
import requests
from bs4 import BeautifulSoup
import time
url_to_scrape = 'https://www.teamrankings.com/ncb/stats/'
r = requests.get(url_to_scrape)
soup = BeautifulSoup(r.text, "html.parser")
stat_links = []
for table_row in soup.select(".expand-section li"):
table_cells = table_row.findAll('li')
if len(table_cells) > 0:
link = table_cells[0].find('a')['href']
stat_links.append(link)
total_rank = 0
for link in stat_links:
r = requests.get(link)
soup = BeaultifulSoup(r.text)
team_rows = soup.select(".tr-table datatable scrollable dataTable no-footer tr")
for row in team_rows:
if row.findAll('td')[1].text.strip() == 'Oklahoma':
rank = row.findAll('td')[0].text.strip()
total_rank = total_rank + rank
print total_rank
Check out that link to double check I have the correct class specified. I have a feeling the problem might be in the first for loop where I select an li tag then select all li tags within that first tag, I dunno.
I don't use Python so I'm unfamiliar with any debugging tools. So if anyone wants to forward me to one of those that would be great!
First, the team stats and player stats sections are contained in a 'div class='large column-2'. The team stats are in the first occurrence. Then you can find all of the href tags within it. I've combined both in a one-liner.
teamstats = soup(class_='column large-2')[0].find_all(href=True)
The teamstats list contains all of the 'a' tags. Use a list comprehension to extract the links. A few of the hrefs contained "#" (part of navigation links) so I excluded them.
links = [a['href'] for a in teamstats if a['href'] != '#']
Here is a sample of output:
links
Out[84]:
['/ncaa-basketball/stat/points-per-game',
'/ncaa-basketball/stat/average-scoring-margin',
'/ncaa-basketball/stat/offensive-efficiency',
'/ncaa-basketball/stat/floor-percentage',
'/ncaa-basketball/stat/1st-half-points-per-game',
A ran your code on my machine and the line --> table_cells = table_row.findAll('li') , always returns an empty list, so stat_links ends up being an empty array, therefore the iteration over stat_links never gets carried out and total_rank will not get incremented. I suggest you fiddle around with the way you find all the list elements.

Python webscraping and getting contents of first div tag of its class

I'm working with Python 3.3 and this website:
http://www.nasdaq.com/markets/ipos/
My goal is to read only the companies that are in the Upcoming IPO. It is in the div tag with div class="genTable thin floatL" There are two with this class, and the target data is in the first one.
Here is my code
from urllib.request import urlopen
from bs4 import BeautifulSoup
import re
html = urlopen("http://www.nasdaq.com/markets/ipos/").read()
soup = BeautifulSoup(html)
for divparent in soup.find_all('div', attrs={'class':'genTable thin floatL'}) [0]: # I tried putting a [0] so it will only return divs in the first genTable thin floatL class
for div in soup.find_all('div', attrs={'class':'ipo-cell-height'}):
s = div.string
if re.match(r'\d{1,2}/\d{1,2}/\d{4}$', s):
div_next = div.find_next('div')
print('{} - {}'.format(s, div_next.string))
I'd like it to return only
3/7/2014 - RECRO PHARMA, INC.
2/28/2014 - VARONIS SYSTEMS INC
2/27/2014 - LUMENIS LTD
2/21/2014 - SUNDANCE ENERGY AUSTRALIA LTD
2/21/2014 - SEMLER SCIENTIFIC, INC.
But it prints all div classes with the re.match specifications and multiple times as well. I tried inserting [0] on the for divparent loop to retrieve only the first one but this cause the repeating problem instead.
EDIT: Here is the updated code according to warunsl solution. This works.
from urllib.request import urlopen
from bs4 import BeautifulSoup
import re
html = urlopen("http://www.nasdaq.com/markets/ipos/").read()
soup = BeautifulSoup(html)
divparent = soup.find_all('div', attrs={'class':'genTable thin floatL'})[0]
table= divparent.find('table')
for div in table.find_all('div', attrs={'class':'ipo-cell-height'}):
s = div.string
if re.match(r'\d{1,2}/\d{1,2}/\d{4}$', s):
div_next = div.find_next('div')
print('{} - {}'.format(s, div_next.string))
You mentioned that there are two elements that fit the 'class':'genTable thin floatL' criteria. So running a for loop for it's first element does not make sense.
So replace your outer for loop with
divparent = soup.find_all('div', attrs={'class':'genTable thin floatL'})[0]
Now you need not do a soup.find_all again. Doing so will search the entire document. You need to restrict the search to the divparent. So, you do:
table = divparent.find('table')
The remainder of the code to extract the dates and the company name would be the same, except that they will be with reference to the table variable.
for row in table.find_all('tr'):
for data in row.find_all('td'):
print data.string
Hope it helps.

Categories