How to do simple pagination loop with beautifulsoup

How to do simple pagination loop with beautifulsoup - python

I need your help to have an explanation on how to do pagination and loop on 5 different pages but with the same URL (http://www.chartsinfrance.net/charts/albums.php,p2) with just the last word of the URL who change for the number of the page.
I can scrape data of the first page but I don't understand how to get other URLs and scrape all the data in one loop and having like the 250 songs in one execution of the script!
import requests
from bs4 import BeautifulSoup
req = requests.get('http://www.chartsinfrance.net/charts/albums.php')
soup = BeautifulSoup(req.text, "html.parser")
charts = soup.select('.c1_td1')
Auteurs=[]
Titre=[]
Rang=[]
Evolution=[]
for chart in charts:
Rang = chart.select_one('.c1_td2').get_text()
Auteurs = chart.select_one('.c1_td5 a').get_text()
Evolution = chart.select_one('.c1_td3').get_text()
Titre = chart.select_one('.c1_td5 .noir11').get_text()
print('--------')
print(Auteurs)
print(Titre)
print(Rang)
print(Evolution)

You can put your code to while ... loop, where you load soup, get information about songs and then select link to next page.
If the link to next page exists, load new soup and continue the loop.
If not, break the loop.
For example:
import requests
from bs4 import BeautifulSoup
url = 'http://www.chartsinfrance.net/charts/albums.php'
while True:
soup = BeautifulSoup(requests.get(url).content, 'html.parser')
charts = soup.select('.c1_td1')
Auteurs=[]
Titre=[]
Rang=[]
Evolution=[]
for chart in charts:
Rang = chart.select_one('.c1_td2').get_text()
Auteurs = chart.select_one('.c1_td5 a').get_text()
Evolution = chart.select_one('.c1_td3').get_text()
Titre = chart.select_one('.c1_td5 .noir11').get_text()
print('--------')
print(Auteurs)
print(Titre)
print(Rang)
print(Evolution)
next_link = soup.select_one('a:contains("→ Suite du classement")')
if next_link:
url = 'http://www.chartsinfrance.net' + next_link['href']
else:
break
Prints:
--------
Lady Gaga
Chromatica
1
Entrée
--------
Johnny Hallyday
Johnny 69
2
Entrée
--------
...
--------
Bof
Pulp Fiction
248
-115
--------
Trois Cafés Gourmands
Un air de Live
249
-30
--------
Various Artists
Salut Les Copains 60 Ans
250
Entrée

Related

How to webscrape multiple items in <tr> bracket and split them in 3 variables with python?

I am trying to scrape a website with multiple brackets. My plan is to have 3 variables (oem, model, leadtime) to generate the desired output. However, I cannot figure out how to scrape this webpage in 3 variables. Given I am new to python and BeautifulSoup, I highly appreciate your feedback.
Desired output with 3 varibales and the command:
print(oem, model, leadtime)
Renault, Mégane E-Tech, 12 Monate
Nissan, Ariya, 6 Monate
...
Volvo, XC90, 10-12 Monate
Output as of now:
Renault Mégane E-Tech12 Monate
Nissan Ariya6 Monate
Peugeot e-2086-7 Monate
KIA Sportage5-6 Monate6-7 Monate (Hybrid)
Jeep Compass3-5 Monate3-5 Monate (Hybrid)
VW Taigo3-6 Monate
...
XC9010-12 Monate
Code as of now:
from bs4 import BeautifulSoup
import requests
#Inputs/URLs to scrape:
URL = ('https://www.carwow.de/neuwagen-lieferzeiten#gref')
(response := requests.get(URL)).raise_for_status()
soup = BeautifulSoup(response.text, 'lxml')
overview = soup.find()
for card in overview.find_all('tbody'):
for model2 in card.find_all('tr'):
model = model2.text.replace('Angebote vergleichen', '')
#oem?-->this needs to be defined
#leadtime?--> this needs to defined
print(model)

The brand name is inside h3 tag. You can get the parent with this approach .find_all("div", {"class": "expandable-content-container"})
from bs4 import BeautifulSoup
import requests
#Inputs/URLs to scrape:
URL = ('https://www.carwow.de/neuwagen-lieferzeiten#gref')
(response := requests.get(URL)).raise_for_status()
soup = BeautifulSoup(response.text, 'lxml')
overview = soup.find()
for el in overview.find_all("div", {"class": "expandable-content-container"}):
header = el.find("h3").text.strip()
if not header.startswith("Top 10") and not header.endswith("?"):
for row in el.find_all("tr")[1:]:
model_monate = ", ".join(
list(map(lambda x: x.text, row.find_all("td")[:-1]))
)
print(f"{el.find('h3').text.strip()}, {model_monate}")
print("----")

the parts of the car model info that you're trying to scrape are actually stored in separate td tags, meaning, you can just access their index to get corresponding info, try the code below.
import requests
from bs4 import BeautifulSoup
response = requests.get("https://www.carwow.de/neuwagen-lieferzeiten#gref").text
soup = BeautifulSoup(response, 'html.parser')
for tbody in soup.select('tbody'):
for tr in tbody:
brand = tr.select('td > a')[0].get('href').split('/')[3].capitalize()
model = tr.select('td > a')[0].get('href').split('/')[4].capitalize()
monate = tr.select('td')[1].getText(strip=True)
print(f'{brand}, {model}, {monate}')

bs4 findAll not collecting all of the data from the other pages on the website

I'm trying to scrape a real estate website using BeautifulSoup.
I'm trying to get a list of rental prices for London. This works but only for the first page on the website. There are over 150 of them so I'm missing out on a lot of data. I would like to be able to collect all the prices from all the pages. Here is the code I'm using:
import requests
from bs4 import BeautifulSoup as soup
url = 'https://www.zoopla.co.uk/to-rent/property/central-london/?beds_max=5&price_frequency=per_month&q=Central%20London&results_sort=newest_listings&search_source=home'
response = requests.get(url)
response.status_code
data = soup(response.content, 'lxml')
prices = []
for line in data.findAll('div', {'class': 'css-1e28vvi-PriceContainer e2uk8e7'}):
price = str(line).split('>')[2].split(' ')[0].replace('£', '').replace(',','')
price = int(price)
prices.append(price)
Any idea as to why I can't collect the prices from all the pages using this script?
Extra question : is there a way to access the price using soup, IE with doing any list/string manipulation? When I call data.find('div', {'class': 'css-1e28vvi-PriceContainer e2uk8e7'}) I get a string of the following form <div class="css-1e28vvi-PriceContainer e2uk8e7" data-testid="listing-price"><p class="css-1o565rw-Text eczcs4p0" size="6">£3,012 pcm</p></div>
Any help would be much appreciated!

You can append &pn=<page number> parameter to the URL to get next pages:
import re
import requests
from bs4 import BeautifulSoup as soup
url = "https://www.zoopla.co.uk/to-rent/property/central-london/?beds_max=5&price_frequency=per_month&q=Central%20London&results_sort=newest_listings&search_source=home&pn="
prices = []
for page in range(1, 3): # <-- increase number of pages here
data = soup(requests.get(url + str(page)).content, "lxml")
for line in data.findAll(
"div", {"class": "css-1e28vvi-PriceContainer e2uk8e7"}
):
price = line.get_text(strip=True)
price = int(re.sub(r"[^\d]", "", price))
prices.append(price)
print(price)
print("-" * 80)
print(len(prices))
Prints:
...
1993
1993
--------------------------------------------------------------------------------
50

How to get all products from a beautifulsoup page

I want to get all the products on this page:
nike.com.br/snkrs#estoque
My python code is this:
produtos = []
def aviso():
print("Started!")
request = requests.get("https://www.nike.com.br/snkrs#estoque")
soup = bs4(request.text, "html.parser")
links = soup.find_all("a", class_="btn", text="Comprar")
links_filtred = list(set(links))
for link in links_filtred:
if(produto not in produtos):
request = requests.get(f"{link['href']}")
soup = bs4(request.text, "html.parser")
produto = soup.find("div", class_="nome-preco-produto").get_text()
if(code_formated == ""):
code_formated = "\u200b"
print(f"Nome: {produto} Link: {link['href']}\n")
produtos.append(link["href"])
aviso()
Guys, this code gets the products from the page, but not all yesterday, I suspect that the content is dynamic, but how can I get them all with request and beautifulsoup? I don't want to use Selenium or an automation library, how do I do that? I don't want to have to change my code a lot because it's almost done, how do I do that?

DO NOT USE requests.get if you are dealing with the same HOST.
Reason: read-that
import requests
from bs4 import BeautifulSoup
import pandas as pd
def main(url):
allin = []
with requests.Session() as req:
for page in range(1, 6):
params = {
'p': page,
'demanda': 'true'
}
r = req.get(url, params=params)
soup = BeautifulSoup(r.text, 'lxml')
goal = [(x.find_next('h2').get_text(strip=True, separator=" "), x['href'])
for x in soup.select('.aspect-radio-box')]
allin.extend(goal)
df = pd.DataFrame(allin, columns=['Title', 'Url'])
print(df)
main('https://www.nike.com.br/Snkrs/Feed')
Output:
Title Url
0 Dunk High x Fragment design Black https://www.nike.com.br/dunk-high-x-fragment-d...
1 Dunk Low Infantil (16-26) City Market https://www.nike.com.br/dunk-low-infantil-16-2...
2 ISPA Flow 2020 Desert Sand https://www.nike.com.br/ispa-flow-2020-153-169...
3 ISPA Flow 2020 Pure Platinum https://www.nike.com.br/ispa-flow-2020-153-169...
4 Nike iSPA Men's Lightweight Packable Jacket https://www.nike.com.br/nike-ispa-153-169-211-...
.. ... ...
115 Air Jordan 1 Mid Hyper Royal https://www.nike.com.br/air-jordan-1-mid-153-1...
116 Dunk High Orange Blaze https://www.nike.com.br/dunk-high-153-169-211-...
117 Air Jordan 5 Stealth https://www.nike.com.br/air-jordan-5-153-169-2...
118 Air Jordan 3 Midnight Navy https://www.nike.com.br/air-jordan-3-153-169-2...
119 Air Max 90 Bacon https://www.nike.com.br/air-max-90-153-169-211...
[120 rows x 2 columns]

To get the data you can send a request to:
https://www.nike.com.br/Snkrs/Estoque?p=<PAGE>&demanda=true
where providing a page number between 1-5 to p= in the URL.
For example, to print the links, you can try:
import requests
from bs4 import BeautifulSoup
url = "https://www.nike.com.br/Snkrs/Estoque?p={page}&demanda=true"
for page in range(1, 6):
response = requests.get(url.format(page=page))
soup = BeautifulSoup(response.content, "html.parser")
print(soup.find_all("a", class_="btn", text="Comprar"))

How to open scraped links one by one automatically using Python?

So here is my situation: Let's say you search on eBay for "Motorola DynaTAC 8000x". The bot that I build is going to scrape all the links of the listings. My goal is now, to make it open those scraped links one by one.
I think something like that would be possible with using loops, but I am not sure on how to do it. Thanks in advance!
Here is the code of the bot:
import requests
from bs4 import BeautifulSoup
url = "https://www.ebay.com/sch/i.html?_from=R40&_trksid=p2380057.m570.l1313&_nkw=Motorola+DynaTAC+8000x&_sacat=0"
r = requests.get(url)
soup = BeautifulSoup(r.content, features="lxml")
listings = soup.select("li a")
for a in listings:
link = a["href"]
if link.startswith("https://www.ebay.com/itm/"):
print(link)

To get information from the link you can do:
import requests
from bs4 import BeautifulSoup
url = "https://www.ebay.com/sch/i.html?_from=R40&_trksid=p2380057.m570.l1313&_nkw=Motorola+DynaTAC+8000x&_sacat=0"
r = requests.get(url)
soup = BeautifulSoup(r.content, features="lxml")
listings = soup.select("li a")
for a in listings:
link = a["href"]
if link.startswith("https://www.ebay.com/itm/"):
s = BeautifulSoup(requests.get(link).content, "lxml")
price = s.select_one('[itemprop="price"]')
print(s.h1.text)
print(price.text if price else "-")
print(link)
print("-" * 80)
Prints:
...
Details about  MOTOROLA DYNATAC 8100L- BRICK CELL PHONE VINTAGE RETRO RARE MUSEUM 8000X
GBP 555.00
https://www.ebay.com/itm/393245721991?hash=item5b8f458587:g:c7wAAOSw4YdgdvBt
--------------------------------------------------------------------------------
Details about  MOTOROLA DYNATAC 8100L- BRICK CELL PHONE VINTAGE RETRO RARE MUSEUM 8000X
GBP 555.00
https://www.ebay.com/itm/393245721991?hash=item5b8f458587:g:c7wAAOSw4YdgdvBt
--------------------------------------------------------------------------------
Details about  Vintage Pulsar Extra Thick Brick Cell Phone Has Dynatac 8000X Display
US $3,000.00
https://www.ebay.com/itm/163814682288?hash=item26241daeb0:g:sTcAAOSw6QJdUQOX
--------------------------------------------------------------------------------
...

Is there a way to parse data from multiple pages from a parent webpage?

So I have been going to a website to get NDC codes https://ndclist.com/?s=Solifenacin and I need to get 10 digit NDC codes, but on the current webpage there is only 8 digit NDC codes shown like this picture below
So I click on the underlined NDC code. And get this webpage.
So I copy and paste these 2 NDC codes to an excel sheet, and repeat the process for the rest of the codes on the first webpage I've shown. But this process takes a good bit of time, and was wondering if there was a library in Python that could copy and paste the 10 digit NDC codes for me or store them in a list and then I could print the list once I'm finished with all the 8 digit NDC codes on the first page. Would BeautifulSoup work or is there a better library to achieve this process?
EDIT <<<<
I actually need to go another level deep and I've been trying to figure it out, but I've been failing, apparently the last level of webpage is this dumb html table, and I only need one element of the table. Here is the last webpage after you click on the 2nd level codes.
Here is the code that I have, but it's returning a tr and None object once I run it.
url ='https://ndclist.com/?s=Trospium'
soup = BeautifulSoup(requests.get(url).content, 'html.parser')
all_data = []
for a in soup.select('[data-title="NDC"] a[href]'):
link_url = a['href']
print('Processin link {}...'.format(link_url))
soup2 = BeautifulSoup(requests.get(link_url).content, 'html.parser')
for b in soup2.select('#product-packages a'):
link_url2 = b['href']
print('Processing link {}... '.format(link_url2))
soup3 = BeautifulSoup(requests.get(link_url2).content, 'html.parser')
for link in soup3.findAll('tr', limit=7)[1]:
print(link.name)
all_data.append(link.name)
print('Trospium')
print(all_data)

Yes, BeautifulSoup is ideal in this case. This script will print all 10 digits codes from the page:
import requests
from bs4 import BeautifulSoup
url = 'https://ndclist.com/?s=Solifenacin'
soup = BeautifulSoup(requests.get(url).content, 'html.parser')
all_data = []
for a in soup.select('[data-title="NDC"] a[href]'):
link_url = a['href']
print('Processin link {}...'.format(link_url))
soup2 = BeautifulSoup(requests.get(link_url).content, 'html.parser')
for link in soup2.select('#product-packages a'):
print(link.text)
all_data.append(link.text)
# In all_data you have all codes, uncoment to print them:
# print(all_data)
Prints:
Processin link https://ndclist.com/ndc/0093-5263...
0093-5263-56
0093-5263-98
Processin link https://ndclist.com/ndc/0093-5264...
0093-5264-56
0093-5264-98
Processin link https://ndclist.com/ndc/0591-3796...
0591-3796-19
Processin link https://ndclist.com/ndc/27241-037...
27241-037-03
27241-037-09
... and so on.
EDIT: (Version where I get the description too):
import requests
from bs4 import BeautifulSoup
url = 'https://ndclist.com/?s=Solifenacin'
soup = BeautifulSoup(requests.get(url).content, 'html.parser')
all_data = []
for a in soup.select('[data-title="NDC"] a[href]'):
link_url = a['href']
print('Processin link {}...'.format(link_url))
soup2 = BeautifulSoup(requests.get(link_url).content, 'html.parser')
for code, desc in zip(soup2.select('a > h4'), soup2.select('a + p.gi-1x')):
code = code.get_text(strip=True).split(maxsplit=1)[-1]
desc = desc.get_text(strip=True).split(maxsplit=2)[-1]
print(code, desc)
all_data.append((code, desc))
# in all_data you have all codes:
# print(all_data)
Prints:
Processin link https://ndclist.com/ndc/0093-5263...
0093-5263-56 30 TABLET, FILM COATED in 1 BOTTLE
0093-5263-98 90 TABLET, FILM COATED in 1 BOTTLE
Processin link https://ndclist.com/ndc/0093-5264...
0093-5264-56 30 TABLET, FILM COATED in 1 BOTTLE
0093-5264-98 90 TABLET, FILM COATED in 1 BOTTLE
Processin link https://ndclist.com/ndc/0591-3796...
0591-3796-19 90 TABLET, FILM COATED in 1 BOTTLE
...and so on.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to do simple pagination loop with beautifulsoup - python

Related

How to webscrape multiple items in <tr> bracket and split them in 3 variables with python?

bs4 findAll not collecting all of the data from the other pages on the website

How to get all products from a beautifulsoup page

How to open scraped links one by one automatically using Python?

Is there a way to parse data from multiple pages from a parent webpage?

Categories

Resources