Pagination not iterating over pages

Pagination not iterating over pages - python

Want to iterate all pages from this url ""url = "https://www.iata.org/en/about/members/airline-list/"" and dump the results in a .csv file.
How could implementing a piece of code to iterate through the pages be included in the current code below?
import requests
import pandas as pd
from bs4 import BeautifulSoup
from urllib.request import Request
url = 'https://www.iata.org/en/about/members/airline-list/'
req = Request(url , headers = {
'accept':'*/*',
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.190 Safari/537.36'})
data = []
while True:
print(url)
html = requests.get(url)
soup = BeautifulSoup(html.text, 'html.parser')
data.append(pd.read_html(soup.select_one('table.datatable').prettify())[0])
if soup.select_one('span.pagination-link.is-active + div a[href]'):
url = soup.select_one('span.pagination-link.is-active + div a')['href']
else:
break
df = pd.concat(data)
df.to_csv('airline-list.csv',encoding='utf-8-sig',index=False)

Try this approach:
for i in range(1, 30):
url = f'https://www.iata.org/en/about/members/airline-list/?page={i}&search=&ordering=Alphabetical'
html = requests.get(url)
soup = BeautifulSoup(html.text, 'html.parser')
data.append(pd.read_html(soup.select_one('table.datatable').prettify())[0])

To get data dynamically, use:
import pandas as pd
import requests
import bs4
url = 'https://www.iata.org/en/about/members/airline-list/?page={page}&search=&ordering=Alphabetical'
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.190 Safari/537.36'}
# Total number of pages
html = requests.get(url.format(page=1), headers=headers)
soup = bs4.BeautifulSoup(html.text)
pages = int(soup.find_all('a', {'class': 'pagination-link'})[-2].text)
data = []
for page in range(1, pages+1):
html = requests.get(url.format(page=page, headers=headers))
data.append(pd.read_html(html.text)[0])
df = pd.concat(data)
Output:
>>> df
Airline Name IATA Designator 3 digit code ICAO code Country / Territory
0 ABX Air GB 832 ABX United States
1 Aegean Airlines A3 390 AEE Greece
2 Aer Lingus EI 53 EIN Ireland
3 Aero Republica P5 845 RPB Colombia
4 Aeroflot SU 555 AFL Russian Federation
.. ... ... ... ... ...
3 WestJet WS 838 WJA Canada
4 White coloured by you WI 97 WHT Portugal
5 Wideroe WF 701 WIF Norway
6 Xiamen Airlines MF 731 CXA China (People's Republic of)
7 YTO Cargo Airlines YG 860 HYT China (People's Republic of)
[288 rows x 5 columns]

Related

Optimising Python script for scraping to avoid getting blocked/ draining resources

I have a fairly basic Python script that scrapes a property website, and stores the address and price in a csv file. There are over 5000 listings to go through but I find my current code times out after a while (about 2000 listings) and the console shows 302 and CORS policy errors.
import requests
import itertools
from bs4 import BeautifulSoup
from csv import writer
from random import randint
from time import sleep
from datetime import date
url = "https://www.propertypal.com/property-for-sale/northern-ireland/page-"
headers = {
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/106.0.0.0 Safari/537.36'}
filename = date.today().strftime("ni-listings-%Y-%m-%d.csv")
with open(filename, 'w', encoding='utf8', newline='') as f:
thewriter = writer(f)
header = ['Address', 'Price']
thewriter.writerow(header)
# for page in range(1, 3):
for page in itertools.count(1):
req = requests.get(f"{url}{page}", headers=headers)
soup = BeautifulSoup(req.content, 'html.parser')
for li in soup.find_all('li', class_="pp-property-box"):
title = li.find('h2').text
price = li.find('p', class_="pp-property-price").text
info = [title, price]
thewriter.writerow(info)
sleep(randint(1, 5))
# this script scrapes all pages and records all listings and their prices in daily csv
As you can see I added sleep(randint(1, 5)) to add random intervals but I possibly need to do more. Of course I want to scrape the page in its entirety as quickly as possible but I also want to be respectful to the site that is being scraped and minimise burdening them.
Can anyone suggest updates? Ps forgive rookie errors, very new to Python/scraping!

This is one way of getting that data - bear in mind there are 251 pages only, with 12 properties on each of them, not over 5k:
import requests
import pandas as pd
from tqdm import tqdm
from bs4 import BeautifulSoup as bs
pd.set_option('display.max_columns', None)
pd.set_option('display.max_colwidth', None)
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/105.0.0.0 Safari/537.36',
'accept': 'application/json',
'accept-language': 'en-US,en;q=0.9',
'sec-fetch-mode': 'navigate',
'sec-fetch-site': 'same-origin'
}
s = requests.Session()
s.headers.update(headers)
big_list = []
for x in tqdm(range(1, 252)):
soup = bs(s.get(f'https://www.propertypal.com/property-for-sale/northern-ireland/page-{x}').text, 'html.parser')
# print(soup)
properties = soup.select('li.pp-property-box')
for p in properties:
name = p.select_one('h2').get_text(strip=True) if p.select_one('h2') else None
url = 'https://www.propertypal.com' + p.select_one('a').get('href') if p.select_one('a') else None
price = p.select_one('p.pp-property-price').get_text(strip=True) if p.select_one('p.pp-property-price') else None
big_list.append((name, price, url))
big_df = pd.DataFrame(big_list, columns = ['Property', 'Price', 'Url'])
print(big_df)
Result printed in terminal:
100%
251/251 [03:41<00:00, 1.38it/s]
Property Price Url
0 22 Erinvale Gardens, Belfast, BT10 0FS Asking price£165,000 https://www.propertypal.com/22-erinvale-gardens-belfast/777820
1 Laurel Hill, 37 Station Road, Saintfield, BT24 7DZ Guide price£725,000 https://www.propertypal.com/laurel-hill-37-station-road-saintfield/751274
2 19 Carrick Brae, Burren Warrenpoint, Newry, BT34 3TH Guide price£265,000 https://www.propertypal.com/19-carrick-brae-burren-warrenpoint-newry/775302
3 7b Conway Street, Lisburn, BT27 4AD Offers around£299,950 https://www.propertypal.com/7b-conway-street-lisburn/779833
4 Hartley Hall, Greenisland From£280,000to£397,500 https://www.propertypal.com/hartley-hall-greenisland/d850
... ... ... ...
3007 8 Shimna Close, Newtownards, BT23 4PE Offers around£99,950 https://www.propertypal.com/8-shimna-close-newtownards/756825
3008 7 Barronstown Road, Dromore, BT25 1NT Guide price£380,000 https://www.propertypal.com/7-barronstown-road-dromore/756539
3009 39 Tamlough Road, Randalstown, BT41 3DP Offers around£425,000 https://www.propertypal.com/39-tamlough-road-randalstown/753299
3010 Glengeen House, 17 Carnalea Road, Fintona, BT78 2BY Offers over£180,000 https://www.propertypal.com/glengeen-house-17-carnalea-road-fintona/750105
3011 Walnut Road, Larne, BT40 2WE Offers around£169,950 https://www.propertypal.com/walnut-road-larne/749733
3012 rows × 3 columns
See relevant documentation for Requests: https://requests.readthedocs.io/en/latest/
For Pandas: https://pandas.pydata.org/docs/
For BeautifulSoup: https://beautiful-soup-4.readthedocs.io/en/latest/
And for TQDM: https://pypi.org/project/tqdm/

How to append data in data frame using beautiful soup

import requests
from bs4 import BeautifulSoup
import pandas as pd
baseurl='https://locations.atipt.com/'
headers ={
'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.114 Safari/537.36'
}
r =requests.get('https://locations.atipt.com/al')
soup=BeautifulSoup(r.content, 'html.parser')
tra = soup.find_all('ul',class_='list-unstyled')
productlinks=[]
for links in tra:
for link in links.find_all('a',href=True):
comp=baseurl+link['href']
productlinks.append(comp)
temp=[]
for link in productlinks:
r =requests.get(link,headers=headers)
soup=BeautifulSoup(r.content, 'html.parser')
tag=soup.find_all('div',class_='listing content-card')
for pro in tag:
for tup in pro.find_all('p'):
temp.append([text for text in tup.stripped_strings])
df = pd.DataFrame(temp)
print(df)
This is the output I get
9256 Parkway E Ste A
Birmingham,
Alabama 35206
but I doesn't how to give the name in data frame I give name address to 9256 Parkway ESte A and City to Birmingham and state to ALabama 35206 if it is possible that kindly help in these matter

temp=[]
for link in productlinks:
r =requests.get(link,headers=headers)
soup=BeautifulSoup(r.content, 'html.parser')
tag=soup.find_all('div',class_='listing content-card')
for pro in tag:
data=[tup.text for tup in pro.find_all('p')]
address="".join(data[:2])
splitdata=data[2].split(",")
city=splitdata[0]
splitsecond=splitdata[-1].split("\xa0")
state=splitsecond[0]
postalcode=splitsecond[-1]
temp.append([address,city,state])
import pandas as pd
df=pd.DataFrame(temp,columns=["Address","City","State"])
df
Output:
Address City State Postalcode
0 634 1st Street NSte 100 Alabaster AL 35007
1 9256 Parkway ESte A Birmingham AL 35206
....
If you want to add call details just add this statement after postalcode
callNumber=pro.find("span",class_="directory-phone").get_text(strip=True).split("\n")[-1].lstrip()
and append this to temp list

Scraping using scrapy only able to get the first record

Newbie here trying to scrape https://www.worldometers.info/coronavirus/countries-where-coronavirus-has-spread/
This is the function I've wrote. Is it a problem with the xpath?
def parse(self, response):
for row in response.xpath("//table[#id='table3']"):
name = row.xpath(".//tr//td[1]//text()").get()
yield {
'name': name
}

pandas can scrape tables quickly:
import pandas as pd
import requests
url = 'https://www.worldometers.info/coronavirus/countries-where-coronavirus-has-spread/'
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36'}
r = requests.get(url, headers=headers)
df = pd.read_html(r.text)
The table can be found in df[0]. df[0].head():
Country
Cases
Deaths
Region
0
United States
34592377
621293
North America
1
India
30585229
402758
Asia
2
Brazil
18769808
524475
South America
3
France
5786203
111161
Europe
4
Russia
5635294
138579
Europe
You can save it with df[0].to_csv('filename.csv').

Nevermind I got it to work I. Just needed to turn JavaScript off when selecting the the xpath

IndexError: list index out of range when creating a list with variable as number, but works fine in print, why?

Python showed this message while print works but adding the list to the list doesn't :
Web scraping a list of names and sites of colleges, I used the regex to separate sites and append the sites in college_site list but the error says: list index out of range even though, it starts at the start and ends at the end of the loop! Programmers, where is it I change?
my code here is:
import requests
from bs4 import BeautifulSoup
import json
import re
URL = 'http://doors.stanford.edu/~sr/universities.html'
headers = {
"User-Agent": 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.138 Safari/537.36'}
college_site = []
def college():
page = requests.get(URL, headers=headers)
soup = BeautifulSoup(page.content, 'html.parser')
site = "\w+\.+\w+\)"
for ol in soup.find_all('ol'):
for num in range(len((ol.get_text()))):
line = ol.get_text().split()
if (re.search(site, line[num])):
college_site.append(line[num])
# works if i put: print(line[num])
with open('E:\Python\mails for college\\test2\sites.json', 'w') as sites:
json.dump(college_site, sites)
if __name__ == '__main__':
college()

The problem is this part: for num in range(len((ol.get_text()))). You want to loop over lines, but your loop is iterating over every character! The fix is simple.
change:
for num in range(len((ol.get_text()))):
line = ol.get_text().split()`
to:
line = ol.get_text().split()
for num in range(len(line)):
full example:
import requests
from bs4 import BeautifulSoup
import json
import re
URL = 'http://doors.stanford.edu/~sr/universities.html'
headers = {
"User-Agent": 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.138 Safari/537.36'}
college_site = []
def college():
page = requests.get(URL, headers=headers)
soup = BeautifulSoup(page.content, 'html.parser')
site = "\w+\.+\w+\)"
for ol in soup.find_all('ol'):
line = ol.get_text().split()
for num in range(len(line)):
if (re.search(site, line[num])):
college_site.append(line[num])
with open('E:\Python\mails for college\\test2\sites.json', 'w') as sites:
json.dump(college_site, sites)
if __name__ == '__main__':
college()

To get list of universities and links, you can use this example:
import requests
from bs4 import BeautifulSoup
import json
URL = 'http://doors.stanford.edu/~sr/universities.html'
headers = {"User-Agent": 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.138 Safari/537.36'}
college_sites = []
def college():
page = requests.get(URL, headers=headers)
soup = BeautifulSoup(page.content, 'html.parser')
for li in soup.select('ol li'):
college_name = li.a.get_text(strip=True)
college_link = li.a.find_next_sibling(text=True).strip()
print(college_name, college_link)
college_sites.append((college_name, college_link))
with open('data.json', 'w') as sites:
json.dump(college_sites, sites, indent=4)
if __name__ == '__main__':
college()
Prints:
Abilene Christian University (acu.edu)
Adelphi University (adelphi.edu)
Agnes Scott College (scottlan.edu)
Air Force Institute of Technology (afit.af.mil)
Alabama A&M University (aamu.edu)
Alabama State University (alasu.edu)
Alaska Pacific University
Albertson College of Idaho (acofi.edu)
Albion College (albion.edu)
Alderson-Broaddus College
Alfred University (alfred.edu)
Allegheny College (alleg.edu)
...
and saves data.json:
[
[
"Abilene Christian University",
"(acu.edu)"
],
[
"Adelphi University",
"(adelphi.edu)"
],
[
"Agnes Scott College",
"(scottlan.edu)"
],
...

Python - web scraping - pagination stuck at page 1, doesn't progress further

new to python, wrote a following code:
import bs4
from urllib.request import urlopen as Open
from urllib.request import Request
from bs4 import BeautifulSoup as soup
headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.169 Safari/537.36"}
results = "https://www.otodom.pl/sprzedaz/mieszkanie/?nrAdsPerPage=72&search%5Border%5D=created_at_first%3Adesc&page=1"
req = Request(url=results, headers=headers)
html = Open(req).read()
page_soup = soup(html, "html.parser")
total_pages = int(page_soup.find("div",{"class":"after-offers clearfix"}).find("ul",{"class":"pager"}).findAll("li")[4].text)
page_number = 0
if page_number < total_pages:
page_number = page_number + 1
results = "https://www.otodom.pl/sprzedaz/mieszkanie/?nrAdsPerPage=72&search%5Border%5D=created_at_first%3Adesc&page="+str(page_number)
print(results)
req = Request(url=results, headers=headers)
html = Open(req).read()
page_soup = soup(html, "html.parser")
listings = page_soup.findAll("article",{"data-featured-name":"listing_no_promo"})
print(len(listings))
I would have expected the end result to be a stream of printed out links, and number of listings on the page, yet all I have is:
https://www.otodom.pl/sprzedaz/mieszkanie/?nrAdsPerPage=72&search%5Border%5D=created_at_first%3Adesc&page=1
72
Any help would be appreciated, many thanks in advance!

In your script you don't have any loop to get page_soup from new page.
This script scrapes the total number of pages and then iterates over them, prints names of offers and it's link:
import requests
from bs4 import BeautifulSoup as soup
headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.169 Safari/537.36"}
results = "https://www.otodom.pl/sprzedaz/mieszkanie/?nrAdsPerPage=72&search%5Border%5D=created_at_first%3Adesc&page={}"
with requests.session() as s:
req = s.get(results.format(1), headers=headers)
page_soup = soup(req.text, "html.parser")
total_pages = int(page_soup.find("div",{"class":"after-offers clearfix"}).find("ul",{"class":"pager"}).findAll("li")[4].text)
print(total_pages)
cnt = 1
for page in range(1, 10): # <--- change 10 to total_pages to scrape all pages
req = s.get(results.format(page), headers=headers)
page_soup = soup(req.text, "html.parser")
for a in page_soup.select('h3 a[data-featured-name="listing_no_promo"]'):
name, link = a.find_next('span', {'class':'offer-item-title'}).text, a['href']
print('{:<4} {:<50} {}'.format(cnt, name, link))
cnt += 1
Prints:
1645
1 Biuro Sprzedaży Mieszkań 2 Pokoje Bezpośrednio https://www.otodom.pl/oferta/biuro-sprzedazy-mieszkan-2-pokoje-bezposrednio-ID43LEw.html#b3d6f6add3
2 Przestronne mieszkanie na nowej inwestycji - 2020 https://www.otodom.pl/oferta/przestronne-mieszkanie-na-nowej-inwestycji-2020-ID43LEt.html#b3d6f6add3
3 Kapitalny remont, Grabiszyńska, parking https://www.otodom.pl/oferta/kapitalny-remont-grabiszynska-parking-ID43LE0.html#b3d6f6add3
4 Przestronne mieszkanie przy ulicy Żurawiej https://www.otodom.pl/oferta/przestronne-mieszkanie-przy-ulicy-zurawiej-ID43LDZ.html#b3d6f6add3
5 Katowice Bezpośrednio 3 Pokoje https://www.otodom.pl/oferta/katowice-bezposrednio-3-pokoje-ID43LDX.html#b3d6f6add3
6 2 Pokojowe mieszkanie na osiedlu zamkniętym Łomian https://www.otodom.pl/oferta/2-pokojowe-mieszkanie-na-osiedlu-zamknietym-lomian-ID43LDV.html#b3d6f6add3
7 Słoneczne 3 pokojowe w doskonałej lokalizacji ! https://www.otodom.pl/oferta/sloneczne-3-pokojowe-w-doskonalej-lokalizacji-ID43LDS.html#b3d6f6add3
8 Inteligenty apartament Zajezdnia Wrzeszcz https://www.otodom.pl/oferta/inteligenty-apartament-zajezdnia-wrzeszcz-ID43LDR.html#b3d6f6add3
9 Mieszkanie, 32,04 m², Szczecin https://www.otodom.pl/oferta/mieszkanie-32-04-m-szczecin-ID43LDN.html#b3d6f6add3
10 M-3 Teofilów Na Sprzedaż https://www.otodom.pl/oferta/m-3-teofilow-na-sprzedaz-ID43LDI.html#b3d6f6add3
11 2-Pokojowe Mieszkanie https://www.otodom.pl/oferta/2-pokojowe-mieszkanie-ID43LDH.html#b3d6f6add3
12 2 duże pokoje w centrum Gdańsk ul. Zakopiańska https://www.otodom.pl/oferta/2-duze-pokoje-w-centrum-gdansk-ul-zakopianska-ID43LDE.html#b3d6f6add3
13 M2 na Zabobrzu III https://www.otodom.pl/oferta/m2-na-zabobrzu-iii-ID43LDx.html#b3d6f6add3
14 Mieszkanie 2 pokojowe ,atrakcyjna cena https://www.otodom.pl/oferta/mieszkanie-2-pokojowe-atrakcyjna-cena-ID43LDv.html#b3d6f6add3
15 M2 Centrum Miasta, I piętro https://www.otodom.pl/oferta/m2-centrum-miasta-i-pietro-ID43LDt.html#b3d6f6add3
16 Rodzinny 3 Pokojowy Apartament z Ogródkiem https://www.otodom.pl/oferta/rodzinny-3-pokojowy-apartament-z-ogrodkiem-ID43LDr.html#b3d6f6add3
17 2 pokoje. Aneks kuchenny. 45,5 m. Balkon https://www.otodom.pl/oferta/2-pokoje-aneks-kuchenny-45-5-m-balkon-ID43LDp.html#b3d6f6add3
... and so on.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Pagination not iterating over pages - python

Try this approach: for i in range(1, 30): url = f'https://www.iata.org/en/about/members/airline-list/?page={i}&search=&ordering=Alphabetical' html = requests.get(url) soup = BeautifulSoup(html.text, 'html.parser') data.append(pd.read_html(soup.select_one('table.datatable').prettify())[0])

Related

Optimising Python script for scraping to avoid getting blocked/ draining resources

How to append data in data frame using beautiful soup

Scraping using scrapy only able to get the first record

IndexError: list index out of range when creating a list with variable as number, but works fine in print, why?

Python - web scraping - pagination stuck at page 1, doesn't progress further

Categories

Resources