Trying to make a long story short so I apologize in advance, feel free to ask more questions for clarity.
Essentially I am trying to make a web scraping script that takes info from Zillow and puts it into a pandas data frame so that I can learn both pandas and beautifulsoup4 in the process. I am trying to avoid using the Zillow API but it seems it might be my only option. So, when I scrape the location the user inputs, it only returns 7 properties. I was told this is because of the Javascript Zillow uses ("Lazy-loading" or "infinite scrolling".) Basically the other properties aren't loaded until the user scrolls. I tried using selenium instead of requests but I end up getting bot verification captcha'd. I tried using headers and everything but cant seem to figure out a solution other than the API.
Here's my code BEFORE using selenium (aka when it semi-worked):
from bs4 import BeautifulSoup
import pandas as pd
from uszipcode import SearchEngine
import requests, prettify
search = SearchEngine()
zipcode = input("What is your zipcode: ")
zipcode_info = search.by_zipcode(zipcode)
headers = {
'accept':
'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8',
'accept-encoding' : 'en-US,en;0.8',
'upgrade-insecure-requests': '1',
'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.100 Safari/537.36'
}
with requests.Session() as session:
url= "https://www.zillow.com/homes/for_sale/" + zipcode_info.major_city + "/"
response = session.get(url, headers=headers)
soup = BeautifulSoup(response.content, 'html.parser')
soup.prettify()
df = pd.DataFrame()
address = list()
price = list()
bed_bath = list()
links = list()
properties = soup.find_all("li", attrs={"class": "ListItem-c11n-8-73-8__sc-10e22w8-0 srp__hpnp3q-0 enEXBq with_constellation"})
for li in properties:
try:
address.append(li.find("a", attrs = {"class": "StyledPropertyCardDataArea-c11n-8-73-8__sc-yipmu-0 lhIXlm property-card-link"}).text)
except:
pass
try:
price.append(li.find("span", attrs = {"data-test": "property-card-price"}).text)
except:
pass
try:
span = (li.find("span", attrs = {"class": "StyledPropertyCardHomeDetails-c11n-8-73-8__sc-1mlc4v9-0 jlVIIO"}))
for subspan in span:
bed_bath.append(subspan.find("b").text)
except:
pass
try:
links.append( (li.find("a", attrs = {"data-test": "property-card-link"}).get("href")) )
except:
pass
df['Address'] = address
df['Price'] = price
df['Links'] = links
print (df)
And the output is:
Address Price Links
0 525 W River Dr, Pennsauken, NJ 08110 $259,900 https://www.zillow.com/homedetails/525-W-River...
1 7519 Remington, Merchantville, NJ 08109 $270,000 https://www.zillow.com/homedetails/7519-Reming...
2 2269 Marlon Ave, Pennsauken, NJ 08110 $220,000 https://www.zillow.com/homedetails/2269-Marlon...
3 8129 River Rd, Pennsauken, NJ 08110 $324,999 https://www.zillow.com/homedetails/8129-River-...
4 1653 Springfield Ave, Pennsauken, NJ 08110 $259,900 https://www.zillow.com/homedetails/1653-Spring...
5 5531 Jackson Ave, Pennsauken, NJ 08110 $265,000 https://www.zillow.com/homedetails/5531-Jackso...
6 8141 Stow Rd, Pennsauken, NJ 08110 $359,000 https://www.zillow.com/homedetails/8141-Stow-R...
7 2203 42nd St, Pennsauken, NJ 08110 $275,000 https://www.zillow.com/homedetails/2203-42nd-S...
Related
I have a fairly basic Python script that scrapes a property website, and stores the address and price in a csv file. There are over 5000 listings to go through but I find my current code times out after a while (about 2000 listings) and the console shows 302 and CORS policy errors.
import requests
import itertools
from bs4 import BeautifulSoup
from csv import writer
from random import randint
from time import sleep
from datetime import date
url = "https://www.propertypal.com/property-for-sale/northern-ireland/page-"
headers = {
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/106.0.0.0 Safari/537.36'}
filename = date.today().strftime("ni-listings-%Y-%m-%d.csv")
with open(filename, 'w', encoding='utf8', newline='') as f:
thewriter = writer(f)
header = ['Address', 'Price']
thewriter.writerow(header)
# for page in range(1, 3):
for page in itertools.count(1):
req = requests.get(f"{url}{page}", headers=headers)
soup = BeautifulSoup(req.content, 'html.parser')
for li in soup.find_all('li', class_="pp-property-box"):
title = li.find('h2').text
price = li.find('p', class_="pp-property-price").text
info = [title, price]
thewriter.writerow(info)
sleep(randint(1, 5))
# this script scrapes all pages and records all listings and their prices in daily csv
As you can see I added sleep(randint(1, 5)) to add random intervals but I possibly need to do more. Of course I want to scrape the page in its entirety as quickly as possible but I also want to be respectful to the site that is being scraped and minimise burdening them.
Can anyone suggest updates? Ps forgive rookie errors, very new to Python/scraping!
This is one way of getting that data - bear in mind there are 251 pages only, with 12 properties on each of them, not over 5k:
import requests
import pandas as pd
from tqdm import tqdm
from bs4 import BeautifulSoup as bs
pd.set_option('display.max_columns', None)
pd.set_option('display.max_colwidth', None)
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/105.0.0.0 Safari/537.36',
'accept': 'application/json',
'accept-language': 'en-US,en;q=0.9',
'sec-fetch-mode': 'navigate',
'sec-fetch-site': 'same-origin'
}
s = requests.Session()
s.headers.update(headers)
big_list = []
for x in tqdm(range(1, 252)):
soup = bs(s.get(f'https://www.propertypal.com/property-for-sale/northern-ireland/page-{x}').text, 'html.parser')
# print(soup)
properties = soup.select('li.pp-property-box')
for p in properties:
name = p.select_one('h2').get_text(strip=True) if p.select_one('h2') else None
url = 'https://www.propertypal.com' + p.select_one('a').get('href') if p.select_one('a') else None
price = p.select_one('p.pp-property-price').get_text(strip=True) if p.select_one('p.pp-property-price') else None
big_list.append((name, price, url))
big_df = pd.DataFrame(big_list, columns = ['Property', 'Price', 'Url'])
print(big_df)
Result printed in terminal:
100%
251/251 [03:41<00:00, 1.38it/s]
Property Price Url
0 22 Erinvale Gardens, Belfast, BT10 0FS Asking price£165,000 https://www.propertypal.com/22-erinvale-gardens-belfast/777820
1 Laurel Hill, 37 Station Road, Saintfield, BT24 7DZ Guide price£725,000 https://www.propertypal.com/laurel-hill-37-station-road-saintfield/751274
2 19 Carrick Brae, Burren Warrenpoint, Newry, BT34 3TH Guide price£265,000 https://www.propertypal.com/19-carrick-brae-burren-warrenpoint-newry/775302
3 7b Conway Street, Lisburn, BT27 4AD Offers around£299,950 https://www.propertypal.com/7b-conway-street-lisburn/779833
4 Hartley Hall, Greenisland From£280,000to£397,500 https://www.propertypal.com/hartley-hall-greenisland/d850
... ... ... ...
3007 8 Shimna Close, Newtownards, BT23 4PE Offers around£99,950 https://www.propertypal.com/8-shimna-close-newtownards/756825
3008 7 Barronstown Road, Dromore, BT25 1NT Guide price£380,000 https://www.propertypal.com/7-barronstown-road-dromore/756539
3009 39 Tamlough Road, Randalstown, BT41 3DP Offers around£425,000 https://www.propertypal.com/39-tamlough-road-randalstown/753299
3010 Glengeen House, 17 Carnalea Road, Fintona, BT78 2BY Offers over£180,000 https://www.propertypal.com/glengeen-house-17-carnalea-road-fintona/750105
3011 Walnut Road, Larne, BT40 2WE Offers around£169,950 https://www.propertypal.com/walnut-road-larne/749733
3012 rows × 3 columns
See relevant documentation for Requests: https://requests.readthedocs.io/en/latest/
For Pandas: https://pandas.pydata.org/docs/
For BeautifulSoup: https://beautiful-soup-4.readthedocs.io/en/latest/
And for TQDM: https://pypi.org/project/tqdm/
I am trying to scrape yellow pages everything working fine except scraping the phone numbers! it's a div class = 'popover-phones' but having an a tag with href = the phone number can anyone assist me please. yellow pages inspection
import item as item
import requests
from bs4 import BeautifulSoup
import json
from csv import writer
url = 'https://yellowpages.com.eg/en/category/charcoal'
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36
(KHTML, like Gecko) Chrome/103.0.0.0 Safari/537.36'}
r = requests.get(url, headers=headers)
soup = BeautifulSoup(r.content, 'html.parser')
articles = soup.find_all('div', class_= 'col-xs-12 item-details')
for item in articles:
address = item.find('a',class_= 'address-text').text
company = item.find('a',class_= 'item-title').text
telephone = item.find('div', class_='popover-phones')enter code here
print(company,address,telephone)
The phone numbers you see are loaded from external URL. To get all phone numbers from the page you can use next example:
import requests
from bs4 import BeautifulSoup
url = "https://yellowpages.com.eg/en/category/charcoal"
soup = BeautifulSoup(requests.get(url).content, "html.parser")
for p in soup.select("[data-tooltip-phones]"):
phone_url = "https://yellowpages.com.eg" + p["data-tooltip-phones"]
title = p.find_previous(class_="item-title").text
phones = requests.get(phone_url).json()
print(title, *[b for a in phones for b in a])
Prints:
2 Bacco 02-3390-8764
3 A Group International 0120-3530-005 057-2428-449
3 A Group International 0120-3833-500 0120-3530-005
Abdel Karim 0122-3507-461
Abdel Sabour Zidan 03-4864-641
Abou Aoday 0111-9226-536 0100-3958-351
Abou Eid For Charcoal Trading 0110-0494-770
Abou Fares For Charcoal Trade 0128-3380-916
Abou Karim Store 0100-6406-939
Adel Sons 0112-1034-398 0115-0980-776
Afandina 0121-2414-087
Ahmed El Fahham 02-2656-0815
Al Baraka For Charcoal 0114-6157-799 0109-3325-720
Al Ghader For Import & Export 03-5919-355 0111-0162-602 0120-6868-434
Al Mashd For Coal 0101-0013-743 0101-0013-743
Al Zahraa Co. For Exporting Charcoal & Agriculture Products 040-3271-056 0100-0005-174 040-3271-056
Alex Carbon Group 03-3935-902
Alwaha Charcoal Trade Est. 0100-4472-554 0110-1010-810 0100-9210-812
Aly Abdel Rahman For Charcoal Trade 03-4804-440 0122-8220-661
Amy Deluxe Egypt 0112-5444-410
Want to iterate all pages from this url ""url = "https://www.iata.org/en/about/members/airline-list/"" and dump the results in a .csv file.
How could implementing a piece of code to iterate through the pages be included in the current code below?
import requests
import pandas as pd
from bs4 import BeautifulSoup
from urllib.request import Request
url = 'https://www.iata.org/en/about/members/airline-list/'
req = Request(url , headers = {
'accept':'*/*',
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.190 Safari/537.36'})
data = []
while True:
print(url)
html = requests.get(url)
soup = BeautifulSoup(html.text, 'html.parser')
data.append(pd.read_html(soup.select_one('table.datatable').prettify())[0])
if soup.select_one('span.pagination-link.is-active + div a[href]'):
url = soup.select_one('span.pagination-link.is-active + div a')['href']
else:
break
df = pd.concat(data)
df.to_csv('airline-list.csv',encoding='utf-8-sig',index=False)
Try this approach:
for i in range(1, 30):
url = f'https://www.iata.org/en/about/members/airline-list/?page={i}&search=&ordering=Alphabetical'
html = requests.get(url)
soup = BeautifulSoup(html.text, 'html.parser')
data.append(pd.read_html(soup.select_one('table.datatable').prettify())[0])
To get data dynamically, use:
import pandas as pd
import requests
import bs4
url = 'https://www.iata.org/en/about/members/airline-list/?page={page}&search=&ordering=Alphabetical'
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.190 Safari/537.36'}
# Total number of pages
html = requests.get(url.format(page=1), headers=headers)
soup = bs4.BeautifulSoup(html.text)
pages = int(soup.find_all('a', {'class': 'pagination-link'})[-2].text)
data = []
for page in range(1, pages+1):
html = requests.get(url.format(page=page, headers=headers))
data.append(pd.read_html(html.text)[0])
df = pd.concat(data)
Output:
>>> df
Airline Name IATA Designator 3 digit code ICAO code Country / Territory
0 ABX Air GB 832 ABX United States
1 Aegean Airlines A3 390 AEE Greece
2 Aer Lingus EI 53 EIN Ireland
3 Aero Republica P5 845 RPB Colombia
4 Aeroflot SU 555 AFL Russian Federation
.. ... ... ... ... ...
3 WestJet WS 838 WJA Canada
4 White coloured by you WI 97 WHT Portugal
5 Wideroe WF 701 WIF Norway
6 Xiamen Airlines MF 731 CXA China (People's Republic of)
7 YTO Cargo Airlines YG 860 HYT China (People's Republic of)
[288 rows x 5 columns]
import requests
from bs4 import BeautifulSoup
import pandas as pd
baseurl='https://locations.atipt.com/'
headers ={
'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.114 Safari/537.36'
}
r =requests.get('https://locations.atipt.com/al')
soup=BeautifulSoup(r.content, 'html.parser')
tra = soup.find_all('ul',class_='list-unstyled')
productlinks=[]
for links in tra:
for link in links.find_all('a',href=True):
comp=baseurl+link['href']
productlinks.append(comp)
temp=[]
for link in productlinks:
r =requests.get(link,headers=headers)
soup=BeautifulSoup(r.content, 'html.parser')
tag=soup.find_all('div',class_='listing content-card')
for pro in tag:
for tup in pro.find_all('p'):
temp.append([text for text in tup.stripped_strings])
df = pd.DataFrame(temp)
print(df)
This is the output I get
9256 Parkway E Ste A
Birmingham,
Alabama 35206
but I doesn't how to give the name in data frame I give name address to 9256 Parkway ESte A and City to Birmingham and state to ALabama 35206 if it is possible that kindly help in these matter
temp=[]
for link in productlinks:
r =requests.get(link,headers=headers)
soup=BeautifulSoup(r.content, 'html.parser')
tag=soup.find_all('div',class_='listing content-card')
for pro in tag:
data=[tup.text for tup in pro.find_all('p')]
address="".join(data[:2])
splitdata=data[2].split(",")
city=splitdata[0]
splitsecond=splitdata[-1].split("\xa0")
state=splitsecond[0]
postalcode=splitsecond[-1]
temp.append([address,city,state])
import pandas as pd
df=pd.DataFrame(temp,columns=["Address","City","State"])
df
Output:
Address City State Postalcode
0 634 1st Street NSte 100 Alabaster AL 35007
1 9256 Parkway ESte A Birmingham AL 35206
....
If you want to add call details just add this statement after postalcode
callNumber=pro.find("span",class_="directory-phone").get_text(strip=True).split("\n")[-1].lstrip()
and append this to temp list
I'm having a university project and need to get data online. I would like to get some data from this website.
https://www.footballdatabase.eu/en/transfers/-/2020-10-03
For the 3rd of October I managed to get the first 19 rows but then there are 6 pages and I'm struggling to activate the button for loading the next page.
This is the html code for the button:
2
My code so far:
import requests
from bs4 import BeautifulSoup
import pandas as pd
headers = {'User-Agent':
'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/47.0.2526.106 Safari/537.36'}
page = "https://www.footballdatabase.eu/en/transfers/-/2020-10-03"
pageTree = requests.get(page, headers=headers)
pageSoup = BeautifulSoup(pageTree.content, 'html.parser')
Players = pageSoup.find_all("span", {"class": "name"})
Team = pageSoup.find_all("span", {"class": "firstteam"})
Values = pageSoup.find_all("span", {"class": "transferamount"})
Values[0].text
PlayersList = []
TeamList = []
ValuesList = []
j=1
for i in range(0,20):
PlayersList.append(Players[i].text)
TeamList.append(Team[i].text)
ValuesList.append(Values[i].text)
j=j+1
df = pd.DataFrame({"Players":PlayersList,"Team":TeamList,"Values":ValuesList})
Thank you very much!
You can use requests module to simulate the Ajax call. For example:
import requests
from bs4 import BeautifulSoup
data = {
'date': '2020-10-03',
'pid': 1,
'page': 1,
'filter': 'full',
}
url = 'https://www.footballdatabase.eu/ajax_transfers_show.php'
for data['page'] in range(1, 7): # <--- adjust number of pages here.
soup = BeautifulSoup(requests.post(url, data=data).content, 'html.parser')
for line in soup.select('.line'):
name = line.a.text
first_team = line.select_one('.firstteam').a.text if line.select_one('.firstteam').a else 'Free'
second_team = line.select_one('.secondteam').a.text if line.select_one('.secondteam').a else 'Free'
amount = line.select_one('.transferamount').text
print('{:<30} {:<20} {:<20} {}'.format(name, first_team, second_team, amount))
Prints:
Bruno Amione Belgrano Hellas Vérone 1.7 M€
Ismael Gutierrez Betis Deportivo Atlético B 1 M€
Vitaly Janelt Bochum Brentford 500 k€
Sven Ulreich Bayern Munich Hambourg SV 500 k€
Salim Ali Al Hammadi Baniyas Khor Fakkan Prêt
Giovanni Alessandretti Ascoli U-20 Recanatese Prêt
Gabriele Bellodi AC Milan U-20 Alessandria Prêt
Louis Britton Bristol City B Torquay United Prêt
Juan Brunetta Godoy Cruz Parme Prêt
Bobby Burns Barrow Glentoran Prêt
Bohdan Butko Shakhtar Donetsk Lech Poznan Prêt
Nicolò Casale Hellas Vérone Empoli Prêt
Alessio Da Cruz Parme FC Groningue Prêt
Dalbert Henrique Inter Milan Rennes Prêt
...and so on.