Optimising Python script for scraping to avoid getting blocked/ draining resources

Optimising Python script for scraping to avoid getting blocked/ draining resources - python

I have a fairly basic Python script that scrapes a property website, and stores the address and price in a csv file. There are over 5000 listings to go through but I find my current code times out after a while (about 2000 listings) and the console shows 302 and CORS policy errors.
import requests
import itertools
from bs4 import BeautifulSoup
from csv import writer
from random import randint
from time import sleep
from datetime import date
url = "https://www.propertypal.com/property-for-sale/northern-ireland/page-"
headers = {
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/106.0.0.0 Safari/537.36'}
filename = date.today().strftime("ni-listings-%Y-%m-%d.csv")
with open(filename, 'w', encoding='utf8', newline='') as f:
thewriter = writer(f)
header = ['Address', 'Price']
thewriter.writerow(header)
# for page in range(1, 3):
for page in itertools.count(1):
req = requests.get(f"{url}{page}", headers=headers)
soup = BeautifulSoup(req.content, 'html.parser')
for li in soup.find_all('li', class_="pp-property-box"):
title = li.find('h2').text
price = li.find('p', class_="pp-property-price").text
info = [title, price]
thewriter.writerow(info)
sleep(randint(1, 5))
# this script scrapes all pages and records all listings and their prices in daily csv
As you can see I added sleep(randint(1, 5)) to add random intervals but I possibly need to do more. Of course I want to scrape the page in its entirety as quickly as possible but I also want to be respectful to the site that is being scraped and minimise burdening them.
Can anyone suggest updates? Ps forgive rookie errors, very new to Python/scraping!

This is one way of getting that data - bear in mind there are 251 pages only, with 12 properties on each of them, not over 5k:
import requests
import pandas as pd
from tqdm import tqdm
from bs4 import BeautifulSoup as bs
pd.set_option('display.max_columns', None)
pd.set_option('display.max_colwidth', None)
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/105.0.0.0 Safari/537.36',
'accept': 'application/json',
'accept-language': 'en-US,en;q=0.9',
'sec-fetch-mode': 'navigate',
'sec-fetch-site': 'same-origin'
}
s = requests.Session()
s.headers.update(headers)
big_list = []
for x in tqdm(range(1, 252)):
soup = bs(s.get(f'https://www.propertypal.com/property-for-sale/northern-ireland/page-{x}').text, 'html.parser')
# print(soup)
properties = soup.select('li.pp-property-box')
for p in properties:
name = p.select_one('h2').get_text(strip=True) if p.select_one('h2') else None
url = 'https://www.propertypal.com' + p.select_one('a').get('href') if p.select_one('a') else None
price = p.select_one('p.pp-property-price').get_text(strip=True) if p.select_one('p.pp-property-price') else None
big_list.append((name, price, url))
big_df = pd.DataFrame(big_list, columns = ['Property', 'Price', 'Url'])
print(big_df)
Result printed in terminal:
100%
251/251 [03:41<00:00, 1.38it/s]
Property Price Url
0 22 Erinvale Gardens, Belfast, BT10 0FS Asking price£165,000 https://www.propertypal.com/22-erinvale-gardens-belfast/777820
1 Laurel Hill, 37 Station Road, Saintfield, BT24 7DZ Guide price£725,000 https://www.propertypal.com/laurel-hill-37-station-road-saintfield/751274
2 19 Carrick Brae, Burren Warrenpoint, Newry, BT34 3TH Guide price£265,000 https://www.propertypal.com/19-carrick-brae-burren-warrenpoint-newry/775302
3 7b Conway Street, Lisburn, BT27 4AD Offers around£299,950 https://www.propertypal.com/7b-conway-street-lisburn/779833
4 Hartley Hall, Greenisland From£280,000to£397,500 https://www.propertypal.com/hartley-hall-greenisland/d850
... ... ... ...
3007 8 Shimna Close, Newtownards, BT23 4PE Offers around£99,950 https://www.propertypal.com/8-shimna-close-newtownards/756825
3008 7 Barronstown Road, Dromore, BT25 1NT Guide price£380,000 https://www.propertypal.com/7-barronstown-road-dromore/756539
3009 39 Tamlough Road, Randalstown, BT41 3DP Offers around£425,000 https://www.propertypal.com/39-tamlough-road-randalstown/753299
3010 Glengeen House, 17 Carnalea Road, Fintona, BT78 2BY Offers over£180,000 https://www.propertypal.com/glengeen-house-17-carnalea-road-fintona/750105
3011 Walnut Road, Larne, BT40 2WE Offers around£169,950 https://www.propertypal.com/walnut-road-larne/749733
3012 rows × 3 columns
See relevant documentation for Requests: https://requests.readthedocs.io/en/latest/
For Pandas: https://pandas.pydata.org/docs/
For BeautifulSoup: https://beautiful-soup-4.readthedocs.io/en/latest/
And for TQDM: https://pypi.org/project/tqdm/

Related

Webscrape e-commerce

I'm a beginner in webscraping using python - however I need to use it frequently.
I'm trying to webscrape e-shop for mobiles to get item name & price.
website: https://shop.orange.eg/en/mobiles-and-devices?IsMobile=false
My code "using User-agent" technique is as below:
import json
import requests
import pandas as pd
from bs4 import BeautifulSoup
url = 'https://shop.orange.eg/en/mobiles-and-devices?IsMobile=false'
headers = {"User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/105.0.0.0 Safari/537.36"}
web_page = requests.get(url,headers=headers)
soup = BeautifulSoup(web_page.content, "html.parser")
product_list = soup.find_all('div', class_='col-md-6 col-lg-4 mb-4')
product_list
output: [] -> empty lists
I'm not sure I'm doing right, also when i look at page source-code, I find no information.

That page is being loaded initially, then further hydrated from an api (with html).
This is one way to get those products sold by Orange Egypt:
from bs4 import BeautifulSoup as bs
import requests
from tqdm import tqdm ## if using jupyter notebook, import as: from tqdm.notebook import tqdm
import pandas as pd
headers = {
'X-Requested-With': 'XMLHttpRequest',
'User-Agent': "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/104.0.0.0 Safari/537.36"
}
s = requests.Session()
s.headers.update(headers)
big_list = []
for x in tqdm(range(1, 16)):
url = f'https://shop.orange.eg/en/catalog/ListCategoryProducts?IsMobile=false&pagenumber={x}&categoryId=24'
r = s.get(url)
soup = bs(r.text, 'html.parser')
devices = soup.select('[class^="card device-card"]')
for d in devices:
product_title = d.select_one('h4[class^="card-title"] a[name="ancProduct"]').get('title')
product_price = d.select_one('h4[class^="card-title"] a[name="ancProduct"]').get('data-gtm-click-price')
product_link = d.select_one('h4[class^="card-title"] a[name="ancProduct"]').get('href')
big_list.append((product_title, product_price, product_link))
df = pd.DataFrame(big_list, columns = ['Product', 'Price', 'Url'])
print(df)
Result:
Product Price Url
0 Samsung Galaxy Z Fold4 5G 46690.0000 //shop.orange.eg/en/mobiles/samsung-mobiles/samsung-galaxy-z-fold4-5g
1 ASUS Vivobook Flip 14 9999.0000 //shop.orange.eg/en/devices/tablets-and-laptops/asus-vivobook-flip-14
2 Acer Aspire 3 A315-56 7299.0000 //shop.orange.eg/en/devices/tablets-and-laptops/acer-aspire-3-a315-56
3 Lenovo IdeaPad 3 15IGL05 5777.0000 //shop.orange.eg/en/devices/tablets-and-laptops/lenovo-tablets/lenovo-ideapad-3-15igl05
4 Lenovo IdeaPad Flex 5 16199.0000 //shop.orange.eg/en/devices/tablets-and-laptops/lenovo-tablets/lenovo-ideapad-flex-5
... ... ... ...
171 Eufy P1 Scale Wireless Smart Digital 699.0000 //shop.orange.eg/en/devices/accessories/scale-wireless/eufy-p1-scale-wireless-smart-digital
172 Samsung Smart TV 50AU7000 9225.0000 //shop.orange.eg/en/devices/smart-tv/samsung-tv-50tu7000
173 Samsung Smart TV 43T5300 6999.0000 //shop.orange.eg/en/devices/smart-tv/samsung-tv-43t5300
174 Samsung Galaxy A22 4460.0000 //shop.orange.eg/en/mobiles/samsung-mobiles/samsung-galaxy-a22
175 Eufy eufycam 2 2 plus 1 kit 4999.0000 //shop.orange.eg/en/devices/accessories/camera-wireless/eufy-eufycam-2-2-plus-1-kit
176 rows × 3 columns
For TQDM visit https://pypi.org/project/tqdm/
For Requests documentation, see https://requests.readthedocs.io/en/latest/
Also for pandas: https://pandas.pydata.org/pandas-docs/stable/index.html
And for BeautifulSoup: https://beautiful-soup-4.readthedocs.io/en/latest/index.html

The webpage is loaded dynamically from external source via AJAX . So you have to use API url instead.
Example:
import time
import requests
import pandas as pd
from bs4 import BeautifulSoup
url = 'https://shop.orange.eg/en/mobiles-and-devices?IsMobile=false'
headers = {"User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/105.0.0.0 Safari/537.36"}
ajax_url = 'https://shop.orange.eg/en/catalog/ListCategoryProducts'
params = {
'IsMobile':'false',
'pagenumber': '2',
'categoryId': '24'
}
for params['pagenumber'] in range(1,2):
web_page = requests.get(ajax_url,headers=headers,params=params)
time.sleep(5)
soup = BeautifulSoup(web_page.content, "html.parser")
product_list = soup.find_all('div', class_='col-md-6 col-lg-4 mb-4')
for product in product_list:
title=product.h4.get_text(strip=True)
print(title)
Output:
Samsung MobilesSamsung Galaxy Z Fold4 5G
Tablets and LaptopsASUS Vivobook Flip 14
Tablets and LaptopsAcer Aspire 3 A315-56
Lenovo TabletsLenovo IdeaPad 3 15IGL05
Lenovo TabletsLenovo IdeaPad Flex 5
Samsung MobilesSamsung Galaxy S22 Ultra 5G
WearablesApple Watch Series 7
Samsung MobilesSamsung Galaxy Note 20 Ultra
GamingLenovo IdeaPad Gaming 3
Tablets and LaptopsSamsung Galaxy Tab S8 5G
Wireless ChargerLanex Charger Wireless Magnetic 3-in-1 15W
AccessoriesAnker Sound core R100

How do I scrap all movie title, date and reviews on the website below? https://www.nollywoodreinvented.com/list-of-all-reviews

I have tried with the code below and what the code does is to bring the first page and does not load completely the reviews for the movies. I am interested in getting all the movie titles, movie dates, and reviews.
enter code here
from bs4 import BeautifulSoup
import requests
url = 'https://www.nollywoodreinvented.com/list-of-all-reviews'
headers = {
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.12; rv:55.0) Gecko/20100101 Firefox/55.0',
req = requests.get(url, headers=headers)
soup = BeautifulSoup(req.text, 'lxml')
movie_div = soup.find_all('div', class_='article-panel')
title=[]
for div in movie_div:
images= div.find_all('div', class_='article-image-wrapper')
for image in images:
image = image.find_all('div', class_='article-image')
for img in image:
title.append(img.a.img['title'])
date =[]
for div in movie_div:
date.append(div.find('div', class_='authorship type-date').text.strip())
info =[]
for div in movie_div:
info.append(div.find('div', class_='excerpt-text').text.strip())
import pandas as pd
movie = pd.DataFrame({'title':title, 'date':date, 'info':info}, index=None)
movie.head()

There is a backend api which serves up the HTML you are scraping you can see it in action if you open your browsers Developer Tools - Network tab - fetch/Xhr and click the on a the 2nd or 3rd page link, we can recreate the POST request with python like the below:
import requests
from bs4 import BeautifulSoup
import pandas as pd
from datetime import datetime
pages = 3
results_per_page = 500 #max 500 I think
headers = {'user-agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.110 Safari/537.36'}
url = 'https://www.nollywoodreinvented.com/wp-admin/admin-ajax.php'
output = []
for page in range(1,pages+1):
payload = {
'action':'itajax-sort',
'view':'grid',
'loop':'main loop',
'location':'',
'thumbnail':'1',
'rating':'1',
'meta':'1',
'award':'1',
'badge':'1',
'authorship':'1',
'icon':'1',
'excerpt':'1',
'sorter':'recent',
'columns':'4',
'layout':'full',
'numarticles':str(results_per_page),
'largefirst':'',
'paginated':str(page),
'currentquery[category__in][]':'2648',
'currentquery[category__in][]':'2649'
}
resp = requests.post(url,headers=headers,data=payload).json()
print(f'Scraping page: {page} - results: {results_per_page}')
soup = BeautifulSoup(resp['content'],'html.parser')
for film in soup.find_all('div',class_='article-panel'):
try:
title = film.find('h3').text.strip()
except AttributeError:
continue
date = datetime.strptime(film.find('span',class_='date').text.strip(),"%B %d, %Y").strftime('%Y-%m-%d')
likes = film.find('span',class_='numcount').text.strip()
if not likes:
likes = 0
full_stars = [1 for _ in film.find_all('span',class_='theme-icon-star-full')]
half_stars = [0.5 for _ in film.find_all('span',class_='theme-icon-star-half')]
stars = (sum(full_stars)+ sum(half_stars))/2.0
item = {
'title':title,
'date':date,
'likes':likes,
'stars':stars
}
output.append(item)
df= pd.DataFrame(output)
df.to_csv('nollywood_data.csv',index=False)
print('Saved to nollywood_data.csv')

I am scraping Html table they show me the error 'AttributeError: 'NoneType' object has no attribute 'select''

I am scraping Html table they show me the error 'AttributeError: 'NoneType' object has no attribute 'select' try to solve it
import requests
from bs4 import BeautifulSoup
import pandas as pd
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.131 Safari/537.3"
}
r = requests.get("https://capitalonebank2.bluematrix.com/sellside/Disclosures.action")
soup = BeautifulSoup(r.content, "lxml")
table = soup.find('table',attrs={'style':"border"})
all_data = []
for row in table.select("tr:has(td)"):
tds = [td.get_text(strip=True) for td in row.select("td")]
all_data.append(tds)
df = pd.DataFrame(all_data, columns=header)
print(df)

It appears that website you are trying to scrape blocks the requests sent by requests library. To deal with the issue, I used Selenium library which automates the website browsing. The code below collects the titles given in the table.
from selenium import webdriver
from bs4 import BeautifulSoup
import pandas as pd
browser = webdriver.Chrome()
browser.get("https://capitalonebank2.bluematrix.com/sellside/Disclosures.action")
soup = BeautifulSoup(browser.page_source, "lxml")
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.131 Safari/537.3"}
all_data = [i.text.strip() for i in soup.select("option")]
df = pd.DataFrame(all_data, columns=["Titles"])
print(df)
Output:
Titles
0 Agree Realty Corporation (ADC)
1 American Campus Communities, Inc. (ACC)
2 Antero Midstream Corporation (AM)
3 Antero Resources Corporation (AR)
4 Apache Corp. (APA)
.. ...
126 W. P. Carey Inc. (WPC)
127 Washington Real Estate Investment Trust (WRE)
128 Welltower Inc. (WELL)
129 Western Midstream Partners, LP (WES)
130 Whiting Petroleum Corporation (WLL)
If you have not used Selenium before, do not forget to install chromedriver.exe and add it to the PATH environment variable. You can also give the location of the driver to the constructor manually.
Updated code to extract extra information
from selenium import webdriver
from bs4 import BeautifulSoup
import pandas as pd
import time
browser = webdriver.Chrome()
browser.get("https://capitalonebank2.bluematrix.com/sellside/Disclosures.action")
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.131 Safari/537.3"}
for title in browser.find_elements_by_css_selector('option'):
title.click()
time.sleep(1)
browser.switch_to.frame(browser.find_elements_by_css_selector("iframe")[1])
table = browser.find_element_by_css_selector("table table")
soup = BeautifulSoup(table.get_attribute("innerHTML"), "lxml")
all_data = []
ratings = {"BUY":[], "HOLD":[], "SELL":[]}
lists_ = []
for row in soup.select("tr")[-4:-1]:
info_list = row.select("td")
count = info_list[1].text
percent = info_list[2].text
IBServ_count = info_list[4].text
IBServ_percent = info_list[5].text
lists_.append([count, percent, IBServ_count, IBServ_percent])
ratings["BUY"] = lists_[0]
ratings["HOLD"] = lists_[1]
ratings["SELL"] = lists_[2]
print(ratings)
browser.switch_to.default_content()

BeautifulSoup organize data into dataframe table

I have been working with BeautifulSoup to try and organize some data that I am pulling from an website (html) I have been able to boil the data down but am getting stuck on how to:
eliminate not needed info
organize remaining data to be put into a pandas dataframe
Here is the code I am working with:
import urllib.request
from bs4 import BeautifulSoup as bs
import re
import pandas as pd
import requests
headers = requests.utils.default_headers()
headers.update({
'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/90.0.4430.93 Safari/537.36'
})
url = 'https://www.apartments.com/lehi-ut/1-bedrooms/'
page = requests.get(url,headers = headers)
soup = bs(page.text)
names = soup.body.findAll('tr')
function_names = re.findall('th class="\w+', str(names))
function_names = [item[10:] for item in function_names]
description = soup.body.findAll('td')
#description = re.findall('td class="\w+', str(description))
data = pd.DataFrame({'Title':function_names,'Info':description})
The error I have been getting is that the array numbers don't match up, which I know to be true but when I un-hashtag out the second description line it removes the numbers I want from there and even then the table isn't organizing itself properly.
What I would like the output to look like is:
(headers) title: location | studio | 1 BR | 2 BR | 3 BR
(new line) data : Lehi, UT| $1,335 |$1,309|$1,454|$1,580
That is really all that I need but I can't get BS or Pandas to do it properly.
Any help would be greatly appreciated!

Try the following approach. It first extracts all of the data in the table and then transposes it (columns swapped with rows):
import urllib.request
from bs4 import BeautifulSoup as bs
import re
import pandas as pd
import requests
headers = {
'User-Agent' : 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/90.0.4430.93 Safari/537.36'
}
url = 'https://www.apartments.com/lehi-ut/1-bedrooms/'
page = requests.get(url, headers=headers)
soup = bs(page.text, 'lxml')
table = soup.find("table", class_="rentTrendGrid")
rows = []
for tr in table.find_all('tr'):
rows.append([td.text for td in tr.find_all(['th', 'td'])])
#header_row = rows[0]
rows = list(zip(*rows[1:])) # tranpose the table
df = pd.DataFrame(rows[1:], columns=rows[0])
print(df)
Giving you the following kind of output:
Studio 1 BR 2 BR 3 BR
0 0 729 1,041 1,333
1 $1,335 $1,247 $1,464 $1,738

Unsure how to web-scrape a specific value that could be in several different places

So I've been working on a web-scraping program and have been having some difficulties with one of the last bits.
There is this website that shows records of in-game fights like so:
Example 1: https://zkillboard.com/kill/44998359/
Example 2: https://zkillboard.com/kill/44917133/
I am trying to always scrape the full information of the player who scored the killing blow. That means their name, their corporation name, and their alliance name.
The information for the above examples are:
Example 1: Name = Happosait, Corp. = Arctic Light Inc., Alliance = Arctic Light
Example 2: Name = Lord Veninal, Corp. = Sniggerdly, Alliance = Pandemic Legion
While the "Final Blow" is always listed in the top right with the name, the name does not have the corporation and alliance with it as well. The full information is always listed below in the right-hand column, "## Involved", but their location in that column depends on how much damage they did in the fight, so it is not always on top, or anywhere specific for that matter.
So while I can get their names with:
kbPilotName = soup.find_all('td', style="text-align: center;")[0].find_all('a', href=re.compile('/character/'))[0].img.get('alt')
How can I get the rest of their information?

There is a textarea element containing all the data you are looking for. It's all in one text, but it's structured. You can choose a different way to parse it, but here is an example using regex:
import re
from bs4 import BeautifulSoup
import requests
url = 'https://zkillboard.com/kill/44998359/'
pattern = re.compile(r"(?s)Name: (.*?)Security: (.*?)Corp: (.*?)Alliance: (.*?)")
with requests.Session() as session:
session.headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/40.0.2214.115 Safari/537.36'}
response = session.get(url)
soup = BeautifulSoup(response.content)
data = soup.select('form.form textarea#eft')[0].text
for name, security, corp, alliance in pattern.findall(data):
print name.strip()
Prints:
Happosait (laid the final blow)
Baneken
Perkel
Tibor Vherok
Kheo Dons
Kayakka
Lina Ectelion
Jay Burner
Zalamus
Draacan Ferox
Luwanii
Jousen Momaki
Varcuntis Morannear
Grimm K-Man
Wob'Niar
Godfrey Silvarna
Quintus Corvus
Shadow Altair
Sieren
Isha Vir
Argyrosdraco
Jack None
Strixi
Alternative solution (parsing "involved" page):
from bs4 import BeautifulSoup
import requests
url = 'https://zkillboard.com/kill/44998359/'
involved_url = 'https://zkillboard.com/kill/44998359/involved/'
with requests.Session() as session:
session.headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/40.0.2214.115 Safari/537.36'}
session.get(url)
response = session.get(involved_url)
soup = BeautifulSoup(response.content)
for row in soup.select('table.table tr.attacker'):
name, corp, alliance = row.select('td.pilot > a')
print name.text, corp.text, alliance.text

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Optimising Python script for scraping to avoid getting blocked/ draining resources - python

Related

Webscrape e-commerce

How do I scrap all movie title, date and reviews on the website below? https://www.nollywoodreinvented.com/list-of-all-reviews

I am scraping Html table they show me the error 'AttributeError: 'NoneType' object has no attribute 'select''

BeautifulSoup organize data into dataframe table

Unsure how to web-scrape a specific value that could be in several different places

Categories

Resources