How to increase the speed of my program in Python? - python

I am working on a web scraping project, and I have to get links from 19062 facilities. If I use a for loop, it will take almost 3 hours to complete. I tried making a generator but failed to make any logic, and I am not sure that it can be done using a generator. So, is there any Python expert who has an idea to get what I want faster? In my code, I execute it for just 20 ids. Thanks
import requests, json
from bs4 import BeautifulSoup as bs
url = 'https://hilfe.diakonie.de/hilfe-vor-ort/marker-json.php?ersteller=&kategorie=0&text=& n=55.0815&e=15.0418321&s=47.270127&w=5.8662579&zoom=20000'
res = requests.get(url).json()
url_1 = 'https://hilfe.diakonie.de/hilfe-vor-ort/info-window-html.php?id='
# extracting all the id= from .json res object
id = []
for item in res['items'][0]["elements"]:
id.append(item["id"])
# opening a .json file and making a dict for links
file = open('links.json', 'a')
links = {'links': []}
def link_parser(url, id):
resp = requests.get(url + id).content
soup = bs(resp, "html.parser")
link = soup.select_one('p > a').attrs['href']
links['links'].append(link)
# dumping the dict into links.json file
for item in id[:20]:
link_parser(url_1, item)
json.dump(links, file)
file.close()

In web scraping, speed is not a good idea! You will be hitting the server numerous times a second and will most likely get blocked if you use a For Loop. A generator will not make this quicker. Ideally, you want to hit the server once and process the data locally.
If it were me, I would want to use a framework like Scrapy that encourages good practice and various Spider classes to support standard techniques.

Related

Recursive crawling with BeautifulSoup really slow

I'm building a crawler that downloads all .pdf Files of a given website and its subpages. For this, I've used built-in functionalities around the below simplified recursive function that retrieves all links of a given URL.
However this becomes quite slow, the longer it crawls a given website (may take 2 minutes or longer per URL).
I can't quite figure out what's causing this and would really appreciate suggestions on what needs to be changed in order to increase the speed.
import re
import requests
from bs4 import BeautifulSoup
pages = set()
def get_links(page_url):
global pages
pattern = re.compile("^(/)")
html = requests.get(f"https://www.srs-stahl.de/{page_url}").text
soup = BeautifulSoup(html, "html.parser")
for link in soup.find_all("a", href=pattern):
if "href" in link.attrs:
if link.attrs["href"] not in pages:
new_page = link.attrs["href"]
print(new_page)
pages.add(new_page)
get_links(new_page)
get_links("")
It is not that easy to figure out what activly slow down your crawling - It is maybe the way you crawl, server of the website, ...
In your code, you request a URL, grab the links and call the functions itself in the first iteration, so you only append requested urls.
You may want to work with "queues" to keep the processes more transparent.
One advantage is that if the script aborts, you have this information stored and can access it to start from the urls you already have collected to visit. Quite the opposite of your for loop, which may have to start at an earlier point to ensure it get all urls.
Another point is, you request the PDF files, but without using the response in any way. Wouldn't it make more sense to either download and save them directly or skip the request and at least keep the links in separate "queue" for post processing?
Collected information in comparison - Based on iterations
Code in question:
pages --> 24
Example code (without delay):
urlsVisited --> 24
urlsToVisit --> 87
urlsToDownload --> 67
Example
Just to demonstrate, feel free to create defs, classes and structure to your needs. Note added some delay, but you can skip it if you like. "Queues" to demonstrate the process are lists but should be files, database,... to store your data safely.
import requests, time
from bs4 import BeautifulSoup
baseUrl = 'https://www.srs-stahl.de'
urlsToDownload = []
urlsToVisit = ["https://www.srs-stahl.de/"]
urlsVisited = []
def crawl(url):
html = requests.get(url).text
soup = BeautifulSoup(html, "html.parser")
for a in soup.select('a[href^="/"]'):
url = f"{baseUrl}{a['href']}"
if '.pdf' in url and url not in urlsToDownload:
urlsToDownload.append(url)
else:
if url not in urlsToVisit and url not in urlsVisited:
urlsToVisit.append(url)
while urlsToVisit:
url = urlsToVisit.pop(0)
try:
crawl(url)
except Exception as e:
print(f'Failed to crawl: {url} -> error {e}')
finally:
urlsVisited.append(url)
time.sleep(2)

Web scraping lazy list (lazy loading) using python request (without selenium/scarpy)

I have written a simple script for myself as practice to find who had bought same tracks as me on bandcamp to ideally find accounts with similar tastes and so more same music on their accounts.
The problem is that fan list on a album/track page is lazy loading. Using python's requests and bs4 I am only getting 60 results out of potential 700.
I am trying to figure out how to send request i.e. here https://pitp.bandcamp.com/album/fragments-distancing to open more of the list. After finding what request is send when I click in finder, I used that json request to send it using requests, although without any result
res = requests.get(track_link)
open_more = {"tralbum_type":"a","tralbum_id":3542956135,"token":"1:1609185066:1714678:0:1:0","count":100}
for i in range(0,3):
requests.post(track_link, json=open_more)
Will appreciate any help!
i think that just typing a ridiculous number for count will do. i did some automation on your script too if you want to get data on other albums
from urllib.parse import urlsplit
import json
import requests
from bs4 import BeautifulSoup
# build the post link
get_link="https://pitp.bandcamp.com/album/fragments-distancing"
link=urlsplit(get_link)
base_link=f'{link.scheme}://{link.netloc}'
post_link=f"{base_link}/api/tralbumcollectors/2/thumbs"
with requests.session() as s:
res = s.get(get_link)
soup = BeautifulSoup(res.text, 'lxml')
# the data for tralbum_type and tralbum_id
# are stored in a script attribute
key="data-band-follow-info"
data=soup.select_one(f'script[{key}]')[key]
data=json.loads(data)
open_more = {
"tralbum_type":data["tralbum_type"],
"tralbum_id":data["tralbum_id"],
"count":1000}
r=s.post(post_link, json=open_more).json()
print(r['more_available']) # if not false put a bigger count

Multiple scraping flat-file for list of urls

I made a script for scraping pages of some shop looking for out of stock items. It looks like this:
import requests
from bs4 import BeautifulSoup
urls = ['https://www.someurla','https://www.someurlb']
for url in urls:
res = requests.get(url)
soup = BeautifulSoup(res.content,'html.parser')
if len(soup.find_all('li',class_='out-of-stock')) > 0:
print(soup.title)
Now, I would like to somehow make this list or URLs available for updating without intervention in this little script. So, I think about some detached file that would serve as a flat database. I think it would be more appropriate than some relational DB, because I don't need it really.
I would like to get some opinion from more experienced Python users is this appropriate approach, and if it is what is the best way to do this with text or with .py file. What libraries are good for this task? On the other hand are there better approaches?
Go with a simple JSON file. Something like this:
import os
import json
url_file = '<path>/urls.json'
urls = []
if os.path.isfile(url_file):
with open(url_file, 'rb') as f:
urls = json.load(f)['urls']
else:
print('No URLs found to load')
print(urls)
# hook in your script here...
JSON structure for this particular example:
{
"urls": [
"http://example.com",
"http://google.com"
]
}

Error while scraping image with beautifulsoup

The original code is here : https://github.com/amitabhadey/Web-Scraping-Images-using-Python-via-BeautifulSoup-/blob/master/code.py
So i am trying to adapt a Python script to collect pictures from a website to get better at web scraping.
I tried to get images from "https://500px.com/editors"
The first error was
The code that caused this warning is on line 12 of the file/Bureau/scrapper.py. To get rid of this warning, pass the additional argument
'features="lxml"' to the BeautifulSoup constructor.
So I did :
soup = BeautifulSoup(plain_text, features="lxml")
I also adapted the class to reflect the tag in 500px.
But now the script stopped running and nothing happened.
In the end it looks like this :
import requests
from bs4 import BeautifulSoup
import urllib.request
import random
url = "https://500px.com/editors"
source_code = requests.get(url)
plain_text = source_code.text
soup = BeautifulSoup(plain_text, features="lxml")
for link in soup.find_all("a",{"class":"photo_link "}):
href = link.get('href')
print(href)
img_name = random.randrange(1,500)
full_name = str(img_name) + ".jpg"
urllib.request.urlretrieve(href, full_name)
print("loop break")
What did I do wrong?
Actually the website is loaded via JavaScript using XHR request to the following API
So you can reach it directly via API.
Note that you can increase parameter rpp=50 to any number as you want for getting more than 50 result.
import requests
r = requests.get("https://api.500px.com/v1/photos?rpp=50&feature=editors&image_size%5B%5D=1&image_size%5B%5D=2&image_size%5B%5D=32&image_size%5B%5D=31&image_size%5B%5D=33&image_size%5B%5D=34&image_size%5B%5D=35&image_size%5B%5D=36&image_size%5B%5D=2048&image_size%5B%5D=4&image_size%5B%5D=14&sort=&include_states=true&include_licensing=true&formats=jpeg%2Clytro&only=&exclude=&personalized_categories=&page=1&rpp=50").json()
for item in r['photos']:
print(item['url'])
also you can access the image url itself in order to write it directly!
import requests
r = requests.get("https://api.500px.com/v1/photos?rpp=50&feature=editors&image_size%5B%5D=1&image_size%5B%5D=2&image_size%5B%5D=32&image_size%5B%5D=31&image_size%5B%5D=33&image_size%5B%5D=34&image_size%5B%5D=35&image_size%5B%5D=36&image_size%5B%5D=2048&image_size%5B%5D=4&image_size%5B%5D=14&sort=&include_states=true&include_licensing=true&formats=jpeg%2Clytro&only=&exclude=&personalized_categories=&page=1&rpp=50").json()
for item in r['photos']:
print(item['image_url'][-1])
Note that image_url key hold different img size. so you can choose your preferred one and save it. here I've taken the big one.
Saving directly:
import requests
with requests.Session() as req:
r = req.get("https://api.500px.com/v1/photos?rpp=50&feature=editors&image_size%5B%5D=1&image_size%5B%5D=2&image_size%5B%5D=32&image_size%5B%5D=31&image_size%5B%5D=33&image_size%5B%5D=34&image_size%5B%5D=35&image_size%5B%5D=36&image_size%5B%5D=2048&image_size%5B%5D=4&image_size%5B%5D=14&sort=&include_states=true&include_licensing=true&formats=jpeg%2Clytro&only=&exclude=&personalized_categories=&page=1&rpp=50").json()
result = []
for item in r['photos']:
print(f"Downloading {item['name']}")
save = req.get(item['image_url'][-1])
name = save.headers.get("Content-Disposition")[9:]
with open(name, 'wb') as f:
f.write(save.content)
Looking at the page you're trying to scrape I noticed something. The data doesn't appear to load until a few moments after the page finishes loading. This tells me that they're using a JS framework to load the images after page load.
Your scraper will not work with this page due to the fact that it does not run JS on the pages it's pulling. Running your script and printing out what plain_text contains proves this:
<a class='photo_link {{#if hasDetailsTooltip}}px_tooltip{{/if}}' href='{{photoUrl}}'>
If you look at the href attribute on that tag you'll see it's actually a templating tag used by JS UI frameworks.
Your options now are to either see what APIs they're calling to get this data (check the inspector in your web browser for network calls, if you're lucky they may not require authentication) or to use a tool that runs JS on pages. One tool I've seen recommended for this is selenium, though I've never used it so I'm not fully aware of its capabilities; I imagine the tooling around this would drastically increase the complexity of what you're trying to do.

Is time.sleep() enough to safely create a delay for a simple webscraper?

I'm using a webscraping code, without a headless browser, in order to scrape about 500 inputs from Transfer Mrkt for a personal project.
According to best practices, I need to randomize the web scraping pattern I have, along with using a delay and dealing with errors/loading delays, in order to successfully scrape Transfer Markt without getting raising any flags.
I understand how Selenium and Chromedriver can help with all of these in order to scrape more safely, but I've used requests and BeautifulSoup to create a much simpler webscraper:
import requests, re, ast
from bs4 import BeautifulSoup import pandas as pd
i = 1
url_list = []
while True:
page = requests.get('https://www.transfermarkt.us/spieler-statistik/wertvollstespieler/marktwertetop?page=' + str(i), headers = {'User-Agent':'Mozilla/5.0'}).text
parsed_page = BeautifulSoup(page,'lxml')
all_links = []
for link in parsed_page.find_all('a', href=True): link = str(link['href']) all_links.append(link)
r = re.compile('.*profil/spieler.*')
player_links = list(filter(r.match, all_links))
for plink in range(0,25):
url_list.append('https://www.transfermarkt.us' + player_links[plink])
i += 1
if i > 20: break
final_url_list = []
for i in url_list:
int_page = requests.get(i, headers = {'User-Agent':'Mozilla/5.0'}).text
parsed_int_page = BeautifulSoup(int_page,'lxml')
graph_container = parsed_int_page.find('div', class_='large-7 columns small-12 marktwertentwicklung-graph')
graph_a = graph_container.find('a')
graph_link = graph_a.get('href')
final_url_list.append('https://www.transfermarkt.us' + graph_link)
for url in final_url_list:
r = requests.get('https://www.transfermarkt.com/neymar/marktwertverlauf/spieler/68290', headers = {'User-Agent':'Mozilla/5.0'})
p = re.compile(r"'data':(.*)}\],")
s = p.findall(r.text)[0]
s = s.encode().decode('unicode_escape')
data = ast.literal_eval(s)
#rest of the code to write scraped info below this
I was wondering if this is generally considered a safe enough way to scrape a website like Transfer Mrkt if I add the time.sleep() method from the time library, as detailed here, in order to create a delay - long enough to allow the page to load, like 10 seconds - to scrape the 500 inputs successfully without raising any flags.
I would also forego using randomized clicks (which I think can only be done with selenium/chromedriver) to mimic human behavior, and was wondering if that too would be ok to exclude in order to scrape safely.

Categories