I am trying to scrape data from a few thousands pages. The code I have works fine for about a 100 pages, but then slows down dramatically. I am pretty sure that my Tarzan-like code could be improved, so that the speed of the webscraping process increases. Any help would be appreciated. TIA!
Here is the simplified code:
csvfile=open('test.csv', 'w', encoding='cp850', errors='replace')
writer=csv.writer(csvfile)
list_url= ["http://www.randomsite.com"]
i=1
for url in list_url:
base_url_parts = urllib.parse.urlparse(url)
while True:
raw_html = urllib.request.urlopen(url).read()
soup = BeautifulSoup(raw_html, "lxml")
#### scrape the page for the desired info
i=i+1
n=str(i)
#Zip the data
output_data=zip(variable_1, variable_2, variable_3, ..., variable_10)
#Write the observations to the CSV file
writer=csv.writer(open('test.csv','a',newline='', encoding='cp850', errors='replace'))
writer.writerows(output_data)
csvfile.flush()
base="http://www.randomsite.com/page"
base2=base+n
url_part2="/otherstuff"
url_test = base2+url_part2
try:
if url_test != None:
url = url_test
print(url)
else:
break
except:
break
csvfile.close()
EDIT: Thanks for all the answers, I learn quite a lot from them. I am (slowly!) learning my way around Scrapy. However, I found that the pages are available via bulk download, which will be an ever better way to solve the performance issue.
The main bottleneck is that your code is synchronous (blocking). You don't proceed to the next URL until you finish processing the current one.
You need to make things asynchronously either by switching to Scrapy which solves this problem out-of-the-box, or by building something yourself via, for example, grequests.
If you want to go really fast without a lot of complex code, you need to A) Segregate the requests from the parsing (because the parsing is blocking the thread you'd otherwise use to make the request), and B) Make requests concurrently and parse them concurrently. So, I'd do a couple things:
Request all pages asynchronously using eventlets. I've struggled with async http frameworks in Python and find eventlets the easiest to learn.
Every time you successfully fetch a page, store the html somewhere. A) You could write it to individual html files locally but you'll have a lot of files on your hands. B) You could probably store this many records as strings (str(source_code)) and put them in a native data store so long as it's hashed (probably a set or dict). C) You could use super lightweight but not particularly performant database like TinyDB and stick the page source in JSON files. D) You could use a third party library's data structures for high performance computing like a Pandas DataFrame or a NumPy array. They can easily store this much data but may be overkill.
Parse each document separately after it's been retrieved. Parsing with lxml will be extremely fast, so depending on how fast you need to go you may be able to get away with parsing the files sequentially. If you want to go faster, look up a tutorial on multiprocessing in python. It's pretty darn easy to learn, and you'd be able to concurrently parse X documents, where X is the number of available cores on your machine.
Perhaps this is just a bug in simplification, but looks like you are opening 'test.csv' multiple times, but closing it only once. Not sure if that's the cause for unexpected slowdown when number of URLs grows above 100, but if you want all data to go in one csv file, you should probably stick to opening the file and the writer once at the top, like you're already doing, and not do it inside the loop.
Also, in the logic of constructing the new URL: isn't url_test != None always true? Then how are you exiting the loop? on the exception when urlopen fails? Then you should have the try-except around that. This is not a performance issue, but any kind of clarity helps.
Related
I am currently working on a data scraping project, which requires me to load and save my data in every loop. You might wonder why I would do that? Well, before I scraped without loading and saving my data in between every loop and lost all my data if the script crashed anywhere before the last iteration (which happened every time due to timeout, weird URL or anything you can imagine).
All in all the method applied right now works fine, but after some 20k iterations my storage files increase to a length of ~90mb resulting in the script becoming slower and slower, forcing me to create a new data saving file. The code below shows the basic functioning of my script.
import numpy as np
#List with URLS to scrape (220k URLS)
URLS = np.load("/some_directory/URLS.npy").tolist()
#Already checked URLS
Checked_URLS=np.load("/some_directory/checked_URLS.npy").tolist()
#Perform scraping and add new URL to checked list
for i in URLS:
if i not in Checked_URLS:
Checked_already.append(i)
np.save("some_directory/checked_URLS.npy", Checked_already)
NEW_DATA=Bunch_of_scraping_code
#Load numpy list with data collected from previous URLS
#Append new scraping data and save list
FOUND_DATA=np.load("/some_directory/FOUND_DATA.npy", allow_pickle=True).tolist()
FOUND_DATA.append(NEW_DATA)
np.save("some_directory/FOUND_DATA.npy", LOT_DATA)
I am sure there must be a more pythonic way, in which it is not required to load the entire lists into python every loop? Or perhaps another way which does not require writing at all? I write my lists to .npy because, as far as I am aware, this is the most efficient way to parse large list files.
I have tried to save my data directly into pandas but this made everything much worse and slower. All help will be appreciated!
I think that a much more efficient approach would be to wrap the code that might crash with try/except and log the problematic urls for further investigation instead of repeating the same operations over and over again:
for url in URLS:
try:
FOUND_DATA=np.load("/some_directory/FOUND_DATA.npy", allow_pickle=True).tolist()
FOUND_DATA.append(NEW_DATA)
except Exception as ex:
# for any reason the code crashes, the following message will be printed and the loop will continue
print(f"Failed to process url: {url})
I am writing a plugin for a software I am using. The goal here is to have a button, that will automate downloading data from a government site.
I am still a bit new to python, however I have managed to get it working exactly like I want. But - I want to avoid a case where my plugin makes hundreds of requests for downloads at once, as that could impact the website performance. The below function is what I use to download the files.
How can I make sure that what I am doing will not request 1000s of files within few seconds, thus overloading the website? Is there a way to make the below function wait for completing one download, before starting another?
import requests
from os.path import join
def downloadFiles(fileList, outDir):
# Download list of files, one by one
for url in fileList:
file = requests.get(url)
fileName = url.split('/')[-1]
open(join(outDir, fileName), 'wb').write(file.content)
This code is already sequential and it will wait for a download to finish before starting a new one. It's funny, usually people ask how to parallelize stuff.
If you want to slow it down further, you can add a time.sleep() to your code.
If you want to be more fancy you can use something like this
I have a list of about 100,000 URLs saved in my computer. ( that 100,000 can very quickly multiply into several million.) For every url, i check that webpage and collect all additional urls on that page, but only if each additional link is not already in my large list. The issue here is reloading that huge list into memory iteratively so i can consistently have an accurate list. where the amount of memory used will probably very soon become much too much, and even more importantly, the time it takes inbetween reloading the list gets longer which is severely holding up the progress of the project.
My list is saved in several different formats. One format is by having all links contained in one single text file, where i open(filetext).readlines() to turn it straight into a list. Another format i have saved which seems more helpful, is by saving a folder tree with all the links, and turning that into a list by using os.walk(path).
im really unsure of any other way to do this recurring conditional check more efficiently, without the ridiculous use of memory and loadimg time. i tried using a queue as well, but It was such a benefit to be able to see the text output of these links that queueing became unneccesarily complicated. where else can i even start?
The main issue is not to load the list in memory. This should be done only once at the beginning, before scrapping the webpages. The issue is to find if an element is already in the list. The in operation will be too long for large list.
You should try to look into several thinks; among which sets and pandas. The first one will probably be the optimal solution.
Now, since you thought of using a folder tree with the urls as folder names, I can think of one way which could be faster. Instead of creating the list with os.walk(path), try to look if the folder is already present. If not, it means you did not have that url yet. This is basically a fake graph database. To do so, you could use the function os.path.isdir(). If you want a true graph DB, you could look into OrientDB for instance.
Have you considered mapping a table of IP addresses to URL? Granted this would only work if you are seeking unique domains vs thousands of pages on the same domain. The advantage is you would be dealing with a 12 integer address. The downside is the need for additional tabulated data structures and additional processes to map the data.
Memory shouldn't be an issue. If each url takes 1KiB to store in memory (very generous), 1 million urls will be 1GiB. You say you have 8GiB of RAM.
You can keep known urls in memory in a set and check for containment using in or not in. Python's set uses hashing, so searching for containment is O(1) which is a lot quicker in general than the linear search of a list.
You can recursively scrape the pages:
def main():
with open('urls.txt') as urls:
known_urls = set(urls)
for url in list(known_urls):
scrape(url, known_urls)
def scrape(url, known_urls):
new_urls = _parse_page_urls(url)
for new_url in new_urls:
if new_url not in known_urls:
known_urls.add(new_url)
scrape(new_url, known_urls)
I have been trying to write large amount (>800mb) of data to JSON file; I did some fair amount of trial and error to get this code:
def write_to_cube(data):
with open('test.json') as file1:
temp_data = json.load(file1)
temp_data.update(data)
file1.close()
with open('test.json', 'w') as f:
json.dump(temp_data, f)
f.close()
to run it just call the function write_to_cube({"some_data" = data})
Now the problem with this code is that it's fast for the small amount of data, but the problem comes when test.json file has more than 800mb in it. When I try to update or add data to it, it takes ages.
I know there are external libraries such as simplejson or jsonpickle, I am not pretty sure on how to use them.
Is there any other way to this problem?
Update:
I am not sure how this can be a duplicate, other articles say nothing about writing or updating a large JSON file, rather they say only about parsing.
Is there a memory efficient and fast way to load big json files in python?
Reading rather large json files in Python
None of the above resolve this question a duplicate. They don't say anything about writing or update.
So the problem is that you have a long operation. Here are a few approaches that I usually do:
Optimize the operation: This rarely works. I wouldn't recommend any superb library that would parse the json a few seconds faster
Change your logic: If the purpose is to save and load data, probably you would like to try something else, like storing your object into a binary file instead.
Threads and callback, or deferred objects in some web frameworks. In case of web applications, sometimes, the operation takes longer than a request can wait, we can do the operation in background (some cases are: zipping files, then send the zip to user's email, sending SMS by calling another third party's api...)
I need to loop through a very large text file, several gigabytes in size (a zone file to be exact). I need to run a few queries for each entry in the zone file, and then store the results in a searchable database.
My weapons of choice at the moment, mainly because I know them, are Python and MySQL. I'm not sure how well either will deal with files of this size, however.
Does anyone with experience in this area have any suggestions on the best way to open and loop through the file without overloading my system? How about the most efficient way to process the file once I can open it (threading?) and store the processed data?
You shouldn't have any real trouble storing that amount of data in MySQL, although you will probably not be able to store the entire database in memory, so expect some IO performance issues. As always, make sure you have the appropriate indices before running your queries.
The most important thing is to not try to load the entire file into memory. Loop through the file, don't try to use a method like readlines which will load the whole file at once.
Make sure to batch the requests. Load up a few thousand lines at a time and send them all in one big SQL request.
This approach should work:
def push_batch(batch):
# Send a big INSERT request to MySQL
current_batch = []
with open('filename') as f:
for line in f:
batch.append(line)
if len(current_batch) > 1000:
push_batch(current_batch)
current_batch = []
push_batch(current_batch)
Zone files are pretty normally formatted, consider if you can get away with just using LOAD DATA INFILE. You might also consider creating a named pipe, pushing partially formatted data in to it from python, and using LOAD DATA INFILE to read it in with MySQL.
MySQL has some great tips on optimizing inserts, some highlights:
Use multiple value lists in each insert statement.
Use INSERT DELAYED, particularly if you are pushing from multiple clients at once (e.g. using threading).
Lock your tables before inserting.
Tweak the key_buffer_size and bulk_insert_buffer_size.
The fastest processing will be done in MySQL, so consider if you can get away with doing the queries you need after the data is in the db, not before. If you do need to do operations in Python, threading is not going to help you. Only one thread of Python code can execute at a time (GIL), so unless you're doing something which spends a considerable amount of time in C, or interfaces with external resources, you're only going to ever be running in one thread anyway.
The most important optimization question is what is bounding the speed, there's no point spinning up a bunch of threads to read the file, if the database is the bounding factor. The only way to really know is to try it and make tweaks until it is fast enough for your purpose.
#Zack Bloom's answer is excellent and I upvoted it. Just a couple of thoughts:
As he showed, just using with open(filename) as f: / for line in f is all you need to do. open() returns an iterator that gives you one line at a time from the file.
If you want to slurp every line into your database, do it in the loop. If you only want certain lines that match a certain regular expression, that's easy.
import re
pat = re.compile(some_pattern)
with open(filename) as f:
for line in f:
if not pat.search(line):
continue
# do the work to insert the line here
With a file that is multiple gigabytes, you are likely to be I/O bound. So there is likely no reason to worry about multithreading or whatever. Even running several regular expressions is likely to crunch through the data faster than the file can be read or the database updated.
Personally, I'm not much of a database guy and I like using an ORM. The last project I did database work on, I was using Autumn with SQLite. I found that the default for the ORM was to do one commit per insert, and it took forever to insert a bunch of records, so I extended Autumn to let you explicitly bracket a bunch of inserts with a single commit; it was much faster that way. (Hmm, I should extend Autumn to work with a Python with statement, so that you could wrap a bunch of inserts into a with block and Autumn would automatically commit.)
http://autumn-orm.org/
Anyway, my point was just that with database stuff, doing things the wrong way can be very slow. If you are finding that the database inserting is your bottleneck, there might be something you can do to fix it, and Zack Bloom's answer contains several ideas to start you out.