I am writing a plugin for a software I am using. The goal here is to have a button, that will automate downloading data from a government site.
I am still a bit new to python, however I have managed to get it working exactly like I want. But - I want to avoid a case where my plugin makes hundreds of requests for downloads at once, as that could impact the website performance. The below function is what I use to download the files.
How can I make sure that what I am doing will not request 1000s of files within few seconds, thus overloading the website? Is there a way to make the below function wait for completing one download, before starting another?
import requests
from os.path import join
def downloadFiles(fileList, outDir):
# Download list of files, one by one
for url in fileList:
file = requests.get(url)
fileName = url.split('/')[-1]
open(join(outDir, fileName), 'wb').write(file.content)
This code is already sequential and it will wait for a download to finish before starting a new one. It's funny, usually people ask how to parallelize stuff.
If you want to slow it down further, you can add a time.sleep() to your code.
If you want to be more fancy you can use something like this
Related
The website Download the GLEIF Golden Copy and Delta Files.
has buttons that download data that I want to retrieve automatically with a python script. Usually when I want to download a file, I use mget or similar, but that will not work here (at least I don't think it will).
For some reason I cannot fathom, the producers of the data seem to want to force one to manually download the files. I really need to automate this to reduce the number of steps for my users (and frankly for me), since there are a great many files in addition to these and I want to automate as many as possible (all of them).
So my question is this - is there some kind of python package for doing this sort of thing? If not a python package, is there perhaps some other tool that is useful for it? I have to believe this is a common annoyance.
Yup, you can use BeautifulSoup to scrape the URLs then download them with requests.
We are having a python script which automates the batch processing of time-series image data downloaded from the internet. The current script requires all data to be downloaded before execution. This consumes more time. We want to modify the script by writing a scheduler which will call the script whenever a single data is completely downloaded. How to find that a file has been downloaded completely using python?
If you download the file with Python, then you can just do the image processing operation after the file download operation finishes. An example using requests:
import requests
import mymodule # The module containing your custom image-processing function
for img in ("foo.png", "bar.png", "baz.png"):
response = requests.get("http://www.example.com/" + img)
image_bytes = response.content
mymodule.process_image(image_bytes)
However, with the sequential approach above you will be spending a lot of time waiting for responses from the remote server. To make this faster, you can download and process multiple files at once using aysncio and aiohttp. There's a good introduction to downloading files this way in Paweł Miech's blog post Making 1 million requests with python-aiohttp. The code you need will look something like the example at the bottom of that blog post (the one with the semaphore).
I recently started coding, but took a brief stint. I started a new job and I’m under some confidential restrictions. I need to make sure python and pandas are secure before I do this—I’ll also be talking with IT on Monday
I was wondering if pandas in python was a local library, or does the data get sent to or from elsewhere? If I write something in pandas—will the data be stored somewhere under pandas?
The best example of what I’m doing is best found on a medium article about stripping data from tables that don’t have csv Exports.
https://medium.com/#ageitgey/quick-tip-the-easiest-way-to-grab-data-out-of-a-web-page-in-python-7153cecfca58
Creating a DataFrame out of a dict, doing vectorized operations on its rows, printing out slices of it, etc. are all completely local. I'm not sure why this matters. Is your IT department going to say, "Well, this looks fishy—but some random guy on the internet says it's safe, so forget our policies, we'll allow it"? But, for what it's worth, you have this random guy on the internet saying it's safe.
However, Pandas can be used to make network requests. Some of the IO functions can take a URL instead of a filename or file object. Some of them can also use another library that does so—e.g., if you have lxml installed, read_html, will pass the filename to lxml to open, and if that filename is an HTTP URL, lxml will go fetch it.
This is rarely a concern, but if you want to get paranoid, you could imagine ways in which it might be.
For example, let's say your program is parsing user-supplied CSV files and doing some data processing on them. That's safe; there's no network access at all.
Now you add a way for the user to specify CSV files by URL, and you pass them into read_csv and go fetch them. Still safe; there is network access, but it's transparent to the end user and obviously needed for the user's task; if this weren't appropriate, your company wouldn't have asked you to add this feature.
Now you add a way for CSV files to reference other CSV files: if column 1 is #path/to/other/file, you recursively read and parse path/to/other/file and embed it in place of the current row. Now, what happens if I can give one of your users a CSV file where, buried at line 69105, there's #http://example.com/evilendpoint?track=me (an endpoint which does something evil, but then returns something that looks like a perfectly valid thing to insert at line 69105 of that CSV)? Now you may be facilitating my hacking of your employees, without even realizing it.
Of course this is a more limited version of exactly the same functionality that's in every web browser with HTML pages. But maybe your IT department has gotten paranoid and clamped down security on browsers and written an application-level sniffer to detect suspicious followup requests from HTML, and haven't thought to do the same thing for references in CSV files.
I don't think that's a problem a sane IT department should worry about. If your company doesn't trust you to think about these issues, they shouldn't hire you and assign you to write software that involves scraping the web. But then not every IT department is sane about what they do and don't get paranoid about. ("Sure, we can forward this under-1024 port to your laptop for you… but you'd better not install a newer version of Firefox than 16.0…")
I am trying to scrape data from a few thousands pages. The code I have works fine for about a 100 pages, but then slows down dramatically. I am pretty sure that my Tarzan-like code could be improved, so that the speed of the webscraping process increases. Any help would be appreciated. TIA!
Here is the simplified code:
csvfile=open('test.csv', 'w', encoding='cp850', errors='replace')
writer=csv.writer(csvfile)
list_url= ["http://www.randomsite.com"]
i=1
for url in list_url:
base_url_parts = urllib.parse.urlparse(url)
while True:
raw_html = urllib.request.urlopen(url).read()
soup = BeautifulSoup(raw_html, "lxml")
#### scrape the page for the desired info
i=i+1
n=str(i)
#Zip the data
output_data=zip(variable_1, variable_2, variable_3, ..., variable_10)
#Write the observations to the CSV file
writer=csv.writer(open('test.csv','a',newline='', encoding='cp850', errors='replace'))
writer.writerows(output_data)
csvfile.flush()
base="http://www.randomsite.com/page"
base2=base+n
url_part2="/otherstuff"
url_test = base2+url_part2
try:
if url_test != None:
url = url_test
print(url)
else:
break
except:
break
csvfile.close()
EDIT: Thanks for all the answers, I learn quite a lot from them. I am (slowly!) learning my way around Scrapy. However, I found that the pages are available via bulk download, which will be an ever better way to solve the performance issue.
The main bottleneck is that your code is synchronous (blocking). You don't proceed to the next URL until you finish processing the current one.
You need to make things asynchronously either by switching to Scrapy which solves this problem out-of-the-box, or by building something yourself via, for example, grequests.
If you want to go really fast without a lot of complex code, you need to A) Segregate the requests from the parsing (because the parsing is blocking the thread you'd otherwise use to make the request), and B) Make requests concurrently and parse them concurrently. So, I'd do a couple things:
Request all pages asynchronously using eventlets. I've struggled with async http frameworks in Python and find eventlets the easiest to learn.
Every time you successfully fetch a page, store the html somewhere. A) You could write it to individual html files locally but you'll have a lot of files on your hands. B) You could probably store this many records as strings (str(source_code)) and put them in a native data store so long as it's hashed (probably a set or dict). C) You could use super lightweight but not particularly performant database like TinyDB and stick the page source in JSON files. D) You could use a third party library's data structures for high performance computing like a Pandas DataFrame or a NumPy array. They can easily store this much data but may be overkill.
Parse each document separately after it's been retrieved. Parsing with lxml will be extremely fast, so depending on how fast you need to go you may be able to get away with parsing the files sequentially. If you want to go faster, look up a tutorial on multiprocessing in python. It's pretty darn easy to learn, and you'd be able to concurrently parse X documents, where X is the number of available cores on your machine.
Perhaps this is just a bug in simplification, but looks like you are opening 'test.csv' multiple times, but closing it only once. Not sure if that's the cause for unexpected slowdown when number of URLs grows above 100, but if you want all data to go in one csv file, you should probably stick to opening the file and the writer once at the top, like you're already doing, and not do it inside the loop.
Also, in the logic of constructing the new URL: isn't url_test != None always true? Then how are you exiting the loop? on the exception when urlopen fails? Then you should have the try-except around that. This is not a performance issue, but any kind of clarity helps.
I'm trying to download a spreadsheet from Google Drive inside a program I'm writing (so the data can be easily updated across all users), but I've run into a few problems:
First, and perhaps foolishly, I'm only wanting to use the basic python distribution, so I'm not requiring people to download multiple modules to run it. The urllib.request module seems to work well enough for basic downloading, specifically the urlopen() function, when I've tested it on normal webpages (more on why I say "normal" below).
Second, most questions and answers on here deal with retrieving a .csv from the spreadsheet. While this might work even better than trying to parse the feeds (and I have actually gotten it to work), using only the basic address means only the first sheet is downloaded, and I need to add a non-obvious gid to get the others. I want to have the program independent of the spreadsheet, so I only have to add new data online and the clients are automatically updated; trying to find a gid programmatically gives me trouble because:
Third, I can't actually get the feeds (interface described here) to be downloaded correctly. That does seem to be the best way to get what I want—download the overview of the entire spreadsheet, and from there obtain the addresses to each sheet—but if I try to send that through urlopen(feed).read() it just returns b''. While I'm not exactly sure what the problem is, I'd guess that the webpage is empty very briefly when it's first loaded, and that's what urlopen() thinks it should be returning. I've included what little code I'm using below, and was hoping someone had a way of working around this. Thanks!
import urllib.request as url
key = '1Eamsi8_3T_a0OfL926OdtJwLoWFrGjl1S2GiUAn75lU'
gid = '1193707515'
# Single sheet in CSV format
# feed = 'https://docs.google.com/spreadsheets/d/' + key + '/export?format=csv&gid=' + gid
# Document feed
feed = 'https://spreadsheets.google.com/feeds/worksheets/' + key + '/private/full'
csv = url.urlopen(feed).read()
(I don't actually mind publishing the key/gid, because I am planning on releasing this if I ever finish it.)
Requres OAuth2 or a password.
If you log out of google and try again with your browser, it fails (It failed when I did logged out). It looks like it requires a google account.
I did have it working with and application password a while ago. But I now use OAuth2. Both are quite a bit of messing about compared to CSV.
This sounds like a perfect use case for a wrapper library i once wrote. Let me know if you find it useful.