How to find if a file has been downloaded completely using python? - python

We are having a python script which automates the batch processing of time-series image data downloaded from the internet. The current script requires all data to be downloaded before execution. This consumes more time. We want to modify the script by writing a scheduler which will call the script whenever a single data is completely downloaded. How to find that a file has been downloaded completely using python?

If you download the file with Python, then you can just do the image processing operation after the file download operation finishes. An example using requests:
import requests
import mymodule # The module containing your custom image-processing function
for img in ("foo.png", "bar.png", "baz.png"):
response = requests.get("http://www.example.com/" + img)
image_bytes = response.content
mymodule.process_image(image_bytes)
However, with the sequential approach above you will be spending a lot of time waiting for responses from the remote server. To make this faster, you can download and process multiple files at once using aysncio and aiohttp. There's a good introduction to downloading files this way in Paweł Miech's blog post Making 1 million requests with python-aiohttp. The code you need will look something like the example at the bottom of that blog post (the one with the semaphore).

Related

Running python scripts from within java class Android Studio

I want to integrate python with my android app. What I want is:
I will write a python script and I will put it somewhere in my project folder. And I should be able to call that script from my activity class .java file.
In simple words, I want to build my complete app in Android Studio only but I want to use python for some parts of my code.
I know of only one thing that perfectly fits this criteria and that's "Chaquopy" but I don't want to use Chaquopy.
Can you please suggest something else?
Thank you
I would use flask and have an endpoint like so:
#app.rooute('/data', methods=['GET'])
def meth():
# python code here
return make_response(jsonify({'results': ret}), 200)
I've actually set up an endpoint for you to use here that takes a png file, uses pillow to resize it to 1200 pixels and returns the new png.
Your task is now to display the PNG however you would in Java.
EDIT: There are many, many approaches to read data from the HTTP endpoint in Java, one is given below, using okhttp:
Request request = new Request.Builder().url("https://hd1-martin.herokuapp.com/data").build();
Response rawResponse = new OkHttpClient().newCall(request).execute();
byte[] response = rawResponse.body().bytes();
This is the limit of my expertise, I'm now leaving the rest in your capable hands to get the new image to display in Android, with the following hint... I suspect you're going to be looking at writing the bytes -- in response in the snippet -- to a file and loading it into an Android ImageView.
Hope that helps.

automatically download list of files from a text file every X hours on Windows

I have a Python script that collects source links and puts each link as a new line in a text file. This file is updated continuously as I generally run the Python script for most of the day. The textfile contains duplicate links, so that I usually have to manually remove duplicates before downloading.
Right now I manually put the textfile in Jdownloader and have the downloads saved to the folder on my PC, skipping any duplicates (downloads from before). However, this is a manual process, and by the time I get to it, some of the links are dead.
Is there a way to automate this? Like every few hours it can use the textfile at that time, download all the links to the folder (no subfolders), and skip all previously downloaded content (same filename)? Or is Jdownloader manually my best option?
I think you need to use a cron job. You can do so with Windows' Task Scheduler. With it you can schedule your script execution.
This tutorial may help you.
In your script you can download each file using the Requests library

Python requests, how to avoid DDOSing someone

I am writing a plugin for a software I am using. The goal here is to have a button, that will automate downloading data from a government site.
I am still a bit new to python, however I have managed to get it working exactly like I want. But - I want to avoid a case where my plugin makes hundreds of requests for downloads at once, as that could impact the website performance. The below function is what I use to download the files.
How can I make sure that what I am doing will not request 1000s of files within few seconds, thus overloading the website? Is there a way to make the below function wait for completing one download, before starting another?
import requests
from os.path import join
def downloadFiles(fileList, outDir):
# Download list of files, one by one
for url in fileList:
file = requests.get(url)
fileName = url.split('/')[-1]
open(join(outDir, fileName), 'wb').write(file.content)
This code is already sequential and it will wait for a download to finish before starting a new one. It's funny, usually people ask how to parallelize stuff.
If you want to slow it down further, you can add a time.sleep() to your code.
If you want to be more fancy you can use something like this

Dropbox Python API Upload multiple files

I'm trying to upload a set of pd.DataFrames as CSV to a folder in Dropbox using the Dropbox Python SDK (v2). The set of files is not particularly big, but it's numerous. Using batches will help to reduce the API calls and comply with the developer recommendations outlined in the documentation:
"The idea is to group concurrent file uploads into batches, where files
in each batch are uploaded in parallel via multiple API requests to
maximize throughput, but the whole batch is committed in a single,
asynchronous API call to allow Dropbox to coordinate the acquisition
and release of namespace locks for all files in the batch as
efficiently as possible."
Following several answers in SO (see the most relevant to my problem here), and this answer from the SDK maintainers in the Dropbox Forum I tried the following code:
commit_info = []
for df in list_pandas_df:
df_raw_str = df.to_csv(index=False)
upload_session = dbx.upload_session_start(df_raw_str.encode())
commit_info.append(
dbx.files.CommitInfo(path=/path/to/db/folder.csv
)
dbx.files_upload_finish_batch(commit_info)
Nonetheless, when reading the files_upload_finish_batch docstring I noticed that the function only takes a list of CommitInfo as an argument (documentation), which is confusing since the non-batch version (files_upload_session_finish) does take a CommitInfo object with a path, and a cursor object with data about the session.
I'm fairly lost in the documentation, and even the source code is not so helpful to understand how the batch works to upload several files (and not as a case for uploading heavy files). What I am missing here?

Caching a downloaded file based on modified time using dogpile

I'm writing a program that downloads a large file (~150MB) and parses the data into a more useful text format file. The process of downloading, and especially parsing, are slow (~20 minutes in total), so I'd like to cache the result.
The result of downloading are a bunch of files, and the result of parsing is a single file, so I can manually check if these files exist and if so, check their modified time; however, as I'm already using a dogpile with a redis backend for web service calls in other places in the code, I was wondering if dogpile could be used for this?
So my question is: can dogpile be used to cache a file based on its modified time?
Why you don't want divide program to several parts:
downloader
parser & saver
worker with results
You can use cache variable to store there value that you need, which you will update on file update.
import os
import threading
_lock_services=threading.Lock()
tmp_file="/tmp/txt.json"
update_time_sec=3300
with _lock_services:
# if file was created more the 50min ago
# here you can check if file was updated and update your cache variable
if os.path.getctime(tmp_file) < (time.time() - update_time_sec):
os.system("%s >%s" %("echo '{}'",tmp_file))
with open(tmp_file,"r") as json_data:
cache_variable = json.load(json_data)

Categories