Caching a downloaded file based on modified time using dogpile - python

I'm writing a program that downloads a large file (~150MB) and parses the data into a more useful text format file. The process of downloading, and especially parsing, are slow (~20 minutes in total), so I'd like to cache the result.
The result of downloading are a bunch of files, and the result of parsing is a single file, so I can manually check if these files exist and if so, check their modified time; however, as I'm already using a dogpile with a redis backend for web service calls in other places in the code, I was wondering if dogpile could be used for this?
So my question is: can dogpile be used to cache a file based on its modified time?

Why you don't want divide program to several parts:
downloader
parser & saver
worker with results
You can use cache variable to store there value that you need, which you will update on file update.
import os
import threading
_lock_services=threading.Lock()
tmp_file="/tmp/txt.json"
update_time_sec=3300
with _lock_services:
# if file was created more the 50min ago
# here you can check if file was updated and update your cache variable
if os.path.getctime(tmp_file) < (time.time() - update_time_sec):
os.system("%s >%s" %("echo '{}'",tmp_file))
with open(tmp_file,"r") as json_data:
cache_variable = json.load(json_data)

Related

run c++ script multiple times which reads csv file [closed]

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 1 year ago.
Improve this question
I have a c++ console app to be ran multiple times. It reads a large csv file everytime it runs which doesn't change . It is a slow process.
Is there any way to just "load" the file in memory and not have to read everytime I run the file.
I was thinking in ways R and python works. You load a csv as dataframe and can use it in other
R scripts without loading everytime.
Each time your C++ app exits, its memory will be freed, which implies that your data won't be kept for the next time. Therefore, if you store your data in your app memory heap, it should be read from file each time you run the app.
If you really want to avoid reading your data from the filesystem, the easiest path is to use a separated process, i.e. a in-memory database, such as Redis or SQLite, so you can read your cdv once, store it in the DB memory, and then access your data through your C++ app.
List of In-memory DB
In your case, I would suggest to choose Redis (easier than SQLite since you don't need to create tables).
If you're not familiar with it, it's quite simple to start : Redis is a key-value storage system.
You just need to install a redis server for our environment, and you can use a C++ lib to immediately store and retrieve data in it. All you have to do is use two types of command: SET (when you read your CSV file for the first time) and GET (when you access the data for your processing purpose).
The simplest way to store your data is probably to store each line of your CSV with a key composed by the filename and the line number. For instance, if your file name is artists_rock.csv, you can do this to store the line 909:
SET ART_ROCK_909 "Lennon;John;Liverpool"
and you can get your record like that:
GET ART_ROCK_909
The key format is up to you, so that one makes it easy to iterate or access a line directly, just as if you were reading your file.
And if you use a C++ lib to parse your CSV records (meaning you never manipulate the original strings), you can also store your object as a set and manipulate it with HSET and HGET. The previous example would look like this:
HSET ART_ROCK_909 name "Lennon" fistname "John" birthplace "Liverpool"
and you would access data with
HGET ART_ROCK_909 birthplace
All you need to do is to choose a C++ lib to talk to you Redis server. There are many wrappers to the hiredis C library, such as redis-plus-plus that you can find on Github.
Here is a getting started sample code.
To keep the same example as above, the corresponding code would look like this:
#include <sw/redis++/redis++.h>
using namespace sw::redis;
try {
// Create an Redis object, which is movable but NOT copyable.
auto redis = Redis("tcp://127.0.0.1:6379");
auto line = my_csv_reading_function();
redis.set("ART_ROCK_909", line);
auto val = redis.get("ART_ROCK_909"); // val is of type OptionalString. See 'API Reference' section for details.
}
Assuming most of time is wasted on CSV parsing, and not on file-system operations, you can store your parsed data to a new file using fwrite. On second execution, read the other file using fread, instead of parsing the CSV-file.
Pseudo-code:
data = allocate()
open 'file.parsed'
if successful:
fread('file.parsed', data) # this is supposed to be fast
else:
parse('file.csv', data) # this is slow; will do only once
fwrite('file.parsed', data)

How to find if a file has been downloaded completely using python?

We are having a python script which automates the batch processing of time-series image data downloaded from the internet. The current script requires all data to be downloaded before execution. This consumes more time. We want to modify the script by writing a scheduler which will call the script whenever a single data is completely downloaded. How to find that a file has been downloaded completely using python?
If you download the file with Python, then you can just do the image processing operation after the file download operation finishes. An example using requests:
import requests
import mymodule # The module containing your custom image-processing function
for img in ("foo.png", "bar.png", "baz.png"):
response = requests.get("http://www.example.com/" + img)
image_bytes = response.content
mymodule.process_image(image_bytes)
However, with the sequential approach above you will be spending a lot of time waiting for responses from the remote server. To make this faster, you can download and process multiple files at once using aysncio and aiohttp. There's a good introduction to downloading files this way in Paweł Miech's blog post Making 1 million requests with python-aiohttp. The code you need will look something like the example at the bottom of that blog post (the one with the semaphore).

How to keep track of the files I read into a database in python?

Situation
I get a ton of json files from a remote data source. I organize these files into an archive, then read them into a database. The archive exists to rebuild the database, if necessary.
The json files are generated remotely and sent to my server periodically and the reading-in process happens continuously. One more than one occasion, we had a power loss to our severs overnight or over the weekend, this was a huge problem for database loading, since the processes halted and I didn't know what had been loaded and what hadn't so I had to roll back to some previously known state and rebuild out of the archive.
To fix this problem, my master loader daemon (written in python) now uses the logging package to track what files it has loaded. The basic workflow of the loader daemon is
cp json file to archive
`rm' original
insert archived copy to database (its MariaDB)
commit to database
log filename of loaded json file
I'm not so much worried about duplicates in the database, but I don't want gaps; that is, things in the archive that are not in the database. This methods has so far and seems guaranteed to prevent any gaps.
For my logging, it basically looks like this. When the daemon starts up on a set of received files' names, it checks for duplicates that have already been loaded to the destination database and then loads all the non-duplicates. It is possible to get duplicates from my remote data source.
def initialize_logs(filenames, destination)
try:
with open("/data/dblogs/{0}.log".format(destination), 'r') as already_used:
seen = set([line.rstrip("\n") for line in already_used])
except FileNotFoundError:
print("Log file for {0} not found. Repair database".format(destination))
quit()
fnamelog = logging.getLogger('filename.log')
fnamelog.setLevel(logging.INFO)
fh = logging.FileHandler("/data/dblogs/{0}.log".format(destination))
fh.setLevel(logging.INFO)
fnamelog.addHandler(fh)
Then, as I process the jsonfiles, I log each file added using
fnamelog.info(filename)
The database loader is run parallelized, so I originally chose the logging package for its built in concurrency protections. There are a variety of databases; not every database pulls all data from the json files. Some databases with more information are shorter in time, usually one to two months. In this case, it is nice to have a log file with all json files in a given database, so if I want to add some on to it, I don't have to worry about what is already in there, the log file is keeping track.
Problem
A year has passed. I have kept getting json files. I am now getting around a million files per month. The text logging of each filename as it is processed in is clumsy, but it still works...for now. There are multiple databases, but for the largest ones, the log file is over half a GB. I feel like this logging solution will not work well for much longer.
What options are available in python to track which filenames have been inserted into a database, when there are over 10 million filenames per database, and rising?
One approach would be to log the files in a table in the database itself rather than in a text log file. If you added some columns for things like import date or file name, that might provide you a little bit of flexibility with respect to finding information from these logs when you need to do that, but also it would allow you to perform periodic maintenance (like for example deleting log records that are more than a few months old if you know you won't need to look at those ever).
If you decide to keep using text based log files, you might consider breaking them up so you don't wind up with a giant monolithic log file. When you install things like Apache that log lots of data, you'll see it automatic sets up log rotation to compress and archive log files periodically...
You don't say what type of database you are using but the general approach to take is
1) make a hash of each json file. SHA256 is widely available. If you are concerned about performance see this post https://softwareengineering.stackexchange.com/questions/49550/which-hashing-algorithm-is-best-for-uniqueness-and-speed
2) make the hash field a unique key on your database and before you do the other operations try and insert it. If you can't the record already exists and the transaction will abort
Program 1:
Foreach file in input directory
INSERT IGNORE into database the MD5 of the file
"mv" the file to the archive directory
Program 2, a "keep-alive" program.
It is run via cron every minute and tries to launch Program 1, but don't start it if it is already running.
Notes:
'mv' and 'cron' assume Unix. If using Windows, do the equivalent.
'mv' is atomic, so the file will be either in one directory of the other; no hassle of knowing whether it is 'processed'. (So, I wonder why you even have a database table??)
Since the INSERT and mv cannot realistically be done "atomically", here is why my plan is safe: IGNORE.
The "is it running" can be handled in a number of ways, either in Program 1 or 2.
You could add a timestamp and/or filename to the table containing the md5; whatever you like.
Since it is not a good idea to have even 10K files in a directory, you should use something other than the pair of flat directories I have envisioned.
You are getting only about 1 file every 3 seconds. This is not a heavy load unless the files are huge. Then it becomes an I/O problem, not a database problem.
I have a feeling that either I am missing a hidden 'requirement', or you are being extra paranoid. I don't really understand what you need to do with the files.

How to dynamically rename the hdf5 file from psychopy's iohub

I'm using the Psychopy 1.82.01 Coder and its iohub functionality (on Ubuntu 14.04 LTS). It is working but I was wondering if there is a way to dynamically rename the hdf5 file it produces during an experiment (such that in the end, I know which participant it belongs to and two participants will get two files without overwriting one of them).
It seems to me that the filename is determined in this file: https://github.com/psychopy/psychopy/blob/df68d434973817f92e5df78786da313b35322ae8/psychopy/iohub/default_config.yaml
But is there a way to change this dynamically?
If you want to create a different hdf5 file for each experiment run, then the options depend on how you are starting the ioHub process. Assuming you are using the psychopy.iohub.launchHubServer() function to start ioHub, then you can pass the 'experiment_code' kwarg to the function and that will be used as the hdf5 file name.
For example, if you created a script with the following code and ran it:
import psychopy.iohub as iohub
io = iohub.launchHubServer(experiment_code="exp_sess_1")
# your experiment code here ....
# ...
io.quit()
An ioHub hdf5 file called 'exp_sess_1.hdf5' will be created in the same folder as the script file.
As a side note, you do not have to save each experiment sessions data into a separate hdf5 file. The ioHub hdf5 file structure is designed to save multiple participants / sessions data in a single file. Each time the experiment is run, a unique session code is required, and the data from each run is saved in the hdf5 file with a session id that is associated with the session code.

Custom Log File That Doesn't Grow Past a Given Size

I need to write some output to a log file every time certain events in my application are triggered. Right now I'm doing this in the simplest way imagineable:
with open('file.log', 'a+') as file:
file.write('some info')
This works, but the problem is I don't want the file to grow indefinitely. I want a hard size cutoff of, say 25 MB. But I don't want to simply clear the file when it reaches that size.
Instead, I want it to function how I believe other log files work, clearing the oldest data from the top of the file and appending the new data to the end of the file. What would be a good approach to achieving this in Python? Note that I can't use Python's logging library because this is for a Celery application and (a) I already have logging set up for purposes unrelated to this and (b) Celery and the logging library do not play well together AT ALL.
import os
statinfo = os.stat('somefile.txt')
This will return the size of somefile.txt in bytes.
Then you can do something like:
if statinfo.st_size > 25000000:
You can implement this however you want. You could read up to the number of lines that are going to be replaced, delete those and then save the reminder in a temporary file, then you could write the data you wish, and append the temporary file. This would not be very effective, but it would work.

Categories