Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 1 year ago.
Improve this question
I have a c++ console app to be ran multiple times. It reads a large csv file everytime it runs which doesn't change . It is a slow process.
Is there any way to just "load" the file in memory and not have to read everytime I run the file.
I was thinking in ways R and python works. You load a csv as dataframe and can use it in other
R scripts without loading everytime.
Each time your C++ app exits, its memory will be freed, which implies that your data won't be kept for the next time. Therefore, if you store your data in your app memory heap, it should be read from file each time you run the app.
If you really want to avoid reading your data from the filesystem, the easiest path is to use a separated process, i.e. a in-memory database, such as Redis or SQLite, so you can read your cdv once, store it in the DB memory, and then access your data through your C++ app.
List of In-memory DB
In your case, I would suggest to choose Redis (easier than SQLite since you don't need to create tables).
If you're not familiar with it, it's quite simple to start : Redis is a key-value storage system.
You just need to install a redis server for our environment, and you can use a C++ lib to immediately store and retrieve data in it. All you have to do is use two types of command: SET (when you read your CSV file for the first time) and GET (when you access the data for your processing purpose).
The simplest way to store your data is probably to store each line of your CSV with a key composed by the filename and the line number. For instance, if your file name is artists_rock.csv, you can do this to store the line 909:
SET ART_ROCK_909 "Lennon;John;Liverpool"
and you can get your record like that:
GET ART_ROCK_909
The key format is up to you, so that one makes it easy to iterate or access a line directly, just as if you were reading your file.
And if you use a C++ lib to parse your CSV records (meaning you never manipulate the original strings), you can also store your object as a set and manipulate it with HSET and HGET. The previous example would look like this:
HSET ART_ROCK_909 name "Lennon" fistname "John" birthplace "Liverpool"
and you would access data with
HGET ART_ROCK_909 birthplace
All you need to do is to choose a C++ lib to talk to you Redis server. There are many wrappers to the hiredis C library, such as redis-plus-plus that you can find on Github.
Here is a getting started sample code.
To keep the same example as above, the corresponding code would look like this:
#include <sw/redis++/redis++.h>
using namespace sw::redis;
try {
// Create an Redis object, which is movable but NOT copyable.
auto redis = Redis("tcp://127.0.0.1:6379");
auto line = my_csv_reading_function();
redis.set("ART_ROCK_909", line);
auto val = redis.get("ART_ROCK_909"); // val is of type OptionalString. See 'API Reference' section for details.
}
Assuming most of time is wasted on CSV parsing, and not on file-system operations, you can store your parsed data to a new file using fwrite. On second execution, read the other file using fread, instead of parsing the CSV-file.
Pseudo-code:
data = allocate()
open 'file.parsed'
if successful:
fread('file.parsed', data) # this is supposed to be fast
else:
parse('file.csv', data) # this is slow; will do only once
fwrite('file.parsed', data)
Related
I have a JSON file that has the following format:
{
"items": {
"item_1_name": { ...item properties... }
"item_2_name": { ...item properties... }
...
}
}
On my last count, there can be over 13K items stored in the JSON file, and the file itself is nearly 75MB on disk.
Now, I have a program that needs to query (read-only) data. Each query operation takes an item name and needs to read its properties. Each invocation of that program may involve from a few to several dozen query ops.
Naturally, loading the JSON file from disk and parsing it takes time and space: it takes 0.76 seconds to load and parse, and the parsed data takes 197 MB in memory. That means on each invocation of that program, I need to first wait nearly a second before it can do anything else with the results. I want to make the program respond faster.
So I have another approach: create a SQLite database file from that JSON file. Afterwards, the program needs to query against the database, instead of querying against the data directly parsed from the JSON file.
However, the SQLite approach has one drawback: unlike json.load(), it doesn't parse the whole file and keep it around in memory (assuming cache miss), and I'm not sure if the time spent on disk IO encountered by the query ops may offset the benefit of not using the JSON approach.
So my question is: from your experience, is this use case suitable for SQLite?
I think this depends entirely on how you're querying the data. From the way you describe it, you're querying by an ID only, so you're not going to get the best of what sqlite has to offer by way of efficiencies. It should work just fine for your use case, but it would excel at returning all records matching a value, all record with values between two integers, etc. A third option worth considering is a minimal key/value store such as a python dictionary stored as a pickle or a really simple redis service. Both of these will allow to query by ID faster than reading a large json string.
Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 3 years ago.
Improve this question
Here's the problem:
In a laboratory, very large microscopy data is created (from 1GB to 200GB per file).
We store the metadata as JSONS in MongoDB. But we can not find a suitable local / open source platform to store these files.
We have tried Hadoop but it is a very complex framework and we do not need many features. We only need a BLOB / Object Storage, if possible with a Python API to read and write data via a self-built GUI.
Have already evaluated Ceph, OpenStack Swift, OwnCloud, Gluster, etc., but we fail with each of them because of max_limit_size_of_file. Many of these mentioned have a max limit of 5GB per file.
What is the best way to store these files?
We need the following features:
Python (and REST) API
No Max-Limit size
Open Source / Local Software
Object / Blob Storage
If possible replication of the data
Unfortunately, for compliance reasons, cloud solutions are not an option.
Have you had a look at OMERO? It sounds as if it covers most of your requirements. Although I dont know how far you can go with the Python API.
For cases like these, sometimes the best thing to do is use the built in file-system to store your files.
How many files do you need to keep? A plain file system with a file share works really well for storing large binary data. You can store your metadata in the mongoDB as well as the path to the directory.
One thing you might or might not need to worry about is how many files you need to store. In my experience if you're storing thousands of files then you need to work out how to distribute the files across the folders. If you store the hash of the object you can create a function that calculates what directory to store the file based on hash. If you're familiar with git, this is exactly how it stores objects.
vaex is a library for loading in dataframes larger than the system memory allows, if you were to store your metadata with MongoDB and have a field for filename and you'd have your query abilities whilst keeping your data on the filesystem in a usable way
Situation
I get a ton of json files from a remote data source. I organize these files into an archive, then read them into a database. The archive exists to rebuild the database, if necessary.
The json files are generated remotely and sent to my server periodically and the reading-in process happens continuously. One more than one occasion, we had a power loss to our severs overnight or over the weekend, this was a huge problem for database loading, since the processes halted and I didn't know what had been loaded and what hadn't so I had to roll back to some previously known state and rebuild out of the archive.
To fix this problem, my master loader daemon (written in python) now uses the logging package to track what files it has loaded. The basic workflow of the loader daemon is
cp json file to archive
`rm' original
insert archived copy to database (its MariaDB)
commit to database
log filename of loaded json file
I'm not so much worried about duplicates in the database, but I don't want gaps; that is, things in the archive that are not in the database. This methods has so far and seems guaranteed to prevent any gaps.
For my logging, it basically looks like this. When the daemon starts up on a set of received files' names, it checks for duplicates that have already been loaded to the destination database and then loads all the non-duplicates. It is possible to get duplicates from my remote data source.
def initialize_logs(filenames, destination)
try:
with open("/data/dblogs/{0}.log".format(destination), 'r') as already_used:
seen = set([line.rstrip("\n") for line in already_used])
except FileNotFoundError:
print("Log file for {0} not found. Repair database".format(destination))
quit()
fnamelog = logging.getLogger('filename.log')
fnamelog.setLevel(logging.INFO)
fh = logging.FileHandler("/data/dblogs/{0}.log".format(destination))
fh.setLevel(logging.INFO)
fnamelog.addHandler(fh)
Then, as I process the jsonfiles, I log each file added using
fnamelog.info(filename)
The database loader is run parallelized, so I originally chose the logging package for its built in concurrency protections. There are a variety of databases; not every database pulls all data from the json files. Some databases with more information are shorter in time, usually one to two months. In this case, it is nice to have a log file with all json files in a given database, so if I want to add some on to it, I don't have to worry about what is already in there, the log file is keeping track.
Problem
A year has passed. I have kept getting json files. I am now getting around a million files per month. The text logging of each filename as it is processed in is clumsy, but it still works...for now. There are multiple databases, but for the largest ones, the log file is over half a GB. I feel like this logging solution will not work well for much longer.
What options are available in python to track which filenames have been inserted into a database, when there are over 10 million filenames per database, and rising?
One approach would be to log the files in a table in the database itself rather than in a text log file. If you added some columns for things like import date or file name, that might provide you a little bit of flexibility with respect to finding information from these logs when you need to do that, but also it would allow you to perform periodic maintenance (like for example deleting log records that are more than a few months old if you know you won't need to look at those ever).
If you decide to keep using text based log files, you might consider breaking them up so you don't wind up with a giant monolithic log file. When you install things like Apache that log lots of data, you'll see it automatic sets up log rotation to compress and archive log files periodically...
You don't say what type of database you are using but the general approach to take is
1) make a hash of each json file. SHA256 is widely available. If you are concerned about performance see this post https://softwareengineering.stackexchange.com/questions/49550/which-hashing-algorithm-is-best-for-uniqueness-and-speed
2) make the hash field a unique key on your database and before you do the other operations try and insert it. If you can't the record already exists and the transaction will abort
Program 1:
Foreach file in input directory
INSERT IGNORE into database the MD5 of the file
"mv" the file to the archive directory
Program 2, a "keep-alive" program.
It is run via cron every minute and tries to launch Program 1, but don't start it if it is already running.
Notes:
'mv' and 'cron' assume Unix. If using Windows, do the equivalent.
'mv' is atomic, so the file will be either in one directory of the other; no hassle of knowing whether it is 'processed'. (So, I wonder why you even have a database table??)
Since the INSERT and mv cannot realistically be done "atomically", here is why my plan is safe: IGNORE.
The "is it running" can be handled in a number of ways, either in Program 1 or 2.
You could add a timestamp and/or filename to the table containing the md5; whatever you like.
Since it is not a good idea to have even 10K files in a directory, you should use something other than the pair of flat directories I have envisioned.
You are getting only about 1 file every 3 seconds. This is not a heavy load unless the files are huge. Then it becomes an I/O problem, not a database problem.
I have a feeling that either I am missing a hidden 'requirement', or you are being extra paranoid. I don't really understand what you need to do with the files.
I'm still learning python and am currently developing an API (artificial personal assistant e.g. Siri or Cortana). I was wondering if there was a way to update code by input. For example, if I had a list- would it be possible to PERMANENTLY add a new item even after the program has finished running.
I read that you would have to use SQLite, is that true? And are there any other ways?
Hello J Nowak
I think what you want to do is save the input data to a file (Eg. txt file).
You can view the link below which will show you how to read and write to a text file.
How to read and write to text file in Python
There are planty of methods how you can make your data persistent.
It depends on the task, on the environment etc.
Just a couple examples:
Files (JSON, DBM, Pickle)
NoSQL Databases (Redis, MongoDB, etc.)
SQL Databases (both serverless and client server: sqlite, MySQL, PostgreSQL etc.)
The most simple/basic approach is to use files.
There are even modules that allow to do it transparently.
You just work with your data as always.
See shelve for example.
From the documentation:
A “shelf” is a persistent, dictionary-like object. The difference with
“dbm” databases is that the values (not the keys!) in a shelf can be
essentially arbitrary Python objects — anything that the pickle module
can handle. This includes most class instances, recursive data types,
and objects containing lots of shared sub-objects. The keys are
ordinary strings.
Example of usage:
import shelve
s = shelve.open('test_shelf.db')
try:
s['key1'] = { 'int': 10, 'float':9.5, 'string':'Sample data' }
finally:
s.close()
You work with s just normally, as it were just a normal dictionary.
And it is automatically saved on disk (in file test_shelf.db in this case).
In this case your dictionary is persistent
and will not lose its values after the program restart.
More on it:
https://docs.python.org/2/library/shelve.html
https://pymotw.com/2/shelve/
Another option is to use pickle, which gives you persistence
also, but not magically: you will need read and write data on your own.
Comparison between shelve and pickle:
What is the difference between pickle and shelve?
I'm writing a program that downloads a large file (~150MB) and parses the data into a more useful text format file. The process of downloading, and especially parsing, are slow (~20 minutes in total), so I'd like to cache the result.
The result of downloading are a bunch of files, and the result of parsing is a single file, so I can manually check if these files exist and if so, check their modified time; however, as I'm already using a dogpile with a redis backend for web service calls in other places in the code, I was wondering if dogpile could be used for this?
So my question is: can dogpile be used to cache a file based on its modified time?
Why you don't want divide program to several parts:
downloader
parser & saver
worker with results
You can use cache variable to store there value that you need, which you will update on file update.
import os
import threading
_lock_services=threading.Lock()
tmp_file="/tmp/txt.json"
update_time_sec=3300
with _lock_services:
# if file was created more the 50min ago
# here you can check if file was updated and update your cache variable
if os.path.getctime(tmp_file) < (time.time() - update_time_sec):
os.system("%s >%s" %("echo '{}'",tmp_file))
with open(tmp_file,"r") as json_data:
cache_variable = json.load(json_data)