I have a Python program that is controlling some machines and stores some data. The data is produced at a rate of about 20 rows per second (and about 10 columns or so). The whole run of this program can be as long as one week, as a result there is a large dataframe.
What are safe and correct ways to store this data? With safe I mean that if something fails in the day 6, I will still have all the data from days 1→6. With correct I mean not re-writing the whole dataframe to a file in each loop.
My current solution is a CSV file, I just print each row manually. This solution is both safe and correct, but the problem is that CSV does not preserve data types and also occupies more memory. So I would like to know if there is a binary solution. I like the feather format as it is really fast, but it does not allow to append rows.
I can think of two easy options:
store chunks of data (e.g. every 30 seconds or whatever suits your use case) into separate files; you can then postprocess them back into a single dataframe.
store each row into an SQL database as it comes in. Sqlite will likely be a good start, but I'd maybe really go for PostgreSQL. That's what databases are meant for, after all.
Related
I have a calculator that iterates a couple of hundred object and produces Nx1 arrays for each of those objects. N here being 1-10m depending on configurations. Right now I am summing over these by using a generator expression, so memory consumption is low. However, I would like to store the Nx1 arrays to file, so I can do other computations.(Compute quantiles, partial sums etc. pandas style) Preferably I would like to use pa.memory_map on a single file (in order to have dataframes not loaded into memory), but I can not see how I can produce such a file without generating the entire result first. (Monte Carlo results on 200-500*10m floats).
If I understand correctly RecordBatchStreamWriter needs a part of the entire table, and I can not produce only a part of it. The parts the calculator produces is the columns, not parts of all columns. Is there any way of writing "columns" one by one? Either by appending, or create an empty arrow file which can be filled? (schema known).
As I see it, my alternative is to write several files and use "dataset" /tabular data to "join" them together. My "other computations" would then have to filter or pull parts into memory as I can`t see in the docs that "dataset()" work with memory_map.The result set is to big to fit in memory. (At least on the server it is running on)
I`m on day 2 of digging the docs and trying to understand how it all works, so apologies if the "lingo" is not all correct.
On further inspection, it looks like all files used in datasets() must have same schema, so I can not split "columns" in separate files either, can I..
EDIT
After wrestling with this library, I now produce single column files which I later combine in a single file. However, in following the suggested solution visible memory consumption (task manager) skyrockets in the step of combining the files. I would expect peaks for every "rowgroup" or combined recordbatch, but instead steadily increase to use all memory. A snip of this step:
readers = [pa.ipc.open_stream(file) for file in self.tempfiles]
combined_schema = pa.unify_schemas([r.schema for r in readers])
with pa.ipc.new_stream(
os.path.join(self.filepath, self.outfile_name + ".arrow"),
schema=combined_schema,
) as writer:
for group in zip(*readers):
combined_batch = pa.RecordBatch.from_arrays(
[g.column(0) for g in group], names=combined_schema.names
)
writer.write_batch(combined_batch)
From this link I would expect that running memory consumption to be that of combined_batch and some.
You could do the write in two passes.
First, write each column to its own file. Make sure to set a row group size small enough that a table consisting of one row group from each file comfortably fits into memory.
Second, create a streaming reader for each file you created and one writer. Read a single row group from each one. Create a table by combining all of the partial columns and write the table to your writer. Repeat until you've exhausted all your readers.
I'm not sure that memory mapping is going to help you much here.
I’m currently using Postgres database to store survey answers.
My problem I’m facing is that I need to generate pivot table from Postgres database.
When the dataset is small, it’s easy to just read whole data set and use Pandas to produce the pivot table.
However, my current database now has around 500k rows, and it’s increasing around 1000 rows per day. Reading whole dataset is not effective anymore.
My question is that do I need to use HDFS to store data on disk and supply it to Pandas to do pivoting?
My customers need to view pivot table output nearly real time. Do we have any way to solve it?
My theory is that I’ll create pivot table output of 500k rows and store the output somewhere, then when new data gets saved into database, I’ll only need to merge the new data with existing pivot table. I’m not quite sure if Pandas supports this way, or it needs a full dataset to do pivoting?
Have you tried using pickle. I'm a data scientist and use this all the time with data sets of 1M+ rows and several hundred columns.
In your particular case I would recommend the following.
import pickle
save_data = open('path/file.pickle', 'wb') #wb stands for write bytes
pickle.dump(pd_data, save_data)
save_data.close()
In the above code what you're doing is saving your data in a compact format that can quickly be loaded using:
pickle_data = open('path/file.pickle', 'rb') #rb stands for read bytes
pd_data = pickle.load(pickle_data)
pickle_data.close()
At which point you can append your data (pd_data) with the new 1,000 rows and save it again using pickle. If your data will continue to grow and you expect memory to become a problem I suggest identifying a way to append or concatenate the data rather than merge or join since the latter two can also result in memory issues.
You will find that this will cut out significant load time when reading something off your disk (I use Dropbox and its still lightning fast). What I usually do in order to reduce that even further is segment my data sets into groups of rows & columns and then write methods that load the pickled data as need be (super useful graphing).
After the recommendation from Jeff's Answer to check out this Google Forum, I still didn't feel satisfied on what the conclusion was regarding the appendCSV method. Below, you can see my implementation of reading many XLS files. Is there a way to significantly increase the speed of this? It currently takes over 10 minutes for around 900,000 rows.
listOfFiles = glob.glob(file_location)
frame = pd.DataFrame()
for idx, a_file in enumerate(listOfFiles):
data = pd.read_excel(a_file, sheetname=0, skiprows=range(1,2), header=1)
data.rename(columns={'Alphabeta':'AlphaBeta'}, inplace=True)
frame = frame.append(data)
# Save to CSV..
frame.to_csv(output_dir, index=False, encoding='utf-8', date_format="%Y-%m-%d")
The very first important point
Optimize only code that is required to be optimized.
If you need to convert all you files just once then you have already made a great job, congrats! If you, however, need to reuse it really often (and by really I mean that there is a source that produce your Excel files with a speed at least of 900K rows per 10 minutes and you need to parse them in real-time) then what you need to do is to analyze your profiling results.
Profiling analysis
Sorting your profile in descending order by 'cumtime', which is cumulative execution time of function including its subcalls, you will discover that out of ~2000 seconds of runtime ~800 seconds are taken by 'read_excel' method and ~1200 seconds are taken by 'to_csv' method.
If then you will sort profile by 'tottime' which is total execution time of functions themselves you will find out that top time consumers are populated with functions that are connected with reading and writing lines and conversion between formats. So, the real problem is that either libraries you use are slow, or the amount of data you are parsing is really huge.
Possible solutions
For the first reason, please keep in mind that parsing Excel lines and converting them could be a really complex task. It is hard to advice you without having an example of your input data. But there could be a real time loss just because the library you are using is for everything and it does hard work parsing rows several times when you actually do not need it, because your rows have very simple structure. In this case you may try to switch to different libraries, that does not perform complex parsing of input data, for example use xlrd for reading data from Excel. But in title you mentioned that input files are also CSVs so if this is applicable in your case then load lines with just:
line.strip().split(sep)
instead of complex Excel format parsing. And of course if your rows are simple than you can always use
','.join(list_of_rows)
to write CSV instead of using complex DataFrames at all. However, if your files contain Unicode symbols, complex fields and so on then these libraries are probably the best choice.
For the second reason - 900K rows could contain from 900K to infinite bytes, so it is really hard to understand whether your data input is really so big, without an example again. If you have really a lot of data then probably there is not too much you could do and you just have to wait. And remember that disk is actually a very slow device. Usual disks could provide you with ~100Mb/s at its best so if you are copying (because ultimately that is what you are doing) 10Gb of data then you can see that at least 3-4 minutes will be required for just physically reading raw data and writing the result. But in case if you are not using your disk bandwidth for 100% (for example if parsing one row with library that you are using takes comparable time with just reading this row from disk) you might also try to increase speed of your code by asynchronous data reading with multiprocessing map_async instead of cycle.
If you are using pandas, you could do this:
dfs = [pd.read_excel(path.join(dir, name), sep='\t', encoding='cp1252', error_bad_lines=False ) for name in os.listdir(dir) if name.endswith(suffix)]
df = pd.concat(dfs, axis=0, ignore_index=True)
This is screaming fast compared to other methods of getting data into pandas. Other tips:
You can also speed this up by specifying dtype for all columns.
If you are doing read_csv, use the engine='c' to speed up the import.
Skip rows on error
I'm working on implementing a relatively large (5,000,000 and growing) set of time series data in an HDF5 table. I need a way to remove duplicates on it, on a daily basis, one 'run' per day. As my data retrieval process currently stands, it's far easier to write in the duplicates during the data retrieval process than ensure no dups go in.
What is the best way to remove dups from a pytable? All of my reading is pointing me towards importing the whole table into pandas, and getting a unique- valued data frame, and writing it back to disk by recreating the table with each data run. This seems counter to the point of pytables, though, and in time I don't know that the whole data set will efficiently fit into memory. I should add that it is two columns that define a unique record.
No reproducible code, but can anyone give me pytables data management advice?
Big thanks in advance...
See this releated question: finding a duplicate in a hdf5 pytable with 500e6 rows
Why do you say that this is 'counter to the point of pytables'? It is perfectly possible to store duplicates. The user is responsible for this.
You can also try this: merging two tables with millions of rows in python, where you use a merge function that is simply drop_duplicates().
I'm currently rewriting some python code to make it more efficient and I have a question about saving python arrays so that they can be re-used / manipulated later.
I have a large number of data, saved in CSV files. Each file contains time-stamped values of the data that I am interested in and I have reached the point where I have to deal with tens of millions of data points. The data has got so large now that the processing time is excessive and inefficient---the way the current code is written the entire data set has to be reprocessed every time some new data is added.
What I want to do is this:
Read in all of the existing data to python arrays
Save the variable arrays to some kind of database/file
Then, the next time more data is added I load my database, append the new data, and resave it. This way only a small number of data need to be processed at any one time.
I would like the saved data to be accessible to further python scripts but also to be fairly "human readable" so that it can be handled in programs like OriginPro or perhaps even Excel.
My question is: whats the best format to save the data in? HDF5 seems like it might have all the features I need---but would something like SQLite make more sense?
EDIT: My data is single dimensional. I essentially have 30 arrays which are (millions, 1) in size. If it wasn't for the fact that there are so many points then CSV would be an ideal format! I am unlikely to want to do lookups of single entries---more likely is that I might want to plot small subsets of data (eg the last 100 hours, or the last 1000 hours, etc).
HDF5 is an excellent choice! It has a nice interface, is widely used (in the scientific community at least), many programs have support for it (matlab for example), there are libraries for C,C++,fortran,python,... It has a complete toolset to display the contents of a HDF5 file. If you later want to do complex MPI calculation on your data, HDF5 has support for concurrently read/writes. It's very well suited to handle very large datasets.
Maybe you could use some kind of key-value database like Redis, Berkeley DB, MongoDB... But it would be nice some more info about the schema you would be using.
EDITED
If you choose Redis for example, you can index very long lists:
The max length of a list is 232 - 1 elements (4294967295, more than 4
billion of elements per list). The main features of Redis Lists from
the point of view of time complexity are the support for constant time
insertion and deletion of elements near the head and tail, even with
many millions of inserted items. Accessing elements is very fast near
the extremes of the list but is slow if you try accessing the middle
of a very big list, as it is an O(N) operation.
I would use a single file with fixed record length for this usecase. No specialised DB solution (seems overkill to me in that case), just plain old struct (see the documentation for struct.py) and read()/write() on a file. If you have just millions of entries, everything should be working nicely in a single file of some dozens or hundreds of MB size (which is hardly too large for any file system). You also have random access to subsets in case you will need that later.