Efficiently rewriting lines in a large text file with Python - python

I'm trying to generate a large data file (in the GBs) by iterating over thousands of database records. At the top of the file are a line for each "feature" that appears latter in the file. They look like:
#attribute 'Diameter' numeric
#attribute 'Length' real
#attribute 'Qty' integer
lines containing data using these attributes look like:
{0 0.86, 1 0.98, 2 7}
However, since my data is sparse data, each record from my database may not have each attribute, and I don't know what the complete feature set is in advance. I could, in theory, iterate over my database records twice, the first time accumulating the feature set, and then the second time to output my records, but I'm trying to find a more efficient method.
I'd like to try a method like the following pseudo-code:
fout = open('output.dat', 'w')
known_features = set()
for records in records:
if record has unknown features:
jump to top of file
delete existing "#attribute" lines and write new lines
jump to bottom of file
fout.write(record)
It's the jump-to/write/jump-back part I'm not sure how to pull off. How would you do this in Python?
I tried something like:
fout.seek(0)
for new_attribute in new_attributes:
fout.write(attribute)
fout.seek(0, 2)
but this overwrites both the attribute lines and data lines at the top of the file, not simply insert new lines starting at the seek position I specify.
How do you obtain a word-processor's "insert" functionality in Python without loading the entire document into memory? The final file is larger than all my available memory.

Why don't you get a list of all the features and their data types; list them first. If a feature is missing, replace it with a known value - NULL seems appropriate.
This way your records will be complete (in length), and you don't have to hop around the file.
The other approach is, write two files. One contains all your features, the others all your rows. Once both files are generated, append the feature file to the top of the data file.
FWIW, word processors load files in memory for editing; and then they write the entire file out. This is why you can't load a file larger than the addressable/available memory in a word processor; or any other program that is not implemented as a stream reader.

Why don't you build the output in memory first (e.g. as a dict) and write it to a file after all data is known?

Related

Xarray to merge two hdf5 file with different dimension length

I have some instrumental data which saved in hdf-5 format as multiple 2-d array along with the measuring time. As attached figures below, d1 and d2 are two independent file in which the instrument recorded in different time. They have the same data variables, and the only difference is the length of phony_dim_0, which represet the total data points varying with measurement time.
These files need to be loaded to a specific software provided by the instrument company for obtaining meaningful results. I want to merge multiple files with Python xarray while keeping in their original format, and then loaed one merged file into the software.
Here is my attempt:
files = os.listdir("DATA_PATH")
d1 = xarray.open_dataset(files[0])
d2 = xarray.open_dataset(files[1])
## copy a new one to save the merged data array.
d0 = d1
vars_ = [c for c in d1]
for var in vars_:
d0[var].values = np.vstack([d1[var],d2[var]])
The error shows like this:
replacement data must match the Variable's shape. replacement data has shape (761, 200); Variable has shape (441, 200)
I thought about two solution for this problem:
expanding the dimension length to the total length of all merged files.
creating a new empty dataframe in the same format of d1 and d2.
However, I still could not figure out the function to achieve that. Any comments or suggestions would be appreciated.
Supplemental information
dataset example [d1],[d2]
I'm not familiar with xarray, so can't help with your code. However, you don't need xarray to copy HDF5 data; h5py is designed to work nicely with HDF5 data as NumPy arrays, and is all you need to get merge the data.
A note about Xarray. It uses different nomenclature than HDF5 and h5py. Xarray refers to the files as 'datasets', and calls the HDF5 datasets 'data variables'. HDF5/h5py nomenclature is more frequently used, so I am going to use it for the rest of my post.
There are some things to consider when merging datasets across 2 or more HDF5 files. They are:
Consistency of the data schema (which you have checked).
Consistency of attributes. If datasets have different attribute names or values, the merge process gets a lot more complicated! (Yours appear to be consistent.)
It's preferable to create resizabe datasets in the merged file. This simplifies the process, as you don't need to know the total size when you initially create the dataset. Better yet, you can add more data later (if/when you have more files).
I looked at your files. You have 8 HDF5 datasets in each file. One nice thing: the datasets are resizble. That simplifies the merge process. Also, although your datasets have a lot of attributes, they appear to be common in both files. That also simplifies the process.
The code below goes through the following steps to merge the data.
Open the new merge file for writing
Open the first data file (read-only)
Loop thru all data sets
a. use the group copy function to copy the dataset (data plus maxshape parameters, and attribute names and values).
Open the second data file (read-only)
Loop thru all data sets and do the following:
a. get the size of the 2 datasets (existing and to be added)
b. increase the size of HDF5 dataset with .resize() method
c. write values from dataset to end of existing dataset
At the end it loops thru all 3 files and prints shape and
maxshape for all datasets (for visual comparison).
Code below:
import h5py
files = [ '211008_778183_m.h5', '211008_778624_m.h5', 'merged_.h5' ]
# Create the merge file:
with h5py.File('merged_.h5','w') as h5fw:
# Open first HDF5 file and copy each dataset.
# Will use maxhape and attributes from existing dataset.
with h5py.File(files[0],'r') as h5fr:
for ds in h5fr.keys():
h5fw.copy(h5fr[ds], h5fw, name=ds)
# Open second HDF5 file and copy data from each dataset.
# Resizes existing dataset as needed to hold new data.
with h5py.File(files[1],'r') as h5fr:
for ds in h5fr.keys():
ds_a0 = h5fw[ds].shape[0]
add_a0 = h5fr[ds].shape[0]
h5fw[ds].resize(ds_a0+add_a0,axis=0)
h5fw[ds][ds_a0:] = h5fr[ds][:]
for fname in files:
print(f'Working on file:{fname}')
with h5py.File(fname,'r') as h5f:
for ds, h5obj in h5f.items():
print (f'for: {ds}; axshape={h5obj.shape}, maxshape={h5obj.maxshape}')

Reading the last batch of data added to hdfs file using Python

I have a program that will add a variable number of rows of data to an hdf5 file as shown below.
data_without_cosmic.to_hdf(new_file,key='s', append=True, mode='r+', format='table')
New_file is the file name and data_without_cosmic is a pandas data frame with 'x' , 'y', 'z', and 'i' columns representing positional data and a scalar quantity. I may add several data frames of this form to the file each time I run the full program. For each data frame I add, the 'z' values are a constant value.
The next time I use the program, I would need to access the last batch of rows that was added to the data in order to perform some operations. I wondered if there was a fast way to retrieve just the last data frame that was added to the file or if I could group the data in some way as I add it in order to be able to do so.
The only other way I can think of achieving my goal is by reading the entire file and then checking the z values from bottom up until it changes, but that seemed a little excessive. Any ideas?
P.S I am very inexperienced with working with hdf5 files but I read that they are efficient to work with.

How to load certain aspect of json file to rdd easily

I have a json file with over a million rows, so I am trying to minimize the number of times I have to run through it all to get one aspect of it into an rdd.
Right now, I load each row into a list:
with open('in/json-files/sites.json') as f:
for line in f:
data.append(json.loads(line))
Then, I make another list and import the aspect into that:
for line in range(1,len(data)):
data_companies.append(data[line]['company'])
Then, I parallelize this into an rdd so that I can analyze it. I am worried about how much memory this will take up, so is there an easier and faster way to do this? I have tried loading the json file like this, but it wont work:
data.append(json.loads(line['company'))
As your data is structured(JSON), you can look into Spark-SQL
https://spark.apache.org/docs/2.4.0/sql-programming-guide.html
https://spark.apache.org/docs/2.4.0/sql-data-sources-json.html
You can directly load your JSON into a dataframe and look for the particular column to do your analysis

How to copy a partial or skeleton h5py file

I have a few questions wrapped up into this issue. I realize this might be a convoluted post and can provide extra details.
A code package I use can produce large .h5 files (source.h5) (100+ Gb), where almost all of this data resides in 1 dataset (group2/D). I want to make a new .h5 file (dest.h5) using Python that contains all datasets except group2/D of source.h5 without needing to copy the entire file. I then will condense group2/D after some postprocessing and write a new group2/D in dest.h5 with much less data. However, I need to keep source.h5 because this postprocessing may need to be performed multiple times into multiple destination files.
source.h5 is always structured the same and cannot be changed in either source.h5 or dest.h5, where each letter is a dataset:
group1/A
group1/B
group2/C
group2/D
I thus want to initially make a file with this format:
group1/A
group1/B
group2/C
and again, fill in group2/D later. Simply copying source.h5 multiple times is always possible, but I'd like to avoid having to copy a huge file a bunch of times because disk space is limited and this is something that isn't a 1 off case.
I searched and found this question (How to partially copy using python an Hdf5 file into a new one keeping the same structure?) and tested if dest.h5 would be the same as source.h5:
fs = h5py.File('source.h5', 'r')
fd = h5py.File('dest.h5', 'w')
fs.copy('group1', fd)
fd.create_group('group2')
fs.copy('group2/C', fd['/group2'])
fd.copy('group2/D', fd['/group2'])
fs.close()
fd.close()
but the code package I used couldn't read the file I created (which I must have happen), implying there was some critical data loss when I did this operation (the file sizes differ by 7 kb also). I'm assuming the problem was when I created group2 manually because I checked with numpy that the values in group1 datasets exactly matched in both source.h5 and dest.h5. Before I did any digging into what data is missing I wanted to get a few things out of the way:
Question 1: Is there .h5 file metadata that accompanies each group or dataset? If so, how can I see it so I can create a group2 in dest.h5 that exactly matches the one in source.h5? Is there a way to see if 2 groups (not datasets) exactly match each other?
Question 2: Alternatively, is it possible to simply copy the data structure of a .h5 file (i.e. groups and datasets with empty lists as a skeleton file) so that fields can be populated later? Or, as a subset of this question, is there a way to copy a blank dataset to another file such that any metadata is retained (assuming there is some)?
Question 3: Finally, to avoid all this, is it possible to just copy a subset of source.h5 to dest.h5? With something like:
fs.copy(['group1','group2/C'], fd)
Thanks for your time. I appreciate you reading this far

make custom spreadsheets with python

I have a pandas data frame with two columns:
year experience and salary
I want to save a csv file with these two columns and also have some stats at the head of the file as in the image:
Is there any option to handle these with pandas or any other library of do I have to make a script to write it line adding the commas between fields?
Pandas does not support what you want to do here. The problem is that your format is no valid csv. The RFC for CSV states that Each record is located on a separate line, implying that a line corresponds to a record, with an optional header line. Your format adds the average and max values, which do not correspond to records.
As I see it, you have three paths to go from here: i. You create two separate data frames and map them to csv files (super precise would be 3), one with your records, one with the additional values. ii. Write your data frame to csv first, then open that file and insert the your additional values at the top. iii. If your goal is an import into excel, however, #gefero 's suggestion is the right hint: try using the xslxwriter package do directly write to cells in a spreadsheet.
You can read the file as two separate parts (stats and csv)
Reading stats:
number_of_stats_rows = 3
stats = pandas.read_csv(file_path, nrows=number_of_stats_rows, header=None).fillna('')
Reading remaining file:
other_data = pandas.read_csv(file_path, skiprows=number_of_stats_rows).fillna('')
Take a look to xslxwriter. Perhaps it´s what you are looking for.

Categories