Reading the last batch of data added to hdfs file using Python

Reading the last batch of data added to hdfs file using Python - python

I have a program that will add a variable number of rows of data to an hdf5 file as shown below.
data_without_cosmic.to_hdf(new_file,key='s', append=True, mode='r+', format='table')
New_file is the file name and data_without_cosmic is a pandas data frame with 'x' , 'y', 'z', and 'i' columns representing positional data and a scalar quantity. I may add several data frames of this form to the file each time I run the full program. For each data frame I add, the 'z' values are a constant value.
The next time I use the program, I would need to access the last batch of rows that was added to the data in order to perform some operations. I wondered if there was a fast way to retrieve just the last data frame that was added to the file or if I could group the data in some way as I add it in order to be able to do so.
The only other way I can think of achieving my goal is by reading the entire file and then checking the z values from bottom up until it changes, but that seemed a little excessive. Any ideas?
P.S I am very inexperienced with working with hdf5 files but I read that they are efficient to work with.

Related

Python reads only one column from my CSV file

first post here.
I am very new to programmation, sorry if it is confused.
I made a database by collecting multiple different data online. All these data are in one xlsx file (each data a column), that I converted in csv afterwards because my teacher only showed us how to use csv file in Python.
I installed pandas and make it read my csv file, but it seems it doesnt understand that I have multiple columns, it reads one column. Thus, I can't get the info on each data (and so i can't transform the data).
I tried df.info() and df.info(verbose=True, show_counts=True) but it makes the same thing
len(df.columns) = 1 which proves it doesnt see that each data has its own column
len(df) = 1923, which is right
I was expecting that : https://imgur.com/a/UROKtxN (different project, not the same database)
database used: https://imgur.com/a/Wl1tsYb
And I have that instead : https://imgur.com/a/iV38YNe
database used: https://imgur.com/a/VefFrL4
idk, it looks pretty similar, why doesn't work :((
Thanks.

displaying multiple pandas function created on python in the same csv file

How can i display multiple pandas function created on python in the same csv file
So I have multiple data tables saved as pandas dataframes, and I want to output all of them into the same CSV for ease of access. However, I am not really sure the best way to go about this, as I want to maintain each dataframes inherent structure (ie columns and index), so I can combine them all into 1 single dataframe.

You have 2 choices:
Either you combine them first (pd.concat()) with all the advantages and limitations of that appraoch, then you cann call .to_csv and it will print 1 file. If they are structurally the same, this is great because you will be able to read the file again.
Or, you call .to_csv() multiple times, and save the output in a "buffer", which you can then write (see here). Probably the only way if your DataFrames are very different from a structural perspective, but a mess to read them later.
Is .json output an option for what you want to do?

Thanks alot for the comment Kingotto, I used to first option added the this code and it was able to help me arrange my functions horizontally and exported the file to csv like this:
frames = pd.concat([file_1, file_2, file_3], axis = 1)
save the dataframe
frames.to_csv('Combined.csv', index = False)

changing output from column to rows in python to csv

I am trying to output a few separate variables (separate data sets created previously in my code) as a single csv file. I have found the most luck using np.asarray and np.savetxt. I am now trying to format the csv output file and want to have my variables read in each column (header then data below for each variable being written to csv). I have successfully had the data transfer to the csv file along with adding column title headers, but I cannot get the variables to format from one row into separate rows.
I have tried changing the order from 'C' to 'F' in np.asarray. I have also used a few things with the csv writing library in python but I did find np.savetxt and asarray were the best routes (so far)
my code for this is as follows. Each variable type in csvData is listed as 'float64' in my variable explorer if that helps at all.
csvData=np.asarray([[timestep], [IQR_bot],[IQR_top],[median],
[prcnt95],[prcnt5], [KFTG_tot]], dtype=object, order='C')
np.savetxt("pathout.csv",csvData,fmt='%s', header='Timestep,
IQR_bot,IQR_top,median,prcnt95,prcnt5, KFTG_top')
I want each input variable from csvData to be its own separate column of data tied with the respective header listed in np.savetxt. This code is not currently throwing any error messages, but the output format is not what I want it to be.

make custom spreadsheets with python

I have a pandas data frame with two columns:
year experience and salary
I want to save a csv file with these two columns and also have some stats at the head of the file as in the image:
Is there any option to handle these with pandas or any other library of do I have to make a script to write it line adding the commas between fields?

Pandas does not support what you want to do here. The problem is that your format is no valid csv. The RFC for CSV states that Each record is located on a separate line, implying that a line corresponds to a record, with an optional header line. Your format adds the average and max values, which do not correspond to records.
As I see it, you have three paths to go from here: i. You create two separate data frames and map them to csv files (super precise would be 3), one with your records, one with the additional values. ii. Write your data frame to csv first, then open that file and insert the your additional values at the top. iii. If your goal is an import into excel, however, #gefero 's suggestion is the right hint: try using the xslxwriter package do directly write to cells in a spreadsheet.

You can read the file as two separate parts (stats and csv)
Reading stats:
number_of_stats_rows = 3
stats = pandas.read_csv(file_path, nrows=number_of_stats_rows, header=None).fillna('')
Reading remaining file:
other_data = pandas.read_csv(file_path, skiprows=number_of_stats_rows).fillna('')

Take a look to xslxwriter. Perhaps it´s what you are looking for.

Efficiently rewriting lines in a large text file with Python

I'm trying to generate a large data file (in the GBs) by iterating over thousands of database records. At the top of the file are a line for each "feature" that appears latter in the file. They look like:
#attribute 'Diameter' numeric
#attribute 'Length' real
#attribute 'Qty' integer
lines containing data using these attributes look like:
{0 0.86, 1 0.98, 2 7}
However, since my data is sparse data, each record from my database may not have each attribute, and I don't know what the complete feature set is in advance. I could, in theory, iterate over my database records twice, the first time accumulating the feature set, and then the second time to output my records, but I'm trying to find a more efficient method.
I'd like to try a method like the following pseudo-code:
fout = open('output.dat', 'w')
known_features = set()
for records in records:
if record has unknown features:
jump to top of file
delete existing "#attribute" lines and write new lines
jump to bottom of file
fout.write(record)
It's the jump-to/write/jump-back part I'm not sure how to pull off. How would you do this in Python?
I tried something like:
fout.seek(0)
for new_attribute in new_attributes:
fout.write(attribute)
fout.seek(0, 2)
but this overwrites both the attribute lines and data lines at the top of the file, not simply insert new lines starting at the seek position I specify.
How do you obtain a word-processor's "insert" functionality in Python without loading the entire document into memory? The final file is larger than all my available memory.

Why don't you get a list of all the features and their data types; list them first. If a feature is missing, replace it with a known value - NULL seems appropriate.
This way your records will be complete (in length), and you don't have to hop around the file.
The other approach is, write two files. One contains all your features, the others all your rows. Once both files are generated, append the feature file to the top of the data file.
FWIW, word processors load files in memory for editing; and then they write the entire file out. This is why you can't load a file larger than the addressable/available memory in a word processor; or any other program that is not implemented as a stream reader.

Why don't you build the output in memory first (e.g. as a dict) and write it to a file after all data is known?

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.