Pytables check if column exists - python

Is it possible to use Pytables (or Pandas) to detect whether a hdf file's table contains a certain column? To load the hdf file I use:
from pandas.io.pytables import HDFStore
# this doesn't read the full file which is good
hdf_store = HDFStore('data.h5', mode='r')
# returns a "Group" object, not sure if this could be used...
hdf_store.get_node('tablename')
I could also use Pytables directly instead of Pandas. The aim is not to load all the data of the hdf file as these files are potentially large and I only want to establish whether a certain column exists.

I may have found a solution, but am unsure of (1) why it works and (2) whether this is a robust solution.
import tables
h5 = tables.openFile('data.h5', mode='r')
df_node = h5.root.__getattr__('tablename')
# Not sure why `axis0` contains the column data, but it seems consistent
# with the tested h5 files.
columns = df_node.axis0[:]
columns contains a numpy array with all the column names.

The accepted solution doesn't work for me with Pandas 0.20.3 and PyTables 3.3.0 (the HDF file was created using Pandas). However, this works:
pd.HDFStore('data.hd5', mode='r').get_node('/path/to/pandas/df').table.colnames

Related

Renaming the columns in Vaex

I tried to read a csv file of 4GB initially with pandas pd.read_csv but my system is running out of memory (I guess) and the kernel is restarting or the system hangs.
So, I tried using vaex library to convert csv to HDF5 and do operations(aggregations,group by)on that. For that I've used:
df = vaex.from_csv('Wager-Win_April-Jul.csv',column_names = None, convert=True, chunk_size=5000000)
and
df = vaex.from_csv('Wager-Win_April-Jul.csv',header = None, convert=True, chunk_size=5000000)
But still I'm getting my first record in csv file as the header(column names to be precise)and I'm unable to change the column names. I tried finding function to change the names but didn't come across any. Pls help me on that. Thanks :)
The column names 1559104, 10289, 991... is actually the first record in the csv and somehow vaex is taking the first row as my column names which I want to avoid
vaex.from_csv is a wrapper around pandas.read_csv with few extra options for the conversion.
So reading the pandas documentation, header='infer' (which is the default) if you want the csv reader to automatically infer the column names. Otherwise the 1st row of the file is used as the header. Alternatively you can pass the column names manually via the names kwarg. Same holds true for both vaex and pandas.
I would read the pandas.read_csv documentation to better understand all the options. Then you can use those options with vaex and the convert and chunk_size arguments.

How to store complex csv data in django?

I am working on django project.where user can upload a csv file and stored into database.Most of the csv file i saw 1st row contain header and then under the values but my case my header presents on column.like this(my csv data)
I did not understand how to save this type of data on my django model.
You can transpose your data. I think it is more appropriate for your dataset in order to do real analysis. Usually things such as id values would be the row index and the names such company_id, company_name, etc would be the columns. This will allow you to do further analysis (mean, std, variances, ptc_change, group_by) and use pandas at its fullest. Thus said:
import pandas as pd
df = pd.read_csv('yourcsvfile.csv')
df2 = df.T
Also, as #H.E. Lee pointed out. In order to save your model to your database, you can either use the method to_sql in your dataframe to save in mysql (e.g. your connection), if you're using mongodb you can use to_json and then import the data, or you can manually set your function transformation to your database.
You can flip it with the built-in CSV module quite easily, no need for cumbersome modules like pandas (which in turn requires NumPy...)... Since you didn't define the Python version you're using, and this procedure differs slightly between the versions, I'll assume Python 3.x:
import csv
# open("file.csv", "rb") in Python 2.x
with open("file.csv", "r", newline="") as f: # open the file for reading
data = list(map(list, zip(*csv.reader(f)))) # read the CSV and flip it
If you're using Python 2.x you should also use itertools.izip() instead of zip() and you don't have to turn the map() output into a list (it already is).
Also, if the rows are uneven in your CSV you might want to use itertools.zip_longest() (itertools.izip_longest() in Python 2.x) instead.
Either way, this will give you a 2D list data where the first element is your header and the rest of them are the related data. What you plan to do from there depends purely on your DB... If you want to deal with the data only, just skip the first element of data when iterating and you're done.
Given your data it may be best to store each row as a string entry using TextField. That way you can be sure not to lose any structure going forward.

numpy and pytables changing values when only using some specific columns

I have some problem when reading a csv file and then reordering the data into different hdf5 tables. However, I noticed that the values in the csv file did not match those in the hdf5 tables, and managed to make a small example of the problem.
The csv file I use look like this, table names and values, called test.csv:
a,b,c,d
1,10,3,5
2,15,5,7
3,20,7,12
4,25,9,25
I read this using numpy.genfromtxt. Then I create a hdf5 file using pytables. The code for that I use is as follows:
import numpy
import tables
csv = 'test.csv'
data = numpy.genfromtxt(csv, delimiter=",", names=True)
hdf = 'out.h5'
file = tables.open_file(hdf, mode="a")
group = file.create_group('/', 'test', 'test')
table = file.create_table(group, 'test', data, 'test data')
table = file.create_table(group, 'test2', data[['a', 'b']], 'test data')
This creates two tables in the hdf5 file, testand test2. With test containing all the columns in test.csv and test2only containing the 'a' and 'b' columns.
However, when I open the hdf5 file in a viewer, it looks like this:
test1
test2
Notice the values in test2 being all scrambled and wrong.
I have made a work around, but it is slow and requires looping over all values in python, so I would rather understand why this is happening and if there is any particular solution to the problem, and a way to get the correct values in test2.
I am using Python 3.5.3, numpy 1.14.0 and pytables 3.4.2.

pandas write dataframe to parquet format with append

I am trying to write a pandas dataframe to parquet file format (introduced in most recent pandas version 0.21.0) in append mode. However, instead of appending to the existing file, the file is overwritten with new data. What am i missing?
the write syntax is
df.to_parquet(path, mode='append')
the read syntax is
pd.read_parquet(path)
Looks like its possible to append row groups to already existing parquet file using fastparquet. This is quite a unique feature, since most libraries don't have this implementation.
Below is from pandas doc:
DataFrame.to_parquet(path, engine='auto', compression='snappy', index=None, partition_cols=None, **kwargs)
we have to pass in both engine and **kwargs.
engine{‘auto’, ‘pyarrow’, ‘fastparquet’}
**kwargs - Additional arguments passed to the parquet library.
**kwargs - here we need to pass is: append=True (from fastparquet)
import pandas as pd
import os.path
file_path = "D:\\dev\\output.parquet"
df = pd.DataFrame(data={'col1': [1, 2,], 'col2': [3, 4]})
if not os.path.isfile(file_path):
df.to_parquet(file_path, engine='fastparquet')
else:
df.to_parquet(file_path, engine='fastparquet', append=True)
If append is set to True and the file does not exist then you will see below error
AttributeError: 'ParquetFile' object has no attribute 'fmd'
Running above script 3 times I have below data in parquet file.
If I inspect the metadata, I can see that this resulted in 3 row groups.
Note:
Append could be inefficient if you write too many small row groups. Typically recommended size of a row group is closer to 100,000 or 1,000,000 rows. This has a few benefits over very small row groups. Compression will work better, since compression operates within a row group only. There will also be less overhead spent on storing statistics, since each row group stores its own statistics.
To append, do this:
import pandas as pd
import pyarrow.parquet as pq
import pyarrow as pa
dataframe = pd.read_csv('content.csv')
output = "/Users/myTable.parquet"
# Create a parquet table from your dataframe
table = pa.Table.from_pandas(dataframe)
# Write direct to your parquet file
pq.write_to_dataset(table , root_path=output)
This will automatically append into your table.
I used aws wrangler library. It works like charm
Below are the reference docs
https://aws-data-wrangler.readthedocs.io/en/latest/stubs/awswrangler.s3.to_parquet.html
I have read from kinesis stream and used kinesis-python library to consume the message and writing to s3 . processing logic of json I have not included as this post deals with problem unable to append data to s3. Executed in aws sagemaker jupyter
Below is the sample code I used:
!pip install awswrangler
import awswrangler as wr
import pandas as pd
evet_data=pd.DataFrame({'a': [a], 'b':[b],'c':[c],'d':[d],'e': [e],'f':[f],'g': [g]},columns=['a','b','c','d','e','f','g'])
#print(evet_data)
s3_path="s3://<your bucker>/table/temp/<your folder name>/e="+e+"/f="+str(f)
try:
wr.s3.to_parquet(
df=evet_data,
path=s3_path,
dataset=True,
partition_cols=['e','f'],
mode="append",
database="wat_q4_stg",
table="raw_data_v3",
catalog_versioning=True # Optional
)
print("write successful")
except Exception as e:
print(str(e))
Any clarifications ready to help. In few more posts I have read to read data and overwrite again. But as the data gets larger it will slow down the process. It is inefficient
There is no append mode in pandas.to_parquet(). What you can do instead is read the existing file, change it, and write back to it overwriting it.
Use the fastparquet write function
from fastparquet import write
write(file_name, df, append=True)
The file must already exist as I understand it.
API is available here (for now at least): https://fastparquet.readthedocs.io/en/latest/api.html#fastparquet.write
Pandas to_parquet() can handle both single files as well as directories with multiple files in it. Pandas will silently overwrite the file, if the file is already there. To append to a parquet object just add a new file to the same parquet directory.
os.makedirs(path, exist_ok=True)
# write append (replace the naming logic with what works for you)
filename = f'{datetime.datetime.utcnow().timestamp()}.parquet'
df.to_parquet(os.path.join(path, filename))
# read
pd.read_parquet(path)

Pyfits or astropy.io.fits add row to binary table in fits file

How can I add a single row to binary table inside large fits file using pyfits, astropy.io.fits or maybe some other python library?
This file is used as a log, so every second a single row will be added, eventually the size of the file will reach gigabytes so reading all the file and writing it back or keeping the copy of data in memory and writing it to the file every seconds is actually impossible. With pyfits or astropy.io.fits so far I could only read everything to memory add new row and then write it back.
Example. I create fits file like this:
import numpy, pyfits
data = numpy.array([1.0])
col = pyfits.Column(name='index', format='E', array=data)
cols = pyfits.ColDefs([col])
tbhdu = pyfits.BinTableHDU.from_columns(cols)
tbhdu.writeto('test.fits')
And I want to add some new value to the column 'index', i.e. add one more row to the binary table.
Solution This is a trivial task for cfitsio library (method fits_insert_row(...)), so I use python module which is based on it: https://github.com/esheldon/fitsio
And here is the solution using fitsio. To create new fits file the one can do:
import fitsio, numpy
from fitsio import FITS,FITSHDR
fits = FITS('test.fits','rw')
data = numpy.zeros(1, dtype=[('index','i4')])
data[0]['index'] = 1
fits.write(data)
fits.close()
To append a row:
fits = FITS('test.fits','rw')
#you can actually use the same already opened fits file,
#to flush the changes you just need: fits.reopen()
data = numpy.zeros(1, dtype=[('index','i4')])
data[0]['index'] = 2
fits[1].append(data)
fits.close()
Thank you for your help.

Categories