I have some problem when reading a csv file and then reordering the data into different hdf5 tables. However, I noticed that the values in the csv file did not match those in the hdf5 tables, and managed to make a small example of the problem.
The csv file I use look like this, table names and values, called test.csv:
a,b,c,d
1,10,3,5
2,15,5,7
3,20,7,12
4,25,9,25
I read this using numpy.genfromtxt. Then I create a hdf5 file using pytables. The code for that I use is as follows:
import numpy
import tables
csv = 'test.csv'
data = numpy.genfromtxt(csv, delimiter=",", names=True)
hdf = 'out.h5'
file = tables.open_file(hdf, mode="a")
group = file.create_group('/', 'test', 'test')
table = file.create_table(group, 'test', data, 'test data')
table = file.create_table(group, 'test2', data[['a', 'b']], 'test data')
This creates two tables in the hdf5 file, testand test2. With test containing all the columns in test.csv and test2only containing the 'a' and 'b' columns.
However, when I open the hdf5 file in a viewer, it looks like this:
test1
test2
Notice the values in test2 being all scrambled and wrong.
I have made a work around, but it is slow and requires looping over all values in python, so I would rather understand why this is happening and if there is any particular solution to the problem, and a way to get the correct values in test2.
I am using Python 3.5.3, numpy 1.14.0 and pytables 3.4.2.
Related
Let's say I have a csv file that is such that:
Dog
Cat
Bird
is a common column with a txt file I have that has two columns of interest:
Cat 8.9
Bird 12.2
Dog 2.1
One column being the identifiers (species name), the other being let's say, speed in mph.
I want to parse through the rows of the csv file, and lookup the species speed from my txt file, and then add it as an extra column to the csv file. Ie, I want to join these two files based on the common species key, and I specifically JUST want to append the speed from the txt (and not other columns that may be in that file) to the csv file. The csv file also has superfluous columns that I would like to keep; but I don't want any of the superfluous columns in the txt file (let's say it had heights, weights, and so on. I don't want to add those to my csv file).
Let's say that this file is very large; I have 1,000,000 species in both files.
What would be the shortest and cleanest way to do this?
Should I write some type of a nested loop, should I use a dictionary, or would Pandas be an effective method?
Note: let's say I also want to compute the mean speed of all the species; ie I want to take the mean of the speed column. Would numpy be the most effective method for this?
Additional note: All the species names are in common, but the order in the csv and txt files are different (as indicated in my example). How would I correct for this?
Additional note 2: This is a totally contrived example since I've been learning about dictionaries and input/output files, and reviewing loops that I wrote in the past.
Note 3: The txt file should be tab seperated, but the csv file is (obviously) comma separated.
You could do everything needed with the built-in csv module. As I mentioned in a comment, it can be used to read the text file since it's tab-delimited (as well as read the csv file and write an updated version of it).
You seem to indicate there are other fields besides the "animal" and "speed" in both files, but the code below assumes they only have one or both of them.
import csv
csv_filename = 'the_csv_file.csv'
updated_csv_filename = 'the_csv_file2.csv'
txt_filename = 'the_text_file.txt'
# Create a lookup table from text file.
with open(txt_filename, 'r', newline='') as txt_file:
# Use csv module to read text file.
reader = csv.DictReader(txt_file, ('animal', 'speed'), delimiter='\t')
lookup = {row['animal']: row['speed'] for row in reader}
# Read csv file and create an updated version of it.
with open(csv_filename, 'r', newline='') as csv_file, \
open(updated_csv_filename, 'w', newline='') as updated_csv_file:
reader = csv.reader(csv_file)
writer = csv.writer(updated_csv_file)
for row in reader:
# Insert looked-up value (speed) into the row following the first column
# (animal) and copy any columns following that.
row.insert(1, lookup[row[0]]) # Insert looked-up speed into column 2.
writer.writerow(row)
Given the two input files in your question, here's the contents of the updated csv file (assuming there were no additional columns):
Dog,2.1
Cat,8.9
Bird,12.2
This is probably most easily achieved with pandas DataFrames.
You can load both your CSV and text file using the read_csv function (just adjust the separator for the text file to tab) and use the join function to join the two DataFrames on the columns you want to match, something like:
column = pd.read_csv('data.csv')
data = pd.read_csv('data.txt', sep='\t')
joined = column.join(data, on='species')
result = joined[['species', 'speed', 'other column you want to keep']]
If you want to conduct more in depth analysis of your data or your files are too large for memory, you may want to look into importing your data into a dedicated database management system like PostgreSQL.
EDIT: If your files don't contain column names, you can load them with custom names using pd.read_csv(file_path, header=None, names=['name1','name2']) as described here. Also, columns can be renamed after loading using dataframe.rename(columns = {'oldname':'newname'}, inplace = True) as seen here.
You can just use the merge() method of pandas like this:
import pandas as pd
df_csv = pd.read_csv('csv.csv', header=None)
df_txt = pd.read_fwf('txt.txt', header=None)
result = pd.merge(df_txt,df_csv)
print(result)
Gives the following output:
0 1
0 Cat 8.9
1 Bird 12.2
2 Dog 2.1
I have a very large CSV File (~12Gb) that looks something like this:
posX,posY,posZ,eventID,parentID,clockTime
-117.9853515625,60.2998046875,0.29499998688697815,0,0,0
-117.9853515625,60.32909393310547,0.29499998688697815,0,0,0
-117.9560546875,60.2998046875,0.29499998688697815,0,0,0
-117.9560546875,60.32909393310547,0.29499998688697815,0,0,0
-117.92676544189453,60.2998046875,0.29499998688697815,0,0,0
-117.92676544189453,60.32909393310547,0.29499998688697815,0,0,0
-118.04051208496094,60.34012985229492,4.474999904632568,0,0,0
-118.04051208496094,60.36941909790039,4.474999904632568,0,0,0
-118.04051208496094,60.39870834350586,4.474999904632568,0,0,0
I want to convert this CSV File into the HDF5 Format using the library h5py while also lowering the total file size by setting the field / index types i.G. saying:
Save posX, posY and posZ as float32. Save eventID, parentID and clockTime as int32 or something along those lines.
Note: I need to chunk the data in some form when I read it in to avoid Memory Errors.
However I am unable to get the wished result. What I have tried so far:
Using Pandas own methods following this guide: How to write a large csv file to hdf5 in python?
This creates the file but im somehow unable to change the types and the file remains too big (~10.7Gb). The field types are float64 and int64.
I also tried to split the CSV up into parts (using split -n x myfile.csv) before working with the increments. I ran into some data errors in the beginning and end on each file which I was able to fix by removing said lines using sed. Then I tried out the following code:
import pandas as pd
import h5py
PATH_csv = "/home/MYNAME/Documents/Workfolder/xaa" #xaa is my csv increment
DATA_csv = pd.read_csv(PATH_csv)
with h5py.File("pct_data-hdf5.h5", "a") as DATA_hdf:
dset = DATA_hdf.create_dataset("posX", data=DATA_csv["posX"], dtype="float32")
Sadly this created the file and the table but didn't write any data into it.
Expectation
Creating a HDF5 File containing the data of a large CSV file while also changing the variable type of each index.
If something is unclear please ask me for clarification. Im still a beginner!
Have you considered the numpy module?
It has a handy function (genfromtxt) to read CSV data with headers into a Numpy array. You define the dtype. The array is suitable for loading into HDF5 with the h5py.create_dataset() function.
See code below. I included 2 print statements. The first shows the dtype names created from the CSV headers. The second shows how you can access the data in the numpy array by field (column) name.
import h5py
import numpy as np
PATH_csv = 'SO_55576601.csv'
csv_dtype= ('f8', 'f8', 'f8', 'i4', 'i4', 'i4' )
csv_data = np.genfromtxt(PATH_csv, dtype=csv_dtype, delimiter=',', names=True)
print (csv_data.dtype.names)
print (csv_data['posX'])
with h5py.File('SO_55576601.h5', 'w') as h5f:
dset = h5f.create_dataset('CSV_data', data=csv_data)
h5f.close()
Is there any way you can write value to specific place in given .csv file using pandas or csv module?
I have tried using csv_reader to read the file and find a line which fits my requirements though I couldn't figure out a way to switch value which is in the file to mine.
What I am trying to achieve here is that I have a spreadsheet of names and values. I am using JSON to update the values from the server and after that I want to update my spreadsheet also.
The latest solution which I came up with was to create separate sheet from which I will get updated data, but this one is not working, though there is no sequence in which the dict is written to the file.
def updateSheet(fileName, aValues):
with open(fileName+".csv") as workingSheet:
writer = csv.DictWriter(workingSheet,aValues.keys())
writer.writeheader()
writer.writerow(aValues)
I will appreciate any guidance and tips.
You can try this way to operate the specified csv file
import pandas as pd
a = ['one','two','three']
b = [1,2,3]
english_column = pd.Series(a, name='english')
number_column = pd.Series(b, name='number')
predictions = pd.concat([english_column, number_column], axis=1)
save = pd.DataFrame({'english':a,'number':b})
save.to_csv('b.csv',index=False,sep=',')
How can I add a single row to binary table inside large fits file using pyfits, astropy.io.fits or maybe some other python library?
This file is used as a log, so every second a single row will be added, eventually the size of the file will reach gigabytes so reading all the file and writing it back or keeping the copy of data in memory and writing it to the file every seconds is actually impossible. With pyfits or astropy.io.fits so far I could only read everything to memory add new row and then write it back.
Example. I create fits file like this:
import numpy, pyfits
data = numpy.array([1.0])
col = pyfits.Column(name='index', format='E', array=data)
cols = pyfits.ColDefs([col])
tbhdu = pyfits.BinTableHDU.from_columns(cols)
tbhdu.writeto('test.fits')
And I want to add some new value to the column 'index', i.e. add one more row to the binary table.
Solution This is a trivial task for cfitsio library (method fits_insert_row(...)), so I use python module which is based on it: https://github.com/esheldon/fitsio
And here is the solution using fitsio. To create new fits file the one can do:
import fitsio, numpy
from fitsio import FITS,FITSHDR
fits = FITS('test.fits','rw')
data = numpy.zeros(1, dtype=[('index','i4')])
data[0]['index'] = 1
fits.write(data)
fits.close()
To append a row:
fits = FITS('test.fits','rw')
#you can actually use the same already opened fits file,
#to flush the changes you just need: fits.reopen()
data = numpy.zeros(1, dtype=[('index','i4')])
data[0]['index'] = 2
fits[1].append(data)
fits.close()
Thank you for your help.
Is it possible to use Pytables (or Pandas) to detect whether a hdf file's table contains a certain column? To load the hdf file I use:
from pandas.io.pytables import HDFStore
# this doesn't read the full file which is good
hdf_store = HDFStore('data.h5', mode='r')
# returns a "Group" object, not sure if this could be used...
hdf_store.get_node('tablename')
I could also use Pytables directly instead of Pandas. The aim is not to load all the data of the hdf file as these files are potentially large and I only want to establish whether a certain column exists.
I may have found a solution, but am unsure of (1) why it works and (2) whether this is a robust solution.
import tables
h5 = tables.openFile('data.h5', mode='r')
df_node = h5.root.__getattr__('tablename')
# Not sure why `axis0` contains the column data, but it seems consistent
# with the tested h5 files.
columns = df_node.axis0[:]
columns contains a numpy array with all the column names.
The accepted solution doesn't work for me with Pandas 0.20.3 and PyTables 3.3.0 (the HDF file was created using Pandas). However, this works:
pd.HDFStore('data.hd5', mode='r').get_node('/path/to/pandas/df').table.colnames