Convert bytea to ndarray of nd.float32 - python

I have an ndarray of np.float32 that is saved in a Postgres database in the bytea format:
import pandas as pd
import numpy as np
import sqlite3
myndarray=np.array([-3.55219245e-02, 1.33227497e-01, -4.96977456e-02, 2.16857344e-01], dtype=np.float32)
myarray=[myndarray.tobytes()]
mydataframe=pd.DataFrame(myarray, columns=['Column1'])
mydataframe.to_sql('mytable', sqlite3.connect("/tmp/floats.sqlite"))
In SQLITE3, this will produce:
CREATE TABLE IF NOT EXISTS "mytable" ("index" INTEGER, "Column1" TEXT);
INSERT INTO mytable VALUES(0,X'707f11bdca6c083edd8f4bbdda0f5e3e');
In Postgresql, this will produce:
mydatabase=# select * from mytable;
index | Column1
-------+------------------------------------
0 | \x707f11bdca6c083edd8f4bbdda0f5e3e
Which format is bytea. How to convert that \x707f... back to myndarray? No expert here, I've found a lot of obscure documentation about frombuffer(), python2 buffer(), memoryview() but I am far from a proper result.
My best so far is:
np.frombuffer(bytearray('707f11bdca6c083edd8f4bbdda0f5e3e', 'utf-8'), dtype=np.float32)
which is completely wrong (myndarray has 4 values):
[2.1627062e+23 1.6690035e+22 3.3643249e+21 5.2896255e+22 2.1769183e+23
1.6704162e+22 2.0823326e+23 5.2948159e+22]

After a lot of trial and error (repeat, I don't know python), I've found a solution.
ndarray=np.frombuffer(bytes.fromhex("707f11bdca6c083edd8f4bbdda0f5e3e"), np.float32)
print(ndarray)
# [-0.03552192 0.1332275 -0.04969775 0.21685734]
print(type(ndarray))
# <class 'numpy.ndarray'>
print(type(ndarray[0]))
# <class 'numpy.float32'>
Now, the full example:
ngine=create_engine('postgresql://postgres:mypassword#localhost/mydatabase')
myndarray=np.array([-3.55219245e-02, 1.33227497e-01, -4.96977456e-02, 2.16857344e-01], dtype=np.float32)
print(myndarray)
# [-0.03552192 0.1332275 -0.04969775 0.21685734]
myarray=[myndarray.tobytes()]
mydataframe=pd.DataFrame(myarray, columns=['Column1'])
mydataframe.to_sql('mytable', ngine)
# SELECT * FROM mytable
#
#index Column1
# 0 \x707f11bdca6c083edd8f4bbdda0f5e3e
bytea=pd.read_sql(sql=select(mytable), con=ngine).iloc[0]['Column1']
print(type(bytea))
# <class 'str'>
print(bytea)
# \x707f11bdca6c083edd8f4bbdda0f5e3e
print(bytea[2:])
# 707f11bdca6c083edd8f4bbdda0f5e3e
# Surprise! The \x is not interpreted!
ndarray=np.frombuffer(bytes.fromhex(bytea[2:]), np.float32)
print(ndarray)
# [-0.03552192 0.1332275 -0.04969775 0.21685734]
Thanks, #hpaulj, #PranavHosangadi #juanpa.arrivillaga, #ACarter.

Related

How to convert array to dataframe

My output from previous code.
from pandas import DataFrame
cursor.execute("SELECT * FROM DOCTOR")
row = cursor.fetchall()
row
how to convert array into data frame. expected output is shown in the below picture.
how to convert array to dataframe.
import numpy as np
df = DataFrame(data=np.array(row),
index=np.arange(len(row)),
columns=['Doctor_id','Doctor_Name','Doctor_Spl'])
You appear to be using a plain cursor from a SQL client
Pandas has its own read_sql function that wraps a SQL client library and returns a Dataframe
You can also do most SQL queries directly from Dataframe objects, if you were to start with read_sql_table - https://pandas.pydata.org/pandas-docs/stable/getting_started/comparison/comparison_with_sql.html#compare-with-sql
converting the fetched row to numpy array and using that to convert to data frame
import numpy as np
import pandas as pd
cursor.execute("SELECT * FROM DOCTOR")
row = cursor.fetchall()
numpy_array_row = np.array(row)
data_frame = pd.DataFrame(numpy_array_row, columns=['Doctor_id', 'Doctor_Name', 'Doctor_Spl'])
# print your data_frame and voila :)
This can be done using the pandas.DataFrame() method.
In your case it would look like this:
DataFrame(row, columns=('Doctor_id', 'Doctor_Name', 'Doctor_Spl'))
One possible solution:
rows = [
("D001", "TIM1", "orthopedic"),
("D002", "Ram", "General medicine"),
("D003", "Sarala", "gynaecology"),
("D004", "TIM", "Ayurvedic"),
("D005", "viny", "Hemeopathy"),
]
df = pd.DataFrame(rows, columns=["Doctor_id", "Doctor_Name", "Doctor_Spl"])
df.index += 1 # <-- if you need index starting from `1`
print(df)
Prints:
Doctor_id Doctor_Name Doctor_Spl
1 D001 TIM1 orthopedic
2 D002 Ram General medicine
3 D003 Sarala gynaecology
4 D004 TIM Ayurvedic
5 D005 viny Hemeopathy

"IndexError: too many indices for array" while merging VIPERS and PRIMUS

Hi I'm trying to extract RA, Dec and redshift information from the two surveys(PRIMUS and VIPERS) and collects them into a single nd-array.
The code is as follows :
from astropy.io import fits
import numpy as np
hdulist_PRIMUS = fits.open('data/PRIMUS_2013_zcat_v1.fits')
data_PRIMUS = hdulist_PRIMUS[1].data
data_PRIMUS = np.column_stack((data_PRIMUS['RA'], data_PRIMUS['DEC'],
data_PRIMUS['Z'], data_PRIMUS['FIELD']))
data_PRIMUS = np.array(filter(lambda x: x[3].strip() == 'xmm', data_PRIMUS))[:, :3]
data_PRIMUS = np.array(map(lambda x: [float(x[0]), float(x[1]), float(x[2])], data_PRIMUS))
hdulist_VIPERS = fits.open('data/VIPERS_W1_SPECTRO_PDR2.fits')
data_VIPERS = hdulist_VIPERS[1].data
data_VIPERS = np.column_stack((data_VIPERS['alpha'], data_VIPERS['delta'], data_VIPERS['zspec']))
from astropy import units as u
from astropy.coordinates import SkyCoord
PRIMUS_catalog = SkyCoord(ra=data_PRIMUS[:, 0]*u.degree, dec =data_PRIMUS[:, 1]*u.degree)
VIPERS_catalog = SkyCoord(ra=data_VIPERS[:, 0]*u.degree, dec=data_VIPERS [:, 1]*u.degree)
idx, d2d, d3d = PRIMUS_catalog.match_to_catalog_sky(VIPERS_catalog)
feasible_indices = np.array(map(
lambda x: x[0],
filter(lambda x: x[1].value > 1e-3, zip(idx, d2d))))
data_VIPERS = data_VIPERS[feasible_indices]
data_HZ = np.vstack((data_PRIMUS, data_VIPERS))
When I run this I'm getting a "IndexError: too many indices for array"
Datasets:
PRIMUS Redshift Catalog - https://primus.ucsd.edu/version1.html
VIPERS Redshift Catalog - https://projects.ift.uam-csic.es/skies-universes/VIPERS/photometry/
I think there are a few ways you're doing this where you're making it harder for yourself by not using existing, available tools effectively. For example, since you are working with tabular data from a FITS file, you can take advantage of Astropy's Table interface:
>>> from astropy.table import Table
>>> primus = Table.read('PRIMUS_2013_zcat_v1.fits')
(for this particular file I got some warnings about some of the headers in the table being non-standard, but this can be ignored).
If you want to do some operations on just a few columns of the table, you can do this easily. For example, rather than doing what you did, of selecting a few columns together, and then stacking them into a new array
np.column_stack((data_PRIMUS['RA'], data_PRIMUS['DEC'],
data_PRIMUS['Z'], data_PRIMUS['FIELD']))
you can select a subset of columns from the table like so:
>>> primus[['RA', 'DEC', 'Z', 'FIELD']]
<Table length=213696>
RA DEC Z FIELD
degree degree
float64 float64 float32 bytes13
------------------ ------------------- ---------- -------------
52.892275339281994 -27.833172368069615 0.3420992 calib
52.88448889270391 -27.85252305560996 0.4824943 calib
52.880363885710295 -27.86221750021335 0.33976158 calib
52.88334306466262 -27.86937808271639 0.6134631 calib
52.8866138857103 -27.871773055662942 0.58744365 calib
52.885607068267845 -27.889578785511922 0.26873255 calib
... ... ... ...
34.54856 -4.5544 0.8544105 xmm
34.56942 -4.57564 0.6331108 xmm
34.567412432719756 -4.572718190305209 1.1456184 xmm
34.57134 -4.56414 0.6346616 xmm
34.58088 -4.56804 1.081143 xmm
34.58686 -4.57449 0.7471819 xmm
Then it seems you select the RA, DEC, and Z columns where the field is xmm by using a filter function, but as these are Numpy arrays you can use the filtering capabilities built into Numpy array indexing, as well as Table indexing. The only tricky part is that since these are fixed width character fields you do still need to perform comparisons correctly. You can use Numpy's string functions like np.char.startswith for this:
>>> primus = primus[np.char.startswith(primus['FIELD'], b'xmm')]
In the process of doing a performance comparison, I realized this line is where you're probably getting the error IndexError: too many indices for array:
>>> np.array(filter(lambda x: x[3].strip() == 'xmm', primus))
array(<filter object at 0x7f5170981940>, dtype=object)
In Python 3, the filter function returns an iterable, so wrapping it in np.array() just makes a 0-D array containing this Python object; it's probably not what you intended, so it fails here (this is where looking at the traceback might have been useful). Even if you wrapped the filter() call in list() it wouldn't work, because np.array() only takes homogeneous arrays normally. So an approach like the one I gave is perfectly sufficient (though there may be slightly more efficient ways). It also makes the next line:
np.array(map(lambda x: [float(x[0]), float(x[1]), float(x[2])], data_PRIMUS))
unnecessary. In particular, the first three columns are already in floating point format so this would not be necessary anyways.
Some similar advice applies to the other parts of your code. I'd have written it like more like this:
import numpy as np
from astropy.table import Table, vstack
from astropy import units as u
from astropy.coordinates import SkyCoord
primus = Table.read('PRIMUS_2013_zcat_v1.fits')
primus_field = primus['FIELD']
primus = primus[['RA', 'DEC', 'Z']]
primus = primus[np.char.startswith(primus_field, b'xmm')]
vipers = Table.read('VIPERS_W1_SPECTRO_PDR2.fits')[['alpha', 'delta', 'zspec']]
primus_catalog = SkyCoord(ra=primus['RA']*u.degree, dec=primus['DEC']*u.degree)
vipers_catalog = SkyCoord(ra=vipers['alpha']*u.degree, dec=vipers['delta']*u.degree)
idx, d2d, d3d = primus_catalog.match_to_catalog_sky(vipers_catalog)
feasible_indices = idx[d2d > 1e-3]
vipers = vipers[feasible_indices]
vipers.rename_columns(['alpha', 'delta', 'zspec'], ['RA', 'DEC', 'Z'])
hz = vstack(primus, vipers)
Please let me know if there are any parts of this you have questions on.

Error when trying to save hdf5 row where one column is a string and the other is an array of floats

I have two column, one is a string, and the other is a numpy array of floats
a = 'this is string'
b = np.array([-2.355, 1.957, 1.266, -6.913])
I would like to store them in a row as separate columns in a hdf5 file. For that I am using pandas
hdf_key = 'hdf_key'
store5 = pd.HDFStore('file.h5')
z = pd.DataFrame(
{
'string': [a],
'array': [b]
})
store5.append(hdf_key, z, index=False)
store5.close()
However, I get this error
TypeError: Cannot serialize the column [array] because
its data contents are [mixed] object dtype
Is there a way to store this to h5? If so, how? If not, what's the best way to store this sort of data?
I can't help you with pandas, but can show you how do this with pytables.
Basically you create a table referencing either a numpy recarray or a dtype that defines the mixed datatypes.
Below is a super simple example to show how to create a table with 1 string and 4 floats. Then it adds rows of data to the table.
It shows 2 different methods to add data:
1. A list of tuples (1 tuple for each row) - see append_list
2. A numpy recarray (with dtype matching the table definition) -
see simple_recarr in the for loop
To get the rest of the arguments for create_table(), read the Pytables documentation. It's very helpful, and should answer additional questions. Link below:
Pytables Users's Guide
import tables as tb
import numpy as np
with tb.open_file('SO_55943319.h5', 'w') as h5f:
my_dtype = np.dtype([('A','S16'),('b',float),('c',float),('d',float),('e',float)])
dset = h5f.create_table(h5f.root, 'table_data', description=my_dtype)
# Append one row using a list:
append_list = [('test string', -2.355, 1.957, 1.266, -6.913)]
dset.append(append_list)
simple_recarr = np.recarray((1,),dtype=my_dtype)
for i in range(5):
simple_recarr['A']='string_' + str(i)
simple_recarr['b']=2.0*i
simple_recarr['c']=3.0*i
simple_recarr['d']=4.0*i
simple_recarr['e']=5.0*i
dset.append(simple_recarr)
print ('done')

Store ndarray in a PyTable (and how to define the Col()-type)

TL;DR: I have a PyTable with a float32 Col and get an error when writing a numpy-float32-array into it. (How) can I store a numpy-array (float32) in the Column of a PyTables table?
I'm new to PyTables - following a recommendation of TFtables (a lib to use HDF5 in Tensorflow), I'm using it to store all my HDF5 data (currently distributed in batches in several files with each three datasets) within a table in a single HDF5 file. Datasets are
'data' : (n_elements, 1024, 1024, 4)#float32
'label' : (n_elements, 1024, 1024, 1)#uint8
'weights' : (n_elements, 1024, 1024, 1)#float32
where the n_elements are distributed over several files that I want to merge into one now (to allow unordered access).
So when I build my table, I figured each dataset represents a column. I built everything in a generic way that allows to do this for an arbitrary number of datasets:
# gets dtypes (and shapes) of the dsets (accessed by dset_keys = ['data', 'label', 'weights']
dtypes, shapes = _determine_shape(hdf5_files, dset_keys)
# to dynamically generate a table, I'm using a dict (not a class as in the PyTables tutorials)
# the dict is (conform with the doc): { 'col_name' : Col()-class-descendent }
table_description = {dset_keys[i]: tables.Col.from_dtype(dtypes[i]) for i in range(len(dset_keys))}
# create a file, a group-node and attach a table to it
h5file = tables.open_file(destination_file, mode="w", title="merged")
group = h5file.create_group("/", 'main', 'Node for data table')
table = h5file.create_table(group, 'data_table', table_description, "Collected data with %s" % (str(val_keys)))
The dtypes that I get for each dsets (read with h5py) are obviously the ones of the numpy arrays (ndarray) that reading the dset returns: float32 or uint8. So the Col()-types are Float32Col an UInt8Col. I naively assumed that I can now write a float32-array into this col, but filling in data with:
dummy_data = np.zeros([1024,1024,3], float32) # normally data read from other files
sample = table.row
sample['data'] = dummy_data
results in TypeError: invalid type (<class 'numpy.ndarray'>) for column ``data``. So now I feel stupid for assuming I'd be able to write an array in there, BUT there are no "ArrayCol()" types offered, neither are there any hints in the PyTables doc as to whether or how it is possible to write an array into a column. How do I do this?
There are "shape" arguments in the Col() class and it's descendents, so it should be possible, otherwise what are these for?!
I know it's a bit late, but I think the answer to your problem lies in the shape parameter for Float32Col.
Here's how it's used in the documentation:
from tables import *
from numpy import *
# Describe a particle record
class Particle(IsDescription):
name = StringCol(itemsize=16) # 16-character string
lati = Int32Col() # integer
longi = Int32Col() # integer
pressure = Float32Col(shape=(2,3)) # array of floats (single-precision)
temperature = Float64Col(shape=(2,3)) # array of doubles (double-precision)
# Open a file in "w"rite mode
fileh = open_file("tutorial2.h5", mode = "w")
# Get the HDF5 root group
root = fileh.root
# Create the groups:
for groupname in ("Particles", "Events"):
group = fileh.create_group(root, groupname)
# Now, create and fill the tables in Particles group
gparticles = root.Particles
# Create 3 new tables
for tablename in ("TParticle1", "TParticle2", "TParticle3"):
# Create a table
table = fileh.create_table("/Particles", tablename, Particle, "Particles: "+tablename)
# Get the record object associated with the table:
particle = table.row
# Fill the table with 257 particles
for i in xrange(257):
# First, assign the values to the Particle record
particle['name'] = 'Particle: %6d' % (i)
particle['lati'] = i
particle['longi'] = 10 - i
########### Detectable errors start here. Play with them!
particle['pressure'] = array(i*arange(2*3)).reshape((2,4)) # Incorrect
#particle['pressure'] = array(i*arange(2*3)).reshape((2,3)) # Correct
########### End of errors
particle['temperature'] = (i**2) # Broadcasting
# This injects the Record values
particle.append()
# Flush the table buffers
table.flush()
Here's the link to the part of the documentation I'm referring to
https://www.pytables.org/usersguide/tutorials.html
Edit: I just saw that the tables.Col.from_type(type, shape) allows using the precision of a type (float32 instead of float alone). The rest stays the same (takes a string and shape).
The factory function tables.Col.from_kind(kind, shape) can be used to construct a Col-Type that supports ndarrays. What "kind" is and how to use this isn't documented anywhere I found; however with trial and error I found that allowed "kind"s are strings of basic datatypes. I.e.: 'float', 'uint', ... without the precision (NOT 'float64')
Since I get numpy.dtypes from h5py reading a dataset (dset.dtype), these have to be cast to str and the precision needs to be removed.
In the end the relevant lines look like this:
# get key, dtype and shapes of elements per dataset from the datasource files
val_keys, dtypes, element_shapes = _get_dtypes(datasources, element_axis=element_axis)
# for storing arrays in columns apparently one has to use "kind"
# "kind" cannot be created with dtype but only a string representing
# the dtype w/o precision, e.g. 'float' or 'uint'
dtypes_kind = [''.join(i for i in str(dtype) if not i.isdigit()) for dtype in dtypes]
# create table description as dictionary
description = {val_keys[i]: tables.Col.from_kind(dtypes_kind[i], shape=element_shapes[i]) for i in range(len(val_keys))}
Then writing data into the table finally works as suggested:
sample = table.row
sample[key] = my_array
Since it all felt a bit "hacky" and isn't documented well, I am still wondering, whether this is not an intended use for PyTables and would leave this question open for abit to see if s.o. knows more about this...

Python: Numpy, the data genfromtxt cannot do tranpose

for example, if I have a data.csv file, it looks like:
A B
1| 1 2
2| 3 4
if I load this data this way:
import numpy as np
from StringIO import StringIO
with open('test.csv', 'rb') as text:
b = np.genfromtxt( StringIO(text.read()),
delimiter=",",
dtype = [('x', int), ('y', int)]) ##<=== in my case
## I want may columns dtype
## in different format
print b.tranpose()
it still gives me like this, which is not been transposed.
[(1,2)
(3,4)]
But if I do: dtype = 'str' (which means all columns with same format
np.transpose() works good.
===============================================
But if I do it this way like:
import numpy as np
a = np.array([(1,2), (3,4)])
print a.transpose()
it works correctly
======= MY QUESTION =======
If I load numpy data from txt or csv, and define each columns' data type. Is there any way to make the numpy data working properly?
This is what is designed to do, see document
T Same as self.transpose(), except that self is returned if self.ndim < 2.
Here you have ndim of 1 for the b array (you may think it has a shape of (2,2) but it has a shape of (2,), be aware), so T returns itself.

Categories