to read data from a mysql table , and use numpy to transfer the data to numpy array, the data in the mysql table include varchar(128),int, bigint,float, therefore, I think I may read these data all as string type at first, try using numpy.fromiter:
select_sql = "select * from fb_web_active_group_members_user_mbkmeansclustering_ng_six_test"
count = cur.execute(select_sql)
if count:
user_level_cluster_data = cur.fetchall()
user_level_cluster_data_df = numpy.fromiter(user_level_cluster_data,dtype = numpy.str,count = -1)
but it errors:
File "F:/MyDocument/F/My Document/Training/Python/PyCharmProject/FaceBookCrawl/FB_group_user_stability.py", line 21, in get_pre_new_user_level_data
user_level_cluster_data_df = numpy.fromiter(user_level_cluster_data,dtype = numpy.str,count = -1)
ValueError: Must specify length when using variable-size data-type.
could you please tell me the reason and how to resolve it, if I want read all the data from the mysql table as their own data types(not read them all as string type at first), such as: the varchar(128) data as string, int type as int, float type as float....how I should do
dtype needs to be the entire, full dtype for a whole record. Your current error occurs because NumPy strings are fixed-capacity, so you'd need to say dtype='S128' for example, to get strings up to 128 characters in capacity. But your actual dtype probably consists of several columns, so you might want something like this:
dtype=[('colA', 'i4'), ('colB', 'f8'), ('colC', 'S128')]
Also note that fromiter() may not be helping you, since you're using fetchall() which I think returns a list anyway. You can simply do:
np.array(user_level_cluster_data, dtype)
Or if you want to use fromiter(), you should pass it the count parameter and use lazy fetching instead of fetchall().
Related
I have a dataframe that I want to save in the appendable format to a hdf5 file. The dataframe looks like this:
column1
0 [0, 1, 2, 3, 4]
And the code that replicates the issue is:
import pandas as pd
test = pd.DataFrame({"column1":[list(range(0,5))]})
test.to_hdf('test','testgroup',format="table")
Unfortunately, it returns this error:
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-65-c2dbeaca15df> in <module>
1 test = pd.DataFrame({"column1":[list(range(0,5))]})
----> 2 test.to_hdf('test','testgroup',format="table")
7 frames
/usr/local/lib/python3.7/dist-packages/pandas/io/pytables.py in _maybe_convert_for_string_atom(name, block, existing_col, min_itemsize, nan_rep, encoding, errors, columns)
4979 error_column_label = columns[i] if len(columns) > i else f"No.{i}"
4980 raise TypeError(
-> 4981 f"Cannot serialize the column [{error_column_label}]\n"
4982 f"because its data contents are not [string] but "
4983 f"[{inferred_type}] object dtype"
TypeError: Cannot serialize the column [column1]
because its data contents are not [string] but [mixed] object dtype
I am aware that I can save each value in a separate column. This does not help my extended use case, as there might be variable length lists.
I know I could convert the list to a string and then recreate it based on the string, but if I start converting each column to string, I might as well use a text format, like csv, instead of a binary one like hdf5.
Is there a standard way of saving lists into hdf5 table format?
Python Lists present a challenge when writing to HDF5 because they may contain different types. For example, this is a perfectly valid list: [1, 'two', 3.0]. Also, if I understand your Pandas 'column1' dataframe, it may contain different length lists. There is no (simple) way to represent this as an HDF5 dataset.
[That's why you got the [mixed] object dtype message. The conversion of the dataframe creates an intermediate object that is written as a dataset. The dtype of the converted list data is "O" (object), and HDF5 doesn't support this type.]
However, all is not lost. If we can make some assumptions about your data, we can wrangle it into a HDF5 dataset. Assumptions: 1) all df list entities are the same type (int in this case), and 2) all df lists are the same length. (We can handle different length lists, but it is more complicated.) Also, you will need to use a different package to write the HDF5 data (either PyTables or h5py). PyTables is the underlying package for Pandas HDF5 support and h5py is widely used. The choice is yours.
Before I post the code, here is an outline of the process:
Create a NumPy record array (aka recarray) from the the dataframe
Define the desired type and shape for the HDF5 dataset (as an Atom for
Pytables, or a dtype for h5py).
Create the dataset with Ataom/dtype definition above (could do on 1 line, but
easier to read this way).
Loop over rows of the recarray (from Step 1), and write data to rows of
the dataset. This converts the List to the equivalent array.
Code to create recarray (adds 2 rows to your dataframe):
import pandas as pd
test = pd.DataFrame({"column1":[list(range(0,5)), list(range(10,15)), list(range(100,105))]})
# create recarray from the dataframe (use index='column1' to only get that column)
rec_arr = test.to_records(index=False)
PyTables specific code to export data:
import tables as tb
with tb.File('74489101_tb.h5', 'w') as h5f:
# define "atom" with type and shape of column1 data
df_atom = tb.Atom.from_type('int32', shape=(len(rec_arr[0]['column1']),) )
# create the dataset
test = h5f.create_array('/','test', shape=rec_arr.shape, atom=df_atom )
# loop over recarray and polulate dataset
for i in range(rec_arr.shape[0]):
test[i] = rec_arr[i]['column1']
print(test[:])
h5py specific code to export data:
import h5py
with h5py.File('74489101_h5py.h5', 'w') as h5f:
df_dt = (int,(len(rec_arr1[0]['column1']),))
test = h5f.create_dataset('test', shape=rec_arr1.shape, dtype=df_dt )
for i in range(rec_arr1.shape[0]):
test[i] = rec_arr1[i]['column1']
print(test[:])
I am running into a weird inconsistency. So I had to learn the difference between immutable and mutable data types. For my purpose, I need to convert my pandas DataFrame into Numpy apply operations and convert it back, as I do not wish to alter my input.
so I am converting like follows:
mix=pd.DataFrame(array,columns=columns)
def mix_to_pmix(mix,p_tank):
previous=0
columns,mix_in=np.array(mix) #<---
mix_in*=p_tank
previous=0
for count,val in enumerate(mix_in):
mix_in[count]=val+previous
previous+=val
return pd.DataFrame(mix_in,columns=columns)
This works perfectly fine, but the function:
columns,mix_in=np.array(mix)
seems to not be consistent as in the case:
def to_molfrac(mix):
columns,mix_in=np.array(mix)
shape=mix_in.shape
for i in range(shape[0]):
mix_in[i,:]*=1/max(mix_in[i,:])
for k in range(shape[1]-1,0,-1):
mix_in[:,k]+=-mix_in[:,k-1]
mix_in=mix_in/mix_in.sum(axis=1)[:,np.newaxis]
return pd.DataFrame(mix_in,columns=columns)
I receive the error:
ValueError: too many values to unpack (expected 2)
The input of the latter function is the output of the previous function. So it should be the same case.
It's impossible to understand the input of to_molfrac and mix_to_pmix without an example.
But the pandas objects has a .value attribute which allows you to access the underlying numpy array. So, its probably better to use mix_in = mix.values instead.
columns, values = df.columns, df.values
I am filling an numpy array in python (could change this to a list if neccesary), and i want to fill it with column headings, then enter a loop and fill the table with values, I am struggling with which type to use for the array. I have something like this so far...
info = np.zeros(shape=(no_of_label+1,19),dtype = np.str) #Creates array to store coordinates of particles
info[0,:] = ['Xpos','Ypos','Zpos','NodeNumber','BoundingBoxTopX','BoundingBoxTopY','BoundingBoxTopZ','BoundingBoxBottomX','BoundingBoxBottomY','BoundingBoxBottomZ','BoxVolume','Xdisp','Ydisp','Zdisp','Xrot','Yrot','Zrot','CC','Error']
for i in np.arange(1,no_of_label+1,1):
info[i,:] = [C[0],C[1],C[2],i,int(round(C[0]-b)),int(round(C[1]-b)),int(round(C[2]-b)),int(round(C[0]+b)),int(round(C[1]+b)),int(round(C[2]+b)),volume,0,0,0,0,0,0,0,0] # Fills an array with label.No., size of box, and co-ords
np.savetxt(save_path+Folder+'/Data_'+Folder+'.csv',information,fmt = '%10.5f' ,delimiter=",")
There is other things in the loop, but they are irrelevent, C is an array of float, b is int.
I also need to be able to save it as a csv file as shown in the last line, and open it in excel.
What I have now, returns all the values as integers, when i need C[0], C[1], C[2] to be floating point.
Thanks in advance!
It depends on what you want to do with this array but I think you want to use 'dtype=object' instead of 'np.str'. You can do that explicitly, by changing 'np.str' to 'dtype' or here is how I would write the first part of your code:
import numpy as np
labels = ['Xpos','Ypos','Zpos','NodeNumber','BoundingBoxTopX','BoundingBoxTopY',
'BoundingBoxTopZ','BoundingBoxBottomX','BoundingBoxBottomY','BoundingBoxBottomZ',
'BoxVolume','Xdisp','Ydisp','Zdisp','Xrot','Yrot','Zrot','CC','Error']
no_of_label = len(labels)
#make a list of length ((no_of_label+1)*19) and convert it to an array and reshape it
info = np.array([None]*((no_of_label+1)*19)).reshape(no_of_label+1, 19)
info[0] = labels
Again, there is probably a better way of doing this if you have a specific application in mind, but this should let you store different types of data in the same 2D array.
I have solved it as follows:
info = np.zeros(shape=(no_of_label+1,19),dtype=float)
for i in np.arange(1,no_of_label+1,1):
info[i-1] = [C[0],C[1],C[2],i,int(round(C[0]-b)),int(round(C[1]-b)),int(round(C[2]-b)),int(round(C[0]+b)),int(round(C[1]+b)),int(round(C[2]+b)),volume,0,0,0,0,0,0,0,0]
np.savetxt(save_path+Folder+'/Data_'+Folder+'.csv',information,fmt = '%10.5f' ,delimiter=",",header='Xpos,Ypos,Zpos,NodeNumber,BoundingBoxTopX,BoundingBoxTopY,BoundingBoxTopZ,BoundingBoxBottomX,BoundingBoxBottomY,BoundingBoxBottomZ,BoxVolume,Xdisp,Ydisp,Zdisp,Xrot,Yrot,Zrot,CC,Error',comments='')
Using the header function built in to the numpy save text feature. Thanks everyone!
I want to import a dirty csv file into a numpy object. I have a very few amount of values that are apparently not an integer or float, because the output is not of the correct dtype.
The code I use:
d.data = np.genfromtxt(inputtable, delimiter=";",skip_header=2, comments="#", dtype=np.float)
I would like to know if there is an easy way to just replace all non floats into -1 so that I do not need to find these values by hand in the 10.000+ rows.
You just have a provide a set of callbacks as the converters argument, as documented here:
converters : variable, optional
The set of functions that convert the data of a column to a value. The converters can also be used to provide a default value for missing
data: converters = {3: lambda s: float(s or 0)}.
I have trouble getting numpy to load tabular data and automatically generate column names. It seems pretty simple but I cannot nail it.
If i knew the number of columns I could easily create names parameter, but I don't have this knowledge, and I would like to avoid prior introspection of the data file.
How can I force numpy to generate the column names, or use tuple-like dtype automatically, when I have no knowledge how many columns there are in file? I want to manipulate the column names after reading the data.
My approaches so far:
data = np.genfromtxt(tar_member, unpack = True, names = '') - I wanted to force automatic generation of column names by giving some "empty" parameter. Results with error ValueError: size of tuple must match number of fields.
data = np.genfromtxt(tar_member, unpack = True, names = True) - "Works" but consumes 1st row of data.
data = np.genfromtxt(tar_member, unpack = True, dtype = None) - Worked for data with mixed types. Automatic type guessing expanded dtype into a tuple, and assigned the names. However, for data where everything was actually float, dtype was set to float64, and I got ValueError: there are no fields defined when I tried accessing data.dtype.names.
I know there is a cleaner way to do this, but if you don't mind forcing the issue you can generate your dtype structure and assign it directly to the data array.
x = numpy.random.rand(10,10)
numpy.savetxt('test.out', x, delimiter=',')
dataa = numpy.genfromtxt('test.out',delimiter=",", dtype=None)
if dataa.dtype.names is None:#then dataa is homogenous?
l1 = map(lambda z:('f%d'%(z),dataa.dtype),range(0,dataa.shape[1]))
dataa.dtype = dtype(l1)
dataa.dtype
dataa.dtype.names