how to append an array into list - python

I have a problem to append my array that created from numpy library. Here is my python's code
# to get array's column name
columnData = [x[0] for x in curHeader.description]
# to get data result
rData = curHeader.fetchall()
# loop data
for i in rData:
arrayData = np.asarray(dict(zip(columnData, i)))
# print data
print(arrayData)
# {'KD_VAL': 'USD', 'FOB': None, 'FREIGHT': None, 'CIF': 33090.0}
# sample append data into list
listData.append(arrayData)
# Convert listData to json
# Insert json into MongoDB using insert_many
Unfortunately, the array can't insert into my mongoDB event my code doesn't produce an error. Is there any logical missing?
Thanks!

Sorry guys, I have an answer after several trials.
So, the point is only change the code into listData.append(arrayData.toList()).
Thanks to #Santosh Kumar.

if it is possible to skip using numpy arrays this will work
array_data = dict(zip(columnData, rData))
list_data.append(array_data)
# my_mongo_collection.insert_many(list_data)

Related

Numpy issue with converting list into array

To begin this topic off I've created a stock market environment that a function can return its observation through this function. The field 'df' is a pandas instance loaded from csv file and I am returning a step (row) of the data frame to get the data which return its value on the data sheet. My issue is when I set the data to the observation field it return different values then the data sheet.
def _next_observation(self):
observation = [
self.df.loc[self.current_step, 'Open'],
self.df.loc[self.current_step, 'High'],
self.df.loc[self.current_step, 'Low'],
self.df.loc[self.current_step, 'Close'],
self.df.loc[self.current_step, 'Volume'],
self.account_quantity
]
# Add Indicators
if self.indicators != None:
for _ in range(len(self.indicators)):
observation.append(self.df.loc[self.current_step, self.indicators[_][0]])
print(observation) # Print normally
self.observation = np.array(observation) # Hmmmmmm
print(self.observation) # Print strangly
exit(1)
return self.observation
The first observation in the step return instance of this data which is incorrect is listed below.
[ 5.17610000e-01 5.17810000e-01 5.15010000e-01 5.15370000e-01
5.18581850e+06 0.00000000e+00 3.76286621e+01 5.15838144e-01
-1.86428571e-05]
I have narrow the issue down to a line of code.
The correct data is presented as a list not numpy array below
[0.51761, 0.51781, 0.51501, 0.51537, 5185818.5, 0, 37.62866206258322, 0.5158381442961018, -1.864285714292535e-05]
If anyone has any tips of how to solve this issue please let me know I don't understand why this is happening. I usually don't ask for help but this is the first. I also have a Agent (A2C) that keep returning 0 as it action and I believe the data is to blame.
Sincerely, Richard
The data is just in exponential notation but identical. To suppress exponential notation in numpy you can do the following:
numpy.set_printoptions(supress = True)
or you can use formatted printing of variables such as:
for item in mylist:
print(f"{item:0.3f}")

Is there a way to extend a PyTables EArray in the second dimension?

I have a 2D array that can grow to larger sizes than I'm able to fit on memory, so I'm trying to store it in a h5 file using Pytables. The number of rows is known beforehand but the length of each row is not known and is variable between rows. After some research, I thought something along these lines would work, where I can set the extendable dimension as the second dimension.
filename = os.path.join(tempfile.mkdtemp(), 'example.h5')
h5_file = open_file(filename, mode="w", title="Example Extendable Array")
h5_group = h5_file.create_group("/", "example_on_dim_2")
e_array = h5_file.create_earray(h5_group, "example", Int32Atom(shape=()), (100, 0)) # Assume num of rows is 100
# Add some item to index 2
print(e_array[2]) # should print an empty array
e_array[2] = np.append(e_array[2], 5) # add the value 5 to row 2
print(e_array[2]) # should print [5], currently printing empty array
I'm not sure if it's possible to add elements in this way (I might have misunderstood the way earrays work), but any help would be greatly appreciated!
You're close...but have a small misunderstanding of some of the arguments and behavior. When you create the EArray with shape=(100,0), you don't have any data...just an object designated to have 100 rows that can add columns. Also, you need to use e_array.append() to add data, not np.append(). Also, if you are going to create a very large array, consider defining the expectedrows= parameter for improved performance as the EArray grows.
Take a look at this code.
import tables as tb
import numpy as np
filename = 'example.h5'
with tb.File(filename, mode="w", title="Example Extendable Array") as h5_file :
h5_group = h5_file.create_group("/", "example_on_dim_2")
# Assume num of rows is 100
#e_array = h5_file.create_earray(h5_group, "example", Int32Atom(shape=()), (100, 0))
e_array = h5_file.create_earray(h5_group, "example", atom=tb.IntAtom(), shape=(100, 0))
print (e_array.shape)
e_array.append(np.arange(100,dtype=int).reshape(100,1)) # append a column of values
print (e_array.shape)
print(e_array[2]) # prints [2]
Here is an example showing how to create a VLArray (Variable Length). It is similar to the EArray example above, and follows the example from the Pytables doc (link in comment above). However, although a VLArray supports variable length rows, it does not have a mechanism to add items to an existing row (AFAIK).
import tables as tb
import numpy as np
filename = 'example_vlarray.h5'
with tb.File(filename, mode="w", title="Example Variable Length Array") as h5_file :
h5_group = h5_file.create_group("/", "vl_example")
vlarray = h5_file.create_vlarray(h5_group, "example", tb.IntAtom(), "ragged array of ints",)
# Append some (variable length) rows:
vlarray.append(np.array([0]))
vlarray.append(np.array([1, 2]))
vlarray.append([3, 4, 5])
vlarray.append([6, 7, 8, 9])
# Now, read it through an iterator:
print('-->', vlarray.title)
for x in vlarray:
print('%s[%d]--> %s' % (vlarray.name, vlarray.nrow, x))

pytables and pandas string padding question

I've created a dataset using hdf5cpp library with a fixed size string (requirement). However when loading with pytables or pandas the strings are always represented like:
b'test\x00\xff\xff\xff\xff\xff\xff\xff\xff\xff
The string value of 'test' with the padding after it. Does anyone know a way to suppress or not show this padding data? I really just want 'test' shown. I realise this may be correct behaviour.
My hdf5cpp setup for strings:
strType = H5Tcopy(H5T_C_S1);
status = H5Tset_size(strType, 36);
H5Tset_strpad(strType, H5T_STR_NULLTERM);
I can't help with your C Code. It is possible to work with padded strings in Pytables. I can read data written by a C application that creates a struct array of mixed types, including padded strings. (Note: there was an issue related to copying a NumPy struct array with padding. It was fixed in 3.5.0. Read this for details: PyTables GitHub Pull 720.)
Here is an example that shows proper string handling with a file created by PyTables. Maybe it will help you investigate your problem. Checking the dataset's properties would be a good start.
import tables as tb
import numpy as np
arr = np.empty((10), 'S10')
arr[0]='test'
arr[1]='one'
arr[2]='two'
arr[3]='three'
with tb.File('SO_63184571.h5','w') as h5f:
ds = h5f.create_array('/', 'testdata', obj=arr)
print (ds.atom)
for i in range(4):
print (ds[i])
print (ds[i].decode('utf-8'))
Example below added to demonstrate compound dataset with int and fixed string. This is called a Table in PyTables (Arrays always contain homogeneous values). This can be done a number of ways. I show the 2 methods I prefer:
Create a record array and reference with the description= or
obj= parameter. This is useful when already have all of your data AND it will fit in memory.
Create a record array dtype and reference with the description=
parameter. Then add the data with the .append() method. This is
useful when all of your data will NOT fit in memory, OR you need to add data to an existing table.
Code below:
recarr_dtype = np.dtype(
{ 'names': ['ints', 'strs' ],
'formats': [int, 'S10'] } )
a = np.arange(5)
b = np.array(['a', 'b', 'c', 'd', 'e'])
recarr = np.rec.fromarrays((a, b), dtype=recarr_dtype)
with tb.File('SO_63184571.h5','w') as h5f:
ds1 = h5f.create_table('/', 'compound_data1', description=recarr)
for i in range(5):
print (ds1[i]['ints'], ds1[i]['strs'].decode('utf-8'))
ds2 = h5f.create_table('/', 'compound_data2', description=recarr_dtype)
ds2.append(recarr)
for i in range(5):
print (ds2[i]['ints'], ds2[i]['strs'].decode('utf-8'))

converting google datastore query result to pandas dataframe in python

I need to convert a Google Cloud Datastore query result to a dataframe, to create a chart from the retrieved data. The query:
def fetch_times(limit):
start_date = '2019-10-08'
end_date = '2019-10-19'
query = datastore_client.query(kind='ParticleEvent')
query.add_filter(
'published_at', '>', start_date)
query.add_filter(
'published_at', '<', end_date)
query.order = ['-published_at']
times = query.fetch(limit=limit)
return times
creates a json like string of the results for each entity returned by the query:
Entity('ParticleEvent', 5942717456580608) {'gc_pub_sub_id': '438169950283983', 'data': '605', 'event': 'light intensity', 'published_at': '2019-10-11T14:37:45.407Z', 'device_id': 'e00fce6847be7713698287a1'}>
Thought I found something that would translate to json which I could convert to dataframe, but get an error that the properties attribute does not exist:
def to_json(gql_object):
result = []
for item in gql_object:
result.append(dict([(p, getattr(item, p)) for p in item.properties()]))
return json.dumps(result, cls=JSONEncoder)
Is there a way to iterate through the query results to get them into a dataframe either directly to a dataframe or by converting to json then to dataframe?
Datastore entities can be treated as Python base dictionaries! So you should be able to do something as simple as...
df = pd.DataFrame(datastore_entities)
...and pandas will do all the rest.
If you needed to convert the entity key, or any of its attributes to a column as well, you can pack them into the dictionary separately:
for e in entities:
e['entity_key'] = e.key
e['entity_key_name'] = e.key.name # for example
df = pd.DataFrame(entities)
You can use pd.read_json to read your json query output into a dataframe.
Assuming the output is the string that you have shared above, then the following approach can work.
#Extracting the beginning of the dictionary
startPos = line.find("{")
df = pd.DataFrame([eval(line[startPos:-1])])
Output looks like :
gc_pub_sub_id data event published_at \
0 438169950283983 605 light intensity 2019-10-11T14:37:45.407Z
device_id
0 e00fce6847be7713698287a1
Here, line[startPos:-1] is essentially the entire dictionary in that sthe string input. Using eval, we can convert it into an actual dictionary. Once we have that, it can be easily converted into a dataframe object
Original poster found a workaround, which is to convert each item in the query result object to string, and then manually parse the string to extract the needed data into a list.
The return value of the fetch function is google.cloud.datastore.query.Iterator which behaves like a List[dict] so the output of fetch can be passed directly into pd.DataFrame.
import pandas as pd
df = pd.DataFrame(fetch_times(10))
This is similar to #bkitej, but I added the use of the original poster's function.

Error when trying to save hdf5 row where one column is a string and the other is an array of floats

I have two column, one is a string, and the other is a numpy array of floats
a = 'this is string'
b = np.array([-2.355, 1.957, 1.266, -6.913])
I would like to store them in a row as separate columns in a hdf5 file. For that I am using pandas
hdf_key = 'hdf_key'
store5 = pd.HDFStore('file.h5')
z = pd.DataFrame(
{
'string': [a],
'array': [b]
})
store5.append(hdf_key, z, index=False)
store5.close()
However, I get this error
TypeError: Cannot serialize the column [array] because
its data contents are [mixed] object dtype
Is there a way to store this to h5? If so, how? If not, what's the best way to store this sort of data?
I can't help you with pandas, but can show you how do this with pytables.
Basically you create a table referencing either a numpy recarray or a dtype that defines the mixed datatypes.
Below is a super simple example to show how to create a table with 1 string and 4 floats. Then it adds rows of data to the table.
It shows 2 different methods to add data:
1. A list of tuples (1 tuple for each row) - see append_list
2. A numpy recarray (with dtype matching the table definition) -
see simple_recarr in the for loop
To get the rest of the arguments for create_table(), read the Pytables documentation. It's very helpful, and should answer additional questions. Link below:
Pytables Users's Guide
import tables as tb
import numpy as np
with tb.open_file('SO_55943319.h5', 'w') as h5f:
my_dtype = np.dtype([('A','S16'),('b',float),('c',float),('d',float),('e',float)])
dset = h5f.create_table(h5f.root, 'table_data', description=my_dtype)
# Append one row using a list:
append_list = [('test string', -2.355, 1.957, 1.266, -6.913)]
dset.append(append_list)
simple_recarr = np.recarray((1,),dtype=my_dtype)
for i in range(5):
simple_recarr['A']='string_' + str(i)
simple_recarr['b']=2.0*i
simple_recarr['c']=3.0*i
simple_recarr['d']=4.0*i
simple_recarr['e']=5.0*i
dset.append(simple_recarr)
print ('done')

Categories