I am new to hdf5 and was trying to store a DataFrame row into the hdf5 format. I was to append a row at different locations within the file; however, every time I append it shows up at an array in a single column rather than a single value in multiple columns.
I have tried both h5py and pandas and it seems like pandas is the better option for appending. Additionally, I have really been trying a lot of different methods. Truly, any help would be greatly appreciated.
Here is me sending an array multiple times into the hdf5 file.
import pandas as pd
import numpy as np
data = np.zeros((1,48), dtype = float)
columnName = ['Hello'+str(y) for (x,y), item in np.ndenumerate(data)]
df = pd.DataFrame(data = data, columns =columnName)
file = pd.HDFStore('file.hdf5', mode = 'a', complevel = 9, comlib = 'blosc')
for x in range(0,11):
file.put('/data', df, column_data = columnName , append = True, format = 'table')
In [243]: store = pd.HDFStore('test.h5')
This seems to work fine:
In [247]: store.put('foo',df,append=True,format='table')
In [248]: store.put('foo',df,append=True,format='table')
In [249]: store.put('foo',df,append=True,format='table')
In [250]: store['foo']
Out[250]:
Hello0 Hello1 Hello2 Hello3 Hello4 ... Hello43 Hello44 Hello45 Hello46 Hello47
0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0
0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0
0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0
[3 rows x 48 columns]
Related
I have a Pandas dataframe, wt, with a datetime index and three columns as well as dataframe t with the same datetime index and three other columns below:
wt
date 0 1 2
2004-11-19 0.2 0.3 0.5
2004-11-22 0.0 0.0 0.0
2004-11-23 0.0 0.0 0.0
2004-11-24 0.0 0.0 0.0
2004-11-26 0.0 0.0 0.0
2004-11-29 0.0 0.0 0.0
2004-11-30 0.0 0.0 0.0
t
date GLD SPY TLT
2004-11-19 0.009013068949977443 -0.011116725618999457 -0.007980218051028332
2004-11-22 0.0037963376507370583 0.004769204564810003 0.005211874008610895
2004-11-23 -0.00444938820912133 0.0015256823190370472 0.0012398557258792575
2004-11-24 0.006703910614525022 0.0023696682464455776 0.0
2004-11-26 0.005327413984461682 -0.0007598784194529085 -0.00652932567826181
2004-11-29 0.002428792227864962 -0.004562737642585524 -0.010651558073654366
2004-11-30 -0.006167400881057272 0.0006790595025889523 -0.004237773450922022
2004-12-01 0.005762411347517871 0.011366528119433505 -0.0015527950310557648
I'm currently using the Pandas iterrrows method to run through each row for processing, and as a first step, I check if the row entries are non-zero, as below:
for dt, row in t.iterrows():
if sum(wt.loc[dt]) <= 0:
...
Based on this, I'd like to assign values to dataframe wt if non-zero values don't currently exist. How can I retrieve the next row for a given dt entry (eg, '11/22/2004' for dt = '11/19/2004')?
Part 2
As an addendum, I'm setting this up using a for loop for testing but would like to use list comprehension once complete. Processing will return the wt dataframe described above, as well as an intermediate, secondary dataframe again with datetime index and a single column (sample below):
r
date r
2004-11-19 0.030202
2004-11-22 -0.01047
2004-11-23 0.002456
2004-11-24 -0.01274
2004-11-26 0.00928
Is there a way to use list comprehensions to return both the above wt and this r dataframes without simply creating two separate comprehensions?
Edit
I was able to get desired results by changing my approach, so adding for clarification (referenced dataframes are as described above). Wonder if there's any way to apply list comprehensions for this.
r = pd.DataFrame(columns=['ret'],index=wt.index.copy())
dts = wt.reset_index().date
for i, dt in enumerate(dts):
row = t.loc[dt]
dt_1 = dts.shift(-1).iloc[i]
try:
wt.loc[dt_1] = ((wt.loc[dt].tolist() * (1+row)).transpose() / np.dot(wt.loc[dt].tolist(), (1+row))).tolist()
r.loc[dt] = np.dot(wt.loc[dt], row)
except:
print(f'Error calculating for date {dt}')
continue
I am new to pandas so sorry for naiveté.
I have two dataframe.
One is out.hdf:
999999 2014 1 2 15 19 45.19 14.095 -91.528 69.7 4.5 0.0 0.0 0.0 603879074
999999 2014 1 2 23 53 57.58 16.128 -97.815 23.2 4.8 0.0 0.0 0.0 603879292
999999 2014 1 9 12 27 10.98 13.265 -89.835 55.0 4.5 0.0 0.0 0.0 603947030
999999 2014 1 9 20 57 44.88 23.273 -80.778 15.0 5.1 0.0 0.0 0.0 603947340
and another one is out.res (the first column is station name):
061Z 56.72 0.0 P 603879074
061Z 29.92 0.0 P 603879074
0614 46.24 0.0 P 603879292
109C 87.51 0.0 P 603947030
113A 66.93 0.0 P 603947030
113A 26.93 0.0 P 603947030
121A 31.49 0.0 P 603947340
The last columns in both dataframes are ID.
I want to creat a new dataframe which puts the same IDs from two dataframes together in this way (first reads a line from hdf, then puts the lines from res with the same ID beneath it, but doesn't keep the ID in res).
The new dataframe:
"999999 2014 1 2 15 19 45.19 14.095 -91.528 69.7 4.5 0.0 0.0 0.0 603879074"
061Z 56.72 0.0 P
061Z 29.92 0.0 P
"999999 2014 1 2 23 53 57.58 16.128 -97.815 23.2 4.8 0.0 0.0 0.0 603879292"
0614 46.24 0.0 P
"999999 2014 1 9 12 27 10.98 13.265 -89.835 55.0 4.5 0.0 0.0 0.0 603947030"
109C 87.51 0.0 P
113A 66.93 0.0 P
113A 26.93 0.0 P
"999999 2014 1 9 20 57 44.88 23.273 -80.778 15.0 5.1 0.0 0.0 0.0 603947340"
121A 31.49 0.0 P
My code to do this is:
import csv
import pandas as pd
import numpy as np
path= './'
hdf = pd.read_csv(path + 'out.hdf', delimiter = '\t', header = None)
res = pd.read_csv(path + 'out.res', delimiter = '\t', header = None)
###creating input to the format of ph2dt-jp/ph
with open('./new_df', 'w', encoding='UTF8') as f:
writer = csv.writer(f, delimiter='\t')
i=0
with open('./out.hdf', 'r') as a_file:
for line in a_file:
liney = line.strip()
writer.writerow(np.array([liney]))
print(liney)
j=0
with open('./out.res', 'r') as a_file:
for line in a_file:
if res.iloc[j, 4] == hdf.iloc[i, 14]:
strng = res.iloc[j, [0, 1, 2, 3]]
print(strng)
writer.writerow(np.array(strng))
j+=1
i+=1
The goal is to keep just unique stations in the 3rd dataframe. I used these commands for res to keep unique stations before creating the 3rd dataframe:
res.drop_duplicates([0], keep = 'last', inplace = True)
and
res.groupby([0], as_index = False).last()
and it works fine. The problem is for a large data set, including thousands of lines, using these commands causes some lines of res file to be omitted in the 3rd dataframe.
Could you please let me know what I should do to give the same result for a large dataset?
I am going crazy and thanks for your time and help in advance.
I found the problem and hope it is helpful for others in the future.
In a large data set, the duplicated stations were repeating many times but not consecutively. Drop_duplicates() were keeping just one of them.
However, I wanted to drop just consecutive stations not all of them. And I've done this using shift:
unique_stations = res.loc[res[0].shift() != res[0]]
I created a dataframe of zeros using this syntax:
ltv = pd.DataFrame(data=np.zeros([actual_df.shape[0], 6]),
columns=['customer_id',
'actual_total',
'predicted_num_purchases',
'predicted_value',
'predicted_total',
'error'], dtype=np.float32)
It comes out perfectly as expected
customer_id | actual_total | predicted_num_purchases | predicted_value | predicted_total | error
0 0.0 0.0 0.0 0.0 0.0 0.0
1 0.0 0.0 0.0 0.0 0.0 0.0
2 0.0 0.0 0.0 0.0 0.0 0.0
When I run this syntax:
ltv['customer_id'] = actual_df['customer_id']
I get all NaNs in ltv['customer_id']. What is causing this and how can I prevent it from happening?
NB: I also checked actual_dfand there are no NaNs inside of it
You need same index values in both (and also same length of both DataFrames).
So first solution is create default RabgeIndex in actual_df, in ltv is not specify, so created by default:
actual_df = actual_df.reset_index(drop=True)
ltv['customer_id'] = actual_df['customer_id']
Or add parameter index to DataFrame constructor:
ltv = pd.DataFrame(data=np.zeros([actual_df.shape[0], 6]),
columns=['customer_id',
'actual_total',
'predicted_num_purchases',
'predicted_value',
'predicted_total',
'error'], dtype=np.float32,
index=actual_df.index)
ltv['customer_id'] = actual_df['customer_id']
Another option (more complicated than jezrael's great answer) is using pd.concat() followed by .drop():
ltv = pd.concat([ltv.drop(columns=['customer_id']),actual_df[['customer_id']]],axis=1,ignore_index=True)
I have some file which I am running a loop over all file and doing some calculations. I would like to get a new df with the name of the files on the row side and the calculated value per each file in the correct row.
the code is:
results = []
file_name = '{}'
for file in folder:
df = pd.read_csv(file_name.format(file))
print("reading file ", file)
results.append(df['old_calc'])#this is the data i want to save to the new df and I need it .sum()
the above code doesn't work as expected as it is giving me:
old calc old calc old calc old calc old calc old calc old calc
4 0.0 0.0 0.0 0.0 0.0 0.0
5 0.0 0.0 0.0 59.0 0.0 0.0
6 0.0 0.0 58.4 0.0 0.0 0.0
7 0.0 0.0 8.4 -79.1 0.0 0.0
8 0.0 0.0 120.9 0.0 0.0 0.0
The expected result will be a new df named results:
file1 0
file2 0
file3 187.7
file4 20.1
file5 0
thanks for the help
This is one way you can extract the data you need:
dfs = {file: pd.read_csv(file) for file in folder}
result_dict = {k: v['old_calc'].sum() for k, v in dfs.items()}
result_df = pd.DataFrame.from_dict(result_dict, orient='index')
I'm currently dealing with a set of similar DataFrames having a double Header.
They have the following structure:
age height weight shoe_size
RHS height weight shoe_size
0 8.0 6.0 2.0 1.0
1 8.0 NaN 2.0 1.0
2 6.0 1.0 4.0 NaN
3 5.0 1.0 NaN 0.0
4 5.0 NaN 1.0 NaN
5 3.0 0.0 1.0 0.0
height weight shoe_size age
RHS weight shoe_size age
0 1.0 1.0 NaN NaN
1 1.0 2.0 0.0 2.0
2 1.0 NaN 0.0 5.0
3 1.0 2.0 0.0 NaN
4 0.0 1.0 0.0 3.0
Actually the main differences are the sorting of the first Header row, which could be made the same for all of them, and the position of the RHS header column in the second Header row. I'm currently wondering if there is an easy way of saving/reading all these DataFrames into/from a single CSV file instead of having a different CSV file for each of them.
Unfortunately, there isn't any reasonable way to store multiple dataframes in a single CSV such that retrieving each one would not be excessively cumbersome, but you can use pd.ExcelWriter and save to separate sheets in a single .xlsx file:
import pandas as pd
writer = pd.ExcelWriter('file.xlsx')
for i, df in enumerate(df_list):
df.to_excel(writer,'sheet{}'.format(i))
writer.save()
Taking back your example (with random numbers instead of your values) :
import pandas as pd
import numpy as np
h1 = [['age', 'height', 'weight', 'shoe_size'],['RHS','height','weight','shoe_size']]
df1 = pd.DataFrame(np.random.randn(3, 4), columns=h1)
h2 = [['height', 'weight', 'shoe_size','age'],['RHS','weight','shoe_size','age']]
df2 = pd.DataFrame(np.random.randn(3, 4), columns=h2)
First, reorder your columns (How to change the order of DataFrame columns?) :
df3 = df2[h1[0]]
Then, concatenate the two dataframes (Merge, join, and concatenate) :
df4 = pd.concat([df1,df3])
I don't know how you want to deal with the second row of your header (for now, it's just using two sub-columns, which is not very elegant). If, to your point of view, this row is meaningless, just reset your header like you want before to concatenate :
df1.columns=h1[0]
df3.columns=h1[0]
df5 = pd.concat([df1,df3])
Finally, save it under CSV format (pandas.DataFrame.to_csv) :
df4.to_csv('file_name.csv',sep=',')