export to csv and read multiIndex dataframe pandas - python

I need to export to csv and then import again a DataFrame that looks like this:
price ................................................................................................................... hold buy balance long_size short_size minute hour day week month
close high low open CCI12 ROC12 CCI15 ROC15 CCI21 ROC21 ...
Time
2015-01-02 14:20:00 97.8515 97.8595 97.8205 97.8345 91.168620 0.000557 95.323467 0.000394 68.073065 0.000348 ... 0.0 0.0 0.0 0.0 0.0 8.660254e-01 -0.500000 0.974928 1.205367e-01 5.000000e-01
where the row index is represented by the timestamp and the first 39 columns are subcolumns of 'price' while the remaining ones are on the same level of 'price'. The multiIndex looks like this
MultiIndex(levels=[['price', 'tick_counts', 'sell', 'hold', 'buy', 'balance', 'long_size', 'short_size', 'minute', 'hour', 'day', 'week', 'month'], [0, 'close', 'high', 'low', 'open', 'CCI12', 'ROC12', 'CCI15', 'ROC15', 'CCI21', 'ROC21', 'CCI30', 'ROC30', 'CCI40', 'ROC40', 'CCI100', 'ROC100', 'SMA12', 'EWMA12', 'SMA21', 'EWMA21', 'SMA26', 'EWMA26', 'SMA50', 'EWMA50', 'SMA100', 'EWMA100', 'SMA200', 'EWMA200', 'MACD', 'UpperBB10', 'LowerBB10', 'UpperBB20', 'LowerBB20', 'UpperBB30', 'LowerBB30', 'UpperBB40', 'LowerBB40', 'UpperBB50', 'LowerBB50', '']],
labels=[[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12], [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 0, 40, 40, 40, 40, 40, 40, 40, 40, 40, 40, 40]])
I have no idea on how to preserve this structure easily while exporting with df.to_csv() and importing with df.read_csv(). All my attempts have been a mess so far.
EDIT: if I simply use as suggested pd.to_csv("/", index=True) and then I read it back with read_csv("/"), I get:
Unnamed: 0 price price.1 price.2 price.3 price.4 price.5 price.6 price.7 price.8 ... hold buy balance long_size short_size minute hour day week month
0 NaN close high low open CCI12 ROC12 CCI15 ROC15 CCI21 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1 Time NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2 2015-01-02 14:20:00 97.85149999999999 97.8595 97.82050000000001 97.83449999999999 91.16862020296143 0.0005572768080819476 95.32346677471595 0.0003936082115872622 68.07306512447788 ... 0.0 0.0 0.0 0.0 0.0 8.660254e-01 -0.500000 0.974928 1.205367e-01 5.000000e-01
where the second layer of the header became the first row of the dataFrame.
EDIT2: Nevermind, I've just discovered hdf5 and apparently, contrary to csv, it preserves the structure even with multiIndex without additional work, so I will use pd.to_hdf().

I think if you use - pd.to_csv("/", index=True)
It saved it with index and then read back as normal.

Related

labeling rows in dataframe, based on dynamic conditions

i need some help with labeling data inside dataframe, based on dynamic conditions.
I have a dataframe
df3 = pd.DataFrame({
'first_name': ['John', 'John', 'Jane', 'Jane', 'Jane','Marry', 'Victoria', 'Gabriel', 'John'],
'id': [1, 1, 2, 2, 2, 3, 4, 5, 1],
'age': [30, 30, 25, 25, 25, 30, 45, 15, 30],
'group': [0, 0, 0, 0, 0, 0, 0, 0, 0],
'product_type': [1, 1, 2, 1, 2, 1, 2, 1, 2],
'quantity': [10, 15, 10, 10, 15, 30, 30, 10, 10]
})
df3['agemore'] = (df3['age'] > 20)
df3
So i need to take first person with id=1 and group =0 and label him with group=1 (on all of his rows).
This person placed on 3 rows (indexes 0, 1, 8) and has agemore=True, product_type = 1, 1, 2 and quantity = 10, 15, 10.
Condition for looking matched persons are based on product_type,quantity, agemore columns.
the first taken person slice:
df6=df3.loc[lambda df: (df['id'] ==1) &(df['product_type'] ==1), :]
df6
i need to take agemore = True, product_type = 1 (with is on two rows) and quantity of product of this type(10,15) for conditions.
and i will look for persons with has agemore = True, product_type = 2(two, its cross column search) (with is on two rows) and quantity of product_type = 2 (10,15) for conditions. The matched person has id 2. i must put this person in group 1 as well.
Then take next person with lowest id and group=0, take his conditions, look for similar, group them together etc
The output i would like to have
df4 = pd.DataFrame({
'first_name': ['John', 'John', 'Jane', 'Jane', 'Jane','Marry', 'Victoria', 'Gabriel', 'John'],
'id': [1, 1, 2, 2, 2, 3, 4, 5, 1],
'age': [30, 30, 25, 25, 25, 30, 45, 15, 30],
'group': [1, 1, 1, 1, 1, 2, 2, 3, 1],
'product_type': [1, 1, 2, 1, 2, 1, 2, 1, 2],
'quantity': [10, 15, 10, 10, 15, 30, 30, 10, 10]
})
df4
set2
import pandas as pd
data = pd.DataFrame({
'first_name': ['John', 'John', 'Jane', 'Jane', 'Jane','Marry', 'Victoria', 'Gabriel', 'John'],
'id': [1, 1, 2, 2, 2, 3, 4, 5, 1],
'age': [30, 30, 25, 25, 25, 30, 45, 15, 3],
'group': [0, 0, 0, 0, 0, 0, 0, 0, 0],
'product_type': [1, 1, 2, 1, 2, 1, 2, 1, 1],
'quantity': [10, 15, 10, 10, 15, 30, 30, 10, 10]
})
data['agemore'] = (data['age'] > 20)
rm1991, thanks for clarifying your question.
From the information provided, I gathered that you are trying to group customers by their behavior and age group. I can also infer that the IDs are assigned to customers when they first make a transaction with you, which means that the higher the ID value, the newer the customer is to the company.
If this is the case, I would suggest you use an unsupervised learning method to cluster the data points by their similarity regarding the product type, quantity purchased, and age group. Have a look at the SKLearn suite of clustering algorithms for further information.
NB: upon further clarification from rm1991, it seems that product_type is not a "clustering" criteria.
I have replicated your output using only Pandas logic within a loop, as you can see below:
import pandas as pd
data = pd.DataFrame({
'first_name': ['John', 'John', 'Jane', 'Jane', 'Jane','Marry', 'Victoria', 'Gabriel', 'John'],
'id': [1, 1, 2, 2, 2, 3, 4, 5, 1],
'age': [30, 30, 25, 25, 25, 30, 45, 15, 30],
'group': [0, 0, 0, 0, 0, 0, 0, 0, 0],
'product_type': [1, 1, 2, 1, 2, 1, 2, 1, 2],
'quantity': [10, 15, 10, 10, 15, 30, 30, 10, 10]
})
data['agemore'] = (data['age'] > 20)
group_val = 0
for id in data['id'].unique():
age_param = list(set([age_bool for age_bool in data.loc[data['id'] == id, 'agemore']]))
# Product type removed as per latest requirements
# product_type_param = list(set([prod_type for prod_type in data.loc[data['id'] == id, 'product_type']]))
quantity_param = list(set([qty for qty in data.loc[data['id'] == id, 'quantity']]))
if data.loc[(data['id'] == id)
& (data['group']==0), :].shape[0] > 0:
group_val += 1
data.loc[(data['group'] == 0)
& (data['agemore'].isin(age_param))
# Product_type removed as per latest requirements
# & (data['product_type'].isin(product_type_param))
& (data['quantity'].isin(quantity_param)), 'group'] = group_val
Now the output does match what you've posted earlier:
first_name id age group product_type quantity agemore
0 John 1 30 1 1 10 True
1 John 1 30 1 1 15 True
2 Jane 2 25 1 2 10 True
3 Jane 2 25 1 1 10 True
4 Jane 2 25 1 2 15 True
5 Marry 3 30 2 1 30 True
6 Victoria 4 45 2 2 30 True
7 Gabriel 5 15 3 1 10 False
8 John 1 30 1 2 10 True
It remains unclear to me why Victoria, with ID = 4, would be assigned to the same group as Marry (ID = 3), given that they have not purchased the same product_type.
I hope this is helpful.

Fill list with last value if date gap is greater than N seconds

Suppose I have the list data:
import numpy as np
import datetime
np.random.seed(0)
aux = [10,30,50,60,70,110,120]
base = datetime.datetime(2018, 1, 1, 22, 34, 20)
data = [[base + datetime.timedelta(seconds=s),
round(np.random.rand(),3)] for s in aux]
This returns:
data ==
[[datetime.datetime(2018, 1, 1, 22, 34, 30), 0.549],
[datetime.datetime(2018, 1, 1, 22, 34, 50), 0.715],
[datetime.datetime(2018, 1, 1, 22, 35, 10), 0.603],
[datetime.datetime(2018, 1, 1, 22, 35, 20), 0.545],
[datetime.datetime(2018, 1, 1, 22, 35, 30), 0.424],
[datetime.datetime(2018, 1, 1, 22, 36, 10), 0.646],
[datetime.datetime(2018, 1, 1, 22, 36, 20), 0.438]]
What I want to do is fill the spaces where the gaps in the dates are greater than10 seconds using the last previous value. For this example, the output should be:
desired_output ==
[[datetime.datetime(2018, 1, 1, 22, 34, 30), 0.549],
[datetime.datetime(2018, 1, 1, 22, 34, 40), 0.549],
[datetime.datetime(2018, 1, 1, 22, 34, 50), 0.715],
[datetime.datetime(2018, 1, 1, 22, 35), 0.715],
[datetime.datetime(2018, 1, 1, 22, 35, 10), 0.603],
[datetime.datetime(2018, 1, 1, 22, 35, 20), 0.545],
[datetime.datetime(2018, 1, 1, 22, 35, 30), 0.424],
[datetime.datetime(2018, 1, 1, 22, 35, 40), 0.424],
[datetime.datetime(2018, 1, 1, 22, 35, 50), 0.424],
[datetime.datetime(2018, 1, 1, 22, 36), 0.424],
[datetime.datetime(2018, 1, 1, 22, 36, 10), 0.646],
[datetime.datetime(2018, 1, 1, 22, 36, 20), 0.438]]
I can't think of any smart way to do this. All dates are separated by multiples of 10 seconds. Any ideas?
Option 1: with Pandas
If you're open to using Pandas, it makes reindexing operations like this easy:
>>> import pandas as pd
>>> df = pd.DataFrame(data, columns=['date', 'value'])
>>> ridx = df.set_index('date').asfreq('10s').ffill().reset_index()
>>> ridx
date value
0 2018-01-01 22:34:30 0.549
1 2018-01-01 22:34:40 0.549
2 2018-01-01 22:34:50 0.715
3 2018-01-01 22:35:00 0.715
4 2018-01-01 22:35:10 0.603
5 2018-01-01 22:35:20 0.545
6 2018-01-01 22:35:30 0.424
7 2018-01-01 22:35:40 0.424
8 2018-01-01 22:35:50 0.424
9 2018-01-01 22:36:00 0.424
10 2018-01-01 22:36:10 0.646
11 2018-01-01 22:36:20 0.438
.asfreq('10s') will fill the missing 10-second intervals. .ffill() means "forward-fill" missing values with the last-seen valid value.
To get back to the data structure that you have now (though note that the elements will be 2-tuples, rather then lists of length 2):
>>> native_ridx = list(zip(ridx['date'].dt.to_pydatetime().tolist(), ridx['value']))
>>> from pprint import pprint
>>> pprint(native_ridx[:5])
[(datetime.datetime(2018, 1, 1, 22, 34, 30), 0.549),
(datetime.datetime(2018, 1, 1, 22, 34, 40), 0.549),
(datetime.datetime(2018, 1, 1, 22, 34, 50), 0.715),
(datetime.datetime(2018, 1, 1, 22, 35), 0.715),
(datetime.datetime(2018, 1, 1, 22, 35, 10), 0.603)]
To confirm:
>>> assert all(tuple(i) == j for i, j in zip(desired_output, native_ridx))
Option 2: Native Python
import datetime
def make_daterange(
start: datetime.datetime,
end: datetime.datetime,
incr=datetime.timedelta(seconds=10)
):
yield start
while start < end:
start += incr
yield start
def reindex_ffill(data: list, incr=datetime.timedelta(seconds=10)):
dates, _ = zip(*data)
data = dict(data)
start, end = min(dates), max(dates)
daterng = make_daterange(start, end, incr)
# If initial value is not valid, the element at [0][0] will be NaN
lastvalid = np.nan
get = data.get
for date in daterng:
value = get(date)
if value:
yield date, value
lastvalid = value
else:
yield date, lastvalid
Example:
>>> pynative_ridx = list(reindex_ffill(data))
>>> assert all(tuple(i) == j for i, j in zip(desired_output, pynative_ridx))

timedeltas for a groupby column in pandas [duplicate]

This question already has an answer here:
How to calculate time difference by group using pandas?
(1 answer)
Closed 4 years ago.
For a given data frame df
timestamps = [
datetime.datetime(2018, 1, 1, 10, 0, 0, 0), # person 1
datetime.datetime(2018, 1, 1, 10, 0, 0, 0), # person 2
datetime.datetime(2018, 1, 1, 11, 0, 0, 0), # person 2
datetime.datetime(2018, 1, 2, 11, 0, 0, 0), # person 2
datetime.datetime(2018, 1, 1, 10, 0, 0, 0), # person 3
datetime.datetime(2018, 1, 2, 11, 0, 0, 0), # person 3
datetime.datetime(2018, 1, 4, 10, 0, 0, 0), # person 3
datetime.datetime(2018, 1, 5, 12, 0, 0, 0) # person 3
]
df = pd.DataFrame({'person': [1, 2, 2, 2, 3, 3, 3, 3], 'timestamp': timestamps })
I want to calculate for each person (df.groupby('person')) the time differences between all timestamps of that person, which I would to with diff().
df.groupby('person').timestamp.diff()
is just half the way, because the mapping back to the person is lost.
How could a solution look like?
i think you should use
df.groupby('person').timestamp.transform(pd.Series.diff)
There is problem diff no aggregate values, so possible solution is transform:
df['new'] = df.groupby('person').timestamp.transform(pd.Series.diff)
print (df)
person timestamp new
0 1 2018-01-01 10:00:00 NaT
1 2 2018-01-01 10:00:00 NaT
2 2 2018-01-01 11:00:00 0 days 01:00:00
3 2 2018-01-02 11:00:00 1 days 00:00:00
4 3 2018-01-01 10:00:00 NaT
5 3 2018-01-02 11:00:00 1 days 01:00:00
6 3 2018-01-04 10:00:00 1 days 23:00:00
7 3 2018-01-05 12:00:00 1 days 02:00:00

loop for computing average of selected data in dataframe using pandas

I have a 3 row x 96 column dataframe. I'm trying to computer the average of the two rows beneath the index (row1:96) for every 12 data points. here is my dataframe:
Run 1 Run 2 Run 3 Run 4 Run 5 Run 6 \
0 1461274.92 1458079.44 1456807.1 1459216.08 1458643.24 1457145.19
1 478167.44 479528.72 480316.08 475569.52 472989.01 476054.89
2 ------ ------ ------ ------ ------ ------
Run 7 Run 8 Run 9 Run 10 ... Run 87 \
0 1458117.08 1455184.82 1455768.69 1454738.07 ... 1441822.45
1 473630.89 476282.93 475530.87 474200.22 ... 468525.2
2 ------ ------ ------ ------ ... ------
Run 88 Run 89 Run 90 Run 91 Run 92 Run 93 \
0 1445339.53 1461050.97 1446849.43 1438870.43 1431275.76 1430781.28
1 460076.8 473263.06 455885.07 475245.64 483875.35 487065.25
2 ------ ------ ------ ------ ------ ------
Run 94 Run 95 Run 96
0 1436007.32 1435238.23 1444300.51
1 474328.87 475789.12 458681.11
2 ------ ------ ------
[3 rows x 96 columns]
Currently I am trying to use df.irow(0) to select all the data in row index 0.
something along the lines of:
selection = np.arange(0,13)
for i in selection:
new_df = pd.DataFrame()
data = df.irow(0)
........
then i get lost
I just don't know how to link this range with the dataframe in order to computer the mean for every 12 data points in each column.
To summarize, I want the average for every 12 runs in each column. So, i should end up with a separate dataframe with 2 * 8 average values (96/12).
any ideas?
thanks.
You can do a groupby on axis=1 (using some dummy data I made up):
>>> h = df.iloc[:2].astype(float)
>>> h.groupby(np.arange(len(h.columns))//12, axis=1).mean()
0 1 2 3 4 5 6 7
0 0.609643 0.452047 0.536786 0.377845 0.544321 0.214615 0.541185 0.544462
1 0.382945 0.596034 0.659157 0.437576 0.490161 0.435382 0.476376 0.423039
First we extract the data and force recognition of a float (the presence of the ------ row means that you've probably got an object dtype, which will make the mean unhappy.)
Then we make an array saying what groups we want to put the different columns in:
>>> np.arange(len(df.columns))//12
array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,
3, 3, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 5, 5, 5, 5, 5, 5, 5, 5, 5,
5, 5, 5, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 7, 7, 7, 7, 7, 7, 7, 7,
7, 7, 7, 7], dtype=int32)
which we feed as an argument to groupby. .mean() handles the rest.
It's always best to try to use pandas methods when you can, rather than iterating over the rows. The DataFrame's iloc method is useful for extracting any number of rows.
The following example shows you how to do what you want in a two-column DataFrame. The same technique will work independent of the number of columns:
In [14]: df = pd.DataFrame({"x": [1, 2, "-"], "y": [3, 4, "-"]})
In [15]: df
Out[15]:
x y
0 1 3
1 2 4
2 - -
In [16]: df.iloc[2] = df.iloc[0:2].sum()
In [17]: df
Out[17]:
x y
0 1 3
1 2 4
2 3 7
However, in your case you want to sum each group of eight cells in df.iloc[2]`, so you might be better simply taking the result of the summing expression with the statement
ds = df.iloc[0:2].sum()
which with your data will have the form
col1 0
col2 1
col3 2
col4 3
...
col93 92
col94 93
col95 94
col96 95
(These numbers are representative, you will obviously see your column sums). You can then turn this into a 12x8 matrix with
ds.values.reshape(12, 8)
whose value is
array([[ 0, 1, 2, 3, 4, 5, 6, 7],
[ 8, 9, 10, 11, 12, 13, 14, 15],
[16, 17, 18, 19, 20, 21, 22, 23],
[24, 25, 26, 27, 28, 29, 30, 31],
[32, 33, 34, 35, 36, 37, 38, 39],
[40, 41, 42, 43, 44, 45, 46, 47],
[48, 49, 50, 51, 52, 53, 54, 55],
[56, 57, 58, 59, 60, 61, 62, 63],
[64, 65, 66, 67, 68, 69, 70, 71],
[72, 73, 74, 75, 76, 77, 78, 79],
[80, 81, 82, 83, 84, 85, 86, 87],
[88, 89, 90, 91, 92, 93, 94, 95]])
but summing this array will give you the sum of all elements, so instead create another DataFrame with
rs = pd.DataFrame(ds.values.reshape(12, 8))
and then sum that:
rs.sum()
giving
0 528
1 540
2 552
3 564
4 576
5 588
6 600
7 612
dtype: int64
You may find in practice that it is easier to simply create two 12x8 matrices in the first place, which you can add together before creating a dataframe which you can then sum. Much depends on how you are reading your data.

Can't call to_json on pivoted pandas dataframe: returns "ValueError: Label array sizes do not match corresponding data shape"

I have a database containing time series data from sensors. The graphing library I would like to use on the front end requires that the data be reshaped into one column per sensor instead of the vertical format in my dataset:
>>> for d in dataset: print d
...
[datetime.datetime(2014, 9, 26, 0, 56, 0, 598000), u'motion', 0.0]
[datetime.datetime(2014, 9, 26, 0, 56, 7, 698000), u'motion', 1.0]
[datetime.datetime(2014, 9, 26, 0, 58, 20, 298000), u'motion', 0.0]
[datetime.datetime(2014, 9, 26, 2, 21, 27, 893000), u'door', 0.0]
[datetime.datetime(2014, 9, 26, 2, 21, 37, 793000), u'door', 1.0]
[datetime.datetime(2014, 9, 26, 2, 21, 53, 893000), u'door', 0.0]
With some help from stackoverflow and the pandas documentation (thanks!) I figured out how to pivot the data:
>>> import pandas as pd
>>> pd.__version__
'0.14.1'
>>>
>>> df = pd.DataFrame(dataset, columns=['tstamp', 'tag', 'value'])
>>> dfp = df.pivot('tstamp', 'tag')
>>> dfp
value
tag door motion
tstamp
2014-09-26 00:56:00.598000 NaN 0
2014-09-26 00:56:07.698000 NaN 1
2014-09-26 00:58:20.298000 NaN 0
2014-09-26 02:21:27.893000 0 NaN
2014-09-26 02:21:37.793000 1 NaN
2014-09-26 02:21:53.893000 0 NaN
>>>
Now I'm stuck trying to output the data in JSON:
>>> dfp.to_json()
Traceback (most recent call last):
File "<console>", line 1, in <module>
File "/var/www/environment/default/local/lib/python2.7/site-packages/pandas/core/generic.py", line 853, in to_json
default_handler=default_handler)
File "/var/www/environment/default/local/lib/python2.7/site-packages/pandas/io/json.py", line 34, in to_json
date_unit=date_unit, default_handler=default_handler).write()
File "/var/www/environment/default/local/lib/python2.7/site-packages/pandas/io/json.py", line 77, in write
default_handler=self.default_handler)
ValueError: Label array sizes do not match corresponding data shape
I am new to pandas so I am guessing that I need to fix my "label arrays". What do I do? I can see that
>>> dfp.keys()
MultiIndex(levels=[[u'value'], [u'door', u'motion']],
labels=[[0, 0], [0, 1]],
names=[None, u'tag'])
But I'm not sure what to do next.
The pivot is making a DataFrame whose columns have a MultiIndex. Since the top level, value, is the same for all columns, you could simply drop it:
dfp.columns = dfp.columns.droplevel(0)
and then calling to_json works:
In [20]: dfp.to_json()
Out[20]: '{"door":{"1411692960598":null,"1411692967698":null,"1411693100298":null,"1411698087893":0.0,"1411698097793":1.0,"1411698113893":0.0},"motion":{"1411692960598":0.0,"1411692967698":1.0,"1411693100298":0.0,"1411698087893":null,"1411698097793":null,"1411698113893":null}}'
Or, better yet, specify the values column when calling pivot:
In [26]: dfp = df.pivot(index='tstamp', columns='tag', values='value'); dfp
Out[26]:
tag door motion
tstamp
2014-09-26 00:56:00.598000 NaN 0
2014-09-26 00:56:07.698000 NaN 1
2014-09-26 00:58:20.298000 NaN 0
2014-09-26 02:21:27.893000 0 NaN
2014-09-26 02:21:37.793000 1 NaN
2014-09-26 02:21:53.893000 0 NaN
and now calling to_json works out-of-the-box, since the columns index is flat.

Categories