Conditional transform on pandas - python

I've got a very simple problem, but I can't seem to get it right.
Consider this dataframe
df = pd.DataFrame({'group' :
['A', 'A', 'A', 'B', 'B'], 'time' : [20, 21, 22, 20, 21],
'price' : [3.1, 3.5, 3.0, 2.3, 2.1]})
group price time
0 A 3.1 20
1 A 3.5 21
2 A 3.0 22
3 B 2.3 20
4 B 2.1 21
Now I want to take the standard deviation of the price of each group, but conditional on it being before time 22 (let's call it early_std). I want to then create a variable with that information.
The expected result is
group price time early_std
A 3.1 20 0.282843
A 3.5 21 0.282843
A 3.0 22 0.282843
B 2.3 20 0.141421
B 2.1 21 0.141421
This is what I tried:
df['early_std'] = df[df.time < 22].groupby('group').\
price.transform(lambda x : x.std())
This almost works but it gives a missing value on time = 22:
group price time early_std
0 A 3.1 20 0.282843
1 A 3.5 21 0.282843
2 A 3.0 22 NaN
3 B 2.3 20 0.141421
4 B 2.1 21 0.141421
I also tried with apply and I think it works, but I need to reset the index, which is something I'd rather avoid (I have a large dataset and I need to do this repeatedly)
early_std2 = df[df.time < 22].groupby('group').price.std()
df.set_index('group', inplace=True)
df['early_std2'] = early_std2
price time early_std early_std2
group
A 3.1 20 0.282843 0.282843
A 3.5 21 0.282843 0.282843
A 3.0 22 NaN 0.282843
B 2.3 20 0.141421 0.141421
B 2.1 21 0.141421 0.141421
Thanks!

It looks like you only need to add fillna() to your first code to expand the std values:
df['early_std'] = df[df.time < 22].groupby('group')['price'].transform(pd.Series.std)
df['early_std'] = df.groupby('group')['early_std'].apply(lambda x: x.fillna(x.max()))
df
To get:
group price time early_std
0 A 3.1 20 0.283
1 A 3.5 21 0.283
2 A 3.0 22 0.283
3 B 2.3 20 0.141
4 B 2.1 21 0.141
EDIT: I have changed ffill to a more general fillna, but you could also use chained .bfill().ffill() to achieve the same result.

Your second approach is very close to what you are trying to achieve.
This may not be the most efficient method but it worked for me:
df['early_std'] = 0
for index,value in early_std2.iteritems():
df.early_std[df.group==index] = value

Related

Pandas combine two dataseries into one series

I need to combine the dataseries rateScore and rate into one.
This is the current DataFrame I have
rateScore rate
10 NaN 4.5
11 2.5 NaN
12 4.5 NaN
13 NaN 5.0
..
235 NaN 4.7
236 3.8 NaN
This needs to be something like this:
rateScore
10 4.5
11 2.5
12 4.5
13 5.0
..
235 4.7
236 3.8
The rate column needs to be dropped after merging the series and also for each row, the index number needs stay the same.
You can try with the following with fillna(), redifining the rateScore column and dropping rate:
df = df.fillna(0)
df['rateScore'] = df['rateScore'] + df['rate']
df = df.drop(columns='rate')
You could use combine_first to fill NaN values from a second Series:
df['rateScore'] = df['rateScore'].combine_first(df['rateScore'])
Let us do add
df['rateScore'] = df['rateScore'].add(df['rate'],fill_value=0)

How to make different dataframes of different lengths become equal in length (downsampling and upsampling)

I have many dataframes (timeseries) that are of different lengths ranging between 28 and 179. I need to make them all of length 104. (upsampling those below 104 and downsampling those above 104)
For upsampling, the linear method can be sufficient to my needs. For downsampling, the mean of the values should be good.
To get all files to be the same length, I thought that I need to make all dataframes start and end at the same dates.
I was able to downsample all to the size of the smallest dataframe (i.e. 28) using below lines of code:
df.set_index(pd.date_range(start='1/1/1991' ,periods=len(df), end='1/1/2000'), inplace=True)
resampled=df.resample('120D').mean()
However, this will not give me good results when I feed them into the model I need them for as it shrinks the longer files so much thus distorting the data.
This is what I tried so far:
df.set_index(pd.date_range(start='1/1/1991' ,periods=len(df), end='1/1/2000'), inplace=True)
if df.shape[0]>100: resampled=df.resample('D').mean()
elif df.shape[0]<100: resampled=df.astype(float).resample('33D').interpolate(axis=0, method='linear')
else: break
Now, in the above lines of code, I am getting the files to be the same length (length 100). The downsampling part works fine too.
What's not working is the interpoaltion on the upsampling part. It just returns dataframes of length 100 with the first value of every column just copied over to all the rows.
What I need is to make them all size 104 (average size). This means any df of length>104 needs to downsampled and any df of length<104 needs to be upsampled.
As an example, please consider the two dfs as follows:
>>df1
index
0 3 -1 0
1 5 -3 2
2 9 -5 0
3 11 -7 -2
>>df2
index
0 3 -1 0
1 5 -3 2
2 9 -5 0
3 6 -3 -2
4 4 0 -4
5 8 2 -6
6 10 4 -8
7 12 6 -10
Suppose the avg length is 6, the expected output would be:
df1 upsampled to length 6 using interpolation - for e.g. resamle(rule).interpolate().
And df2 downsampled to length 6 using resample(rule).mean() .
Update:
If I could get all the files to be upsampled to 179, that would be fine as well.
I assume the problem is when you do resample in the up-sampling case, the other values are not kept. With you example df1, you can see it by using asfreq on one column:
print (df1.set_index(pd.date_range(start='1/1/1991' ,periods=len(df1), end='1/1/2000'))[1]
.resample('33D').asfreq().isna().sum(0))
#99 rows are nan on the 100 length resampled dataframe
So when you do interpolate instead of asfreq, it actually interpolates with just the first value, meaning that the first value is "repeated" over all the rows
To get the result you want, then before interpolating, use also mean even in the up-sampling case, such as:
print (df1.set_index(pd.date_range(start='1/1/1991' ,periods=len(df1), end='1/1/2000'))[1]
.resample('33D').mean().interpolate().head())
1991-01-01 3.000000
1991-02-03 3.060606
1991-03-08 3.121212
1991-04-10 3.181818
1991-05-13 3.242424
Freq: 33D, Name: 1, dtype: float64
and you will get values as you want.
To conclude, I think in both up-sampling and down-sampling cases, you can use the same command
resampled = (df.set_index(pd.date_range(start='1/1/1991' ,periods=len(df), end='1/1/2000'))
.resample('33D').mean().interpolate())
Because the interpolate would not affect the result in the down-sampling case.
Here is my version using skimage.transform.resize() function:
df1 = pd.DataFrame({
'a': [3,5,9,11],
'b': [-1,-3,-5,-7],
'c': [0,2,0,-2]
})
df1
a b c
0 3 -1 0
1 5 -3 2
2 9 -5 0
3 11 -7 -2
import pandas as pd
import numpy as np
from skimage.transform import resize
def df_resample(df1, num=1):
df2 = pd.DataFrame()
for key, value in df1.iteritems():
temp = value.to_numpy()/value.abs().max() # normalize
resampled = resize(temp, (num,1), mode='edge')*value.abs().max() # de-normalize
df2[key] = resampled.flatten().round(2)
return df2
df2 = df_resample(df1, 20) # resampling rate is 20
df2
a b c
0 3.0 -1.0 0.0
1 3.0 -1.0 0.0
2 3.0 -1.0 0.0
3 3.4 -1.4 0.4
4 3.8 -1.8 0.8
5 4.2 -2.2 1.2
6 4.6 -2.6 1.6
7 5.0 -3.0 2.0
8 5.8 -3.4 1.6
9 6.6 -3.8 1.2
10 7.4 -4.2 0.8
11 8.2 -4.6 0.4
12 9.0 -5.0 0.0
13 9.4 -5.4 -0.4
14 9.8 -5.8 -0.8
15 10.2 -6.2 -1.2
16 10.6 -6.6 -1.6
17 11.0 -7.0 -2.0
18 11.0 -7.0 -2.0
19 11.0 -7.0 -2.0

optimize a date function for timeseries

i have a dataset with Time(seconds) column, i want to creat another column with full date and time ( yy/mm/dd h:m:s), so i wrote the function below but it takes too long for larger datasets, any advice how can i optimize it ?
Column Time is like this
0 0.0
1 0.2
2 0.4
3 0.6
4 0.8
5 1.0
6 1.2
7 1.4
8 1.6
9 1.8
10 2.0
11 2.2
12 2.4
13 2.6
14 2.8
15 3.0
16 3.2
17 3.4
18 3.6
19 3.8
20 4.0
function
def calcul_datetime(ds,h,m,s,Y,M,D):
for index,row in ds.iterrows():
if row.Time.is_integer()==True and row.Time!=0:
s=s+1
if s==60:
m=m+1
s=0
if m==60:
h=h+1
m=0
if h==24:
D=D+1
h=0
ds.iloc[index,9]="{Y}-{M}-{D} {heure}:{min}:{sec}".format(Y=Y, M=M, D=D ,heure=h,min=m,sec=s)
return ds ```
The datetime module has pretty much everything you need to do computations on times and dates.
Specifically, the datetime.datetime.fromtimestamp method converts seconds since the epoch to a datetime.datetime object. This object then has a convenient method for string representation: datetime.datetime.strftime.
Here's an example:
import datetime
some_arbitrary_time_value = 10000000
as_a_datetime = datetime.datetime.fromtimestamp(some_arbitrary_time_value)
as_a_string = as_a_datetime.strftime("%Y-%m-%d %H:%M:%S")
print(as_a_string)

Reshaping Pandas dataframe by months

The task is to transform the below table
import pandas as pd
import numpy as np
index = pd.date_range('2000-1-1', periods=700, freq='D')
df = pd.DataFrame(np.random.randn(700), index=index, columns=["values"])
df.groupby(by=[df.index.year, df.index.month]).sum()
In[1]: df
Out[1]:
values
2000 1 1.181000
2 -8.005783
3 6.590623
4 -6.266232
5 1.266315
6 0.384050
7 -1.418357
8 -3.132253
9 0.005496
10 -6.646101
11 9.616482
12 3.960872
2001 1 -0.989869
2 -2.845278
3 -1.518746
4 2.984735
5 -2.616795
6 8.360319
7 5.659576
8 0.279863
9 -5.220678
10 5.077400
11 1.332519
such that it looks like this
Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
2000 1.2 -8.0 6.6 -6.3 1.2 0.4 -1.4 -3.1 0.0 -6.6 9.6 3.9
2001 -0.9 -2.8 -1.5 3.0 -2.6 8.3 5.7 0.3 -5.2 5.1 1.3
Additionally I need to add an extra column which sums the yearly values like this
Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec Year
2000 1.2 -8.0 6.6 -6.3 1.2 0.4 -1.4 -3.1 0.0 -6.6 9.6 3.9 4.7
2001 -0.9 -2.8 -1.5 3.0 -2.6 8.3 5.7 0.3 -5.2 5.1 1.3 10.7
Is there a quick pandas pivotal way to solve this?
use strftime('%b') in your groupby
df['values'].groupby([df.index.year, df.index.strftime('%b')]).sum().unstack()
To preserve order of months
df['values'].groupby([df.index.year, df.index.strftime('%b')], sort=False).sum().unstack()
With 'Year' at end
df['values'].groupby([df.index.year, df.index.strftime('%b')], sort=False).sum() \
.unstack().assign(Year=df.groupby(df.index.year).sum())
You can do something like this:
import pandas as pd
import numpy as np
index = pd.date_range('2000-1-1', periods=700, freq='D')
df = pd.DataFrame(np.random.randn(700), index=index, columns=["values"])
l = [df.index.strftime("%Y"), df.index.strftime("%b"), df.index.strftime("%d")]
df.index = l
df=df.groupby(level=[-3,-2]).sum().unstack(-1)
df['Year'] = df.sum(axis=1)
df
Output:
Only change is you need to unstack the DF to convert it into a wide format. Once you get the integer month numbers, you could convert these into a datetime by specifying %m directive as the format to be considered. After obtaining this, use it to retrieve it's string representation through the help of strftime.
Calculate the year by taking it's sum across columns by specifying axis=1.
np.random.seed(314)
fr = df.groupby([df.index.year, df.index.month]).sum().unstack(fill_value=0)
fr.columns = pd.to_datetime(fr.columns.droplevel(0), format='%m').strftime('%b')
fr['Year'] = fr.sum(1)
The extra Year column you can do by doing
df['Year'] = df.sum(axis=1)
It will sum the dataframe row-wise (due to the axis=1), and storing it in a new column.

how do I transform a DataFrame in pandas with a function applied to many slices in each row?

I want to apply a function f to many slices within each row of a pandas DataFrame.
For example, DataFrame df would look as such:
df = pandas.DataFrame(np.round(np.random.normal(size=(2,49)), 2))
So, I have a dataframe of 2 rows by 49 columns, and my function needs to be applied to every consequent slice of 7 data points in both rows, and so that the resulting dataframe looks identical to the input dataframe.
I was doing it as such:
df1=df.copy()
df1.T[:7], df1.T[7:14], df1.T[14:21],..., df1.T[43:50] = f(df.T.iloc[:7,:]), f(df.T.iloc[7:14,:]),..., f(df.T.iloc[43:50,:])
As you can see that's a whole lot of redundant code.. so I would like to create a loop or something so that it applies the function to every 7 subsequent data point...
I have no idea how to approach this. Is there a more elegant way to do this?
I thought I could maybe use a transform function for this, but in the pandas documentation I can only see that applied to a dataframe that has been grouped and not on slices of the data....
Hopefully this is clear.. let me know.
Thank you.
To avoid redundant code you can just do a loop like this:
STEP = 7
for i in range(0,len(df),STEP):
df1.T[i:i+STEP] = f(df1.T[i:i+STEP]) # could also do an apply here somehow, depending on what you want to do
Don't Repeat Yourself
You don't provide any examples of your desired output, so here's my best guess at what you want...
If your data are lumped into groups of seven, the you need to come up with a way to label them as such.
If other words, you with want to work with arbitrary arrays, use numpy. If you want to work with labeled, meaningful data and it's associated metadata, then use pandas.
Also, pandas works more efficiently when operating (and displaying!) row-wise data. So that mean store data long (49x2), not wide (2x49)
Here's an example of what I mean. I have the same 49x2 random array, but assigned grouping labels to the rows ahead of time.
Let's yeah you're reading in some wide-ish data as following:
import pandas
import numpy
from io import StringIO # python 3
# from StringIO import StringIO # python 2
datafile = StringIO("""\
A,B,C,D,E,F,G,H,I,J
0.0,0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9
1.0,1.1,1.2,1.3,1.4,1.5,1.6,1.7,1.8,1.9
2.0,2.1,2.2,2.3,2.4,2.5,2.6,2.7,2.8,2.9
""")
df = pandas.read_csv(datafile)
print(df)
A B C D E F G H I J
0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
1 1 1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.9
2 2 2.1 2.2 2.3 2.4 2.5 2.6 2.7 2.8 2.9
You could add a cluster value to the columns, like so:
cluster_size = 3
col_vals = []
for n, col in enumerate(df.columns):
cluster = int(n/cluster_size)
col_vals.append((cluster, col))
df.columns = pandas.Index(col_vals)
print(df)
0 1 2 3
A B C D E F G H I J
0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
1 1 1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.9
2 2 2.1 2.2 2.3 2.4 2.5 2.6 2.7 2.8 2.9
By default, the groupby method tries to group rows, but you can group columns (I just fogured this out), by passing axis=1 when you create the object. So the sum of each cluster of columns for each row is as follows:
df.groupby(axis=1, level=0).sum()
0 1 2 3
0 0.3 1.2 2.1 0.9
1 3.3 4.2 5.1 1.9
2 6.3 7.2 8.1 2.9
But again, if all you're doing is more "global" operations, there's no need to any of this.
In-place column cluster operation
df[0] *= 5
print(df)
0 1 2 3
A B C D E F G H I J
0 0 2.5 5 0.3 0.4 0.5 0.6 0.7 0.8 0.9
1 25 27.5 30 1.3 1.4 1.5 1.6 1.7 1.8 1.9
2 50 52.5 55 2.3 2.4 2.5 2.6 2.7 2.8 2.9
In-place row operation
df.T[0] += 20
0 1 2 3
A B C D E F G H I J
0 20 22.5 25 20.3 20.4 20.5 20.6 20.7 20.8 20.9
1 25 27.5 30 1.3 1.4 1.5 1.6 1.7 1.8 1.9
2 50 52.5 55 2.3 2.4 2.5 2.6 2.7 2.8 2.9
Operate on the entire dataframe at once
def myFunc(x):
return 5 + x**2
myFunc(df)
0 1 2 3
A B C D E F G H I J
0 405 511.25 630 417.09 421.16 425.25 429.36 433.49 437.64 441.81
1 630 761.25 905 6.69 6.96 7.25 7.56 7.89 8.24 8.61
2 2505 2761.25 3030 10.29 10.76 11.25 11.76 12.29 12.84 13.41

Categories