I have pandas dataframe:
df = pd.DataFrame()
df['city'] = ['NY','NY','LA','LA']
df['hour'] = ['0','12','0','12']
df['value'] = [12,24,3,9]
city hour value
0 NY 0 12
1 NY 12 24
2 LA 0 3
3 LA 12 9
I want, for each city, to divide each row by the previous one and write the result into a new dataframe. The desired output is:
city ratio
NY 2
LA 3
What's the most pythonic way to do this?
First divide by shifted values per groups:
df['ratio'] = df['value'].div(df.groupby('city')['value'].shift(1))
print (df)
city hour value ratio
0 NY 0 12 NaN
1 NY 12 24 2.0
2 LA 0 3 NaN
3 LA 12 9 3.0
Then remove NaNs and select only city and ratio column:
df = df.dropna(subset=['ratio'])[['city', 'ratio']]
print (df)
city ratio
1 NY 2.0
3 LA 3.0
You can use pct_change:
In [20]: df[['city']].assign(ratio=df.groupby('city').value.pct_change().add(1)).dropna()
Out[20]:
city ratio
1 NY 2.0
3 LA 3.0
This'll do it:
df.groupby('city')['value'].agg({'ratio': lambda x: x.max()/x.min()}).reset_index()
# city ratio
#0 LA 3
#1 NY 2
This is one way using a custom function. It assumes you want to ignore the NaN rows in the result of dividing one series by a shifted version of itself.
def divider(x):
return x['value'] / x['value'].shift(1)
res = df.groupby('city').apply(divider)\
.dropna().reset_index()\
.rename(columns={'value': 'ratio'})\
.loc[:, ['city', 'ratio']]
print(res)
city ratio
0 LA 3.0
1 NY 2.0
one way is,
df.groupby(['city']).apply(lambda x:x['value']/x['value'].shift(1))
for further improvement,
print df.groupby(['city']).apply(lambda x:(x['value']/x['value'].shift(1)).fillna(method='bfill'))).reset_index().drop_duplicates(subset=['city']).drop('level_1',axis=1)
city value
0 LA 3.0
2 NY 2.0
Related
I have the following dataframe and subsequent EWMA function:
from functools import partial
#Create DF
d = {'Name': ['Jim', 'Jim','Jim','Jim','Jim','Jim','Jim','Jim',], 'col2': [5,5,5,5,5,5,5,5]}
df1 = pd.DataFrame(data=d)
#EWMA 5
alpha = 1-np.log(2)/3
window5 = 5
weights5 = list(reversed([(1-alpha)**n for n in range(window5)]))
ewma5 = partial(np.average, weights=weights5)
df1['Rolling5'] = df1.groupby('Name')['col2'].transform(lambda x: x.rolling(5).apply(ewma5))
df1
Which results in:
I have a specified a rolling window of 5 but does anyone know how can I get the EWMA to calculate in the first to 4th Row even though there arent 5 values?
EG for row 1, calculate for just row 1 (which would just be the same value) and then for row 2 it calculates the EWMA of rows 1 & 2. Also open to more efficient ways of doing this!
Thanks very much!
You can use ewm and set min_periods in rolling to 1:
def f(x):
return x.ewm(alpha=1-np.log(2)/3).mean().iloc[-1]
df1['Rolling5'] = df1.groupby('Name')['col2'].transform(
lambda x: x.rolling(5, min_periods=1).apply(f))
Comparing with the original:
df1['Rolling5_original'] = df1.groupby('Name')['col2'].transform(
lambda x: x.rolling(5).apply(ewma5))
df1['Rolling5'] = df1.groupby('Name')['col2'].transform(
lambda x: x.rolling(5, min_periods=1).apply(f))
df1
Output:
Name col2 Rolling5_original Rolling5
0 Jim 1 NaN 1.000000
1 Jim 2 NaN 1.812315
2 Jim 3 NaN 2.736992
3 Jim 4 NaN 3.710959
4 Jim 5 4.702821 4.702821
5 Jim 6 5.702821 5.702821
6 Jim 7 6.702821 6.702821
7 Jim 8 7.702821 7.702821
You're close, if you specify the min_periods=1, the windows out of rolling will start from size 1 and then expand till 5 and stay there. As for the average, we will pass the weights' corresponding parts to cover the cases it will fall short:
weights = (1-alpha) ** np.arange(5)[::-1]
df["rolling_5"] = (df.col2
.rolling(5, min_periods=1)
.apply(lambda win: np.average(win, weights=weights[-win.size:]))
)
to get
Name col2 rolling_5
0 Jim 5 5.0
1 Jim 5 5.0
2 Jim 5 5.0
3 Jim 5 5.0
4 Jim 5 5.0
5 Jim 5 5.0
6 Jim 5 5.0
7 Jim 5 5.0
I've a dateframe 1
Place
0 New York
1 Los Angeles 1
2 Los Angeles- 2
3 Dallas -1
4 Dallas - 2
5 Dallas3
dataframe 2
Place target value1 value2
New York 1000 a b
Los Angeles 1500 c d
Dallas 1 2000 e f
Desired dataframe
Place target value1 value2
New York 1000 a b
Los Angeles 1 750 c d
Los Angeles- 2 750 c d
Dallas -1 666.6 e f
Dallas - 2 666.6 e f
Dallas3 666.6 e f
Explanation: We have to merge dataframe1 and dateframe2 on 'place'. We have 1 New york, 2 Los Angeles, 3 Dallas in dataframe1, but we have only ones in dateframe2. So we split the target based on the count of places (only names, not numbers) in df1 and assign value1 and value2 to respective place.
Is there any way to consider all the spell check, whitespaces, special characters using regex and obtain the desired dataframe?
This is the exact solution:
def extract_city(col):
return col.str.extract('([a-zA-Z]+(?:\s+[a-zA-Z]+)*)')[0]
df = pd.merge(df1, df2, left_on=extract_city(df1['Place']), right_on=extract_city(df2['Place']))
df = df.drop(['key_0', 'Place_y'], axis=1).rename({'Place_x' : 'Place'}, axis=1)
df['Target'] /= df.groupby(extract_city(df['Place']))['Place'].transform('count')
df
An alternate method to do thing this will be as follows:
import pandas as pd
df1 = pd.DataFrame({'Place':['New York','Los Angeles 1','Los Angeles- 2','Dallas -1','Dallas - 2','Dallas3']})
print (df1)
#create a column to compare both dataframes. Remove numeric, - and space values
df1['Place_compare'] = df1.Place.str.replace('\d+|-| ', '')
df2 = pd.DataFrame({'Place':['New York','Los Angeles','Dallas 1'],
'target':[1000,1500,2000],
'value1':['a','c','e'],
'value2':['b','d','f']})
print (df2)
#create a column to compare both dataframes. Remove numeric, - and space values
df2['Place_compare'] = df2.Place.str.replace('\d+|-| ', '')
#count number of times the unique values of Place occurs in df1. assign to df2
df2['counts'] = df2['Place_compare'].map(df1['Place_compare'].value_counts())
#calculate new target based on number of occurrences of Place in df1
df2['new_target'] = (df2['target'] / df2['counts']).round(2)
#repeat the nows by the number of times it appears in counts
df2 = df2.reindex(df2.index.repeat(df2['counts']))
#drop temp columns
df2.drop(['counts','Place_compare','target'], axis=1, inplace=True)
#rename new_target as target
df2 = df2.rename({'new_target': 'target'}, axis=1)
print (df2)
The output of this will be:
Dataframe1:
Place
0 New York
1 Los Angeles 1
2 Los Angeles- 2
3 Dallas -1
4 Dallas - 2
5 Dallas3
Dataframe2:
Place target value1 value2
0 New York 1000 a b
1 Los Angeles 1500 c d
2 Dallas 1 2000 e f
Updated DataFrame with repeated values:
Place value1 value2 target
0 New York a b 1000.00
1 Los Angeles c d 750.00
1 Los Angeles c d 750.00
2 Dallas 1 e f 666.67
2 Dallas 1 e f 666.67
2 Dallas 1 e f 666.67
i'm looking for help creating a sub-dataframe from an existing dataframe using a np.nansum-like function. I want to convert this table into a matrix of non-null column sums:
dan ste bob
t1 na 2 na
t2 2 na 1
t3 2 1 na
t4 1 na 2
t5 na 1 2
t6 2 1 na
t7 1 na 2
For example, when 'dan' is not-null (t-2,3,4,6,7) the sum of 'ste' is 2 and 'bob' is 5. When 'ste' is not-null the sum of 'dan' is 4.
dan ste bob
dan 0 2 5
ste 4 0 2
bob 4 1 0
Any ideas?
Thanks in advance!
I ended up using a modified version of matt's function below:
def nansum_matrix_create(df):
rows = []
for col in list(df.columns.values):
col_sums = df[df[col] != 0].sum()
rows.append(col_sums)
return pd.DataFrame(rows, columns=df.columns, index=df.columns)
Use pd.DataFrame.notnull to get where non-nulls are.
Then use pd.DataFrame.dot to ge the crosstab.
Finally, use np.eye to zero out the diagonal.
df.notnull().T.dot(df.fillna(0)) * (1 - np.eye(df.shape[1]))
dan ste bob
dan 0.0 2.0 5.0
ste 4.0 0.0 2.0
bob 4.0 1.0 0.0
Note:
I used this to ensure my values were numeric.
df = df.apply(pd.to_numeric, errors='coerce')
Assuming your dataframe doesn't have large number of columns, this function should do what you want and be fairly performant. I have implemented this using for loop across columns so there may be a more performant / elegant solution out there.
import pandas as pd
# Initialise dataframe
df = {"dan":[pd.np.nan,2,2,1,pd.np.nan,2,1],
"ste":[2,pd.np.nan,1,pd.np.nan,1,1,pd.np.nan],
"bob":[pd.np.nan,1,pd.np.nan,2,2,pd.np.nan,2]}
df = pd.DataFrame(df)[["dan","ste","bob"]]
def matrix_create(df):
rows = []
for col in df.columns:
subvals, index = [], []
for subcol in df.columns:
index.append(subcol)
if subcol == col:
subvals.append(0)
else:
subvals.append(df[~pd.isnull(df[col])][subcol].sum())
rows.append(subvals)
return pd.DataFrame(rows,columns=df.columns,index=index)
matrix_create(df)
I have some simple data in a dataframe consisting of three columns [id, country, volume] where the index is 'id'.
I can perform simple operations like:
df_vol.groupby('country').sum()
and it works as expected. When I attempt to use rank() it does not work as expected and the results is an empty dataframe.
df_vol.groupby('country').rank()
The result is not consistent and in some cases it works. The following also works as expected:
df_vol.rank()
I want to return something like:
vols = []
for _, df in f_vol.groupby('country'):
vols.append(df['volume'].rank())
pd.concat(vols)
Any ideas why much appreciated!
You can add column by [] - function is call only for column Volume:
df_vol.groupby('country')['volume'].rank()
Sample:
df_vol = pd.DataFrame({'country':['en','us','us','en','en'],
'volume':[10,10,30,20,50],
'id':[1,1,1,2,2]})
print(df_vol)
country id volume
0 en 1 10
1 us 1 10
2 us 1 30
3 en 2 20
4 en 2 50
df_vol['r'] = df_vol.groupby('country')['volume'].rank()
print (df_vol)
country id volume r
0 en 1 10 1.0
1 us 1 10 1.0
2 us 1 30 2.0
3 en 2 20 2.0
4 en 2 50 3.0
I have a DataFrame named au from which I want to drop the rows where date aligns with observations where bal==bal.max(). For example, if bal==bal.max() is associated with 2009-08-01, then I want to drop all other observations for which date=='2009-08-01'. Below is what I've tried but both attempts result in ValueError: Series lengths must match to compare
au = au[au.date != au.date[au.bal==au.bal.max()]]
au = au[au.date != au.date[au.bal==au.bal.max()].values]
Using idxmax and #jezrael's setup
setup
au = pd.DataFrame({'bal':[1,2,3,4],
'date':['2009-08-01','2009-08-01','2009-08-02', '2009-08-02'],
'C':[7,8,9,1]})
solution
dmax = au.date.loc[au.bal.idxmax()]
au[au.date != dmax]
C bal date
0 7 1 2009-08-01
1 8 2 2009-08-01
I think you need get scalar from one item Series by item or values with selected firts value by [0]:
au = pd.DataFrame({'bal':[1,2,3,4],
'date':['2009-08-01','2009-08-01','2009-08-02', '2009-08-02'],
'C':[7,8,9,1]})
print (au)
C bal date
0 7 1 2009-08-01
1 8 2 2009-08-01
2 9 3 2009-08-02
3 1 4 2009-08-02
print (au[au.date != au.loc[au.bal==au.bal.max(), 'date'].item()])
C bal date
0 7 1 2009-08-01
1 8 2 2009-08-01
Solution with idxmax - create Series with date first:
print (au.set_index('date').bal.idxmax())
2009-08-02
au = au[au.date != au.set_index('date').bal.idxmax()]
print (au)
C bal date
0 7 1 2009-08-01
1 8 2 2009-08-01