I am tackling an issue in pandas:
I would like to group a DataFrame by an index column, then perform a transform(np.gradient) (i.e. compute the derivative over all values in a group). This doesn't work if my group is too small (less than 2 elements), so I would like to just return 0 in this case.
The following code returns an error:
import pandas as pd
import numpy as np
data = pd.DataFrame(
{
"time": [0,0,1,2,2,3,3],
"position": [0.1,0.2,0.2,0.1,0.2,0.1,0.2],
"speed": [150.0,145.0, 149.0,150.0,150.0,150.0,150.0],
}
)
derivative = data.groupby("time").transform(np.gradient)
Gives me a ValueError:
ValueError: Shape of array too small to calculate a numerical gradient, at least (edge_order + 1) elements are required.
The desired output for the example DataFrame above would be
time position_km
0 0.1 -5.0
0.2 -5.0
1 0.2 0.0
2 0.1 0.0
0.2 0.0
3 0.1 0.0
0.2 0.0
Does anyone have a good idea on how to solve this, e.g. using a lambda function in the transform?
derivative = data.groupby("time").transform(lambda x: np.gradient(x) if len(x) > 1 else 0)
does exactly what I wanted. Thanks #Chrysophylaxs
Possible option:
def gradient_group(group):
if group.shape[0] < 2:
return 0
return np.gradient(group)
df['derivative'] = df.groupby(df.index).apply(gradient_group)
Related
With a dataframe like this:
index col_1 col_2 ... col_n
0 0.2 0.1 0.3
1 0.2 0.1 0.3
2 0.2 0.1 0.3
...
n 0.4 0.7 0.1
How can one get the norm for each column ?
Where the norm is the sqrt of the sum of the squares.
I am able to do this for each column sequentially, but am unsure how to vectorize (avoiding a for loop) the same to an answer:
import pandas as pd
import numpy as np
norm_col_1 = np.linalg.norm(df[col_1])
norm_col_2 = np.linalg.norm(df[col_2])
norm_col_n = np.linalg.norm(df[col_n])
the answer would be a new dataframe series like this:
norms
col_1 0.111
col_2 0.202
col_3 0.55
...
con_n 0.100
You can pass the entire DataFrame to np.linalg.norm, along with an axis argument of 0 to tell it to apply it column-wise:
np.linalg.norm(df, axis=0)
To create a series with appropriate column names, try:
results = pd.Series(data=np.linalg.norm(df, axis=0), index=df.columns)
I'm probably missing something, but I was not able to find a solution for this.
Is there a way in python to add values to a new column which satisfy a certain condition.
In Excel I would apply the following formula in the new column and paste it below
=SUMIF(A1:C1, ">0")
val1
val2
val3
output
0.5
0.7
-0.9
1.2
0.3
-0.7
0.3
-0.5
-0.7
-0.9
0
Also in my extracts, there are a few blank values. Can you please help me understand what code should be written for this?
df['total'] = df[['A','B']].sum(axis=1).where(df['A'] > 0, 0)
I came across the above code, but it checks only one condition. What I need is a sum of all of those columns which match the given condition.
Thanks!
pandas can handle that quite out of the box, like that:
import pandas as pd
df = pd.DataFrame([[0.5,.7,-.9],[0.3,-.7,None],[-0.5,-.7,-.9]], columns=['val1','val2','val3'])
df['output'] = df[df>0].sum(axis=1)
Another way, somewhat similar to SUMIF:
# this is the "IF"
is_positive = df.loc[:, "val1": "val3"] > 0
# this is selecting the parts where condition holds & sums
df["output"] = df.loc[:, "val1": "val3"][is_positive].sum(axis=1)
where axis=1 in last line is to sum along rows,
to get
>>> df
val1 val2 val3 output
0 0.5 0.7 -0.9 1.2
1 0.3 -0.7 NaN 0.3
2 -0.5 -0.7 -0.9 0.0
Use DataFrame.clip before sum:
df['total'] = df[['val1','val2','val3']].clip(lower=0).sum(axis=1)
#solution by Nk03 from comments
cols = ['val1','val2','val3']
df['total'] = df[cols].mask(df[cols]<0).sum(axis=1)
EDIT: For test another mask by another columns convert them to numpy array:
df['total'] = df.loc[:, "D":"F"].mask(df.loc[:, "A":"C"].to_numpy() == 'Y', 0).sum(axis=1)
You can do it in the following way:
df["total"] = df.apply(lambda x: sum(x), axis=1).where((df['A'] > 0) & (df['B'] > 0) & (another_condition) & (another_condition), 0)
Note the code will take sum across all columns at once.
For taking sum of specific columns you can do the following:
df['total'] = df[['A','B','C','D','E']].sum(axis=1).where((df['A'] > 0) & (df['B'] > 0) & (another_condition) & (another_condition), 0)
I'm struggling on how to clean-up a dataframe. What I would like to do is truncate all items (i.e. floor()), and for any items below or over a min/max, replace with the min or max as applicable. E.g. for this dataframe:
If my min and max are 1 and 5 respectively, 1.2 would truncate to 1, 9.6 would map to 5, -1.2 would map to 1, and 3.5 would truncate to 3:
Other than brute-force iteration using iterrows(), I haven't been able to get this to work. Lots of stuff on finding the min and max, but not on applying a min and max.
May I please ask if anyone has some suggestions? Thank you.
You can use applymap, such as:
from numpy import floor
MAX, MIN = 5, 1
df = df.applymap(lambda val: MAX if val > MAX else int(floor(val)) if val > MIN else MIN)
You can use df.clip and cast to int
df = pd.DataFrame({
'A':[1.2, 3.5],
'B':[9.6, -1.2]
})
df.clip(1,5).astype('int')
Out:
A B
0 1 5
1 3 1
If you want float values you can floor the dataframe with np.floor, which conveniently returns a pd.dataframe and then clip.
import numpy as np
np.floor(df)
Out:
A B
0 1.0 9.0
1 3.0 -2.0
np.floor(df).clip(1,5)
Out:
A B
0 1.0 5.0
1 3.0 1.0
Micro-Benchmark
With python 3.6.9, pandas 1.1.5 on a google colab instance
Results:
Code used for the benchmark
import pandas as pd
import numpy as np
import perfplot
def make_data(n=100):
return pd.DataFrame(
np.random.uniform(-1.2, 9.6, (n,10))
)
def clip_castint(df):
return df.clip(1,5).astype('int')
def clip_npfloor(df):
return np.floor(df.clip(1,5))
from numpy import floor
def applymap(df):
MAX, MIN = 5, 1
return df.applymap(lambda val: MAX if val > MAX else int(floor(val)) if val > MIN else MIN)
perfplot.show(
setup=make_data,
kernels=[clip_castint, clip_npfloor, applymap],
n_range=[2**k for k in range(10,22)],
xlabel="df(rows, 10)"
)
I am calculating percentage change for a panel dataset, which has both positive and negative values. If both values of n and n+1 date are negative and values of n > n+1, for instance, n=-2, n+1=-4. The calculated percentage change is ((n+1)-n)/n=((-4)-(-2))/-2=1. As you can see the change should be a downtrend, which is expected to be negative but the result is the opposite. I normally set the denominators the absolute values ((n+1)-n)/abs(n) in other software to ensure the direction of the trend. Just wondering if I can do so in Python pandas pct_change to set up the denominator to be absolute values. Many thanks. I have solved the question based on the answer from Leo.
Here is a data example if one wants to play around.
import pandas as pd
df= {'id':[1,1,2,2,3,3],'values':[-2,-4,-2,2,1,5]}
df=pd.DataFrame(data=df)
df['pecdiff']=(df.groupby('id')['values'].apply(lambda x:x.diff()/x.abs().shift()
)).fillna(method=bfill)
If I understood correctly, the line for expected change should solve your problem. For comparison, I put side by side pandas' method and what you need.
The following code:
import pandas as pd
df = pd.DataFrame([-2,-4,-2,2,1], columns = ['Values'])
df['pct_change'] = df['Values'].pct_change()
# This should help you out:
df['expected_change'] = df['Values'].diff() / df['Values'].abs().shift()
df
Gives this output. Note that the signs are different for lines 1 through 3
Values pct_change expected_change
0 -2 NaN NaN
1 -4 1.0 -1.0
2 -2 -0.5 0.5
3 2 -2.0 2.0
4 1 -0.5 -0.5
I have a dataframe with two columns, score and order_amount. I want to find the score Y that represents the Xth percentile of order_amount. I.e. if I sum up all of the values of order_amount where score <= Y I will get X% of the total order_amount.
I have a solution below that works, but it seems like there should be a more elegant way with pandas.
import pandas as pd
test_data = {'score': [0.3,0.1,0.2,0.4,0.8],
'value': [10,100,15,200,150]
}
df = pd.DataFrame(test_data)
df
score value
0 0.3 10
1 0.1 100
2 0.2 15
3 0.4 200
4 0.8 150
# Now we can order by `score` and use `cumsum` to calculate what we want
df_order = df.sort_values('score')
df_order['percentile_value'] = 100*df_order['value'].cumsum()/df_order['value'].sum()
df_order
score value percentile_value
1 0.1 100 21.052632
2 0.2 15 24.210526
0 0.3 10 26.315789
3 0.4 200 68.421053
4 0.8 150 100.000000
# Now can find the first value of score with percentile bigger than 50% (for example)
df_order[df_order['percentile_value']>50]['score'].iloc[0]
Use Series.searchsorted:
idx = df_order['percentile_value'].searchsorted(50)
print (df_order.iloc[idx, df.columns.get_loc('score')])
0.4
Or get first value of filtered Series with next and iter, if no match returned some default value:
s = df_order.loc[df_order['percentile_value'] > 50, 'score']
print (next(iter(s), 'no match'))
0.4
One line solution:
out = next(iter((df.sort_values('score')
.assign(percentile_value = lambda x: 100*x['value'].cumsum()/x['value'].sum())
.query('percentile_value > 50')['score'])),'no matc')
print (out)
0.4
here is another way starting from the oriinal dataframe using np.percentile:
df = df.sort_values('score')
df.loc[np.searchsorted(df['value'],np.percentile(df['value'].cumsum(),50)),'score']
Or series.quantile
df.loc[np.searchsorted(df['value'],df['value'].cumsum().quantile(0.5)),'score']
Or similarly with iloc, if index is not default:
df.iloc[np.searchsorted(df['value']
,np.percentile(df['value'].cumsum(),50)),df.columns.get_loc('score')]
0.4