How to determine the difference between rows in col X but between groups, rather than within groups. So the diff value within groups should be ffill.
df = pd.DataFrame({
'Time' : [1,1,2,2,3,3],
'X' : [1,1,3,3,6,6],
'Y' : [1,1,1,1,2,2],
})
df['X'] = df['X'].diff()
df['X'] = df.groupby('Time')['X'].diff()
Intended Output:
Time X Y
0 1 0 1
1 1 0 1
2 2 2 1
3 2 2 1
4 3 3 2
5 3 3 2
If values inside a group are equal (but the number of rows per group are not), you can do this by subtracting all rows in a group with the value of the previous group.
df['X'] - df['Time'].map(df.groupby('Time')['X'].max().shift()).fillna(df['X'])
0 0.0
1 0.0
2 2.0
3 2.0
4 3.0
5 3.0
dtype: float64
Details
The first piece is to find the unique values in each group (I use max(), but you can just as well use unique() or first()):
df.groupby('Time')['X'].max()
Time
1 1
2 3
3 6
Name: X, dtype: int64
Next, shift them down:
_.shift()
Time
1 NaN
2 1.0
3 3.0
Name: X, dtype: float64
Map it back to "Time" (the grouper):
df['Time'].map(_)
0 NaN
1 NaN
2 1.0
3 1.0
4 3.0
5 3.0
Name: Time, dtype: float64
Fill the first group of NaNs with "X":
_.fillna(df['X'])
0 1.0
1 1.0
2 1.0
3 1.0
4 3.0
5 3.0
Name: Time, dtype: float64
Now you have your RHS. Just subtract this from "X" and you're done.
If you have fixed rows for each group you can do
>>> df.X = df.X.diff(periods=2).fillna(0) # assumes all groups have two rows
>>> df
Time X Y
0 1 0.0 1
1 1 0.0 1
2 2 2.0 1
3 2 2.0 1
4 3 3.0 2
5 3 3.0 2
Related
Given that, i have a dataset as below:
dict = {
"A": [math.nan,math.nan,1,math.nan,2,math.nan,3,5],
"B": np.random.randint(1,5,size=8)
}
dt = pd.DataFrame(dict)
My favorite output is, if the in column A we have an Nan then multiply the value of the column B in the same row and replace it with Nan. So, given that, the below is my dataset:
A B
NaN 1
NaN 1
1.0 3
NaN 2
2.0 3
NaN 1
3.0 1
5.0 3
My favorite output is:
A B
2 1
2 1
1 3
4 2
2 3
2 1
3 1
5 3
My current solution is as below which does not work:
dt[pd.isna(dt["A"])]["A"] = dt[pd.isna(dt["A"])]["B"].apply( lambda x:2*x )
print(dt)
In your case with fillna
df.A.fillna(df.B*2, inplace=True)
df
A B
0 2.0 1
1 2.0 1
2 1.0 3
3 4.0 2
4 2.0 3
5 2.0 1
6 3.0 1
7 5.0 3
I have two DataFrames (example below). I would like to delete any row in df1 with a value equal to df2[patnum] if df2[city] is 'nan'.
For example: I would want to drop rows 2 and 3 in df1 since they contain '4' and patnum '4' in df2 has a missing value in df2['city'].
How would I do this?
df1
Citer Citee
0 1 2
1 2 4
2 3 5
3 4 7
df2
Patnum City
0 1 new york
1 2 amsterdam
2 3 copenhagen
3 4 nan
4 5 sydney
expected result:
df1
Citer Citee
0 1 2
1 3 5
IIUC stack isin and dropna
the idea is to return a True/False boolean based on matches then drop those rows after we unstack the dataframe.
val = df2[df2['City'].isna()]['Patnum'].values
df3 = df1.stack()[~df1.stack().isin(val)].unstack().dropna(how="any")
Citer Citee
0 1.0 2.0
2 3.0 5.0
Details
df1.stack()[~df1.stack().isin(val)]
0 Citer 1
Citee 2
1 Citer 2
2 Citer 3
Citee 5
3 Citee 7
dtype: int64
print(df1.stack()[~df1.stack().isin(val)].unstack())
Citer Citee
0 1.0 2.0
1 2.0 NaN
2 3.0 5.0
3 NaN 7.0
I have a pandas DataFrame of the form:
df
ID_col time_in_hours data_col
1 62.5 4
1 40 3
1 20 3
2 30 1
2 20 5
3 50 6
What I want to be able to do is, find the rate of change of data_col by using the time_in_hours column. Specifically,
rate_of_change = (data_col[i+1] - data_col[i]) / abs(time_in_hours[ i +1] - time_in_hours[i])
Where i is a given row and the rate_of_change is calculated separately for different IDs
Effectively, I want a new DataFrame of the form:
new_df
ID_col time_in_hours data_col rate_of_change
1 62.5 4 NaN
1 40 3 -0.044
1 20 3 0
2 30 1 NaN
2 20 5 0.4
3 50 6 NaN
How do I go about this?
You can use groupby:
s = df.groupby('ID_col').apply(lambda dft: dft['data_col'].diff() / dft['time_in_hours'].diff().abs())
s.index = s.index.droplevel()
s
returns
0 NaN
1 -0.044444
2 0.000000
3 NaN
4 0.400000
5 NaN
dtype: float64
You can actually get around the groupby + apply given how your DataFrame is sorted. In this case, you can just check if the ID_col is the same as the shifted row.
So calculate the rate of change for everything, and then only assign the values back if they are within a group.
import numpy as np
mask = df.ID_col == df.ID_col.shift(1)
roc = (df.data_col - df.data_col.shift(1))/np.abs(df.time_in_hours - df.time_in_hours.shift(1))
df.loc[mask, 'rate_of_change'] = roc[mask]
Output:
ID_col time_in_hours data_col rate_of_change
0 1 62.5 4 NaN
1 1 40.0 3 -0.044444
2 1 20.0 3 0.000000
3 2 30.0 1 NaN
4 2 20.0 5 0.400000
5 3 50.0 6 NaN
You can use pandas.diff:
df.groupby('ID_col').apply(
lambda x: x['data_col'].diff() / x['time_in_hours'].diff().abs())
ID_col
1 0 NaN
1 -0.044444
2 0.000000
2 3 NaN
4 0.400000
3 5 NaN
dtype: float64
I often get tables containing similar information from different sources for "QC". Sometime I want to put these two tables side by side, output to excel to show others, so we can resolve discrepancies. To do so I want a 'lazy' merge with pandas dataframe.
say, I have two tables:
df a: df b:
n I II n III IV
0 a 1 2 0 a 1 2
1 a 3 4 1 a 0 0
2 b 5 6 2 b 5 6
3 c 9 9 3 b 7 8
I want to have results like:
a merge b
n I II III IV
0 a 1 2 1 2
1 a 3 4
2 b 5 6 5 6
3 b 7 8
4 c 9 9
of course this is what I got with merge():
a.merge(b, how='outer', on="n")
n I II III IV
0 a 1 2 1.0 2.0
1 a 1 2 0.0 0.0
2 a 3 4 1.0 2.0
3 a 3 4 0.0 0.0
4 b 5 6 5.0 6.0
5 b 5 6 7.0 8.0
6 c 9 9 NaN NaN
I feel there must be an easy way to do that, but all my solution were convoluted.
Is there a parameter in merge or concat for something like "no_copy"?
Doesn't look like you can do it with the information given alone, you need to introduce a cumulative count column to add to the merge columns. Consider this solution
>>> import pandas
>>> dfa = pandas.DataFrame( {'n':['a','a','b','c'] , 'I' : [1,3,5,9] , 'II':[2,4,6,9]}, columns=['n','I','II'])
>>> dfb = pandas.DataFrame( {'n':['a','b','b'] , 'III' : [1,5,7] , 'IV':[2,6,8] }, columns=['n','III','IV'])
>>>
>>> dfa['nCC'] = dfa.groupby( 'n' ).cumcount()
>>> dfb['nCC'] = dfb.groupby( 'n' ).cumcount()
>>> dm = dfa.merge(dfb, how='outer', on=['n','nCC'] )
>>>
>>>
>>> dfa
n I II nCC
0 a 1 2 0
1 a 3 4 1
2 b 5 6 0
3 c 9 9 0
>>> dfb
n III IV nCC
0 a 1 2 0
1 b 5 6 0
2 b 7 8 1
>>> dm
n I II nCC III IV
0 a 1.0 2.0 0 1.0 2.0
1 a 3.0 4.0 1 NaN NaN
2 b 5.0 6.0 0 5.0 6.0
3 c 9.0 9.0 0 NaN NaN
4 b NaN NaN 1 7.0 8.0
>>>
It has the gaps or lack of duplication where you want although the index isn't quite identical to your output. Because NaN's are involved the various columns get coerced to float64 types.
Adding the cumulative count essentially forces instances to match with each other across both sides, the first matches for a given level match the corresponding first level, and likewise for all instances of the level for all levels.
I'm trying to create a total column that sums the numbers from another column based on a third column. I can do this by using .groupby(), but that creates a truncated column, whereas I want a column that is the same length.
My code:
df = pd.DataFrame({'a':[1,2,2,3,3,3], 'b':[1,2,3,4,5,6]})
df['total'] = df.groupby(['a']).sum().reset_index()['b']
My result:
a b total
0 1 1 1.0
1 2 2 5.0
2 2 3 15.0
3 3 4 NaN
4 3 5 NaN
5 3 6 NaN
My desired result:
a b total
0 1 1 1.0
1 2 2 5.0
2 2 3 5.0
3 3 4 15.0
4 3 5 15.0
5 3 6 15.0
...where each 'a' column has the same total as the other.
Returning the sum from a groupby operation in pandas produces a column only as long as the number of unique items in the index. Use transform to produce a column of the same length ("like-indexed") as the original data frame without performing any merges.
df['total'] = df.groupby('a')['b'].transform(sum)
>>> df
a b total
0 1 1 1
1 2 2 5
2 2 3 5
3 3 4 15
4 3 5 15
5 3 6 15