I have an indexed dataset like this
np.random.seed(1)
df = pd.DataFrame({'A': [1, 1, 2, 2],
'B': [1, 2, 3, 4],
'C': np.random.randn(4)},
index = [5,242,12,634])
Now I'm trying to get the difference of C by group like so
df.groupby('A').agg('diff')
which gives me the output
B C
5 NaN NaN
242 1.0 -2.492028
12 NaN NaN
634 1.0 -0.455332
I'm trying to get a resulting dataframe with only 2 rows, which contain the differences like so
B C
1.0 -2.492028
1.0 -0.455332
How can I achieve this?
First diff is not a agg function which will return the same length of out put same as original dataframe , if you would like the diff without NaN we should do dropna
out = df.groupby('A').diff().dropna()
Related
I have 2 dataframes:
df = pd.DataFrame({'A': [1, 2, 3],
'B': [4, 5, 6]},
index=['X', 'Y', 'Z'])
and
df1 = pd.DataFrame({'M': [10, 20, 30],
'N': [40, 50, 60]},
index=['S', 'T', 'U'])
i want to append the df1 with a row of from the df dataframe.
i use the following code to extract the row:
row = df.loc['Y']
when i print this i get:
A 2
B 5
Name: Y, dtype: int64
A and B are key values or column heading names. so i transpose this with
row_25 = row.transpose()
i print row_25 and get:
A 2
B 5
Name: Y, dtype: int64
this is the same as row, so it seems the transpose didn't happen
i then add this code to add the row to df1:
result = pd.concat([df1, row_25], axis=0, ignore_index=False)
print(result)
when i print df1 i get:
M N 0
S 10.0 40.0 NaN
T 20.0 50.0 NaN
U 30.0 60.0 NaN
A NaN NaN 2.0
B NaN NaN 5.0
i want A and B to be column headings (key values) and the name of row (Y) to be the row index.
what am i doing wrong?
Try
pd.concat([df1, df.loc[['Y']])
It generates:
M N A B
S 10.0 40.0 NaN NaN
T 20.0 50.0 NaN NaN
U 30.0 60.0 NaN NaN
Y NaN NaN 2.0 5.0
Not sure if this is what you want.
To exclude column names 'M' and 'N' from the result you can rename the columns beforehand:
>>> df1.columns = ['A', 'B']
>>> pd.concat([df1, df.loc[['Y']])
A B
S 10 40
T 20 50
U 30 60
Y 2 5
The reason why you need double square brackets is that single square brackets return a 1D Series, that cannot be transposed. And double brackets return a 2D DataFrame (in general double brackets are used to reference several columns, like df1.loc[['X', 'Y']]; it is called 'fancy indexing' in NumPy).
If you are allergic to double brackets, use
pd.concat([df1.rename(columns={'M': 'A', 'N': 'B'}),
df.filter('Y', axis=0)])
Finally, if you really want to transpose something, you can convert the series to a frame and transpose it:
>>> df.loc['Y'].to_frame().T
A B
Y 2 5
I have two dataframes like so:
data = {'A': [3, 2, 1, 0], 'B': [1, 2, 3, 4]}
data2 = {'A': [3, 2, 1, 0, 3, 2], 'B': [1, 2, 3, 4, 20, 2], 'C':[5,3,2,1, 5, 1]}
df1 = pd.DataFrame.from_dict(data)
df2 = pd.DataFrame.from_dict(data2)
Now I did a groupby of df2 for C
values_to_map = df2.groupby(['A', 'B']).mean().to_dict()
Now I would like to map df1['new C'] where the columns A and B match.
A B new_C
0 3 1 1.0
1 2 2 2.0
2 1 3 2.0
3 0 4 12.5
where new c is basically the averages of C for every pair A, B from df2
Note that A and B don't have to be keys of the dataframe (i.e. they aren't unique identifiers which is why I want to map it with a dictionary originally, but failed with multiple keys)
How would I go about that?
Thank you for looking into it with me!
I found a solution to this
values_to_map = df2.groupby(['A', 'B']).mean().to_dict()
df1['new_c'] = df1.apply(lambda x: values_to_map[x['A'], x['B']], axis=1)
Thanks for looking into it!
Just do np.vectorize:
values_to_map = df2.groupby(['A', 'B']).mean().to_dict()
df1['new_c'] = np.vectorize(lambda x: values_to_map.get(x['A'], x['B']))(df1[['A', 'B']])
You can first form a MultiIndex from the [["A", "B"]] subset of the frame df1 and use its map function to map the A-B pairs to the desired grouped mean values:
cols = ["A", "B"]
mapper = df2.groupby(cols).C.mean()
df1["new_c"] = pd.MultiIndex.from_frame(df1[cols]).map(mapper)
to get
>>> df1
A B new_c
0 3 1 5.0
1 2 2 2.0
2 1 3 2.0
3 0 4 1.0
(if an A-B pair in df1 isn't found in df2's groups, new_c corresponding to that pair will be NaN with this method.)
Note that neither pandas' apply nor np.vectorize are "vectorized" routines. However, they might be fast enough for one's purposes and might prove more readable in places.
I am converting a piece of code written in R to python. The following code is in R. df1 and df2 are the dataframes. id, case, feature, feature_value are column names. The code in R is
for(i in 1:dim(df1)[1]){
temp = subset(df2,df2$id == df1$case[i],select = df1$feature[i])
df1$feature_value[i] = temp[,df1$feature[i]]
}
My code in python is as follows.
for i in range(0,len(df1)):
temp=np.where(df1['case'].iloc[i]==df2['id']),df1['feature'].iloc[i]
df1['feature_value'].iloc[i]=temp[:,df1['feature'].iloc[i]]
but it gives
TypeError: tuple indices must be integers or slices, not tuple
How to rectify this error? Appreciate any help.
Unfortunately, R and Pandas handle dataframes pretty differently. If you'll be using Pandas a lot, it would probably be worth going through a tutorial on it.
I'm not too familiar with R so this is what I think you want to do:
Find rows in df1 where the 'case' matches an 'id' in df2. If such a row is found, add the "feature" in df1 to a new df1 column called "feature_value."
If so, you can do this with the following:
#create a sample df1 and df2
>>> df1 = pd.DataFrame({'case': [1, 2, 3], 'feature': [3, 4, 5]})
>>> df1
case feature
0 1 3
1 2 4
2 3 5
>>> df2 = pd.DataFrame({'id': [1, 3, 7], 'age': [45, 63, 39]})
>>> df2
id age
0 1 45
1 3 63
2 7 39
#create a list with all the "id" values of df2
>>> df2_list = df2['id'].to_list()
>>> df2_list
[1, 3, 7]
#lambda allows small functions; in this case, the value of df1['feature_value']
#for each row is assigned df1['feature'] if df1['case'] is in df2_list,
#and otherwise it is assigned np.nan.
>>> df1['feature_value'] = df1.apply(lambda x: x['feature'] if x['case'] in df2_list else np.nan, axis=1)
>>> df1
case feature feature_value
0 1 3 3.0
1 2 4 NaN
2 3 5 5.0
Instead of lamda, a full function can be created, which may be easier to understand:
def get_feature_values(df, id_list):
if df['case'] in id_list:
feature_value = df['feature']
else:
feature_value = np.nan
return feature_value
df1['feature_value'] = df1.apply(get_feature_values, id_list=df2_list, axis=1)
Another way of going about this would involve merging df1 and df2 to find rows where the "case" value in df1 matches an "id" value in df2 (https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.merge.html)
===================
To address the follow-up question in the comments:
You can do this by merging the databases and then creating a function.
#create example dataframes
>>> df1 = pd.DataFrame({'case': [1, 2, 3], 'feature': [3, 4, 5], 'names': ['a', 'b', 'c']})
>>> df2 = pd.DataFrame({'id': [1, 3, 7], 'age': [45, 63, 39], 'a': [30, 31, 32], 'b': [40, 41, 42], 'c': [50, 51, 52]})
#merge the dataframes
>>> df1 = df1.merge(df2, how='left', left_on='case', right_on='id')
>>> df1
case feature names id age a b c
0 1 3 a 1.0 45.0 30.0 40.0 50.0
1 2 4 b NaN NaN NaN NaN NaN
2 3 5 c 3.0 63.0 31.0 41.0 51.0
Then you can create the following function:
def get_feature_values_2(df):
if pd.notnull(df['id']):
feature_value = df['feature']
column_of_interest = df['names']
feature_extended_value = df[column_of_interest]
else:
feature_value = np.nan
feature_extended_value = np.nan
return feature_value, feature_extended_value
# "result_type='expand'" allows multiple values to be returned from the function
df1[['feature_value', 'feature_extended_value']] = df1.apply(get_feature_values_2, result_type='expand', axis=1)
#This results in the following dataframe:
case feature names id age a b c feature_value \
0 1 3 a 1.0 45.0 30.0 40.0 50.0 3.0
1 2 4 b NaN NaN NaN NaN NaN NaN
2 3 5 c 3.0 63.0 31.0 41.0 51.0 5.0
feature_extended_value
0 30.0
1 NaN
2 51.0
#To keep only a subset of the columns:
#First create a copy-pasteable list of the column names
list(df1.columns)
['case', 'feature', 'names', 'id', 'age', 'a', 'b', 'c', 'feature_value', 'feature_extended_value']
#Choose the subset of columns you would like to keep
df1 = df1[['case', 'feature', 'names', 'feature_value', 'feature_extended_value']]
df1
case feature names feature_value feature_extended_value
0 1 3 a 3.0 30.0
1 2 4 b NaN NaN
2 3 5 c 5.0 51.0
I the below code, I am replacing all NaN values from column b with blank string if the corresponding value in column a is 1.
The code works, but I have to type df.loc[df.a == 1, 'b'] twice.
Is there a shorter/better way to do it?
import pandas as pd
df = pd.DataFrame({
'a': [1, None, 3],
'b': [None, 5, 6],
})
filtered = df.loc[df.a == 1, 'b']
filtered.fillna('', inplace=True)
df.loc[df.a == 1, 'b'] = filtered
print(df)
how about the use of numpy where clause to check values in a and b and replace? see a mockup below. I have used column 'c' to illustrate
import pandas as pd
import numpy as np
df = pd.DataFrame({
'a': [1, None, 3],
'b': [None, 5, 6],
})
#replace b value if the corresponding value in column a is 1 and column b is NaN
df['c'] = np.where(((df['a'] == 1) & (df['b'].isna())), df['a'], df['b'])
df
original dataframe
a b
0 1.0 1.0
1 NaN 5.0
2 3.0 6.0
result:
a b c
0 1.0 NaN 1.0
1 NaN 5.0 5.0
2 3.0 6.0 6.0
Use where() to do it in one line
import numpy as np
df['b'] = np.where((df['b'].isnull()) & (df['a']==1),'',df['a'])
Use Series.fillna only for matched values by condition:
df.loc[df.a == 1, 'b'] = df['b'].fillna('')
I'm wondering if the there is a consice way to do exclude all columns with more than N NaNs, excluding one column from this subset.
For example:
df = pd.DataFrame([[np.nan, 2, np.nan, 0],
[3, 4, np.nan, 1],
[np.nan, np.nan, np.nan, 5]],
columns=list('ABCD'))
Results in:
A B C D
0 NaN 2.0 NaN 0
1 3.0 4.0 NaN 1
2 NaN NaN NaN 5
Running the following, I get:
df.dropna(thresh=2, axis=1)
B D
0 2.0 0
1 4.0 1
2 NaN 5
I would like to keep column 'C'. I.e., to perform this thresholding except on column 'C'.
Is that possible?
You can put the column back once you've done the thresholding. If you do this all on one line, you don't even need to store a reference to the column.
import pandas as pd
import numpy as np
df = pd.DataFrame([[np.nan, 2, np.nan, 0],
[3, 4, np.nan, 1],
[np.nan, np.nan, np.nan, 5]],
columns=list('ABCD'))
df.dropna(thresh=2, axis=1).assign(C=df['C'])
You could also do
C = df['C']
df.dropna(thresh=2, axis=1)
df.assign(C=C)
As suggested by #Wen, you can also do an indexing operation that won't remove column C to begin with.
threshold = 2
df = df.loc[:, (df.isnull().sum(0) < threshold) | (df.columns == 'C')]
The index here for the column will select columns that have fewer than threshold NaN values, or whose name is C. If you wanted to include more than just one column in the exception, you can chain more conditions with the "or" operator |. For example:
df = df.loc[
:,
(df.isnull().sum(0) < threshold) |
(df.columns == 'C') |
(df.columns == 'D')]
df.loc[:,(df.isnull().sum(0)<=1)|(df.isnull().sum(0)==len(df))]
Out[415]:
B C D
0 2.0 NaN 0
1 4.0 NaN 1
2 NaN NaN 5
As per Zero's suggestion
df.loc[:,(df.isnull().sum(0)<=1)|(df.isnull().all(0))]
EDIT :
df.loc[:,(df.isnull().sum(0)<=1)|(df.columns=='C')]
Another take that blends some concepts from other answers.
df.loc[:, df.isnull().assign(C=False).sum().lt(2)]
B C D
0 2.0 NaN 0
1 4.0 NaN 1
2 NaN NaN 5