import pandas as pd
df = pd.DataFrame({'A': [0, 0, 1, 1],
'B': [1, 3, 8, 10],
'C': ['alpha', 'bravo', 'charlie', 'delta']})
Now, I would like to group the data using own lambdas, but they behave different from what I expect. The lambda in the following example should return the first value of the column in the group:
df.groupby('A', as_index = False).agg({'B':'mean',
'C': lambda x: x[0]})
The code throws the KeyError: 0, which is unclear to me since ['alpha', 'bravo'][0] gives 'alpha'
So overall the desired output:
A B C
0 0 2 'alpha'
1 1 9 'charlie'
If need select first value in group is necessary use Series.iat or Series.iloc for select by position:
df1 = df.groupby('A', as_index = False).agg({'B':'mean', 'C': lambda x: x.iat[0]})
Another solution is use GroupBy.first:
df1 = df.groupby('A', as_index = False).agg({'B':'mean', 'C': 'first'})
print (df1)
A B C
0 0 2 alpha
1 1 9 charlie
Can you add an explanation of why the lambda doesn't work?
Problem is for second group, there is index not 0, but 2, what raise error, because x[0] try seelct by index with 0 and for second group it not exist:
df1 = df.groupby('A', as_index = False).agg({'B':'mean', 'C': lambda x: print (x[0])})
print (df1)
alpha <- return first value of first group only, because alpha has index 0
alpha
alpha
So if set index 0 for first values of groups it working with this sample data:
df = pd.DataFrame({'A': [0, 0, 1, 1],
'B': [1, 3, 8, 10],
'C': ['alpha', 'bravo', 'charlie', 'delta']}, index=[0,1,0,1])
print (df)
A B C
0 0 1 alpha <- index is 0
1 0 3 bravo
0 1 8 charlie <- index is 0
1 1 10 delta
df1 = df.groupby('A', as_index = False).agg({'B':'mean', 'C': lambda x: x[0]})
print (df1)
A B C
0 0 2 alpha
1 1 9 charlie
Small explanation on why your lambda function won't work.
When we use groupby we get an groupby object back:
g = df.groupby('A')
print(g)
<pandas.core.groupby.generic.DataFrameGroupBy object at 0x0000023AA1BB41D0>
When we access the elements in our groupby object, we get grouped dataframes back:
for idx, d in g:
print(d, '\n')
A B C
0 0 1 alpha
1 0 3 bravo
A B C
2 1 8 charlie
3 1 10 delta
So thats why we need to threat these elements as DataFrames. As jezrael already pointed out in his answer, there are several ways to access the first value in your C column:
for idx, d in g:
print(d['C'].iat[0])
print(d['C'].iloc[0], '\n')
alpha
alpha
charlie
charlie
Related
I have a data frame and a dictionary like this:
thresholds = {'column':{'A':10,'B':11,'C':9}}
df:
Column
A 13
A 7
A 11
B 12
B 14
B 14
C 7
C 8
C 11
For every index group, I want to calculate the count of values less than the threshold and greater than the threshold value.
So my output looks like this:
df:
Values<Thr Values>Thr
A 1 2
B 0 3
C 2 1
Can anyone help me with this
You can use:
import numpy as np
t = df.index.to_series().map(thresholds['column'])
out = (pd.crosstab(df.index, np.where(df['Column'].gt(t), 'Values>Thr', 'Values≤Thr'))
.rename_axis(index=None, columns=None)
)
Output:
Values>Thr Values≤Thr
A 2 1
B 3 0
C 1 2
syntax variant
out = (pd.crosstab(df.index, df['Column'].gt(t))
.rename_axis(index=None, columns=None)
.rename(columns={False: 'Values≤Thr', True: 'Values>Thr'})
)
apply on many column based on the key in the dictionary
def count(s):
t = s.index.to_series().map(thresholds.get(s.name, {}))
return (pd.crosstab(s.index, s.gt(t))
.rename_axis(index=None, columns=None)
.rename(columns={False: 'Values≤Thr', True: 'Values>Thr'})
)
out = pd.concat({c: count(df[c]) for c in df})
NB. The key of the dictionary must match exactly. I changed the case for the demo.
Output:
Values≤Thr Values>Thr
Column A 1 2
B 0 3
C 2 1
Here another option:
import pandas as pd
df = pd.DataFrame({'Column': [13, 7, 11, 12, 14, 14, 7, 8, 11]})
df.index = ['A', 'A', 'A', 'B', 'B', 'B', 'C', 'C', 'C']
thresholds = {'column':{'A':10,'B':11,'C':9}}
df['smaller'] = df['Column'].groupby(df.index).transform(lambda x: x < thresholds['column'][x.name]).astype(int)
df['greater'] = df['Column'].groupby(df.index).transform(lambda x: x > thresholds['column'][x.name]).astype(int)
df.drop(columns=['Column'], inplace=True)
# group by index summing the greater and smaller columns
sums = df.groupby(df.index).sum()
sums
I have two dataframes like so:
data = {'A': [3, 2, 1, 0], 'B': [1, 2, 3, 4]}
data2 = {'A': [3, 2, 1, 0, 3, 2], 'B': [1, 2, 3, 4, 20, 2], 'C':[5,3,2,1, 5, 1]}
df1 = pd.DataFrame.from_dict(data)
df2 = pd.DataFrame.from_dict(data2)
Now I did a groupby of df2 for C
values_to_map = df2.groupby(['A', 'B']).mean().to_dict()
Now I would like to map df1['new C'] where the columns A and B match.
A B new_C
0 3 1 1.0
1 2 2 2.0
2 1 3 2.0
3 0 4 12.5
where new c is basically the averages of C for every pair A, B from df2
Note that A and B don't have to be keys of the dataframe (i.e. they aren't unique identifiers which is why I want to map it with a dictionary originally, but failed with multiple keys)
How would I go about that?
Thank you for looking into it with me!
I found a solution to this
values_to_map = df2.groupby(['A', 'B']).mean().to_dict()
df1['new_c'] = df1.apply(lambda x: values_to_map[x['A'], x['B']], axis=1)
Thanks for looking into it!
Just do np.vectorize:
values_to_map = df2.groupby(['A', 'B']).mean().to_dict()
df1['new_c'] = np.vectorize(lambda x: values_to_map.get(x['A'], x['B']))(df1[['A', 'B']])
You can first form a MultiIndex from the [["A", "B"]] subset of the frame df1 and use its map function to map the A-B pairs to the desired grouped mean values:
cols = ["A", "B"]
mapper = df2.groupby(cols).C.mean()
df1["new_c"] = pd.MultiIndex.from_frame(df1[cols]).map(mapper)
to get
>>> df1
A B new_c
0 3 1 5.0
1 2 2 2.0
2 1 3 2.0
3 0 4 1.0
(if an A-B pair in df1 isn't found in df2's groups, new_c corresponding to that pair will be NaN with this method.)
Note that neither pandas' apply nor np.vectorize are "vectorized" routines. However, they might be fast enough for one's purposes and might prove more readable in places.
I would like to sort a dataframe by index, and then by alphabetical order, in case some index values are identical.
df = pd.DataFrame()
df['values'] = ['d', 'c', 'b', 'a']
df['index'] = [2, 0, 1, 0]
df = df.set_index('index')
df.sort_index(inplace=True)
Which output
values
index
0 c
0 a
1 b
2 d
However, I am expecting:
values
index
0 a
0 c
1 b
2 d
Is there any way to achieve this consistently ? Thank you.
According to the official documentation, you can pass the index name into sort_values:
df.sort_values(['index','values'])
Output:
values
index
0 a
0 c
1 b
2 d
Fun: You can also sort by values, then sort again by index with a stable algorithm:
df.sort_values('values').sort_index(kind='mergesort')
I have two dataframes of different size (df1 nad df2). I would like to remove from df1 all the rows which are stored within df2.
So if I have df2 equals to:
A B
0 wer 6
1 tyu 7
And df1 equals to:
A B C
0 qwe 5 a
1 wer 6 s
2 wer 6 d
3 rty 9 f
4 tyu 7 g
5 tyu 7 h
6 tyu 7 j
7 iop 1 k
The final result should be like so:
A B C
0 qwe 5 a
1 rty 9 f
2 iop 1 k
I was able to achieve my goal by using a for loop but I would like to know if there is a better and more elegant and efficient way to perform such operation.
Here is the code I wrote in case you need it:
import pandas as pd
df1 = pd.DataFrame({'A' : ['qwe', 'wer', 'wer', 'rty', 'tyu', 'tyu', 'tyu', 'iop'],
'B' : [ 5, 6, 6, 9, 7, 7, 7, 1],
'C' : ['a' , 's', 'd', 'f', 'g', 'h', 'j', 'k']})
df2 = pd.DataFrame({'A' : ['wer', 'tyu'],
'B' : [ 6, 7]})
for i, row in df2.iterrows():
df1 = df1[(df1['A']!=row['A']) & (df1['B']!=row['B'])].reset_index(drop=True)
Use merge with outer join with filter by query, last remove helper column by drop:
df = pd.merge(df1, df2, on=['A','B'], how='outer', indicator=True)
.query("_merge != 'both'")
.drop('_merge', axis=1)
.reset_index(drop=True)
print (df)
A B C
0 qwe 5 a
1 rty 9 f
2 iop 1 k
The cleanest way I found was to use drop from pandas using the index of the dataframe you want to drop:
df1.drop(df2.index, axis=0,inplace=True)
You can use np.in1d to check if any row in df1 exists in df2. And then use it as a reversed mask to select rows from df1.
df1[~df1[['A','B']].apply(lambda x: np.in1d(x,df2).all(),axis=1)]\
.reset_index(drop=True)
Out[115]:
A B C
0 qwe 5 a
1 rty 9 f
2 iop 1 k
pandas has a method called isin, however this relies on unique indices. We can define a lambda function to create columns we can use in this from the existing 'A' and 'B' of df1 and df2. We then negate this (as we want the values not in df2) and reset the index:
import pandas as pd
df1 = pd.DataFrame({'A' : ['qwe', 'wer', 'wer', 'rty', 'tyu', 'tyu', 'tyu', 'iop'],
'B' : [ 5, 6, 6, 9, 7, 7, 7, 1],
'C' : ['a' , 's', 'd', 'f', 'g', 'h', 'j', 'k']})
df2 = pd.DataFrame({'A' : ['wer', 'tyu'],
'B' : [ 6, 7]})
unique_ind = lambda df: df['A'].astype(str) + '_' + df['B'].astype(str)
print df1[~unique_ind(df1).isin(unique_ind(df2))].reset_index(drop=True)
printing:
A B C
0 qwe 5 a
1 rty 9 f
2 iop 1 k
I think the cleanest way can be:
We have base dataframe D and want to remove a subset D1. Let the output be D2
D2 = pd.DataFrame(D, index = set(D.index).difference(set(D1.index))).reset_index()
I find this other alternative useful too:
pd.concat([df1,df2], axis=0, ignore_index=True).drop_duplicates(subset=["A","B"],keep=False, ignore_index=True)
A B C
0 qwe 5 a
1 rty 9 f
2 iop 1 k
keep=False drops both duplicates.
It doesn't require to put all the equal columns between the two df, so I find that a bit easier.
I am trying to create a program that will delete a column in a Panda's dataFrame if the column's sum is less than 10.
I currently have the following solution, but I was curious if there is a more pythonic way to do this.
df = pandas.DataFrame(AllData)
sum = df.sum(axis=1)
badCols = list()
for index in range(len(sum)):
if sum[index] < 10:
badCols.append(index)
df = df.drop(df.columns[badCols], axis=1)
In my approach, I create a list of column indexes that have sums less than 10, then I delete this list. Is there a better approach for doing this?
You can call sum to generate a Series that gives the sum of each column, then use this to generate a boolean mask against your column array and use this to filter the df. DF generation code borrowed from #Alexander:
In [2]:
df = pd.DataFrame({'a': [1, 10], 'b': [1, 1], 'c': [20, 30]})
df
Out[2]:
a b c
0 1 1 20
1 10 1 30
In [3]:
df.sum()
Out[3]:
a 11
b 2
c 50
dtype: int64
In [6]:
df[df.columns[df.sum()>10]]
Out[6]:
a c
0 1 20
1 10 30
You can accomplish your objective using a one-liner by using a list comprehension and iteritems to identify all columns that meet your criteria.
df = pd.DataFrame({'a': [1, 10], 'b': [1, 1], 'c': [20, 30]})
>>> df
a b c
0 1 1 20
1 10 1 30
df.drop([col for col, val in df.sum().iteritems() if val < 10], axis=1, inplace=True)
>>> df
a c
0 1 20
1 10 30