I have a Pandas DataFrame with columns A, B, C, D, date. I want to filter out duplicates of A and B, keeping the row with the most recent value in date.
So if I have two rows that look like:
A B C D date
1 1 2 3 1/1/18
1 1 2 3 1/1/17
The correct output would be:
A B C D date
1 1 2 3 1/1/18
I can do this by looping through, but I'd like to use df.groupby(['A', 'B']) and then aggregate by taking the largest value for date in each group.
I tried:
df.groupby(['A', 'B']).agg(lambda x: x.iloc[x.date.argmax()])
But I get:
AttributeError: 'Series' object has no attribute 'date'
Any idea what I'm doing incorrectly?
Edit: Hmm if I do:
df.groupby(['A', 'B']).UPDATED_AT.max()
I get mostly what I want but I lose columns D and C...
You can do with
df.date=pd.to_datetime(df.date)
df.sort_values('date').drop_duplicates(['A','B'],keep='last')
A B C D date
0 1 1 2 3 2018-01-01
Try df.groupby(['A', 'B']).agg(lambda x: x.iloc[x['date'].argmax()])
pandas has its own date object. Maybe pandas got confused with the series name.
df = pd.DataFrame([[1, 1, 2, 3, '1/1/18'],
[1, 1, 2, 3, '1/1/17']],
columns=['A', 'B', 'C', 'D', 'date'])
Output:
A B C D date
0 1 1 2 3 1/1/18
1 1 1 2 3 1/1/17
Grouping an
d removing duplicate:
df.groupby(['A', 'B']).agg(
{
'date': 'max'
})
Output:
date
A B
1 1 1/1/18
This should work. It may work better with having 'date' column to be datetime object.
Related
Suppose you have a dataframe with an "id" column and a column of values:
df1 = pd.DataFrame({'id': ['a', 'b', 'c'] , 'vals': [1, 2, 3]})
df1
id vals
0 a 1
1 b 2
2 c 3
You also have a series that contains lists of "id" values that correspond to those in df1:
df2 = pd.Series([['b', 'c'], ['a', 'c'], ['a', 'b']])
df2
id
0 [b, c]
1 [a, c]
2 [a, b]
Now, you need a computationally efficient method for taking the mean of the "vals" column in df1 using the corresponding ids in df2 and creating a new column in df1. For instance, for the first row (index=0) we would take the mean of the values for ids "b" and "c" in df1 (since these are the id values in df2 for index=0):
id vals avg_vals
0 a 1 2.5
1 b 2 2.0
2 c 3 1.5
You could do it this way:
df1['avg_vals'] = df2.apply(lambda x: df1.loc[df1['id'].isin(x), 'vals'].mean())
df1
id vals avg_vals
0 a 1 2.5
1 b 2 2.0
2 c 3 1.5
...but suppose it is too slow for your purposes. I.e., I need something much more computationally efficient if possible! Thanks for your help in advance.
Let us try
df1['new'] = pd.DataFrame(df2.tolist()).replace(dict(zip(df1.id,df1.vals))).mean(1)
df1
Out[109]:
id vals new
0 a 1 2.5
1 b 2 2.0
2 c 3 1.5
Try something like:
df1['avg_vals'] = (df2.explode()
.map(df1.set_index('id')['vals'])
.groupby(level=0)
.mean()
)
output:
id vals avg_vals
0 a 1 2.5
1 b 2 2.0
2 c 3 1.5
Thanks to #Beny and #mozway for their answers. But, these still were not performing as efficiently as I needed. I was able to take some of mozway's answer and add a merge and groupby to it which sped things up:
df1 = pd.DataFrame({'id': ['a', 'b', 'c'] , 'vals': [1, 2, 3]})
df2 = pd.Series([['b', 'c'], ['a', 'c'], ['a', 'b']])
df2 = df2.explode().reset_index(drop=False)
df1['avg_vals'] = pd.merge(df1, df2, left_on='id', right_on=0, how='right').groupby('index').mean()['vals']
df1
id vals avg_vals
0 a 1 2.5
1 b 2 2.0
2 c 3 1.5
In my application I am multiplying two Pandas Series which both have multiple index levels. Sometimes, a level contains only a single unique value, in which case I don't get all the index levels from both Series in my result.
To illustrate the problem, let's take two series:
s1 = pd.Series(np.random.randn(4), index=[[1, 1, 1, 1], [1,2,3,4]])
s1.index.names = ['A', 'B']
A B
1 1 -2.155463
2 -0.411068
3 1.041838
4 0.016690
s2 = pd.Series(np.random.randn(4), index=[['a', 'a', 'a', 'a'], [1,2,3,4]])
s2.index.names = ['C', 'B']
C B
a 1 0.043064
2 -1.456251
3 0.024657
4 0.912114
Now, if I multiply them, I get the following:
s1.mul(s2)
A B
1 1 -0.092822
2 0.598618
3 0.025689
4 0.015223
While my desired result would be
A C B
1 a 1 -0.092822
2 0.598618
3 0.025689
4 0.015223
How can I keep index level C in the multiplication?
I have so far been able to get the right result as shown below, but would much prefer a neater solution which keeps my code more simple and readable.
s3 = s2.mul(s1).to_frame()
s3['C'] = 'a'
s3.set_index('C', append=True, inplace=True)
You can use Series.unstack with DataFrame.stack:
s = s2.unstack(level=0).mul(s1, level=1, axis=0).stack().reorder_levels(['A','C','B'])
print (s)
A C B
1 a 1 0.827482
2 -0.476929
3 -0.473209
4 -0.520207
dtype: float64
I want to know how to groupby a single column and join multiple column strings each row.
Here's an example dataframe:
df = pd.DataFrame(np.array([['a', 'a', 'b', 'b'], [1, 1, 2, 2],
['k', 'l', 'm', 'n']]).T,
columns=['a', 'b', 'c'])
print(df)
a b c
0 a 1 k
1 a 1 l
2 b 2 m
3 b 2 n
I've tried something like,
df.groupby(['b', 'a'])['c'].apply(','.join).reset_index()
b a c
0 1 a k,l
1 2 b m,n
But that is not my required output,
Desired output:
a b c
0 1 a,a k,l
1 2 b,b m,n
How can I achieve this? I need a scalable solution because I'm dealing with millions of rows.
I think you need grouping by b column only and then if necessary create list of columns for apply function with GroupBy.agg:
df1 = df.groupby('b')['a','c'].agg(','.join).reset_index()
#alternative if want join all columns without b
#df1 = df.groupby('b').agg(','.join).reset_index()
print (df1)
b a c
0 1 a,a k,l
1 2 b,b m,n
I know that by using set_index i can convert an existing column into a dataframe index, but is there a way to specify, directly in the Dataframe constructor to use of one the data columns as an index (instead of turning it into a column).
Right now i initialize a DataFrame using data records, then i use set_index to make the column into an index.
DataFrame([{'a':1,'b':1,"c":2,'d':1},{'a':1,'b':2,"c":2,'d':2}], index= ['a', 'b'], columns=('c', 'd'))
I want:
c d
ab
11 2 1
12 2 2
Instead i get:
c d
a 2 1
b 2 2
You can use MultiIndex.from_tuples:
print (pd.MultiIndex.from_tuples([(x['a'], x['b']) for x in d], names=('a','b')))
MultiIndex(levels=[[1], [1, 2]],
labels=[[0, 0], [0, 1]],
names=['a', 'b'])
d = [{'a':1,'b':1,"c":2,'d':1},{'a':1,'b':2,"c":2,'d':2}]
df= pd.DataFrame(d,
index = pd.MultiIndex.from_tuples([(x['a'], x['b']) for x in d],
names=('a','b')),
columns=('c', 'd'))
print (df)
c d
a b
1 1 2 1
2 2 2
You can just chain call set_index on the ctor without specifying the index and columns params:
In [19]:
df=pd.DataFrame([{'a':1,'b':1,"c":2,'d':1},{'a':1,'b':2,"c":2,'d':2}]).set_index(['a','b'])
df
Out[19]:
c d
a b
1 1 2 1
2 2 2
Hello I have the following Data Frame:
df =
ID Value
a 45
b 3
c 10
And another dataframe with the numeric ID of each value
df1 =
ID ID_n
a 3
b 35
c 0
d 7
e 1
I would like to have a new column in df with the numeric ID, so:
df =
ID Value ID_n
a 45 3
b 3 35
c 10 0
Thanks
Use pandas merge:
import pandas as pd
df1 = pd.DataFrame({
'ID': ['a', 'b', 'c'],
'Value': [45, 3, 10]
})
df2 = pd.DataFrame({
'ID': ['a', 'b', 'c', 'd', 'e'],
'ID_n': [3, 35, 0, 7, 1],
})
df1.set_index(['ID'], drop=False, inplace=True)
df2.set_index(['ID'], drop=False, inplace=True)
print pd.merge(df1, df2, on="ID", how='left')
output:
ID Value ID_n
0 a 45 3
1 b 3 35
2 c 10 0
You could use join(),
In [14]: df1.join(df2)
Out[14]:
Value ID_n
ID
a 45 3
b 3 35
c 10 0
If you want index to be numeric you could reset_index(),
In [17]: df1.join(df2).reset_index()
Out[17]:
ID Value ID_n
0 a 45 3
1 b 3 35
2 c 10 0
You can do this in a single operation. join works on the index, which you don't appear to have set. Just set the index to ID, join df after also setting its index to ID, and then reset your index to return your original dataframe with the new column added.
>>> df.set_index('ID').join(df1.set_index('ID')).reset_index()
ID Value ID_n
0 a 45 3
1 b 3 35
2 c 10 0
Also, because you don't do an inplace set_index on df1, its structure remains the same (i.e. you don't change its indexing).