I have a dataframe looks like this
df = pd.DataFrame({'type': ['A', 'A', 'A', 'B', 'B', 'B', 'A', 'A','C','D','C','D','D','A', 'A'],
})
I wanna create a unique id based on the group of the type column, but it will still cumsum when the type equals to 'A'
Eventually this output dataframe will look like this
df = pd.DataFrame({'type': ['A', 'A', 'A', 'B', 'B', 'B', 'A', 'A','C','D','C','D','D','A', 'A'],
'id': [1, 2, 3, 4, 4, 4, 5,6, 7, 8, 9, 10, 10, 11, 12],
})
Any help would be much appreciated
You can try with shift with cumsum create the key , then assign the A with unique key
s = df.groupby(df.type.ne(df.type.shift()).cumsum()).cumcount().astype(str)
df['new'] = df['type']
df.loc[df.new.eq('A'),'new'] += s
df['new'] = df['new'].ne(df['new'].shift()).cumsum()
df
Out[58]:
type new
0 A 1
1 A 2
2 A 3
3 B 4
4 B 4
5 B 4
6 A 5
7 A 6
8 C 7
9 D 8
10 C 9
11 D 10
12 D 10
13 A 11
14 A 12
Related
Suppose I have the following dataframe:
data = {'ID': ['A', 'B', 'C', 'A', 'C', 'O', 'B', 'A', 'B', 'O'], 'Item':['Apple','Banana','Carrot','Apple', 'Carrot', 'Orange', 'Banana', 'Apple', 'Banana', 'Orange'], 'Cost':[10, 12, 15, 13, 54, 20, 73, 22, 19, 32]}
dataframe = pd.DataFrame(data)
dataframe
And I want to replace the cost of the current item with the cost of the previous item using Pandas, with the first instance of each item being deleted. So the above dataframe would become
data2 = {'ID': ['A', 'C', 'B', 'A', 'B', 'O'], 'Item':['Apple', 'Carrot', 'Banana', 'Apple', 'Banana', 'Orange'], 'Cost':[10, 15, 12, 13, 73, 20]}
dataframe2 = pd.DataFrame(data2)
dataframe2
What's a good way to do it?
Use groupby and head with a negative number to exclude the last occurrence of each item:
>>> dataframe.groupby('Item').head(-1)
ID Item Cost
0 A Apple 10
1 B Banana 12
2 C Carrot 15
3 A Apple 13
5 O Orange 20
6 B Banana 73
You can use groupby on Item as well. This gives you output in the same order you expected
data['Cost'] = data.groupby('Item')['Cost'].shift(fill_value=0)
data[data['Cost'] != 0]
This gives us expected output:
ID Item Cost
3 A Apple 10
4 C Carrot 15
6 B Banana 12
7 A Apple 13
8 B Banana 73
9 O Orange 20
I have the following dataframe -
df = pd.DataFrame({
'ID': [1, 2, 2, 3, 3, 3, 4],
'Prior': ['a', 'b', 'c', 'd', 'e', 'f', 'g'],
'Current': ['a1', 'c', 'c1', 'e', 'f', 'f1', 'g1'],
'Date': ['1/1/2019', '5/1/2019', '10/2/2019', '15/3/2019', '6/5/2019',
'7/9/2019', '16/11/2019']
})
This is my desired output -
desired_df = pd.DataFrame({
'ID': [1, 1, 2, 2, 2, 3, 3, 3, 3, 4, 4],
'Prior_Current': ['a', 'a1', 'b', 'c', 'c1', 'd', 'e', 'f', 'f1', 'g',
'g1'],
'Start_Date': ['', '1/1/2019', '', '5/1/2019', '10/2/2019', '', '15/3/2019',
'6/5/2019', '7/9/2019', '', '16/11/2019'],
'End_Date': ['1/1/2019', '', '5/1/2019', '10/2/2019', '', '15/3/2019',
'6/5/2019', '7/9/2019', '', '16/11/2019', '']
})
I tried the following -
keys = ['Prior', 'Current']
df2 = (
pd.melt(df, id_vars='ID', value_vars=keys, value_name='Prior_Current')
.merge(df[['ID', 'Date']], how='left', on='ID')
)
df2['Start_Date'] = np.where(df2['variable'] == 'Prior', df2['Date'], '')
df2['End_Date'] = np.where(df2['variable'] == 'Current', df2['Date'], '')
df2.sort_values(['ID'], ascending=True, inplace=True)
But this does not seem be working. Please help.
you can use stack and pivot_table:
k = df.set_index(['ID', 'Date']).stack().reset_index()
df = k.pivot_table(index = ['ID',0], columns = 'level_2', values = 'Date', aggfunc = ''.join, fill_value= '').reset_index()
df.columns = ['ID', 'prior-current', 'start-date', 'end-date']
OUTPUT:
ID prior-current start-date end-date
0 1 a 1/1/2019
1 1 a1 1/1/2019
2 2 b 5/1/2019
3 2 c 5/1/2019 10/2/2019
4 2 c1 10/2/2019
5 3 d 15/3/2019
6 3 e 15/3/2019 6/5/2019
7 3 f 6/5/2019 7/9/2019
8 3 f1 7/9/2019
9 4 g 16/11/2019
10 4 g1 16/11/2019
Explaination:
After stack / reset_index df will look like this:
ID Date level_2 0
0 1 1/1/2019 Prior a
1 1 1/1/2019 Current a1
2 2 5/1/2019 Prior b
3 2 5/1/2019 Current c
4 2 10/2/2019 Prior c
5 2 10/2/2019 Current c1
6 3 15/3/2019 Prior d
7 3 15/3/2019 Current e
8 3 6/5/2019 Prior e
9 3 6/5/2019 Current f
10 3 7/9/2019 Prior f
11 3 7/9/2019 Current f1
12 4 16/11/2019 Prior g
13 4 16/11/2019 Current g1
Now, we can use ID and column 0 as index / level_2 as column / Date column as value.
Finally, we need to rename the columns to get the desired result.
My approach is to build and attain the target df step by step. The first step is an extension of your code using melt() and merge(). The merge is done based on the columns 'Current' and 'Prior' to get the start and end date.
df = pd.DataFrame({
'ID': [1, 2, 2, 3, 3, 3, 4],
'Prior': ['a', 'b', 'c', 'd', 'e', 'f', 'g'],
'Current': ['a1', 'c', 'c1', 'e', 'f', 'f1', 'g1'],
'Date': ['1/1/2019', '5/1/2019', '10/2/2019', '15/3/2019', '6/5/2019',
'7/9/2019', '16/11/2019']
})
df2 = pd.melt(df, id_vars='ID', value_vars=['Prior', 'Current'], value_name='Prior_Current').drop('variable',1).drop_duplicates().sort_values('ID')
df2 = df2.merge(df[['Current', 'Date']], how='left', left_on='Prior_Current', right_on='Current').drop('Current',1)
df2 = df2.merge(df[['Prior', 'Date']], how='left', left_on='Prior_Current', right_on='Prior').drop('Prior',1)
df2 = df2.fillna('').reset_index(drop=True)
df2.columns = ['ID', 'Prior_Current', 'Start_Date', 'End_Date']
Alternative way is to define a custom function to get date, then use lambda function:
def get_date(x, col):
try:
return df['Date'][df[col]==x].values[0]
except:
return ''
df2 = pd.melt(df, id_vars='ID', value_vars=['Prior', 'Current'], value_name='Prior_Current').drop('variable',1).drop_duplicates().sort_values('ID').reset_index(drop=True)
df2['Start_Date'] = df2['Prior_Current'].apply(lambda x: get_date(x, 'Current'))
df2['End_Date'] = df2['Prior_Current'].apply(lambda x: get_date(x, 'Prior'))
Output
I need to group by mean() for the first 2 values of each category, how I define that.
df like
category value
-> a 2
-> a 5
a 4
a 8
-> b 6
-> b 3
b 1
-> c 2
-> c 2
c 7
by reading only the arrowed data where the output be like
category mean
a 3.5
b 4.5
c 2
how can I do this
I am trying but do not know where to define the to get only 1st 2 observation from each categrory
output = df.groupby(['category'])['value'].mean().reset_index()
your help is appreciated, thanks in advance
You can also do this via groupby() and agg():
out=df.groupby('category',as_index=False)['value'].agg(lambda x:x.head(2).mean())
Try apply on each group of values and use head(2) to just get the first 2 values then mean:
import pandas as pd
df = pd.DataFrame({
'category': {0: 'a', 1: 'a', 2: 'a', 3: 'a', 4: 'b', 5: 'b',
6: 'b', 7: 'c', 8: 'c', 9: 'c'},
'value': {0: 2, 1: 5, 2: 4, 3: 8, 4: 6, 5: 3, 6: 1, 7: 2,
8: 2, 9: 7}
})
output = df.groupby('category', as_index=False)['value'] \
.apply(lambda a: a.head(2).mean())
print(output)
output:
category value
0 a 3.5
1 b 4.5
2 c 2.0
Or create a boolean index to filter df with:
m = df.groupby('category').cumcount().lt(2)
output = df[m].groupby('category')['value'].mean().reset_index()
print(output)
category value
0 a 3.5
1 b 4.5
2 c 2.0
I am trying to match columns of 2 dataframes and getting the output as false. is there a way to find what is the "not equal" data?
code:
rounds2['company_permalink'].equals(companies['permalink'])
output:
False
if i have understood the question:
data1 = {'col_1': [22, 34, 23, 43], 'col_2': ['a', 'b', 'c', 'd']}
data2 = {'col_1': [22, 66, 23, 88], 'col_2': ['x', 'b', 'c', 'y']}
df1 = pd.DataFrame.from_dict(data1)
df2 = pd.DataFrame.from_dict(data2)
not_equal_col1 = df1['col_1']!=df2['col_1']
print(df1[not_equal_col1])
print(df2[not_equal_col1])
Output:
col_1 col_2
1 34 b
3 43 d
col_1 col_2
1 66 b
3 88 y
I would like to create a rank on year (so in year 2012, Manager B is 1. In 2011, Manager B is 1 again). I struggled with the pandas rank function for awhile and DO NOT want to resort to a for loop.
s = pd.DataFrame([['2012','A',3],['2012','B',8],['2011','A',20],['2011','B',30]], columns=['Year','Manager','Return'])
Out[1]:
Year Manager Return
0 2012 A 3
1 2012 B 8
2 2011 A 20
3 2011 B 30
The issue I'm having is with the additional code (didn't think this would be relevant before):
s = pd.DataFrame([['2012', 'A', 3], ['2012', 'B', 8], ['2011', 'A', 20], ['2011', 'B', 30]], columns=['Year', 'Manager', 'Return'])
b = pd.DataFrame([['2012', 'A', 3], ['2012', 'B', 8], ['2011', 'A', 20], ['2011', 'B', 30]], columns=['Year', 'Manager', 'Return'])
s = s.append(b)
s['Rank'] = s.groupby(['Year'])['Return'].rank(ascending=False)
raise Exception('Reindexing only valid with uniquely valued Index '
Exception: Reindexing only valid with uniquely valued Index objects
Any ideas?
This is the real data structure I am using.
Been having trouble re-indexing..
It sounds like you want to group by the Year, then rank the Returns in descending order.
import pandas as pd
s = pd.DataFrame([['2012', 'A', 3], ['2012', 'B', 8], ['2011', 'A', 20], ['2011', 'B', 30]],
columns=['Year', 'Manager', 'Return'])
s['Rank'] = s.groupby(['Year'])['Return'].rank(ascending=False)
print(s)
yields
Year Manager Return Rank
0 2012 A 3 2
1 2012 B 8 1
2 2011 A 20 2
3 2011 B 30 1
To address the OP's revised question: The error message
ValueError: cannot reindex from a duplicate axis
occurs when trying to groupby/rank on a DataFrame with duplicate values in the index. You can avoid the problem by constructing s to have unique index values after appending:
s = pd.DataFrame([['2012', 'A', 3], ['2012', 'B', 8], ['2011', 'A', 20], ['2011', 'B', 30]], columns=['Year', 'Manager', 'Return'])
b = pd.DataFrame([['2012', 'A', 3], ['2012', 'B', 8], ['2011', 'A', 20], ['2011', 'B', 30]], columns=['Year', 'Manager', 'Return'])
s = s.append(b, ignore_index=True)
yields
Year Manager Return
0 2012 A 3
1 2012 B 8
2 2011 A 20
3 2011 B 30
4 2012 A 3
5 2012 B 8
6 2011 A 20
7 2011 B 30
If you've already appended new rows using
s = s.append(b)
then use reset_index to create a unique index:
s = s.reset_index(drop=True)