I'm trying to create a total column that sums the numbers from another column based on a third column. I can do this by using .groupby(), but that creates a truncated column, whereas I want a column that is the same length.
My code:
df = pd.DataFrame({'a':[1,2,2,3,3,3], 'b':[1,2,3,4,5,6]})
df['total'] = df.groupby(['a']).sum().reset_index()['b']
My result:
a b total
0 1 1 1.0
1 2 2 5.0
2 2 3 15.0
3 3 4 NaN
4 3 5 NaN
5 3 6 NaN
My desired result:
a b total
0 1 1 1.0
1 2 2 5.0
2 2 3 5.0
3 3 4 15.0
4 3 5 15.0
5 3 6 15.0
...where each 'a' column has the same total as the other.
Returning the sum from a groupby operation in pandas produces a column only as long as the number of unique items in the index. Use transform to produce a column of the same length ("like-indexed") as the original data frame without performing any merges.
df['total'] = df.groupby('a')['b'].transform(sum)
>>> df
a b total
0 1 1 1
1 2 2 5
2 2 3 5
3 3 4 15
4 3 5 15
5 3 6 15
Related
Suppose I have the following pandas dataframe X in long format:
rank group ID
1 1 3
2 1 1
3 1 2
4 1 4
1 2 2
2 2 1
1 3 1
2 3 4
3 3 3
4 3 2
1 4 1
1 5 3
2 5 2
3 5 1
1 6 1
And I would like to reshape it to the following wide format according to the following rules:
split the ID column into 4 columns n1,n2,n3,n4 representing the 4 elements (person) in the ID column.
for column ni, i=1,2,3,4, the entry in the j-th row is 5 minus the ranking of i-the person in the j-th group. For example, in group 3, person 4 gets rank 2, hence the 3rd row of the n4 column is 5-2=3.
If person i doesn't exist in group j, then the j-th entry in column ni is NA.
So basically I want to create a "score system" for person i according to the ranking: the person who is ranked 1 gets the highest score and the person who is ranked 4 gets the lowest score (or NA if that no there aren't that many people in the group).
i.e.:
group n1 n2 n3 n4
1 3 2 4 1
2 3 4 NA NA
3 4 1 2 3
4 4 NA NA NA
5 2 3 4 NA
6 4 NA NA NA
I hope I have explained it in an understandable manner. Thank you.
Reshape the dataframe using pivot then subtract 5 from all the values and add prefix of n to column names:
df.pivot('group', 'ID', 'rank').rsub(5).add_prefix('n')
ID n1 n2 n3 n4
group
1 3.0 2.0 4.0 1.0
2 3.0 4.0 NaN NaN
3 4.0 1.0 2.0 3.0
4 4.0 NaN NaN NaN
5 2.0 3.0 4.0 NaN
6 4.0 NaN NaN NaN
I created this dataframe I calculated the gap that I was looking but the problem is that some flats have the same price and I get a difference of price of 0. How could I replace the value 0 by the difference with the last lower price of the same group.
for example:
neighboorhood:a, bed:1, bath:1, price:5
neighboorhood:a, bed:1, bath:1, price:5
neighboorhood:a, bed:1, bath:1, price:3
neighboorhood:a, bed:1, bath:1, price:2
I get difference price of 0,2,1,nan and I'm looking for 2,2,1,nan (briefly I don't want to compare 2 flats with the same price)
Thanks in advance and good day.
data=[
[1,'a',1,1,5],[2,'a',1,1,5],[3,'a',1,1,4],[4,'a',1,1,2],[5,'b',1,2,6],[6,'b',1,2,6],[7,'b',1,2,3]
]
df = pd.DataFrame(data, columns = ['id','neighborhoodname', 'beds', 'baths', 'price'])
df['difference_price'] = ( df.dropna()
.sort_values('price',ascending=False)
.groupby(['city','beds','baths'])['price'].diff(-1) )
I think you can remove duplicates first per all columns used for groupby with diff, create new column in filtered data and last use merge with left join to original:
df1 = (df.dropna()
.sort_values('price',ascending=False)
.drop_duplicates(['neighborhoodname','beds','baths', 'price']))
df1['difference_price'] = df1.groupby(['neighborhoodname','beds','baths'])['price'].diff(-1)
df = df.merge(df1[['neighborhoodname','beds','baths','price', 'difference_price']], how='left')
print (df)
id neighborhoodname beds baths price difference_price
0 1 a 1 1 5 1.0
1 2 a 1 1 5 1.0
2 3 a 1 1 4 2.0
3 4 a 1 1 2 NaN
4 5 b 1 2 6 3.0
5 6 b 1 2 6 3.0
6 7 b 1 2 3 NaN
Or you can use lambda function for back filling 0 values per groups for avoid wrong outputs if one row groups (data moved from another groups):
df['difference_price'] = (df.sort_values('price',ascending=False)
.groupby(['neighborhoodname','beds','baths'])['price']
.apply(lambda x: x.diff(-1).replace(0, np.nan).bfill()))
print (df)
id neighborhoodname beds baths price difference_price
0 1 a 1 1 5 1.0
1 2 a 1 1 5 1.0
2 3 a 1 1 4 2.0
3 4 a 1 1 2 NaN
4 5 b 1 2 6 3.0
5 6 b 1 2 6 3.0
6 7 b 1 2 3 NaN
I have a data frame where there are several groups of numeric series where the values are cumulative. Consider the following:
df = pd.DataFrame({'Cat': ['A', 'A','A','A', 'B','B','B','B'], 'Indicator': [1,2,3,4,1,2,3,4], 'Cumulative1': [1,3,6,7,2,4,6,9], 'Cumulative2': [1,3,4,6,1,5,7,12]})
In [74]:df
Out[74]:
Cat Cumulative1 Cumulative2 Indicator
0 A 1 1 1
1 A 3 3 2
2 A 6 4 3
3 A 7 6 4
4 B 2 1 1
5 B 4 5 2
6 B 6 7 3
7 B 9 12 4
I need to create discrete series for Cumulative1 and Cumulative2, with starting point being the earliest entry in 'Indicator'.
my Approach is to use diff()
In[82]: df['Discrete1'] = df.groupby('Cat')['Cumulative1'].diff()
Out[82]: df
Cat Cumulative1 Cumulative2 Indicator Discrete1
0 A 1 1 1 NaN
1 A 3 3 2 2.0
2 A 6 4 3 3.0
3 A 7 6 4 1.0
4 B 2 1 1 NaN
5 B 4 5 2 2.0
6 B 6 7 3 2.0
7 B 9 12 4 3.0
I have 3 questions:
How do I avoid the NaN in an elegant/Pythonic way? The correct values are to be found in the original Cumulative series.
Secondly, how do I elegantly apply this computation to all series, say -
cols = ['Cumulative1', 'Cumulative2']
Thirdly, I have a lot of data that needs this computation -- is this the most efficient way?
You do not want to avoid NaNs, you want to fill them with the start values from the "cumulative" column:
df['Discrete1'] = df['Discrete1'].combine_first(df['Cumulative1'])
To apply the operation to all (or select) columns, broadcast it to all columns of interest:
sources = 'Cumulative1', 'Cumulative2'
targets = ["Discrete" + x[len('Cumulative'):] for x in sources]
df[targets] = df.groupby('Cat')[sources].diff()
You still have to condition the NaNs in a loop:
for s,t in zip(sources, targets):
df[t] = df[t].combine_first(df[s])
I have a question for which I have
a dataframe which looks like (example):
index ID time value
0 1 2h 10
1 1 2.15h 15
2 1 2.30h 5
3 1 2.45h 24
4 2 2.15h 6
5 2 2.30h 12
6 2 2.45h 18
7 3 2.15h 2
8 3 2.30h 1
I would like to keep the maximum number of ID row overlapping.
So:
index ID time value
1 1 2.15h 15
2 1 2.30h 5
4 2 2.15h 6
5 2 2.30h 12
7 3 2.15h 2
8 3 2.30h 1
I know I can create a df with unique times and then merge each ID separately to it and then keep all rows with all IDs filled for each time but this is quite impractical. I have looked but have not found an answer for a possible smarter way. Does someone have an idea how to make this more practical?
Use:
cols = df.groupby(['ID', 'time']).size().unstack().dropna(axis=1).columns
df = df[df['time'].isin(cols)]
print (df)
ID time value
1 1 2.15h 15
2 1 2.30h 5
4 2 2.15h 6
5 2 2.30h 12
7 3 2.15h 2
8 3 2.30h 1
Details:
First aggregate DataFrame by groupby and size, then reshape by unstack - NaNs are created for non overlapping values:
print (df.groupby(['ID', 'time']).size().unstack())
time 2.15h 2.30h 2.45h 2h
ID
1 1.0 1.0 1.0 1.0
2 1.0 1.0 1.0 NaN
3 1.0 1.0 NaN NaN
Remove columns with dropna and get columns names:
print (df.groupby(['ID', 'time']).size().unstack().dropna(axis=1))
time 2.15h 2.30h
ID
1 1.0 1.0
2 1.0 1.0
3 1.0 1.0
And last filter list by isin and boolean indexing:
df = df[df['time'].isin(cols)]
I want to insert a pandas dataframe into another pandas dataframe at certain indices.
Lets say we have this dataframe:
original_df = pd.DataFrame([[1,2,3],[4,5,6],[7,8,9]])
0 1 2
0 1 2 3
1 4 5 6
2 7 8 9
I can then change values at certain indices as following:
original_df = pd.DataFrame([[1,2,3],[4,5,6],[7,8,9]])
original_df.iloc[[0,2],[0,1]] = 2
0 1 2
0 2 2 3
1 4 5 6
2 2 2 9
However, if i use the same technique to insert another dataframe, it doesn't work:
original_df = pd.DataFrame([[1,2,3],[4,5,6],[7,8,9]])
df_to_insert = pd.DataFrame([[10,11],[12,13]])
original_df.iloc[[0,2],[0,1]] = df_to_insert
0 1 2
0 10.0 11.0 3.0
1 4.0 5.0 6.0
2 NaN NaN 9.0
I am looking for a way to get the following result:
0 1 2
0 10 11 3
1 4 5 6
2 12 13 9
It seems to me that with the syntax i am using, the values from df_to_insert are taken from the corresponding index at their target locations. Is there a way for me to avoid this?
When you do insert make sure change the df to values , pandas is index sensitive , which means it will always try to match with the index and column during calculation
original_df.iloc[[0,2],[0,1]] = df_to_insert.values
original_df
Out[651]:
0 1 2
0 10 11 3
1 4 5 6
2 12 13 9
It does work with an array rather than a df:
original_df.iloc[[0,2],[0,1]] = np.array([[10,11],[12,13]])