This is very similar to this question, except I want my code to be able to apply to the length of a dataframe, instead of specific columns.
I have a DataFrame, and I'm trying to get a sum of each row to append to the dataframe as a column.
df = pd.DataFrame([[1,0,0],[20,7,1],[63,13,5]],columns=['drinking','drugs','both'],index = ['First','Second','Third'])
drinking drugs both
First 1 0 0
Second 20 7 1
Third 63 13 5
Desired output:
drinking drugs both total
First 1 0 0 1
Second 20 7 1 28
Third 63 13 5 81
Current code:
df['total'] = df.apply(lambda row: (row['drinking'] + row['drugs'] + row['both']),axis=1)
This works great. But what if I have another dataframe, with seven columns, which are not called 'drinking', 'drugs', or 'both'? Is it possible to adjust this function so that it applies to the length of the dataframe? That way I can use the function for any dataframe at all, with a varying number of columns, not just a dataframe with columns called 'drinking', 'drugs', and 'both'?
Something like:
df['total'] = df.apply(for col in df: [code to calculate sum of each row]),axis=1)
You can use sum:
df['total'] = df.sum(axis=1)
If you need sum only some columns, use subset:
df['total'] = df[['drinking', 'drugs', 'both']].sum(axis=1)
what about something like this :
df.loc[:, 'Total'] = df.sum(axis=1)
with the output :
Out[4]:
drinking drugs both Total
First 1 0 0 1
Second 20 7 1 28
Third 63 13 5 81
It will sum all columns by row.
Related
I've got a dataframe with lots of rows and columns on score results. I want to aggregate them and look at how many records there are with each score. Ideally the output would look like:
df=
Score count
10 576
9 306
8 644
7 829
etc...
I've been using the below code to try to get the main dataframe into that format.
df = df[['score']]
df = df.groupby(['score'])['score'].count()
df = df.reset_index()
This code works for the most part, the second line gets me the aggregated figures for each score, but the scores themselves are in the index as opposed to being their own column which is why I try to reset the index.
However I keep getting the error: ValueError: cannot insert score, already exists.
Anyway to get around this so I can have the two columns; score and count.
Sol 1
df = df.groupby(['score']).count()
df = df.reset_index()
df
score count
0 7 1
1 8 1
2 9 1
3 10 1
Sol 2
df = df[['score']]
df = df.groupby(['score']).size()
df = df.reset_index()
df
score 0
0 7 1
1 8 1
2 9 1
3 10 1
Here's the code I currently have:
df.groupby(df['LOCAL_COUNCIL']).agg({'CRIME_RATING': ['mean', 'count']}).reset_index()
which returns something like the following (I've made these values up):
CRIME_RATING
mean count
0 3.000000 1
1 3.118397 39
2 2.790698 32
3 5.125000 18
4 4.000000 1
5 4.222222 22
but I'd quite like to exclude indexes 0 and 4 from the resulting dataframe given that they both have a count of 1. Can this be done?
Use Series.ne for filter not equal 1 with tuple for select MultiIndex columns and filter in boolean indexing:
df1 = df.groupby(df['LOCAL_COUNCIL']).agg({'CRIME_RATING': ['mean', 'count']}).reset_index()
df2 = df1[df1[('CRIME_RATING','count')].ne(1)]
If want avoid MultiIndex use named aggregation:
df1 = df.groupby(df['LOCAL_COUNCIL']).agg(mean = ('CRIME_RATING','mean'),
count = ('CRIME_RATING','count'))
df2 = df1[df1['count'].ne(1)]
I have a pandas df that looks like this:
import pandas as pd
df = pd.DataFrame({0:[1],5:[1],10:[1],15:[1],20:[0],25:[0],
30:[1],35:[1],40:[0],45:[0],50:[0]})
df
The column names reflect coordinates. I would like to retrieve the start and end coordinate of columns with consecutive equal numbers.
The output should be something like this:
# start,end
0,15
20,25
30,35
40,50
IIUCusing groupby with diff and cumsum to split the group
s=df.T.reset_index()
s=s.groupby(s[0].diff().ne(0).cumsum())['index'].agg(['first','last'])
Out[241]:
first last
0
1 0 15
2 20 25
3 30 35
4 40 50
cumsum to identify group, and groupby:
s = df.iloc[0].diff().ne(0).cumsum()
(df.columns.to_series()
.groupby(s).agg(['min','max'])
)
Output:
min max
0
1 0 15
2 20 25
3 30 35
4 40 50
I have an index in a pandas dataframe which repeats the index value. I want to re-index as multi-index where repeated indexes are grouped.
The indexing looks like such:
so I would like all the 112335586 index values would be grouped under the same in index.
I have looked at this question Create pandas dataframe by repeating one row with new multiindex but here the value can be index can be pre-defined but this is not possible as my dataframe is far too large to hard code this.
I also looked at at the multi-index documentation but this also pre-defines the value for the index.
I believe you need:
s = pd.Series([1,2,3,4], index=[10,10,20,20])
s.index.name = 'EVENT_ID'
print (s)
EVENT_ID
10 1
10 2
20 3
20 4
dtype: int64
s1 = s.index.to_series()
s2 = s1.groupby(s1).cumcount()
s.index = [s.index, s2]
print (s)
EVENT_ID
10 0 1
1 2
20 0 3
1 4
dtype: int64
Try this:
df.reset_index(inplace=True)
df['sub_idx'] = df.groupby('EVENT_ID').cumcount()
df.set_index(['EVENT_ID','sub_idx'], inplace=True)
I have a pandas dataframe defined as:
A B SUM_C
1 1 10
1 2 20
I would like to do a cumulative sum of SUM_C and add it as a new column to the same dataframe. In other words, my end goal is to have a dataframe that looks like below:
A B SUM_C CUMSUM_C
1 1 10 10
1 2 20 30
Using cumsum in pandas on group() shows the possibility of generating a new dataframe where column name SUM_C is replaced with cumulative sum. However, my ask is to add the cumulative sum as a new column to the existing dataframe.
Thank you
Just apply cumsum on the pandas.Series df['SUM_C'] and assign it to a new column:
df['CUMSUM_C'] = df['SUM_C'].cumsum()
Result:
df
Out[34]:
A B SUM_C CUMSUM_C
0 1 1 10 10
1 1 2 20 30