pandas DataFrame sum method works counterintuitively - python

my_df = DataFrame(np.arange(1,13).reshape(4,3), columns=list('abc'))
my_df.sum(axis="rows")
O/P is
a 22
b 26
c 30
// I expect it to sum by rows thereby giving
0 6
1 15
2 24
3 33
my_df.sum(axis="columns") //helps achieve this
Why does it work counterintutively?
In a similar context, drop method works as it should i.e when i write
my_df.drop(['a'],axis="columns")
// This drops column "a".
Am I missing something? Please enlighten.

Short version
It is a naming convention. The sum of the columns gives a row-wise sum. You are looking for axis='columns').
Long version
Ok that was interesting. In pandas normally 0 is for columns and 1 is for rows.
However looking in the docs we find that the allowed params are:
axis : {index (0), columns (1)}
You are passing a param that does not exist which results in the default. This can thus be read as: The sum of the columns returns the row sum. The sum of the index returns the column sum. What you want to use it axis=1 or axis='columns' which results in your desired output:
import pandas as pd
import numpy as np
df = pd.DataFrame(np.arange(1,13).reshape(4,3), columns=list('abc'))
print(df.sum(axis=1))
Returns:
0 6
1 15
2 24
3 33
dtype: int64

Related

Conditionally dropping columns in a pandas dataframe

I have this dataframe and my goal is to remove any columns that have less than 1000 entries.
Prior to to pivoting the df I know I have 880 unique well_id's with entries ranging from 4 to 60k+. I know should end up with 102 well_id's.
I tried to accomplish this in a very naïve way by collecting the wells that I am trying to remove in an array and using a loop but I keep getting a 'TypeError: Level type mismatch' but when I just use del without a for loop it works.
#this works
del df[164301.0]
del df['TB-0071']
# this doesn't work
for id in unwanted_id:
del df[id]
Any help is appreciated, Thanks.
You can use dropna method:
df.dropna(thresh=[]) #specify [here] how many non-na values you require to keep the row
The advantage of this method is that you don't need to create a list.
Also don't forget to add the usual inplace = True if you want the changes to be made in place.
You can use pandas drop method:
df.drop(columns=['colName'], inplace=True)
You can actually pass a list of columns names:
unwanted_id = [164301.0, 'TB-0071']
df.drop(columns=unwanted_ids, inplace=True)
Sample:
df[:5]
from to freq
0 A X 20
1 B Z 9
2 A Y 2
3 A Z 5
4 A X 8
df.drop(columns=['from', 'to'])
freq
0 20
1 9
2 2
3 5
4 8
And to get those column names with more than 1000 unique values, you can use something like this:
counts = df.nunique()[df.nunique()>1000].to_frame('uCounts').reset_index().rename(columns={'index':'colName'})
counts
colName uCounts
0 to 1001
1 freq 1050

why pandas.DataFrame.sum(axis=0) returns sum of values in each column where axis =0 represent rows?

In pandas, axis=0 represent rows and axis=1 represent columns.
Therefore to get the sum of values in each row in pandas, df.sum(axis=0) is called.
But it returns a sum of values in each columns and vice-versa. Why???
import pandas as pd
df=pd.DataFrame({"x":[1,2,3,4,5],"y":[2,4,6,8,10]})
df.sum(axis=0)
Dataframe:
x y
0 1 2
1 2 4
2 3 6
3 4 8
4 5 10
Output:
x 15
y 30
Expected Output:
0 3
1 6
2 9
3 12
4 15
I think the right way to interpret the axis parameter is what axis you sum 'over' (or 'across'), rather than the 'direction' the sum is computed in. Specifying axis = 0 computes the sum over the rows, giving you a total for each column; axis = 1 computes the sum across the columns, giving you a total for each row.
I was a reading the source code in pandas project, and I think that this come from Numpy, in this library is used in that way(0 sum vertically and 1 horizonally), and additionally Pandas use under the hood numpy in order to make this sum.
In this link you could check that pandas use numpy.cumsum function to make the sum.
And this link is for numpy documentation.
If you are looking a way to remember how to use the axis parameter, the 'anant' answer, its a good approach, interpreting the sum over the axis instead across. So when is specified 0 you are computing the sum over the rows(iterating over the index in order to be more pandas doc complaint). When axis is 1 you are iterating over the columns.

How can I keep all columns in a dataframe, plus add groupby, and sum?

I have a data frame with 5 fields. I want to copy 2 fields from this into a new data frame. This works fine. df1 = df[['task_id','duration']]
Now in this df1, when I try to group by task_id and sum duration, the task_id field drops off.
Before (what I have now).
After (what I'm trying to achieve).
So, for instance, I'm trying this:
df1['total'] = df1.groupby(['task_id'])['duration'].sum()
The result is:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
I don't know why I can't just sum the values in a column and group by unique IDs in another column. Basically, all I want to do is preserve the original two columns (['task_id', 'duration']), sum duration, and calculate a percentage of duration in a new column named pct. This seems like a very simple thing but I can't get anything working. How can I get this straightened out?
The code will take care of having the columns retained and getting the sum.
df[['task_id', 'duration']].groupby(['task_id', 'duration']).size().reset_index(name='counts')
Setup:
X = np.random.choice([0,1,2], 20)
Y = np.random.uniform(2,10,20)
df = pd.DataFrame({'task_id':X, 'duration':Y})
Calculate pct:
df = pd.merge(df, df.groupby('task_id').agg(sum).reset_index(), on='task_id')
df['pct'] = df['duration_x'].divide(df['duration_y'])*100
df.drop('duration_y', axis=1) # Drops sum duration, remove this line if you want to see it.
Result:
duration_x task_id pct
0 8.751517 0 58.017921
1 6.332645 0 41.982079
2 8.828693 1 9.865355
3 2.611285 1 2.917901
4 5.806709 1 6.488531
5 8.045490 1 8.990189
6 6.285593 1 7.023645
7 7.932952 1 8.864436
8 7.440938 1 8.314650
9 7.272948 1 8.126935
10 9.162262 1 10.238092
11 7.834692 1 8.754639
12 7.989057 1 8.927129
13 3.795571 1 4.241246
14 6.485703 1 7.247252
15 5.858985 2 21.396850
16 9.024650 2 32.957771
17 3.885288 2 14.188966
18 5.794491 2 21.161322
19 2.819049 2 10.295091
disclaimer: All data is randomly generated in setup, however, calculations are straightforward and should be correct for any case.
I finally got everything working in the following way.
# group by and sum durations
df1 = df1.groupby('task_id', as_index=False).agg({'duration': 'sum'})
list(df1)
# find each task_id as relative percentage of whole
df1['pct'] = df1['duration']/(df1['duration'].sum())
df1 = pd.DataFrame(df1)

Function within a function involving each column of a DataFrame in Python

As the question states, I'm trying to learn how to run a function on each element belonging to a column within a DataFrame without having to define that column directly. The point is that I would like to be able to enter any given set of DataFrame and find each element within each column that fulfills a particular condition.
The sample that I've included illustrates what I'm trying to do. I know the below doesn't work and I thought that writing def fun(dataframe[column]) would do the trick but the syntax is incorrect, unfortunately.
Basically, the reason for this is that I have multiple sets of data where I'd like to locate each element that is above a set threshold.
Thanks a lot in advance!
df=pd.DataFrame(np.random.randint(0,100,size=(3, 3)), columns=list('ABC'))
def fun(dataframe):
for column in dataframe:
def fun(column):
mean= sum(column)/len(column)
print (mean)
for element in column:
if element < mean*1.1:
element = 0
print (element)
fun(df)
As #MadPhysicist mentioned in a comment, pandas was created to reduce the need for explicit for-looping.
If I understand your specific case correctly, you intend to replace with zero any element that is less than 1.1 times the mean value of its column. Here's one way to do that in idiomatic pandas:
# Set a random seed for repeatability
np.random.seed(314159)
# Create example data
df = pd.DataFrame(np.random.randint(0,100,size=(3, 3)), columns=list('ABC'))
df
A B C
0 11 34 93
1 79 0 81
2 66 43 71
# By default, df.mean() computes the mean of each numeric column (not row)
df.mean()
A 52.000000
B 25.666667
C 81.666667
dtype: float64
# We can use boolean indexing to replace values less than
# 1.1 * column mean with zero
# docs: https://pandas.pydata.org/pandas-docs/stable/indexing.html#boolean-indexing
df[df < 1.1 * df.mean()] = 0
df
A B C
0 0 34 93
1 79 0 0
2 66 43 0

Pandas groupby function

Suppose I have the data set below in a dataframe, df:
import pandas as pd
df = pd.DataFrame({'ID' : ['A','A','A','B','B','B'], 'Date' : ['1-Jan','2-Jan','3-Jan','1-Jan','2-Jan','3-Jan'],'VAL' : [45,23,54,65,76,23]})
I am trying to insert a column, say 'new_col', that calculates the percent change in VAL that is grouped by ID. So, for example, I would want the percent change from 45 to 23, 23 to 54, and then restart for ID 'B'. The below code works but it calculates the percent change regardless of ID.
df['new_col'] = (df['VAL'] - df['VAL'].shift(1)) / df['VAL'].shift(1)
I tried adding the group by function in front of it but I am still getting an error:
df['new_col'] = df.groupby('ID')[(df['VAL'] - df['VAL'].shift(1)) / df['VAL'].shift(1)]
^^^^^^^^^^^^^^^^
You can't just just stick your expression in brackets onto the groupby like that. What you need to do is use apply to apply a function that calculates what you want. What you want can be calculated more simply using the diff method:
>>> df.groupby('ID')['VAL'].apply(lambda g: g.diff()/g.shift())
0 NaN
1 -0.488889
2 1.347826
3 NaN
4 0.169231
5 -0.697368
dtype: float64
As DSM notes in a comment, in this case you can do it directly with the pct_change method:
>>> df.groupby('ID')['VAL'].pct_change()
0 NaN
1 -0.488889
2 1.347826
3 NaN
4 0.169231
5 -0.697368
dtype: float64
However, it is good to be aware of how to do it with apply because you'll need to do things that way if you want to do a more complex operation on the groups (i.e., an operation for which there is no predefined one-shot method).

Categories