I want to apply cumsum on dataframe in pandas in python, but withouth zeros. Simply I want to leave zeros and do cumsum on dataframe. Suppose I have dataframe like this:
import pandas as pd
df = pd.DataFrame({'a' : [1,2,0,1],
'b' : [2,5,0,0],
'c' : [0,1,2,5]})
a b c
0 1 2 0
1 2 5 1
2 0 0 2
3 1 0 5
and result sould be
a b c
0 1 2 0
1 3 7 1
2 0 0 3
3 4 0 8
Any ideas how to do that avoiding loops? In R there is ave function, but Im very new to python and I dont know.
You can mask the df so that you only overwrite the non-zero cells:
In [173]:
df[df!=0] = df.cumsum()
df
Out[173]:
a b c
0 1 2 0
1 3 7 1
2 0 0 3
3 4 0 8
Related
I've generated the following dummy data, where the number of rows per id range from 1 to 5:
import pandas as pd, numpy as np
import random
import functools
import operator
uid = functools.reduce(operator.iconcat, [np.repeat(x,random.randint(1,5)).tolist() for x in range(10)], [])
df = pd.DataFrame(columns=list("ABCD"), data= np.random.randint(1, 5, size=(len(uid), 4)))
df['id'] = uid
df.head()
A B C D id
0 1 1 2 2 0
1 1 2 4 4 0
2 2 3 3 2 0
3 4 3 3 1 1
4 1 3 4 4 1
I would like to group by id then sum all the values, I.E:
A B C D id
0 1 1 2 2 0
1 1 2 4 4 0
2 2 3 3 2 0 = (1+1+2+2+1+2+4+4+2+2+3+3+2) = 29
Then duplicate the value for the group so the result would be:
A B C D id val
0 1 1 2 2 0 29
1 1 2 4 4 0 29
2 2 3 3 2 0 29
I've tried to call sum df.groupby('id').sum() but it sums each column separately.
You can use set_index and then stack folowed by groupby and sum, then series.map
df['val'] = df['id'].map(df.set_index("id").stack().groupby(level=0).sum())
Or as suggested by #jezrael, sum has a level=0 arg which does the same as above:
df['val'] = df['id'].map(df.set_index("id").stack().sum(level=0))
A B C D id val
0 4 2 4 1 0 34
1 3 4 4 2 0 34
2 2 4 3 1 0 34
3 1 1 1 3 1 6
4 2 3 1 4 2 50
You can first sum all columns without id and then using GroupBy.transform:
df['val'] = df.drop('id',1).sum(axis=1).groupby(df['id']).transform('sum')
Another idea:
df['val'] = df['id'].map( df.groupby('id').sum().sum(axis=1))
M=df.groupby("id").sum()
np.sum(M,axis=1)
print(M)
print(np.sum(M,axis=1))
First M is the sum of all columns which grouped by id. Then after grouping by id, when we sum all data in a row basically we have wanted result
SQL : Select Max(A) , Min (B) , C from Table group by C
I want to do the same operation in pandas on a dataframe. The closer I got was till :
DF2= DF1.groupby(by=['C']).max()
where I land up getting max of both the columns , how do i do more than one operation while grouping by.
You can use function agg:
DF2 = DF1.groupby('C').agg({'A': max, 'B': min})
Sample:
print DF1
A B C D
0 1 5 a a
1 7 9 a b
2 2 10 c d
3 3 2 c c
DF2 = DF1.groupby('C').agg({'A': max, 'B': min})
print DF2
A B
C
a 7 5
c 3 2
GroupBy-fu: improvements in grouping and aggregating data in pandas - nice explanations.
try agg() function:
import numpy as np
import pandas as pd
df = pd.DataFrame(np.random.randint(0,5,size=(20, 3)), columns=list('ABC'))
print(df)
print(df.groupby('C').agg({'A': max, 'B':min}))
Output:
A B C
0 2 3 0
1 2 2 1
2 4 0 1
3 0 1 4
4 3 3 2
5 0 4 3
6 2 4 2
7 3 4 0
8 4 2 2
9 3 2 1
10 2 3 1
11 4 1 0
12 4 3 2
13 0 0 1
14 3 1 1
15 4 1 1
16 0 0 0
17 4 0 1
18 3 4 0
19 0 2 4
A B
C
0 4 0
1 4 0
2 4 2
3 0 4
4 0 1
Alternatively you may want to check pandas.read_sql_query() function...
You can use the agg function
import pandas as pd
import numpy as np
df.groupby('something').agg({'column1': np.max, 'columns2': np.min})
I have one pd including two categorical columns with 150 categories. May be a value in column A is not appeared in Column B. For example
a = pd.DataFrame({'A':list('bbaba'), 'B':list('cccaa')})
a['A'] = a['A'].astype('category')
a['B'] = a['B'].astype('category')
The output is
Out[217]:
A B
0 b c
1 b c
2 a c
3 b a
4 a a
And also
cat_columns = a.select_dtypes(['category']).columns
a[cat_columns] = a[cat_columns].apply(lambda x: x.cat.codes)
a
The output is
Out[220]:
A B
0 1 1
1 1 1
2 0 1
3 1 0
4 0 0
My problem is that in column A, the b is considered as 1, but in column B, the c is considered as 1. However, I want something like this:
Out[220]:
A B
0 1 2
1 1 2
2 0 2
3 1 0
4 0 0
which 2 is considered as c.
Please note that I have 150 different labels.
Using pd.Categorical() you can specify a list of categories:
In [44]: cats = a[['A','B']].stack().sort_values().unique()
In [45]: cats
Out[45]: array(['a', 'b', 'c'], dtype=object)
In [46]: a['A'] = pd.Categorical(a['A'], categories=cats)
In [47]: a['B'] = pd.Categorical(a['B'], categories=cats)
In [48]: a[cat_columns] = a[cat_columns].apply(lambda x: x.cat.codes)
In [49]: a
Out[49]:
A B
0 1 2
1 1 2
2 0 2
3 1 0
4 0 0
We can use pd.factorize all at once.
pd.DataFrame(
pd.factorize(a.values.ravel())[0].reshape(a.shape),
a.index, a.columns
)
A B
0 0 1
1 0 1
2 2 1
3 0 2
4 2 2
Or if you wanted to factorize by sorted category value, use the sort=True argument
pd.DataFrame(
pd.factorize(a.values.ravel(), True)[0].reshape(a.shape),
a.index, a.columns
)
A B
0 1 2
1 1 2
2 0 2
3 1 0
4 0 0
Or equivalently with np.unique
pd.DataFrame(
np.unique(a.values.ravel(), return_inverse=True)[1].reshape(a.shape),
a.index, a.columns
)
A B
0 1 2
1 1 2
2 0 2
3 1 0
4 0 0
If you are only interested in converting to categorical codes and being able to access the mapping via a dictionary, pd.factorize may be more convenient.
Algorithm for getting unique values across columns via #AlexRiley.
a = pd.DataFrame({'A':list('bbaba'), 'B':list('cccaa')})
fact = dict(zip(*pd.factorize(pd.unique(a[['A', 'B']].values.ravel('K')))[::-1]))
b = a.applymap(fact.get)
Result:
A B
0 0 2
1 0 2
2 1 2
3 0 1
4 1 1
I'm trying to find rows that have unique pairs of values across 2 columns, so this dataframe:
A B
1 0
2 0
3 0
0 1
2 1
3 1
0 2
1 2
3 2
0 3
1 3
2 3
will be reduced to only the rows that don't match up if flipped, for instance 1 and 3 is a combination I only want returned once. So a check to see if the same pair exists if the columns are flipped (3 and 1) it can be removed. The table I'm looking to get is:
A B
0 2
0 3
1 0
1 2
1 3
2 3
Where there is only one occurrence of each pair of values that are mirrored if the columns are flipped.
I think you can use apply sorted + drop_duplicates:
df = df.apply(sorted, axis=1).drop_duplicates()
print (df)
A B
0 0 1
1 0 2
2 0 3
4 1 2
5 1 3
8 2 3
Faster solution with numpy.sort:
df = pd.DataFrame(np.sort(df.values, axis=1), index=df.index, columns=df.columns)
.drop_duplicates()
print (df)
A B
0 0 1
1 0 2
2 0 3
4 1 2
5 1 3
8 2 3
Solution without sorting with DataFrame.min and DataFrame.max:
a = df.min(axis=1)
b = df.max(axis=1)
df['A'] = a
df['B'] = b
df = df.drop_duplicates()
print (df)
A B
0 0 1
1 0 2
2 0 3
4 1 2
5 1 3
8 2 3
Loading the data:
import numpy as np
import pandas as pd
a = np.array("1 2 3 0 2 3 0 1 3 0 1 2".split("\t"),dtype=np.double)
b = np.array("0 0 0 1 1 1 2 2 2 3 3 3".split("\t"),dtype=np.double)
df = pd.DataFrame(dict(A=a,B=b))
In case you don't need to sort the entire DF:
df["trans"] = df.apply(
lambda row: (min(row['A'], row['B']), max(row['A'], row['B'])), axis=1
)
df.drop_duplicates("trans")
I am trying to convert a data set with 100,000 rows and 3 columns into pivot. While the following code runs without an error, the values are displayed as NaN.
df1 = pd.pivot_table(df_TEST, values='actions', index=['sku'], columns=['user'])
It is not taking the values (ranges from 1 to 36 ) from DataFrame. Has anyone come across this situation?
This can happen when you are doing a pivot since not all the values might be present. e.g.
In [10]: df_TEST
Out[10]:
a b c
0 0 0 0
1 0 1 0
2 0 2 0
3 1 1 1
4 1 2 3
5 1 4 5
Now, when you do pivot on this,
In [9]: df_TEST.pivot_table(index='a', values='c', columns='b')
Out[9]:
b 0 1 2 4
a
0 0 0 0 NaN
1 NaN 1 3 5
Note that, you got NaN at index 0 and column 4, since there is no entry in df_TEST with column a = 0 and column b = 4.
Typically you fill such values with zeros.
In [11]: df_TEST.pivot_table(index='a', values='c', columns='b').fillna(0)
Out[11]:
b 0 1 2 4
a
0 0 0 0 0
1 0 1 3 5