Create new column on grouped data frame - python

I want to create new column that is calculated by groups using multiple columns from current data frame. Basically something like this in R (tidyverse):
require(tidyverse)
data <- data_frame(
a = c(1, 2, 1, 2, 3, 1, 2),
b = c(1, 1, 1, 1, 1, 1, 1),
c = c(1, 0, 1, 1, 0, 0, 1),
)
data %>%
group_by(a) %>%
mutate(d = cumsum(b) * c)
In pandas I think I should use groupby and apply to create new column and then assign it to the original data frame. This is what I've tried so far:
import numpy as np
import pandas as pd
def create_new_column(data):
return np.cumsum(data['b']) * data['c']
data = pd.DataFrame({
'a': [1, 2, 1, 2, 3, 1, 2],
'b': [1, 1, 1, 1, 1, 1, 1],
'c': [1, 0, 1, 1, 0, 0, 1],
})
# assign - throws error
data['d'] = data.groupby('a').apply(create_new_column)
# assign without index - incorrect order in output
data['d'] = data.groupby('a').apply(create_new_column).values
# assign to sorted data frame
data_sorted = data.sort_values('a')
data_sorted['d'] = data_sorted.groupby('a').apply(create_new_column).values
What is preferred way (ideally without sorting the data) to achieve this?

Add parameter group_keys=False for avoid MultiIndex, so possible assign back to new column:
data['d'] = data.groupby('a', group_keys=False).apply(create_new_column)
Alternative is remove first level:
data['d'] = data.groupby('a').apply(create_new_column).reset_index(level=0, drop=True)
print (data)
a b c d
0 1 1 1 1
1 2 1 0 0
2 1 1 1 2
3 2 1 1 2
4 3 1 0 0
5 1 1 0 0
6 2 1 1 3
Detail:
print (data.groupby('a').apply(create_new_column))
a
1 0 1
2 2
5 0
2 1 0
3 2
6 3
3 4 0
dtype: int64
print (data.groupby('a', group_keys=False).apply(create_new_column))
0 1
2 2
5 0
1 0
3 2
6 3
4 0
dtype: int64

Now you can also implement it in python with datar in the way exactly you did in R:
>>> from datar.all import c, f, tibble, cumsum
>>>
>>> data = tibble(
... a = c(1, 2, 1, 2, 3, 1, 2),
... b = c(1, 1, 1, 1, 1, 1, 1),
... c = c(1, 0, 1, 1, 0, 0, 1),
... )
>>>
>>> (data >>
... group_by(f.a) >>
... mutate(d=cumsum(f.b) * f.c))
a b c d
0 1 1 1 1
1 2 1 0 0
2 1 1 1 2
3 2 1 1 2
4 3 1 0 0
5 1 1 0 0
6 2 1 1 3
[Groups: ['a'] (n=3)]
I am the author of the package. Feel free to submit issues if you have any questions.

Related

how nunique works with given table values?

yf = pd.DataFrame({'A': [1, 2, 3], 'B': [1, 1, 1]})
output:
A B
0 1 1
1 2 1
2 3 1
yf.nunique(axis=0)
output:
A 3
B 1
yf.nunique(axis=1)
output:
0 1
1 2
2 2
could you please how axis=0 and axis=1 works? In axis=0, why A=2, B=1 are ignored? Wonder if nunique gets in index as well?
You can test number of unique values per columns or per index by DataFrame.nunique.
yf = pd.DataFrame({'A': [1, 2, 3], 'B': [1, 1, 1]})
print (yf)
A B
0 1 1
1 2 1
2 3 1
print (yf.nunique(axis=0))
A 3
B 1
dtype: int64
print (yf.nunique(axis=1))
0 1
1 2
2 2
dtype: int64
It means:
A is 3, because 3 unique values in column A
0 is 1, because 1 unique values in row 0

Pandas/Numpy How to generate a rolling count column?

I have a table with two columns. In the second column is binary, 0 or 1 value. I would like to keep a running count of these values until it switches. For example, i would like to add a 'count' column that looks like this:
Date sig count
2000-01-03 0 1
2000-01-04 0 2
2000-01-05 1 1
2000-01-06 1 2
2000-01-07 1 3
2000-01-08 1 4
2000-01-09 0 1
2000-01-010 0 2
2000-01-011 0 3
2000-01-012 0 4
2000-01-013 0 5
Is there a simple way of doing this with pandas, numpy or simply python without having to iterate or use loops?
In numpy you can find an indexes where different groups starts and counts of these groups then apply np.add.accumulate on a sequence of repeated ones with some of them replaced:
def accumulative_count(sig):
marker_idx = np.flatnonzero(np.diff(sig)) + 1
counts = np.diff(marker_idx, prepend=0)
counter = np.ones(len(sig), dtype=int)
counter[marker_idx] -= counts
return np.add.accumulate(counter)
df['count'] = accumulative_count[df['sig']]
Sample run:
sig = [0, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0]
marker_idx = np.flatnonzero(np.diff(sig)) + 1
counts = np.diff(marker_idx, prepend=0,)
counter = np.ones(len(sig), dtype=int)
counter[marker_idx] -= counts
>>> marker_idx #starts of groups
array([2, 6], dtype=int64)
>>> counts #counts of groups
array([2, 4], dtype=int64)
>>> counter #a sequence of units with some of the units replaced
array([ 1, 1, -1, 1, 1, 1, -3, 1, 1, 1, 1])
>>> np.add.accumulate(counter) #output
array([1, 2, 1, 2, 3, 4, 1, 2, 3, 4, 5], dtype=int32)
df['count'] = df.groupby((df['sig'] != df['sig'].shift(1)).cumsum()).cumcount()+1
In [1571]: df
Out[1571]:
Date sig count
0 2000-01-03 0 1
1 2000-01-04 0 2
2 2000-01-05 1 1
3 2000-01-06 1 2
4 2000-01-07 1 3
5 2000-01-08 1 4
6 2000-01-09 0 1
7 2000-01-010 0 2
8 2000-01-011 0 3
9 2000-01-012 0 4
10 2000-01-013 0 5

How to take one part of a Series in Pandas?

How do I only take the values on the right? Ex. only have an array/list of [120, 108, 82...]
d = daily_counts.loc[daily_counts['workingday'] == "yes", 'casual']
d
You can simply use tolist() or values or to_numpy method. Here is a toy example:
>>> df = pd.DataFrame({'a':[1,2,3,4,5,7,1,4]})
>>>
a
0 1
1 2
2 3
3 4
4 5
5 7
6 1
7 4
>>> df['a'].value_counts() #generates similar output as you
>>>
4 2
1 2
7 1
5 1
3 1
2 1
Name: a, dtype: int64
>>> df['a'].value_counts().tolist() #extracting as a list
[2, 2, 1, 1, 1, 1]
>>> df['a'].value_counts().values #extracting as a numpy array
array([2, 2, 1, 1, 1, 1])
>>> df['a'].value_counts().to_numpy() #extracting as a numpy array
array([2, 2, 1, 1, 1, 1])

dataframe new column based on groupby operations

import pandas
import numpy
df = pandas.DataFrame({'id_1' : [1,2,1,1,1,1,1,2,2,2,2],
'id_2' : [1,1,1,1,1,2,2,2,2,2,2],
'v_1' : [2,1,1,3,2,1,2,4,1,1,2],
'v_2' : [1,1,1,1,2,2,2,1,1,2,2],
'v_3' : [3,3,3,3,4,4,4,3,3,3,3]})
In [4]: df
Out[4]:
id_1 id_2 v_1 v_2 v_3
0 1 1 2 1 3
1 2 1 1 1 3
2 1 1 1 1 3
3 1 1 3 1 3
4 1 1 2 2 4
5 1 2 1 2 4
6 1 2 2 2 4
7 2 2 4 1 3
8 2 2 1 1 3
9 2 2 1 2 3
10 2 2 2 2 3
sub = df[(df['id_1'] == 1) & (df['id_2'] == 1)].copy()
sub['v_4'] = numpy.where(sub['v_1'] == sub['v_2'].shift(), 'A', \
numpy.where(sub['v_1'] == sub['v_3'].shift(), 'B', 'C'))
In [6]: sub
Out[6]:
id_1 id_2 v_1 v_2 v_3 v_4
0 1 1 2 1 3 C
2 1 1 1 1 3 A
3 1 1 3 1 3 B
4 1 1 2 2 4 C
I have a dataframe as defined above. I would like to perform some operation, basically categorize whether v_1 equals the previous v_2 or v_3 for each group of (id_1, id_2)
I have done the the operation which performs on a sub df. And I would like to have a one line code to combine the following groupby together with the operation I have on the sub df together.
gbdf = df.groupby(by=['id_1', 'id_2'])
I have tried something like
gbdf['v_4'] = numpy.where(gbdf['v_1'] == gbdf['v_2'].shift(), 'A', \
numpy.where(gbdf['v_1'] == gbdf['v_3'].shift(), 'B', 'C'))
and the error was
'DataFrameGroupBy' object does not support item assignment
I also tried
df['v_4'] = numpy.where(gbdf['v_1'] == gbdf['v_2'].shift(), 'A', \
numpy.where(gbdf['v_1'] == gbdf['v_3'].shift(), 'B', 'C'))
which I believe the result was wrong, it does not align the groupby result with the original ordering.
I am wondering whether there is an elegant way to achieve this.
This gets you a list of dataframes that match the content of the dataframe sub, but for all results of the .groupby():
import numpy
import pandas
source = pandas.DataFrame(
{'id_1': [1, 2, 1, 1, 1, 1, 1, 2, 2, 2, 2],
'id_2': [1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2],
'v_1': [2, 1, 1, 3, 2, 1, 2, 4, 1, 1, 2],
'v_2': [1, 1, 1, 1, 2, 2, 2, 1, 1, 2, 2],
'v_3': [3, 3, 3, 3, 4, 4, 4, 3, 3, 3, 3]})
def add_v4(df):
df['v_4'] = numpy.where(df['v_1'] == df['v_2'].shift(), 'A', numpy.where(df['v_1'] == df['v_3'].shift(), 'B', 'C'))
return df
dfs = [add_v4(pandas.DataFrame(slice)) for _, slice in source.groupby(by=['id_1', 'id_2'])]
print(dfs)
About this line:
dfs = [add_v4(pandas.DataFrame(slice)) for _, slice in source.groupby(by=['id_1', 'id_2'])]
It's a list comprehension that gets all the slices from the groupby and turns them into actual new dataframes before passing them to add_v4, which returns the modified dataframe to be added to the list.

Conditional multi column matching (reviewed with new example)

I tried to rework my question in order to match the quality criteria and spent more time in trying to achieve the result on my own.
Given are two DataFrames
a = DataFrame({"id" : ["id1"] * 3 + ["id2"] * 3 + ["id3"] * 3,
"left" : [6, 2, 5, 2, 1, 4, 5, 2, 4],
"right" : [1, 3, 4, 6, 5, 3, 6, 3, 2]
})
b = DataFrame({"id" : ["id1"] * 6 + ["id2"] * 6 + ["id3"] * 6,
"left_and_right" : range(1,7) * 3,
"boolen" : [0, 0, 1, 0, 1, 0, 1, 0, 0 , 1, 1, 0, 0, 0, 1, 0, 0, 1]
})
The expected result is
result = DataFrame({"id" : ["id1"] * 3 + ["id2"] * 3 + ["id3"] * 3,
"left" : [6, 2, 5, 2, 1, 4, 5, 2, 4],
"right" : [1, 3, 4, 6, 5, 3, 6, 3, 2],
"NEW": [0, 1, 1, 0, 1, 1, 1, 1, 0]
})
So I want to check in each row of DataFrame b if there is a row in DataFrame a where a.id == b.id AND then look up if b.left_and_right is in (==) a.left OR a.rigtht. If such a row is found and b.boolen is True/1 for either the value of a.left or a.right, the value of a.NEW in this row should be also True/1.
I hope the example illustrates it better than my words.
To sum it up: I want to look up if in each row where id is the same for both DataFrames whether b.boolen is True/1 for a value in b.left_and_right and if this value is in a.left or in a.right, the new value in a.NEW should also be TRUE/1.
I have tried using the pd.match() and pd.merge() function in combination with & and | operators but could not achieve the wanted result.
Some time ago I had asked a very simillar question dealing with a simillar problem in R (data was organized in a slightly other way, so it was a bit different) but now I fail using the same approach in python.
Related question: Conditional matching of two lists with multi-column data.frames
Thanks
Just use a boolean masks with & (and) and | (or):
In [11]: (a.A == b.A) & ((a.B == b.E) | (a.C == b.E)) # they all satisfy this requirement!
Out[11]:
0 True
1 True
2 True
3 True
dtype: bool
In [12]: b.D[(a.A == b.A) & ((a.B == b.E) | (a.C == b.E))]
Out[12]:
0 0
1 1
2 0
3 0
Name: D, dtype: int64
In [13]: a['NEW'] = b.D[(a.A == b.A) & ((a.B == b.E) | (a.C == b.E))]
In [14]: a
Out[14]:
A B C NEW
0 foo 1 4 0
1 goo 2 3 1
2 doo 3 1 0
3 boo 4 2 0
Update with the slightly different question:
In [21]: merged = pd.merge(a, b, on='id')
In [22]: matching = merged[(merged.left == merged.left_and_right) | (merged.right == merged.left_and_right)]
In [23]: (matching.groupby(['id', 'left', 'right'])['boolen'].sum()).reset_index()
Out[23]:
id left right boolen
0 id1 2 3 1
1 id1 5 4 1
2 id1 6 1 0
3 id2 1 5 2
4 id2 2 6 0
5 id2 4 3 1
6 id3 2 3 1
7 id3 4 2 0
8 id3 5 6 1
Note there is a 2 here, so perhaps you want to care about those > 0.
In [24]: (matching.groupby(['id', 'left', 'right'])['boolen'].sum() > 0).reset_index()
Out[24]:
id left right boolen
0 id1 2 3 True
1 id1 5 4 True
2 id1 6 1 False
3 id2 1 5 True
4 id2 2 6 False
5 id2 4 3 True
6 id3 2 3 True
7 id3 4 2 False
8 id3 5 6 True
You may want to rename the boolen column to NEW.

Categories