pandas grouping multiple column data at a column level [duplicate] - python

This question already has answers here:
How do I Pandas group-by to get sum?
(11 answers)
Aggregation in Pandas
(2 answers)
Closed 2 years ago.
I'm trying to group data of a column A, B, C, and D on Report column.
report A B C D
1 1 0 0 0
1 0 1 0 0
2 0 0 0 1
2 0 0 1 0
3 1 0 0 0
3 0 1 0 0
4 0 0 1 0
4 1 0 0 0
Here is what I'm trying to achieve
report A B C D
1 1 1 0 0
2 0 0 1 1
3 1 1 0 0
4 1 0 1 0
Is there any straight forward way to achieve the result?
Thank you very much!!
I appreciate any support!

I believe you're looking for sum in the pandas library. The script below outputs the expected results.
import pandas as pd
report = [1, 1, 2, 2, 3, 3, 4, 4]
A = [1, 0, 0, 0, 1, 0, 0, 1]
B = [0, 1, 0, 0, 0, 1, 0, 0]
C = [0, 0, 0, 1, 0, 0, 1, 0]
D = [0, 0, 1, 0, 0, 0, 0, 0]
df = pd.DataFrame(list(zip(report, A, B, C, D)), columns = ['report', 'A', 'B', 'C', 'D'])
print(df.groupby(['report']).sum())

Related

4 I am trying to put array into a pandas dataframe

import pandas as pd
import numpy as np
zeros=np.zeros((6,6))
arra=np.array([zeros])
rownames=['A','B','C','D','E','F']
colnames=[['one','tow','three','four','five','six']]
df=pd.DataFrame(arra,index=rownames,columns=colnames)
print(df)
Error:
ValueError: Must pass 2-d input. shape=(1, 6, 6)
My desired output is :
A B C D E F
one 0 0 0 0 0 0
tow 0 0 0 0 0 0
three 0 0 0 0 0 0
four 0 0 0 0 0 0
five 0 0 0 0 0 0
six 0 0 0 0 0 0
Try this
pd.DataFrame(np.zeros((6,6)), columns=list('ABCDEF'), index=['one','tow','three','four','five','six'])
If you want to initialize your DataFrame with a single value, you don't need to bother creating a 2D array, just pass the desired scalar to the DataFrame constructor and it will broadcast:
import pandas as pd
rownames=['A','B','C','D','E','F']
colnames=[['one','tow','three','four','five','six']
df=pd.DataFrame(0, index=rownames, columns=colnames)
print(df)
Output:
one tow three four five six
A 0 0 0 0 0 0
B 0 0 0 0 0 0
C 0 0 0 0 0 0
D 0 0 0 0 0 0
E 0 0 0 0 0 0
F 0 0 0 0 0 0
Try this
zeros=np.zeros((6,6), dtype=int)
df=pd.DataFrame(zeros, columns=['A','B','C','D','E','F'], index=['one','tow','three','four','five','six'])
Understand that in your questions 'A','B','C','D','E','F' these are column names and 'one','tow','three','four','five','six' are indexes, you have confused them with rows and columns.
The reason you got that error is because of the line arra=np.array([zeros]) which converts 2d array to 1d array (like how its given below - see '[[[' which means it is 1d array of 2d array ), but you need 2d array to create a dataframe.
array([[[0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0]]])
Hope this helped!

Find the rows that share the value

I need to find where the rows in ABC all have the value 1 and then create a new column that has the result.
my idea is to use np.where() with some condition, but I don't know the correct way of dealing with this problem, from what I have read I'm not supposed to iterate through a dataframe, but use some of the pandas creative methods?
df1 = pd.DataFrame({'A': [0, 1, 1, 0],
'B': [1, 1, 0, 1],
'C': [0, 1, 1, 1],},
index=[0, 1, 2, 4])
print(df1)
what I am after is this:
A B C TRUE
0 0 1 0 0
1 1 1 1 1 <----
2 1 0 1 0
4 0 1 1 0
If the data is always 0/1, you can simply take the product per row:
df1['TRUE'] = df1.prod(1)
output:
A B C TRUE
0 0 1 0 0
1 1 1 1 1
2 1 0 1 0
4 0 1 1 0
This is what you are looking for:
df1["TRUE"] = (df1==1).all(axis=1).astype(int)

Create new column on grouped data frame

I want to create new column that is calculated by groups using multiple columns from current data frame. Basically something like this in R (tidyverse):
require(tidyverse)
data <- data_frame(
a = c(1, 2, 1, 2, 3, 1, 2),
b = c(1, 1, 1, 1, 1, 1, 1),
c = c(1, 0, 1, 1, 0, 0, 1),
)
data %>%
group_by(a) %>%
mutate(d = cumsum(b) * c)
In pandas I think I should use groupby and apply to create new column and then assign it to the original data frame. This is what I've tried so far:
import numpy as np
import pandas as pd
def create_new_column(data):
return np.cumsum(data['b']) * data['c']
data = pd.DataFrame({
'a': [1, 2, 1, 2, 3, 1, 2],
'b': [1, 1, 1, 1, 1, 1, 1],
'c': [1, 0, 1, 1, 0, 0, 1],
})
# assign - throws error
data['d'] = data.groupby('a').apply(create_new_column)
# assign without index - incorrect order in output
data['d'] = data.groupby('a').apply(create_new_column).values
# assign to sorted data frame
data_sorted = data.sort_values('a')
data_sorted['d'] = data_sorted.groupby('a').apply(create_new_column).values
What is preferred way (ideally without sorting the data) to achieve this?
Add parameter group_keys=False for avoid MultiIndex, so possible assign back to new column:
data['d'] = data.groupby('a', group_keys=False).apply(create_new_column)
Alternative is remove first level:
data['d'] = data.groupby('a').apply(create_new_column).reset_index(level=0, drop=True)
print (data)
a b c d
0 1 1 1 1
1 2 1 0 0
2 1 1 1 2
3 2 1 1 2
4 3 1 0 0
5 1 1 0 0
6 2 1 1 3
Detail:
print (data.groupby('a').apply(create_new_column))
a
1 0 1
2 2
5 0
2 1 0
3 2
6 3
3 4 0
dtype: int64
print (data.groupby('a', group_keys=False).apply(create_new_column))
0 1
2 2
5 0
1 0
3 2
6 3
4 0
dtype: int64
Now you can also implement it in python with datar in the way exactly you did in R:
>>> from datar.all import c, f, tibble, cumsum
>>>
>>> data = tibble(
... a = c(1, 2, 1, 2, 3, 1, 2),
... b = c(1, 1, 1, 1, 1, 1, 1),
... c = c(1, 0, 1, 1, 0, 0, 1),
... )
>>>
>>> (data >>
... group_by(f.a) >>
... mutate(d=cumsum(f.b) * f.c))
a b c d
0 1 1 1 1
1 2 1 0 0
2 1 1 1 2
3 2 1 1 2
4 3 1 0 0
5 1 1 0 0
6 2 1 1 3
[Groups: ['a'] (n=3)]
I am the author of the package. Feel free to submit issues if you have any questions.

Add values for matching column and row names

Quick question that I'm brain-farting on how to best implement. I am generating a matrix to add up how many times two items are found next to each other in a list across a large number of permutations of this list. My code looks something like this:
agreement_matrix = pandas.DataFrame(0, index=names, columns=names)
for list in bunch_of_lists:
for i in range(len(list)-1):
agreement_matrix[list[i]][list[i+1]] += 1
It generates an array like:
A B C D
A 0 2 1 1
B 2 0 1 1
C 1 1 0 2
D 1 1 2 0
And because I don't care about order as much I want to add up values so it's like this:
A B C D
A 0 4 2 2
B 0 0 2 2
C 0 0 0 4
D 0 0 0 0
Is there any fast/simple way to achieve this? I've been toying around with both doing it after generation and trying to do it as I add values.
Use np.tri*:
np.triu(df) + np.tril(df).T
array([[0, 4, 2, 2],
[0, 0, 2, 2],
[0, 0, 0, 4],
[0, 0, 0, 0]])
Call the DataFrame constructor:
pd.DataFrame(np.triu(df) + np.tril(df).T, df.index, df.columns)
A B C D
A 0 4 2 2
B 0 0 2 2
C 0 0 0 4
D 0 0 0 0
To solve the problem ..
np.triu(df.values*2)#df.values.T+df.values
Out[595]:
array([[0, 4, 2, 2],
[0, 0, 2, 2],
[0, 0, 0, 4],
[0, 0, 0, 0]], dtype=int64)
Then you do
pd.DataFrame(np.triu(df.values*2), df.index, df.columns)
Out[600]:
A B C D
A 0 4 2 2
B 0 0 2 2
C 0 0 0 4
D 0 0 0 0
A pandas solution to avoid the first loop:
values=['ABCD'[i] for i in np.random.randint(0,4,100)] # data
df=pd.DataFrame(values)
df[1]=df[0].shift()
df=df.iloc[1:]
df.values.sort(axis=1)
df[2]=1
res=df.pivot_table(2,0,1,np.sum,0)
#
#1 A B C D
#0
#A 2 14 11 16
#B 0 5 9 13
#C 0 0 10 17
#D 0 0 0 2

pandas DataFrame set non-contiguous sections

I have a DataFrame like below and would like for B to be 1 for n rows after the 1 in column A (where below n = 2)
index A B
0 0 0
1 1 0
2 0 1
3 0 1
4 1 0
5 0 1
6 0 1
7 0 0
8 1 0
9 0 1
I think I can do it using .ix similar to this example but not sure how. I'd like to do it in a single in pandas-style selection command if possible. (Ideally not using rolling_apply.)
Modifying a subset of rows in a pandas dataframe
EDIT: the application is that the 1 in column A is "ignored" if it falls within n rows of the previous 1. As per the comments, for n = 2 then, and these example:
A = [1, 0, 1, 0, 1], B should be = [0, 1, 1, 0, 0]
A = [1, 1, 0, 0], B should be [0, 1, 1, 0]

Categories