Transform dataset based on conditional addition - complicated

Transform dataset based on conditional addition - complicated - python

I want to transform the following data from df1 to df2:
df1:
ID a b c d a-d c-d a-c-d
0 1 0 0 0 0 0 0 1
1 2 0 0 1 0 1 0 0
2 3 0 1 0 0 0 1 0
3 4 0 0 0 0 1 0 1
4 5 0 0 1 1 0 0 0
And df2 is:
ID a b c d
0 1 1 0 1 1
1 2 1 0 1 1
2 3 0 1 1 1
3 4 2 0 1 2
4 5 0 0 1 1
Basically, I want to get the total values of "a", from all the columns in which the letter "a" appears in the column name. E.g. in the 4th row of df1 there are 2 column names in which the letter "a" appears. If you sum up all the "a" from the 4th row, there would be a total of 2 a's there. I want a single column for apples in the new dataset (df2). Note that a 1 for "a-c-d" is a 1 for EACH "a", "b", "c".

If you know the unique categories in advance (e.g. ["a", "b", "c", "d"]) then you can take a little short cut and rely on df.filter to gather all of the columns with that letter, then use .sum(axis=1) to sum across those columns to create your expected summary value:
data = {"ID": df["ID"]}
for letter in ["a", "b", "c", "d"]:
data[letter] = df.filter(like=letter).sum(axis=1)
final_df = pd.DataFrame(data)
print(final_df)
ID a b c d
0 1 1 0 1 1
1 2 1 0 1 1
2 3 0 1 1 1
3 4 2 0 1 2
4 5 0 0 1 1

Let's try melt to stack the column names, then str.split followed by explode to split the a,b,c,d and duplicate the data:
(df1.melt('ID', var_name='col')
.assign(col=lambda x: x['col'].str.split('-'))
.explode('col')
.pivot_table(index='ID',columns='col',
values='value', aggfunc='sum')
.reset_index()
)
Output:
col ID a b c d
0 1 1 0 1 1
1 2 1 0 1 1
2 3 0 1 1 1
3 4 2 0 1 2
4 5 0 0 1 1

Something like split then explode and groupby with sum
out = df.T.reset_index().assign(index=lambda x : x['index'].str.split('-')).explode('index').\
groupby('index').sum().T
Out[102]:
index ID a b c d
0 1 1 0 1 1
1 2 1 0 1 1
2 3 0 1 1 1
3 4 2 0 1 2
4 5 0 0 1 1

Well, just to complete the answers here, a method more manual is like follow:
df1.loc[:, 'a'] = df1.loc[:, 'a'] + df1.loc[:, 'a-d'] + df1.loc[:, 'a-c-d']
df1.loc[:, 'c'] = df1.loc[:, 'c'] + df1.loc[:, 'c-d'] + df1.loc[:, 'a-c-d']
df1.loc[:, 'd'] = df1.loc[:, 'd'] + df1.loc[:, 'a-d'] + df1.loc[:, 'c-d'] + df1.loc[:, 'a-c-d']
Output:
col ID a b c d
0 1 1 0 1 1
1 2 1 0 1 1
2 3 0 1 1 1
3 4 2 0 1 2
4 5 0 0 1 1

Related

How do you count the common 1's in pandas data frame?

I have this data for example:
A
B
C
Class_label
0
1
1
B_C
1
1
1
A_B_C
0
0
1
C
How do you obtain (classified label column) this and count the common ones and display that as well using pandas dataframe?

Use DataFrame.assign for add new columns by DataFrame.dot with columns names for labels and sum for count 1, but only numeric columns selected by DataFrame.select_dtypes:
df1 = df.select_dtypes(np.number)
df = df.assign(classifiedlabel = df1.dot(df1.columns + '_').str[:-1],
countones = df1.sum(axis=1))
print (df)
A B C D classifiedlabel countones
0 0 1 0 1 B_D 2
1 1 1 0 1 A_B_D 3
2 0 0 1 0 C 1
3 0 1 1 0 B_C 2
If column classifiedlabel exist simpliest is use sum only:
df["countones"] = df.sum(axis=1)
print (df)
A B C D classifiedlabel countones
0 0 1 0 1 B_D 2
1 1 1 0 1 A_B_D 3
2 0 0 1 0 C 1
3 0 1 1 0 B_C 2

If values are 1/0 then you can use:
(
df.assign(
count=df._get_numeric_data().sum(axis=1)
)
)
Output:
A B C D classifiedlabel count
0 0 1 0 1 B_D 2
1 1 1 0 1 A_B_D 3
2 0 0 1 0 C 1
3 0 1 1 0 B_C 2

Try:
df["number_of_ones"] = (df == 1).astype(int).sum(axis=1)
print(df)
A B C D classifiedlabel number_of_ones
0 0 1 0 1 B_D 2
1 1 1 0 1 A_B_D 3
2 0 0 1 0 C 1
3 0 1 1 0 B_C 2

Using previous row value while creating a new column

I have a df in python that looks something like this:
'A'
0
1
0
0
1
1
1
1
0
I want to create another column that adds cumulative 1's from column A, and starts over if the value in column A becomes 0 again. So desired output:
'A' 'B'
0 0
1 1
0 0
0 0
1 1
1 2
1 3
1 4
0 0
This is what I am trying, but it's just replicating column A:
df.B[df.A ==0] = 0
df.B[df.A !=0] = df.A + df.B.shift(1)

Let us do cumsum with groupby cumcount
df['B']=(df.groupby(df.A.eq(0).cumsum()).cumcount()).where(df.A==1,0)
Out[81]:
0 0
1 1
2 0
3 0
4 1
5 2
6 3
7 4
8 0
dtype: int64

Use shift with ne and groupby.cumsum:
df['B'] = df.groupby(df['A'].shift().ne(df['A']).cumsum())['A'].cumsum()
print(df)
A B
0 0 0
1 1 1
2 0 0
3 0 0
4 1 1
5 1 2
6 1 3
7 1 4
8 0 0

Concat() alternate group by python3.0

My goal here is to concat() alternate groups between two dataframe.
desired result :
group ordercode quantity
0 A 1
B 1
C 1
D 1
0 A 1
B 3
1 A 1
B 2
C 1
1 A 1
B 1
C 2
My dataframe:
import pandas as pd
df1=pd.DataFrame([[0,"A",1],[0,"B",1],[0,"C",1],[0,"D",1],[1,"A",1],[1,"B",2],[1,"C",1]],columns=["group","ordercode","quantity"])
df2=pd.DataFrame([[0,"A",1],[0,"B",3],[1,"A",1],[1,"B",1],[1,"C",2]],columns=["group","ordercode","quantity"])
print(df1)
print(df2)
I have used dfff=pd.concat([df1,df2]).sort_index(kind="merge")
but I have got the below result:
group ordercode quantity
0 0 A 1
0 0 A 1
1 B 1
1 B 3
2 C 1
3 D 1
4 1 A 1
4 1 A 1
5 B 2
5 B 1
6 C 1
6 C 2
You can see here the concatenate is formed between each rows not by group.
It has to print like
group 0 of df1
group0 of df2
group1 of df1
group1 of df2 and so on
Note:
I have created these DataFrame using groupby() function
df = pd.DataFrame(np.concatenate(df.apply(lambda x: [x[0]] * x[1], 1).as_matrix()),
columns=['ordercode'])
df['quantity'] = 1
df['group'] = sorted(list(range(0, len(df)//3, 1)) * 4)[0:len(df)]
df=df.groupby(['group', 'ordercode']).sum()
Question:
Where I went wrong?
Its sorting out by taking index
I have used .set_index("group") but It didnt work either.

Use cumcount for helper column used for sorting by sort_values :
df1['g'] = df1.groupby('ordercode').cumcount()
df2['g'] = df2.groupby('ordercode').cumcount()
dfff = pd.concat([df1,df2]).sort_values(['group','g']).reset_index(drop=True)
print (dfff)
group ordercode quantity g
0 0 A 1 0
1 0 B 1 0
2 0 C 1 0
3 0 D 1 0
4 0 A 1 0
5 0 B 3 0
6 1 C 2 0
7 1 A 1 1
8 1 B 2 1
9 1 C 1 1
10 1 A 1 1
11 1 B 1 1
and last remove column:
dfff = dfff.drop('g', axis=1)

How to set index to an existing dataframe in the form of cartesian product?

I have a list. I want to set_index of dataframe in the form of a cartesian product of list values with dataframe i.e
li = ['A','B']
df = pd.DataFrame([[0,0,0],[1,1,1],[2,2,2]])
I want the resulting dataframe to be like
0 1 2
A 0 0 0
A 1 1 1
A 2 2 2
B 0 0 0
B 1 1 1
B 2 2 2
How can I do this?

Option 1
pd.concat with keys argument
pd.concat([df] * len(li), keys=li)
0 1 2
A 0 0 0 0
1 1 1 1
2 2 2 2
B 0 0 0 0
1 1 1 1
2 2 2 2
To replicate your output exactly:
pd.concat([df] * len(li), keys=li).reset_index(1, drop=True)
0 1 2
A 0 0 0
A 1 1 1
A 2 2 2
B 0 0 0
B 1 1 1
B 2 2 2
Option 2
np.tile and np.repeat
pd.DataFrame(np.tile(df, [len(li), 1]), np.repeat(li, len(df)), df.columns)
0 1 2
A 0 0 0
A 1 1 1
A 2 2 2
B 0 0 0
B 1 1 1
B 2 2 2

Use MultiIndex.from_product with reindex:
mux = pd.MultiIndex.from_product([li, df.index])
df = df.reindex(mux, level=1).reset_index(level=1, drop=True)
print (df)
0 1 2
A 0 0 0
A 1 1 1
A 2 2 2
B 0 0 0
B 1 1 1
B 2 2 2

Or you can using .
li = [['A','B']]
df['New']=li*len(df)
df.set_index([0,1,2])['New'].apply(pd.Series).stack().to_frame().rename(columns={0:'keys'})\
.reset_index().drop('level_3',1).sort_values('keys')
Out[698]:
0 1 2 keys
0 0 0 0 A
2 1 1 1 A
4 2 2 2 A
1 0 0 0 B
3 1 1 1 B
5 2 2 2 B

How to efficiently create label columns out of existing columns without for loop and drop them afterwords

I have a data frame looking like this:
A B C D
2 5 0 9
2 0 8 0
0 0 8 9
2 0 8 0
0 5 0 9
2 0 8 9
0 5 0 9
2 5 8 0
I want to check each value in a column and create a new one out of that column such that each row of the new column will be 1 if the value is greater 0 and 0 otherwise
I did it using a for loop this way:
#Generate a data frame example
df = pd.DataFrame(np.random.randint(5, size=(10, 8)), columns = ["A", "B", "C", "D", "E","F","G", "H"])
# create a label out of it
for label in df.columns.values:
df['label_' + label] = df[label].apply(lambda a: 0 if a==0 else 1)
df.drop(label, axis=1)
My questions are:
1- How can I do the same task but without a for loop?
2- How can I drop the columns after manipualting them? I already tried .drop(label,
axis=1) but it did nto work

IIUC you could do -
df_out = (df>0).astype(int)
df_out.columns = ['label_'+i for i in df.columns]
A vectorized way to create those new labels, would be using NumPy's char supported functions -
df_out.columns = np.core.add('label_',df.columns)
Or a nice one-liner as suggested by #Ted Petrou -
(df>0).astype(int).add_prefix('label_')

Another option...
df2 = df.mask(df != 0).fillna('1').add_prefix('label_')
print(df2)
label_A label_B label_C label_D label_E label_F label_G label_H
0 1 1 1 1 1 1 1 1
1 1 1 0 1 1 1 0 0
2 1 0 0 1 1 0 0 1
3 1 1 1 0 1 1 1 0
4 1 1 1 1 1 1 1 1
5 1 0 0 1 1 1 1 0
6 1 1 1 1 1 1 1 1
7 1 0 1 0 1 0 1 0
8 1 1 1 1 1 1 1 1
9 0 1 1 1 0 1 1 1

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Transform dataset based on conditional addition - complicated - python

Something like split then explode and groupby with sum out = df.T.reset_index().assign(index=lambda x : x['index'].str.split('-')).explode('index').\ groupby('index').sum().T Out[102]: index ID a b c d 0 1 1 0 1 1 1 2 1 0 1 1 2 3 0 1 1 1 3 4 2 0 1 2 4 5 0 0 1 1

Related

How do you count the common 1's in pandas data frame?

Using previous row value while creating a new column

Concat() alternate group by python3.0

How to set index to an existing dataframe in the form of cartesian product?

How to efficiently create label columns out of existing columns without for loop and drop them afterwords

Categories

Resources