Concat() alternate group by python3.0 - python

My goal here is to concat() alternate groups between two dataframe.
desired result :
group ordercode quantity
0 A 1
B 1
C 1
D 1
0 A 1
B 3
1 A 1
B 2
C 1
1 A 1
B 1
C 2
My dataframe:
import pandas as pd
df1=pd.DataFrame([[0,"A",1],[0,"B",1],[0,"C",1],[0,"D",1],[1,"A",1],[1,"B",2],[1,"C",1]],columns=["group","ordercode","quantity"])
df2=pd.DataFrame([[0,"A",1],[0,"B",3],[1,"A",1],[1,"B",1],[1,"C",2]],columns=["group","ordercode","quantity"])
print(df1)
print(df2)
I have used dfff=pd.concat([df1,df2]).sort_index(kind="merge")
but I have got the below result:
group ordercode quantity
0 0 A 1
0 0 A 1
1 B 1
1 B 3
2 C 1
3 D 1
4 1 A 1
4 1 A 1
5 B 2
5 B 1
6 C 1
6 C 2
You can see here the concatenate is formed between each rows not by group.
It has to print like
group 0 of df1
group0 of df2
group1 of df1
group1 of df2 and so on
Note:
I have created these DataFrame using groupby() function
df = pd.DataFrame(np.concatenate(df.apply(lambda x: [x[0]] * x[1], 1).as_matrix()),
columns=['ordercode'])
df['quantity'] = 1
df['group'] = sorted(list(range(0, len(df)//3, 1)) * 4)[0:len(df)]
df=df.groupby(['group', 'ordercode']).sum()
Question:
Where I went wrong?
Its sorting out by taking index
I have used .set_index("group") but It didnt work either.

Use cumcount for helper column used for sorting by sort_values :
df1['g'] = df1.groupby('ordercode').cumcount()
df2['g'] = df2.groupby('ordercode').cumcount()
dfff = pd.concat([df1,df2]).sort_values(['group','g']).reset_index(drop=True)
print (dfff)
group ordercode quantity g
0 0 A 1 0
1 0 B 1 0
2 0 C 1 0
3 0 D 1 0
4 0 A 1 0
5 0 B 3 0
6 1 C 2 0
7 1 A 1 1
8 1 B 2 1
9 1 C 1 1
10 1 A 1 1
11 1 B 1 1
and last remove column:
dfff = dfff.drop('g', axis=1)

Related

How do you count the common 1's in pandas data frame?

I have this data for example:
A
B
C
Class_label
0
1
1
B_C
1
1
1
A_B_C
0
0
1
C
How do you obtain (classified label column) this and count the common ones and display that as well using pandas dataframe?
Use DataFrame.assign for add new columns by DataFrame.dot with columns names for labels and sum for count 1, but only numeric columns selected by DataFrame.select_dtypes:
df1 = df.select_dtypes(np.number)
df = df.assign(classifiedlabel = df1.dot(df1.columns + '_').str[:-1],
countones = df1.sum(axis=1))
print (df)
A B C D classifiedlabel countones
0 0 1 0 1 B_D 2
1 1 1 0 1 A_B_D 3
2 0 0 1 0 C 1
3 0 1 1 0 B_C 2
If column classifiedlabel exist simpliest is use sum only:
df["countones"] = df.sum(axis=1)
print (df)
A B C D classifiedlabel countones
0 0 1 0 1 B_D 2
1 1 1 0 1 A_B_D 3
2 0 0 1 0 C 1
3 0 1 1 0 B_C 2
If values are 1/0 then you can use:
(
df.assign(
count=df._get_numeric_data().sum(axis=1)
)
)
Output:
A B C D classifiedlabel count
0 0 1 0 1 B_D 2
1 1 1 0 1 A_B_D 3
2 0 0 1 0 C 1
3 0 1 1 0 B_C 2
Try:
df["number_of_ones"] = (df == 1).astype(int).sum(axis=1)
print(df)
A B C D classifiedlabel number_of_ones
0 0 1 0 1 B_D 2
1 1 1 0 1 A_B_D 3
2 0 0 1 0 C 1
3 0 1 1 0 B_C 2

Transform dataset based on conditional addition - complicated

I want to transform the following data from df1 to df2:
df1:
ID a b c d a-d c-d a-c-d
0 1 0 0 0 0 0 0 1
1 2 0 0 1 0 1 0 0
2 3 0 1 0 0 0 1 0
3 4 0 0 0 0 1 0 1
4 5 0 0 1 1 0 0 0
And df2 is:
ID a b c d
0 1 1 0 1 1
1 2 1 0 1 1
2 3 0 1 1 1
3 4 2 0 1 2
4 5 0 0 1 1
Basically, I want to get the total values of "a", from all the columns in which the letter "a" appears in the column name. E.g. in the 4th row of df1 there are 2 column names in which the letter "a" appears. If you sum up all the "a" from the 4th row, there would be a total of 2 a's there. I want a single column for apples in the new dataset (df2). Note that a 1 for "a-c-d" is a 1 for EACH "a", "b", "c".
If you know the unique categories in advance (e.g. ["a", "b", "c", "d"]) then you can take a little short cut and rely on df.filter to gather all of the columns with that letter, then use .sum(axis=1) to sum across those columns to create your expected summary value:
data = {"ID": df["ID"]}
for letter in ["a", "b", "c", "d"]:
data[letter] = df.filter(like=letter).sum(axis=1)
final_df = pd.DataFrame(data)
print(final_df)
ID a b c d
0 1 1 0 1 1
1 2 1 0 1 1
2 3 0 1 1 1
3 4 2 0 1 2
4 5 0 0 1 1
Let's try melt to stack the column names, then str.split followed by explode to split the a,b,c,d and duplicate the data:
(df1.melt('ID', var_name='col')
.assign(col=lambda x: x['col'].str.split('-'))
.explode('col')
.pivot_table(index='ID',columns='col',
values='value', aggfunc='sum')
.reset_index()
)
Output:
col ID a b c d
0 1 1 0 1 1
1 2 1 0 1 1
2 3 0 1 1 1
3 4 2 0 1 2
4 5 0 0 1 1
Something like split then explode and groupby with sum
out = df.T.reset_index().assign(index=lambda x : x['index'].str.split('-')).explode('index').\
groupby('index').sum().T
Out[102]:
index ID a b c d
0 1 1 0 1 1
1 2 1 0 1 1
2 3 0 1 1 1
3 4 2 0 1 2
4 5 0 0 1 1
Well, just to complete the answers here, a method more manual is like follow:
df1.loc[:, 'a'] = df1.loc[:, 'a'] + df1.loc[:, 'a-d'] + df1.loc[:, 'a-c-d']
df1.loc[:, 'c'] = df1.loc[:, 'c'] + df1.loc[:, 'c-d'] + df1.loc[:, 'a-c-d']
df1.loc[:, 'd'] = df1.loc[:, 'd'] + df1.loc[:, 'a-d'] + df1.loc[:, 'c-d'] + df1.loc[:, 'a-c-d']
Output:
col ID a b c d
0 1 1 0 1 1
1 2 1 0 1 1
2 3 0 1 1 1
3 4 2 0 1 2
4 5 0 0 1 1

Insert values into columns without NaN

I`m trying to calculate count of some values in data frame like
user_id event_type
1 a
1 a
1 b
2 a
2 b
2 c
and I want to get table like
user_id event_type event_type_a event_type_b event_type_c
1 a 2 1 0
1 a 2 1 0
1 b 2 1 0
2 a 1 1 1
2 b 1 1 1
2 c 1 1 1
I`ve tried code like
df[' event_type_a'] = df['user_id', 'event_type'].where(df['event_type']=='a').groupby([user_id]).count()
and get table like
user_id count_a
1 2
2 1
How i should insert this values into default df, to fill all rows without NaN items?
Maybe exsists method like, for exaple, "insert into df_1['column'] from df_2['column'] where df_1['user_id'] == df_1['user_id'] "
Use crosstab with add_prefix for new columns names and join:
df2 = pd.crosstab(df['user_id'],df['event_type'])
#alternatives
#df2 = df.groupby(['user_id','event_type']).size().unstack(fill_value=0)
#df2 = df.pivot_table(index='user_id', columns='event_type', fill_value=0, aggfunc='size')
df = df.join(df2.add_prefix('event_type_'), on='user_id')
print (df)
user_id event_type event_type_a event_type_b event_type_c
0 1 a 2 1 0
1 1 a 2 1 0
2 1 b 2 1 0
3 2 a 1 1 1
4 2 b 1 1 1
5 2 c 1 1 1
Here is another way for getting df2 as Jez mentioned but slightly different , since I using the transform and did not provide the agg format , So the df2 shape has the same length as original df
df2= df.set_index('user_id').event_type.str.get_dummies().groupby(level=0).transform('sum')
df2
Out[11]:
a b c
user_id
1 2 1 0
1 2 1 0
1 2 1 0
2 1 1 1
2 1 1 1
2 1 1 1
Then using concat
df2.index=df.index
pd.concat([df,df2],axis=1)
Out[19]:
user_id event_type a b c
0 1 a 2 1 0
1 1 a 2 1 0
2 1 b 2 1 0
3 2 a 1 1 1
4 2 b 1 1 1
5 2 c 1 1 1

How to apply cummulative count on multiple columns of dataframe

Dataframe
a b c
0 0 1 1
1 0 1 1
2 0 0 1
3 0 0 1
4 1 1 0
5 1 1 1
6 1 1 1
7 0 0 1
I am trying apply cummulative count cumcount on multiple columns of dataframe, i have tried applying the cummulative count by grouping each column. Is there any easy way to achieve expected output
I have tried this code , but it is not working
li =[]
for column in df.columns:
li.append(df.groupby(column)[column].cumcount())
pd.concat(li,axis=1)
Expected output
a b c
0 1 1 1
1 1 2 2
2 1 1 3
3 1 1 4
4 1 1 1
5 2 2 1
6 3 3 2
7 1 1 3
Create consecutive groups by comparing with shifted values and for each column apply cumcount, last set 1 by boolean mask:
df = (df.ne(df.shift()).cumsum()
.apply(lambda x: df.groupby(x).cumcount() + 1)
.mask(df == 0, 1))
print (df)
a b c
0 1 1 1
1 1 2 2
2 1 1 3
3 1 1 4
4 1 1 1
5 2 2 1
6 3 3 2
7 1 1 3
Another solution if performance is important - count only 1 values and last set 1 by mask by np.where:
a = df == 1
b = a.cumsum()
arr = np.where(a, b-b.mask(a).ffill().fillna(0).astype(int), 1)
df = pd.DataFrame(arr, index=df.index, columns=df.columns)
print (df)
a b c
0 1 1 1
1 1 2 2
2 1 1 3
3 1 1 4
4 1 1 1
5 2 2 1
6 3 3 2
7 1 1 3

return rows with unique pairs across columns

I'm trying to find rows that have unique pairs of values across 2 columns, so this dataframe:
A B
1 0
2 0
3 0
0 1
2 1
3 1
0 2
1 2
3 2
0 3
1 3
2 3
will be reduced to only the rows that don't match up if flipped, for instance 1 and 3 is a combination I only want returned once. So a check to see if the same pair exists if the columns are flipped (3 and 1) it can be removed. The table I'm looking to get is:
A B
0 2
0 3
1 0
1 2
1 3
2 3
Where there is only one occurrence of each pair of values that are mirrored if the columns are flipped.
I think you can use apply sorted + drop_duplicates:
df = df.apply(sorted, axis=1).drop_duplicates()
print (df)
A B
0 0 1
1 0 2
2 0 3
4 1 2
5 1 3
8 2 3
Faster solution with numpy.sort:
df = pd.DataFrame(np.sort(df.values, axis=1), index=df.index, columns=df.columns)
.drop_duplicates()
print (df)
A B
0 0 1
1 0 2
2 0 3
4 1 2
5 1 3
8 2 3
Solution without sorting with DataFrame.min and DataFrame.max:
a = df.min(axis=1)
b = df.max(axis=1)
df['A'] = a
df['B'] = b
df = df.drop_duplicates()
print (df)
A B
0 0 1
1 0 2
2 0 3
4 1 2
5 1 3
8 2 3
Loading the data:
import numpy as np
import pandas as pd
a = np.array("1 2 3 0 2 3 0 1 3 0 1 2".split("\t"),dtype=np.double)
b = np.array("0 0 0 1 1 1 2 2 2 3 3 3".split("\t"),dtype=np.double)
df = pd.DataFrame(dict(A=a,B=b))
In case you don't need to sort the entire DF:
df["trans"] = df.apply(
lambda row: (min(row['A'], row['B']), max(row['A'], row['B'])), axis=1
)
df.drop_duplicates("trans")

Categories