I am trying to create a "two-entry table" from many columns in my df. I tried with pivot_table / crosstrab / groupby but results appeareance using this functions is not acomplish since will not be a "two entry table"
for example if i have a dataframe like this :
df
A B C D E
1 0 0 1 1
0 1 0 1 0
1 1 1 1 1
I will like to transform my df to a df which could be seen like a "two-entry table"
A B C D E
A 2 1 1 2 2
B 1 2 1 2 1
C 1 1 1 1 1
D 2 2 1 3 1
E 2 1 1 1 2
so if i should explain first row, would be as A has two 1 in his column, then A-A = 2, A-B = 1 because they shared one's in the third row level in df, A-C = 1 because the third row in df they shared one's in the same row level and finaly A-E = 2 because they shared one's in the first row and the third row of df
Use pd.DataFrame.dot with T:
df.T.dot(df) # or df.T#df
Output:
A B C D E
A 2 1 1 2 2
B 1 2 1 2 1
C 1 1 1 1 1
D 2 2 1 3 2
E 2 1 1 2 2
Related
I have a dataframe df which looks like this:
And output columns is to be calculated
ID input OUTPUT
1 A,B 1
1 B,C,D 2
1 C 1
2 E,f 1
2 A,B,C 3
3 E 0
Can anyone please help me how to calculate the changes in the output column based on the input value and whenever the ID changes the output will be zero even if the previous element of the list is different
In row 1 output is 1 because there is only one change from A to B
in 2 row there are two changes from B to C and C to D.
in row 2 it will be one because previous element of the last list was D.
When ID changes we will not compare with the previous row, so E to F only 1 change
Here's one approach:
# get the last element from the previous row
prev_row = df.groupby('ID').input.shift().str.split(',').str[-1]
# concatenate with the next element, construct a set and
# count the amount of elements
df['OUTPUT'] = (prev_row.str.cat(df.input, sep=',')
.fillna(df.input)
.str.split(',')
.map(set)
.str.len()
.sub(1))
print(df)
ID input OUTPUT
0 1 A,B 1
1 1 B,C,D 2
2 1 C 1
3 2 E,f 1
4 2 A,B,C 3
5 3 E 0
I append a new row to the test data for testing, see the output:
df["idchng"]= df.ID.diff().ge(1)
df["lastch"]= df.input.str.rpartition(",")[2].shift()
print(df,"\n")
df["inp2"]= np.where(df.idchng, df.input, df.lastch.str.cat(df.input,sep=","))
df.inp2.iat[0]= df.input.iat[0]
def diffstr(s):
ser= pd.Series(s.split(","))
return ser.ne(ser.shift()).sum()-1
df["RSLT"]= df.inp2.map(diffstr)
df= df.drop(columns=["inp2","lastch","idchng"])
print(df,"\n")
Outputs:
# test data:
ID input OUTPUT
0 1 A,B 1
1 1 B,C,D 2
2 1 C 1
3 2 E,f 1
4 2 A,B,C 3
5 3 E 0
6 4 A,A,B,A,C,D,A,E 6
ID input OUTPUT idchng lastch
0 1 A,B 1 False NaN
1 1 B,C,D 2 False B
2 1 C 1 False D
3 2 E,f 1 True C
4 2 A,B,C 3 False f
5 3 E 0 True C
6 4 A,A,B,A,C,D,A,E 6 True E
ID input OUTPUT RSLT
0 1 A,B 1 1
1 1 B,C,D 2 2
2 1 C 1 1
3 2 E,f 1 1
4 2 A,B,C 3 3
5 3 E 0 0
6 4 A,A,B,A,C,D,A,E 6 6
df have:
A B C
a 1 2 3
b 2 1 4
c 1 1 1
df want:
A B C
a 1 2 3
b 2 1 4
c 1 1 1
d 1 -1 1
I am able to get df want by using:
df.loc['d']=df.loc['b']-df.loc['a']
However, my actual df has 'a','b','c' rows for multiple IDs 'X', 'Y' etc.
A B C
X a 1 2 3
b 2 1 4
c 1 1 1
Y a 1 2 3
b 2 1 4
c 1 1 1
How can I create the same output with multiple IDs?
My original method:
df.loc['d']=df.loc['b']-df.loc['a']
fails KeyError:'b'
Desired output:
A B C
X a 1 2 3
b 2 1 4
c 1 1 1
d 1 -1 1
Y a 1 2 3
b 2 2 4
c 1 1 1
d 1 0 1
IIUC,
for i, sub in df.groupby(df.index.get_level_values(0)):
df.loc[(i, 'd'), :] = sub.loc[(i,'b')] - sub.loc[(i, 'a')]
print(df.sort_index())
Or maybe
k = df.groupby(df.index.get_level_values(0), as_index=False).apply(lambda s: pd.DataFrame([s.loc[(s.name,'b')].values - s.loc[(s.name, 'a')].values],
columns=s.columns,
index=pd.MultiIndex(levels=[[s.name], ['d']], codes=[[0],[0]])
)).reset_index(drop=True, level=0)
pd.concat([k, df]).sort_index()
Data reshaping is a useful trick if you want to do manipulation on a particular level of a multiindex. See code below,
result = (df.unstack(0).T
.assign(d=lambda x:x.b-x.a)
.stack()
.unstack(0))
Use pd.IndexSlice to slice a and b. Call diff and slice on b and rename it to d. Finally, append it to original df
idx = pd.IndexSlice
df1 = df.loc[idx[:,['a','b']],:].diff().loc[idx[:,'b'],:].rename({'b': 'd'})
df2 = df.append(df1).sort_index().astype(int)
Out[106]:
A B C
X a 1 2 3
b 2 1 4
c 1 1 1
d 1 -1 1
Y a 1 2 3
b 2 2 4
c 1 1 1
d 1 0 1
I`m trying to calculate count of some values in data frame like
user_id event_type
1 a
1 a
1 b
2 a
2 b
2 c
and I want to get table like
user_id event_type event_type_a event_type_b event_type_c
1 a 2 1 0
1 a 2 1 0
1 b 2 1 0
2 a 1 1 1
2 b 1 1 1
2 c 1 1 1
I`ve tried code like
df[' event_type_a'] = df['user_id', 'event_type'].where(df['event_type']=='a').groupby([user_id]).count()
and get table like
user_id count_a
1 2
2 1
How i should insert this values into default df, to fill all rows without NaN items?
Maybe exsists method like, for exaple, "insert into df_1['column'] from df_2['column'] where df_1['user_id'] == df_1['user_id'] "
Use crosstab with add_prefix for new columns names and join:
df2 = pd.crosstab(df['user_id'],df['event_type'])
#alternatives
#df2 = df.groupby(['user_id','event_type']).size().unstack(fill_value=0)
#df2 = df.pivot_table(index='user_id', columns='event_type', fill_value=0, aggfunc='size')
df = df.join(df2.add_prefix('event_type_'), on='user_id')
print (df)
user_id event_type event_type_a event_type_b event_type_c
0 1 a 2 1 0
1 1 a 2 1 0
2 1 b 2 1 0
3 2 a 1 1 1
4 2 b 1 1 1
5 2 c 1 1 1
Here is another way for getting df2 as Jez mentioned but slightly different , since I using the transform and did not provide the agg format , So the df2 shape has the same length as original df
df2= df.set_index('user_id').event_type.str.get_dummies().groupby(level=0).transform('sum')
df2
Out[11]:
a b c
user_id
1 2 1 0
1 2 1 0
1 2 1 0
2 1 1 1
2 1 1 1
2 1 1 1
Then using concat
df2.index=df.index
pd.concat([df,df2],axis=1)
Out[19]:
user_id event_type a b c
0 1 a 2 1 0
1 1 a 2 1 0
2 1 b 2 1 0
3 2 a 1 1 1
4 2 b 1 1 1
5 2 c 1 1 1
I'm trying to find rows that have unique pairs of values across 2 columns, so this dataframe:
A B
1 0
2 0
3 0
0 1
2 1
3 1
0 2
1 2
3 2
0 3
1 3
2 3
will be reduced to only the rows that don't match up if flipped, for instance 1 and 3 is a combination I only want returned once. So a check to see if the same pair exists if the columns are flipped (3 and 1) it can be removed. The table I'm looking to get is:
A B
0 2
0 3
1 0
1 2
1 3
2 3
Where there is only one occurrence of each pair of values that are mirrored if the columns are flipped.
I think you can use apply sorted + drop_duplicates:
df = df.apply(sorted, axis=1).drop_duplicates()
print (df)
A B
0 0 1
1 0 2
2 0 3
4 1 2
5 1 3
8 2 3
Faster solution with numpy.sort:
df = pd.DataFrame(np.sort(df.values, axis=1), index=df.index, columns=df.columns)
.drop_duplicates()
print (df)
A B
0 0 1
1 0 2
2 0 3
4 1 2
5 1 3
8 2 3
Solution without sorting with DataFrame.min and DataFrame.max:
a = df.min(axis=1)
b = df.max(axis=1)
df['A'] = a
df['B'] = b
df = df.drop_duplicates()
print (df)
A B
0 0 1
1 0 2
2 0 3
4 1 2
5 1 3
8 2 3
Loading the data:
import numpy as np
import pandas as pd
a = np.array("1 2 3 0 2 3 0 1 3 0 1 2".split("\t"),dtype=np.double)
b = np.array("0 0 0 1 1 1 2 2 2 3 3 3".split("\t"),dtype=np.double)
df = pd.DataFrame(dict(A=a,B=b))
In case you don't need to sort the entire DF:
df["trans"] = df.apply(
lambda row: (min(row['A'], row['B']), max(row['A'], row['B'])), axis=1
)
df.drop_duplicates("trans")
I want to create a new column with column name for the max value by index.
Tie would include both columns.
A B C D
TRDNumber
ALB2008081610 3 1 1 1
ALB200808167 1 3 4 1
ALB200808168 3 1 3 1
ALB200808171 2 2 5 1
ALB2008081710 1 2 2 5
Desired output
A B C D Best
TRDNumber
ALB2008081610 3 1 1 1 A
ALB200808167 1 3 4 1 C
ALB200808168 3 1 3 1 A,C
ALB200808171 2 2 5 1 C
ALB2008081710 1 2 2 5 D
I have tried the following code
df.groupby(['TRDNumber'])[cols].max()
you can do:
>>> f = lambda r: ','.join(df.columns[r])
>>> df.eq(df.max(axis=1), axis=0).apply(f, axis=1)
TRDNumber
ALB2008081610 A
ALB200808167 C
ALB200808168 A,C
ALB200808171 C
ALB2008081710 D
dtype: object
>>> df['best'] = _
>>> df
A B C D best
TRDNumber
ALB2008081610 3 1 1 1 A
ALB200808167 1 3 4 1 C
ALB200808168 3 1 3 1 A,C
ALB200808171 2 2 5 1 C
ALB2008081710 1 2 2 5 D