I'm a complete newbie at pandas so a simpler (though maybe not the most efficient or elegant) solution is appreciated. I don't mind a bit of brute force if I can understand the answer better.
If I have the following Dataframe:
A B C
0 0 1
0 1 1
I want to loop through columns "A", "B" and "C" in that order and during each iteration select all the rows for which the current column is "1" and none of the previous columns are and save the result and also use it in the next iteration.
So when looking at column A, I wouldn't select anything. Then when looking at column B I would select the second row because B==1 and A==0. Then when looking at column C I would select the first row because A==0 and B==0.
Create a boolean mask:
m = (df == 1) & (df.cumsum(axis=1) == 1)
d = {col: df[m[col]].index.tolist() for col in df.columns if m[col].sum()}
Output:
>>> m
A B C
0 False False True
1 False True False
2 False False True
>>> d
{'B': [1], 'C': [0, 2]}
I slightly modified your dataframe:
>>> df
A B C
0 0 0 1
1 0 1 1
2 0 0 1
Update
For the expected output on my sample:
for rows, col in zip(m, df.columns):
if m[col].sum():
print(f"\n=== {col} ===")
print(df[m[col]])
Output:
=== B ===
A B C
1 0 1 1
=== C ===
A B C
0 0 0 1
2 0 0 1
Seems like you need a direct use of idxmax
Return index of first occurrence of maximum over requested axis.
NA/null values are excluded.
>>> df.idxmax()
A 0
B 1
C 0
dtype: int64
The values above are the indexes for which your constraints are met. 1 for B means that the second row was "selected". 0 for C, same. The only issue is that, if nothing is found, it'll also return 0.
To address that, you can use where
>>> df.idxmax().where(~df.eq(0).all())
This will make sure that NaNs are returned for all-zero columns.
A NaN
B 1.0
C 0.0
dtype: float64
Related
I have huge dataset with more than 100 columns that contain non-null values that I want to replace (and leave all the null values as is). Some columns, however, should stay untouched.
I am planning to do the following:
1) find unique values in these columns
2) replace this values with 1
Problem:
1) something like this barely possible to use for 100+ columns:
np.unique(df[['Col1', 'Col2']].values)
2) how do I apply than loc to all these columns? code below does not work
df_2.loc[df_2[['col1','col2','col3']] !=0, ['col1','col2','col3']] = 1
Maybe there is more reasonable and elegant way to solve the problem. Thanks!
Use DataFrame.mask:
c = ['col1','col2','col3']
df_2[c] = df_2[c].mask(df_2[c] != 0, 1)
Or compare by not equal with DataFrame.ne and cast mask by integers with DataFrame.astype:
df_2 = pd.DataFrame({
'A':list('abcdef'),
'col1':[0,5,0,5,5,0],
'col2':[7,8,9,0,2,0],
'col3':[0,0,5,7,0,0],
'E':[5,0,6,9,2,0],
})
c = ['col1','col2','col3']
df_2[c] = df_2[c].ne(0).astype(int)
print (df_2)
A col1 col2 col3 E
0 a 0 1 0 5
1 b 1 1 0 0
2 c 0 1 1 6
3 d 1 0 1 9
4 e 1 1 0 2
5 f 0 0 0 0
EDIT: For select columns by positions use DataFrame.iloc:
idx = np.r_[6:71,82]
df_2.iloc[:, idx] = df_2.iloc[:, idx].ne(0).astype(int)
Or first solution:
df_2.iloc[:, idx] = df_2.iloc[:, idx].mask(df_2.iloc[:, idx]] != 0, 1)
I have a dataframe:
id to from flag
1 a x 1
1 a y 0
2 c z 1
2 c m 1
2 b v 0
2 b p 0
and I want to groupby(['id', 'to']) and return a list of the elements in from that have a flag 1 only. If no element has a flag 1, then the resulting output should be 'None'. The desired output should be:
id to from
1 a ['x']
2 c ['z','m']
2 b None
I can do it with apply i.e.
out_df = df.groupby(['id', 'to'])['from'].apply(
lambda x: match_to_list(x['from'], x['flag'])).reset_index()
where:
def match_to_list(to, flag):
matches = list(to.iloc[flag.nonzero()[0]])
if len(matches) == 0:
return 'None'
else:
matches
but this is taking too long and I think there must be a better way that I am missing.
Any help/insights would be very appreciated! TIA
IIUC 1st create the index , with MultiIndex, then we do groupby with agg
idx=pd.MultiIndex.from_tuples(list(map(tuple,df[['id','to']].drop_duplicates().values.tolist())))
yourdf=df.loc[df.flag==1].groupby(['id','to'])['from'].agg(list).reindex(idx).reset_index()
yourdf
Out[13]:
level_0 level_1 from
0 1 a [x]
1 2 c [z, m]
2 2 b NaN
Or just using apply , less efficient but more readable
df.groupby(['id','to']).apply(lambda x : x['from'][x['flag']==1].tolist() if (x['flag']==1).any() else None).reset_index()
Out[17]:
id to 0
0 1 a [x]
1 2 b None
2 2 c [z, m]
I have this table1:
A B C D
0 1 2 k l
1 3 4 e r
df.dtypes gets me this:
A int64
B int64
C object
D object
Now, I want to create a table2 which only includes objects (column C and D) using this command table2=df.select_dtypes(include=[object]).
Then, I want to encode table2 using this command pd.get_dummies(table).
It gives me this table2:
C D
0 0 1
1 1 0
The last thing I want to do is append both tables together (table 1 + table 2), so that the final table looks like this:
A B C D
0 1 2 0 1
1 3 4 1 0
Can somebody help?
This should do it:
table2=df.select_dtypes(include=[object])
table1.select_dtypes(include=[int]).join(table2.apply(lambda x:pd.factorize(x, sort=True)[0]))
It first factorizes the object typed columns of table 2 (instead of using dummies generator) and then merge it back to the int typed columns of the original dataframe!
Assuming what you're trying to do from the question is have a column for C that has a value of 1 replace values of e and in column D, values of 1 replace values of l. Otherwise, as mentioned elsewhere there will be a column for each response possibility.
df = pd.DataFrame({'A': [1,2], 'B': [2,4], 'C': ['k','e'], 'D': ['l','r']})
df
A B C D
0 1 2 k l
1 2 4 e r
df.dtypes
A int64
B int64
C object
D object
dtype: object
Now, if you want to drop the e and l because you want to have k-1 columns, you can use the drop_first argument.
df = pd.get_dummies(df, drop_first = True)
df
A B C_k D_r
0 1 2 1 0
1 2 4 0 1
Note that the dtypes are not int64 like columns A and B.
df
A int64
B int64
C_k uint8
D_r uint8
dtype: object
If it's important they are the same type you can of course change those as appropriate. In the general case, you may want to keep names like C_k and D_r so you know what the dummies correspond to. If not, you can always rename based on the _ (the default of get_dummies prefix argument.) So, you could create the rename dictionary using the '_' as as way to split out the part of the column name after the prefix. Or for a simple case like this.
df.rename({'C_k': 'C', 'D_r': 'D'}, axis = 1, inplace = True)
df
A B C D
0 1 2 1 0
1 2 4 0 1
Given this df:
Name i j k
A 1 0 3
B 0 5 4
C 0 0 4
D 0 5
My goal is to add in a column "Final" that takes value in an order of i j k:
Name i j k Final
A 1 0 3 1
B 0 5 4 5
C 0 0 4 4
D 0 5 <-- this one is tricky. We do count the null for j column here.
Here is my attempt: df['Final'] = df[['i', 'j', 'k'].bfill(axis=1).iloc[:, 0]. This doesn't work since it always takes the value of column 1. Any help would be appreciated. :)
Many thanks!
If by "taking values in column order", you mean "taking the first non-zero value in each row, or zero if all values are zero", you could use DataFrame.lookup after doing a boolean comparison:
In [113]: df["final"] = df.lookup(df.index,(df[["i","j","k"]] != 0).idxmax(axis=1))
In [114]: df
Out[114]:
Name i j k final
0 A 1 0.0 3 1.0
1 B 0 5.0 4 5.0
2 C 0 0.0 4 4.0
3 D 0 NaN 5 NaN
where first we compare everything with zero:
In [115]: df[["i","j","k"]] != 0
Out[115]:
i j k
0 True False True
1 False True True
2 False False True
3 False True True
and then we use idxmax to find the first True (or the first False if you have a row of zeroes):
In [116]: (df[["i","j","k"]] != 0).idxmax(axis=1)
Out[116]:
0 i
1 j
2 k
3 j
dtype: object
Is this what you need ?
df['Final']=df[['i', 'j', 'k']].mask((df=='')|(df==0)).bfill(axis=1).iloc[:, 0][(df!='').all(1)]
df
Out[1290]:
Name i j k Final
0 A 1 0 3 1.0
1 B 0 5 4 5.0
2 C 0 0 4 4.0
3 D 0 5 NaN
Using pandas.Series.nonzero the solution can be expressed succicntly.
df['Final'] = df.apply(lambda x: x.iloc[x.nonzero()[0][0]], axis=1)
How this works:
nonzero() returns the indices of elements that are not zero (and will match np.nan as well).
We take the first index location and return the value at that location to construct the Final Column.
We apply this on the dataframe using axis=1 to apply it row by row.
A benefit of this approach is that it does not depend on naming individual columns ['i', 'j', 'k']
How do I check whether the column values in a panda table are the same and create the result in a fourth column:
original
red blue green
a 1 1 1
b 1 2 1
c 2 2 2
becomes:
red blue green match
a 1 1 1 1
b 1 2 1 0
c 2 2 2 1
Originally I only had 2 columns and it was possible to achieve something similar by doing this:
df['match']=df['blue']-df['red']
but this won't work with 3 columns.
Your help is greatly appreciated!
To make it more generic, compare row values on apply method.
Using set()
In [54]: df['match'] = df.apply(lambda x: len(set(x)) == 1, axis=1).astype(int)
In [55]: df
Out[55]:
red blue green match
a 1 1 1 1
b 1 2 1 0
c 2 2 2 1
Alternatively, use pd.Series.nunique to identify number of unique in row.
In [56]: (df.apply(pd.Series.nunique, axis=1) == 1).astype(int)
Out[56]:
a 1
b 0
c 1
dtype: int32
Or, use df.iloc[:, 0] for first column values and match it eq with df
In [57]: df.eq(df.iloc[:, 0], axis=0).all(axis=1).astype(int)
Out[57]:
a 1
b 0
c 1
dtype: int32
You can try this:
df["match"] = df.apply(lambda x: int(x[0]==x[1]==x[2]), axis=1)
where:
x[0]==x[1]==x[2] : test for the eaquality of the 3 first columns
axis=1: columns wise
Alternatively, you can call the column by their name too:
df["match"] = df.apply(lambda x: int(x["red"]==x["blue"]==x["green"]), axis=1)
This is more convenient if you have many column and that you want to compare a subpart of them without knowing their number.
If you want to compare all the columns, use John Galt's solution