I've seemingly simple problem, based on condition e.g. that value in dataframe is smaller than two, change value to 1, in opposite case to 0. Kind of "if-else".
Toy exmample, input:
a b
0 1 -5
1 2 0
2 3 10
Output:
a b
0 1 1
1 0 1
2 0 0
Here is my solution:
import pandas as pd
import numpy as np
df = pd.DataFrame({'a': [1,2,3], 'b': [-5, 0, 10]})
arr = np.where(df < 2, 1, 0)
df_fin = pd.DataFrame(data=arr, index=df.index, columns=df.columns)
I don't like direct dependency on numpy and it also a little looks verbose to me. Could it be done in more cleaner, idiomatic way?
General solutions:
Pandas is built in numpy, so in my opinion only need import. Here is possible set values in df[:]
import numpy as np
df[:] = np.where(df < 2, 1, 0)
print (df)
a b
0 1 1
1 0 1
2 0 0
A bit overcomplicated if use only pandas functions:
m = df < 2
df = df.mask(m, 1).where(m, 0)
Replace to 0,1 solution:
Convert mask for map True to 1 and False to 0 by DataFrame.view or like in another answer:
df = (df < 2).view('i1')
Pandas' replace might be handy here:
df.lt(2).replace({False : 0, True: 1})
Out[7]:
a b
0 1 1
1 0 1
2 0 0
or you just convert the booleans to integers:
df.lt(2).astype(int)
Out[9]:
a b
0 1 1
1 0 1
2 0 0
Related
I am trying to add another column based on the value of two columns. Here is the mini version of my dataframe.
data = {'current_pair': ['"["StimusNeu/2357.jpg","StimusNeu/5731.jpg"]"', '"["StimusEmo/6350.jpg","StimusEmo/3230.jpg"]"', '"["StimusEmo/3215.jpg","StimusEmo/9570.jpg"]"','"["StimusNeu/7020.jpg","StimusNeu/7547.jpg"]"', '"["StimusNeu/7080.jpg","StimusNeu/7179.jpg"]"'],
'B': [1, 0, 1, 1, 0]
}
df = pd.DataFrame(data)
df
current_pair B
0 "["StimusNeu/2357.jpg","StimusNeu/5731.jpg"]" 1
1 "["StimusEmo/6350.jpg","StimusEmo/3230.jpg"]" 0
2 "["StimusEmo/3215.jpg","StimusEmo/9570.jpg"]" 1
3 "["StimusNeu/7020.jpg","StimusNeu/7547.jpg"]" 1
4 "["StimusNeu/7080.jpg","StimusNeu/7179.jpg"]" 0
I want the result to be:
current_pair B C
0 "["StimusNeu/2357.jpg","StimusNeu/5731.jpg"]" 1 1
1 "["StimusEmo/6350.jpg","StimusEmo/3230.jpg"]" 0 2
2 "["StimusEmo/3215.jpg","StimusEmo/9570.jpg"]" 1 0
3 "["StimusNeu/7020.jpg","StimusNeu/7547.jpg"]" 1 1
4 "["StimusNeu/7080.jpg","StimusNeu/7179.jpg"]" 0 2
I used the numpy select commands:
conditions=[(data['B']==1 & data['current_pair'].str.contains('Emo/', na=False)),
(data['B']==1 & data['current_pair'].str.contains('Neu/', na=False)),
data['B']==0]
choices = [0, 1, 2]
data['C'] = np.select(conditions, choices, default=np.nan)
Unfortunately, it gives me this dataframe without recognizing anything with "1" in column "C".
current_pair B C
0 "["StimusNeu/2357.jpg","StimusNeu/5731.jpg"]" 1 0
1 "["StimusEmo/6350.jpg","StimusEmo/3230.jpg"]" 0 2
2 "["StimusEmo/3215.jpg","StimusEmo/9570.jpg"]" 1 0
3 "["StimusNeu/7020.jpg","StimusNeu/7547.jpg"]" 1 0
4 "["StimusNeu/7080.jpg","StimusNeu/7179.jpg"]" 0 2
Any help counts! thanks a lot.
There is problem with () after ==1 for precedence of operators:
conditions=[(data['B']==1) & data['current_pair'].str.contains('Emo/', na=False),
(data['B']==1) & data['current_pair'].str.contains('Neu/', na=False),
data['B']==0]
I think some logic went wrong here; this works:
df.assign(C=np.select([df.B==0, df.current_pair.str.contains('Emo/'), df.current_pair.str.contains('Neu/')], [2,0,1]))
Here is a slightly more generalized suggestion, easily applicable to more complex cases. You should, however mind execution speed:
import pandas as pd
df = pd.DataFrame({'col_1': ['Abc', 'Xcd', 'Afs', 'Xtf', 'Aky'], 'col_2': [1, 2, 3, 4, 5]})
def someLogic(col_1, col_2):
if 'A' in col_1 and col_2 == 1:
return 111
elif "X" in col_1 and col_2 == 4:
return 999
return 888
df['NewCol'] = df.apply(lambda row: someLogic(row.col_1, row.col_2), axis=1, result_type="expand")
print(df)
Why can't I chain the get_dummies() function?
import pandas as pd
df = (pd
.read_csv('https://raw.githubusercontent.com/mwaskom/seaborn-data/master/iris.csv')
.drop(columns=['sepal_length'])
.get_dummies()
)
This works fine:
df = (pd
.read_csv('https://raw.githubusercontent.com/mwaskom/seaborn-data/master/iris.csv')
.drop(columns=['sepal_length'])
)
df = pd.get_dummies(df)
DataFrame.pipe can be helpful in chaining methods or function calls which are not natively attached to the DataFrame, like pd.get_dummies:
df = df.drop(columns=['sepal_length']).pipe(pd.get_dummies)
Or with lambda:
df = (
df.drop(columns=['sepal_length'])
.pipe(lambda current_df: pd.get_dummies(current_df))
)
Sample DataFrame:
df = pd.DataFrame({'sepal_length': 1, 'a': list('ABACC'), 'b': list('ACCAB')})
df:
sepal_length a b
0 1 A A
1 1 B C
2 1 A C
3 1 C A
4 1 C B
Sample Output:
df = df.drop(columns=['sepal_length']).pipe(pd.get_dummies)
df:
a_A a_B a_C b_A b_B b_C
0 1 0 0 1 0 0
1 0 1 0 0 0 1
2 1 0 0 0 0 1
3 0 0 1 1 0 0
4 0 0 1 0 1 0
You can't chain the pd.get_dummies() method since it is not a pd.DataFrame method. However, assuming -
You have a single column left after you drop your columns in the previous step in the chain.
Your column is a string column dtype.
... you can use pd.Series.str.get_dummies() which is a series level method.
### Dummy Dataframe
# A B
# 0 1 x
# 1 2 y
# 2 3 z
pd.read_csv(path).drop(columns=['A'])['B'].str.get_dummies()
x y z
0 1 0 0
1 0 1 0
2 0 0 1
NOTE: Make sure that before you call the get_dummies() method, the data type of the object is series. In this case, I fetch column ['B'] to do that, which kinda makes the previous pd.DataFrame.drop() method unnecessary and useless :)
But this is only for example's sake.
When one stores the groupby object before calling apply, the index is dropped somewhere. How can this happen?
MWE:
import pandas as pd
df = pd.DataFrame({'a': [1, 1, 0, 0], 'b': list(range(4))})
df.groupby('a').apply(lambda x: x)
a b
0 1 0
1 1 1
2 0 2
3 0 3
dfg = df.groupby('a')
dfg.apply(lambda x: x)
b
0 0
1 1
2 2
3 3
EDIT:
I was on pandas 0.23.2 but this bus is not reproducible with pandas 0.24.x. So upgrading is a solution.
convert this series or dataframe in pandas python
1 2 3
to
1 0 0
0 2 0
0 0 3
how to do it in pandas way. this can be achieved by loops and something like it
you can use numpy to do this.
the following code will work
import numpy as np
a = np.zeros((3, 3), int)
np.fill_diagonal(a, [1,2,3])
print(a)
Output:
array([[1, 0, 0],
[0, 2, 0],
[0, 0, 3]])
To convert it into dataframe, just do the following
import pandas as pd
d = pd.DataFrame(a)
print(d)
0 1 2
0 1 0 0
1 0 2 0
2 0 0 3
I want to process a pandas dataframe with rank-hot encoding instead of one-hot encoding.
For example take this pandas dataframe:
df = pd.DataFrame([[1,2],[3,2],[2,2]], columns=['colA', 'colB'])
print(df)
>> colA colB
0 1 2
1 3 0
2 2 3
How it should look in the end:
print(df)
>> colA_0 colA_1 colA_2 colA_3 colB_0 colB_1 colB_2 colB_3
0 1 1 0 0 1 1 1 0
1 1 1 1 1 1 0 0 0
2 1 1 1 0 1 1 1 1
This worked on small dataFrames:
def rankHotEncode(row):
newFeatures = {}
for i, v in row.iteritems():
for k in range(MULTIPLYFEATURES):
newFeatures[i + repr(k)] = 1 if v >= k else 0
return pd.Series(newFeatures)
df.apply(rankHotEncode, axis=1)
The solution should not be hardcoded and efficient for order ~100.000 rows.
How can I improve the provided solution to make it more efficient or what is the best way to do this?
You can use scikit-learn oneHotEncoder with numpy.cumsum. While it involves some copies, it is quite efficient as it does not deal with the matrix row by row. Here is a sample code using it.
from sklearn.preprocessing import OneHotEncoder
import pandas as pd
import numpy as np
df = pd.DataFrame([[1,2],[3,0],[2,3]], columns=['colA', 'colB'])
print(df)
n_values = df.max().values + 1
enc = OneHotEncoder(sparse=False, n_values=n_values, dtype=int)
enc.fit(df)
encoded_columns = [
'{}_{}'.format(col_name, i)
for col_name, n_value in zip(df.columns, n_values)
for i in range(n_value)
]
one_hot = enc.transform(df)
rank_hot = np.zeros_like(one_hot)
for col_start, col_end in zip(enc.feature_indices_[:-1], enc.feature_indices_[1:]):
one_hot_col_reversed = one_hot[:, col_start: col_end][:, ::-1]
rank_hot[:, col_start: col_end] = np.cumsum(one_hot_col_reversed, axis=1)[:, ::-1]
encoded_df = pd.DataFrame(rank_hot, columns=encoded_columns)
It outputs for your example
print(encoded_df)
>> colA_0 colA_1 colA_2 colA_3 colB_0 colB_1 colB_2 colB_3
0 1 1 0 0 1 1 1 0
1 1 1 1 1 1 0 0 0
2 1 1 1 0 1 1 1 1