Flip flop mechanism in pandas - python

I am looking for a flip flop type mechanism in pandas.
data3 is the output I'd like to create
data3 = 1 when data1 is a 1. Then stay a 1 until data2 signals a 1. Then continue as 0 until data1 signals a 1 again....etc.....
I could use .iterrows(), but I am wondering if there is a faster vectorized way?
data1 = [1,1,0,0,0,1,0,0,0,0,1]
data2 = [0,0,1,0,0,0,0,0,1,1,0]
data3 = [1,1,0,0,0,1,1,1,0,0,1]
df = pd.DataFrame()
df["d1"]= data1
df["d2"]= data2

One option:
df['out'] = (
df['d1'].map({1: True})
.fillna(df['d2'].map({1: False}))
.ffill().fillna(0).astype(int)
)
print(df)
If you want to give priority to d2 over d1 in case of equality:
df['out'] = (
df['d2'].map({1: False})
.fillna(df['d1'].map({0: True}))
.ffill().fillna(0).astype(int)
)
Output:
d1 d2 out
0 1 0 1
1 1 0 1
2 0 1 0
3 0 0 0
4 0 0 0
5 1 0 1
6 0 0 1
7 0 0 1
8 0 1 0
9 0 1 0
10 1 0 1

Related

How to pivot dataframe into ML format

My head is spinning trying to figure out if I have to use pivot_table, melt, or some other function.
I have a DF that looks like this:
month day week_day classname_en origin destination
0 1 7 2 1 2 5
1 1 2 6 2 1 167
2 2 1 5 1 2 54
3 2 2 6 4 1 6
4 1 2 6 5 6 1
But I want to turn it into something like:
month_1 month_2 ...classname_en_1 classname_en_2 ... origin_1 origin_2 ...destination_1
0 1 0 1 0 0 1 0
1 1 0 0 1 1 0 0
2 0 1 1 0 0 1 0
3 0 1 0 0 1 0 0
4 1 0 0 0 0 0 1
Basically, turn all values into columns and then have binary rows 1 - if the column is present, 0 if none.
IDK if it is at all possible to do with like a single function or not, but would appreciate all and any help!
To expand #Corraliens answer
It is indeed a way to do it, but since you write for ML purposes, you might introduce a bug.
With the code above you get a matrix with 20 features. Now, say you want to predict on some data which suddenly have a month more than your training data, then your matrix on your prediction data would have 21 features, thus you cannot parse that into your fitted model.
To overcome this you can use one-hot-encoding from Sklearn. It'll make sure that you always have the same amount of features on "new data" as your training data.
import pandas as pd
df_train = pd.DataFrame({"color":["red","blue"],"age":[10,15]})
pd.get_dummies(df_train)
# output
age color_blue color_red
0 10 0 1
1 15 1 0
df_new = pd.DataFrame({"color":["red","blue","green"],"age":[10,15,20]})
pd.get_dummies(df_new)
#output
age color_blue color_green color_red
0 10 0 0 1
1 15 1 0 0
2 20 0 1 0
and as you can see, the order of the color-binary representation has also changed.
If we on the other hand use OneHotEncoder you can ommit all those issues
from sklearn.preprocessing import OneHotEncoder
df_train = pd.DataFrame({"color":["red","blue"],"age":[10,15]})
ohe = OneHotEncoder(handle_unknown="ignore")
color_ohe_transformed= ohe.fit_transform(df_train[["color"]]) #creates sparse matrix
ohe_features = ohe.get_feature_names_out() # [color_blue, color_red]
pd.DataFrame(color_ohe_transformed.todense(),columns = ohe_features, dtype=int)
# output
color_blue color_red
0 0 1
1 1 0
# now transform new data
df_new = pd.DataFrame({"color":["red","blue","green"],"age":[10,15,20]})
new_data_ohe_transformed = ohe.transform(df_new[["color"]])
pd.DataFrame(new_data_ohe_transformed .todense(),columns = ohe_features, dtype=int)
#output
color_blue color_red
0 0 1
1 1 0
2 0 0
note in the last row that both blue and red are both zeros since it has color= "green" which was not present in the training data.
Note the todense() function is only used here to illustrate how it works. Ususally you would like to keep it a sparse matrix and use e.g scipy.sparse.hstack to append your other features such as age to it.
Use pd.get_dummies:
out = pd.get_dummies(df, columns=df.columns)
print(out)
# Output
month_1 month_2 day_1 day_2 day_7 week_day_2 week_day_5 ... origin_2 origin_6 destination_1 destination_5 destination_6 destination_54 destination_167
0 1 0 0 0 1 1 0 ... 1 0 0 1 0 0 0
1 1 0 0 1 0 0 0 ... 0 0 0 0 0 0 1
2 0 1 1 0 0 0 1 ... 1 0 0 0 0 1 0
3 0 1 0 1 0 0 0 ... 0 0 0 0 1 0 0
4 1 0 0 1 0 0 0 ... 0 1 1 0 0 0 0
[5 rows x 20 columns]
You can use get_dummies function of pandas for convert row to column based on data.
For that your code will be:
import pandas as pd
df = pd.DataFrame({
'month': [1, 1, 2, 2, 1],
'day': [7, 2, 1, 2, 2],
'week_day': [2, 6, 5, 6, 6],
'classname_en': [1, 2, 1, 4, 5],
'origin': [2, 1, 2, 1, 6],
'destination': [5, 167, 54, 6, 1]
})
response = pd.get_dummies(df, columns=df.columns)
print(response)
Result :

Pandas: occurrence matrix from one hot encoding from pandas dataframe

I have a dataframe, it's in one hot format:
dummy_data = {'a': [0,0,1,0],'b': [1,1,1,0], 'c': [0,1,0,1],'d': [1,1,1,0]}
data = pd.DataFrame(dummy_data)
Output:
a b c d
0 0 1 0 1
1 0 1 1 1
2 1 1 0 1
3 0 0 1 0
I am trying to get the occurrence matrix from dataframe, but if I have columns name in list instead of one hot like this:
raw = [['b','d'],['b','c','d'],['a','b','d'],['c']]
unique_categories = ['a','b','c','d']
Then I am able to find the occurrence matrix like this:
df = pd.DataFrame(raw).stack().rename('val').reset_index().drop(columns='level_1')
df = df.loc[df.val.isin(unique_categories)]
df = df.merge(df, on='level_0').query('val_x != val_y')
final = pd.crosstab(df.val_x, df.val_y)
adj_matrix = (pd.crosstab(df.val_x, df.val_y)
.reindex(unique_categories, axis=0).reindex(unique_categories, axis=1)).fillna(0)
Output:
val_y a b c d
val_x
a 0 1 0 1
b 1 0 1 3
c 0 1 0 1
d 1 3 1 0
How to get the occurrence matrix directly from one hot dataframe?
You can have some fun with matrix math!
u = np.diag(np.ones(df.shape[1], dtype=bool))
df.T.dot(df) * (~u)
a b c d
a 0 1 0 1
b 1 0 1 3
c 0 1 0 1
d 1 3 1 0

Python - Column-wise keep first unique value

I have a dataframe that has multiple columns that represent whether or not something had existed, but they are ordinal in nature. Something could have existed in all 3 categories, but I only want to indicate the highest level that it existed in.
So for a given row, i only want a single '1' value , but I want it to be kept at the highest level it was found at.
For this row:
1,1,0 , I would want the row to be changed to 1,0,0
and this row:
0,1,1 , I would want the row to be changed to 0,1,0
Here is a sample of what the data could look like, and expected output:
import pandas as pd
#input data
df = pd.DataFrame({'id':[1,2,3,4,5],
'level1':[0,0,0,0,1],
'level2':[1,0,1,0,1],
'level3':[0,1,1,1,0]})
#expected output:
new_df = pd.DataFrame({'id':[1,2,3,4,5],
'level1':[0,0,0,0,1],
'level2':[1,0,1,0,0],
'level3':[0,1,0,1,0]})
Using numpy.zeros and filling via numpy.argmax:
out = np.zeros(df.iloc[:, 1:].shape, dtype=int)
out[np.arange(len(out)), np.argmax(df.iloc[:, 1:].values, 1)] = 1
df.iloc[:, 1:] = out
Using broadcasting with argmax:
a = df.iloc[:, 1:].values
df.iloc[:, 1:] = (a.argmax(axis=1)[:,None] == range(a.shape[1])).astype(int)
Both produce:
id level1 level2 level3
0 1 0 1 0
1 2 0 0 1
2 3 0 1 0
3 4 0 0 1
4 5 1 0 0
You can use advanced indexing with NumPy. Updating underlying NumPy array works here since you have a dataframe of int dtype.
idx = df.iloc[:, 1:].eq(1).values.argmax(1)
df.iloc[:, 1:] = 0
df.values[np.arange(df.shape[0]), idx+1] = 1
print(df)
id level1 level2 level3
0 1 0 1 0
1 2 0 0 1
2 3 0 1 0
3 4 0 0 1
4 5 1 0 0
numpy.eye
v = df.iloc[:, 1:].values
i = np.eye(3, dtype=np.int64)
a = v.argmax(1)
df.iloc[:, 1:] = i[a]
df
id level1 level2 level3
0 1 0 1 0
1 2 0 0 1
2 3 0 1 0
3 4 0 0 1
4 5 1 0 0
cumsum and mask
df.set_index('id').pipe(
lambda d: d.mask(d.cumsum(1) > 1, 0)
).reset_index()
id level1 level2 level3
0 1 0 1 0
1 2 0 0 1
2 3 0 1 0
3 4 0 0 1
4 5 1 0 0
You can use get_dummies() by assigning a 1 to the maximum index
df[df.filter(like='level').columns] = pd.get_dummies(df.filter(like='level').idxmax(1))
id level1 level2 level3
0 1 0 1 0
1 2 0 0 1
2 3 0 1 0
3 4 0 0 1
4 5 1 0 0

Creating a dataframe with binary valued columns with pandas using values from an existing dataframe

I am trying to create a new dataframe with binary (0 or 1) values from an exisitng dataframe. For every row in the given dataframe, the program should take value from each cell and set 1 for the corresponding columns of the row indexed with same number in the new dataframe
I have tried executing the following code snippet.
for col in products :
index = 0;
for item in products.loc[col] :
products_coded.ix[index, 'prod_' + str(item)] = 1;
index = index + 1;
It works for less number of rows. But,it takes lot of time for any large dataset. What could be the best way to get the desired outcome.
I think you need:
first get_dummies with casting values to strings
aggregate max by columns names max
for correct ordering convert columns to int
reindex for ordering and append missing columns, replace NaNs by 0 by parameter fill_value=0 and remove first 0 column
add_prefix for rename columns
df = pd.DataFrame({'B':[3,1,12,12,8],
'C':[0,6,0,14,0],
'D':[0,14,0,0,0]})
print (df)
B C D
0 3 0 0
1 1 6 14
2 12 0 0
3 12 14 0
4 8 0 0
df1 = (pd.get_dummies(df.astype(str), prefix='', prefix_sep='')
.max(level=0, axis=1)
.rename(columns=lambda x: int(x))
.reindex(columns=range(1, df.values.max() + 1), fill_value=0)
.add_prefix('prod_'))
print (df1)
prod_1 prod_2 prod_3 prod_4 prod_5 prod_6 prod_7 prod_8 prod_9 \
0 0 0 1 0 0 0 0 0 0
1 1 0 0 0 0 1 0 0 0
2 0 0 0 0 0 0 0 0 0
3 0 0 0 0 0 0 0 0 0
4 0 0 0 0 0 0 0 1 0
prod_10 prod_11 prod_12 prod_13 prod_14
0 0 0 0 0 0
1 0 0 0 0 1
2 0 0 1 0 0
3 0 0 1 0 1
4 0 0 0 0 0
Another similar solution:
df1 = (pd.get_dummies(df.astype(str), prefix='', prefix_sep='')
.max(level=0, axis=1))
df1.columns = df1.columns.astype(int)
df1 = (df1.reindex(columns=range(1, df1.columns.max() + 1), fill_value=0)
.add_prefix('prod_'))

How to transpose a list to a square matrix

I would like to transpose a list of items into a square matrix format using python .
I tried pivot_table in pandas but it didn't work.
Here is my code , the input being a two column csv file
with open(path_to_file,"r") as f:
reader = csv.reader(f,delimiter = ',')
data = list(reader)
row_count=len(data)
print(row_count - 1)
df = pd.read_csv(path_to_file)
groups = df.groupby(['transmitter chan', 'receiver chan'])
max_for_AS = defaultdict(int)
df = df.assign(ID = [0 + i for i in xrange(len(df))])
print(df)
for g in groups:
transmitter, count = g[0][0], len(g[1])
max_for_AS[ transmitter ] = max( max_for_AS[transmitter], count )
for g in groups:
transmitter, receiver, count = g[0][0], g[0][1], len(g[1])
if count == max_for_AS[ transmitter ]:
dataFinal = "{} , {} , {}".format(transmitter, receiver, count )
print( dataFinal )
Data:
V1 V2 count
0 A R 1
1 Z T 4
2 E B 9
3 R O 8
4 T M 7
5 Y K 5
6 B I 6
7 T Z 2
8 A O 7
9 Y B 8
I think you need:
df = pd.read_csv(path_to_file)
df1 = df.pivot(index='V1',columns='V2',values='count').fillna(0).astype(int)
df1 = df.set_index(['V1','V2'])['count'].unstack(fill_value=0)
But if duplicates in V1 and V2 need aggregate them:
df1 = df.pivot_table(index='V1',columns='V2',values='count', fill_value=0)
df1 = df.groupby(['V1','V2'])['count'].mean().unstack(fill_value=0)
#for change ordering add reindex
df1 = df1.reindex(index=df.V1.unique(), columns=df.V2.unique())
print (df1)
V2 R T B O M K I Z
V1
A 1 0 0 7 0 0 0 0
Z 0 4 0 0 0 0 0 0
E 0 0 9 0 0 0 0 0
R 0 0 0 8 0 0 0 0
T 0 0 0 0 7 0 0 2
Y 0 0 8 0 0 5 0 0
B 0 0 0 0 0 0 6 0
Since it's not clear what you are trying to achieve, I'll approach this answer with an assumption.
I assume that you have a pandas dataframe. If that's true, to get the transpose of it using numpy, you might have to,
Convert the dataframe(df) to a numpy ndarray like this: df=df.values
Find the transpose using numpy.transpose on the result of step 1
Edit:
Better way. You can also do df.transpose()

Categories