classifying a series to a new column in pandas - python

I want to be able to take my current set of data, which is filled with ints, and classify them according to certain criteria. The table looks something like this:
[in]> df = pd.DataFrame({'A':[0,2,3,2,0,0],'B': [1,0,2,0,0,0],'C': [0,0,1,0,1,0]})
[out]>
A B C
0 0 1 0
1 2 0 0
2 3 2 1
3 2 0 0
4 0 0 1
5 0 0 0
I'd like to classify these in a separate column by string. Being more familiar with R, I tried to create a new column with the rules in that column's definition. Following that I attempted with .ix and lambdas which both resulted in a type errors (between ints & series ). I'm under the impression that this is a fairly simple question. Although the following is completely wrong, here is the logic from attempt 1:
df['D']=(
if ((df['A'] > 0) & (df['B'] == 0) & df['C']==0):
return "c1";
elif ((df['A'] == 0) & ((df['B'] > 0) | df['C'] >0)):
return "c2";
else:
return "c3";)
for a final result of:
A B C D
0 0 1 0 "c2"
1 2 0 0 "c1"
2 3 2 1 "c3"
3 2 0 0 "c1"
4 0 0 1 "c2"
5 0 0 0 "c3"
If someone could help me figure this out it would be much appreciated.

I can think of two ways. The first is to write a classifier function and then .apply it row-wise:
>>> import pandas as pd
>>> df = pd.DataFrame({'A':[0,2,3,2,0,0],'B': [1,0,2,0,0,0],'C': [0,0,1,0,1,0]})
>>>
>>> def classifier(row):
... if row["A"] > 0 and row["B"] == 0 and row["C"] == 0:
... return "c1"
... elif row["A"] == 0 and (row["B"] > 0 or row["C"] > 0):
... return "c2"
... else:
... return "c3"
...
>>> df["D"] = df.apply(classifier, axis=1)
>>> df
A B C D
0 0 1 0 c2
1 2 0 0 c1
2 3 2 1 c3
3 2 0 0 c1
4 0 0 1 c2
5 0 0 0 c3
and the second is to use advanced indexing:
>>> df = pd.DataFrame({'A':[0,2,3,2,0,0],'B': [1,0,2,0,0,0],'C': [0,0,1,0,1,0]})
>>> df["D"] = "c3"
>>> df["D"][(df["A"] > 0) & (df["B"] == 0) & (df["C"] == 0)] = "c1"
>>> df["D"][(df["A"] == 0) & ((df["B"] > 0) | (df["C"] > 0))] = "c2"
>>> df
A B C D
0 0 1 0 c2
1 2 0 0 c1
2 3 2 1 c3
3 2 0 0 c1
4 0 0 1 c2
5 0 0 0 c3
Which one is clearer depends upon the situation. Usually the more complex the logic the more likely I am to wrap it up in a function I can then document and test.

Related

Pandas Dataframe - Adding Else?

I want to generate Test Data for my Bayesian Network.
This is my current Code:
data = np.random.randint(2, size=(5, 6))
columns = ['p_1', 'p_2', 'OP1', 'OP2', 'OP3', 'OP4']
df = pd.DataFrame(data=data, columns=columns)
df.loc[(df['p_1'] == 1) & (df['p_2'] == 1), 'OP1'] = 1
df.loc[(df['p_1'] == 1) & (df['p_2'] == 0), 'OP2'] = 1
df.loc[(df['p_1'] == 0) & (df['p_2'] == 1), 'OP3'] = 1
df.loc[(df['p_1'] == 0) & (df['p_2'] == 0), 'OP4'] = 1
print(df)
So every time, for example, p_1 has a 1 and p_2 has a 1, the OP1 should be 1 as well, all the other values should output 0 in the column.
When p_1 is 1 and p_2 is 0, then OP2 should be 1 an d all others 0, and so on.
But my current Output is the following:
p_1
p_2
OP1
OP2
OP3
OP4
0
0
0
0
0
1
1
0
1
1
1
1
0
0
1
1
0
1
0
1
1
1
1
1
1
0
0
1
1
0
Is there any way to fix it? What did I do wrong?
I didn't really understand the solutions to other peoples questions, so I thought Id ask here.
I hope that someone can help me.
The problem is that when you instantiate df, the "OP" columns already have some values:
data = np.random.randint(2, size=(5, 6))
columns = ['p_1', 'p_2', 'OP1', 'OP2', 'OP3', 'OP4']
df = pd.DataFrame(data=data, columns=columns)
df
p_1 p_2 OP1 OP2 OP3 OP4
0 1 1 0 1 0 0
1 0 0 1 1 0 1
2 0 1 1 1 0 0
3 1 1 1 1 0 1
4 0 1 1 0 1 0
One way of fixing it with your code is forcing all "OP" columns to 0 before:
df["OP1"] = df["OP2"] = df["OP3"] df["OP4"] = 0
But then you are generating too many random numbers. I'd do this instead:
data = np.random.randint(2, size=(5, 2))
columns = ['p_1', 'p_2']
df = pd.DataFrame(data=data, columns=columns)
df["OP1"] = ((df['p_1'] == 0) & (df['p_2'] == 1)).astype(int)
You can defined tuples for test and create new columns by casting values of mask to inetegers for True/False to 1/0 mapping:
vals = [(1,1),(1,0),(0,1),(0,0)]
for i, (a, b) in enumerate(vals, 1):
df[f'OP{i}'] = ((df['p_1'] == a) & (df['p_2'] == b)).astype(int)
print(df)
p_1 p_2 OP1 OP2 OP3 OP4
0 0 0 0 0 0 1
1 0 1 0 0 1 0
2 0 1 0 0 1 0
3 0 1 0 0 1 0
4 1 0 0 1 0 0
In your solution set 0 first, because already are set 1 values in original DataFrame:
cols = ['OP1', 'OP2', 'OP3', 'OP4']
df[cols] = 0

Use pandas to select the lagged row along with current row based on criteria

I have a dataframe like as shown below
person_id source_system r_diff
1 A NULL
1 B 0
1 B 9
1 A 15
1 A 574
1 B 0
1 A 63
1 A 136
1 B 0
I would like to select data based on Or operation of 2 rules
a) Select all records where source_system = B
b) Select n and n-1 rows where r_diff = 0.
For example, in the above data, you can find r_diff = 0 for row numbers 2,6,9. So, I would like to select rows 1,2 and 5,6 and 8,9. You can see how I have chosen n and n-1 rows
I tried the below
df['flag_1'] = np.where((df['source_system'] == 'B'), '1','0')
df['flag_2'] = np.where((df['r_diff'] == 0), '1','0')
df['flag_3'] = np.where((df['r_diff'].shift(-1) == 0, '1','0')
df = df[((df['flag_1'] == '1') or (df['flag_2'] == '1') or (df['flag_3'] == '1'))]
I expect my output to be like as shown below
person_id source_system r_diff
1 A NULL
1 B 0
1 B 9
1 A 574
1 B 0
1 A 136
1 B 0
I think you are close, you can set mask to variables and chain by | for bitwise OR like:
m1 = df['source_system'] == 'B'
m2 = df['r_diff'] == 0
m3 = df.groupby('person_id')['r_diff'].shift(-1) == 0
df = df[m1 | m2 | m3]
print (df)
person_id source_system r_diff
0 1 A NaN
1 1 B 0.0
2 1 B 9.0
4 1 A 574.0
5 1 B 0.0
7 1 A 136.0
8 1 B 0.0

Pandas: occurrence matrix from one hot encoding from pandas dataframe

I have a dataframe, it's in one hot format:
dummy_data = {'a': [0,0,1,0],'b': [1,1,1,0], 'c': [0,1,0,1],'d': [1,1,1,0]}
data = pd.DataFrame(dummy_data)
Output:
a b c d
0 0 1 0 1
1 0 1 1 1
2 1 1 0 1
3 0 0 1 0
I am trying to get the occurrence matrix from dataframe, but if I have columns name in list instead of one hot like this:
raw = [['b','d'],['b','c','d'],['a','b','d'],['c']]
unique_categories = ['a','b','c','d']
Then I am able to find the occurrence matrix like this:
df = pd.DataFrame(raw).stack().rename('val').reset_index().drop(columns='level_1')
df = df.loc[df.val.isin(unique_categories)]
df = df.merge(df, on='level_0').query('val_x != val_y')
final = pd.crosstab(df.val_x, df.val_y)
adj_matrix = (pd.crosstab(df.val_x, df.val_y)
.reindex(unique_categories, axis=0).reindex(unique_categories, axis=1)).fillna(0)
Output:
val_y a b c d
val_x
a 0 1 0 1
b 1 0 1 3
c 0 1 0 1
d 1 3 1 0
How to get the occurrence matrix directly from one hot dataframe?
You can have some fun with matrix math!
u = np.diag(np.ones(df.shape[1], dtype=bool))
df.T.dot(df) * (~u)
a b c d
a 0 1 0 1
b 1 0 1 3
c 0 1 0 1
d 1 3 1 0

pd.get_dummies() for DataFrames with multi-index columns

Using get_dummies(), it is possible to create one-hot encoded dummy variables for categorical data. For example:
import pandas as pd
df = pd.DataFrame({'A': ['a', 'b', 'a'],
'B': ['b', 'a', 'c']})
print(pd.get_dummies(df))
# A_a A_b B_a B_b B_c
# 0 1 0 0 1 0
# 1 0 1 1 0 0
# 2 1 0 0 0 1
So far, so good. But how can I use get_dummies() in combination with multi-index columns? The default behavior is not very practical: The multi-index tuple is converted into a string and the same suffix mechanism applies as with the simple-index columns.
df = pd.DataFrame({('i','A'): ['a', 'b', 'a'],
('ii','B'): ['b', 'a', 'c']})
ret = pd.get_dummies(df)
print(ret)
print(type(ret.columns[0]))
# ('i','A')_a ('i','A')_b ('ii','B')_a ('ii','B')_b ('ii','B')_c
# 0 1 0 0 1 0
# 1 0 1 1 0 0
# 2 1 0 0 0 1
#
# str
What I would like to get, however, is that the dummies create a new column level:
ret = pd.get_dummies(df, ???)
print(ret)
print(type(ret.columns[0]))
# i ii
# A B
# a b a b c
# 0 1 0 0 1 0
# 1 0 1 1 0 0
# 2 1 0 0 0 1
#
# tuple
#
# Note that the ret would be equivalent to the following:
# ('i','A','a') ('i','A','b') ('ii','B','a') ('ii','B','b') ('ii','B','c')
# 0 1 0 0 1 0
# 1 0 1 1 0 0
# 2 1 0 0 0 1
How could this be achieved?
Update: I placed a feature request for better support of multi-index data-frames in get_dummies: https://github.com/pandas-dev/pandas/issues/26560
You can parse the column names and rename them:
import ast
def parse_dummy(x):
parts = x.split('_')
return ast.literal_eval(parts[0]) + (parts[1],)
ret.columns = pd.Series(ret.columns).apply(parse_dummy)
# (i, A, a) (i, A, b) (ii, B, a) (ii, B, b) (ii, B, c)
#0 1 0 0 1 0
#1 0 1 1 0 0
#2 1 0 0 0 1
Note that this DataFrame is not the same as a DataFrame with three-level multiindex column names.
I had a similar need, but in a more complex DataFrame, with a multi index as row index and numerical columns which shall not be converted to dummy. So my case required to scan through the columns, expand to dummy only the columns of dtype='object', and build a new column index as a concatenation of the name of the column with the dummy variable and the value of the dummy variable itself. This because I didn't want to add a new column index level.
Here is the code
first build a dataframe in the format I need
import pandas as pd
import numpy as np
df_size = 3
objects = ['obj1','obj2']
attributes = ['a1','a2','a3']
cols = pd.MultiIndex.from_product([objects, attributes], names=['objects', 'attributes'])
lab1 = ['car','truck','van']
lab2 = ['bay','zoe','ros','lol']
time = np.arange(df_size)
node = ['n1','n2']
idx = pd.MultiIndex.from_product([time, node], names=['time', 'node'])
df = pd.DataFrame(np.random.randint(10,size=(len(idx),len(cols))),columns=cols,index=idx)
c1 = map(lambda i:lab1[i],np.random.randint(len(lab1),size=len(idx)))
c2 = map(lambda i:lab2[i],np.random.randint(len(lab2),size=len(idx)))
df[('obj1','a3')]=list(c1)
df[('obj2','a2')]=list(c2)
print(df)
objects obj1 obj2
attributes a1 a2 a3 a1 a2 a3
time node
0 n1 6 5 truck 3 ros 3
n2 5 6 car 9 lol 7
1 n1 0 8 truck 7 zoe 8
n2 4 3 truck 8 bay 3
2 n1 5 8 van 0 bay 0
n2 4 8 car 5 lol 4
And here is the code to dummify only the object columns
for t in [df.columns[i] for i,dt in enumerate(df.dtypes) if dt=='object']:
dummy_block = pd.get_dummies( df[t] )
dummy_block.columns = pd.MultiIndex.from_product([[t[0]],[f'{t[1]}_{c}' for c in dummy_block.columns]],
names=df.columns.names)
df = pd.concat([df.drop(t,axis=1),dummy_block],axis=1).sort_index(axis=1)
print(df)
objects obj1 obj2
attributes a1 a2 a3_car a3_truck a3_van a1 a2_bay a2_lol a2_ros a2_zoe a3
time node
0 n1 6 5 0 1 0 3 0 0 1 0 3
n2 5 6 1 0 0 9 0 1 0 0 7
1 n1 0 8 0 1 0 7 0 0 0 1 8
n2 4 3 0 1 0 8 1 0 0 0 3
2 n1 5 8 0 0 1 0 1 0 0 0 0
n2 4 8 1 0 0 5 0 1 0 0 4
It can be easily changed to answer to original use case, adding one more row to the columns multi index.
df = pd.DataFrame({('i','A'): ['a', 'b', 'a'],
('ii','B'): ['b', 'a', 'c']})
print(df)
i ii
A B
0 a b
1 b a
2 a c
df.columns = pd.MultiIndex.from_tuples([t+('',) for t in df.columns])
for t in [df.columns[i] for i,dt in enumerate(df.dtypes) if dt=='object']:
dummy_block = pd.get_dummies( df[t] )
dummy_block.columns = pd.MultiIndex.from_product([[t[0]],[t[1]],[c for c in dummy_block.columns]],
names=df.columns.names)
df = pd.concat([df.drop(t,axis=1),dummy_block],axis=1).sort_index(axis=1)
print(df)
i ii
A B
a b a b c
0 1 0 0 1 0
1 0 1 1 0 0
2 1 0 0 0 1
note that it still works if there are numerical columns - it just adds an empty additional level to them in the columns index as well.

Compare two columns using pandas 2

I'm comparing two columns in a dataframe (A & B). I have a method that works (C5). It came from this question:
Compare two columns using pandas
I wondered why I couldn't get the other methods (C1 - C4) to give the correct answer:
df = pd.DataFrame({'A': [1,1,1,1,1,2,2,2,2,2],
'B': [1,1,1,1,1,1,0,0,0,0]})
#df['C1'] = 1 [df['A'] == df['B']]
df['C2'] = df['A'].equals(df['B'])
df['C3'] = np.where((df['A'] == df['B']),0,1)
def fun(row):
if ['A'] == ['B']:
return 1
else:
return 0
df['C4'] = df.apply(fun, axis=1)
df['C5'] = df.apply(lambda x : 1 if x['A'] == x['B'] else 0, axis=1)
Use:
df = pd.DataFrame({'A': [1,1,1,1,1,2,2,2,2,2],
'B': [1,1,1,1,1,1,0,0,0,0]})
So for C1 and C2 need compare columns by == or eq for boolean mask and then convert it to integers - True, False to 1,0:
df['C1'] = (df['A'] == df['B']).astype(int)
df['C2'] = df['A'].eq(df['B']).astype(int)
Here is necessary change order 1,0 - for match condition need 1:
df['C3'] = np.where((df['A'] == df['B']),1,0)
In function is not selected values of Series, missing row:
def fun(row):
if row['A'] == row['B']:
return 1
else:
return 0
df['C4'] = df.apply(fun, axis=1)
Solution is correct:
df['C5'] = df.apply(lambda x : 1 if x['A'] == x['B'] else 0, axis=1)
print (df)
A B C1 C2 C3 C4 C5
0 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1
2 1 1 1 1 1 1 1
3 1 1 1 1 1 1 1
4 1 1 1 1 1 1 1
5 2 1 0 0 0 0 0
6 2 0 0 0 0 0 0
7 2 0 0 0 0 0 0
8 2 0 0 0 0 0 0
9 2 0 0 0 0 0 0
IIUC you need this:
def fun(row):
if row['A'] == row['B']:
return 1
else:
return 0

Categories