Compare two columns using pandas 2 - python

I'm comparing two columns in a dataframe (A & B). I have a method that works (C5). It came from this question:
Compare two columns using pandas
I wondered why I couldn't get the other methods (C1 - C4) to give the correct answer:
df = pd.DataFrame({'A': [1,1,1,1,1,2,2,2,2,2],
'B': [1,1,1,1,1,1,0,0,0,0]})
#df['C1'] = 1 [df['A'] == df['B']]
df['C2'] = df['A'].equals(df['B'])
df['C3'] = np.where((df['A'] == df['B']),0,1)
def fun(row):
if ['A'] == ['B']:
return 1
else:
return 0
df['C4'] = df.apply(fun, axis=1)
df['C5'] = df.apply(lambda x : 1 if x['A'] == x['B'] else 0, axis=1)

Use:
df = pd.DataFrame({'A': [1,1,1,1,1,2,2,2,2,2],
'B': [1,1,1,1,1,1,0,0,0,0]})
So for C1 and C2 need compare columns by == or eq for boolean mask and then convert it to integers - True, False to 1,0:
df['C1'] = (df['A'] == df['B']).astype(int)
df['C2'] = df['A'].eq(df['B']).astype(int)
Here is necessary change order 1,0 - for match condition need 1:
df['C3'] = np.where((df['A'] == df['B']),1,0)
In function is not selected values of Series, missing row:
def fun(row):
if row['A'] == row['B']:
return 1
else:
return 0
df['C4'] = df.apply(fun, axis=1)
Solution is correct:
df['C5'] = df.apply(lambda x : 1 if x['A'] == x['B'] else 0, axis=1)
print (df)
A B C1 C2 C3 C4 C5
0 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1
2 1 1 1 1 1 1 1
3 1 1 1 1 1 1 1
4 1 1 1 1 1 1 1
5 2 1 0 0 0 0 0
6 2 0 0 0 0 0 0
7 2 0 0 0 0 0 0
8 2 0 0 0 0 0 0
9 2 0 0 0 0 0 0

IIUC you need this:
def fun(row):
if row['A'] == row['B']:
return 1
else:
return 0

Related

Pandas Dataframe - Adding Else?

I want to generate Test Data for my Bayesian Network.
This is my current Code:
data = np.random.randint(2, size=(5, 6))
columns = ['p_1', 'p_2', 'OP1', 'OP2', 'OP3', 'OP4']
df = pd.DataFrame(data=data, columns=columns)
df.loc[(df['p_1'] == 1) & (df['p_2'] == 1), 'OP1'] = 1
df.loc[(df['p_1'] == 1) & (df['p_2'] == 0), 'OP2'] = 1
df.loc[(df['p_1'] == 0) & (df['p_2'] == 1), 'OP3'] = 1
df.loc[(df['p_1'] == 0) & (df['p_2'] == 0), 'OP4'] = 1
print(df)
So every time, for example, p_1 has a 1 and p_2 has a 1, the OP1 should be 1 as well, all the other values should output 0 in the column.
When p_1 is 1 and p_2 is 0, then OP2 should be 1 an d all others 0, and so on.
But my current Output is the following:
p_1
p_2
OP1
OP2
OP3
OP4
0
0
0
0
0
1
1
0
1
1
1
1
0
0
1
1
0
1
0
1
1
1
1
1
1
0
0
1
1
0
Is there any way to fix it? What did I do wrong?
I didn't really understand the solutions to other peoples questions, so I thought Id ask here.
I hope that someone can help me.
The problem is that when you instantiate df, the "OP" columns already have some values:
data = np.random.randint(2, size=(5, 6))
columns = ['p_1', 'p_2', 'OP1', 'OP2', 'OP3', 'OP4']
df = pd.DataFrame(data=data, columns=columns)
df
p_1 p_2 OP1 OP2 OP3 OP4
0 1 1 0 1 0 0
1 0 0 1 1 0 1
2 0 1 1 1 0 0
3 1 1 1 1 0 1
4 0 1 1 0 1 0
One way of fixing it with your code is forcing all "OP" columns to 0 before:
df["OP1"] = df["OP2"] = df["OP3"] df["OP4"] = 0
But then you are generating too many random numbers. I'd do this instead:
data = np.random.randint(2, size=(5, 2))
columns = ['p_1', 'p_2']
df = pd.DataFrame(data=data, columns=columns)
df["OP1"] = ((df['p_1'] == 0) & (df['p_2'] == 1)).astype(int)
You can defined tuples for test and create new columns by casting values of mask to inetegers for True/False to 1/0 mapping:
vals = [(1,1),(1,0),(0,1),(0,0)]
for i, (a, b) in enumerate(vals, 1):
df[f'OP{i}'] = ((df['p_1'] == a) & (df['p_2'] == b)).astype(int)
print(df)
p_1 p_2 OP1 OP2 OP3 OP4
0 0 0 0 0 0 1
1 0 1 0 0 1 0
2 0 1 0 0 1 0
3 0 1 0 0 1 0
4 1 0 0 1 0 0
In your solution set 0 first, because already are set 1 values in original DataFrame:
cols = ['OP1', 'OP2', 'OP3', 'OP4']
df[cols] = 0

Count number of occurences of a column changing value from 0 to 1 and 0 to 2

Suppose I have df1['col1'] and df2['col2'] both columns equal in length.
The values in the columns are 0, 1 and 2 only.
How do I find the counts of changes from 0 -> 2 and 1 -> 2 from df1 to df2 if each corresponding row in the df is a transition about an element?
As an example, I need to count the changes in the 2 columns below
print(df1['orig_label'][0:5])
Name: predicted_label, dtype: int64
0 2
1 2
2 0
3 0
4 1
print(df2['predicted_label'][0:5])
Name: predicted_label, dtype: int64
0 1
1 1
2 0
3 2
4 2
Expected output: 1 and 1
which are counts of 0->2 and 1->2 transition
This is a use case for pandas.crosstab:
pd.crosstab(df1['col1'], df2['col2'])
Output:
col2 0 1 2
col1
0 1 0 1
1 0 0 1
2 0 2 0
To get only 0->2 and 1->2:
pd.crosstab(df1['col1'], df2['col2']).loc[[0,1], [2]]
Output:
col2 2
col1
0 1
1 1
How about this?:
df1 = pd.DataFrame(data={"col1":[0,1,2,3,4], "col2":[2,2,0,0,1]}, columns=["col1", "col2"])
df["0->2"] = df.apply(lambda row: 1 if row["col1"] == 0 and row["col2"] == 2 else 0, axis=1)
df["1->2"] = df.apply(lambda row: 1 if row["col1"] == 1 and row["col2"] == 2 else 0, axis=1)
print("N 0->2 = {}".format(df["0->2"].sum()))
print("N 1->2 = {}".format(df["1->2"].sum()))
This has the potential downside of adding two additional columns to your original dataframe, but you could also just create them as separate series objects if you don't want to do that:
df = pd.DataFrame(data={"col1":[0,1,2,3,4], "col2":[2,2,0,0,1]}, columns=["col1", "col2"])
zeroToTwo = df.apply(lambda row: 1 if row["col1"] == 0 and row["col2"] == 2 else 0, axis=1)
oneToTwo = df.apply(lambda row: 1 if row["col1"] == 1 and row["col2"] == 2 else 0, axis=1)
print("N 0->2 = {}".format(zeroToTwo.sum()))
print("N 1->2 = {}".format(oneToTwo.sum()))
Here is a way using zip() and value_counts()
(pd.Series(['0-2' if (col1 == 0 and col12 == 2) else '1-2' if (col1 == 1 and col2 == 2) else None for col1,col2 in zip(df['col1'],df['col2'])])
.value_counts())
Output
0-2 1
1-2 1
dtype: int64

Concatenate column names by using the binary values in the columns

Currently, I have a dataframe as follows:
date A B C
02/19/2020 0 0 0
02/20/2020 0 0 0
02/21/2020 1 1 1
02/22/2020 0 1 0
02/23/2020 0 1 1
02/24/2020 0 0 1
02/25/2020 1 0 1
02/26/2020 1 0 0
The binary columns contain integers. The "date" column is a DateTime object. I want to create a new categorical column that is based on the binary columns as follows
date A B C new
02/19/2020 0 0 0 "None"
02/20/2020 0 0 0 "None"
02/21/2020 1 1 1 A+B+C
02/22/2020 0 1 0 B
02/23/2020 0 1 1 B+C
02/24/2020 0 0 1 C
02/25/2020 1 0 1 A+C
02/26/2020 1 0 0 A
How can I achieve this?
Use DataFrame.dot for matrix multiplication with columns names with omit first column by position in DataFrame.iloc, add separator to columns names without first and last remove separator by indexing str[:-1]:
df['new'] = df.iloc[:, 1:].dot(df.columns[1:] + '+').str[:-1]
#set empty string to None
df.loc[df['new'].eq(''), 'new'] = None
print (df)
date A B C new
0 02/19/2020 0 0 0 None
1 02/20/2020 0 0 0 None
2 02/21/2020 1 1 1 A+B+C
3 02/22/2020 0 1 0 B
4 02/23/2020 0 1 1 B+C
5 02/24/2020 0 0 1 C
6 02/25/2020 1 0 1 A+C
7 02/26/2020 1 0 0 A
If possible use NaNs instead Nones:
df['new'] = df.iloc[:, 1:].dot(df.columns[1:] + '+').str[:-1].replace('', np.nan)
print (df)
date A B C new
0 02/19/2020 0 0 0 NaN
1 02/20/2020 0 0 0 NaN
2 02/21/2020 1 1 1 A+B+C
3 02/22/2020 0 1 0 B
4 02/23/2020 0 1 1 B+C
5 02/24/2020 0 0 1 C
6 02/25/2020 1 0 1 A+C
7 02/26/2020 1 0 0 A
Or if possible set first column to DatetimeIndex use:
df1 = df.set_index('date')
df1['new'] = df1.dot(df1.columns + '+').str[:-1]
df1.loc[df1['new'].eq(''), 'new'] = None
You can iterate over the Dataframe to calculate the new columns values and then add it.
This is a basic example
new_column = []
for i, row in df.iterrows():
row_val = None
if row["A"]:
if row_val:
row_val += "+A"
else:
row_val = "A"
if row["B"]:
if row_val:
row_val += "+B"
else:
row_val = "B"
if row["C"]:
if row_val:
row_val += "+C"
else:
row_val = "C"
if row_val is None:
row_val = "None"
new_column.append(row_val)
df["new_column_name"] = new_column

Pandas: Count values on a row basis

I have a numeric DataFrame, for example:
x = np.array([[1,2,3],[-1,-1,1],[0,0,0]])
df = pd.DataFrame(x, columns=['A','B','C'])
df
A B C
0 1 2 3
1 -1 -1 1
2 0 0 0
And I want to count, for each row, the number of positive values, negativa values and values equals to 0. I've been trying the following:
df['positive_count'] = df.apply(lambda row: (row > 0).sum(), axis = 1)
df['negative_count'] = df.apply(lambda row: (row < 0).sum(), axis = 1)
df['zero_count'] = df.apply(lambda row: (row == 0).sum(), axis = 1)
But I'm getting the following result, which is obviously incorrent
A B C positive_count negative_count zero_count
0 1 2 3 3 0 1
1 -1 -1 1 1 2 0
2 0 0 0 0 0 5
Anyone knows what might be going wrong, or could help me find the best way to do what I'm looking for?
Thank you.
There are some ways, but one option is using np.sign and get_dummies:
u = (pd.get_dummies(np.sign(df.stack()))
.sum(level=0)
.rename({-1: 'negative_count', 1: 'positive_count', 0: 'zero_count'}, axis=1))
u
negative_count zero_count positive_count
0 0 0 3
1 2 0 1
2 0 3 0
df = pd.concat([df, u], axis=1)
df
A B C negative_count zero_count positive_count
0 1 2 3 0 0 3
1 -1 -1 1 2 0 1
2 0 0 0 0 3 0
np.sign treats zero differently from positive and negative values, so it is ideal to use here.
Another option is groupby and value_counts:
(np.sign(df)
.stack()
.groupby(level=0)
.value_counts()
.unstack(1, fill_value=0)
.rename({-1: 'negative_count', 1: 'positive_count', 0: 'zero_count'}, axis=1))
negative_count zero_count positive_count
0 0 0 3
1 2 0 1
2 0 3 0
Slightly more verbose but still worth knowing about.

classifying a series to a new column in pandas

I want to be able to take my current set of data, which is filled with ints, and classify them according to certain criteria. The table looks something like this:
[in]> df = pd.DataFrame({'A':[0,2,3,2,0,0],'B': [1,0,2,0,0,0],'C': [0,0,1,0,1,0]})
[out]>
A B C
0 0 1 0
1 2 0 0
2 3 2 1
3 2 0 0
4 0 0 1
5 0 0 0
I'd like to classify these in a separate column by string. Being more familiar with R, I tried to create a new column with the rules in that column's definition. Following that I attempted with .ix and lambdas which both resulted in a type errors (between ints & series ). I'm under the impression that this is a fairly simple question. Although the following is completely wrong, here is the logic from attempt 1:
df['D']=(
if ((df['A'] > 0) & (df['B'] == 0) & df['C']==0):
return "c1";
elif ((df['A'] == 0) & ((df['B'] > 0) | df['C'] >0)):
return "c2";
else:
return "c3";)
for a final result of:
A B C D
0 0 1 0 "c2"
1 2 0 0 "c1"
2 3 2 1 "c3"
3 2 0 0 "c1"
4 0 0 1 "c2"
5 0 0 0 "c3"
If someone could help me figure this out it would be much appreciated.
I can think of two ways. The first is to write a classifier function and then .apply it row-wise:
>>> import pandas as pd
>>> df = pd.DataFrame({'A':[0,2,3,2,0,0],'B': [1,0,2,0,0,0],'C': [0,0,1,0,1,0]})
>>>
>>> def classifier(row):
... if row["A"] > 0 and row["B"] == 0 and row["C"] == 0:
... return "c1"
... elif row["A"] == 0 and (row["B"] > 0 or row["C"] > 0):
... return "c2"
... else:
... return "c3"
...
>>> df["D"] = df.apply(classifier, axis=1)
>>> df
A B C D
0 0 1 0 c2
1 2 0 0 c1
2 3 2 1 c3
3 2 0 0 c1
4 0 0 1 c2
5 0 0 0 c3
and the second is to use advanced indexing:
>>> df = pd.DataFrame({'A':[0,2,3,2,0,0],'B': [1,0,2,0,0,0],'C': [0,0,1,0,1,0]})
>>> df["D"] = "c3"
>>> df["D"][(df["A"] > 0) & (df["B"] == 0) & (df["C"] == 0)] = "c1"
>>> df["D"][(df["A"] == 0) & ((df["B"] > 0) | (df["C"] > 0))] = "c2"
>>> df
A B C D
0 0 1 0 c2
1 2 0 0 c1
2 3 2 1 c3
3 2 0 0 c1
4 0 0 1 c2
5 0 0 0 c3
Which one is clearer depends upon the situation. Usually the more complex the logic the more likely I am to wrap it up in a function I can then document and test.

Categories