Checking column values in python panda - python

How do I check whether the column values in a panda table are the same and create the result in a fourth column:
original
red blue green
a 1 1 1
b 1 2 1
c 2 2 2
becomes:
red blue green match
a 1 1 1 1
b 1 2 1 0
c 2 2 2 1
Originally I only had 2 columns and it was possible to achieve something similar by doing this:
df['match']=df['blue']-df['red']
but this won't work with 3 columns.
Your help is greatly appreciated!

To make it more generic, compare row values on apply method.
Using set()
In [54]: df['match'] = df.apply(lambda x: len(set(x)) == 1, axis=1).astype(int)
In [55]: df
Out[55]:
red blue green match
a 1 1 1 1
b 1 2 1 0
c 2 2 2 1
Alternatively, use pd.Series.nunique to identify number of unique in row.
In [56]: (df.apply(pd.Series.nunique, axis=1) == 1).astype(int)
Out[56]:
a 1
b 0
c 1
dtype: int32
Or, use df.iloc[:, 0] for first column values and match it eq with df
In [57]: df.eq(df.iloc[:, 0], axis=0).all(axis=1).astype(int)
Out[57]:
a 1
b 0
c 1
dtype: int32

You can try this:
df["match"] = df.apply(lambda x: int(x[0]==x[1]==x[2]), axis=1)
where:
x[0]==x[1]==x[2] : test for the eaquality of the 3 first columns
axis=1: columns wise
Alternatively, you can call the column by their name too:
df["match"] = df.apply(lambda x: int(x["red"]==x["blue"]==x["green"]), axis=1)
This is more convenient if you have many column and that you want to compare a subpart of them without knowing their number.
If you want to compare all the columns, use John Galt's solution

Related

How do I iteratively select rows in pandas based on column values?

I'm a complete newbie at pandas so a simpler (though maybe not the most efficient or elegant) solution is appreciated. I don't mind a bit of brute force if I can understand the answer better.
If I have the following Dataframe:
A B C
0 0 1
0 1 1
I want to loop through columns "A", "B" and "C" in that order and during each iteration select all the rows for which the current column is "1" and none of the previous columns are and save the result and also use it in the next iteration.
So when looking at column A, I wouldn't select anything. Then when looking at column B I would select the second row because B==1 and A==0. Then when looking at column C I would select the first row because A==0 and B==0.
Create a boolean mask:
m = (df == 1) & (df.cumsum(axis=1) == 1)
d = {col: df[m[col]].index.tolist() for col in df.columns if m[col].sum()}
Output:
>>> m
A B C
0 False False True
1 False True False
2 False False True
>>> d
{'B': [1], 'C': [0, 2]}
I slightly modified your dataframe:
>>> df
A B C
0 0 0 1
1 0 1 1
2 0 0 1
Update
For the expected output on my sample:
for rows, col in zip(m, df.columns):
if m[col].sum():
print(f"\n=== {col} ===")
print(df[m[col]])
Output:
=== B ===
A B C
1 0 1 1
=== C ===
A B C
0 0 0 1
2 0 0 1
Seems like you need a direct use of idxmax
Return index of first occurrence of maximum over requested axis.
NA/null values are excluded.
>>> df.idxmax()
A 0
B 1
C 0
dtype: int64
The values above are the indexes for which your constraints are met. 1 for B means that the second row was "selected". 0 for C, same. The only issue is that, if nothing is found, it'll also return 0.
To address that, you can use where
>>> df.idxmax().where(~df.eq(0).all())
This will make sure that NaNs are returned for all-zero columns.
A NaN
B 1.0
C 0.0
dtype: float64

How to keep columns based on a given row values

Here how the datalooks like in df dataframe:
A B C D
0.js 2 1 1 -1
1.js 3 -5 1 -4
total 5 -4 2 -5
And I would get new dataframe df1:
A C
0.js 2 1
1.js 3 1
total 5 2
So basically it should look like this:
df1 = df[df["total"] > 0]
but it should filter on row instead of column and I can't figure it out..
You want to use .loc[:, column_mask] i.e.
In [11]: df.loc[:, df.sum() > 0]
Out[11]:
A C
total 5 2
# or
In [12]: df.loc[:, df.iloc[0] > 0]
Out[12]:
A C
total 5 2
Use .where to set negative values to NaN and then dropna setting axis = 1:
df.where(df.gt(0)).dropna(axis=1)
A C
total 5 2
You can use, loc with boolean indexing or reindex:
df.loc[:, df.columns[(df.loc['total'] > 0)]]
OR
df.reindex(df.columns[(df.loc['total'] > 0)], axis=1)
Output:
A C
0.js 2 1
1.js 3 1
total 5 2

Is there a way to put a dataframe as the value of a specific column in pandas python?

I have a set of data that has column names and values to create a dataframe.
However one of the column values is another dataframe is it possible to do this in pandas, or are each column values meant to be a single value?
For example what I am trying to achieve would look something like this;
df
out:
A B C
0 A1 B1 D E
D1 E1
F G
F1 G1
This is where letters that have numbers with them are the values, and just letters are the column names.
Yes it is possible to put another dataframe (or any type of object) in a pandas cell.
In[2]: df1 = pd.DataFrame({'a':range(2)})
df1
Out[2]:
a
0 0
1 1
In[3]: df2 = pd.DataFrame({'x':range(3), 'y':range(3)})
df2
Out[3]:
x y
0 0 0
1 1 1
2 2 2
In[4]: df1['b'] = [df2, {'cat':'meow', 'otter':'clap'}]
df1
Out[4]:
a b
0 0 x y 0 0 0 1 1 1 2 2 2
1 1 {u'otter': u'clap', u'cat': u'meow'}
In[5]: df1.get_value(0, 'b')
Out[5]:
x y
0 0 0
1 1 1
2 2 2
As you see it's not very readable to print a dataframe contaning another dataframe. If you want it to look as your example you should go with multiindex as Wen suggested.

Finding duplicate rows in a Pandas Dataframe then Adding a column in the Dataframe that states if the row is a duplicate

I have a pandas dataframe that contains a column with possible duplicates. I would like to create a column that will produce a 1 if the row is duplicate and 0 if it is not.
So if I have:
A|B
1 1|x
2 2|y
3 1|x
4 3|z
I would get:
A|B|C
1 1|x|1
2 2|y|0
3 1|x|1
4 3|z|0
I tried df['C'] = np.where(df['A']==df['A'], '1', '0') but this just created a column of all 1's in C.
You need Series.duplicated with parameter keep=False for all duplicates first, then cast boolean mask (Trues and Falses) to 1s and 0s by astype by int and if necessary then cast to str:
df['C'] = df['A'].duplicated(keep=False).astype(int).astype(str)
print (df)
A B C
1 1 x 1
2 2 y 0
3 1 x 1
4 3 z 0
If need check duplicates in columns A and B together use DataFrame.duplicated:
df['C'] = df.duplicated(subset=['A','B'], keep=False).astype(int).astype(str)
print (df)
A B C
1 1 x 1
2 2 y 0
3 1 x 1
4 3 z 0
And numpy.where solution:
df['C'] = np.where(df['A'].duplicated(keep=False), '1', '0')
print (df)
A B C
1 1 x 1
2 2 y 0
3 1 x 1
4 3 z 0

concat two dataframe using python

We have one dataframe like
-0.140447131 0.124802527 0.140780106
0.062166349 -0.121484447 -0.140675515
-0.002989106 0.13984927 0.004382326
and the other as
1
1
2
We need to concat both the dataframe like
-0.140447131 0.124802527 0.140780106 1
0.062166349 -0.121484447 -0.140675515 1
-0.002989106 0.13984927 0.004382326 2
Let's say your first dataframe is like
In [281]: df1
Out[281]:
a b c
0 -0.140447 0.124803 0.140780
1 0.062166 -0.121484 -0.140676
2 -0.002989 0.139849 0.004382
And, the second like,
In [283]: df2
Out[283]:
d
0 1
1 1
2 2
Then you could create new column for df1 using df2
In [284]: df1['d_new'] = df2['d']
In [285]: df1
Out[285]:
a b c d_new
0 -0.140447 0.124803 0.140780 1
1 0.062166 -0.121484 -0.140676 1
2 -0.002989 0.139849 0.004382 2
The assumption however being both dataframes have common index
Use pd.concat and specify the axis equal to 1 (rows):
df_new = pd.concat([df1, df2], axis=1)
>>> df_new
0 1 2 0
0 -0.140447 0.124803 0.140780 1
1 0.062166 -0.121484 -0.140676 2
2 -0.002989 0.139849 0.004382 3

Categories