Is there any way add a columns with specific condition? - python

There is a data frame.
I would like to add column 'e' after checking below conditions.
if component of 'c' is in column 'a' AND component of 'd' is in column 'b' at same row , then component of e is OK
else ""
import pandas as pd
import numpy as np
A = {'a':[0,2,1,4], 'b':[4,5,1,7],'c':['1','2','3','6'], 'd':['1','4','2','9']}
df = pd.DataFrame(A)
The result I want to get is
A = {'a':[0,2,1,4], 'b':[4,5,1,7],'c':['1','2','3','6'], 'd':['1','4','2','9'], 'e':['OK','','','']}

You can merge df with itself on ['a', 'b'] on the left and ['c', 'd'] on the right. If index 'survives' the merge, then e should be OK:
df['e'] = np.where(
df.index.isin(df.merge(df, left_on=['a', 'b'], right_on=['c', 'd']).index),
'OK', '')
df
Output:
a b c d e
0 0 4 1 1 OK
1 2 5 2 4
2 1 1 3 2
3 4 7 6 9
P.S. Before the merge, we need to convert a and b columns to str type (or c and d to numeric), so that we can compare c and a, and d and b:
df[['a', 'b']] = df[['a', 'b']].astype(str)

Related

Determining if a Pandas dataframe row has multiple specific values

I have a Pandas data frame represented by the one below:
A B C D
| 1 1 1 3 |
| 1 1 1 2 |
| 2 3 4 5 |
I need to iterate through this data frame, looking for rows where the values in columns A, B, & C match and if that's true check the values in column D for those rows and delete the row with the smaller value. So, in above example would look like this afterwards.
A B C D
| 1 1 1 3 |
| 2 3 4 5 |
I've written the following code, but something isn't right and it's causing an error. It also looks more complicated than it may need to be, so I am wondering if there is a better, more concise way to write this.
for col, row in df.iterrows():
... df1 = df.copy()
... df1.drop(col, inplace = True)
... for col1, row1 in df1.iterrows():
... if df[0].iloc[col] == df1[0].iloc[col1] & df[1].iloc[col] == df1[1].iloc[col1] &
df[2].iloc[col] == df1[2].iloc[col1] & df1[3].iloc[col1] > df[3].iloc[col]:
... df.drop(col, inplace = True)
Here is one solution:
df[~((df[['A', 'B', 'C']].duplicated(keep=False)) & (df.groupby(['A', 'B', 'C'])['D'].transform(min)==df['D']))]
Explanation:
df[['A', 'B', 'C']].duplicated(keep=False)
returns a mask for rows with duplicated values of ['A', 'B', 'C'] columns
df.groupby(['A', 'B', 'C'])['D'].transform(min)==df['D']
returns a mask for rows that have the minimum value for ['D'] column, for each group of ['A', 'B', 'C']
The combination of these masks, selects all these rows (duplicated ['A', 'B', 'C'] and minimum 'D' for the group. With ~ we select all other rows except from these ones.
Result for the provided input:
A B C D
0 1 1 1 3
2 2 3 4 5
You can groupby all the variables (using groupby(['A', 'B', 'C'])) which have to be equal and then exclude the row with minimum value of D (using func)) if there are multiple unique records to get the boolean indices for the rows which has to be retained
def func(x):
if len(x.unique()) != 1:
return x != x.min()
else:
return x == x
df[df.groupby(['A', 'B', 'C'])['D'].apply(lambda x: func(x))]
A B C D
0 1 1 1 3
2 2 3 4 5
If row having just the maximum group value in D has to be retained. Then you can use the below:
df[df.groupby(['A', 'B', 'C'])['D'].apply(lambda x: x == x.max())]

how to groupby and join multiple rows from multiple columns at a time?

I want to know how to groupby a single column and join multiple column strings each row.
Here's an example dataframe:
df = pd.DataFrame(np.array([['a', 'a', 'b', 'b'], [1, 1, 2, 2],
['k', 'l', 'm', 'n']]).T,
columns=['a', 'b', 'c'])
print(df)
a b c
0 a 1 k
1 a 1 l
2 b 2 m
3 b 2 n
I've tried something like,
df.groupby(['b', 'a'])['c'].apply(','.join).reset_index()
b a c
0 1 a k,l
1 2 b m,n
But that is not my required output,
Desired output:
a b c
0 1 a,a k,l
1 2 b,b m,n
How can I achieve this? I need a scalable solution because I'm dealing with millions of rows.
I think you need grouping by b column only and then if necessary create list of columns for apply function with GroupBy.agg:
df1 = df.groupby('b')['a','c'].agg(','.join).reset_index()
#alternative if want join all columns without b
#df1 = df.groupby('b').agg(','.join).reset_index()
print (df1)
b a c
0 1 a,a k,l
1 2 b,b m,n

pandas: Extracting the index of the maximum value in an expanding window

In a pandas DataFrame, I can create a Series B with the maximum value of another Series A, from the first row to the current one, by using an expanding window:
df['B'] = df['A'].expanding().max()
I can also extract the value of the index of the maximum overall value of Series A:
idx_max_A = df['A'].idxmax().value
What I want is an efficient way to combine both; that is, to create a Series B that holds the value of the index of the maximum value of Series A from the first row up to the current one. Ideally, something like this...
df['B'] = df['A'].expanding().idxmax().value
...but, of course, the above fails because the Expanding object does not have idxmax. Is there a straightforward way to do this?
EDIT: For illustration purposes, for the following DataFrame...
df = pd.DataFrame([1, 2, 1, 3, 0], index=['a', 'b', 'c', 'd', 'e'], columns=['A'])
...I'd like to create an additional column B so that the DataFrame contains the following:
A B
a 1 a
b 2 b
c 1 b
d 3 d
e 0 d
I believe you can use expanding + max + groupby:
v = df.expanding().max().A
df['B'] = v.groupby(v).transform('idxmax')
df
A B
a 1 a
b 2 b
c 1 b
d 3 d
e 0 d
It seems idmax is a function in the latest version of pandas which I don't have yet. Here's a solution not involving groupby or idmax
import pandas as pd
import numpy as np
df = pd.DataFrame([1, 2, 1, 3, 0], index=['a', 'b', 'c', 'd', 'e'], columns=['A'])
temp = df.A.expanding().max()
df['B'] = temp.apply(lambda x: temp[temp == x].index[0])
df
A B
a 1 a
b 2 b
c 1 b
d 3 d
e 0 d

How can I mark a row when it meets a condition?

If I have a dataframe,
df = pd.DataFrame({
'name' : ['A', 'B', 'C'],
'john_01' : [1, 2, 3],
'mary_02' : [4,5,6],
})
I'd like to attach a mark '#' with the name if column['name'] equal to list containing 'A' and 'B'. Then I can see something like below in the result, does anyone know how to do it using pandas in elegant way?
name_list = ['A','B','D'] # But we only have A and B in df.
john_01 mary_02 name
0 1 4 #A
1 2 5 #B
2 3 6 C
If name_list is the same length as the length of the Series name, then you could try this:
df1['name_list'] = ['A','B','D']
df1.ix[df1.name == df1.name_list, 'name'] = '#'+df1.name
This would only prepend a '#' when the value of name and name_list are the same for the current index.
In [81]: df1
Out[81]:
john_01 mary_02 name name_list
0 1 4 #A A
1 2 5 #B B
2 3 6 C D
In [82]: df1.drop('name_list', axis=1, inplace=True) # Drop assist column
If the two are not the same length - and therefore you don't care about index - then you could try this:
In [84]: name_list = ['A','B','D']
In [87]: df1.ix[df1.name.isin(name_list), 'name'] = '#'+df1.name
In [88]: df1
Out[88]:
john_01 mary_02 name
0 1 4 #A
1 2 5 #B
2 3 6 C
I hope this helps.
Use df.loc[row_indexer,column_indexer] operator with isin method of a Series object:
df.loc[df.name.isin(name_list), 'name'] = '#'+df.name
print(df)
The output:
john_01 mary_02 name
0 1 4 #A
1 2 5 #B
2 3 6 C
http://pandas.pydata.org/pandas-docs/stable/indexing.html
You can use isin to check whether the name is in the list, and use numpy.where to prepend #:
df['name'] = np.where(df['name'].isin(name_list), '#', '') + df['name']
df
Out:
john_01 mary_02 name
0 1 4 #A
1 2 5 #B
2 3 6 C
import pandas as pd
def exclude_list (x):
list_exclude = ['A','B']
if x in list_exclude:
x = '#' + x
return x
df = pd.DataFrame({
'name' : ['A', 'B', 'C'],
'john_01' : [1, 2, 3],
'mary_02' : [4,5,6],
})
df['name'] = df['name'].apply(lambda row: exclude_list(row))
print(df)

How to drop rows not containing string type in a column in Pandas?

I have a csv file with four columns. I read it like this:
df = pd.read_csv('my.csv', error_bad_lines=False, sep='\t', header=None, names=['A', 'B', 'C', 'D'])
Now, field C contains string values. But in some rows there are non-string type (floats or numbers) values. How to drop those rows? I'm using version 0.18.1 of Pandas.
Setup
df = pd.DataFrame([['a', 'b', 'c', 'd'], ['e', 'f', 1.2, 'g']], columns=list('ABCD'))
print df
A B C D
0 a b c d
1 e f 1.2 g
Notice you can see what the individual cell types are.
print type(df.loc[0, 'C']), type(df.loc[1, 'C'])
<type 'str'> <type 'float'>
mask and slice
print df.loc[df.C.apply(type) != float]
A B C D
0 a b c d
more general
print df.loc[df.C.apply(lambda x: not isinstance(x, (float, int)))]
A B C D
0 a b c d
you could also use float as an attempt to determine if it can be a float.
def try_float(x):
try:
float(x)
return True
except:
return False
print df.loc[~df.C.apply(try_float)]
A B C D
0 a b c d
The problem with this approach is that you'll exclude strings that can be interpreted as floats.
Comparing times for the few options I've provided and also jezrael's solution with small dataframes.
For a dataframe with 500,000 rows:
Checking if its type is float seems to be most performant with is numeric right behind it. If you need to check int and float, I'd go with jezrael's answer. If you can get away with checking for float, use that one.
You can use boolean indexing with mask created by to_numeric with parameter errors='coerce' - you get NaN where are string values. Then check isnull:
df = pd.DataFrame({'A':[1,2,3],
'B':[4,5,6],
'C':['a',8,9],
'D':[1,3,5]})
print (df)
A B C D
0 1 4 a 1
1 2 5 8 3
2 3 6 9 5
print (pd.to_numeric(df.C, errors='coerce'))
0 NaN
1 8.0
2 9.0
Name: C, dtype: float64
print (pd.to_numeric(df.C, errors='coerce').isnull())
0 True
1 False
2 False
Name: C, dtype: bool
print (df[pd.to_numeric(df.C, errors='coerce').isnull()])
A B C D
0 1 4 a 1
Use pandas.DataFrame.select_dtypes method.
Ex.
df.select_dtypes(exclude='object')
or
df.select_dtypes(include=['int64','float','int'])

Categories