Drop columns if rows contain a specific value in Pandas - python

I am starting to learn Pandas. I have seen a lot of questions here in SO where people ask how to delete a row if a column matches certain value.
In my case it is the opposite. Imagine having this dataframe:
Where you want to know is, if any column has in any of its row the value salty, that column should be deleted, having as a result:
I have tried with several similarities to this:
if df.loc[df['A'] == 'salty']:
df.drop(df.columns[0], axis=1, inplace=True)
But I am quite lost at finding documentation onto how to delete columns based on a row value of that column. That code is a mix of finding a specific column and deleting always the first column (as my idea was to search the value of a row in that column, in ALL columns in a for loop.

Perform a comparison across your values, then use DataFrame.any to get a mask to index:
df.loc[:, ~(df == 'Salty').any()]
If you insist on using drop, this is how you need to do it. Pass a list of indices:
df.drop(columns=df.columns[(df == 'Salty').any()])
df = pd.DataFrame({
'A': ['Mountain', 'Salty'], 'B': ['Lake', 'Hotty'], 'C': ['River', 'Coldy']})
df
A B C
0 Mountain Lake River
1 Salty Hotty Coldy
(df == 'Salty').any()
A True
B False
C False
dtype: bool
df.loc[:, ~(df == 'Salty').any()]
B C
0 Lake River
1 Hotty Coldy
df.columns[(df == 'Salty').any()]
# Index(['A'], dtype='object')
df.drop(columns=df.columns[(df == 'Salty').any()])
B C
0 Lake River
1 Hotty Coldy

The following is locating the indices where your desired column matches a specific value and then drops them. I think this is probably the more straightforward way of accomplishing this:
df.drop(df.loc[df['Your column name here'] == 'Match value'].index, inplace=True)

Here's one possibility:
df = df.drop([col for col in df.columns if df[col].eq('Salty').any()], axis=1)

Related

Concatenating values into column from multiple rows

I have a dataframe containing only duplicate "MainID" rows. One MainID may have multiple secondary IDs (SecID). I want to concatenate the values of SecID if there is a common MainID, joined by ':' in SecID col. What is the best way of achieving this? Yes, I know this is not best practice, however it's the structure the software wants.
Need to keep the df structure and values in rest of the df. They will always match the other duplicated row. Only SecID will be different.
Current:
data={'MainID':['NHFPL0580','NHFPL0580','NHFPL0582','NHFPL0582'],'SecID':['G12345','G67890','G11223','G34455'], 'Other':['A','A','B','B']}
df=pd.DataFrame(data)
print(df)
MainID SecID Other
0 NHFPL0580 G12345 A
1 NHFPL0580 G67890 A
2 NHFPL0582 G11223 B
3 NHFPL0582 G34455 B
Intended Structure
MainID SecID Other
NHFPL0580 G12345:G67890 A
NHFPL0582 G11223:G34455 B
Try:
df.groupby('MainID').apply(lambda x: ':'.join(x.SecID))
the above code returns a pd.Series, and you can convert it to a dataframe as #Guy suggested:
You need .reset_index(name='SecID') if you want it back as DataFrame
The solution to the edited question:
df = df.groupby(['MainID', 'Other']).apply(lambda x: ':'.join(x.SecID)).reset_index(name='SecID')
You can then change the column order
cols = df.columns.tolist()
df = df[[cols[i] for i in [0, 2, 1]]]

Compare values of one column of dataframe in another dataframe

I have 2 dataframes. df1 is
DATE
2020-05-20
2020-05-21
and df2 is
ID NAME DATE
1 abc 2020-05-20
2 bcd 2020-05-20
3 ggg 2020-05-25
4 jhg 2020-05-26
I want to compare the values of df1 with df2, for eg: taking first value of df1 i.e 2020-05-20 and find it in df2 and filter it and show output and subset the filtered rows.
My code is
for index,row in df1.iterrows():
x = row['DATE']
if x == df2['DATE']:
print('Found')
new = df2[df2['DATE'] == x]
print(new)
else:
print('Not Found')
But I am getting the following error:
ValueError: The truth value of a series is ambigious. Use a.empty,a.bool(),a.item(),a.any()
x == df2['DATE'] is a pd.Series (of Booleans), not a single value. You have to reduce that to a single Boolean value in order to evaluate that in a condition.
You can either use .any() or .all() depeding on what you need. I assumed you need .any() here.
for index,row in df1.iterrows():
x = row['DATE']
if (x == df2['DATE']).any():
print('Found')
new = df2[df2['DATE'] == x]
print(new)
else:
print('Not Found')
Also see here for a pure pandas solution for this.
you can create one extra column in df1 and use np.where to fill it.
import numpy as np
df1['Match'] = np.where(df1.DATE.isin(df2.DATE),'Found', 'Not Found')
this can also be done as a merge which I think makes it a bit clearer as it's only one line with no branching. You can also add the validate parameter to make sure that each key is unique in either the left of right dataset,
import pandas
df1 = pandas.DataFrame(['2020-05-20', '2020-05-21'], columns=['DATE'])
df2 = pandas.DataFrame({'Name': ['abc', 'bcd', 'ggg', 'jgh'],
'DATE': ['2020-05-20', '2020-05-20', '2020-05-25', '2020-05-26']})
df3 = df1.merge(right=df2, on='DATE', how='left')

Check for populated columns and delete values based on hierarchy?

I have a dataframe (very simplified version below):
d = {'col1': [1, '', 2], 'col2': ['', '', 3], 'col3': [4, 5, 6]}
df = pd.DataFrame(data=d)
I need to loop through the dataframe and check how many columns are populated per row. If the row has just one column populated, then I can continue onto the next row. If however, the column has more than one non-NaN value, I need to make all the columns into NaNs apart from one, based on some hierarchy.
For example, let's say the hierarchy is:
col1 is the most important
col2 second etc.
Therefore, if there were two or more columns with data and one of them happened to be column 1, I would drop all other column values, otherwise I would defer to check if col2 has a value etc and then repeat for the next row.
I have something like this as an idea:
nrows = df.shape[0]
for index in range(0, nrows):
print(index)
#check is the row has only one column populated
if (df.iloc[[index]].notna().sum() == 1):
continue
#check if more than one column is populated for that row
elif (df.iloc[[index]].notna().sum() >= 1):
if (index['col1'].notna() == True):
df.loc[:, df.columns != 'col1'] == 'NaN'
#continue down the hierarchy
but this is not correct as it gives True/False for every column and cannot read it the way I need.
Any suggestions very welcome! I was thinking of creating some sort of key, but feel there may be a more simply way to get there with the code I already have?
Edit:
Another important point which I should have included is that my index is not integers - it is unique identifiers which look something like this: '123XYZ', which is why I used range(0,n) and reshaped the df.
For the example dataframe you gave I don't think it would change after applying this algorithm so I didn't test it thoroughly, but something like this should work:
import numpy as np
heirarchy = ['col1', 'col2', 'col3']
inds = df.isna().sum(axis=1)
inds = inds[inds >= 2].index
for i in inds:
for col in heirarchy:
if not pd.isna(df.iloc[[i]][col]).all():
tmp = df.iloc[[i]][col]
df.iloc[[i]] = np.nan
df.iloc[[i]][col] = tmp
Note I'm assuming that you actually mean nan and not the empty string like you have in your example. If you want to look for empty strings then inds and the if statement would change above
I also think this should be faster than what you have above since it's only looping through the rows with more than 1 nan values.

Update dataframe according to another dataframe based on certain conditions

I have two dataframes df1 and df2. Df1 has columns A,B,C,D,E,F and df2 A,B,J,D,E,K. I want to update the second dataframe with the rows of the first but only when two first columns have the same value in both dataframes. For each row that the following two conditions are true:
df1.A = df2.A
df1.B = df2.B
then update accordingly:
df2.D = df1.D
df2.E = df1.E
My dataframes have different number of rows.
When I tried this code I get a TypeError :cannot do positional indexing with these indexers of type 'str'.
for a in df1:
for t in df2:
if df1.iloc[a]['A'] == df2.iloc[t]['A'] and df1.iloc[a]['B'] == df2.iloc[t]['B']:
df2.iloc[t]['D'] = df1.iloc[a]['D']
df2.iloc[t]['E'] = df1.iloc[a]['E']
The Question:
You'd be better served merging the dataframes than doing nested iteration.
df2 = df2.merge(df1[['A', 'B', 'D', 'E']], on=['A', 'B'], how='left', suffixes=['_old', ''])
df2['D'] = df2['D'].fillna(df2['D_old'])
df2['E'] = df2['E'].fillna(df2['E_old'])
del df2['D_old']
del df2['E_old']
The first row attaches columns to df2 with values for columns D and E from corresponding rows of df1, and renames the old columns.
The next two lines fill in the rows for which df1 had no matching row, and the next two delete the initial, now outdated versions of the columns.
The Error:
Your TypeError happened because for a in df1: iterates over the columns of a dataframe, which are strings here, while .iloc only takes integers. Additionally, though you didn't get to this point, to set a value you'd need both index and column contained within the brackets.
So if you did need to set values by row, you'd want something more like
for a in df1.iterrows():
for t in df2.iterrows():
if df1.loc[a, 'A'] == ...
Though I'd strongly caution against doing that. If you find yourself thinking about it, there's probably either a much faster, less painful way to do it in pandas, or you're better off using another tool less focused on tabular data.

How to count number of rows in a group that have an exact string match in pandas?

I have a dataframe where I want to group by some column, and then count the number of rows that have some exact string match for some other column. Assume all dtypes are 'object'.
In pseudo-code I'm looking for something like:
df.groupby('A').filter(x['B'] == '0').size()
I want to group by column 'A', then count the number of rows of column 'B' that have an exact string match to the string '0'.
edit: I found an inelegant solution:
def counter(group):
i = 0
for item in group:
if item == '0':
i = i + 1
return i
df.groupby('A')['B'].agg(counter)
There must be a better way.
I don't see much wrong with the solution you proposed in your question. If you wanted to make it a one liner you could do the following:
data = np.array(list('abcdefabc')).reshape((3, 3))
df = pd.DataFrame(data, columns=list('ABC'))
df
A B C
0 a b c
1 d e f
2 a b c
df.groupby('A').agg(lambda x:list(x).count('c'))
B C
A
a 0 2
d 0 0
This would have the advantage of giving all of the values for each column in the original dataframe
try creating a temp column which suggest if the value is zero or not
and then make a pivot table based on this column
Hope this helps.
Let me know if it worked.
import pandas as pd
df=pd.DataFrame({'A':['one','one','one','one','one','one','one','two','two','two','two','two','two','two'],'B':[1,2,3,0,2,3,0,2,3,2,0,3,44,55]})
# create a new column if the values is ZERO or not.
df['C'] = df['B'].apply(lambda x: 'EQUALS_ZERO' if x==0 else 'NOT_EQUAL_ZERO')
# make a pivote table
# this will give you value for both =0 and !=0
x= pd.pivot_table(df,index=['A'],values='B',columns='C',aggfunc='count',fill_value=0)
print(x)

Categories