Pandas remove rows where several columns are not nan - python

I have a dataframe that looks like this:
A B C D E
0 P 10 NaN 5.0 9.0
1 Q 19 NaN NaN 4.0
2 R 8 NaN 3.0 7.0
3 S 20 NaN 3.0 7.0
4 T 4 NaN 2.0 NaN
And I have a list: [['A', 'B', 'D', 'E'], ['A', 'B', 'D'], ['A', 'B', 'E']]
I am iterating over the list and getting only those rows from the dataframe, for which the columns specified by the list are not empty.
I have tried with the following code:
test_df = pd.DataFrame([['P', 10, np.nan, 5, 9], ['Q', 19, np.nan, np.nan, 4], ['R', 8, np.nan, 3, 7],
['S', 20, np.nan, 3, 7], ['T', 4, np.nan, 2, np.nan]], columns=list('ABCDE'))
priority_list = [list('ABDE'), list('ABD'), list('ABE')]
for elem in priority_list:
test_df = test_df.loc[test_df[elem].notna()]
print(test_df)
But this is throwing the following error:
File "C:\Python37\lib\site-packages\pandas\core\indexing.py", line 879, in __getitem__
return self._getitem_axis(maybe_callable, axis=axis)
File "C:\Python37\lib\site-packages\pandas\core\indexing.py", line 1097, in _getitem_axis
raise ValueError("Cannot index with multidimensional key")
ValueError: Cannot index with multidimensional key
How to overcome this issue and check for multiple columns for non-na values in the dataframe?

Use DataFrame.all for test if all selected values are Trues:
priority_list = [list('ABDE'), list('ABD'), list('ABE')]
for elem in priority_list:
test_df = test_df.loc[test_df[elem].notna().all(axis=1)]
print(test_df)

Related

Find highest two numbers on every row in pandas dataframe and extract the column names

I have a code with multiple columns and I would like to add two more, one for the highest number on the row, and another one for the second highest. However, instead of the number, I would like to show the column name where they are found.
Assume the following data frame:
import pandas as pd
df = pd.DataFrame({'A': [1, 5, 10], 'B': [2, 6, 11], 'C': [3, 7, 12], 'D': [4, 8, 13], 'E': [5, 9, 14]})
To extract the highest number on every row, I can just apply max(axis=1) like this:
df['max1'] = df[['A', 'B', 'C', 'D', 'E']].max(axis = 1)
This gets me the max number, but not the column name itself.
How can this be applied to the second max number as well?
You can sorting values and assign top2 values:
cols = ['A', 'B', 'C', 'D', 'E']
df[['max2','max1']] = np.sort(df[cols].to_numpy(), axis=1)[:, -2:]
print (df)
A B C D E max2 max1
0 1 2 3 4 5 4 5
1 5 6 7 8 9 8 9
2 10 11 12 13 14 13 14
df[['max1','max2']] = np.sort(df[cols].to_numpy(), axis=1)[:, -2:][:, ::-1]
EDIT: For get top2 columns names and top2 values use:
df = pd.DataFrame({'A': [1, 50, 10], 'B': [2, 6, 11],
'C': [3, 7, 12], 'D': [40, 8, 13], 'E': [5, 9, 14]})
cols = ['A', 'B', 'C', 'D', 'E']
#values in numpy array
vals = df[cols].to_numpy()
#columns names in array
cols = np.array(cols)
#get indices that would sort an array in descending order
arr = np.argsort(-vals, axis=1)
#top 2 columns names
df[['top1','top2']] = cols[arr[:, :2]]
#top 2 values
df[['max2','max1']] = vals[np.arange(arr.shape[0])[:, None], arr[:, :2]]
print (df)
A B C D E top1 top2 max2 max1
0 1 2 3 40 5 D E 40 5
1 50 6 7 8 9 A E 50 9
2 10 11 12 13 14 E D 14 13
Another approaches to you can get first max then remove it and get max again to get the second max
import pandas as pd
import numpy as np
df = pd.DataFrame({'A': [1, 15, 10], 'B': [2, 89, 11], 'C': [80, 7, 12], 'D': [4, 8, 13], 'E': [5, 9, 14]})
max1=df.max(axis=1)
maxcolum1=df.idxmax(axis=1)
max2 = df.replace(np.array(df.max(axis=1)),0).max(axis=1)
maxcolum2=df.replace(np.array(df.max(axis=1)),0).idxmax(axis=1)
df2 =pd.DataFrame({ 'max1': max1, 'max2': max2 ,'maxcol1':maxcolum1,'maxcol2':maxcolum2 })
df.join(df2)

Replace values of a dataframe with the value of another dataframe

I have two pandas dataframes
df1 = pd.DataFrame({'A': [1, 3, 5], 'B': [3, 4, 5]})
df2 = pd.DataFrame({'A': [1, 2, 3, 4, 5], 'B': [8, 9, 10, 11, 12], 'C': ['K', 'D', 'E', 'F', 'G']})
The index of both data-frames are 'A'.
How to replace the values of df1's column 'B' with the values of df2 column 'B'?
RESULT of df1:
A B
1 8
3 10
5 12
Maybe dataframe.isin() is what you're searching:
df1['B'] = df2[df2['A'].isin(df1['A'])]['B'].values
print(df1)
Prints:
A B
0 1 8
1 3 10
2 5 12
One of possible solutions:
wrk = df1.set_index('A').B
wrk.update(df2.set_index('A').B)
df1 = wrk.reset_index()
The result is:
A B
0 1 8
1 3 10
2 5 12
Another solution, based on merge:
df1 = df1.merge(df2[['A', 'B']], how='left', on='A', suffixes=['_x', ''])\
.drop(columns=['B_x'])

Pandas dataframe conditional substitution and columnwise trimming

Current Pandas DataFrame
fn1 = pd.DataFrame([['A', 'NaN', 'NaN', 9, 6], ['B', 'NaN', 2, 'NaN', 7], ['C', 3, 2, 'NaN', 10], ['D', 'NaN', 7, 'NaN', 'NaN'], ['E', 'NaN', 'NaN', 3, 3], ['F', 'NaN', 'NaN', 7,'NaN']], columns = ['Symbol', 'Condition1','Condition2', 'Condition3', 'Condition4'])
fn1.set_index('Symbol', inplace=True)
Condition1 Condition2 Condition3 Condition4
Symbol
A NaN NaN 9 6
B NaN 2 NaN 7
C 3 2 NaN 10
D NaN 7 NaN NaN
E NaN NaN 3 3
F NaN NaN 7 NaN
I'm currently working with a Pandas DataFrame that looks like the link above. I'm trying to go column by column to substitute values that are not 'NaN' with the 'Symbol' associated with that row then collapse each column (or write to a new DataFrame) so that each column is a list of 'Symbol's that were present for each 'Condition' as shown in the desired output:
Desired Output
I've been able to get the 'Symbols' that were present for each condition into a list of lists (see below) but want to maintain the same column names and had trouble adding them to an ever-growing new DataFrame because the lengths are variable and I'm looping through columns.
ls2 = []
for col in fn1.columns:
fn2 = fn1[fn1[col] > 0]
ls2.append(list(fn2.index))
Where fn1 is the DataFrame that looks like the first image and I had made the 'Symbol' column the index.
Thank you in advance for any help.
Another answer would be slicing, just like below (explanations in comments):
import numpy as np
import pandas as pd
df = pd.DataFrame.from_dict({
"Symbol": ["a", "b", "c", "d", "e", "f", "g", "h", "i", "j", "k"],
"Condition1": [1, np.nan, 3, np.nan, np.nan, np.nan, 7, np.nan, np.nan, 8, 12],
"Condition2": [np.nan, 2, 2, 7, np.nan, np.nan, 5, 11, 14, np.nan, np.nan],
}
)
new_df = pd.concat(
[
df["Symbol"][df[column].notnull()].reset_index(drop=True) # get columns without null and ignore the index (as your output suggests)
for column in list(df)[1:] # Iterate over all columns except "Symbols"
],
axis=1, # Column-wise concatenation
)
# Rename columns
new_df.columns = list(df)[1:]
# You can leave NaNs or replace them with empty string, your choice
new_df.fillna("", inplace=True)
Output of this operation will be:
Condition1 Condition2
0 a b
1 c c
2 g d
3 j g
4 k h
5 i
If you need any further clarification, post a comment down below.
You can map the symbols to each of the columns, and then take the set of non-null values.
df = fn1.apply(lambda x: x.map(fn1['Symbol'].to_dict()))
condition_symbols = {col:sorted(list(set(fn1_symbols[col].dropna()))) for col in fn1.columns[1:]}
This will give you a dictionary:
{'Condition1': ['B', 'D'],
'Condition2': ['C', 'H'],
'Condition3': ['D', 'H', 'J'],
'Condition4': ['D', 'G', 'H', 'K']}
I know you asked for a Dataframe, but since the length for each list is different, it would not make sense to make it into a Dataframe. If you wanted a Dataframe, then you could just run this code:
pd.DataFrame(dict([ (k,pd.Series(v)) for k,v in condition_symbols.items() ]))
This gives you the following output:
Condition1 Condition2 Condition3 Condition4
0 B C D D
1 D H H G
2 NaN NaN J H
3 NaN NaN NaN K

cannot add multiple column with values in Python Pandas

I want to add the the data of reference to data, so I use
data[reference.columns]=reference
but it only creates the column with no value, how can I add the value?
Your two DataFrames are indexed differently, so when you do data[reference.columns] = reference it tries to align the new columns on indices. Since the indices of reference are not in data (or only align for index=0) it adds the columns, but fills the values with NaN.
It looks like you want to add multiple static columns to data with the values from reference. You can just assign these:
for col in reference.columns:
data[col] = reference[col].values[0]
Here's an illustration of the issue.
import pandas as pd
data = pd.DataFrame({'id': [1, 2, 3, 4],
'val1': ['A', 'B', 'C', 'D']})
reference = pd.DataFrame({'id2': [1, 2, 3, 4],
'val2': ['A', 'B', 'C', 'D']})
These have the same indices ranging from 0-3.
data[reference.columns] = reference
Outputs
id val1 id2 val2
0 1 A 1 A
1 2 B 2 B
2 3 C 3 C
3 4 D 4 D
But, if these DataFrames have different indices (that only partially overlap):
data = pd.DataFrame({'id': [1, 2, 3, 4],
'val1': ['A', 'B', 'C', 'D']})
reference = pd.DataFrame({'id2': [1, 2, 3, 4],
'val2': ['A', 'B', 'C', 'D']})
reference.index=[3,4,5,6]
data[reference.columns]=reference
Outputs:
id val1 id2 val2
0 1 A NaN NaN
1 2 B NaN NaN
2 3 C NaN NaN
3 4 D 1.0 A
As only the index value of 3 is shared.

Reshape rows to columns in pandas dataframe

In pandas how to go from a:
a = pd.DataFrame({'foo': ['m', 'm', 'm', 's', 's', 's'],
'bar': [1, 2, 3, 4, 5, 6]})
>>> a
bar foo
0 1 m
1 2 m
2 3 m
3 4 s
4 5 s
5 6 s
to b:
b = pd.DataFrame({'m': [1, 2, 3],
's': [4, 5, 6]})
>>> b
m s
0 1 4
1 2 5
2 3 6
I tried solutions in other answers, e.g. here and here but none seemed to do what I want.
Basically, I want to swap rows with columns and drop the index, but how to do it?
a.set_index(
[a.groupby('foo').cumcount(), 'foo']
).bar.unstack()
This is my solution
a = pd.DataFrame({'foo': ['m', 'm', 'm', 's', 's', 's'],
'bar': [1, 2, 3, 4, 5, 6]})
a.pivot(columns='foo', values='bar').apply(lambda x: pd.Series(x.dropna().values))
foo m s
0 1.0 4.0
1 2.0 5.0
2 3.0 6.0

Categories