Pandas dataframe conditional substitution and columnwise trimming - python

Current Pandas DataFrame
fn1 = pd.DataFrame([['A', 'NaN', 'NaN', 9, 6], ['B', 'NaN', 2, 'NaN', 7], ['C', 3, 2, 'NaN', 10], ['D', 'NaN', 7, 'NaN', 'NaN'], ['E', 'NaN', 'NaN', 3, 3], ['F', 'NaN', 'NaN', 7,'NaN']], columns = ['Symbol', 'Condition1','Condition2', 'Condition3', 'Condition4'])
fn1.set_index('Symbol', inplace=True)
Condition1 Condition2 Condition3 Condition4
Symbol
A NaN NaN 9 6
B NaN 2 NaN 7
C 3 2 NaN 10
D NaN 7 NaN NaN
E NaN NaN 3 3
F NaN NaN 7 NaN
I'm currently working with a Pandas DataFrame that looks like the link above. I'm trying to go column by column to substitute values that are not 'NaN' with the 'Symbol' associated with that row then collapse each column (or write to a new DataFrame) so that each column is a list of 'Symbol's that were present for each 'Condition' as shown in the desired output:
Desired Output
I've been able to get the 'Symbols' that were present for each condition into a list of lists (see below) but want to maintain the same column names and had trouble adding them to an ever-growing new DataFrame because the lengths are variable and I'm looping through columns.
ls2 = []
for col in fn1.columns:
fn2 = fn1[fn1[col] > 0]
ls2.append(list(fn2.index))
Where fn1 is the DataFrame that looks like the first image and I had made the 'Symbol' column the index.
Thank you in advance for any help.

Another answer would be slicing, just like below (explanations in comments):
import numpy as np
import pandas as pd
df = pd.DataFrame.from_dict({
"Symbol": ["a", "b", "c", "d", "e", "f", "g", "h", "i", "j", "k"],
"Condition1": [1, np.nan, 3, np.nan, np.nan, np.nan, 7, np.nan, np.nan, 8, 12],
"Condition2": [np.nan, 2, 2, 7, np.nan, np.nan, 5, 11, 14, np.nan, np.nan],
}
)
new_df = pd.concat(
[
df["Symbol"][df[column].notnull()].reset_index(drop=True) # get columns without null and ignore the index (as your output suggests)
for column in list(df)[1:] # Iterate over all columns except "Symbols"
],
axis=1, # Column-wise concatenation
)
# Rename columns
new_df.columns = list(df)[1:]
# You can leave NaNs or replace them with empty string, your choice
new_df.fillna("", inplace=True)
Output of this operation will be:
Condition1 Condition2
0 a b
1 c c
2 g d
3 j g
4 k h
5 i
If you need any further clarification, post a comment down below.

You can map the symbols to each of the columns, and then take the set of non-null values.
df = fn1.apply(lambda x: x.map(fn1['Symbol'].to_dict()))
condition_symbols = {col:sorted(list(set(fn1_symbols[col].dropna()))) for col in fn1.columns[1:]}
This will give you a dictionary:
{'Condition1': ['B', 'D'],
'Condition2': ['C', 'H'],
'Condition3': ['D', 'H', 'J'],
'Condition4': ['D', 'G', 'H', 'K']}
I know you asked for a Dataframe, but since the length for each list is different, it would not make sense to make it into a Dataframe. If you wanted a Dataframe, then you could just run this code:
pd.DataFrame(dict([ (k,pd.Series(v)) for k,v in condition_symbols.items() ]))
This gives you the following output:
Condition1 Condition2 Condition3 Condition4
0 B C D D
1 D H H G
2 NaN NaN J H
3 NaN NaN NaN K

Related

Pandas: Setting a value in a cell when multiple columns are empty

I've been looking for ways to do this natively for a little while now and can't find a solution.
I have a large dataframe where I would like to set the value in other_col to 'True' for all rows where one of a list of columns is empty.
This works for a single column page_title:
df.loc[df['page_title'].isna(), ['other_col']] = ''
But not when using a list
df.loc[df[['page_title','brand','name']].isna(), ['other_col']] = ''
Any ideas of how I could do this without using Numpy or looping through all rows?
Thanks
Maybe this is what you are looking for:
df = pd.DataFrame({
'A' : ['1', '2', '3', np.nan],
'B': ['10', np.nan, np.nan, '40'],
'C' : ['test', 'test', 'test', 'test']})
df.loc[df[['A', 'B']].isna().any(1), ['C']] = 'value'
print(df)
Result:
A B C
0 1 10 test
1 2 NaN value
2 3 NaN value
3 NaN 40 value
This will allow you to set which columns you want to determine if np.nan is present and set a True/False indicator
data = {
'Column1' : [1, 2, 3, np.nan],
'Column2' : [1, 2, 3, 4],
'Column3' : [1, 2, np.nan, 4]
}
df = pd.DataFrame(data)
df['other_col'] = np.where((df['Column1'].isna()) | (df['Column2'].isna()) | (df['Column3'].isna()), True, False)
df

Pandas remove rows where several columns are not nan

I have a dataframe that looks like this:
A B C D E
0 P 10 NaN 5.0 9.0
1 Q 19 NaN NaN 4.0
2 R 8 NaN 3.0 7.0
3 S 20 NaN 3.0 7.0
4 T 4 NaN 2.0 NaN
And I have a list: [['A', 'B', 'D', 'E'], ['A', 'B', 'D'], ['A', 'B', 'E']]
I am iterating over the list and getting only those rows from the dataframe, for which the columns specified by the list are not empty.
I have tried with the following code:
test_df = pd.DataFrame([['P', 10, np.nan, 5, 9], ['Q', 19, np.nan, np.nan, 4], ['R', 8, np.nan, 3, 7],
['S', 20, np.nan, 3, 7], ['T', 4, np.nan, 2, np.nan]], columns=list('ABCDE'))
priority_list = [list('ABDE'), list('ABD'), list('ABE')]
for elem in priority_list:
test_df = test_df.loc[test_df[elem].notna()]
print(test_df)
But this is throwing the following error:
File "C:\Python37\lib\site-packages\pandas\core\indexing.py", line 879, in __getitem__
return self._getitem_axis(maybe_callable, axis=axis)
File "C:\Python37\lib\site-packages\pandas\core\indexing.py", line 1097, in _getitem_axis
raise ValueError("Cannot index with multidimensional key")
ValueError: Cannot index with multidimensional key
How to overcome this issue and check for multiple columns for non-na values in the dataframe?
Use DataFrame.all for test if all selected values are Trues:
priority_list = [list('ABDE'), list('ABD'), list('ABE')]
for elem in priority_list:
test_df = test_df.loc[test_df[elem].notna().all(axis=1)]
print(test_df)

How to replace specific character in pandas column with null?

I Have a column within a dataset, regarding categorical company sizes, which currently looks like this, where the '-' hyphens are currently representing missing data:
I want to change the '-' in missing values with nulls so i can analyse missing data. However when I use the pd replace tool (see following code) with a None value it seems to also make any of the genuine entries as they also contain hyphens (e.g 51-200).
df['Company Size'].replace({'-': None},inplace =True, regex= True)
How can I replace only lone standing hyphens and leave the other entries untouched?
You need not to use regex=True.
df['Company Size'].replace({'-': None},inplace =True)
You could also just do:
df['column_name'] = df['column_name'].replace('-','None')
import numpy as np
df.replace('-', np.NaN, inplace=True)
This code worked for me.
you can do it like this
import numpy as np
import pandas as pd
df = pd.DataFrame({'A': [0, 1, 2, 3, 4],
'B': [5, 6, 7, 8, 9],
'C': ['a', '-', 'c--', 'd', 'e']})
df['C'] = df['C'].replace('-', np.nan)
df = df.where((pd.notnull(df)), None)
# can also use this -> df['C'] = df['C'].where((pd.notnull(df)), None)
print(df)
output:
A B C
0 0 5 a
1 1 6 None
2 2 7 c--
3 3 8 d
4 4 9 e
another example:
df = pd.DataFrame({'A': [0, 1, 2, 3, 4],
'B': ['5-5', '-', 7, 8, 9],
'C': ['a', 'b', 'c--', 'd', 'e']})
df['B'] = df['B'].replace('-', np.nan)
df = df.where((pd.notnull(df)), None)
print(df)
output:
A B C
0 0 5-5 a
1 1 None b
2 2 7 c--
3 3 8 d
4 4 9 e

cannot add multiple column with values in Python Pandas

I want to add the the data of reference to data, so I use
data[reference.columns]=reference
but it only creates the column with no value, how can I add the value?
Your two DataFrames are indexed differently, so when you do data[reference.columns] = reference it tries to align the new columns on indices. Since the indices of reference are not in data (or only align for index=0) it adds the columns, but fills the values with NaN.
It looks like you want to add multiple static columns to data with the values from reference. You can just assign these:
for col in reference.columns:
data[col] = reference[col].values[0]
Here's an illustration of the issue.
import pandas as pd
data = pd.DataFrame({'id': [1, 2, 3, 4],
'val1': ['A', 'B', 'C', 'D']})
reference = pd.DataFrame({'id2': [1, 2, 3, 4],
'val2': ['A', 'B', 'C', 'D']})
These have the same indices ranging from 0-3.
data[reference.columns] = reference
Outputs
id val1 id2 val2
0 1 A 1 A
1 2 B 2 B
2 3 C 3 C
3 4 D 4 D
But, if these DataFrames have different indices (that only partially overlap):
data = pd.DataFrame({'id': [1, 2, 3, 4],
'val1': ['A', 'B', 'C', 'D']})
reference = pd.DataFrame({'id2': [1, 2, 3, 4],
'val2': ['A', 'B', 'C', 'D']})
reference.index=[3,4,5,6]
data[reference.columns]=reference
Outputs:
id val1 id2 val2
0 1 A NaN NaN
1 2 B NaN NaN
2 3 C NaN NaN
3 4 D 1.0 A
As only the index value of 3 is shared.

Replicating rows in pandas dataframe by column value and add a new column with repetition index

My question is similar to one asked here. I have a dataframe and I want to repeat each row of the dataframe k number of times. Along with it, I also want to create a column with values 0 to k-1. So
import pandas as pd
df = pd.DataFrame(data={
'id': ['A', 'B', 'C'],
'n' : [ 1, 2, 3],
'v' : [ 10, 13, 8]
})
what_i_want = pd.DataFrame(data={
'id': ['A', 'B', 'B', 'C', 'C', 'C'],
'n' : [ 1, 2, 2, 3, 3, 3],
'v' : [ 10, 13, 13, 8, 8, 8],
'repeat_id': [0, 0, 1, 0, 1, 2]
})
Command below does half of the job. I am looking for pandas way of adding the repeat_id column.
df.loc[df.index.repeat(df.n)]
Use GroupBy.cumcount and copy for avoid SettingWithCopyWarning:
If you modify values in df1 later you will find that the modifications do not propagate back to the original data (df), and that Pandas does warning.
df1 = df.loc[df.index.repeat(df.n)].copy()
df1['repeat_id'] = df1.groupby(level=0).cumcount()
df1 = df1.reset_index(drop=True)
print (df1)
id n v repeat_id
0 A 1 10 0
1 B 2 13 0
2 B 2 13 1
3 C 3 8 0
4 C 3 8 1
5 C 3 8 2

Categories