Removing values from DataFrame columns based on list of values - python

I have a slightly specific problem.
A pandas DataFrame with let's say 6 columns. Each column has a unique set of values, which need to be looked for and removed / updated with a new value.
The lookup list would look like this:
lookup_values_per_column = [[-99999], [9999], [99, 98],[9],[99],[996, 997, 998, 999]]
Now, what I want to do is: Look at column 1 of the dataframe and check if -99999 is present, if yes, remove / update each instance with a fixed value (lets say NA's)
Then we move to the next column and then check for all 9999 and also update them with NA's.
If we don't find a match, we just leave the column as it is.
I couldn't find a solution and I guess it's not so hard anyways.

We can use DataFrame.replace with a dictionary built from the list and columns:
df = df.replace(
to_replace=dict(zip(df.columns, lookup_values_per_column)),
value=np.NAN
)
Sample output:
A B C D E F
0 4.0 1.0 NaN 2.0 3.0 NaN
1 NaN 3.0 2.0 NaN 4.0 NaN
2 1.0 4.0 NaN 1.0 1.0 NaN
3 2.0 2.0 3.0 3.0 2.0 1.0
4 3.0 NaN 1.0 4.0 NaN NaN
Setup Used:
from random import sample, seed
from string import ascii_uppercase
import numpy as np
import pandas as pd
lookup_values_per_column = [
[-99999], [9999], [99, 98], [9], [99], [996, 997, 998, 999]
]
df_len = max(map(len, lookup_values_per_column)) + 1
seed(10)
df = pd.DataFrame({
k: sample(v + list(range(1, df_len + 1 - len(v))), df_len)
for k, v in
zip(ascii_uppercase, lookup_values_per_column)
})
df:
A B C D E F
0 4 1 99 2 3 997
1 -99999 3 2 9 4 998
2 1 4 98 1 1 999
3 2 2 3 3 2 1
4 3 9999 1 4 99 996

Related

Python How to drop rows of Pandas DataFrame whose value in a certain column is NaN

I have this DataFrame and want only the records whose "Total" column is not NaN ,and records when A~E has more than two NaN:
A B C D E Total
1 1 3 5 5 8
1 4 3 5 5 NaN
3 6 NaN NaN NaN 6
2 2 5 9 NaN 8
..i.e. something like df.dropna(....) to get this resulting dataframe:
A B C D E Total
1 1 3 5 5 8
2 2 5 9 NaN 8
Here's my code
import pandas as pd
dfInputData = pd.read_csv(path)
dfInputData = dfInputData.dropna(axis=1,how = 'any')
RowCnt = dfInputData.shape[0]
But it looks like no modification has been made even error
Please help!! Thanks
Use boolean indexing with count all columns without Total for number of missing values and not misisng values in Total:
df = df[df.drop('Total', axis=1).isna().sum(axis=1).le(2) & df['Total'].notna()]
print (df)
A B C D E Total
0 1 1 3.0 5.0 5.0 8.0
3 2 2 5.0 9.0 NaN 8.0
Or filter columns between A:E:
df = df[df.loc[:, 'A':'E'].isna().sum(axis=1).le(2) & df['Total'].notna()]
print (df)
A B C D E Total
0 1 1 3.0 5.0 5.0 8.0
3 2 2 5.0 9.0 NaN 8.0

How to really filter a pandas dataset without leaving Nans everywhere

Say I have a huge DataFrame that only contains a handful of cells that match the filtering I perform. How can I end up with only the values that match it (and their indexes and columns) in a new dataframe without the entire other DataFrame that becomes Nan. Dropping Nan's with dropna just removes the whole column or row and filter replaces non matches with Nans.
Here's my code:
import numpy as np
import pandas as pd
df = pd.DataFrame(np.random.random((1000, 1000)))
# this one is almost filled with Nans
df[df<0.01]
If need non missing values in another format you can use DataFrame.stack:
np.random.seed(2020)
df = pd.DataFrame(np.random.randint(10, size=(5, 3)))
# this one is almost filled with Nans
df1 = df[df<7]
print (df1)
0 1 2
0 0.0 NaN 3.0
1 6.0 3.0 3.0
2 NaN NaN 0.0
3 0.0 NaN NaN
4 3.0 NaN 2.0
df2 = df1.stack().rename_axis(('a','b')).reset_index(name='c')
print (df2)
a b c
0 0 0 0.0
1 0 2 3.0
2 1 0 6.0
3 1 1 3.0
4 1 2 3.0
5 2 2 0.0
6 3 0 0.0
7 4 0 3.0
8 4 2 2.0

Pandas Dataframe Question: Subtract next row and add specific value if NaN

Trying to groupby in pandas, then sort values and have a result column show what you need to add to get to the next row in the group, and if your are the end of the group. To replace the value with the number 3. Anyone have an idea how to do it?
import pandas as pd
df = pd.DataFrame({'label': 'a a b c b c'.split(), 'Val': [2,6,6, 4,16, 8]})
df
label Val
0 a 2
1 a 6
2 b 6
3 c 4
4 b 16
5 c 8
Id like the results as shown below, that you have to add 4 to 2 to get 6. So the groups are sorted. But if there is no next value in the group and NaN is added. To replace it with the value 3. I have shown below what the results should look like:
label Val Results
0 a 2 4.0
1 a 6 3.0
2 b 6 10.0
3 c 4 4.0
4 b 16 3.0
5 c 8 3.0
I tried this, and was thinking of shifting values up but the problem is that the labels aren't sorted.
df['Results'] = df.groupby('label').apply(lambda x: x - x.shift())`
df
label Val Results
0 a 2 NaN
1 a 6 4.0
2 b 6 NaN
3 c 4 NaN
4 b 16 10.0
5 c 8 4.0
Hope someone can help:D!
Use groupby, diff and abs:
df['Results'] = abs(df.groupby('label')['Val'].diff(-1)).fillna(3)
label Val Results
0 a 2 4.0
1 a 6 3.0
2 b 6 10.0
3 c 4 4.0
4 b 16 3.0
5 c 8 3.0

How to convert a Pandas DataFrame column (or Series) of variable-length lists to a DataFrame of fixed width [duplicate]

This question already has answers here:
Split a Pandas column of lists into multiple columns
(11 answers)
Closed 2 years ago.
I would like to convert a DataFrame column (or Series) with lists that have different lengths into a DataFrame with a fixed number of columns.
The DataFrame will have as many columns as the longest list, and the values where the other lists are shorter can be NaN or anything.
The str module allows for this when the data comes as a string, with the option expand in str.split. But I have not been able to find an equivalents for lists of variable length.
In my example the type in the list is int, but the idea is to be able do it with any type. This prevents simply converting the Series to str and applying the mentioned expand attribute.
Below I show code to run the example with the list using the str.split funtion, and after a minimum example with the Series to be converted.
I found a solution using apply, shown in the example, but is so extremely slow that it is not useful.
import numpy as np
import pandas as pd
# Example with a list as a string
A = pd.DataFrame({'lists': [
'[]',
'[360,460,160]',
'[360,1,2,3,4,5,6]',
'[10,20,30]',
'[100,100,100,100]',
],
'other': [1,2,3,4,5]
})
print(A['lists'].astype(str).str.strip('[]').str.split(',', expand=True))
# Example with actual lists
B = pd.DataFrame({'lists': [
[],
[360,460,160],
[360,1,2,3,4,5,6],
[10,20,30],
[100,100,100,100],
],
'other': [1,2,3,4,5]
})
# Create and pre-fill expected columns
max_len = max(B['lists'].str.len())
for idx in range(max_len):
B[f'lists_{idx}'] = np.nan
# Use .apply to fill the columns
def expand_int_list(row, col, df):
for idx, item in enumerate(row[col]):
df.loc[row.name, f'{col}_{idx}'] = item
B.apply(lambda row: expand_int_list(row, 'lists', B), axis=1)
print(B)
Output:
0 1 2 3 4 5 6
0 None None None None None None
1 360 460 160 None None None None
2 360 1 2 3 4 5 6
3 10 20 30 None None None None
4 100 100 100 100 None None None
lists other lists_0 lists_1 lists_2 lists_3 \
0 [] 1 NaN NaN NaN NaN
1 [360, 460, 160] 2 360.0 460.0 160.0 NaN
2 [360, 1, 2, 3, 4, 5, 6] 3 360.0 1.0 2.0 3.0
3 [10, 20, 30] 4 10.0 20.0 30.0 NaN
4 [100, 100, 100, 100] 5 100.0 100.0 100.0 100.0
lists_4 lists_5 lists_6
0 NaN NaN NaN
1 NaN NaN NaN
2 4.0 5.0 6.0
3 NaN NaN NaN
4 NaN NaN NaN
EDIT AND FINAL SOLUTION:
An important piece of information that made the methods found in other questions fail is that in my data I have None sometimes instead of a list.
In that situation, using tolist() will yield a Series of lists again and Pandas will not allow to make those cells an empty list with B.loc[B[col].isna(), col] = [].
The solution I found is to use tolist() only in the rows that are not None, and concat using the original index:
# Example with actual lists
B = pd.DataFrame({'lists': [
[],
[360,460,160],
None,
[10,20,30],
[100,100,100,100],
],
'other': [1,2,3,4,5]
})
col = 'lists'
# I need to keep the index for the concat afterwards.
extended = pd.DataFrame(B.loc[~B[col].isna(), col].tolist(),
index=B.loc[~B[col].isna()].index)
extended = extended.add_prefix(f'{col}_')
B = pd.concat([B, extended], axis=1)
print(B)
Output:
lists other lists_0 lists_1 lists_2 lists_3
0 [] 1 NaN NaN NaN NaN
1 [360, 460, 160] 2 360.0 460.0 160.0 NaN
2 None 3 NaN NaN NaN NaN
3 [10, 20, 30] 4 10.0 20.0 30.0 NaN
4 [100, 100, 100, 100] 5 100.0 100.0 100.0 100.0
If convert nested lists to list and pass to DataFrame constructor missing values are added like longest lists, then DataFrame.add_prefix and append to original by DataFrame.join:
df = B.join(pd.DataFrame(B['lists'].tolist()).add_prefix('lists_'))
print (df)
lists other lists_0 lists_1 lists_2 lists_3 \
0 [] 1 NaN NaN NaN NaN
1 [360, 460, 160] 2 360.0 460.0 160.0 NaN
2 [360, 1, 2, 3, 4, 5, 6] 3 360.0 1.0 2.0 3.0
3 [10, 20, 30] 4 10.0 20.0 30.0 NaN
4 [100, 100, 100, 100] 5 100.0 100.0 100.0 100.0
lists_4 lists_5 lists_6
0 NaN NaN NaN
1 NaN NaN NaN
2 4.0 5.0 6.0
3 NaN NaN NaN
4 NaN NaN NaN

Pandas agg fuction with operations on multiple columns

I am interested if we can use pandas.core.groupby.DataFrameGroupBy.agg function to make arithmetic operations on multiple columns columns. For example:
import numpy as np
import pandas as pd
df = pd.DataFrame(np.arange(15).reshape(5, 3))
df['C'] = [0, 0, 2, 2, 5]
print(df.groupby('C').mean()[0] - df.groupby('C').mean()[1])
print(df.groupby('C').agg({0: 'mean', 1: 'sum', 2: 'nunique', 'C': 'mean0-mean1'}))
Is it somehow possible that we receive result like in this example: the difference between means of column 0 and column 1 grouped by column 'C'?
df
0 1 2 C
0 0 1 2 0
1 3 4 5 0
2 6 7 8 2
3 9 10 11 2
4 12 13 14 5
Groupped difference
C
0 -1.0
2 -1.0
5 -1.0
dtype: float64
I am not interested with solutions that does not use agg method. I am curious only if agg method can take multiple columns as argument and then do some operations on them to return one columns after job is done.
IIUC:
In [12]: df.groupby('C').mean().diff(axis=1)
Out[12]:
0 1 2
C
0 NaN 1.0 1.0
2 NaN 1.0 1.0
5 NaN 1.0 1.0
or
In [13]: df.groupby('C').mean().diff(-1, axis=1)
Out[13]:
0 1 2
C
0 -1.0 -1.0 NaN
2 -1.0 -1.0 NaN
5 -1.0 -1.0 NaN

Categories