re.match() in cleaning pandas data frame - python

I want to use re.match() to clean a pandas data frame such that if an entry in any column is 1 or 2 it remains unchanged, but if it is any other value is is set to NaN.
The problem's that my function sets everything to NaN. I'm new to regular expressions so I think I've made a mistake.
Thanks!
# DATA
data = [['Bob',10,1],['Bob',2,2],['Clarke',13,1]]
my_df = pd.DataFrame(data,columns=['Name','Age','Sex'])
print(my_df)
Name Age Sex
0 Bob 10 1
1 Bob 2 2
2 Clarke 13 1
# CLEANING FUNCTION
def my_fun(df):
for col in df.columns:
for row in df.index:
if re.match('^\d{1}(\.)\d{2}$', str(df[col][row])):
df[col][row] = df[col][row]
else:
df[col][row] = np.nan
return(df)
# OUTPUT
my_fun(my_df)
Name Age Sex
0 NaN NaN NaN
1 NaN NaN NaN
2 NaN NaN NaN
# EXPECTED/DESIRED OUTPUT
Name Age Sex
0 NaN NaN 1
1 NaN 2 2
2 NaN NaN 1

You can go with where with isin here for a full match:
my_df.where(my_df.isin([1,2]))
Name Age Sex
0 NaN NaN 1
1 NaN 2.0 2
2 NaN NaN 1
Some observations:
df[col][row] not a recommended way to index a dataframe in pandas. Use .loc or .iloc, see Indexing and selecting data
Also, looping over a dataframe is generally not recommended at all, you might end up with a very poor in performance solution. I'd suggest you to read How to iterate over rows in a DataFrame in Pandas
You don't need a regex for what you want to do. You want to match either 1 or 2, there are more straight forward ways of doing this, both using python lists and Pandas. When using built-in methods to match something gets complicated, then maybe start looking into regex.

Related

Pandas - Replace Last Non-None Value with None Row-wise

One common thing people seem to want to do in pandas is to replace None-values with the next or previous None-value. This is easily done with .fillna. I however want to do something similar but different.
I have a dataframe, df, with some entries. Every row has a different number of entries and they are all "left-adjusted" (if the df is 10 columns wide and some row has n<10 entries the first n columns hold the entries and the remaining columns are Nones).
What I want to do is find the last non-None entry in every row and change it to also be a None. This could be any of the columns from the first to the last.
I could of course do this with a for-loop but my dfs can be quite large so something quicker would be preferable. Any ideas?
Thanks!
With help from numpy, this is quite easy. By counting the number of None in each row one can find for each row the column with the last non-None value. Then using Numpy change this value to None:
data = np.random.random((6,10))
df = pd.DataFrame(data)
df.iloc[0, 7:] = None
df.iloc[1, 6:] = None
df.iloc[2, 5:] = None
df.iloc[3, 8:] = None
df.iloc[4, 5:] = None
df.iloc[5, 4:] = None
Original dataframe looks like this:
0 1 2 3 4 5
0 0.992337 0.651785 0.521422 NaN NaN NaN
1 0.912962 0.292458 0.620195 0.507071 0.010205 NaN
2 0.061320 0.565979 0.344755 NaN NaN NaN
3 0.521936 0.057917 0.359699 0.484009 NaN NaN
isnull = df.isnull()
col = data.shape[1] - isnull.sum(axis = 1) - 1
df.values[range(len(df)), col] = None
Updated dataframe looks like this:
0 1 2 3 4 5
0 0.992337 0.651785 NaN NaN NaN NaN
1 0.912962 0.292458 0.620195 0.507071 NaN NaN
2 0.061320 0.565979 NaN NaN NaN NaN
3 0.521936 0.057917 0.359699 NaN NaN NaN
You can find the index of the element to replace in each row with np.argmax():
indices = np.isnan(df.to_numpy()).argmax(axis=1) - 1
df.to_numpy()[range(len(df)), indices] = None

What happens when I pass a boolean dataframe to the indexing operator for another dataframe in pandas?

There's something fundamental about manipulating pandas dataframes which I am not getting.
TL,DR: passing a boolean series to the indexing operator [] of a pandas dataframe returns the rows or columns of that df where the series was True. But passing a boolean dataframe (ie: multidimensional) returns a weird dataframe consisting only of NaN values.
Edit: to rephrase: why is it possible to pass a dataframe of boolean values to another dataframe, and what does it do? With a series, this makes sense, but with a dataframe, I don't understand what's happening 'under the hood', and why in my example I get a dataframe of null NaN values.
In detail with examples:
When I pass a pandas boolean Series to the indexing operator, it returns a list of rows corresponding to indices where the Series is True:
test_list = [[1,2,3,4],[3,4,5],[4,5]]
test_df = pd.DataFrame(test_list)
test_df
0 1 2 3
0 1 2 3.0 4.0
1 3 4 5.0 NaN
2 4 5 NaN NaN
test_df[test_df[2].isnull()]
0 1 2 3
2 4 5 NaN NaN
So far, so good. But what happens when I do this:
test_df[test_df.isnull()]
0 1 2 3
0 NaN NaN NaN NaN
1 NaN NaN NaN NaN
2 NaN NaN NaN NaN
Why does this return a dataframe consisting of only NaN values? I would expect it to either return an error, or perhaps to return a new dataframe truncated using the boolean mask dataframe. But I find this output completely confusing.
Edit: As an outcome I would expect to get an error. I don't understand why it's possible to pass a dataframe under these circumstances, or why it returns this dataframe of NaN values
test_df[..] calls an indexing method __getitem__(). From the source code:
def __getitem__(self, key):
...
# Do we have a (boolean) DataFrame?
if isinstance(key, DataFrame):
return self.where(key)
# Do we have a (boolean) 1d indexer?
if com.is_bool_indexer(key):
return self._getitem_bool_array(key)
As you can see, if the key is a boolean DataFrame, it will call pandas.DataFrame.where(). The function of where() is to replace values where the condition is False with NaN by default.
# print(test_df.isnull())
0 1 2 3
0 False False False False
1 False False False True
2 False False True True
# print(test_df)
0 1 2 3
0 1 2 3.0 4.0
1 3 4 5.0 NaN
2 4 5 NaN NaN
test_df.where(test_df.isnull()) replaces not null values with NaN.
I believe all values are transformed to NaN because you passed the entire df. The error 'message', precisely, is that all the returned values are NaN (including those that were not NaN), which allows us to see that something wrong happened. But surely a more experienced user will be able to answer you in more detail. Also note most of the time you want to remove or transform these NaN--not just flag them.
Following my comment above and LoukasPap's answer, here is a way to flag, count, and then remove or transform these NaN values:
First flag NaN values:
test_df.isnull()
You might also be interested to count your NaN values:
test_df.isnull().sum() # sum NaN by column
test_df.isnull().sum().sum() # get grand total of NaN
You can now drop NaN values by row
test_df.dropna()
Or by column:
test_df.dropna(axis=1)
Or replace NaN values by median:
test_df.fillna(test_df.median())

Dropna when another row has the missing data OR drop_duplicates with NaN matching all data

I have data like the following:
Index ID data1 data2 ...
0 123 0 NaN ...
1 123 0 1 ...
2 456 NaN 0 ...
3 456 NaN 0 ...
...
I need to drop rows which have less than or equal to the information available in otherwise identical rows.
In the example above rows 0 and either 2 xor 3 should be removed.
My best attempt so far is the rather slow, and also non-functioning:
df.groupby(by='ID').fillna(method='ffill',inplace=True).fillna(method='bfill',inplace=True)
df.drop_duplicates(inplace=True)
How can I best accomplish this goal?
You're approach seems fine, just using in-place assignment was not working here (since you're assigning to a copy of the data), use:
df = df.groupby(by='ID', as_index=False).fillna(method='ffill').fillna(method='bfill')
df.drop_duplicates()
ID data1 data2
0 123 0.0 1.0
2 456 NaN 0.0

Python/Pandas - drop_duplicates selecting the most complete row

I have a dataframe with people information. However sometimes these guys get repeated and some rows have more info about the same person than the others. Is there a way to drop the duplicates using column 'Name' as reference but only keep the most filled rows?
If you have a dataframe like
df = pd.DataFrame([['a',np.nan,np.nan,'M'],['a',12,np.nan,'M'],['c',np.nan,np.nan,'M'],['d',np.nan,np.nan,'M']],columns=['Name','Age','Region','Gender'])
Sorting rows based on nan count and dropping duplicates with subset 'Name' by keep first one might help i.e.
df['count'] = pd.isnull(df).sum(1)
df= df.sort_values(['count']).drop_duplicates(subset=['Name'],keep='first').drop('count',1)
Output:
Before:
Name Age Region Gender
0 a NaN NaN M
1 a 12.0 NaN M
2 c NaN NaN M
3 d NaN NaN M
After:
Name Age Region Gender
1 a 12.0 NaN M
2 c NaN NaN M
3 d NaN NaN M

Drop duplicates while preserving NaNs in pandas

When using the drop_duplicates() method I reduce duplicates but also merge all NaNs into one entry. How can I drop duplicates while preserving rows with an empty entry (like np.nan, None or '')?
import pandas as pd
df = pd.DataFrame({'col':['one','two',np.nan,np.nan,np.nan,'two','two']})
Out[]:
col
0 one
1 two
2 NaN
3 NaN
4 NaN
5 two
6 two
df.drop_duplicates(['col'])
Out[]:
col
0 one
1 two
2 NaN
Try
df[(~df.duplicated()) | (df['col'].isnull())]
The result is :
col
0 one
1 two
2 NaN
3 NaN
4 NaN
Well, one workaround that is not really beautiful is to first save the NaN and put them back in:
temp = df.iloc[pd.isnull(df).any(1).nonzero()[0]]
asd = df.drop_duplicates('col')
pd.merge(temp, asd, how='outer')
Out[81]:
col
0 one
1 two
2 NaN
3 NaN
4 NaN
use:
df.drop_duplicates('col').append(df[df['col'].isna()])

Categories