Changing the DataFrame in a complex way - python

I have a dataframe as follows:
id|s1|s2|s3|s4|s5
0|a|b|NaN|NaN|NaN
0|NaN|NaN|NaN|c|NaN
0|a1|NaN|NaN|c2|NaN
1|b|c|NaN|NaN|NaN
1|NaN|NaN|a1|NaN|NaN
1|a1|b|NaN|c1|NaN
.
.
.
.
1000(rows)...............
I want this to be restructured like this:
id|s1|s2|s3|s4|s5
0|a|b|NaN|c|NaN
0|a1|b|NaN|c2|NaN
1|b|c|a1|c1|NaN
1|a1|b|a1|c1|NaN
I have tried:
df.unstack(),df.melt() and df.pivot()
None of them gave me the expected result.Basically I want to reduce the NaN as much as possible. Could anyone suggest me a way? I want only one entry per cell not a group of entries in single cell.
I dont want NaN values but I want flows as mentioned in the first output.I want NaN only when there exists no values in any of the rows in same id

Group on id and ffill+bfill each row , then drop_duplicates:
df.groupby('id').apply(lambda x: x.ffill().bfill()).drop_duplicates()
id s1 s2 s3 s4 s5
0 0 a b NaN c NaN
2 0 a1 b NaN c2 NaN
3 1 b c a1 c1 NaN
5 1 a1 b a1 c1 NaN

Related

Category Condition on multiple columns

I have a data set as below:
ID A1 A2
0 A123 1234
1 1234 5568
2 5568 NaN
3 Zabc NaN
4 3456 3456
5 3456 3456
6 NaN NaN
7 NaN NaN
Intention is to go through each column (A1 and A2), identify where both the columns are blank as in row 6 and 7, create a new column and categorise as "Both A1 and A2 are blank"
I used the below code:
df['Z_Tax No Not Mapped'] = np.NaN
df['Z_Tax No Not Mapped'] = np.where((df['A1'] == np.NaN) & (df['A2'] == np.NaN), 1, 0)
However the output captures all the rows as 0 under new column 'Z_Tax No Not Mapped', but the data have instances where both the columns are blank. Not sure where i'm making a mistake to filter such cases.
Note: Columns A1 and A2 are sometimes alphanumeric or just numeric.
Idea is to place a category in a separate column as "IDs are not updated" or "IDs are updated", so that by placing a simple filter on "IDs are not updated" we can identify cases that are blank in both columns.
Use DataFrame.isna with DataFrame.all for test if all columns are Trues - missing values:
df['Z_Tax No Not Mapped'] = np.where(df[['A1','A2']].isna().all(axis=1),
'Both A1 and A2 are blank',
'')
df.loc[df.isna().all(axis=1), "Z_Tax No Not Mapped"] = "Both A1 and A2 are blank"

Dropna when another row has the missing data OR drop_duplicates with NaN matching all data

I have data like the following:
Index ID data1 data2 ...
0 123 0 NaN ...
1 123 0 1 ...
2 456 NaN 0 ...
3 456 NaN 0 ...
...
I need to drop rows which have less than or equal to the information available in otherwise identical rows.
In the example above rows 0 and either 2 xor 3 should be removed.
My best attempt so far is the rather slow, and also non-functioning:
df.groupby(by='ID').fillna(method='ffill',inplace=True).fillna(method='bfill',inplace=True)
df.drop_duplicates(inplace=True)
How can I best accomplish this goal?
You're approach seems fine, just using in-place assignment was not working here (since you're assigning to a copy of the data), use:
df = df.groupby(by='ID', as_index=False).fillna(method='ffill').fillna(method='bfill')
df.drop_duplicates()
ID data1 data2
0 123 0.0 1.0
2 456 NaN 0.0

pandas iterrows then match

I have a dataframe like this:
C1 C2 C3 C4
A TV /r/tv3 NaN
B Music Pop /r/pop
C /r/foo NaN NaN
I need to iterate through each row and get the value of the first column and then find the value of the column which starts with /r/. So the output should look like this:
A /r/tv3
B /r/pop
C /r/foo
What is the fastest pythonic way to do this?
Using where after startswith
df.where(df.apply(lambda x : x.str.startswith(pat='/r/'),axis=1)).stack().reset_index(level=1,drop=True)
Out[680]:
C1
A /r/tv3
B /r/pop
C /r/foo
dtype: object

Python/Pandas - drop_duplicates selecting the most complete row

I have a dataframe with people information. However sometimes these guys get repeated and some rows have more info about the same person than the others. Is there a way to drop the duplicates using column 'Name' as reference but only keep the most filled rows?
If you have a dataframe like
df = pd.DataFrame([['a',np.nan,np.nan,'M'],['a',12,np.nan,'M'],['c',np.nan,np.nan,'M'],['d',np.nan,np.nan,'M']],columns=['Name','Age','Region','Gender'])
Sorting rows based on nan count and dropping duplicates with subset 'Name' by keep first one might help i.e.
df['count'] = pd.isnull(df).sum(1)
df= df.sort_values(['count']).drop_duplicates(subset=['Name'],keep='first').drop('count',1)
Output:
Before:
Name Age Region Gender
0 a NaN NaN M
1 a 12.0 NaN M
2 c NaN NaN M
3 d NaN NaN M
After:
Name Age Region Gender
1 a 12.0 NaN M
2 c NaN NaN M
3 d NaN NaN M

Drop duplicates while preserving NaNs in pandas

When using the drop_duplicates() method I reduce duplicates but also merge all NaNs into one entry. How can I drop duplicates while preserving rows with an empty entry (like np.nan, None or '')?
import pandas as pd
df = pd.DataFrame({'col':['one','two',np.nan,np.nan,np.nan,'two','two']})
Out[]:
col
0 one
1 two
2 NaN
3 NaN
4 NaN
5 two
6 two
df.drop_duplicates(['col'])
Out[]:
col
0 one
1 two
2 NaN
Try
df[(~df.duplicated()) | (df['col'].isnull())]
The result is :
col
0 one
1 two
2 NaN
3 NaN
4 NaN
Well, one workaround that is not really beautiful is to first save the NaN and put them back in:
temp = df.iloc[pd.isnull(df).any(1).nonzero()[0]]
asd = df.drop_duplicates('col')
pd.merge(temp, asd, how='outer')
Out[81]:
col
0 one
1 two
2 NaN
3 NaN
4 NaN
use:
df.drop_duplicates('col').append(df[df['col'].isna()])

Categories