Pandas. Picking a column name based on row data - python

In my previous question, i was trying to count blanks and build a dataframe with new columns for the subsequent analysis. The question became too exhaustive and i decided to split it for different purposes.
I have my initial dataset:
import pandas as pd
import numpy as np
df = pd.DataFrame({'id':[1000,2000,3000,4000],
'201710':[7585, 4110, 4498, np.nan],
'201711':[7370, 3877, 4850, 4309],
'201712':[6505, np.nan, 4546, 4498],
'201801':[7473, np.nan, np.nan, 4850],
'201802':[6183, np.nan, np.nan, np.nan ],
'201803':[6699, 4558, 1429, np.nan ],
'201804':[ 118, 4152, 1429, np.nan ],
'201805':[ np.nan, 4271, 1960, np.nan ],
'201806':[ np.nan, np.nan, 1798, np.nan ],
'201807':[ np.nan, np.nan, 1612, 4361],
'201808':[ np.nan, np.nan, 1612, 4272],
'201809':[ np.nan, 3900, 1681, 4199] ,
})
I need to obtain start- and end- dates of each fraction (not blank) for each id.
I managed to get the first occurred start- and the last occurred end- data but not in the middle. An then, I counted blanks for each gap (for further analysis)
The code is here (it's might seem confused):
# to obtain the first and last occurrence with data
res = pd.melt(df, id_vars=['id'], value_vars=df.columns[1:])
res.dropna(subset=['value'], inplace=True)
res.sort_values(by=['id', 'variable', 'value'], ascending=[True, True, True],
inplace=True)
minimum_date = res.drop_duplicates(subset=['id'], keep='first')
maximum_date = res.drop_duplicates(subset=['id'], keep='last')
minimum_date.rename(columns={'variable': 'start_date'}, inplace=True)
maximum_date.rename(columns={'variable': 'end_date'}, inplace=True)
# To obtain number of gaps (nulls) and their length
res2 = pd.melt(df, id_vars=['id'], value_vars=df.columns[1:])
res2.sort_values(by=['id', 'variable'], ascending=[True, True], inplace=True)
res2=res2.replace(np.nan, 0)
m = res2.value.diff().ne(0).cumsum().rename('gid')
gaps = res2.groupby(['id',
m]).value.value_counts().loc[:,:,0].droplevel(-1).reset_index()
# add columns to main dataset with start- and end dates and gaps
df = pd.merge(df, minimum_date[['id', 'start_date']], on=['id'], how='left')
df = pd.merge(df, maximum_date[['id', 'end_date']], on=['id'], how='left')
I came to the dataset like this, where start_date is the 1st notnull occurrence, end_date - the last notnull occurence and 1-,2-,3- blanks are fractions with blanks counting for further analysis:
The output is intended to have additional columns:

Here is a function that may be helpful, IIUC.
import pandas as pd
# create test data
t = pd.DataFrame({'x': [10, 20] + [None] * 3 + [30, 40, 50, 60] + [None] * 5 + [70]})
Create a function to find start location, end location, and size of each 'group', where a group is a sequence of repeated values (e.g., NaNs):
def extract_nans(df, field):
df = df.copy()
# identify NaNs
df['is_na'] = df[field].isna()
# identify groups (sequence of identical values is a group): X Y X => 3 groups
df['group_id'] = (df['is_na'] ^ df['is_na'].shift(1)).cumsum()
# how many members in this group?
df['group_size'] = df.groupby('group_id')['group_id'].transform('size')
# initial, final index of each group
df['min_index'] = df.reset_index().groupby('group_id')['index'].transform(min)
df['max_index'] = df.reset_index().groupby('group_id')['index'].transform(max)
return df
Results:
summary = extract_nans(t, 'x')
print(summary)
x is_na group_id group_size min_index max_index
0 10.0 False 0 2 0 1
1 20.0 False 0 2 0 1
2 NaN True 1 3 2 4
3 NaN True 1 3 2 4
4 NaN True 1 3 2 4
5 30.0 False 2 4 5 8
6 40.0 False 2 4 5 8
7 50.0 False 2 4 5 8
8 60.0 False 2 4 5 8
9 NaN True 3 5 9 13
10 NaN True 3 5 9 13
11 NaN True 3 5 9 13
12 NaN True 3 5 9 13
13 NaN True 3 5 9 13
14 70.0 False 4 1 14 14
Now, you can exclude 'x' from the summary, drop duplicates, filter to keep only NaN values (is_na == True), filter to keep sequences above a certain length (e.g., at least 3 consecutive NaN values), etc. Then, if you drop duplicates, the first row will summarize the first NaN run, second row will summarize the second NaN run, etc.
Finally, you can use this with apply() to process the whole data frame, if this is what you need.
Short version of results, for the test data frame:
print(summary[summary['is_na']].drop(columns='x').drop_duplicates())
is_na group_id group_size min_index max_index
2 True 1 3 2 4
9 True 3 5 9 13

Related

Count NA and none-NA per group in pandas

I assume this is a simple task for pandas but I don't get it.
I have data liket this
Group Val
0 A 0
1 A 1
2 A <NA>
3 A 3
4 B 4
5 B <NA>
6 B 6
7 B <NA>
And I want to know the frequency of valid and invalid values in Val per group Group. This is the expected result.
A B Total
Valid 3 2 5
NA 1 2 3
Here is code to generate that sample data.
#!/usr/bin/env python3
import pandas as pd
df = pd.DataFrame({
'Group': list('AAAABBBB'),
'Val': range(8)
})
# some values to NA
for idx in [2, 5, 7]:
df.iloc[idx, 1] = pd.NA
print(df)
What I tried is something with grouping
>>> df.groupby('Group').agg(lambda x: x.isna())
Val
Group
A [False, False, True, False]
B [False, True, False, True]
>>> df.groupby('Group').apply(lambda x: x.isna())
Group Val
0 False False
1 False False
2 False True
3 False False
4 False False
5 False True
6 False False
7 False True
You are close with using groupby and isna
new = df.groupby(['Group', df['Val'].isna().replace({True: 'NA', False: 'Valid'})])['Group'].count().unstack(level=0)
new['Total'] = new.sum(axis=1)
print(new)
Group A B Total
Val
NA 1 2 3
Valid 3 2 5
here is one way to do it
# cross tab to take the summarize
# convert Val to NA or Valid depending on the value
df2=(pd.crosstab(df['Val'].isna().map({True: 'NA', False: 'Valid'}),
df['Group'] )
.reset_index()
.rename_axis(columns=None))
df2['Total']=df2.sum(axis=1, numeric_only=True) # add Total column
out=df2.set_index('Val') # set index to match expected output
out
A B Total
Val
NA 1 2 3
Valid 3 2 5
if you need both row and column total, then it'll be even simpler with crosstab
df2=(pd.crosstab(df['Val'].isna().map({True: 'NA', False: 'Valid'}),
df['Group'],
margins=True, margins_name='Total')
Group A B Total
Val
NA 1 2 3
Valid 3 2 5
Total 4 4 8
Another possible solution, based on pandas.pivot_table and on the following ideas:
Add a new column, status, which contains NA or Valid if the corresponding value is or is not NaN, respectively.
Create a pivot table, using len as aggregation function.
Add the Total column, by summing by rows.
(df.assign(status=np.where(df['Val'].isna(), 'NA', 'Valid'))
.pivot_table(
index='status', columns='Group', values='Val',
aggfunc=lambda x: len(x))
.reset_index()
.rename_axis(None, axis=1)
.assign(Total = lambda x: x.sum(axis=1)))
Output:
status A B Total
0 NA 1 2 3
1 Valid 3 2 5

how to split dataframe cells using delimiter into different dataframes. with conditions

There are other questions on the same topic and they helped but I have an extra twist.
I have a dataframe with multiple values in each (but not all) cells.
df = pd.DataFrame({'a':["10-30-410","20-40-500","25-50"], 'b':["5-8-9","4", "99"]})
index
a
b
0
10-30-410
5-8-9
1
20-40-500
4
2
25-50
99
How can I split each cell by the dash "-" and create three new dataframes? Note that not all cells have multiple values, in which case the second and third dataframes get NA or blank (treating these as strings).
So I need df1 to be the first of those values:
index
a
b
0
10
5
1
20
4
2
25
99
And df2 would be:
index
a
b
0
30
8
1
40
2
50
And likewise for df3:
index
a
b
0
410
9
1
500
2
I got df1 with this
df1 = df.replace(r'(\d+).*(\d+).*(\d+)+', r'\1', regex=True)
But df2 doesn't quite work. I get the second values but also 4 and 99, which should be blank.
df2 = df.replace(r'(\d+)-(\d+).*', r'\2', regex=True)
index
a
b
0
30
8
1
40
4 - should be blank
2
50
99 - should be blank
Is this the right approach? I'm pretty good on regex but fuzzy with groups. Thank you.
Use str.split + concat + stack to get the data in a more usable format:
new_df = pd.concat(
(df['a'].str.split('-', expand=True),
df['b'].str.split('-', expand=True)),
keys=('a', 'b'),
axis=1
).stack(dropna=False).droplevel(0)
new_df:
a b
0 10 5
1 30 8
2 410 9
0 20 4
1 40 None
2 500 None
0 25 99
1 50 None
2 None None
Expandable option for n cols:
cols = ['a', 'b']
new_df = pd.concat(
(df[c].str.split('-', expand=True) for c in cols),
keys=cols,
axis=1
).stack(dropna=False).droplevel(0)
Then groupby level 0 + reset_index to create a list of dataframes:
dfs = [g.reset_index(drop=True) for _, g in new_df.groupby(level=0)]
dfs:
[ a b
0 10 5
1 20 4
2 25 99,
a b
0 30 8
1 40 None
2 50 None,
a b
0 410 9
1 500 None
2 None None]
Complete Working Example:
import pandas as pd
df = pd.DataFrame({
'a': ["10-30-410", "20-40-500", "25-50"],
'b': ["5-8-9", "4", "99"]
})
cols = ['a', 'b']
new_df = pd.concat(
(df[c].str.split('-', expand=True) for c in cols),
keys=cols,
axis=1
).stack(dropna=False).droplevel(0)
dfs = [g.reset_index(drop=True) for _, g in new_df.groupby(level=0)]
print(dfs)
You can also try with filter:
k = pd.concat((df[c].str.split('-', expand=True).add_prefix(c+ '-')
for c in df.columns), 1).fillna('')
df1 = k.filter(like='0')
df2 = k.filter(like='1')
df3 = k.filter(like='2')
NOTE: To strip the digit from columns use : k.filter(like='0').rename(columns= lambda x: x.split('-')[0])

Python: How to replace missing values column wise by median

I have a dataframe as follows
df = pd.DataFrame({'A': [1, 2, 3], 'B': [1.45, 2.33, np.nan], 'C': [4, 5, 6], 'D': [4.55, 7.36, np.nan]})
I want to replace the missing values i.e. np.nan in generic way. For this I have created a function as follows
def treat_mis_value_nu(df):
df_nu = df.select_dtypes(include=['number'])
lst_null_col = df_nu.columns[df_nu.isnull().any()].tolist()
if len(lst_null_col)>0:
for i in lst_null_col:
if df_nu[i].isnull().sum()/len(df_nu[i])>0.10:
df_final_nu = df_nu.drop([i],axis=1)
else:
df_final_nu = df_nu[i].fillna(df_nu[i].median(),inplace=True)
return df_final_nu
When I apply this function as follows
df_final = treat_mis_value_nu(df)
I am getting a dataframe as follows
A B C
0 1 1.0 4
1 2 2.0 5
2 3 NaN 6
So it has actually removed column D correctly, but failed to remove column B.
I know in past there have been discussion on this topic (here). Still I might be missing something?
Use:
df = pd.DataFrame({'A': [1, 2, 3,5,7], 'B': [1.45, 2.33, np.nan, np.nan, np.nan],
'C': [4, 5, 6,8,7], 'D': [4.55, 7.36, np.nan,9,10],
'E':list('abcde')})
print (df)
A B C D E
0 1 1.45 4 4.55 a
1 2 2.33 5 7.36 b
2 3 NaN 6 NaN c
3 5 NaN 8 9.00 d
4 7 NaN 7 10.00 e
def treat_mis_value_nu(df):
#get only numeric columns to dataframe
df_nu = df.select_dtypes(include=['number'])
#get only columns with NaNs
df_nu = df_nu.loc[:, df_nu.isnull().any()]
#get columns for remove with mean instead sum/len, it is same
cols_to_drop = df_nu.columns[df_nu.isnull().mean() <= 0.30]
#replace missing values of original columns and remove above thresh
return df.fillna(df_nu.median()).drop(cols_to_drop, axis=1)
print (treat_mis_value_nu(df))
A C D E
0 1 4 4.55 a
1 2 5 7.36 b
2 3 6 8.18 c
3 5 8 9.00 d
4 7 7 10.00 e
I would recommend looking at the sklearn Imputer transformer. I don't think it it can drop columns but it can definetly fill them in a 'generic way' - for example, filling in missing values with the median of the relevant column.
You could use it as such:
from sklearn.preprocessing import Imputer
imputer = Imputer(strategy='median')
num_df = df.values
names = df.columns.values
df_final = pd.DataFrame(imputer.transform(num_df), columns=names)
If you have additional transformations you would like to make you could consider making a transformation Pipeline or could even make your own transformers to do bespoke tasks.

Dataframe is not updated when columns are passed to function using apply

I have two dataframes like this:
A B
a 1 10
b 2 11
c 3 12
d 4 13
A B
a 11 NaN
b NaN NaN
c NaN 20
d 16 30
They have identical column names and indices. My goal is to replace the NAs in df2 by the values of df1. Currently, I do this like this:
import pandas as pd
import numpy as np
df1 = pd.DataFrame({'A': range(1, 5), 'B': range(10, 14)}, index=list('abcd'))
df2 = pd.DataFrame({'A': [11, np.nan, np.nan, 16], 'B': [np.nan, np.nan, 20, 30]}, index=list('abcd'))
def repl_na(s, d):
s[s.isnull().values] = d[s.isnull().values][s.name]
return s
df2.apply(repl_na, args=(df1, ))
which gives me the desired output:
A B
a 11 10
b 2 11
c 3 20
d 16 30
My question is now how this could be accomplished if the indices of the dataframes are different (column names are still the same, and the columns have the same length). So I would have a df2 like this(df1 is unchanged):
A B
0 11 NaN
1 NaN NaN
2 NaN 20
3 16 30
Then the above code does not work anymore since the indices of the dataframes are different. Could someone tell me how the line
s[s.isnull().values] = d[s.isnull().values][s.name]
has to be modified in order to get the same result as above?
You could temporarily change the indexes on df1 to be the same as df2and just combine_first with df2;
df2.combine_first(df1.set_index(df2.index))
A B
1 11 10
2 2 11
3 3 20
4 16 30

pandas DataFrame add a new column and fillna

I am trying to add a column to a pandas dataframe, like so:
df = pd.DataFrame()
df['one'] = pd.Series({'1':4, '2':6})
print (df)
df['two'] = pd.Series({'0':4, '2':6})
print (df)
This yields:
one two
1 4 NaN
2 6 6
However, I would the result to be,
one two
0 NaN 4
1 4 NaN
2 6 6
How do you do that?
One possibility is to use pd.concat:
ser1 = pd.Series({'1':4, '2':6})
ser2 = pd.Series({'0':4, '2':6})
df = pd.concat((ser1, ser2), axis=1)
to get
0 1
0 NaN 4
1 4 NaN
2 6 6
You can use join, telling pandas exactly how you want to do it:
df = pd.DataFrame()
df['one'] = pd.Series({'1':4, '2':6})
df.join(pd.Series({'0':4, '2':6}, name = 'two'), how = 'outer')
This results in
one two
0 NaN 4
1 4 NaN
2 6 6

Categories