How to replace missing values with group mode in Pandas? - python

I follow the method in this post to replace missing values with the group mode, but encounter the "IndexError: index out of bounds".
df['SIC'] = df.groupby('CIK').SIC.apply(lambda x: x.fillna(x.mode()[0]))
I guess this is probably because some groups have all missing values and do not have a mode. Is there a way to get around this? Thank you!

mode is quite difficult, given that there really isn't any agreed upon way to deal with ties. Plus it's typically very slow. Here's one way that will be "fast". We'll define a function that calculates the mode for each group, then we can fill the missing values afterwards with a map. We don't run into issues with missing groups, though for ties we arbitrarily choose the modal value that comes first when sorted:
def fast_mode(df, key_cols, value_col):
"""
Calculate a column mode, by group, ignoring null values.
Parameters
----------
df : pandas.DataFrame
DataFrame over which to calcualate the mode.
key_cols : list of str
Columns to groupby for calculation of mode.
value_col : str
Column for which to calculate the mode.
Return
------
pandas.DataFrame
One row for the mode of value_col per key_cols group. If ties,
returns the one which is sorted first.
"""
return (df.groupby(key_cols + [value_col]).size()
.to_frame('counts').reset_index()
.sort_values('counts', ascending=False)
.drop_duplicates(subset=key_cols)).drop(columns='counts')
Sample data df:
CIK SIK
0 C 2.0
1 C 1.0
2 B NaN
3 B 3.0
4 A NaN
5 A 3.0
6 C NaN
7 B NaN
8 C 1.0
9 A 2.0
10 D NaN
11 D NaN
12 D NaN
Code:
df.loc[df.SIK.isnull(), 'SIK'] = df.CIK.map(fast_mode(df, ['CIK'], 'SIK').set_index('CIK').SIK)
Output df:
CIK SIK
0 C 2.0
1 C 1.0
2 B 3.0
3 B 3.0
4 A 2.0
5 A 3.0
6 C 1.0
7 B 3.0
8 C 1.0
9 A 2.0
10 D NaN
11 D NaN
12 D NaN

Related

Divide several columns with the same column name ending by one other column in python

I have a smiliar question to this one.
I have a dataframe with several rows, which looks like this:
Name TypA TypB ... TypF TypA_value TypB_value ... TypF_value Divider
1 1 1 NaN 10 5 NaN 5
2 NaN 2 NaN NaN 20 NaN 10
and I want to divide all columns with the ending "value" by the column "Divider", how can I do so? One trick would be to use the sorting, to use the answer from above, but is there a direct way for it? That I do not need to sort the dataframe.
The outcome would be:
Name TypA TypB ... TypF TypA_value TypB_value ... TypF_value Divider
1 1 1 NaN 2 1 0 5
2 NaN 2 NaN 0 2 0 10
So a NaN will lead to a 0.
Use DataFrame.filter to filter the columns like value from dataframe then use DataFrame.div along axis=0 to divide it by column Divider, finally use DataFrame.update to update the values in dataframe:
d = df.filter(like='_value').div(df['Divider'], axis=0).fillna(0)
df.update(d)
Result:
Name TypA TypB TypF TypA_value TypB_value TypF_value Divider
0 1 1.0 1 NaN 2.0 1.0 0.0 5
1 2 NaN 2 NaN 0.0 2.0 0.0 10
You could select the columns of interest using DataFrame.filter, and divide as:
value_cols = df.filter(regex=r'_value$').columns
df[value_cols] /= df['Divider'].to_numpy()[:,None]
# df[value_cols] = df[value_cols].fillna(0)
print(df)
Name TypA TypB TypF TypA_value TypB_value TypF_value Divider
0 1 1.0 1 NaN 2.0 1.0 NaN 5
1 2 NaN 2 NaN NaN 2.0 NaN 10
Taking two sample columns A and B :
import pandas as pd
import numpy as np
a={ 'Name':[1,2],
'TypA':[1,np.nan],
'TypB':[1,2],
'TypA_value':[10,np.nan],
'TypB_value':[5,20],
'Divider':[5,10]
}
df = pd.DataFrame(a)
cols_all = df.columns
Find columns for which calculations are to be done. Assuming there all have 'value' and an underscore :
cols_to_calc = [c for c in cols_all if '_value' in c]
For these columns: first, divide with the divider column then replace nan with 0 in those columns.
for c in cols_to_calc:
df[c] = df[c] / df.Divider
df[c] = df[c].fillna(0)

Rolling sum over a partition in python

Code:
data['rolling_sum'] = data.groupby(['User_id'])['Amount'].rolling().sum()
Error
TypeError: incompatible index of inserted column with frame index
Please help in figuring out the mistake in the code. An alternative method would also be appreciated.
Use DataFrame.reset_index with level=0 and drop=True for remove first level of MultiIndex, what is safer because aligned by original index values:
data = pd.DataFrame({
'Amount':[5,3,6,9,2,4],
'User_id':list('aababb')
})
data['rolling_sum1'] = data.groupby(['User_id'])['Amount'].rolling(2).sum().reset_index(level=0, drop=True)
If assign only numpy array is possible values are added incorrectly:
data['rolling_sum2'] = data.groupby(['User_id'])['Amount'].rolling(2).sum().values
print (data)
Amount User_id rolling_sum1 rolling_sum2
0 5 a NaN NaN
1 3 a 8.0 8.0
2 6 b NaN 12.0
3 9 a 12.0 NaN
4 2 b 8.0 8.0
5 4 b 6.0 6.0

Combine two dataframes based on ranges which may partially overlap using Pandas and track multiple values

I have two big dataframes (100K rows), One has 'values', one has 'types'. I want to assign a 'type' from df2 to a column in df1 based on depth. The depths are given as depth 'From' and depth 'To' columns. The 'types' are also defined by depth 'From' and 'To'. BUT they are NOT the same intervals. df1 depths may span multiple df2 types.
I want to assign the df2 'types' to df1 and if there are multiple types, try and capture that information too. Example below.
import pandas as pd
import numpy as np
df1=pd.DataFrame(np.array([[1,3,0.001],[3,5,0.005],[5,7,0.002],[7,10,0.001]]),columns=['From', 'To', 'val'])
df2=pd.DataFrame(np.array([[0.0,4,'A'],[4,5,'B'],[5,6,'C'],[6,8,'D'],[8,10,'E']]),columns=['From', 'To', 'Type'])
df1
Out[1]:
From To val
0 1.0 3.0 0.001
1 3.0 5.0 0.005
2 5.0 7.0 0.002
3 7.0 10.0 0.001
df2
Out[2]:
From To Type
0 0 4 A
1 4 5 B
2 5 6 C
3 6 8 D
4 8 10 E
Possible acceptable output:
Out[4]:
From To val Type
0 1 3 0.001 A
1 3 5 0.005 1 unit A,2 units B
2 5 7 0.002 1 unit C,1 unit D
3 7 10 0.001 1 unit D, 3 units E
Percentages of Types would also be a good ouput in Type .
One solution may be to create a new dataframe with high resolution 'depths' and forward fill the types, and do a sort of VLOOKUP on the To and the From.
I also thought about the possibility of making a column in each df that is a 'set' based on the to and from cols.
Possible join or merge but need to get the data compatible first.
Don't know where to start. Hoping there is neat way to tackle this, I have basically the exact same situation as this guy but I don't speak 'R' and would like to possibly report multiple type info.
From df2 create an auxiliary Series, marking each "starting point"
of a unit (a range of length 1):
units = df2.set_index('Type').apply(lambda row: pd.Series(
range(row.From, row.To)), axis=1).stack()\
.reset_index(level=1, drop=True)
The result is:
Type
A 0.0
A 1.0
A 2.0
A 3.0
B 4.0
C 5.0
D 6.0
D 7.0
E 8.0
E 9.0
dtype: float64
Then define a function generating Type for the current row:
def getType(row):
gr = units[units.ge(row.From) & units.lt(row.To)].groupby(level=0)
if gr.ngroups == 1:
return gr.ngroup().index[0]
txt = []
for key, grp in gr:
siz = grp.size
un = 'unit' if siz == 1 else 'units'
txt.append(f'{siz} {un} {key}')
return ','.join(txt)
And to generate Type column, apply it to each row:
df1['Type'] = df1.apply(getType, axis=1)
The result is:
From To val Type
0 1.0 3.0 0.001 A
1 3.0 5.0 0.005 1 unit A,1 unit B
2 5.0 7.0 0.002 1 unit C,1 unit D
3 7.0 10.0 0.001 1 unit D,2 units E
This result is a bit different from your expected result, but I think
you created it in a bit inconsequent way.
I think that my solution is correct (at least more consequent), because:
Row 1.0 - 3.0 is entirely within the limits of 0 4 A, so the
result is just A (like in your post).
Row 3.0 - 5.0 can be "divided" into:
3.0 - 4.0 is within the limits of 0 4 A (1 unit),
4.0 - 5.0 is within the limits of 4 5 B (also 1 unit,
but you want 2 units here).
Row 5.0 - 7.0 can be again "divided" into:
5.0 - 6.0 is within the limits of 5 6 C (1 unit),
6.0 - 7.0 is within the limits of 6 8 D (1 unit, just like you did).
Row 7.0 - 10.0 can be "divided" into:
7.0 - 8.0 is within the limits of 6 8 D (1 unit, just like you did),
8.0 - 10.0 is within the limits of 8 10 E (2 units, not 3 as you wrote).

Python Dataframe filling up non existing

I was wondering if there is an efficient way to add rows to a Dataframe that e.g. include the average or a predifined value in case there are not enough rows for a specific value in another column. I guess the description of the Problem is not the best that is why you find an example below:
Say we have the Dataframe
df1
Client NumberOfProducts ID
A 1 2
A 5 1
B 1 2
B 6 1
C 9 1
And we want to have 2 Rows for each client A, B, C, D, no matter if these 2 rows are already existing or not. So for Client A and B we can just copy the rows, for C we want to add a row which says Client = C, NumberOfProducts = average of existing rows = 9 and ID is not of interest (so we could set it to ID = smallest existing one - 1 = 0 any other value, even NaN, would also be possible). For Client D there does not exist a single row so we want to add 2 rows where NumberOfProducts is equal to the constant 2.5. The output should then look like this:
df1
Client NumberOfProducts ID
A 1 2
A 5 1
B 1 2
B 6 1
C 9 1
C 9 0
D 2.5 NaN
D 2.5 NaN
What I have done so far is to loop through the dataframe and add rows where necessary. Since this is pretty inefficient any better solution would be highly appreciated.
Use:
clients = ['A','B','C','D']
N = 2
#test only values from list and also filter only 2 rows for each client if necessary
df = df[df['Client'].isin(clients)].groupby('Client').head(N)
#create helper counter and reshape by unstack
df1 = df.set_index(['Client',df.groupby('Client').cumcount()]).unstack()
#set first if only 1 row per client - replace second NumberOfProducts by first
df1[('NumberOfProducts',1)] = df1[('NumberOfProducts',1)].fillna(df1[('NumberOfProducts',0)])
# ... replace second ID by first subtracted by 1
df1[('ID',1)] = df1[('ID',1)].fillna(df1[('ID',0)] - 1)
#add missing clients by reindex
df1 = df1.reindex(clients)
#replace NumberOfProducts by constant 2.5
df1['NumberOfProducts'] = df1['NumberOfProducts'].fillna(2.5)
print (df1)
NumberOfProducts ID
0 1 0 1
Client
A 1.0 5.0 2.0 1.0
B 1.0 6.0 2.0 1.0
C 9.0 9.0 1.0 0.0
D 2.5 2.5 NaN NaN
#last reshape to original
df2 = df1.stack().reset_index(level=1, drop=True).reset_index()
print (df2)
Client NumberOfProducts ID
0 A 1.0 2.0
1 A 5.0 1.0
2 B 1.0 2.0
3 B 6.0 1.0
4 C 9.0 1.0
5 C 9.0 0.0
6 D 2.5 NaN
7 D 2.5 NaN

Using fillna() selectively in pandas

I would like to fill N/A values in a DataFrame in a selective manner. In particular, if there is a sequence of consequetive nans within a column, I want them to be filled by the preceeding non-nan value, but only if the length of the nan sequence is below a specified threshold. For example, if the threshold is 3 then a within-column sequence of 3 or less will be filled with the preceeding non-nan value, whereas a sequence of 4 or more nans will be left as is.
That is, if the input DataFrame is
2 5 4
nan nan nan
nan nan nan
5 nan nan
9 3 nan
7 9 1
I want the output to be:
2 5 4
2 5 nan
2 5 nan
5 5 nan
9 3 nan
7 9 1
The fillna function, when applied to a DataFrame, has the method and limit options. But these are unfortunately not sufficient to acheive the task. I tried to specify method='ffill' and limit=3, but that fills in the first 3 nans of any sequence, not selectively as described above.
I suppose this can be coded by going column by column with some conditional statements, but I suspect there must be something more Pythonic. Any suggestinos on an efficient way to acheive this?
Working with contiguous groups is still a little awkward in pandas.. or at least I don't know of a slick way to do this, which isn't at all the same thing. :-)
One way to get what you want would be to use the compare-cumsum-groupby pattern:
In [68]: nulls = df.isnull()
...: groups = (nulls != nulls.shift()).cumsum()
...: to_fill = groups.apply(lambda x: x.groupby(x).transform(len) <= 3)
...: df.where(~to_fill, df.ffill())
...:
Out[68]:
0 1 2
0 2.0 5.0 4.0
1 2.0 5.0 NaN
2 2.0 5.0 NaN
3 5.0 5.0 NaN
4 9.0 3.0 NaN
5 7.0 9.0 1.0
Okay, another alternative which I don't like because it's too tricky:
def method_2(df):
nulls = df.isnull()
filled = df.ffill(limit=3)
unfilled = nulls & (~filled.notnull())
nf = nulls.replace({False: 2.0, True: np.nan})
do_not_fill = nf.combine_first(unfilled.replace(False, np.nan)).bfill() == 1
return df.where(do_not_fill, df.ffill())
This doesn't use any groupby tools and so should be faster. Note that a different approach would be to manually (using shifts) determine which elements are to be filled because they're a group of length 1, 2, or 3.

Categories