How to handle missing data in pandas dataframe? - python

I have a pandas dataframe containing the following information:
For each Timestamp, there are a number of Trays (between 1-4) out of 8 available Trays. (So there is a maximum number of 4 Trays per Timestamp.)
Each Tray consists of 4 positions.
A dataframe could look like this:
df =
timestamp t_idx position error type SNR
0 16229767 5 2 1 T1 123
1 16229767 5 1 0 T1 123
3 16229767 5 3 0 T1 123
4 16229767 5 4 0 T1 123
5 16229767 3 3 1 T9 38
6 16229767 3 1 0 T9 38
7 16229767 3 4 0 T9 38
8 29767162 7 1 0 T4 991
9 29767162 7 4 1 T4 991
If we look at the timestamp "16229767", there where 2 trays in use: Tray 3 and Tray 5.
Each position for Tray 5 was detected.
However, Tray 3 has missing data, as position 2 is missing.
I would like to fix that and add this line programmatically:
10 16229767 3 2 1 T9 38
11 29767162 7 2 1 T4 991
12 29767162 7 3 1 T4 991
I am not sure how to handle the missing values correctly. My naive approach right now is:
timestamps = df['timestamp'].unique()
for ts in timestamps:
tray_ids = df.loc[df['timestamp'] == timestamps ]["Tray ID"].unique()
for t_id in tray_ids:
# For timestamp and tray id: Each position (1 to 4) should exist once!
# df.loc[(df['timestamp'] == ts) & (df['Tray ID'] == t_id)]
# if not, append the position on the tray and set error to 1
How can I find the missing positions now and add the rows to my dataframe?
===
Edit:
I was simplifying my example, but missed a relevant information:
There are also other columns and the new generated rows should have the same content per tray. I made it clearer by adding to more columns.
Also, there was a question about the error: For each row that had to be added, the error should be automatically 1 (there is no logic behind).

We can start by converting position to the categorical type, use a groupby to fill all the missing values and set the corresponding error values to 1.
We also have to fill the type and SNR column with the correct values like so :
>>> df['position'] = pd.Categorical(df['position'], categories=df['position'].unique())
>>> df_grouped = df.groupby(['timestamp', 't_idx', 'position'], as_index=False).first()
>>> df_grouped['error'] = df_grouped['error'].fillna(1)
>>> df_grouped.sort_values('type', inplace=True)
>>> df_grouped['type'] = df_grouped.groupby(['timestamp','t_idx'])['type'].ffill().bfill()
>>> df_grouped.sort_values('SNR', inplace=True)
>>> df_grouped['SNR'] = df_grouped.groupby(['timestamp','t_idx'])['SNR'].ffill().bfill()
>>> df_grouped = df_grouped.reset_index(drop=True)
timestamp t_idx position error type SNR
0 16229767 3 1 0.0 T9 38.0
1 16229767 3 3 1.0 T9 38.0
2 16229767 3 4 0.0 T9 38.0
3 16229767 5 2 1.0 T1 123.0
4 16229767 5 1 0.0 T1 123.0
5 16229767 5 3 0.0 T1 123.0
6 16229767 5 4 0.0 T1 123.0
7 29767162 7 1 0.0 T4 991.0
8 29767162 7 4 1.0 T4 991.0
9 16229767 3 2 1.0 T9 38.0
10 16229767 7 2 1.0 T4 991.0
11 16229767 7 1 1.0 T4 991.0
12 16229767 7 3 1.0 T4 991.0
13 16229767 7 4 1.0 T4 991.0
14 29767162 3 2 1.0 T4 991.0
15 29767162 3 1 1.0 T4 991.0
16 29767162 3 3 1.0 T4 991.0
17 29767162 3 4 1.0 T4 991.0
18 29767162 5 2 1.0 T4 991.0
19 29767162 5 1 1.0 T4 991.0
20 29767162 5 3 1.0 T4 991.0
21 29767162 5 4 1.0 T4 991.0
22 29767162 7 2 1.0 T4 991.0
23 29767162 7 3 1.0 T4 991.0
And then, we filter on the value from the original DataFrame to get the expected result :
>>> df_grouped[
... pd.Series(
... list(zip(df_grouped['timestamp'].values, df_grouped['t_idx'].values))
... ).isin(list(zip(df['timestamp'].values, df['t_idx'].values)))
... ].sort_values(by=['timestamp', 't_idx']).reset_index(drop=True)
timestamp t_idx position error type SNR
0 16229767 3 1 0.0 T9 38.0
1 16229767 3 3 1.0 T9 38.0
2 16229767 3 4 0.0 T9 38.0
3 16229767 3 2 1.0 T9 38.0
4 16229767 5 2 1.0 T1 123.0
5 16229767 5 1 0.0 T1 123.0
6 16229767 5 3 0.0 T1 123.0
7 16229767 5 4 0.0 T1 123.0
8 29767162 7 1 0.0 T4 991.0
9 29767162 7 4 1.0 T4 991.0
10 29767162 7 2 1.0 T4 991.0
11 29767162 7 3 1.0 T4 991.0

pyjanitor has a complete function that exposes explicitly missing values (pyjanitor is a collection of convenient Pandas functions);
In the challenge above, only the explicitly missing values within the data needs to be exposed:
# pip install pyjanitor
import pandas as pd
import janitor
(df.complete(['timestamp', 't_idx', 'type', 'SNR'], 'position')
.fillna({"error":1}, downcast='infer')
.filter(df.columns)
)
timestamp t_idx position error type SNR
0 16229767 5 2 1 T1 123
1 16229767 5 1 0 T1 123
2 16229767 5 3 0 T1 123
3 16229767 5 4 0 T1 123
4 16229767 3 2 1 T9 38
5 16229767 3 1 0 T9 38
6 16229767 3 3 1 T9 38
7 16229767 3 4 0 T9 38
8 29767162 7 2 1 T4 991
9 29767162 7 1 0 T4 991
10 29767162 7 3 1 T4 991
11 29767162 7 4 1 T4 991
In the code above, just the the combination of ['timestamp', 't_idx', 'type', 'SNR'] and position is required to generate the missing values, limiting the output to only the explicit missing values within the dataframe; if all combinations of missing values were required, then the brackets would be dropped, and you'd probably get a much larger dataframe.

You can create a new dataframe with the timestamp with fixed range of position. Then you merge them together and you will end up with NaN values on errors columns for given missing position. Then you fill the NaN to 1.
Sample code:
unique_id = df.timestamp.unique().tolist()
df_tmp = pd.DataFrame({'timestamp':unique_id,'position':range(4)})
df = pd.merge(df_tmp, df, on=["timestamp", "position"], how="left")
df.error.fillna(1)

You can try this code:
def foo(df):
set_ = set(range(1,5))
if df.position.unique().size < 4:
diff_ = set_.difference(df.position.unique())
add_df = df.iloc[:len(diff_),:].copy()
add_df.loc[:, 'position'] = list(diff_)
# I did not understand by what rule the values in the error column are set. You can install it as you need
result_df = pd.concat([df, add_df], ignore_index=True)
return result_df
else:
return df
group = df.groupby(['timestamp', 't_idx'])
group.apply(foo)
timestamp t_idx position error
0 16229767 3 3 1
1 16229767 3 1 0
2 16229767 3 4 0
3 16229767 3 2 1
4 16229767 5 2 1
5 16229767 5 1 0
6 16229767 5 3 0
7 16229767 5 4 0
8 29767162 7 1 0
9 29767162 7 4 1
10 29767162 7 2 0
11 29767162 7 3 1

Related

Calculate %-deviation with values from a pandas Dataframe

I am fairly new to python and I have the following dataframe
setting_id subject_id seconds result_id owner_id average duration_id
0 7 1 0 1680.5 2.0 24.000 1.0
1 7 1 3600 1690.5 2.0 46.000 2.0
2 7 1 10800 1700.5 2.0 101.000 4.0
3 7 2 0 1682.5 2.0 12.500 1.0
4 7 2 3600 1692.5 2.0 33.500 2.0
5 7 2 10800 1702.5 2.0 86.500 4.0
6 7 3 0 1684.5 2.0 8.500 1.0
7 7 3 3600 1694.5 2.0 15.000 2.0
8 7 3 10800 1704.5 2.0 34.000 4.0
What I need to do is Calculate the deviation (%) from averages with a "seconds"-value not equal to 0 from those averages with a seconds value of zero, where the subject_id and Setting_id are the same
i.e. setting_id ==7 & subject_id ==1 would be:
(result/baseline)*100
------> for 3600 seconds: (46/24)*100 = +192%
------> for 10800 seconds: (101/24)*100 = +421%
.... baseline = average-result with a seconds value of 0
.... result = average-result with a seconds value other than 0
The resulting df should look like this
setting_id subject_id seconds owner_id average deviation duration_id
0 7 1 0 2 24 0 1
1 7 1 3600 2 46 192 2
2 7 1 10800 2 101 421 4
I want to use these calculations then to plot a regression graph (with seaborn) of deviations from baseline
I have played around with this df for 2 days now and tried different forloops but I just canĀ“t figure out the correct way.
You can use:
# identify rows with 0
m = df['seconds'].eq(0)
# compute the sum of rows with 0
s = (df['average'].where(m)
.groupby([df['setting_id'], df['subject_id']])
.sum()
)
# compute the deviation per group
deviation = (
df[['setting_id', 'subject_id']]
.merge(s, left_on=['setting_id', 'subject_id'], right_index=True, how='left')
['average']
.rdiv(df['average']).mul(100)
.round().astype(int) # optional
.mask(m, 0)
)
df['deviation'] = deviation
# or
# out = df.assign(deviation=deviation)
Output:
setting_id subject_id seconds result_id owner_id average duration_id deviation
0 7 1 0 1680.5 2.0 24.0 1.0 0
1 7 1 3600 1690.5 2.0 46.0 2.0 192
2 7 1 10800 1700.5 2.0 101.0 4.0 421
3 7 2 0 1682.5 2.0 12.5 1.0 0
4 7 2 3600 1692.5 2.0 33.5 2.0 268
5 7 2 10800 1702.5 2.0 86.5 4.0 692
6 7 3 0 1684.5 2.0 8.5 1.0 0
7 7 3 3600 1694.5 2.0 15.0 2.0 176
8 7 3 10800 1704.5 2.0 34.0 4.0 400

Python Pandas: difference of column values insert into new column

I have a Pandas dataframe that looks like the following:
c1 c2 c3 c4
p1 q1 r1 20
p2 q2 r2 10
p3 q3 r1 30
The Desired output looks like this.
c1 c2 c3 c4 NewColumn(c1.1)
p1 q1 r1 20 0
p2 q2 r2 10 p2-p1
p3 q3 r1 30 p3-p2
The shape of my dataset is(333650,665) I want to do that for all columns. Are there any ways to achieve this?
The code I am using:
data = pd.read_csv('Mydataset.csv')
i=0
j=1
while j < len(data['columnname']):
j=data['columnname'][i+1] - data['columnname'][i]
i+=1 #Next value of column.
j+=1 #Next value new column.
print(j)
Is this what you want? it finds the difference between the rows of a particular column using the shift method and assigns it to a new column.
Note that I am using the data from Dave.
df['New Column'] = df.a.sub(df.a.shift()).fillna(0)
a b c New Column
0 1 1 1 0.0
1 2 1 4 1.0
2 3 2 9 1.0
3 4 3 16 1.0
4 5 5 25 1.0
5 6 8 36 1.0
For multiple columns, this may suffice:
M = df.diff().fillna(0).add_suffix('_1')
#concatenate along the columns axis
pd.concat([df,M], axis = 1)
a b c a_1 b_1 c_1
0 1 1 1 0.0 0.0 0.0
1 2 1 4 1.0 0.0 3.0
2 3 2 9 1.0 1.0 5.0
3 4 3 16 1.0 1.0 7.0
4 5 5 25 1.0 2.0 9.0
5 6 8 36 1.0 3.0 11.0
You want the diff function:
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.diff.html
df
a b c
0 1 1 1
1 2 1 4
2 3 2 9
3 4 3 16
4 5 5 25
5 6 8 36
df.diff()
a b c
0 NaN NaN NaN
1 1.0 0.0 3.0
2 1.0 1.0 5.0
3 1.0 1.0 7.0
4 1.0 2.0 9.0
5 1.0 3.0 11.0

pandas add new row based on sum/difference of other rows

df have
id measure t1 t2 t3
1 savings 1 2 5
1 income 10 15 14
1 misc 5 5 5
2 savings 3 6 12
2 income 4 20 80
2 misc 1 1 1
df want- add a new row to the measure for each id, called spend, calculated by subtracting measure=income - measure=savings, for each of the periods t1,t2,t3, for each id
id measure t1 t2 t3
1 savings 1 2 5
1 income 10 15 14
1 misc 5 5 5
1 spend 9 13 9
2 savings 3 6 12
2 income 4 20 80
2 misc 1 1 1
2 spend 1 14 68
Trying:
df.loc[df['Measure'] == 'spend'] =
df.loc[df['Measure'] == 'income']-
(df.loc[df['Measure'] == 'savings'])
Failing because I am not incorporating groupby for desired outcome
Here is one way using groupby diff
df1=df[df.measure.isin(['savings','spend'])].copy()
s=df1.groupby('id',sort=False).diff().dropna().assign(id=df.id.unique(),measure='spend')
df=df.append(s,sort=True).sort_values('id')
df
Out[276]:
id measure t1 t2 t3
0 1 savings 1.0 2.0 5.0
1 1 income 10.0 15.0 14.0
1 1 spend 9.0 13.0 9.0
2 2 savings 3.0 6.0 12.0
3 2 income 4.0 20.0 80.0
3 2 spend 1.0 14.0 68.0
Update
df1=df.copy()
df1.loc[df.measure.ne('income'),'t1':]*=-1
s=df1.groupby('id',sort=False).sum().assign(id=df.id.unique(),measure='spend')
df=df.append(s,sort=True).sort_values('id')

how to add a mean column for the groupby movieID?

I have a df like below
userId movieId rating
0 1 31 2.0
1 2 10 4.0
2 2 17 5.0
3 2 39 5.0
4 2 47 4.0
5 3 31 3.0
6 3 10 2.0
I need to add two column, one is mean for each movie, the other is diff which is the difference between rating and mean.
Please note that movieId can be repeated because different users may rate the same movie. Here row 0 and 5 is for movieId 31, row 1 and 6 is for movieId 10
userId movieId rating mean diff
0 1 31 2.0 2.5 -0.5
1 2 10 4.0 3 1
2 2 17 5.0 5 0
3 2 39 5.0 5 0
4 2 47 4.0 4 0
5 3 31 3.0 2.5 0.5
6 3 10 2.0 3 -1
here is some of my code which calculates the mean
df = df.groupby('movieId')['rating'].agg(['count','mean']).reset_index()
You can use transform to keep the same number of rows when calculating mean with groupby. Calculating the difference is straightforward from that:
df['mean'] = df.groupby('movieId')['rating'].transform('mean')
df['diff'] = df['rating'] - df['mean']

Drop rows at beginning of group with a specific value in pandas groupby

I have a pandas (multi-index) dataframe as follows:
date Volume
Account ID
10001 2 02-03-2017 0
3 02-03-2017 0
3 09-03-2017 0
3 16-03-2017 50
3 21-03-2017 65
3 28-03-2017 0
3 04-04-2017 0
3 11-04-2017 60
10002 5 02-03-2017 14.5
6 09-03-2017 14.5
3 09-03-2017 0
3 16-03-2017 0
3 21-03-2017 20
3 28-03-2017 33
10003 8 21-03-2017 14.5
9 28-03-2017 15.0
Now I want to delete all rows at the beginning of a series (dates of an account-product combination) with volume 0. So I want to keep the rows with volume 0 in case they are in the middle or at the end of a series.
So in the above example, I'd want the following output:
date Volume
Account ID
10001 3 16-03-2017 50
3 21-03-2017 65
3 28-03-2017 0
3 04-04-2017 0
3 11-04-2017 60
10002 5 02-03-2017 14.5
6 09-03-2017 14.5
3 21-03-2017 20
3 28-03-2017 33
10003 8 21-03-2017 14.5
9 28-03-2017 15.0
Currently, I've been removing complete series with a filter, e.g.
df = data.groupby(level = acc_prod).filter(lambda x: len(x) > 26)
And I've seen examples of removing only the first row; Python: Pandas - Delete the first row by group . Yet I do not know how to only delete rows of zero at the beginning of an account-product series.
Would be great if someone could help me out on this!
You can use boolean indexing with mask created by groupby with cumsum and find values which are not 0:
print (df.groupby(level=['Account','ID'])['Volume'].cumsum())
Account ID
10001 2 0.0
3 0.0
3 0.0
3 50.0
3 115.0
3 115.0
3 115.0
3 175.0
10002 5 14.5
6 14.5
3 0.0
3 0.0
3 20.0
3 53.0
10003 8 14.5
9 15.0
Name: Volume, dtype: float64
mask = df.groupby(level=['Account','ID'])['Volume'].cumsum() != 0
#!= is same as ne function
#mask = df.groupby(level=['Account','ID'])['Volume'].cumsum().ne(0)
print (mask)
Account ID
10001 2 False
3 False
3 False
3 True
3 True
3 True
3 True
3 True
10002 5 True
6 True
3 False
3 False
3 True
3 True
10003 8 True
9 True
Name: Volume, dtype: bool
print (df[mask])
date Volume
Account ID
10001 3 16-03-2017 50.0
3 21-03-2017 65.0
3 28-03-2017 0.0
3 04-04-2017 0.0
3 11-04-2017 60.0
10002 5 02-03-2017 14.5
6 09-03-2017 14.5
3 21-03-2017 20.0
3 28-03-2017 33.0
10003 8 21-03-2017 14.5
9 28-03-2017 15.0

Categories