Impute missing dates and values using Python - python

I have a file named df that has the following values:
Date Available Used Total Free
06072019 5 19 24 5
06202019 14 10 24 6
07072019 6 16 24 6
07202019 20 4 24 20
08072019 23 1 24 23
I am missing the date 08202019 and am looking to impute the missing values with the average of the existing data that I have.
This is what I am currently doing:
import numpy as np
import datetime as dt
df.groupby([df.index.date]).transform(lambda x: x.fill(x.mean()))
However, I know that some of the syntax is not correct here and would like some suggestions.

Use list comprehension to get the means of all the columns excluding the first date column and create a transposed dataframe, which will eseentially be one row. Then concat this new 1-row dataframe with the main dataframe.
df = pd.DataFrame({'Date': {0: '06-07-2019',
1: '06-20-2019',
2: '07-07-2019',
3: '07-20-2019',
4: '08-07-2019'},
'Available': {0: 5, 1: 14, 2: 6, 3: 20, 4: 23},
'Used': {0: 19, 1: 10, 2: 16, 3: 4, 4: 1},
'Total': {0: 24, 1: 24, 2: 24, 3: 24, 4: 24},
'Free': {0: 5, 1: 6, 2: 6, 3: 20, 4: 23}})
s = pd.DataFrame([df[col].mean() for col in df.columns[1:]]).T
s.columns = df.columns[1:]
s['Date'] = '08-20-2019'
df_new = pd.concat([df, s])
df_new
Out[1]:
Date Available Used Total Free
0 06-07-2019 5.0 19.0 24.0 5.0
1 06-20-2019 14.0 10.0 24.0 6.0
2 07-07-2019 6.0 16.0 24.0 6.0
3 07-20-2019 20.0 4.0 24.0 20.0
4 08-07-2019 23.0 1.0 24.0 23.0
0 08-20-2019 13.6 10.0 24.0 12.0
In regards to your comment, you could create a missing_dates list and everything else will be automatic with the loop:
df = pd.DataFrame({'Date': {0: '06-07-2019',
1: '06-20-2019',
2: '07-07-2019',
3: '07-20-2019',
4: '08-07-2019'},
'Available': {0: 5, 1: 14, 2: 6, 3: 20, 4: 23},
'Used': {0: 19, 1: 10, 2: 16, 3: 4, 4: 1},
'Total': {0: 24, 1: 24, 2: 24, 3: 24, 4: 24},
'Free': {0: 5, 1: 6, 2: 6, 3: 20, 4: 23}})
s = pd.DataFrame([df[col].mean() for col in df.columns[1:]]).T
t = s
missing_dates = ['08-20-2019' , '08-30-2019']
for i in range(len(missing_dates) - 1):
t = t.append(s)
t.columns = df.columns[1:]
t['Date'] = missing_dates
df_new = pd.concat([df, t])
df_new
Out[2]:
Date Available Used Total Free
0 06-07-2019 5.0 19.0 24.0 5.0
1 06-20-2019 14.0 10.0 24.0 6.0
2 07-07-2019 6.0 16.0 24.0 6.0
3 07-20-2019 20.0 4.0 24.0 20.0
4 08-07-2019 23.0 1.0 24.0 23.0
0 08-20-2019 13.6 10.0 24.0 12.0
0 08-30-2019 13.6 10.0 24.0 12.0

Consider your dataframe:
import pandas as pd
import numpy as np
df = pd.read_clipboard()
df["Date"] = pd.to_datetime(df["Date"], format="%m%d%Y")
df = df.set_index("Date")
print(df)
Available Used Total Free
Date
2019-06-07 5 19 24 5
2019-06-20 14 10 24 6
2019-07-07 6 16 24 6
2019-07-20 20 4 24 20
2019-08-07 23 1 24 23
You can create a list of missing dates, convert it to a pandas datetime array, create a new index that you'll then fill the na with the column means.
missing_dates = ["2019-08-20", "2019-09-07", "2019-09-20"]
missing_dates = pd.to_datetime(missing_dates)
new_index = df.index.union(missing_dates)
df = df.reindex(new_index).fillna(df.mean(numeric_only=True))
print(df)
Available Used Total Free
2019-06-07 5.0 19.0 24.0 5.0
2019-06-20 14.0 10.0 24.0 6.0
2019-07-07 6.0 16.0 24.0 6.0
2019-07-20 20.0 4.0 24.0 20.0
2019-08-07 23.0 1.0 24.0 23.0
2019-08-20 13.6 10.0 24.0 12.0
2019-09-07 13.6 10.0 24.0 12.0
2019-09-20 13.6 10.0 24.0 12.0

In case you want to check for dates that are non-existent in the data, you can try this:
# suppose data series goes from '2020-09-30' to '2020-10-04' and data on some dates may be missing.
df = pd.DataFrame.from_dict(
dict(datetime=['09302020','10012020','10042020'],
val1 = [1,3,5],
val2 = [6,10,12]))
df.datetime = pd.to_datetime(df.datetime, format='%m%d%Y')
print(df)
dates_missing = pd.date_range(start = '2020-09-30', end = '2020-10-04' ).difference(df.datetime)
val_means = {col: df[col].mean() for col in list(df.columns) if col != 'datetime'}
df = df.append(pd.DataFrame.from_dict(
dict(datetime=dates_missing, **val_means)))
df = df.sort_values(by=['datetime'])
df

Related

Stick the dataframe rows and column in one row+ replace the nan values with the day before or after

I have a df and I want to stick the values of it. At first I want to select the specific time, and replace the Nan values with the same in the day before. Here is a simple example: I only want to choose the values in 2020, I want to stick its value based on the time, and also replace the nan value same as day before.
df = pd.DataFrame()
df['day'] =[ '2020-01-01', '2019-01-01', '2020-01-02','2020-01-03', '2018-01-01', '2020-01-15','2020-03-01', '2020-02-01', '2017-01-01' ]
df['value_1'] = [ 1, np.nan, 32, 48, 5, -1, 5,10,2]
df['value_2'] = [ np.nan, 121, 23, 34, 15, 21, 15, 12, 39]
df
day value_1 value_2
0 2020-01-01 1.0 NaN
1 2019-01-01 NaN 121.0
2 2020-01-02 32.0 23.0
3 2020-01-03 48.0 34.0
4 2018-01-01 5.0 15.0
5 2020-01-15 -1.0 21.0
6 2020-03-01 5.0 15.0
7 2020-02-01 10.0 12.0
8 2017-01-01 2.0 39.0
The output:
_1 _2 _3 _4 _5 _6 _7 _8 _9 _10 _11 _12
0 1 121 1 23 48 34 -1 21 10 12 -1 21
I have tried to use the follwing code, but it does not solve my problem:
val_cols = df.filter(like='value_').columns
output = (df.pivot('day', val_cols).groupby(level=0, axis=1).apply(lambda x:x.ffill(axis=1).bfill(axis=1)).sort_index(axis=1, level=1))
I don't know what the output is supposed to be but i think this should do at least part of what you're trying to do
df['day'] = pd.to_datetime(df['day'], format='%Y-%m-%d')
df = df.sort_values(by=['day'])
filter_2020 = df['day'].dt.year == 2020
val_cols = df.filter(like='value_').columns
df.loc[filter_2020, val_cols] = df.loc[:,val_cols].ffill().loc[filter_2020]
print(df)
day value_1 value_2
8 2017-01-01 2.0 39.0
4 2018-01-01 5.0 15.0
1 2019-01-01 NaN 121.0
0 2020-01-01 1.0 121.0
2 2020-01-02 32.0 23.0
3 2020-01-03 48.0 34.0
5 2020-01-15 -1.0 21.0
7 2020-02-01 10.0 12.0
6 2020-03-01 5.0 15.0

join on index AND column

I would like to join two dataframes based on two conditions:
1. Join via Index
2. If two column headers are in both dataframes, join them aswell
To give an example, lets imagine I have these two dataframes:
import pandas as pd
import numpy as np
df1 = pd.DataFrame({'date': [2010, 2011, 2012],
'a': [np.NaN, 30, np.NaN],
'b': [55, np.NaN, np.NaN],
'c': [55, 40, 84]})
df1 = df1.set_index("date")
df2 = pd.DataFrame({'date': [2010, 2011, 2012],
'a': [10, np.NaN, 30],
'b': [np.NaN, 80, 84],
'd': [55, 40, 84]})
df2 = df2.set_index("date")
if I now join the the two via pd.concat, I get columns such as "a" twice:
pd.concat([df1, df2], axis=1)
a b c a b d
date
2010 NaN 55.0 55 10.0 NaN 55
2011 30.0 NaN 40 NaN 80.0 40
2012 NaN NaN 84 30.0 84.0 84
But I would rather have:
a b c d
date
2010 10.0 55.0 55 55
2011 30.0 80.0 40 40
2012 30.0 84.0 84 84
Thanks in advance!
Try this, add
print(df1.set_index('date').add(df2.set_index("date"), fill_value=0))
a b c d
date
2010 10.0 55.0 55.0 55.0
2011 30.0 80.0 40.0 40.0
2012 30.0 84.0 84.0 84.0

Get last non-null value of a row and its column in pandas DataFrame

I want to get the last non-null value (rightmost) of row C in this DataFrame.
With that, I also want to get its Year (column name).
Here is my DataFrame :
df = pd.DataFrame(np.random.randint(0,100,size=(4, 5)),
columns=['2016', '2017', '2018', '2019', '2020'],
index=['A', 'B', 'C', 'D'])
df.iloc[2, 2:5] = np.NaN
print(df)
2016 2017 2018 2019 2020
A 41 69 63.0 85.0 16.0
B 12 99 88.0 87.0 13.0
C 80 15 NaN NaN NaN
D 42 27 3.0 76.0 6.0
Result should look like {'year' : 2017, 'value' : 15}.
What's the best way of achieving that result ?
Something like this should solve it
In [1]: import pandas as pd
...: import numpy as np
...: df = pd.DataFrame(np.random.randint(0,100,size=(4, 5)),
...: columns=['2016', '2017', '2018', '2019', '2020'],
...: index=['A', 'B', 'C', 'D'])
...: df.iloc[2, 2:5] = np.NaN
...: print(df)
2016 2017 2018 2019 2020
A 13 78 9.0 13.0 98.0
B 35 3 32.0 6.0 42.0
C 26 24 NaN NaN NaN
D 77 91 96.0 60.0 94.0
In [2]: value = int(df.loc['C'][~df.loc['C'].isna()][-1])
In [3]: year = df.loc['C'][df.loc['C'] == value].index.values[0]
In [4]: result = {'year': year, 'value': value}
In [5]: result
Out[5]: {'year': '2017', 'value': 24}
You can break the expressions above part by part to better understand how each functionality is getting used together here to yield the desired output.

Dropping values before date X for each column in DataFrame

Say I have the following DataFrame:
d = pd.DataFrame({'A': [20, 0.5, 40, 45, 40, 35, 20, 25],
'B' : [5, 10, 6, 8, 9, 7, 5, 8]},
index = pd.date_range(start = "2010Q1", periods = 8, freq = 'QS'))
A B
2010-01-01 20.0 5
2010-04-01 0.5 10
2010-07-01 40.0 6
2010-10-01 45.0 8
2011-01-01 40.0 9
2011-04-01 35.0 7
2011-07-01 20.0 5
2011-10-01 25.0 8
Also, assume I have the following series of dates:
D = d.idxmax()
A 2010-10-01
B 2010-04-01
dtype: datetime64[ns]
What I'm trying to do, is to essentially "drop" the values in the DataFrame, d, that occur before the dates in the series D for each column
That is, what I'm looking for is:
A B
2010-01-01 NaN NaN
2010-04-01 NaN 10.0
2010-07-01 NaN 6.0
2010-10-01 45.0 8.0
2011-01-01 40.0 9.0
2011-04-01 35.0 7.0
2011-07-01 20.0 5.0
2011-10-01 25.0 8.0
Notice that all the values in column A before 2010-10-01 are dropped and all the values in column B are dropped before 2010-04-01.
It's fairly simply to iterate through the columns to do this, but the DataFrame i'm working with is extremely large and this process takes a lot of time.
Is there a simpler way that does this in bulk, rather than column by column?
Thanks
Not sure that this is the most elegant answer, but since there aren't any other answers yet, I figured I would offer a working solution:
import pandas as pd
import numpy as np
import datetime
d = pd.DataFrame({'A': [20, 0.5, 40, 45, 40, 35, 20, 25],
'B' : [5, 10, 6, 8, 9, 7, 5, 8]},
index = pd.date_range(start = "2010Q1", periods = 8, freq = 'QS'))
D = d.idxmax()
for column in D.index:
d.loc[d.index < D[column], column] = np.nan
Output:
A B
2010-01-01 NaN NaN
2010-04-01 NaN 10.0
2010-07-01 NaN 6.0
2010-10-01 45.0 8.0
2011-01-01 40.0 9.0
2011-04-01 35.0 7.0
2011-07-01 20.0 5.0
2011-10-01 25.0 8.0

Python: doing multiple column aggregation in pandas

I have dataframe where I went to do multiple column aggregations in pandas.
import pandas as pd
import numpy as np
df = pd.DataFrame({'ser_no': [1, 1, 1, 2, 2, 2, 2, 3, 3, 3],
'CTRY_NM': ['a', 'a', 'b', 'e', 'e', 'a', 'b', 'b', 'b', 'd'],
'lat': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
'long': [21, 22, 23, 24, 25, 26, 27, 28, 29, 30]})
df2 = df.groupby(['ser_no', 'CTRY_NM']).lat.agg({'avg_lat': np.mean})
With this code, I get the mean for lat. I would also like to find the mean for long.
I tried df2 = df.groupby(['ser_no', 'CTRY_NM']).lat.agg({'avg_lat': np.mean}).long.agg({'avg_long': np.mean}) but this produces
AttributeError: 'DataFrame' object has no attribute 'long'
If I just do avg_long, the code works as well.
df2 = df.groupby(['ser_no', 'CTRY_NM']).long.agg({'avg_long': np.mean})
In[2]: df2
Out[42]:
avg_long
ser_no CTRY_NM
1 a 21.5
b 23.0
2 a 26.0
b 27.0
e 24.5
3 b 28.5
d 30.0
Is there a way to do this in one step or is this something I have to do separately and join back later?
I think more simplier is use GroupBy.mean:
print df.groupby(['ser_no', 'CTRY_NM']).mean()
lat long
ser_no CTRY_NM
1 a 1.5 21.5
b 3.0 23.0
2 a 6.0 26.0
b 7.0 27.0
e 4.5 24.5
3 b 8.5 28.5
d 10.0 30.0
Ir you need define columns for aggregating:
print df.groupby(['ser_no', 'CTRY_NM']).agg({'lat' : 'mean', 'long' : 'mean'})
lat long
ser_no CTRY_NM
1 a 1.5 21.5
b 3.0 23.0
2 a 6.0 26.0
b 7.0 27.0
e 4.5 24.5
3 b 8.5 28.5
d 10.0 30.0
More info in docs.
EDIT:
If you need rename column names - remove multiindex in columns, you can use list comprehension:
import pandas as pd
df = pd.DataFrame({'ser_no': [1, 1, 1, 2, 2, 2, 2, 3, 3, 3],
'CTRY_NM': ['a', 'a', 'b', 'e', 'e', 'a', 'b', 'b', 'b', 'd'],
'lat': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
'long': [21, 22, 23, 24, 25, 26, 27, 28, 29, 30],
'date':pd.date_range(pd.to_datetime('2016-02-24'),
pd.to_datetime('2016-02-28'), freq='10H')})
print df
CTRY_NM date lat long ser_no
0 a 2016-02-24 00:00:00 1 21 1
1 a 2016-02-24 10:00:00 2 22 1
2 b 2016-02-24 20:00:00 3 23 1
3 e 2016-02-25 06:00:00 4 24 2
4 e 2016-02-25 16:00:00 5 25 2
5 a 2016-02-26 02:00:00 6 26 2
6 b 2016-02-26 12:00:00 7 27 2
7 b 2016-02-26 22:00:00 8 28 3
8 b 2016-02-27 08:00:00 9 29 3
9 d 2016-02-27 18:00:00 10 30 3
df2=df.groupby(['ser_no','CTRY_NM']).agg({'lat':'mean','long':'mean','date':[min,max,'count']})
df2.columns = ['_'.join(col) for col in df2.columns]
print df2
lat_mean date_min date_max date_count \
ser_no CTRY_NM
1 a 1.5 2016-02-24 00:00:00 2016-02-24 10:00:00 2
b 3.0 2016-02-24 20:00:00 2016-02-24 20:00:00 1
2 a 6.0 2016-02-26 02:00:00 2016-02-26 02:00:00 1
b 7.0 2016-02-26 12:00:00 2016-02-26 12:00:00 1
e 4.5 2016-02-25 06:00:00 2016-02-25 16:00:00 2
3 b 8.5 2016-02-26 22:00:00 2016-02-27 08:00:00 2
d 10.0 2016-02-27 18:00:00 2016-02-27 18:00:00 1
long_mean
ser_no CTRY_NM
1 a 21.5
b 23.0
2 a 26.0
b 27.0
e 24.5
3 b 28.5
d 30.0
You are getting the error because you are first selecting the lat column of the dataframe and doing operations on that column. Getting the long column through that series is not possible, you need the dataframe.
df2 = df.groupby(['ser_no', 'CTRY_NM'])["lat", "long"].agg(np.mean)
would do the same operation for both columns. If you want the column names changed, you can rename the columns afterwards:
df2 = df.groupby(['ser_no', 'CTRY_NM'])["lat", "long"].agg(np.mean).rename(columns = {"lat": "avg_lat", "long": "avg_long"})
In [22]:
df2 = df.groupby(['ser_no', 'CTRY_NM'])["lat", "long"].agg(np.mean).rename(columns = {"lat": "avg_lat", "long": "avg_long"})
df2
Out[22]:
avg_lat avg_long
ser_no CTRY_NM
1 a 1.5 21.5
b 3.0 23.0
2 a 6.0 26.0
b 7.0 27.0
e 4.5 24.5
3 b 8.5 28.5
d 10.0 30.0

Categories