I have a question of how o get interpolated data across a several different "blocks" of time. In a nut shell, I have a dataset like this:
>>> import pandas as pd
>>> test_data = pd.read_csv("test_data.csv")
>>> test_data
ID Condition_num Condition_type Rating Timestamp_ms
0 101 1 Active 58.0 30
1 101 1 Active 59.0 60
2 101 1 Active 65.0 90
3 101 1 Active 70.0 120
4 101 1 Active 80.0 150
5 101 2 Break NaN 180
6 101 3 Active_2 55.0 210
7 101 3 Active_2 60.0 240
8 101 3 Active_2 63.0 270
9 101 3 Active_2 70.0 300
10 101 4 Break NaN 330
11 101 5 Active_3 69.0 360
12 101 5 Active_3 71.0 390
13 101 5 Active_3 50.0 420
14 101 5 Active_3 41.0 450
15 101 5 Active_3 43.0 480
I need to "resample" the final column to a another time interval (e.g. 40 ms) to match it to an external data set. I have been using the following code:
#Setting the column with timestamps as a datetime with the correct units, then set index
test_data['Timestamp_ms'] = pd.to_datetime(test_data['Timestamp_ms'], unit='ms')
test_data = test_data.set_index('Timestamp_ms')
#Resample index to start at 0, resample to the highest resolution 1ms, then resample to 800ms
test_data = test_data.reindex(
pd.date_range(start=pd.to_datetime(0, unit='ms'), end=test_data.index.max(), freq='ms')
)
test_data = test_data.resample('1ms').interpolate().resample('40ms').interpolate()
#Round ms to intergers
test_data.xpos = test_data..round()
Which gives me this:
ID Condition_num Condition_type Rating
1970-01-01 00:00:00.000 NaN NaN NaN NaN
1970-01-01 00:00:00.040 101.0 1.000000 NaN 58.333333
1970-01-01 00:00:00.080 101.0 1.000000 NaN 63.000000
1970-01-01 00:00:00.120 101.0 1.000000 Active 70.000000
1970-01-01 00:00:00.160 101.0 1.333333 NaN 75.833333
1970-01-01 00:00:00.200 101.0 2.666667 NaN 59.166667
1970-01-01 00:00:00.240 101.0 3.000000 Active_2 60.000000
1970-01-01 00:00:00.280 101.0 3.000000 NaN 65.333333
1970-01-01 00:00:00.320 101.0 3.666667 NaN 69.666667
1970-01-01 00:00:00.360 101.0 5.000000 Active_3 69.000000
1970-01-01 00:00:00.400 101.0 5.000000 NaN 64.000000
1970-01-01 00:00:00.440 101.0 5.000000 NaN 44.000000
1970-01-01 00:00:00.480 101.0 5.000000 Active_3 43.000000
The only issue is I cannot figure out which ratings are happening during the "Active" conditions and whether the ratings I am seeing are caused by extrapolations of the "breaks" where there are no ratings. In so many words, I want the interpolation in the "Active" blocks but also have everything aligned to the beginning of the whole data set.
I have tried entering Zero ratings for NaN and interpolating from the top of each condition, but that seems only to make the problem worse by altering the ratings more.
Any advice would be greatly appreciated!
I think you need to do all of your logic inside of a groupby, IIUC:
mask = df.Condition_type.ne('Break')
df2 = (df[mask].groupby('Condition_type') # Groupby Condition_type, excluding "Break" rows.
.apply(lambda x: x.resample('1ms') # To each group... resample it.
.interpolate() # Interpolate
.ffill() # Fill values, this just applies to the Condition_type.
.resample('40ms')# Resample to 40ms
.asfreq()) # No need to interpolate in this direction.
.reset_index('Condition_type', drop=True)) # We no longer need this extra index~
# Force the index to our resample'd interval, this will reveal the breaks:
df2 = df2.asfreq('40ms')
print(df2)
Output:
ID Condition_num Condition_type Rating
Timestamp_ms
1970-01-01 00:00:00.000 NaN NaN NaN NaN
1970-01-01 00:00:00.040 101.0 1.0 Active 58.333333
1970-01-01 00:00:00.080 101.0 1.0 Active 63.000000
1970-01-01 00:00:00.120 101.0 1.0 Active 70.000000
1970-01-01 00:00:00.160 NaN NaN NaN NaN
1970-01-01 00:00:00.200 NaN NaN NaN NaN
1970-01-01 00:00:00.240 101.0 3.0 Active_2 60.000000
1970-01-01 00:00:00.280 101.0 3.0 Active_2 65.333333
1970-01-01 00:00:00.320 NaN NaN NaN NaN
1970-01-01 00:00:00.360 101.0 5.0 Active_3 69.000000
1970-01-01 00:00:00.400 101.0 5.0 Active_3 64.000000
1970-01-01 00:00:00.440 101.0 5.0 Active_3 44.000000
1970-01-01 00:00:00.480 101.0 5.0 Active_3 43.000000
Related
I have a large dataset which contains a date column that covers from the year 2019. Now I do want to generate number of weeks on a separate column that are contained in those dates.
Here is how the date column looks like:
import pandas as pd
data = {'date': ['2019-09-10', 'NaN', '2019-10-07', '2019-11-04', '2019-11-28',
'2019-12-02', '2020-01-24', '2020-01-29', '2020-02-05',
'2020-02-12', '2020-02-14', '2020-02-24', '2020-03-11',
'2020-03-16', '2020-03-17', '2020-03-18', '2021-09-14',
'2021-09-30', '2021-10-07', '2021-10-08', '2021-10-12',
'2021-10-14', '2021-10-15', '2021-10-19', '2021-10-21',
'2021-10-26', '2021-10-28', '2021-10-29', '2021-11-02',
'2021-11-15', '2021-11-16', '2021-12-01', '2021-12-07',
'2021-12-09', '2021-12-10', '2021-12-14', '2021-12-15',
'2022-01-13', '2022-01-14', '2022-01-21', '2022-01-24',
'2022-01-25', '2022-01-27', '2022-01-31', '2022-02-01',
'2022-02-10', '2022-02-11', '2022-02-16', '2022-02-24']}
df = pd.DataFrame(data)
Now as from the first day this data was collected, I want to count 7 days using the date column and create a week out it. an example if the first week contains the 7 dates, I create a column and call it week one. I want to do the same process until the last week the data was collected.
Maybe it will be a good idea to organize the dates in order as from the first date to current one.
I have tried this but its not generating weeks in order, it actually has repetitive weeks.
pd.to_datetime(df['date'], errors='coerce').dt.week
My intention is, as from the first date the date was collected, count 7 days and store that as week one then continue incrementally until the last week say week number 66.
Here is the expected column of weeks created from the date column
import pandas as pd
week_df = {'weeks': ['1', '2', "3", "5", '6']}
df_weeks = pd.DataFrame(week_df)
IIUC use:
df['date'] = pd.to_datetime(df['date'])
df['week'] = df['date'].sub(df['date'].iat[0]).dt.days // 7 + 1
print (df.head(10))
date week
0 2019-09-10 1.0
1 NaT NaN
2 2019-10-07 4.0
3 2019-11-04 8.0
4 2019-11-28 12.0
5 2019-12-02 12.0
6 2020-01-24 20.0
7 2020-01-29 21.0
8 2020-02-05 22.0
9 2020-02-12 23.0
You have more than 66 weeks here, so either you want the real week count since the beginning or you want a dummy week rank. See below for both solutions:
# convert to week period
s = pd.to_datetime(df['date']).dt.to_period('W')
# get real week number
df['week'] = s.sub(s.iloc[0]).dropna().apply(lambda x: x.n).add(1)
# get dummy week rank
df['week2'] = s.rank(method='dense')
output:
date week week2
0 2019-09-10 1.0 1.0
1 NaN NaN NaN
2 2019-10-07 5.0 2.0
3 2019-11-04 9.0 3.0
4 2019-11-28 12.0 4.0
5 2019-12-02 13.0 5.0
6 2020-01-24 20.0 6.0
7 2020-01-29 21.0 7.0
8 2020-02-05 22.0 8.0
9 2020-02-12 23.0 9.0
10 2020-02-14 23.0 9.0
11 2020-02-24 25.0 10.0
12 2020-03-11 27.0 11.0
13 2020-03-16 28.0 12.0
14 2020-03-17 28.0 12.0
15 2020-03-18 28.0 12.0
16 2021-09-14 106.0 13.0
17 2021-09-30 108.0 14.0
18 2021-10-07 109.0 15.0
19 2021-10-08 109.0 15.0
...
42 2022-01-27 125.0 26.0
43 2022-01-31 126.0 27.0
44 2022-02-01 126.0 27.0
45 2022-02-10 127.0 28.0
46 2022-02-11 127.0 28.0
47 2022-02-16 128.0 29.0
48 2022-02-24 129.0 30.0
I would like to plot the below data using matplotlib. I am confused as to how to group the data. I would like to plot Amps against time for each rectifier. Preferably the x axis should be months and the y axis should be Amps. Below is a sample of data.
Rectifier Date Amps Volts
9E220ECP5001 2015-01-01 31.95 11.1
9E220ECP5001 2015-02-01 NaN NaN
9E220ECP5001 2015-03-01 31.05 11.3
9E220ECP5001 2015-04-01 NaN NaN
9E220ECP5001 2015-05-01 30.45 12.2
... ... ... ...
9E220ECP5018 2021-08-01 NaN NaN
9E220ECP5018 2021-09-01 17.4 11.6
9E220ECP5018 2021-10-01 NaN NaN
9E220ECP5018 2021-11-01 NaN NaN
9E220ECP5018 2021-12-01 NaN NaN
I started to work with Pandas and I have some issues that I don't really know how to solve.
I have a dataframe with date, product, stock and sales. Some dates and products are missing. I would like to get a timeseries for each product in a range of dates.
For example:
product udsStock udsSales
date
2019-12-26 14 161 848
2019-12-27 14 1340 914
2019-12-30 14 856 0
2019-12-25 4 3132 439
2019-12-27 4 3177 616
2020-01-01 4 500 883
It has to be the same range for all products even if one product doesn't appear in one date in the range.
If I want the range 2019-12-25 to 2020-01-01, the final dataframe should be like this one:
product udsStock udsSales
date
2019-12-25 14 NaN NaN
2019-12-26 14 161 848
2019-12-27 14 1340 914
2019-12-28 14 NaN NaN
2019-12-29 14 NaN NaN
2019-12-30 14 856 0
2019-12-31 14 NaN NaN
2020-01-01 14 NaN NaN
2019-12-25 4 3132 439
2019-12-26 4 NaN NaN
2019-12-27 4 3177 616
2019-12-28 4 NaN NaN
2019-12-29 4 NaN NaN
2019-12-30 4 NaN NaN
2019-12-31 4 NaN NaN
2020-01-01 4 500 883
I have tried to reindex by the range but it doesn't work because there are identical indexes.
idx = pd.date_range('25-12-2019', '01-01-2020')
df = df.reindex(idx)
I also have tried to index by date and product and then reindex, but I don't know how to put the product that is missing.
Any more ideas?
Thanks in advance
We can use pd.date_range and groupby.reindex to achieve your result:
date_range = pd.date_range(start='2019-12-25', end='2020-01-01', freq='D')
df = df.groupby('product', sort=False).apply(lambda x: x.reindex(date_range))
df['product'] = df.groupby(level=0)['product'].ffill().bfill()
df = df.droplevel(0)
product udsStock udsSales
2019-12-25 14.0 NaN NaN
2019-12-26 14.0 161.0 848.0
2019-12-27 14.0 1340.0 914.0
2019-12-28 14.0 NaN NaN
2019-12-29 14.0 NaN NaN
2019-12-30 14.0 856.0 0.0
2019-12-31 14.0 NaN NaN
2020-01-01 14.0 NaN NaN
2019-12-25 4.0 3132.0 439.0
2019-12-26 4.0 NaN NaN
2019-12-27 4.0 3177.0 616.0
2019-12-28 4.0 NaN NaN
2019-12-29 4.0 NaN NaN
2019-12-30 4.0 NaN NaN
2019-12-31 4.0 NaN NaN
2020-01-01 4.0 500.0 883.0
Convert index to datetime object :
df2.index = pd.to_datetime(df2.index)
Create unique combinations of date and product :
import itertools
idx = pd.date_range("25-12-2019", "01-01-2020")
product = df2["product"].unique()
temp = itertools.product(idx, product)
temp = pd.MultiIndex.from_tuples(temp, names=["date", "product"])
temp
MultiIndex([('2019-12-25', 14),
('2019-12-25', 4),
('2019-12-26', 14),
('2019-12-26', 4),
('2019-12-27', 14),
('2019-12-27', 4),
('2019-12-28', 14),
('2019-12-28', 4),
('2019-12-29', 14),
('2019-12-29', 4),
('2019-12-30', 14),
('2019-12-30', 4),
('2019-12-31', 14),
('2019-12-31', 4),
('2020-01-01', 14),
('2020-01-01', 4)],
names=['date', 'product'])
Reindex dataframe :
df2.set_index("product", append=True).reindex(temp).sort_index(
level=1, ascending=False
).reset_index(level="product")
product udsStock udsSales
date
2020-01-01 14 NaN NaN
2019-12-31 14 NaN NaN
2019-12-30 14 856.0 0.0
2019-12-29 14 NaN NaN
2019-12-28 14 NaN NaN
2019-12-27 14 1340.0 914.0
2019-12-26 14 161.0 848.0
2019-12-25 14 NaN NaN
2020-01-01 4 500.0 883.0
2019-12-31 4 NaN NaN
2019-12-30 4 NaN NaN
2019-12-29 4 NaN NaN
2019-12-28 4 NaN NaN
2019-12-27 4 3177.0 616.0
2019-12-26 4 NaN NaN
2019-12-25 4 3132.0 439.0
In R, specifically tidyverse, it can be achieved with the complete method. In Python, the pyjanitor package has something similar, but a few kinks remain to be ironed out (A PR has been submitted already for this).
I have an issue similar to "ValueError: cannot reindex from a duplicate axis".The solution isn't provided.
I have an excel file containing multiple rows and columns of weather data. Data has missing at certain intervals although not shown in the sample below. I want to reindex the time column at 5 minute intervals so that I can interpolate the missing values. Data Sample:
Date Time Temp Hum Dewpnt WindSpd
04/01/18 12:05 a 30.6 49 18.7 2.7
04/01/18 12:10 a NaN 51 19.3 1.3
04/01/18 12:20 a 30.7 NaN 19.1 2.2
04/01/18 12:30 a 30.7 51 19.4 2.2
04/01/18 12:40 a 30.9 51 19.6 0.9
Here's what I have tried.
import pandas as pd
ts = pd.read_excel('E:\DATA\AP.xlsx')
ts['Time'] = pd.to_datetime(ts['Time'])
ts.set_index('Time', inplace=True)
dt = pd.date_range("2018-04-01 00:00:00", "2018-05-01 00:00:00", freq='5min', name='T')
idx = pd.DatetimeIndex(dt)
ts.reindex(idx)
I just just want to have my index at 5 min frequency so that I can interpolate the NaN later. Expected output:
Date Time Temp Hum Dewpnt WindSpd
04/01/18 12:05 a 30.6 49 18.7 2.7
04/01/18 12:10 a NaN 51 19.3 1.3
04/01/18 12:15 a NaN NaN NaN NaN
04/01/18 12:20 a 30.7 NaN 19.1 2.2
04/01/18 12:25 a NaN NaN NaN NaN
04/01/18 12:30 a 30.7 51 19.4 2.2
One more approach.
df['Time'] = pd.to_datetime(df['Time'])
df = df.set_index(['Time']).resample('5min').last().reset_index()
df['Time'] = df['Time'].dt.time
df
output
Time Date Temp Hum Dewpnt WindSpd
0 00:05:00 4/1/2018 30.6 49.0 18.7 2.7
1 00:10:00 4/1/2018 NaN 51.0 19.3 1.3
2 00:15:00 NaN NaN NaN NaN NaN
3 00:20:00 4/1/2018 30.7 NaN 19.1 2.2
4 00:25:00 NaN NaN NaN NaN NaN
5 00:30:00 4/1/2018 30.7 51.0 19.4 2.2
6 00:35:00 NaN NaN NaN NaN NaN
7 00:40:00 4/1/2018 30.9 51.0 19.6 0.9
If times from multiple dates have to be re-sampled, you can use code below.
However, you will have to seperate 'Date' & 'Time' columns later.
df1['DateTime'] = df1['Date']+df1['Time']
df1['DateTime'] = pd.to_datetime(df1['DateTime'],format='%d/%m/%Y%I:%M %p')
df1 = df1.set_index(['DateTime']).resample('5min').last().reset_index()
df1
Output
DateTime Date Time Temp Hum Dewpnt WindSpd
0 2018-01-04 00:05:00 4/1/2018 12:05 AM 30.6 49.0 18.7 2.7
1 2018-01-04 00:10:00 4/1/2018 12:10 AM NaN 51.0 19.3 1.3
2 2018-01-04 00:15:00 NaN NaN NaN NaN NaN NaN
3 2018-01-04 00:20:00 4/1/2018 12:20 AM 30.7 NaN 19.1 2.2
4 2018-01-04 00:25:00 NaN NaN NaN NaN NaN NaN
5 2018-01-04 00:30:00 4/1/2018 12:30 AM 30.7 51.0 19.4 2.2
6 2018-01-04 00:35:00 NaN NaN NaN NaN NaN NaN
7 2018-01-04 00:40:00 4/1/2018 12:40 AM 30.9 51.0 19.6 0.9
You can try this for example:
import pandas as pd
ts = pd.read_excel('E:\DATA\AP.xlsx')
ts['Time'] = pd.to_datetime(ts['Time'])
ts.set_index('Time', inplace=True)
ts.resample('5T').mean()
More information here: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.resample.html
Set the Time column as the index, making sure it is DateTime type, then try
ts.asfreq('5T')
use
ts.asfreq('5T', method='ffill')
to pull previous values forward.
I would take the approach of creating a blank table and fill it in with the data as it comes from your data source. For this example three observations are read in as NaN, plus the row for 1:15 and 1:20 is missing.
import pandas as pd
import numpy as np
rawpd = pd.read_excel('raw.xlsx')
print(rawpd)
Date Time Col1 Col2
0 2018-04-01 01:00:00 1.0 10.0
1 2018-04-01 01:05:00 2.0 NaN
2 2018-04-01 01:10:00 NaN 10.0
3 2018-04-01 01:20:00 NaN 10.0
4 2018-04-01 01:30:00 5.0 10.0
Now create a dataframe targpd with the ideal structure.
time5min = pd.date_range(start='2018/04/1 01:00',periods=7,freq='5min')
targpd = pd.DataFrame(np.nan,index = time5min,columns=['Col1','Col2'])
print(targpd)
Col1 Col2
2018-04-01 01:00:00 NaN NaN
2018-04-01 01:05:00 NaN NaN
2018-04-01 01:10:00 NaN NaN
2018-04-01 01:15:00 NaN NaN
2018-04-01 01:20:00 NaN NaN
2018-04-01 01:25:00 NaN NaN
2018-04-01 01:30:00 NaN NaN
Now the trick is to update targpd with the data sent to you in rawpd. For this to happen the Date and Time columns have to be combined in rawpd and made into an index.
print(rawpd.Date,rawpd.Time)
0 2018-04-01
1 2018-04-01
2 2018-04-01
3 2018-04-01
4 2018-04-01
Name: Date, dtype: datetime64[ns]
0 01:00:00
1 01:05:00
2 01:10:00
3 01:20:00
4 01:30:00
Name: Time, dtype: object
You can see above the trick in all this. Your date data was converted to datetime but your time data is just a string. Below a proper index is created by used of a lambda function.
rawidx=rawpd.apply(lambda r : pd.datetime.combine(r['Date'],r['Time']),1)
print(rawidx)
This can be applied to the rawpd database as an index.
rawpd2=pd.DataFrame(rawpd[['Col1','Col2']].values,index=rawidx,columns=['Col1','Col2'])
rawpd2=rawpd2.sort_index()
print(rawpd2)
Once this is in place the update command can get you what you want.
targpd.update(rawpd2,overwrite=True)
print(targpd)
Col1 Col2
2018-04-01 01:00:00 1.0 10.0
2018-04-01 01:00:00 1.0 10.0
2018-04-01 01:05:00 2.0 NaN
2018-04-01 01:10:00 NaN 10.0
2018-04-01 01:15:00 NaN NaN
2018-04-01 01:20:00 NaN 10.0
2018-04-01 01:25:00 NaN NaN
2018-04-01 01:30:00 5.0 10.0
2018-04-01 01:05:00 2.0 NaN
2018-04-01 01:10:00 NaN 10.0
2018-04-01 01:15:00 NaN NaN
2018-04-01 01:20:00 NaN 10.0
2018-04-01 01:25:00 NaN NaN
2018-04-01 01:30:00 5.0 10.0
You now have a file ready for interpolation
I have got it to work. thank you everyone for your time. I am providing the working code.
import pandas as pd
df = pd.read_excel('E:\DATA\AP.xlsx', sheet_name='Sheet1', parse_dates=[['Date', 'Time']])
df = df.set_index(['Date_Time']).resample('5min').last().reset_index()
print(df)
I have a pandas dataframe with a two level hierarchical index ('item_id' and 'date'). Each row has columns for a variety of metrics for a particular item in a particular month. Here's a sample:
total_annotations unique_tags
date item_id
2007-04-01 2 30 14
2007-05-01 2 32 16
2007-06-01 2 36 19
2008-07-01 2 81 33
2008-11-01 2 82 34
2009-04-01 2 84 35
2010-03-01 2 90 35
2010-04-01 2 100 36
2010-11-01 2 105 40
2011-05-01 2 106 40
2011-07-01 2 108 42
2005-08-01 3 479 200
2005-09-01 3 707 269
2005-10-01 3 980 327
2005-11-01 3 1176 373
2005-12-01 3 1536 438
2006-01-01 3 1854 497
2006-02-01 3 2206 560
2006-03-01 3 2558 632
2007-02-01 3 5650 1019
As you can see, there are not observations for all consecutive months for each item. What I want to do is reindex the dataframe such that each item has rows for each month in a specified range. Now, this is easy to accomplish for any given item. So, for item_id 99, for example:
baseDateRange = pd.date_range('2005-07-01','2013-01-01',freq='MS')
data.xs(99,level='item_id').reindex(baseDateRange,method='ffill')
But with this method, I'd have to iterate through all the item_ids, then merge everything together, which seems woefully over-complicated.
So how can I apply this to the full dataframe, ffill-ing the observations (but also the item_id index) such that each item_id has properly filled rows for all the dates in baseDateRange?
Essentially for each group you want to reindex and ffill. The apply gets passed a data frame that has the item_id and date still in the index, so reset, then set and reindex with filling.
idx is your baseDateRange from above.
In [33]: df.groupby(level='item_id').apply(
lambda x: x.reset_index().set_index('date').reindex(idx,method='ffill')).head(30)
Out[33]:
item_id annotations tags
item_id
2 2005-07-01 NaN NaN NaN
2005-08-01 NaN NaN NaN
2005-09-01 NaN NaN NaN
2005-10-01 NaN NaN NaN
2005-11-01 NaN NaN NaN
2005-12-01 NaN NaN NaN
2006-01-01 NaN NaN NaN
2006-02-01 NaN NaN NaN
2006-03-01 NaN NaN NaN
2006-04-01 NaN NaN NaN
2006-05-01 NaN NaN NaN
2006-06-01 NaN NaN NaN
2006-07-01 NaN NaN NaN
2006-08-01 NaN NaN NaN
2006-09-01 NaN NaN NaN
2006-10-01 NaN NaN NaN
2006-11-01 NaN NaN NaN
2006-12-01 NaN NaN NaN
2007-01-01 NaN NaN NaN
2007-02-01 NaN NaN NaN
2007-03-01 NaN NaN NaN
2007-04-01 2 30 14
2007-05-01 2 32 16
2007-06-01 2 36 19
2007-07-01 2 36 19
2007-08-01 2 36 19
2007-09-01 2 36 19
2007-10-01 2 36 19
2007-11-01 2 36 19
2007-12-01 2 36 19
Constructing on Jeff's answer, I consider this to be somewhat more readable. It is also considerably more efficient since only the droplevel and reindex methods are used.
df = df.set_index(['item_id', 'date'])
def fill_missing_dates(x, idx=all_dates):
x.index = x.index.droplevel('item_id')
return x.reindex(idx, method='ffill')
filled_df = (df.groupby('item_id')
.apply(fill_missing_dates))