Python plotting time-series data? - python

I would like to plot the below data using matplotlib. I am confused as to how to group the data. I would like to plot Amps against time for each rectifier. Preferably the x axis should be months and the y axis should be Amps. Below is a sample of data.
Rectifier Date Amps Volts
9E220ECP5001 2015-01-01 31.95 11.1
9E220ECP5001 2015-02-01 NaN NaN
9E220ECP5001 2015-03-01 31.05 11.3
9E220ECP5001 2015-04-01 NaN NaN
9E220ECP5001 2015-05-01 30.45 12.2
... ... ... ...
9E220ECP5018 2021-08-01 NaN NaN
9E220ECP5018 2021-09-01 17.4 11.6
9E220ECP5018 2021-10-01 NaN NaN
9E220ECP5018 2021-11-01 NaN NaN
9E220ECP5018 2021-12-01 NaN NaN

Related

Interpolating a data set in pandas while ignoring missing data

I have a question of how o get interpolated data across a several different "blocks" of time. In a nut shell, I have a dataset like this:
>>> import pandas as pd
>>> test_data = pd.read_csv("test_data.csv")
>>> test_data
ID Condition_num Condition_type Rating Timestamp_ms
0 101 1 Active 58.0 30
1 101 1 Active 59.0 60
2 101 1 Active 65.0 90
3 101 1 Active 70.0 120
4 101 1 Active 80.0 150
5 101 2 Break NaN 180
6 101 3 Active_2 55.0 210
7 101 3 Active_2 60.0 240
8 101 3 Active_2 63.0 270
9 101 3 Active_2 70.0 300
10 101 4 Break NaN 330
11 101 5 Active_3 69.0 360
12 101 5 Active_3 71.0 390
13 101 5 Active_3 50.0 420
14 101 5 Active_3 41.0 450
15 101 5 Active_3 43.0 480
I need to "resample" the final column to a another time interval (e.g. 40 ms) to match it to an external data set. I have been using the following code:
#Setting the column with timestamps as a datetime with the correct units, then set index
test_data['Timestamp_ms'] = pd.to_datetime(test_data['Timestamp_ms'], unit='ms')
test_data = test_data.set_index('Timestamp_ms')
#Resample index to start at 0, resample to the highest resolution 1ms, then resample to 800ms
test_data = test_data.reindex(
pd.date_range(start=pd.to_datetime(0, unit='ms'), end=test_data.index.max(), freq='ms')
)
test_data = test_data.resample('1ms').interpolate().resample('40ms').interpolate()
#Round ms to intergers
test_data.xpos = test_data..round()
Which gives me this:
ID Condition_num Condition_type Rating
1970-01-01 00:00:00.000 NaN NaN NaN NaN
1970-01-01 00:00:00.040 101.0 1.000000 NaN 58.333333
1970-01-01 00:00:00.080 101.0 1.000000 NaN 63.000000
1970-01-01 00:00:00.120 101.0 1.000000 Active 70.000000
1970-01-01 00:00:00.160 101.0 1.333333 NaN 75.833333
1970-01-01 00:00:00.200 101.0 2.666667 NaN 59.166667
1970-01-01 00:00:00.240 101.0 3.000000 Active_2 60.000000
1970-01-01 00:00:00.280 101.0 3.000000 NaN 65.333333
1970-01-01 00:00:00.320 101.0 3.666667 NaN 69.666667
1970-01-01 00:00:00.360 101.0 5.000000 Active_3 69.000000
1970-01-01 00:00:00.400 101.0 5.000000 NaN 64.000000
1970-01-01 00:00:00.440 101.0 5.000000 NaN 44.000000
1970-01-01 00:00:00.480 101.0 5.000000 Active_3 43.000000
The only issue is I cannot figure out which ratings are happening during the "Active" conditions and whether the ratings I am seeing are caused by extrapolations of the "breaks" where there are no ratings. In so many words, I want the interpolation in the "Active" blocks but also have everything aligned to the beginning of the whole data set.
I have tried entering Zero ratings for NaN and interpolating from the top of each condition, but that seems only to make the problem worse by altering the ratings more.
Any advice would be greatly appreciated!
I think you need to do all of your logic inside of a groupby, IIUC:
mask = df.Condition_type.ne('Break')
df2 = (df[mask].groupby('Condition_type') # Groupby Condition_type, excluding "Break" rows.
.apply(lambda x: x.resample('1ms') # To each group... resample it.
.interpolate() # Interpolate
.ffill() # Fill values, this just applies to the Condition_type.
.resample('40ms')# Resample to 40ms
.asfreq()) # No need to interpolate in this direction.
.reset_index('Condition_type', drop=True)) # We no longer need this extra index~
# Force the index to our resample'd interval, this will reveal the breaks:
df2 = df2.asfreq('40ms')
print(df2)
Output:
ID Condition_num Condition_type Rating
Timestamp_ms
1970-01-01 00:00:00.000 NaN NaN NaN NaN
1970-01-01 00:00:00.040 101.0 1.0 Active 58.333333
1970-01-01 00:00:00.080 101.0 1.0 Active 63.000000
1970-01-01 00:00:00.120 101.0 1.0 Active 70.000000
1970-01-01 00:00:00.160 NaN NaN NaN NaN
1970-01-01 00:00:00.200 NaN NaN NaN NaN
1970-01-01 00:00:00.240 101.0 3.0 Active_2 60.000000
1970-01-01 00:00:00.280 101.0 3.0 Active_2 65.333333
1970-01-01 00:00:00.320 NaN NaN NaN NaN
1970-01-01 00:00:00.360 101.0 5.0 Active_3 69.000000
1970-01-01 00:00:00.400 101.0 5.0 Active_3 64.000000
1970-01-01 00:00:00.440 101.0 5.0 Active_3 44.000000
1970-01-01 00:00:00.480 101.0 5.0 Active_3 43.000000

Continuous dates for products in Pandas

I started to work with Pandas and I have some issues that I don't really know how to solve.
I have a dataframe with date, product, stock and sales. Some dates and products are missing. I would like to get a timeseries for each product in a range of dates.
For example:
product udsStock udsSales
date
2019-12-26 14 161 848
2019-12-27 14 1340 914
2019-12-30 14 856 0
2019-12-25 4 3132 439
2019-12-27 4 3177 616
2020-01-01 4 500 883
It has to be the same range for all products even if one product doesn't appear in one date in the range.
If I want the range 2019-12-25 to 2020-01-01, the final dataframe should be like this one:
product udsStock udsSales
date
2019-12-25 14 NaN NaN
2019-12-26 14 161 848
2019-12-27 14 1340 914
2019-12-28 14 NaN NaN
2019-12-29 14 NaN NaN
2019-12-30 14 856 0
2019-12-31 14 NaN NaN
2020-01-01 14 NaN NaN
2019-12-25 4 3132 439
2019-12-26 4 NaN NaN
2019-12-27 4 3177 616
2019-12-28 4 NaN NaN
2019-12-29 4 NaN NaN
2019-12-30 4 NaN NaN
2019-12-31 4 NaN NaN
2020-01-01 4 500 883
I have tried to reindex by the range but it doesn't work because there are identical indexes.
idx = pd.date_range('25-12-2019', '01-01-2020')
df = df.reindex(idx)
I also have tried to index by date and product and then reindex, but I don't know how to put the product that is missing.
Any more ideas?
Thanks in advance
We can use pd.date_range and groupby.reindex to achieve your result:
date_range = pd.date_range(start='2019-12-25', end='2020-01-01', freq='D')
df = df.groupby('product', sort=False).apply(lambda x: x.reindex(date_range))
df['product'] = df.groupby(level=0)['product'].ffill().bfill()
df = df.droplevel(0)
product udsStock udsSales
2019-12-25 14.0 NaN NaN
2019-12-26 14.0 161.0 848.0
2019-12-27 14.0 1340.0 914.0
2019-12-28 14.0 NaN NaN
2019-12-29 14.0 NaN NaN
2019-12-30 14.0 856.0 0.0
2019-12-31 14.0 NaN NaN
2020-01-01 14.0 NaN NaN
2019-12-25 4.0 3132.0 439.0
2019-12-26 4.0 NaN NaN
2019-12-27 4.0 3177.0 616.0
2019-12-28 4.0 NaN NaN
2019-12-29 4.0 NaN NaN
2019-12-30 4.0 NaN NaN
2019-12-31 4.0 NaN NaN
2020-01-01 4.0 500.0 883.0
Convert index to datetime object :
df2.index = pd.to_datetime(df2.index)
Create unique combinations of date and product :
import itertools
idx = pd.date_range("25-12-2019", "01-01-2020")
product = df2["product"].unique()
temp = itertools.product(idx, product)
temp = pd.MultiIndex.from_tuples(temp, names=["date", "product"])
temp
MultiIndex([('2019-12-25', 14),
('2019-12-25', 4),
('2019-12-26', 14),
('2019-12-26', 4),
('2019-12-27', 14),
('2019-12-27', 4),
('2019-12-28', 14),
('2019-12-28', 4),
('2019-12-29', 14),
('2019-12-29', 4),
('2019-12-30', 14),
('2019-12-30', 4),
('2019-12-31', 14),
('2019-12-31', 4),
('2020-01-01', 14),
('2020-01-01', 4)],
names=['date', 'product'])
Reindex dataframe :
df2.set_index("product", append=True).reindex(temp).sort_index(
level=1, ascending=False
).reset_index(level="product")
product udsStock udsSales
date
2020-01-01 14 NaN NaN
2019-12-31 14 NaN NaN
2019-12-30 14 856.0 0.0
2019-12-29 14 NaN NaN
2019-12-28 14 NaN NaN
2019-12-27 14 1340.0 914.0
2019-12-26 14 161.0 848.0
2019-12-25 14 NaN NaN
2020-01-01 4 500.0 883.0
2019-12-31 4 NaN NaN
2019-12-30 4 NaN NaN
2019-12-29 4 NaN NaN
2019-12-28 4 NaN NaN
2019-12-27 4 3177.0 616.0
2019-12-26 4 NaN NaN
2019-12-25 4 3132.0 439.0
In R, specifically tidyverse, it can be achieved with the complete method. In Python, the pyjanitor package has something similar, but a few kinks remain to be ironed out (A PR has been submitted already for this).

Reindexing timeseries data

I have an issue similar to "ValueError: cannot reindex from a duplicate axis".The solution isn't provided.
I have an excel file containing multiple rows and columns of weather data. Data has missing at certain intervals although not shown in the sample below. I want to reindex the time column at 5 minute intervals so that I can interpolate the missing values. Data Sample:
Date Time Temp Hum Dewpnt WindSpd
04/01/18 12:05 a 30.6 49 18.7 2.7
04/01/18 12:10 a NaN 51 19.3 1.3
04/01/18 12:20 a 30.7 NaN 19.1 2.2
04/01/18 12:30 a 30.7 51 19.4 2.2
04/01/18 12:40 a 30.9 51 19.6 0.9
Here's what I have tried.
import pandas as pd
ts = pd.read_excel('E:\DATA\AP.xlsx')
ts['Time'] = pd.to_datetime(ts['Time'])
ts.set_index('Time', inplace=True)
dt = pd.date_range("2018-04-01 00:00:00", "2018-05-01 00:00:00", freq='5min', name='T')
idx = pd.DatetimeIndex(dt)
ts.reindex(idx)
I just just want to have my index at 5 min frequency so that I can interpolate the NaN later. Expected output:
Date Time Temp Hum Dewpnt WindSpd
04/01/18 12:05 a 30.6 49 18.7 2.7
04/01/18 12:10 a NaN 51 19.3 1.3
04/01/18 12:15 a NaN NaN NaN NaN
04/01/18 12:20 a 30.7 NaN 19.1 2.2
04/01/18 12:25 a NaN NaN NaN NaN
04/01/18 12:30 a 30.7 51 19.4 2.2
One more approach.
df['Time'] = pd.to_datetime(df['Time'])
df = df.set_index(['Time']).resample('5min').last().reset_index()
df['Time'] = df['Time'].dt.time
df
output
Time Date Temp Hum Dewpnt WindSpd
0 00:05:00 4/1/2018 30.6 49.0 18.7 2.7
1 00:10:00 4/1/2018 NaN 51.0 19.3 1.3
2 00:15:00 NaN NaN NaN NaN NaN
3 00:20:00 4/1/2018 30.7 NaN 19.1 2.2
4 00:25:00 NaN NaN NaN NaN NaN
5 00:30:00 4/1/2018 30.7 51.0 19.4 2.2
6 00:35:00 NaN NaN NaN NaN NaN
7 00:40:00 4/1/2018 30.9 51.0 19.6 0.9
If times from multiple dates have to be re-sampled, you can use code below.
However, you will have to seperate 'Date' & 'Time' columns later.
df1['DateTime'] = df1['Date']+df1['Time']
df1['DateTime'] = pd.to_datetime(df1['DateTime'],format='%d/%m/%Y%I:%M %p')
df1 = df1.set_index(['DateTime']).resample('5min').last().reset_index()
df1
Output
DateTime Date Time Temp Hum Dewpnt WindSpd
0 2018-01-04 00:05:00 4/1/2018 12:05 AM 30.6 49.0 18.7 2.7
1 2018-01-04 00:10:00 4/1/2018 12:10 AM NaN 51.0 19.3 1.3
2 2018-01-04 00:15:00 NaN NaN NaN NaN NaN NaN
3 2018-01-04 00:20:00 4/1/2018 12:20 AM 30.7 NaN 19.1 2.2
4 2018-01-04 00:25:00 NaN NaN NaN NaN NaN NaN
5 2018-01-04 00:30:00 4/1/2018 12:30 AM 30.7 51.0 19.4 2.2
6 2018-01-04 00:35:00 NaN NaN NaN NaN NaN NaN
7 2018-01-04 00:40:00 4/1/2018 12:40 AM 30.9 51.0 19.6 0.9
You can try this for example:
import pandas as pd
ts = pd.read_excel('E:\DATA\AP.xlsx')
ts['Time'] = pd.to_datetime(ts['Time'])
ts.set_index('Time', inplace=True)
ts.resample('5T').mean()
More information here: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.resample.html
Set the Time column as the index, making sure it is DateTime type, then try
ts.asfreq('5T')
use
ts.asfreq('5T', method='ffill')
to pull previous values forward.
I would take the approach of creating a blank table and fill it in with the data as it comes from your data source. For this example three observations are read in as NaN, plus the row for 1:15 and 1:20 is missing.
import pandas as pd
import numpy as np
rawpd = pd.read_excel('raw.xlsx')
print(rawpd)
Date Time Col1 Col2
0 2018-04-01 01:00:00 1.0 10.0
1 2018-04-01 01:05:00 2.0 NaN
2 2018-04-01 01:10:00 NaN 10.0
3 2018-04-01 01:20:00 NaN 10.0
4 2018-04-01 01:30:00 5.0 10.0
Now create a dataframe targpd with the ideal structure.
time5min = pd.date_range(start='2018/04/1 01:00',periods=7,freq='5min')
targpd = pd.DataFrame(np.nan,index = time5min,columns=['Col1','Col2'])
print(targpd)
Col1 Col2
2018-04-01 01:00:00 NaN NaN
2018-04-01 01:05:00 NaN NaN
2018-04-01 01:10:00 NaN NaN
2018-04-01 01:15:00 NaN NaN
2018-04-01 01:20:00 NaN NaN
2018-04-01 01:25:00 NaN NaN
2018-04-01 01:30:00 NaN NaN
Now the trick is to update targpd with the data sent to you in rawpd. For this to happen the Date and Time columns have to be combined in rawpd and made into an index.
print(rawpd.Date,rawpd.Time)
0 2018-04-01
1 2018-04-01
2 2018-04-01
3 2018-04-01
4 2018-04-01
Name: Date, dtype: datetime64[ns]
0 01:00:00
1 01:05:00
2 01:10:00
3 01:20:00
4 01:30:00
Name: Time, dtype: object
You can see above the trick in all this. Your date data was converted to datetime but your time data is just a string. Below a proper index is created by used of a lambda function.
rawidx=rawpd.apply(lambda r : pd.datetime.combine(r['Date'],r['Time']),1)
print(rawidx)
This can be applied to the rawpd database as an index.
rawpd2=pd.DataFrame(rawpd[['Col1','Col2']].values,index=rawidx,columns=['Col1','Col2'])
rawpd2=rawpd2.sort_index()
print(rawpd2)
Once this is in place the update command can get you what you want.
targpd.update(rawpd2,overwrite=True)
print(targpd)
Col1 Col2
2018-04-01 01:00:00 1.0 10.0
2018-04-01 01:00:00 1.0 10.0
2018-04-01 01:05:00 2.0 NaN
2018-04-01 01:10:00 NaN 10.0
2018-04-01 01:15:00 NaN NaN
2018-04-01 01:20:00 NaN 10.0
2018-04-01 01:25:00 NaN NaN
2018-04-01 01:30:00 5.0 10.0
2018-04-01 01:05:00 2.0 NaN
2018-04-01 01:10:00 NaN 10.0
2018-04-01 01:15:00 NaN NaN
2018-04-01 01:20:00 NaN 10.0
2018-04-01 01:25:00 NaN NaN
2018-04-01 01:30:00 5.0 10.0
You now have a file ready for interpolation
I have got it to work. thank you everyone for your time. I am providing the working code.
import pandas as pd
df = pd.read_excel('E:\DATA\AP.xlsx', sheet_name='Sheet1', parse_dates=[['Date', 'Time']])
df = df.set_index(['Date_Time']).resample('5min').last().reset_index()
print(df)

df.plot.density() returns an empty plot despite values actually being there

I currently have this data set
WSPD GST WD WVHT DPD APD MWD BAR ATMP WTMP
Date
2005-06-06 03:00:00 8.2 9.8 86 NaN NaN NaN 77 1011.1 28.8 29
2005-06-06 04:00:00 9.4 11.2 96 NaN NaN NaN 79 1011.6 29 29
2005-06-06 05:00:00 9.4 10.9 103 NaN NaN NaN 78 1011.6 29 28.9
2005-06-06 06:00:00 7.5 9 114 NaN NaN NaN 84 1011.4 27.7 28.9
2005-06-06 07:00:00 7 10.4 118 NaN NaN NaN 85 1011.1 25.4 28.9
I am attempting to do a probability density plot for the column with WVHT. However, when I type:
df['WVHT'].plot.density()
I receive an empty plot despite the column actually having values. Please note this is just a sample of the data.

Python Pandas v0.18+: is there a way to resample a dataframe without filling NAs?

I wonder if there is a way to up resample a DataFrame without having to decide how NAs should be filled immediately.
I tried the following but got the Future Warning:
FutureWarning: .resample() is now a deferred operation use .resample(...).mean() instead of .resample(...)
Code:
import pandas as pd
dates = pd.date_range('2015-01-01', '2016-01-01', freq='BM')
dummy = [i for i in range(len(dates))]
df = pd.DataFrame({'A': dummy})
df.index = dates
df.resample('B')
Is there a better way to do this, that doesn't show warnings?
Thanks.
Use Resampler.asfreq:
print (df.resample('B').asfreq())
A
2015-01-30 0.0
2015-02-02 NaN
2015-02-03 NaN
2015-02-04 NaN
2015-02-05 NaN
2015-02-06 NaN
2015-02-09 NaN
2015-02-10 NaN
2015-02-11 NaN
2015-02-12 NaN
2015-02-13 NaN
2015-02-16 NaN
2015-02-17 NaN
2015-02-18 NaN
2015-02-19 NaN
2015-02-20 NaN
2015-02-23 NaN
2015-02-24 NaN
2015-02-25 NaN
2015-02-26 NaN
2015-02-27 1.0
2015-03-02 NaN
2015-03-03 NaN
2015-03-04 NaN
...
...

Categories