Resampling Pandas but with given dates - python

I want to resample my pandas dataframe with datetime as index. When I use resample method it returns resampled date with index of the last date which doesn't always exist in the original data. For example, my original data has data from 2000-01-03 ~ 2005-12-29. But when I resample this data yearly I get data for 2005-12-31. This is a problem for me when I use concat for resampled data.
Y = price.resample("Y").first()
M = price.resample("M").first()
W = price.resample("W").first()
total = pd.concat([price,W,M,Y], axis=1, sort=False)
#example
price = pd.DataFrame([1315.23, 1324.97, 1376.54, 1351.46, 1343.55, 1369.89, 1380.2 ,
1371.18, 1359.99, 1340.93, 1312.15, 1322.74, 1305.6 , 1264.74,
1274.86, 1305.97, 1305.97, 1315.19, 1328.92, 1334.22, 1320.28],
index = ['2000-12-01', '2000-12-04', '2000-12-05', '2000-12-06',
'2000-12-07', '2000-12-08', '2000-12-11', '2000-12-12',
'2000-12-13', '2000-12-14', '2000-12-15', '2000-12-18',
'2000-12-19', '2000-12-20', '2000-12-21', '2000-12-22',
'2000-12-25', '2000-12-26', '2000-12-27', '2000-12-28',
'2000-12-29'])
price.index = pd.to_datetime(price.index)
price.resample("W").first()
#see how 12-03, 12-10, 12-17, 12-24, 12-31 are not dates that are in the original index

Have you considered just dropping undesired rows afterwards?
The following code will work because all rows created by resample (that are not on the original index) will be set to values of NaN.
price.resample('W').dropna()

Related

Parse CSV in 2D Python Object

i am trying to do Analysis on a CSV file which looks like this:
timestamp
value
1594512094.39
51
1594512094.74
76
1594512098.07
50.9
1594512099.59
76.80000305
1594512101.76
50.9
i am using pandas to import each column:
dataFrame = pandas.read_csv('iot_telemetry_data.csv')
graphDataHumidity: object = dataFrame.loc[:, "humidity"]
graphTime: object = dataFrame.loc[:, "ts"]
My Problem is i need to make a tuple of both columns, to filter the values of a specific time range, so for example i have my timestampBeginn of "1594512109.13668" and my "timestampEnd of "1594512129.37415" and i want to have the corresponding values to generate for example the mean value of the value of the specific time range.
I didn't find any solutions to this online and i don't know any libraries which solve this problem.
You can first filter the rows which timestamp values are between the 'start' and 'end.' Then you can calculate the values of the filtered rows, as follows:
(But, in the sample data, it seems that there is no row, which timestamp are between the range from 1594512109.13668 to 1594512129.37415. You can edit the range values as what you want.
import pandas as pd
df = pd.read_csv('iot_telemetry_data.csv')
start = 159451219.13668
end = 1594512129.37415
df = df[(df['timestamp'] >= start) & (df['timestamp'] <= end)]
average = df['value'].mean()
print(average)

How do I access the integers given by nunique in Pandas?

I am trying to access the items in each column that is outputted given the following code. It outputs two columns, 'Accurate_Episode_Date' values, and the count (the frequency of each Date). My goal is to plot the date on the x axis, and the count on the y axis using a scatterplot, but first I need to be able to access the actual count values.
data = pd.read_csv('CovidDataset.csv')
Barrie = data.loc[data['Reporting_PHU_City'] == 'Barrie']
dates_barrie = Barrie[['Accurate_Episode_Date']]
num = data.groupby('Accurate_Episode_Date')['_id'].nunique()
print(num.tail(5))
The code above outputs the following:
2021-01-10T00:00:00 1326
2021-01-11T00:00:00 1875
2021-01-12T00:00:00 1274
2021-01-13T00:00:00 492
2021-01-14T00:00:00 8
Again, I want to plot the dates on the x axis, and the counts on the y axis in scatterplot form. How do I access the count and date values?
EDIT: I just want a way to plot dates like 2021-01-10T00:00:00 and so on on the x axis, and the corresponding count: 1326 on the Y-axis.
Turns out this was mainly a data type issue. Basically all that was needed was accessing the datetime index and typecasting it to string with num.index.astype(str).
You could probably change it "in-place" and use the plot like below.
num.index = num.index.astype(str)
num.plot()
If you only want to access the values of a DataFrame or Series you just need to access them like this: num.values
If you want to plot the date column on X, you don't need to access that column separately, just use pandas internals:
# some dummy dates + counts
dates = [datetime.now() + timedelta(hours=i) for i in range(1, 6)]
values = np.random.randint(1, 10, 5)
df = pd.DataFrame({
"Date": dates,
"Values": values,
})
# if you only have 1 other column you can skip `y`
df.plot(x="Date", y="Values")
you need to convert date column using pd.to_datetime(df['dates']) then you can plot
updated answer:
here no need to convert to pd.to_datetime(df['dates'])
ax=df[['count']].plot()
ax.set_xticks(df.count.index)
ax.set_xticklabels(df.date)

Graphing rows with a date value within 50 days of a given value in pandas

I have a DataFrame (dataframeA) with column of dates, all formatted like this
date
19960826
19960826
19970303
19970320
19970905
and a column of values
values
100
35
11
37
...
and a column of groups
groupK
groupL
groupM
...
Given another DataFrame, dataframeB, with two columns: date in the format yyyymmdd, and group. For each row in dataframeB, how do I graph the values that are within 60 days before and after the date for each group.
i.e. if dataframeB first row is
20050101 groupM
graph (on the Y axis) the values in dataframeA where the date is within 50 days before or after Jan 01 2005, and the group is groupM.
Here's some sample data to start with:
import pandas as pd
import numpy as np
import string
start_date = '20050101'
drange = pd.date_range(start_date, periods=100, freq='D')
possible_groups = ['A','B','C','D','E','F']
chosen = np.random.choice(possible_groups, len(drange), replace=True)
groups = pd.Series(chosen).apply(lambda x: 'group'+x)
values = np.random.randint(1, 100, len(drange))
dfA = pd.DataFrame({'date':drange, 'grp':groups, 'value':values})
dfB = pd.DataFrame({'date':drange, 'grp':groups})
Note: If you need to keep the datetime objects visually looking like YYYYMMDD, you can use strftime() and switch back to datetime as needed, e.g.:
drange = pd.date_range(start_date, periods=100, freq='D').strftime('%Y%m%d')
Now, assuming you need to keep these data frames separate for some reason (i.e. merge() is not allowed), the following should work.
def plot_range(data, within):
(
dfA.set_index('date')
.loc[dfA.grp.values == data.grp]
.loc[data.date-pd.Timedelta(days=within):
data.date+pd.Timedelta(days=within)]
.plot(title=data.grp)
)
within = 50 # set within to the desired range in days around a date
dfB.apply(plot_range, axis='columns', args=(within,))
Here's example output from a few days' subset:
subset = 3
within = 10
dfB.sample(subset).apply(plot_range, axis='columns', args=(within,))

PANDAS - Loop over two datetime indexes with different sizes to compare days and values

Looking for a more efficient way to loop over and compare datetimeindex values in two Series objects with different frequencies.
Setup
Imagine two Pandas series, each with a datetime index covering the same year span yet with different frequencies for each index. One has a frequency of days, the other a frequency of hours.
range1 = pd.date_range('2016-01-01','2016-12-31', freq='D')
range2 = pd.date_range('2016-01-01','2016-12-31', freq='H')
I'm trying to loop over these series using their indexes as a lookup to match days so I can compare data for each day.
What I'm doing now...slow.
Right now I'm using multi-level for loops and if statements (see below); the time to complete these loops seems excessive (5.45 s per loop) compared with what I'm used to in Pandas operations.
for date, val in zip(frame1.index, frame1['data']): # freq = 'D'
for date2, val2 in zip(frame2.index, frame2['data']): # freq = 'H'
if date.day == date2.day: # check to see if dates are a match
if val2 > val: # compare the values
# append values, etc
Question
Is there a more efficient way of using the index in frame1 to loop over the index in frame2 and compare the values in each frame for a given day? Ultimately I want to create a series of values wherever frame2 vals are greater than frame1 vals.
Reproducible (Tested) Example
Create two separate series with random data and assign each a datetime index.
import pandas as pd
import numpy as np
range1 = pd.date_range('2016-01-01','2016-12-31', freq='D')
range2 = pd.date_range('2016-01-01','2016-12-31', freq='H')
frame1 = pd.Series(np.random.rand(366), index=range1)
frame2 = pd.Series(np.random.rand(8761), index=range2)
Still not sure what you want to do with the information. But I'd do:
make a copy of frame2
split its index into a date and time component
compare specifying a level
frame3 = frame2.copy()
frame3.index = [pd.to_datetime(frame3.index.date), frame.index.time]
results = frame3.lt(frame1, level=0)
results.head()
2016-01-01 00:00:00 True
01:00:00 True
02:00:00 True
03:00:00 True
04:00:00 True
dtype: bool
Yes, use resample, asfreq and pd.concat.
Use resample to get the proper frequency out of your series.
asfreq (which sounds sort of dirty) is use to convert back to a series with frequency defined in resample.
Concatenate with frame1 to get values side-by-side.
df = pd.concat([frame1,frame2.resample('1D').asfreq()],axis=1)
df.head()
Output:
0 1
2016-01-01 0.147067 0.235858
2016-01-02 0.820398 0.353275
2016-01-03 0.840499 0.186273
2016-01-04 0.505740 0.340201
2016-01-05 0.547840 0.695041
Then, you can us the following to get back to your series of frame2 exceeding frame1.
df.columns = ['frame1','frame2']
df.query('framed1 < frame2')['frame2']

Python, Pandas: Reindex/Slice DataFrame with duplicate Index values

Let's consider a DataFrame that contains 1 row of 2 values per each day of the month of Jan 2010:
date_range = pd.date_range(dt(2010,1,1), dt(2010,1,31), freq='1D')
df = pd.DataFrame(data = np.random.rand(len(date_range),2), index = date_range)
and another timeserie with sparser data and duplicate index values:
observations = pd.DataFrame(data =np.random.rand(7,2), index = (dt(2010,1,12),
dt(2010,1,18), dt(2010,1,20), dt(2010,1,20), dt(2010,1,22), dt(2010,1,22),dt(2010,1,28)))
I split the first DataFrame df into a list of 5 DataFrames, each of them containing 1 week worth of data from the original: df_weeks = [g for n, g in df.groupby(pd.TimeGrouper('W'))]
Now I would like to split the data of the second DataFrame by the same 5 weeks. i.e. that would mean in that specific case ending up with a variable obs_weeks containing 5 DataFrames spanning the same time range as df_weeks , 2 of them being empty.
I tried using reindex such as in this question: Python, Pandas: Use the GroupBy.groups description to apply it to another grouping
and Periods:
p1 =[x.to_period() for x in list(df.groupby(pd.TimeGrouper('W')).groups.keys())]
p1 = sorted(p1)
dfs=[]
for p in p1:
dff = observations.truncate(p.start_time, p.end_time)
dfs.append(dff)
(see this question: Python, Pandas: Boolean Indexing Comparing DateTimeIndex to Period)
The problem is that if some values in the index of observations are duplicates (and this is the case) none of those method functions. i also tried to change the index of observations to a normal column and do the slicing on that column, but i received error message as well.
You can achieve this by doing a simple filter:
p1 = [x.to_period() for x in list(df.groupby(pd.TimeGrouper('W')).groups.keys())]
p1 = sorted(p1)
dfs = []
for p in p1:
dff = observations.ix[
(observations.index >= p.start_time) &
(observations.index < p.end_time)]
dfs.append(dff)

Categories