Converting monthly values into daily using pandas interpolation - python

I have 12 avg monthly values for 1000 columns and I want to convert the data into daily using pandas. I have tried to do it using interplolate but I got the daily values from 31/01/1991 to 31/12/1991, which does not cover the whole year. January month values are not getting. I used date_range for index of my data frame.
date=pd.date_range(start="01/01/1991",end="31/12/1991",freq="M")
upsampled=df.resample("D")
interpolated = upsampled.interpolate(method='linear')
How can I get the interpolated values for 365 days?

Note that interpolation is between the known points.
So to interpolate throughout the whole year, it is not enough to have
just 12 values (for each month).
You must have 13 values (e.g. for the start of each month and
the start of the next year).
Thus I created the source df as:
date = pd.date_range(start='01/01/1991', periods=13, freq='MS')
df = pd.DataFrame({'date': date, 'amount': np.random.randint(100, 200, date.size)})
getting e.g.:
date amount
0 1991-01-01 113
1 1991-02-01 164
2 1991-03-01 181
3 1991-04-01 164
4 1991-05-01 155
5 1991-06-01 157
6 1991-07-01 118
7 1991-08-01 133
8 1991-09-01 184
9 1991-10-01 183
10 1991-11-01 159
11 1991-12-01 193
12 1992-01-01 163
​Then to upsample it to daily frequency and interpolate, I ran:
df.set_index('date').resample('D').interpolate()
If you don't want the result to contain the last row (for 1992-01-01),
take only a slice of the above result, dropping the last row:
df.set_index('date').resample('D').interpolate()[:-1]

Related

Pandas Groupby and remove rows with numbers that are close to each other

I have a data frame df
df =
Code Bikes Year
12 356 2020
4 378 2020
2 389 2020
35 378 2021
40 370 2021
32 350 2021
I would like to group the data frame based on Year using df.groupby('Year') and check the close values in column df['Çode'] to find values that are close by at least 3 and retain the row with a maximum value in the df['Bikes'] column out of them.
For instance, in the first group of 2020, values 4 and 2 are close by at least 3 since 4-2=2 ≤ 3 and since 389 (df['Bikes']) corresponding to df['Code'] = 2 is the highest among the two, retain that and drop row where df['code']=4.
The expected out for the given example:
Code Bikes Year
12 356 2020
2 389 2020
35 378 2021
40 370 2021
new_df = df.sort_values(['Year', 'Code'])
new_df['diff'] = new_df['Code'] - new_df.groupby('Year')['Code'].shift()
new_df['cumsum'] = ((new_df['diff'] > 3) | (new_df['diff'].isna())).cumsum()
new_df = new_df.sort_values('Bikes', ascending=False).drop_duplicates(['cumsum']).sort_index()
new_df.drop(columns=['diff', 'cumsum'], inplace=True)
You can first sort values per both columns by DataFrame.sort_values, then create groups by compare differences with treshold 3 and cumulative sum, last use DataFrameGroupBy.idxmax for get maximal indices by Bikes per Year and helper Series:
df1 = df.sort_values(['Year','Code'])
g = df1.groupby('Year')['Code'].diff().gt(3).cumsum()
df2 = df.loc[df1.groupby(['Year', g])['Bikes'].idxmax()].sort_index()
print (df2)
Code Bikes Year
0 12 356 2020
2 2 389 2020
3 35 378 2021
4 40 370 2021

Python Pandas: Best way to find local maximums in large DF

I have a large dataframe that consitsts of many cycles, each cycle has 2 maximum peak values inside that I need to capture into another dataframe.
I have created a sample data frame that mimics the data I am seeing:
import pandas as pd
data = {'Cycle':[1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,2,2,3,3,3,3,3,3,3,3,3,3], 'Pressure':[100,110,140,180,185,160,120,110,189,183,103,115,140,180,200,162,125,110,196,183,100,110,140,180,185,160,120,180,201,190]}
df = pd.DataFrame(data)
As you can see in each cycle there are two maxes but the part I was having trouble with was that the 2nd peak is usaully higher than the first peak, so there could be rows of numbers technicially higher than the other peaks max in the cycle. The results should look something like this:
data2 = {'Cycle':[1,1,2,2,3,3], 'Peak Maxs': [185,189,200,196,185,201]}
df2= pd.DataFrame(data2)
I have tried a couple methods including .nlargest(2) per cycle, but the problem is that since one of the peaks is usually higher it will pull the 2nd highest number in the data, which isnt necesssarily the other peak.
This graph shows the peak pressures from each cycle that I would like to be able to find.
Thanks for any help.
From scipy argrelextrema
from scipy.signal import argrelextrema
out = df.groupby('Cycle')['Pressure'].apply(lambda x : x.iloc[argrelextrema(x.values, np.greater)])
Out[124]:
Cycle
1 4 185
8 189
2 14 200
18 196
3 24 185
28 201
Name: Pressure, dtype: int64
out = out.sort_values().groupby(level=0).tail(2).sort_index()
out
Out[138]:
Cycle
1 4 185
8 189
2 14 200
18 196
3 24 185
28 201
Name: Pressure, dtype: int64
Use groupby().shift() to get the neighborhood values, then compare:
g = df.groupby('Cycle')
local_maxes = (df['Pressure'].gt(g['Pressure'].shift()) # greater than previous row
& df['Pressure'].gt(g['Pressure'].shift(-1))] # greater than next row
)
df[local_maxes]
Output:
Cycle Pressure
4 1 185
8 1 189
14 2 200
18 2 196
24 3 185
28 3 201

How to plot monthly data of Individual year on the same plot correctly

I have some sales data of a store of some certain years and I want to plot its monthly sales data.
Here I had DateTime object so through this I plotted like this
wlmrt['Date'].dt.month.value_counts().sort_index().plot()
# Here it's output actually it's output is graph but to show its error I am showing it's value
1 450
2 495
3 540
4 630
5 585
6 540
7 585
8 540
9 585
10 585
11 405
12 495
I think it groups the same month of different year and plot that but I want the sales data of each month of each year to be plotted
wlmrt['Date'].resample('M').count().plot()
Should do the trick. pandas.DataFrame.resample is a "convenience method for frequency conversion and resampling", the count() method of the returned Resampler object generates counts for each resampled period as a DataFrame which can then be plotted using the typical pandas.DataFrame.plot() method.

Get the average mean of entries per month with datetime in Pandas

I have a large df with many entries per month. I would like to see the average entries per month as to see as an example if there are any months that normally have more entries. (Ideally I'd like to plot this with a line of the over all mean to compare with but that is maybe a later question).
My df is something like this:
ufo=pd.read_csv('https://raw.githubusercontent.com/justmarkham/pandas-videos/master/data/ufo.csv')
ufo['Time']=pd.to_datetime(ufo.Time)
Where the head looks like this:
So if I'd like to see if there are more ufo-sightings in the summer as an example, how would I go about?
I have tried:
ufo.groupby(ufo.Time.month).mean()
But it does only work if I am calculating a numerical value. If I use count()instead I get the sum of all entries for all months.
EDIT: To clarify, I would like to have the mean of entries - ufo-sightings - per month.
You could do something like this:
# count the total months in the records
def total_month(x):
return x.max().year -x.min().year + 1
new_df = ufo.groupby(ufo.Time.dt.month).Time.agg(['size', total_month])
new_df['mean_count'] = new_df['size'] /new_df['total_month']
Output:
size total_month mean_count
Time
1 862 57 15.122807
2 817 70 11.671429
3 1096 55 19.927273
4 1045 68 15.367647
5 1168 53 22.037736
6 3059 71 43.084507
7 2345 65 36.076923
8 1948 64 30.437500
9 1635 67 24.402985
10 1723 65 26.507692
11 1509 50 30.180000
12 1034 56 18.464286
I think this what you are looking for, still please ask for clarification if i didn't reached what you are looking for.
# Add a new column instance, this adds a value to each instance of ufo sighting
ufo['instance'] = 1
# set index to time, this makes df a time series df and then you can apply pandas time series functions.
ufo.set_index(ufo['Time'], drop=True, inplace=True)
# create another df by resampling the original df and counting the instance column by Month ('M' is resample by month)
ufo2 = pd.DataFrame(ufo['instance'].resample('M').count())
# just to find month of resampled observation
ufo2['Time'] = pd.to_datetime(ufo2.index.values)
ufo2['month'] = ufo2['Time'].apply(lambda x: x.month)
and finally you can groupby month :)
ufo2.groupby(by='month').mean()
and this is the output which looks like this:
month mean_instance
1 12.314286
2 11.671429
3 15.657143
4 14.928571
5 16.685714
6 43.084507
7 33.028169
8 27.436620
9 23.028169
10 24.267606
11 21.253521
12 14.563380
Do you mean you want to group your data by month? I think we can do this
ufo['month'] = ufo['Time'].apply(lambda t: t.month)
ufo['year'] = ufo['Time'].apply(lambda t: t.year)
In this way, you will have 'year' and 'month' to group your data.
ufo_2 = ufo.groupby(['year', 'month'])['place_holder'].mean()

Slicing in Pandas Dataframes and comparing elements

Morning. Recently I have been trying to implement pandas in creating large data tables for machine learning (I'm trying to move away from numpy as best I can).
However-I'm running into some issues-namely, slicing pandas date frames.
Namely-I'd like to return the rows I specify and reference and compare particular elements with those in other arrays-here's some a small amount of code i've implemented and some outline
import pandas as pd
import csv
import math
import random as nd
import numpy
#create the pandas dataframe from my csv. The Csv is entirely numerical data
#with exception of the first row vector which has column labels
df=pd.read_csv(r"C:\Users\Python\Downloads\Data for Brent - Secondattampatatdfrandomsample.csv")
#I use panda functionality to return a random sample of the data (a subset
#of the array)
df_sample=pd.DataFrame.sample(df,10)
It's at this point that I want to compare the first element along each row vector to the original data. Specifically, the first element in any row contains an id number.
If the elements of the original data frame and the sample frame match up like to compute a 3 and 6 month average of the associated column elements with matching id number
I want to disclaim I'm comfy moving to numpy and away from pandas-but there are training model methods I hear a ton of good things about in pandas (My training is the mathematics side of things and less so the program development). thanks for the input!
edit: here is the sample input for the first 11 row vectors in the dataframe (id, year, month,x,y,z)
id year month x y z
0 2 2016 2 1130 343.627538 163660.060200
1 2 2016 4 859 913.314513 360633.159400
2 2 2016 5 931 858.548056 93608.190030
3 2 2016 6 489 548.314860 39925.669950
4 2 2016 7 537 684.441725 80270.240060
5 2 2016 8 618 673.887072 124041.560000
6 2 2016 9 1030 644.749493 88975.429980
7 2 2016 10 1001 543.312870 54874.599830
8 2 2016 11 1194 689.053707 79930.230000
9 2 2016 12 673 483.644736 27567.749940
10 2 2017 1 912 657.716386 54590.460070
11 2 2017 2 671 682.007537 52514.580380
here is how sample data is returned given N same n tuple as before. I used native panda functions to return a randomly generated subset of 10 row vectors out of almost 9000 entries
2 2016 1 633 877.9282175 75890.97027
5185 2774 2016 4 184 399.418719 9974.375000
9441 4974 2017 2 239 135.520851 0.000000
5134 2745 2017 2 187 217.220657 7711.333333
8561 4063 2017 1 103 505.714286 18880.000000
3328 2033 2016 11 118 452.152542 7622.000000
3503 2157 2016 3 287 446.668831 8092.588235
5228 2791 2016 2 243 400.166008 12655.250000
9380 4708 2017 2 210 402.690583 5282.352941
1631 1178 2016 10 56 563.716667 16911.500000
2700 1766 2016 1 97 486.764151 6449.625000
I'd like to decry the appropriate positions in the sample array to search for identical elements in the original array and compute averages (and eventually more rigorous statistical modeling) to their associated numerical data
for id in df_sample['id'].unique():
df.groupby('id').mean()[['x', 'y', 'z']].reset_index()
I'm not sure if this is exactly what you want but I'll walk through it to see if it gives you ideas. For each unique id in the sample (I did it for all of them, implement whatever check you like), I grouped the original dataframe by that id (all rows with id == 2 are smushed together) and took the mean of the resulting pandas.GroupBy object as required (which averages the smushed together rows, for each column not in the groupby call). Since this averages your month and year as well, and all I think I care about is x, y, and z, I selected those columns, and then for aesthetic purposes reset the index.
Alternatively, if you wanted the average for that id for each year in the original df, you could do
df.groupby(['id', 'year']).mean()[['x', 'y', 'z']].reset_index()

Categories