Related
So, for a forecasting project, I have a really long Dataframe of multiple time series of the following type (it has a numerical index):
date
time_series_id
value
2015-08-01
0
0
2015-08-02
0
1
2015-08-03
0
2
2015-08-04
0
3
2015-08-01
1
2
2015-08-02
1
3
2015-08-03
1
4
2015-08-04
1
5
My objective, is to add 3 new columns to these dataset, for each individual time series (each id) that correspond to trend, seasonal and resid.
According to the characteristics of the dataset, they tend to have Nans at the start and the end of the dates.
What I was trying to do was the following:
from statsmodels.tsa.seasonal import seasonal_decompose
df.assign(trend = lambda x: x.groupby("time_series_id")["value"].transform(lambda s: s.mask(~s.isna(), other= seasonal_decompose(s[~s.isna()], model='aditive', extrapolate_trend='freq').trend))
The expected output (trend value are not actual values) should be:
date
time_series_id
value
trend
2015-08-01
0
0
1
2015-08-02
0
1
1
2015-08-03
0
2
1
2015-08-04
0
3
1
2015-08-01
1
2
1
2015-08-02
1
3
1
2015-08-03
1
4
1
2015-08-04
1
5
1
But I get the following error message:
AttributeError: 'Int64Index' object has no attribute 'inferred_freq'
In a previous iteration of my code, this worked for my individual time series data frames, since I had embedded the date column as an index of the data frame instead of an additional column, so the "x" that the lambda function takes has already a "date time" index appropriate for the seasonal_decompose function.
df.assign(
trend = lambda x: x["value"].mask(~x["value"].isna(), other =
seasonal_decompose(x["value"][~x["value"].isna()], model='aditive', extrapolate_trend='freq').trend))
My questions are, first: is it possible to achieve this using groupby? Or other approaches are possible second: is it possible to handle this that doesn't eat much memory? The original dataset I'm working on has approximately 1MM ~ rows, so any help is really welcomed :).
Did one of the already posed solutions work? If so or you found a different solution please share. I tried each without success, but I'm new to Python so probably missing something.
Here is what I came up with, using a for loop. For my dataset it took 8 minutes to decompose 20 million rows consisting of 6,000 different subsets. This works but I wish it were faster.
Date Time
Segment ID
Travel Time(Minutes)
2021-11-09 07:15:00
1
30
2021-11-09 07:30:00
1
18
2021-11-09 07:15:00
2
30
2021-11-09 07:30:00
2
17
segments = set(frame['Segment ID'])
data = pd.DataFrame([])
for s in segments:
df = frame[frame['Segment ID'] == s].set_index('Date Time').resample('H').mean()
comp = sm.tsa.seasonal_decompose(x=df['Travel Time(Minutes)'], period=24*7, two_sided=False)
df = df.join(comp.trend).join(comp.seasonal).join(comp.resid)
#optional columns with some statistics to find outliers and trend changes
df['resid zscore'] = (df['resid'] - df['resid'].mean()).div(df['resid'].std())
df['trend pct_change'] = df.trend.pct_change()
df['trend pct_change zscore'] = (df['trend pct_change'] - df['trend pct_change'].mean()).div(df['trend pct_change'].std())
data = data.append(df.dropna())
where you have lambda x: x.groupby(..., you don't have anything to group; you are telling it to group a row (I believe). You can try a setup like this, perhaps
Here you define a function to act on the group you are sending via the apply() method. Then you should be able to use your original code.
I have not tested this, but I use this setup quite often to work on groups.
def trend_function(x):
# do your lambda function here as you are sending each grouping
x.assign(
trend = lambda x: x["value"].mask(~x["value"].isna(), other =
seasonal_decompose(x["value"][~x["value"].isna()], model='aditive', extrapolate_trend='freq').trend))
return x
dfnew = df.groupby('time_series_id').apply(trend_function)
use extrapolate_trend='freq' as a parameter. you add the trend, seasonal, and residual to a dictionary and plot the dictionary
from statsmodels.graphics import tsaplots
import statsmodels.api as sm
date=['2015-08-01','2015-08-02','2015-08-03','2015-08-04','2015-08-01','2015-08-02','2015-08-03','2015-08-04']
time_series_id=[0,0,0,0,1,1,1,1]
value=[0,1,2,3,2,3,4,5]
df=pd.DataFrame({'date':date,'time_series_id':time_series_id,'value':value})
df['date']=pd.to_datetime(df['date'])
df=df.set_index('date')
print(df)
index_day = df.index.day
value_by_day = df.groupby(index_day)['value'].mean()
fig,ax = plt.subplots(figsize=(12,4))
value_by_day.plot(ax=ax)
plt.title('value by month')
plt.show()
df[['value']].boxplot()
plt.show()
fig,ax = plt.subplots(figsize=(12,4))
df[['value']].hist(ax=ax, bins=5)
plt.show()
fig,ax = plt.subplots(figsize=(12,4))
df[['value']].plot(kind='density', ax=ax)
plt.show()
plt.clf()
fig,ax = plt.subplots(figsize=(12,4))
plt.style.use('seaborn-pastel')
fig = tsaplots.plot_acf(df['value'], lags=4,ax=ax)
plt.show()
decomposition=sm.tsa.seasonal_decompose(x=df['value'],model='additive', extrapolate_trend='freq', period=1)
decomposition.plot()
plt.show()
decomposition_trend=decomposition.trend
ax= decomposition_trend.plot(figsize=(14,2))
ax.set_xlabel('Date')
ax.set_ylabel('Trend of time series')
ax.set_title('Trend values of the time series')
plt.show()
I changed the first piece of code according to my scinario.
Here's my code and attached output
data = pd.DataFrame([])
segments = set(subset['Planning_Material'])
for s in segments:
df = subset[subset['Planning_Material'] == s].set_index('Cal_year_month').resample('M').sum()
comp = sm.tsa.seasonal_decompose(df)
df = df.join(comp.trend).join(comp.seasonal).join(comp.resid)
df['Planning_Material'] = s
data = pd.concat([data,df])
data = data.reset_index()
data = data[['Planning_Material', 'Cal_year_month', 'Historical_demand', 'trend','seasonal','resid']]
data
I am so sorry that I truly don't know what title I should use. But here is my question
Stocks_Open
d-1 d-2 d-3 d-4
000001.HR 1817.670960 1808.937405 1796.928768 1804.570628
000002.ZH 4867.910878 4652.713598 4652.713598 4634.904168
000004.HD 92.046474 92.209029 89.526880 96.435445
000005.SS 28.822245 28.636893 28.358865 28.729569
000006.SH 192.362963 189.174626 185.986290 187.403328
000007.SH 79.190528 80.515892 81.509916 78.693516
Stocks_Volume
d-1 d-2 d-3 d-4
000001.HR 324234 345345 657546 234234
000002.ZH 4867343 465234 4652598 4634168
000004.HD 9246474 929029 826880 965445
000005.SS 2822245 2836893 2858865 2829569
000006.SH 19262963 1897466 1886290 183328
000007.SH 7190528 803892 809916 7693516
Above are the data I parsed from a database, what I exactly want to do is to obtain the correlation of open price and volume in 4 days for each stock (The first column consists of codes of different stocks). In other words, I am trying to calculate the correlation of corresponding rows of each DataFrame. (This is only simplified example, the real data should be extended to more than 1000 different stocks.)
My attempt is to create a dataframe and to run a loop, assigning the results to that dataframe. But here is a problem, which is, the index pf the created dataframe is not exactly what I want. When I tried to append the correlation column, the bug occurred. (Please ignore the value of correlation, which is I concocted here, just to give an example)
r = pd.DataFrame(index = range(6),columns = ['c']
for i in range(6):
r.iloc[i-1,:] = Stocks_Open.iloc[i-1].corr(Stocks_Volume.iloc[i-1])
Correlation_in_4days = pd.concat([Stocks_Open,Stocks_Volume], axis = 1)
Correlation_in_4days['corr'] = r['c']
for i in range(6):
Correlation_in_4days.iloc[i-1,8] = r.iloc[i-1,:]
r c
1 0.654
2 -0.454
3 0.3321
4 0.2166
5 -0.8772
6 0.3256
The bug occurred.
"ValueError: Incompatible indexer with Series"
I realized that my correlation dataframe's index is integer but not the stock code, but I don't know how to fix it, is there any help?
My ideal result is:
corr
000001.HR 0.654
000002.ZH -0.454
000004.HD 0.3321
000005.SS 0.2166
000006.SH -0.8772
000007.SH 0.3256
Try assign the index back
r.index = Stocks_Open.index
I have temperature data stored in a csv file when plotted looks like the below image. How do I find the average during each interval when the temperature goes above 12. The result should be the T1, T2 ,T3 which should be the average temperature during the interval when its value is above 12.
Could you please suggest how to achieve this in python?
Highlighted the areas approximately over which I need to calculate the average:
Please find below sample data:
R3,R4
1,11
2,11
3,11
4,11
5,11
6,15.05938512
7,15.12975992
8,15.05850141
18,15.1677708
19,15.00921862
20,15.00686921
21,15.01168888
22,11
23,11
24,11
25,11
26,11
27,15.05938512
28,15.12975992
29,15.05850141
30,15.00328706
31,15.12622611
32,15.01479819
33,15.17778891
34,15.01411488
35,9
36,9
37,9
38,9
39,16.16042435
40,16.00091253
41,16.00419677
42,16.15381827
43,16.0471766
44,16.03725301
45,16.13925003
46,16.00072279
47,11
48,1
In pandas, an idea would be to group the data based on the condition T > 12 and use mean as agg func. Ex:
import pandas as pd
# a dummy df:
df = pd.DataFrame({'T': [11, 13, 13, 10, 14]})
# set the condition
m = df['T'] > 12
# define groups
grouper = (~m).cumsum().where(m)
# ...looks like
# 0 NaN
# 1 1.0
# 2 1.0
# 3 NaN
# 4 2.0
# Name: T, dtype: float64
# now we can easily calculate the mean for each group:
grp_mean = df.groupby(grouper)['T'].mean()
# T
# 1.0 13
# 2.0 14
# Name: T, dtype: int64
Note: if you have noisy data (T jumps up and down), it might be clever to apply a filter first (savgol, median etc. - whatever is appropriate) so you don't end up with groups caused by the noise.
I couldn't find a good pattern for this - here's a clunky bit of code that does what you want, though.
In general, use .shift() to find transition points, and use groupby with transform to get your means.
#if you had a csv with Dates and Temps, do this
#tempsDF = pd.read_csv("temps.csv", columns=["Date","Temp"])
#tempsDF.set_index("Date", inplace=True)
#Using fake data since I don't have your csv
tempsDF = pd.DataFrame({'Temp': [0,13,14,13,8,7,5,0,14,16,16,0,0,0]})
#This is a bit clunky - I bet there's a more elegant way to do it
tempsDF["CumulativeFlag"] = 0
tempsDF.loc[tempsDF["Temp"]>12, "CumulativeFlag"]=1
tempsDF.loc[tempsDF["CumulativeFlag"] > tempsDF["CumulativeFlag"].shift(), "HighTempGroup"] = list(range(1,len(tempsDF.loc[tempsDF["CumulativeFlag"] > tempsDF["CumulativeFlag"].shift()])+1))
tempsDF["HighTempGroup"].fillna(method='ffill', inplace=True)
tempsDF.loc[tempsDF["Temp"]<=12, "HighTempGroup"]= None
tempsDF["HighTempMean"] = tempsDF.groupby("HighTempGroup").transform(np.mean)["Temp"]
I have a small dataframe comprised of two columns, an ORG column and a percentage column. The dataframe is sorted largest to smallest based on the percentage column.
I'd like to create a while loop that adds up the values in the percentage column up until it hits a value of .80 (80%).
So far I've tried:
retail_pareto = 0
counter = 0
while retail_pareto < .80:
retail_pareto += retailerDF[counter]['RETAILER_PCT_OF_CHANGE']
counter += 1
This does not work, both the counter and the counter and retail_pareto value remain at zero with no real error message to help me troubleshoot what I'm doing incorrectly. Ideally, I'd like to end up with a list of the orgs with the largest percentage that together add up to 80%.
I'm not exactly sure what to try next. I've searched these forums, but haven't found anything similar in the forums yet.
Any advice or help is much appreciated. Thank you.
Example Dataframe:
ORG PCT
KST 0.582561
ISL 0.290904
BOV 0.254456
BRH 0.10824
GNT 0.0913631
DSH 0.023441
RDM -0.0119665
JBL -0.0348893
JBD -0.071883
WEG -0.232227
The output that I would expect would be something along the lines of:
ORG PCT
KST 0.582561
ISL 0.290904
Use:
df_filtered = df.loc[df['PCT'].shift(fill_value=0).cumsum().le(0.80),:]
#if you don't want include where cumsum is greater than 0,80
#df_filtered = df.loc[df['PCT'].cumsum().le(0.80),:]
print(df_filtered)
ORG PCT
0 KST 0.582561
1 ISL 0.290904
Can you use this example to help you?
import pandas as pd
retail_pareto = 0
orgs = []
for i,row in retailerDF.iterrows():
if retail_pareto <= .80:
retail_pareto += row['RETAILER_PCT_OF_CHANGE']
orgs.append(row)
else:
break
new_df = pd.DataFrame(orgs)
Edit: made it more like your example and added the new DataFrame.
Instead of your loop, take a more pandasonic approach.
Start with computing an additional column containing cumulative sum
of RETAILER_PCT_OF_CHANGE:
df['pct_cum'] = df.RETAILER_PCT_OF_CHANGE.cumsum()
For your data, the result is:
ORG RETAILER_PCT_OF_CHANGE pct_cum
0 KST 0.582561 0.582561
1 ISL 0.290904 0.873465
2 BOV 0.254456 1.127921
3 BRH 0.108240 1.236161
4 GNT 0.091363 1.327524
5 DSH 0.023441 1.350965
6 RDM -0.011967 1.338999
7 JBL -0.034889 1.304109
8 JBD -0.071883 1.232226
9 WEG -0.232227 0.999999
And now, to print rows which totally include 80 % of change,
ending on the first row above the limit, run:
df[df.pct_cum.shift(1).fillna(0) < 0.8]
The result, together with the cumulated sum, is:
ORG RETAILER_PCT_OF_CHANGE pct_cum
0 KST 0.582561 0.582561
1 ISL 0.290904 0.873465
I am trying to sum (and plot) a total from functions which change states at different times using Python's Pandas.DataFrame. For example:
Suppose we have 3 people whose states can be a) holding nothing, b) holding a 5 pound weight, and c) holding a 10 pound weight. Over time, these people pick weights up and put them down. I want to plot the total amount of weight being held. So, given:
My brute forece attempt:
import pandas as ps
import math
import numpy as np
person1=[3,0,10,10,10,10,10]
person2=[4,0,20,20,25,25,40]
person3=[5,0,5,5,15,15,40]
allPeopleDf=ps.DataFrame(np.array(zip(person1,person2,person3)).T)
allPeopleDf.columns=['count','start1', 'end1', 'start2', 'end2', 'start3','end3']
allPeopleDfNoCount=allPeopleDf[['start1', 'end1', 'start2', 'end2', 'start3','end3']]
uniqueTimes=sorted(ps.unique(allPeopleDfNoCount.values.ravel()))
possibleStates=[-1,0,1,2] #extra state 0 for initialization
stateData={}
comboStates={}
#initialize dict to add up all of the stateData
for time in uniqueTimes:
comboStates[time]=0.0
allPeopleDf['track']=-1
allPeopleDf['status']=-1
numberState=len(possibleStates)
starti=-1
endi=0
startState=0
for i in range(3):
starti=starti+2
print starti
endi=endi+2
for time in uniqueTimes:
def helper(row):
start=row[starti]
end=row[endi]
track=row[7]
if start <= time and time < end:
return possibleStates[i+1]
else:
return possibleStates[0]
def trackHelp(row):
status=row[8]
track=row[7]
if track<=status:
return status
else:
return track
def Multiplier(row):
x=row[8]
if x==0:
return 0.0*row[0]
if x==1:
return 5.0*row[0]
if x==2:
return 10.0*row[0]
if x==-1:#numeric place holder for non-contributing
return 0.0*row[0]
allPeopleDf['status']=allPeopleDf.apply(helper,axis=1)
allPeopleDf['track']=allPeopleDf.apply(trackHelp,axis=1)
stateData[time]=allPeopleDf.apply(Multiplier,axis=1).sum()
for k,v in stateData.iteritems():
comboStates[k]=comboStates.get(k,0)+v
print allPeopleDf
print stateData
print comboStates
Plots of weight being held over time might look like the following:
And the sum of the intensities over time might look like the black line in the following:
with the black line defined with the Cartesian points: (0,0 lbs),(5,0 lbs),(5,5 lbs),(15,5 lbs),(15,10 lbs),(20,10 lbs),(20,15 lbs),(25,15 lbs),(25,20 lbs),(40,20 lbs). However, I'm flexible and don't necessarily need to define the combined intensity line as a set of Cartesian points. The unique times can be found with:
print list(set(uniqueTimes).intersection(allNoCountT[1].values.ravel())).sort()
,but I can't come up with a slick way of getting the corresponding intensity values.
I started out with a very ugly function to break apart each "person's" graph so that all people had start and stop times (albeit many stop and start times without state change) at the same time, and then I could add up all the "chunks" of time. This was cumbersome; there has to be a slick pandas way of handling this. If anyone can offer a suggestion or point me to another SO like that I might have missed, I'd appreciate the help!
In case my simplified example isn't clear, another might be plotting the intensity of sound coming from a piano: there are many notes being played for different durations with different intensities. I would like the sum of intensity coming from the piano over time. While my example is simplistic, I need a solution that is more on the scale of a piano song: thousands of discrete intensity levels per key, and many keys contributing over the course of a song.
Edit--Implementation of mgab's provided solution:
import pandas as ps
import math
import numpy as np
person1=['person1',3,0.0,10.0,10.0,10.0,10.0,10.0]
person2=['person2',4,0,20,20,25,25,40]
person3=['person3',5,0,5,5,15,15,40]
allPeopleDf=ps.DataFrame(np.array(zip(person1,person2,person3)).T)
allPeopleDf.columns=['id','intensity','start1', 'end1', 'start2', 'end2', 'start3','end3']
allPeopleDf=ps.melt(allPeopleDf,id_vars=['intensity','id'])
allPeopleDf.columns=['intensity','id','timeid','time']
df=ps.DataFrame(allPeopleDf).drop('timeid',1)
df[df.id=='person1'].drop('id',1) #easier to visualize one id for check
df['increment']=df.groupby('id')['intensity'].transform( lambda x: x.sub(x.shift(), fill_value= 0 ))
TypeError: unsupported operand type(s) for -: 'str' and 'int'
End Edit
Going for the piano keys example, lets assume you have three keys, with 30 levels of intensity.
I would try to keep the data in this format:
import pandas as pd
df = pd.DataFrame([[10,'A',5],
[10,'B',7],
[13,'C',10],
[15,'A',15],
[20,'A',7],
[23,'C',0]], columns=["time", "key", "intensity"])
time key intensity
0 10 A 5
1 10 B 7
2 13 C 10
3 15 A 15
4 20 A 7
5 23 C 0
where you record every change in intensity of any of the keys. From here you can already get the Cartesian coordinates for each individual key as (time,intensity) pairs
df[df.key=="A"].drop('key',1)
time intensity
0 10 5
3 15 15
4 20 7
Then, you can easily create a new column increment that will indicate the change in intensity that occurred for that key at that time point (intensity indicates just the new value of intensity)
df["increment"]=df.groupby("key")["intensity"].transform(
lambda x: x.sub(x.shift(), fill_value= 0 ))
df
time key intensity increment
0 10 A 5 5
1 10 B 7 7
2 13 C 10 10
3 15 A 15 10
4 20 A 7 -8
5 23 C 0 -10
And then, using this new column, you can generate the (time, total_intensity) pairs to use as Cartesian coordinates
df.groupby("time").sum()["increment"].cumsum()
time
10 12
13 22
15 32
20 24
23 14
dtype: int64
EDIT: applying the specific data presented in question
Assuming the data comes as a list of values, starting with the element id (person/piano key), then a factor multiplying the measured weight/intensities for this element, and then pairs of time values indicating the start and end of a series of known states (weight being carried/intensity being emitted). Not sure if I got the data format right. From your question:
data1=['person1',3,0.0,10.0,10.0,10.0,10.0,10.0]
data2=['person2',4,0,20,20,25,25,40]
data3=['person3',5,0,5,5,15,15,40]
And if we know the weight/intensity of each one of the states, we can define:
known_states = [5, 10, 15]
DF_columns = ["time", "id", "intensity"]
Then, the easiest way I came up to load the data includes this function:
import pandas as pd
def read_data(data, states, columns):
id = data[0]
factor = data[1]
reshaped_data = []
for i in xrange(len(states)):
j += 2+2*i
if not data[j] == data[j+1]:
reshaped_data.append([data[j], id, factor*states[i]])
reshaped_data.append([data[j+1], id, -1*factor*states[i]])
return pd.DataFrame(reshaped_data, columns=columns)
Notice that the if not data[j] == data[j+1]: avoids loading data to the dataframe when start and end times for a given state are equal (seems uninformative, and wouldn't appear in your plots anyway). But take it out if you still want these entries.
Then, you load the data:
df = read_data(data1, known_states, DF_columns)
df = df.append(read_data(data2, known_states, DF_columns), ignore_index=True)
df = df.append(read_data(data3, known_states, DF_columns), ignore_index=True)
# and so on...
And then you're right at the beginning of this answer (substituting 'key' by 'id' and the ids, of course)
Appears to be what .sum() is for:
In [10]:
allPeopleDf.sum()
Out[10]:
aStart 0
aEnd 35
bStart 35
bEnd 50
cStart 50
cEnd 90
dtype: int32