I have two time series stored in data frames london and scotland of the same length and same columns. One column in date which spans from 2009 to 2019 and has a daily frequency for data of column yearly_cost. They look as such:
Date Yearly_cost
0 2009-01-01 230
1 2009-01-02 460
2 2009-01-03 260
3 2009-01-04 250
4 2009-01-05 320
5 2009-01-06 430
I wish to compare the euclidean distance of only the seasonality components of yearly_cost in the time series. I have decomposed them using seasonal_decompose() from statsmodels, however I wish to take only the seasonality component from the object:
result = <statsmodels.tsa.seasonal.DecomposeResult at 0x2b5d7d2add8>
Is this possible to take and create into a time series in a new_df?
Any help would be appreciated. Thanks
I have worked this out. To obtain the seasonal component, you simply use
new_df = result.seasonal
This gives you only the seasonal result.
Related
So I want to be able to know how many clusters are in a time series frequency table.
Input would be a date index with the frequency.
The kind of output you would get when using .resample('D').sum()
Input Example:
Index
Count
01-01-2022
3
02-01-2022
4
03-01-2022
2
04-01-2022
2
05-01-2022
2
....
...
27-01-2022
5
28-01-2022
4
29-01-2022
2
30-01-2022
3
31-01-2022
2
Assume the dates not shown (... on table) are all frequency 0.
So essentially there is two clusters in the month of January 2022. First cluster is at the beginning of the month and the second cluster is at the end of the month.
Cluster 1 is between date range 01-01-2022 and 05-01-2022.
Cluster 2 is between date range 27-01-2022 and 31-01-2022.
Do you know which clustering algorithm would allow me to get the # of clusters with this type of data?
or is a clustering algorithm even necessary?
Thank you for your help
I'm new in Python and hope you guys can help me with the following:
I have a data frame that contains the daily demand of a certain product. However, the demand is shown cumulative over time. I want to create a column that shows the actual daily demand (see table below).
Current Data frame:
Day#
Cumulative Demand
1
10
2
15
3
38
4
44
5
53
What I want to achieve:
Day#
Cumulative Demand
Daily Demand
1
10
10
2
15
5
3
38
23
4
44
6
5
53
9
Thank you!
Firstly, we need the data of the old column
# My Dataframe is called df
demand = df["Cumulative Demand"].tolist()
Then recalculate the data
daily_demand = [demand[0]]
for i, d in enumerate(demand[1:]):
daily_demand.append(d-demand[i])
Lastly append the data to a new column
df["Daily Demand"] = daily_demand
Assuming what you shared above is representative of your actual data, meaning you have 1 row per day, and Day column is sorted in ascending order.
You can use shift() (please read what it does), and perform a subtraction between the cumulative demand, and the shifted version of the cumulative demand. This will give you back the actual daily demand.
To make sure that it works, check whether the cumulative sum of daily demand (the new column) sums to the cumulative sum, using cumsum().
import pandas as pd
# Calculate your Daily Demand column
df['Daily Demand'] = (df['Cumulative Demand'] - df['Cumulative Demand'].shift()).fillna(df['Cumulative Demand'][0])
# Check whether the cumulative sum of daily demands sum up to the Cumulative Demand
>>> all(df['Daily Demand'].cumsum() == df['Cumulative Demand'])
True
Will print back:
Day Cumulative Demand Daily Demand
0 1 10 10.0
1 2 15 5.0
2 3 38 23.0
3 4 44 6.0
4 5 53 9.0
I have a large PySpark DataFrame in a similar structure to this:
city
store
machine_id
numeric_value
time
London
A
1
x
01/01/2021 14:15:00
London
A
2
y
01/01/2021 14:17:00
NY
B
9
z
01/01/2021 16:12:00
London
A
1
w
01/01/2021 14:20:00
London
A
2
q
01/01/2021 14:24:00
.
.
.
.
.
.
.
.
.
.
I would like to split the data into time windows (of 10 minutes for example) and calculate some statistics (mean, variance, number of distinct values and other custom functions) per machine_id and output a histogram of this statistic per combination of city and store. For example, for each city_store combination a historgram of the variance of "numeric_value" in a time window of 10 minutes.
So far I used groupby to get the data grouped by the columns I need -
interval_window = pyspark.sql.functions.window("time", '10 minutes')
grouped_df = df.groupBy('city', 'store', 'machine_id', interval_window)
From here I applied some pyspark.sql.functions (like var,mean..) using agg but I would like to know how to apply a custom function on a GroupedData object and how can I output histogram of the results per city and store. I dont think I can convert it into pandas DF as this DataFrame is very large and won't fit into the master.
I'm a beginner in spark so if I'm not using the correct objects/functions please let me know.
Thanks!
I have an excel data with the following information,
df.head()
User_id Group Week Spend Purchases Group
170309867 Test 2014-10-13 794.66 2 Test-NonRed
57954586 Test 2014-10-13 55.99 1 Test-Red
118068583 Test 2014-10-13 40.87 1 Test-NonRed
516478393 Test 2014-10-13 17.5 2 Test-NonRed
457873235 Test 2014-10-13 20.44 1 Test-Red
From the above information, I need to calculate the incremental spend in the six weeks for the total Test group (test-Red and test-NonRed) vs. control. I need it in absolute ($) and % terms.
I have tried pandas as,
df2= df.groupby(by=['Group','Week']).sum().abs().groupby(level=[0]).cumsum()
And I have the following result,
df2.head()
And then I calculated the sum for each group as,
df2.groupby(by=['group2']).sum()
df2.head()
I would like to have them ( the incremental spend) as an absolute value which I tried by abs(), as well as I need it in absolute percentage.Any help would be much appreciated,
The expected results are to calculate the incremental spend in the six weeks for the total Test group (test-Red and test-NonRed) vs. Control. I need it in absolute spend and then its percentage.The incremental spend for the 6 weeks. Something like this,
Group incremental_spend incremental_%
Control 11450175 #%
test-NonRed 50288158 #%
test-Red 12043938 #%
So my real questions,
1. Whether the above-mentioned approach is the right way to calculate incremental spend for Column Group in 6 Weeks from column Week on Spend?
2. Also, I need all my results in Absolute counts and Absolute %
I think there are several problems here which make your answer difficult to understand.
Vocabulary
What you describe as "Incremental spend" is just the sum.
What you do in two steps is the sum of the cumulative sum .cumsum().sum(), which is not right.
Also I am not sure whether you need abs, which gives the absolute value (abs(-1) gives 1) and will thus only have an effect if there are negative values in your data.
Unfortunately the sample dataset is not large enough to get a conclusion.
Dataset
Your dataset has two columns Group with identical names, which is error prone.
Missing information
You want to get final values (sums) as a ratio (%), but you do not indicate what is the reference value for this ratio.
Is it the sum of Spend for the control group ?
Potential solution
>>> df # Sample dataframe with one entry as 'Control' group
Out[]:
User_id Group Week Spend Purchases Group.1
0 170309867 Test 2014-10-13 794.66 2 Test-NonRed
1 57954586 Test 2014-10-13 55.99 1 Test-Red
2 118068583 Test 2014-10-13 40.87 1 Test-NonRed
3 516478393 Test 2014-10-13 17.50 2 Control
4 457873235 Test 2014-10-13 20.44 1 Test-Red
df2 = pd.DataFrame(df.groupby('Group.1').Spend.sum()) # Get 'Spend' sum for each group
>>> df2
Out[]:
Spend
Group.1
Control 17.50
Test-NonRed 835.53
Test-Red 76.43
control_spend_total = df2.loc['Control'].values # Get total spend for 'Control' group
>>> control_spend_total
Out[]: array([ 17.5])
df2['Spend_%'] = df2.Spend / control_spend_total * 100 # Add 'Spend_ratio' column
>>> df2
Out[]:
Spend Spend_%
Group.1
Control 17.50 100.000000
Test-NonRed 835.53 4774.457143
Test-Red 76.43 436.742857
Does it look like what you want?
Morning. Recently I have been trying to implement pandas in creating large data tables for machine learning (I'm trying to move away from numpy as best I can).
However-I'm running into some issues-namely, slicing pandas date frames.
Namely-I'd like to return the rows I specify and reference and compare particular elements with those in other arrays-here's some a small amount of code i've implemented and some outline
import pandas as pd
import csv
import math
import random as nd
import numpy
#create the pandas dataframe from my csv. The Csv is entirely numerical data
#with exception of the first row vector which has column labels
df=pd.read_csv(r"C:\Users\Python\Downloads\Data for Brent - Secondattampatatdfrandomsample.csv")
#I use panda functionality to return a random sample of the data (a subset
#of the array)
df_sample=pd.DataFrame.sample(df,10)
It's at this point that I want to compare the first element along each row vector to the original data. Specifically, the first element in any row contains an id number.
If the elements of the original data frame and the sample frame match up like to compute a 3 and 6 month average of the associated column elements with matching id number
I want to disclaim I'm comfy moving to numpy and away from pandas-but there are training model methods I hear a ton of good things about in pandas (My training is the mathematics side of things and less so the program development). thanks for the input!
edit: here is the sample input for the first 11 row vectors in the dataframe (id, year, month,x,y,z)
id year month x y z
0 2 2016 2 1130 343.627538 163660.060200
1 2 2016 4 859 913.314513 360633.159400
2 2 2016 5 931 858.548056 93608.190030
3 2 2016 6 489 548.314860 39925.669950
4 2 2016 7 537 684.441725 80270.240060
5 2 2016 8 618 673.887072 124041.560000
6 2 2016 9 1030 644.749493 88975.429980
7 2 2016 10 1001 543.312870 54874.599830
8 2 2016 11 1194 689.053707 79930.230000
9 2 2016 12 673 483.644736 27567.749940
10 2 2017 1 912 657.716386 54590.460070
11 2 2017 2 671 682.007537 52514.580380
here is how sample data is returned given N same n tuple as before. I used native panda functions to return a randomly generated subset of 10 row vectors out of almost 9000 entries
2 2016 1 633 877.9282175 75890.97027
5185 2774 2016 4 184 399.418719 9974.375000
9441 4974 2017 2 239 135.520851 0.000000
5134 2745 2017 2 187 217.220657 7711.333333
8561 4063 2017 1 103 505.714286 18880.000000
3328 2033 2016 11 118 452.152542 7622.000000
3503 2157 2016 3 287 446.668831 8092.588235
5228 2791 2016 2 243 400.166008 12655.250000
9380 4708 2017 2 210 402.690583 5282.352941
1631 1178 2016 10 56 563.716667 16911.500000
2700 1766 2016 1 97 486.764151 6449.625000
I'd like to decry the appropriate positions in the sample array to search for identical elements in the original array and compute averages (and eventually more rigorous statistical modeling) to their associated numerical data
for id in df_sample['id'].unique():
df.groupby('id').mean()[['x', 'y', 'z']].reset_index()
I'm not sure if this is exactly what you want but I'll walk through it to see if it gives you ideas. For each unique id in the sample (I did it for all of them, implement whatever check you like), I grouped the original dataframe by that id (all rows with id == 2 are smushed together) and took the mean of the resulting pandas.GroupBy object as required (which averages the smushed together rows, for each column not in the groupby call). Since this averages your month and year as well, and all I think I care about is x, y, and z, I selected those columns, and then for aesthetic purposes reset the index.
Alternatively, if you wanted the average for that id for each year in the original df, you could do
df.groupby(['id', 'year']).mean()[['x', 'y', 'z']].reset_index()