I have a simple PANDAS dataframe:
V1
Index
1 5
2 6
3 7
4 8
5 9
6 10
I want to fit an ARMA model from statsmodels. When I try to do it, I get the following:
ValueError: Given a pandas object and the index does not contain dates
I guess it means that the index is not set as a date. How can I transform the index to a date? I consider the current index to be days, so in the above example the dataframe runs for 6 days. How can I make PANDAS/statsmodels understand that it is dates of daily frequency? Thank you very much for your help.
You could probably set the index to be daily ending today.
df.index = pd.DatetimeIndex(end=pd.datetime.today(), periods=len(df), freq='1D')
Related
I'm sure that this question is not really helpful and could mean a lot of thinks so I'll try to explain the question with an example.
So my goal is to delete rows in a DataFrame like the following one if the row can't be part in a line of consecutive days which are as big as a given time period t. If t for example is 3, then the last row needs to be deleted, because there is a gap between the last and the row before. If t would be 4 then also the first three rows must be deleted, hence the 07.04.2012 or 03.04.2012 is missing. Hopefully you can understand what I try to explain here.
Date
Value
04.04.2012
24
05.04.2012
21
06.04.2012
20
08.04.2012
21
09.04.2012
23
10.04.2012
21
11.04.2012
26
13.04.2012
24
My attempt was to iterate over the values in the column 'Date' and check for every element x in the column if the value of the element x subtracted by the value of element x + t = -t. If this is not the case the whole row of the element should be deleted. But while I was searching how you can iterate over a DataFrame I read several times that it is not recommended to do that, because this needs a lot of computing time for big DataFrames. Unfortunately I couldn't find any other method or function which could do this. Therefore, I would be really glad if someone could help me out here. Thank you! :)
With the dates as index you can expand the index of the dataframe to include the missing days. The new dates will create nan values. Create groups for every nan value with .isna().cumsum() and count the size of each groups. Finally select the rows with a count larger or equal to the desired time period.
period = 3
df.set_index('Date', inplace=True)
df[df.groupby(df.reindex(pd.date_range(df.index.min(), df.index.max()))
.Value.isna().cumsum())
.transform('count').ge(period).Value].reset_index()
Output
Date Value
0 2012-04-04 24
1 2012-04-05 21
2 2012-04-06 20
3 2012-04-08 21
4 2012-04-09 23
5 2012-04-10 21
6 2012-04-11 26
To create the dataframe used in this solution
t = '''
Date Value
04.04.2012 24
05.04.2012 21
06.04.2012 20
08.04.2012 21
09.04.2012 23
10.04.2012 21
11.04.2012 26
13.04.2012 24
'''
import pandas as pd
from datetime import datetime
df = pd.read_csv(io.StringIO(t), sep='\s+', parse_dates=['Date'],
date_parser=lambda x: datetime.strptime(x, '%d.%m.%Y'))
Data import from csv:
Date
Item_1
Item 2
1990-01-01
34
78
1990-01-02
42
19
.
.
.
.
.
.
2020-12-31
41
23
df = pd.read_csv(r'Insert file directory')
df.index = pd.to_datetime(df.index)
gb= df.groupby([(df.index.year),(df.index.month)]).mean()
Issue:
So basically, the requirement is to group the data according to year and month before processing and I thought that the groupby function would have grouped the data so that the mean() calculate the averages of all values grouped under Jan-1990, Feb-1990 and so on. However, I was wrong. The output result in the average of all values under Item_1
My example is similar to the below post but in my case, it is calculating the mean. I am guessing that it has to do with the way the data is arranged after groupby or some parameters in mean() have to be specified but I have no idea which is the cause. Can someone enlighten me on how to correct the code?
Pandas groupby month and year
Update:
Hi all, I have created the sample data file .csv with 3 items and 3 months of data. I am wondering if the cause has to do with the conversion of data into df when it is imported from .csv because I have noticed some weird time data on the leftmost as shown below:
Link to sample file is:
https://www.mediafire.com/file/t81wh3zem6vf4c2/test.csv/file
import pandas as pd
df = pd.read_csv( 'test.csv', index_col = 'date' )
df.index = pd.to_datetime( df.index )
df.groupby([(df.index.year),(df.index.month)]).mean()
Seems to do the trick from the provided data.
IIUC, you want to calculate the mean of all elements. You can use numpy's mean function that operates on the flattened array by default:
df.index = pd.to_datetime(df.index, format='%d/%m/%Y')
gb = df.groupby([(df.index.year),(df.index.month)]).apply(lambda d: np.mean(d.values))
output:
date date
1990 1 0.563678
2 0.489105
3 0.459131
4 0.755165
5 0.424466
6 0.523857
7 0.612977
8 0.396031
9 0.452538
10 0.527063
11 0.397951
12 0.600371
dtype: float64
Hello StackOverflow Community,
I have been interested in calculating anomalies for data in pandas 1.2.0 using Python 3.9.1 and Numpy 1.19.5, but have been struggling to figure out the most "Pythonic" and "pandas" way to complete this task (or any way for that matter). Below I have some created some dummy data and put it into a pandas DataFrame. In addition, I have tried to clearly outline my methodology for calculating monthly anomalies for the dummy data.
What I am trying to do is take "n" years of monthly values (in this example, 2 years of monthly data = 25 months) and calculate monthly averages for all years (for example group all the January values togeather and calculate the mean). I have been able to do this using pandas.
Next, I would like to take each monthly average and subtract it from all elements in my DataFrame that fall into that specific month (for example subtract each January value from the overall January mean value). In the code below you will see some lines of code that attempt to do this subtraction, but to no avail.
If anyone has any thought or tips on what may be a good way to approach this, I really appreciate your insight. If you require further clarification, let me know. Thanks for your time and thoughts.
-Marian
#Import packages
import numpy as np
import pandas as pd
#-------------------------------------------------------------
#Create a pandas dataframe with some data that will represent:
#Column of dates for two years, at monthly resolution
#Column of corresponding values for each date.
#Create two years worth of monthly dates
dates = pd.date_range(start='2018-01-01', end='2020-01-01', freq='MS')
#Create some random data that will act as our data that we want to compute the anomalies of
values = np.random.randint(0,100,size=25)
#Put our dates and values into a dataframe to demonsrate how we have tried to calculate our anomalies
df = pd.DataFrame({'Dates': dates, 'Values': values})
#-------------------------------------------------------------
#Anomalies will be computed by finding the mean value of each month over all years
#And then subtracting the mean value of each month by each element that is in that particular month
#Group our df according to the month of each entry and calculate monthly mean for each month
monthly_means = df.groupby(df['Dates'].dt.month).mean()
#-------------------------------------------------------------
#Now, how do we go about subtracting these grouped monthly means from each element that falls
#in the corresponding month.
#For example, if the monthly mean over 2 years for January is 20 and the value is 21 in January 2018, the anomaly would be +1 for January 2018
#Example lines of code I have tried, but have not worked
#ValueError:Unable to coerce to Series, length must be 1: given 12
#anomalies = socal_csv.groupby(socal_csv['Date'].dt.month) - monthly_means
#TypeError: unhashable type: "list"
#anomalies = socal_csv.groupby(socal_csv['Date'].dt.month).transform([np.subtract])
You can use pd.merge like this :
import numpy as np
import pandas as pd
dates = pd.date_range(start='2018-01-01', end='2020-01-01', freq='MS')
values = np.random.randint(0,100,size=25)
df = pd.DataFrame({'Dates': dates, 'Values': values})
monthly_means = df.groupby(df['Dates'].dt.month.mean()
df['month']=df['Dates'].dt.strftime("%m").astype(int)
df=df.merge(monthly_means.rename(columns={'Dates':'month','Values':'Mean'}),on='month',how='left')
df['Diff']=df['Mean']-df['Values']
output:
df['Diff']
Out[19]:
0 33.333333
1 19.500000
2 -29.500000
3 -22.500000
4 -24.000000
5 -3.000000
6 10.000000
7 2.500000
8 14.500000
9 -17.500000
10 44.000000
11 31.000000
12 -11.666667
13 -19.500000
14 29.500000
15 22.500000
16 24.000000
17 3.000000
18 -10.000000
19 -2.500000
20 -14.500000
21 17.500000
22 -44.000000
23 -31.000000
24 -21.666667
You can use abs() if you want absolute difference
A one-line solution is:
df = pd.DataFrame({'Values': values}, index=dates)
df.groupby(df.index.month).transform(lambda x: x-x.mean())
I have a series of transactions similar to this table:
ID Customer Date Amount
1 A 6/12/2018 33,223.00
2 A 9/20/2018 635.00
3 B 8/3/2018 8,643.00
4 B 8/30/2018 1,231.00
5 C 5/29/2018 7,522.00
However I need to get the mean amount of the last six months (as of today)
I was using
df.groupby('Customer').resample('W')['Amount'].sum()
And get something like this:
CustomerCode PayDate
A 2018-05-21 268
2018-05-28 0.00
2018-06-11 0.00
2018-06-18 472,657
2018-06-25 0.00
However with this solution I only get the range of dates where the customers had amount. I need to extend the weeks for each customer so I can get the whole range of the six months (in weeks). In this example I would need to get for customer A from the week of '2018-04-05' (which is exactly six months ago from today) till the week of today (filled with 0 of course since there was no amount)
Heres is the solution I found to my question. First I creates the dates I wanted (last six months but in frequency of weeks)
dates = pd.date_range(datetime.date.today() - datetime.timedelta(6*365/12),
pd.datetime.today(),
freq='W')
Then I create a multi-index using the product of the customer with the dates.
multi_index = pd.MultiIndex.from_product([pd.Index(df['Customer'].unique()),
dates],
names=('Customer', 'Date'))
Then I reindex the df using the new created multi-index and lastly, I fill with zeroes the missing values.
df.reindex(multi_index)
df.fillna(0)
Resample is super flexible. To get a 6-month sum instead of the weekly sum you currently have all you need is:
df.groupby('Customer').resample('6M')['Amount'].sum()
That groups by month end; month start would be '6MS'.
More documentation on available frequencies can be found here:
http://pandas.pydata.org/pandas-docs/stable/timeseries.html#offset-aliases
I am a java developer finding it a bit tricky to switch to python and Pandas. Im trying to iterate over dates of a Pandas Dataframe which looks like below,
sender_user_id created
0 1 2016-12-19 07:36:07.816676
1 33 2016-12-19 07:56:07.816676
2 1 2016-12-19 08:14:07.816676
3 15 2016-12-19 08:34:07.816676
what I am trying to get is a dataframe which gives me a count of the total number of transactions that have occurred per week. From the forums I have been able to get syntax for 'for loops' which iterate over indexes only. Basically I need a result dataframe which looks like this. The value field contains the count of sender_user_id and the date needs to be modified to show the starting date per week.
date value
0 2016-12-09 20
1 2016-12-16 36
2 2016-12-23 56
3 2016-12-30 32
Thanks in advance for the help.
I think you need resample by week and aggregate size:
#cast to datetime if necessary
df.created = pd.to_datetime(df.created)
print (df.resample('W', on='created').size().reset_index(name='value'))
created value
0 2016-12-25 4
If need another offsets:
df.created = pd.to_datetime(df.created)
print (df.resample('W-FRI', on='created').size().reset_index(name='value'))
created value
0 2016-12-23 4
If need number of unique values per week aggregate by nunique:
df.created = pd.to_datetime(df.created)
print (df.resample('W-FRI', on='created')['sender_user_id'].nunique()
.reset_index(name='value'))
created value
0 2016-12-23 3