I have a data frame:
date value
0 2017-11-30 13:58:57 901
1 2017-11-30 13:59:41 905
2 2017-11-30 13:59:41 925
That was generated by:
import pandas as pd
df = pd.DataFrame.from_items( [('date', ['2017-11-30 13:58:57', '2017-11-30 13:59:41', '2017-11-30 13:59:41']),("value", [901, 905, 925])])
df['date'] = pd.to_datetime(df['date'])
I want to calculate the percentage change between two consecutive rows, but when I use:
df.pct_change()
I get the error:
ufunc true_divide cannot use operands with types dtype('<M8[ns]') and dtype('<M8[ns]')
How do I make it ignore the date column?
How do I make it ignore the date column?
Here's a solution with select_dtypes that should generalise to any dataframe by ignoring non-numeric columns -
df.select_dtypes(include=['number']).pct_change()
value
0 NaN
1 0.004440
2 0.022099
I’d try specifying the value column.
df[‘pctcng’]=df[‘value’].pct_change()
Related
I have a time series dataframe with 1 min frequency. I need to drop any day which has one or more nan values.
For example in the following df, days 2012-10-15 and 2012-10-25 need to be dropped.
import pandas as pd
index=pd.date_range(start='2012-10-15', end='2012-10-25', freq='1Min')
df=pd.DataFrame(range(len(index)), index=index, columns=['Number'])
df.iloc[1]=np.nan
df.iloc[-2]=np.nan
print(df)
You can use isna to check for nan and groupby.transform() on the date extracted by df.index.normalize():
mask = df['Number'].isna().groupby(df.index.normalize()).transform('any')
df[~mask]
id vi dates f_id
0 5532714 0.549501 2015-07-07 ff_22
1 5532715 0.540969 2015-07-08 ff_22
2 5532716 0.531477 2015-07-09 ff_22
3 5532717 0.521029 2015-07-10 ff_22
4 5532718 0.509694 2015-07-11 ff_22
In the dataframe above, I want to find average yearly value for each year. This does not work:
df.groupby(df.dates.year)['vi'].transform(mean)
I get this error: *** AttributeError: 'Series' object has no attribute 'year'
How to fix this?
Let's make sure that dates is datetime dtype, then use the .dt accessor as .dt.year:
df['dates'] = pd.to_datetime(df.dates)
df.groupby(df.dates.dt.year)['vi'].transform('mean')
Output:
0 0.530534
1 0.530534
2 0.530534
3 0.530534
4 0.530534
Name: vi, dtype: float64
Updating and completing #piRsquared's example below for recent versions of pandas (e.g. v1.1.0), using the Grouper function instead of TimeGrouper which was deprecated:
import pandas as pd
import numpy as np
tidx = pd.date_range('2010-01-01', '2013-12-31', name='dates')
np.random.seed([3,1415])
df = pd.DataFrame(dict(vi=np.random.rand(tidx.size)), tidx)
df.groupby(pd.Grouper(freq='1Y')).mean()
You can also use pd.TimeGrouper with the frequency A
Consider the dataframe df consisting of four years of daily data
tidx = pd.date_range('2010-01-01', '2013-12-31', name='dates')
np.random.seed([3,1415])
df = pd.DataFrame(dict(vi=np.random.rand(tidx.size)), tidx)
df.groupby(pd.TimeGrouper('A')).mean()
vi
dates
2010-12-31 0.465121
2011-12-31 0.511640
2012-12-31 0.491363
2013-12-31 0.516614
I have just pivoted a dataframe to create the dataframe below:
date 2012-10-31 2012-11-30
term
red -4.043862 -0.709225
blue -18.046630 -8.137812
green -8.339924 -6.358016
The columns are supposed to be dates, the left most column in supposed to have strings in it.
I want to be able to run through the rows (using the .apply()) and compare the values under each date column. The problem I am having is that I think the df has a hierarchical index.
Is there a way to give the whole df a new index (e.g. 1, 2, 3 etc.) and then have a flat index (but not get rid of the terms in the first column)?
EDIT: When I try to use .reset_index() I get the error ending with 'AttributeError: 'str' object has no attribute 'view''.
EDIT 2: this is what the df looks like:
EDIT 3: here is the description of the df:
<class 'pandas.core.frame.DataFrame'>
Index: 14597 entries, 101016j to zymogens
Data columns (total 6 columns):
2012-10-31 00:00:00 14597 non-null values
2012-11-30 00:00:00 14597 non-null values
2012-12-31 00:00:00 14597 non-null values
2013-01-31 00:00:00 14597 non-null values
2013-02-28 00:00:00 14597 non-null values
2013-03-31 00:00:00 14597 non-null values
dtypes: float64(6)
Thanks in advance.
df= df.reset_index()
this will take the current index and make it a column then give you a fresh index from 0
Adding example:
import pandas as pd
import numpy as np
df = pd.DataFrame({'2012-10-31': [-4, -18, -18], '2012-11-30': [-0.7, -8, -6]}, index = ['red', 'blue','green'])
df
2012-10-31 2012-11-30
red -4 -0.7
blue -18 -8.0
green -18 -6.0
df.reset_index()
term 2012-10-31 2012-11-30
0 red -4 -0.7
1 blue -18 -8.0
2 green -18 -6.0
EDIT: When I try to use .reset_index() I get the error ending with 'AttributeError: 'str' object has no attribute 'view''.
Try to convert your date columns to string type columns first.
I think pandas doesn't like to reset_index() here because you try to reset your string index into a columns which only consist of dates. If you only have dates as columns, pandas will handle those columns internally as a DateTimeIndex. When calling reset_index(), pandas tries to set up your string index as a further column to your date columns and fails somehow. Looks like a bug for me, but not sure.
Example:
t = pandas.DataFrame({pandas.to_datetime('2011') : [1,2], pandas.to_datetime('2012') : [3,4]}, index=['A', 'B'])
t
2011-01-01 00:00:00 2012-01-01 00:00:00
A 1 3
B 2 4
t.columns
<class 'pandas.tseries.index.DatetimeIndex'>
[2011-01-01 00:00:00, 2012-01-01 00:00:00]
Length: 2, Freq: None, Timezone: None
t.reset_index()
...
AttributeError: 'str' object has no attribute 'view'
If you try with a string columns it will work.
My data can have multiple events on a given date or NO events on a date. I take these events, get a count by date and plot them. However, when I plot them, my two series don't always match.
idx = pd.date_range(df['simpleDate'].min(), df['simpleDate'].max())
s = df.groupby(['simpleDate']).size()
In the above code idx becomes a range of say 30 dates. 09-01-2013 to 09-30-2013
However S may only have 25 or 26 days because no events happened for a given date. I then get an AssertionError as the sizes dont match when I try to plot:
fig, ax = plt.subplots()
ax.bar(idx.to_pydatetime(), s, color='green')
What's the proper way to tackle this? Do I want to remove dates with no values from IDX or (which I'd rather do) is add to the series the missing date with a count of 0. I'd rather have a full graph of 30 days with 0 values. If this approach is right, any suggestions on how to get started? Do I need some sort of dynamic reindex function?
Here's a snippet of S ( df.groupby(['simpleDate']).size() ), notice no entries for 04 and 05.
09-02-2013 2
09-03-2013 10
09-06-2013 5
09-07-2013 1
You could use Series.reindex:
import pandas as pd
idx = pd.date_range('09-01-2013', '09-30-2013')
s = pd.Series({'09-02-2013': 2,
'09-03-2013': 10,
'09-06-2013': 5,
'09-07-2013': 1})
s.index = pd.DatetimeIndex(s.index)
s = s.reindex(idx, fill_value=0)
print(s)
yields
2013-09-01 0
2013-09-02 2
2013-09-03 10
2013-09-04 0
2013-09-05 0
2013-09-06 5
2013-09-07 1
2013-09-08 0
...
A quicker workaround is to use .asfreq(). This doesn't require creation of a new index to call within .reindex().
# "broken" (staggered) dates
dates = pd.Index([pd.Timestamp('2012-05-01'),
pd.Timestamp('2012-05-04'),
pd.Timestamp('2012-05-06')])
s = pd.Series([1, 2, 3], dates)
print(s.asfreq('D'))
2012-05-01 1.0
2012-05-02 NaN
2012-05-03 NaN
2012-05-04 2.0
2012-05-05 NaN
2012-05-06 3.0
Freq: D, dtype: float64
One issue is that reindex will fail if there are duplicate values. Say we're working with timestamped data, which we want to index by date:
df = pd.DataFrame({
'timestamps': pd.to_datetime(
['2016-11-15 1:00','2016-11-16 2:00','2016-11-16 3:00','2016-11-18 4:00']),
'values':['a','b','c','d']})
df.index = pd.DatetimeIndex(df['timestamps']).floor('D')
df
yields
timestamps values
2016-11-15 "2016-11-15 01:00:00" a
2016-11-16 "2016-11-16 02:00:00" b
2016-11-16 "2016-11-16 03:00:00" c
2016-11-18 "2016-11-18 04:00:00" d
Due to the duplicate 2016-11-16 date, an attempt to reindex:
all_days = pd.date_range(df.index.min(), df.index.max(), freq='D')
df.reindex(all_days)
fails with:
...
ValueError: cannot reindex from a duplicate axis
(by this it means the index has duplicates, not that it is itself a dup)
Instead, we can use .loc to look up entries for all dates in range:
df.loc[all_days]
yields
timestamps values
2016-11-15 "2016-11-15 01:00:00" a
2016-11-16 "2016-11-16 02:00:00" b
2016-11-16 "2016-11-16 03:00:00" c
2016-11-17 NaN NaN
2016-11-18 "2016-11-18 04:00:00" d
fillna can be used on the column series to fill blanks if needed.
An alternative approach is resample, which can handle duplicate dates in addition to missing dates. For example:
df.resample('D').mean()
resample is a deferred operation like groupby so you need to follow it with another operation. In this case mean works well, but you can also use many other pandas methods like max, sum, etc.
Here is the original data, but with an extra entry for '2013-09-03':
val
date
2013-09-02 2
2013-09-03 10
2013-09-03 20 <- duplicate date added to OP's data
2013-09-06 5
2013-09-07 1
And here are the results:
val
date
2013-09-02 2.0
2013-09-03 15.0 <- mean of original values for 2013-09-03
2013-09-04 NaN <- NaN b/c date not present in orig
2013-09-05 NaN <- NaN b/c date not present in orig
2013-09-06 5.0
2013-09-07 1.0
I left the missing dates as NaNs to make it clear how this works, but you can add fillna(0) to replace NaNs with zeroes as requested by the OP or alternatively use something like interpolate() to fill with non-zero values based on the neighboring rows.
Here's a nice method to fill in missing dates into a dataframe, with your choice of fill_value, days_back to fill in, and sort order (date_order) by which to sort the dataframe:
def fill_in_missing_dates(df, date_col_name = 'date',date_order = 'asc', fill_value = 0, days_back = 30):
df.set_index(date_col_name,drop=True,inplace=True)
df.index = pd.DatetimeIndex(df.index)
d = datetime.now().date()
d2 = d - timedelta(days = days_back)
idx = pd.date_range(d2, d, freq = "D")
df = df.reindex(idx,fill_value=fill_value)
df[date_col_name] = pd.DatetimeIndex(df.index)
return df
You can always just use DataFrame.merge() utilizing a left join from an 'All Dates' DataFrame to the 'Missing Dates' DataFrame. Example below.
# example DataFrame with missing dates between min(date) and max(date)
missing_df = pd.DataFrame({
'date':pd.to_datetime([
'2022-02-10'
,'2022-02-11'
,'2022-02-14'
,'2022-02-14'
,'2022-02-24'
,'2022-02-16'
])
,'value':[10,20,5,10,15,30]
})
# first create a DataFrame with all dates between specified start<-->end using pd.date_range()
all_dates = pd.DataFrame(pd.date_range(missing_df['date'].min(), missing_df['date'].max()), columns=['date'])
# from the all_dates DataFrame, left join onto the DataFrame with missing dates
new_df = all_dates.merge(right=missing_df, how='left', on='date')
s.asfreq('D').interpolate().asfreq('Q')
Follow up from Summing across rows of Pandas Dataframe and Pandas Dataframe object types fillna exception over different datatypes
One of the columns that I am aggregating using
df.groupby(['stock', 'same1', 'same2'], as_index=False)['positions'].sum()
this method is not very forgiving if there are missing data. If there are any missing data in same1, same2, etc it pads totally unrelated values. Workaround is to do a fillna loop over the columns to replace missing strings with '' and missing numbers with zero solves the problem.
I do however have one column with missing dates as well. column type is 'object' with nan of type float and in the missing cells and datetime objects in the existing data fields. important that I know that the data is missing, i.e. the missing indicator must survive the groupby transformation.
Dataset outlining the problem:
csv file that I use as input is:
Date,Stock,Position,Expiry,same
2012/12/01,A,100,2013/06/01,AA
2012/12/01,A,200,2013/06/01,AA
2012/12/01,B,300,,BB
2012/6/01,C,400,2013/06/01,CC
2012/6/01,C,500,2013/06/01,CC
I then read in file:
df = pd.read_csv('example', parse_dates=[0])
def convert_date(d):
'''Converts YYYY/mm/dd to datetime object'''
if type(d) != str or len(d) != 10: return np.nan
dd = d[8:]
mm = d[5:7]
YYYY = d[:4]
return datetime.datetime(int(YYYY), int(mm), int(dd))
df['Expiry'] = df.Expiry.map(convert_date)
df
df looks like:
Date Stock Position Expiry same
0 2012-12-01 00:00:00 A 100 2013-06-01 00:00:00 AA
1 2012-12-01 00:00:00 A 200 2013-06-01 00:00:00 AA
2 2012-12-01 00:00:00 B 300 NaN BB
3 2012-06-01 00:00:00 C 400 2013-06-01 00:00:00 CC
4 2012-06-01 00:00:00 C 500 2013-06-01 00:00:00 CC
can quite easily change the convert_date function to pop anything else for missing data in Expiry column.
Then using:
df.groupby(['Stock', 'Expiry', 'same'] ,as_index=False)['Position'].sum()
to aggregate the Position column. Get a TypeError: can't compare datetime.datetime to str with any non date that I plug into missing date data. Important for later functionality to know if Expiry is missing.
You need to convert your dates to the datetime64[ns] dtype (which manages how datetimes work). An object column is not efficient nor does it deal well with datelikes. datetime64[ns] allow missing values usingNaT (not-a-time), see here: http://pandas.pydata.org/pandas-docs/dev/missing_data.html#datetimes
In [6]: df['Expiry'] = pd.to_datetime(df['Expiry'])
# alternative way of reading in the data (in 0.11.1, as ``NaT`` will be set
# for missing values in a datelike column)
In [4]: df = pd.read_csv('example',parse_dates=['Date','Expiry'])
In [9]: df.dtypes
Out[9]:
Date datetime64[ns]
Stock object
Position int64
Expiry datetime64[ns]
same object
dtype: object
In [7]: df.groupby(['Stock', 'Expiry', 'same'] ,as_index=False)['Position'].sum()
Out[7]:
Stock Expiry same Position
0 A 2013-06-01 00:00:00 AA 300
1 B NaT BB 300
2 C 2013-06-01 00:00:00 CC 900
In [8]: df.groupby(['Stock', 'Expiry', 'same'] ,as_index=False)['Position'].sum().dtypes
Out[8]:
Stock object
Expiry datetime64[ns]
same object
Position int64
dtype: object