DateTimeIndex should be sorted, but isn't - python

I am trying to resample a DateTime Series in pandas as follows:
df = pd.read_csv(pathToParam + "/" + file)
df.drop(["LAT", "LON", "STATION_HEIGHT"], axis = 1, inplace=True)
df.set_index(df.DATE, inplace=True, drop=True)
if granularity == "daily":
df.index = pd.to_datetime(df.index, cache=False)
df = df.sort_index()
df = df.resample("8H", closed="right").bfill()
The Dataframe looks like this:
DATE
STATION_ID
CLOUD_COVER_TOTAL
2016-01-01
1048
6.7
2016-01-02
1048
7.8
2016-01-03
1048
7.8
But I always get this error:
ValueError: index must be monotonic increasing or decreasing
I tried parse_dates = True and searched for possible solutions on a variety of platforms, still empty handed. Pls help.

Most likely one of the rows in your csv has an empty value where the date should be.
I can recreate your problem only if I intentionally put a blank date in:
dateSeries = ["2016-01-01", "", "2016-01-02", "2016-01-04"]
data = [[1048, 6.7], [1048, 7.8], [1048, 7.8], [1048,7.8]]
df = pd.DataFrame(data, index = dateSeries, columns=["STATION_ID", "CLOUD_COVER_TOTAL"])
df.index = pd.to_datetime(df.index, cache=False)
df = df.sort_index()
df = df.resample("8H", closed="right").bfill()
Draws this error
ValueError: index must be monotonic increasing or decreasing
If I have values in each index it works fine. You can find such problematic records with things like:
df.loc[None]
or
df.loc[""]
or
df.loc[pd.NaT]

Related

How to find Date of 52 Week High and date of 52 Week low using pandas dataframe (Python)?

Please refer below table to for reference
I was able to find 52 Week High and low using:
df = pd.read_csv(csv_file_name, engine='python')
df['52W H'] = df['HIGH'].rolling(window=252, center=False).max()
df['52W L'] = df['LOW'].rolling(window=252, center=False).min()
Can someone please guide me how to find Date of 52 Week High and date of 52 Week low? Thanks in Advance.
My guess is that the date is another column in the dataframe, assuming its name is 'Date'.
you can try something like
df = pd.read_csv(csv_file_name, engine='python')
df['52W H'] = df['HIGH'].rolling(window=252, center=False).max()
df['52W L'] = df['LOW'].rolling(window=252, center=False).min()
df_low = df[df['LOW']== df['52W L'] ]
low_date = df_low['Date']
Similarly you can look for high values
Also it would have helped if you shared your sample dataframe structure.
Used 'pandas_datareader' data. The index is reset first. Then, using the idxmax() and idxmin() functions, the indices of highs and lows are found and lists are created from these values. The index of the 'Date' column is again set. And lists with indexes are fed into df.index. Note how setting indexes in df.index nan values are not involved.
High, Low replace with yours in df.
import pandas as pd
import pandas_datareader.data as web
import numpy as np
df = web.DataReader('GE', 'yahoo', start='2012-01-10', end='2019-10-09')
df = df.reset_index()
imax = df['High'].rolling(window=252, center=False).apply(lambda x: x.idxmax()).values
imin = df['Low'].rolling(window=252, center=False).apply(lambda x: x.idxmin()).values
count0_imax = np.count_nonzero(np.isnan(imax))
count0_imin = np.count_nonzero(np.isnan(imin))
imax = imax[count0_imax:].astype(int)
imin = imin[count0_imin:].astype(int)
df = df.set_index('Date')
df.loc[df.index[count0_imax]:, '52W H'] = df.index[imax]
df.loc[df.index[count0_imin]:, '52W L'] = df.index[imin]

Pythonic way to insert a DataFrame column and calculate its values from each column in a list

I have a DataFrame column (from my project here) that prints like this:
ticker 2021-02-11 21:04 2021-01-12_close 2020-02-11_close 2016-02-11_close
0 AAPL 134.94 128.607819 79.287888 21.787796
1 MSFT 244.20 214.929993 182.506607 45.343704
This gives a stock ticker and its current price followed by the close price on given dates. I am looking for a pythonic way to, after each X_close column, insert an X_return column and calculate the return between the current price and the X price. What is a good way to do this?
Thanks!
Edit: When I say "calculate the return", I mean, for example, to do:
((134.94 - 128.607819) / 128.607819) * 100
So, simply using div() or sub() isn't quite satisfactory.
Try:
df.filter to select the close columns,
then .sub to subtract the selected column
join back
sort the columns with sort_index. You may need to play with this.
All code:
df.join(df.filter(like='close').sub(df['2021-02-11 21:04'], axis=0)
.rename(columns=lambda x: x.replace('close','return'))
).sort_index(axis=1)
Good question. The idea is to simply create the new columns first and concatenate it to the dataframe.
df_returns = (df[cols].div(df["2021-02-11 21:04:00"], axis=0)).rename(columns = (lambda x: x.split('_')[0]+'_return'))
df_new = pd.concat([df, df_returns], axis=1).sort_index(axis=1)
Optionally, you could resort the indices for better graphic utility:
df_new[df_new.columns[:-3:-1].union(df_new.columns[:-2], sort=False)]
For a more customized approach use pandas apply method.
df_returns = (df[cols].apply(foo, axis=0))
def foo(s: pd.Series):
#Series specific changes
ans = pd.Series()
for i in s.shape[0]:
ans.iloc[i] = some_func(s.iloc[i])
#Rename series index for convenience
Hope this helps! You can perform any opps you like in some_func()
Combining ideas from the answers given with my own, here is my solution:
def calculate_returns(df):
print(df)
print()
# Get dataframe of return values
returns_df = df.apply(calculate_return_row, axis=1)
# Append returns df to close prices df
df = pd.concat([df, returns_df], axis=1).sort_index(axis=1, ascending=False)
# Rearrange columns so that close price precedes each respective return value
return_cols = df.columns[2::2]
close_cols = df.columns[3::2]
reordered_cols = list(df.columns[0:2])
reordered_cols = reordered_cols + [col for idx, _ in enumerate(return_cols) for col in [close_cols[idx], return_cols[idx]]]
df = df[reordered_cols]
print(df)
return df
def calculate_return_row(row: pd.Series):
current_price = row[1]
close_prices = row[2:]
returns = [calculate_return(current_price, close_price) for close_price in close_prices]
index = [label.replace('close', 'return') for label in row.index[2:]]
returns = pd.Series(returns, index=index)
return returns
def calculate_return(current_val, initial_val):
return (current_val - initial_val) / initial_val * 100
This avoids loops, and puts the return columns after the close columns:
ticker 2021-02-12 20:37 2021-01-13_close 2020-02-12_close 2016-02-12_close
0 AAPL 134.3500 130.694702 81.170799 21.855232
1 MSFT 243.9332 216.339996 182.773773 46.082863
ticker 2021-02-12 20:37 2021-01-13_close 2021-01-13_return 2020-02-12_close 2020-02-12_return 2016-02-12_close 2016-02-12_return
0 AAPL 134.3500 130.694702 2.796822 81.170799 65.515187 21.855232 514.726938
1 MSFT 243.9332 216.339996 12.754555 182.773773 33.461818 46.082863 429.336037
Thanks!

How to find and add missing dates in a dataframe of sorted dates (descending order)?

In Python, I have a DataFrame with column 'Date' (format e.g. 2020-06-26). This column is sorted in descending order: 2020-06-26, 2020-06-25, 2020-06-24...
The other column 'Reviews' is made of text reviews of a website. My data can have multiple reviews on a given date or no reviews on another date. I want to find what dates are missing in column 'Date'. Then, for each missing date, add one row with date in ´´format='%Y-%m-%d'´´, and an empty review on 'Reviews', to be able to plot them. How should I do this?
from datetime import date, timedelta
d = data['Date']
print(d[0])
print(d[-1])
date_set = set(d[-1] + timedelta(x) for x in range((d[0] - d[-1]).days))
missing = sorted(date_set - set(d))
missing = pd.to_datetime(missing, format='%Y-%m-%d')
idx = pd.date_range(start=min(data.Date), end=max(data.Date), freq='D')
#tried this
data = data.reindex(idx, fill_value=0)
data.head()
#Got TypeError: 'fill_value' ('0') is not in this Categorical's categories.
#also tried this
df2 = (pd.DataFrame(data.set_index('Date'), index=idx).fillna(0) + data.set_index('Date')).ffill().stack()
df2.head()
#Got ValueError: cannot reindex from a duplicate axis
This is my code:
for i in range(len(df)):
if i > 0:
prev = df.loc[i-1]["Date"]
current =df.loc[i]["Date"]
for a in range((prev-current).days):
if a > 0:
df.loc[df["Date"].count()] = [prev-timedelta(days = a), None]
df = df.sort_values("Date", ascending=False)
print(df)

How to fill missing date in timeSeries

Here's what my data looks like:
There are daily records, except for a gap from 2017-06-12 to 2017-06-16.
df2['timestamp'] = pd.to_datetime(df['timestamp'])
df2['timestamp'] = df2['timestamp'].map(lambda x:
datetime.datetime.strftime(x,'%Y-%m-%d'))
df2 = df2.convert_objects(convert_numeric = True)
df2 = df2.groupby('timestamp', as_index = False).sum()
I need to fill this missing gap and others with values for all fields (e.g. timestamp, temperature, humidity, light, pressure, speed, battery_voltage, etc...).
How can I accomplish this with Pandas?
This is what I have done before
weektime = pd.date_range(start = '06/04/2017', end = '12/05/2017', freq = 'W-SUN')
df['week'] = 'nan'
df['weektemp'] = 'nan'
df['weekhumidity'] = 'nan'
df['weeklight'] = 'nan'
df['weekpressure'] = 'nan'
df['weekspeed'] = 'nan'
df['weekbattery_voltage'] = 'nan'
for i in range(0,len(weektime)):
df['week'][i+1] = weektime[i]
df['weektemp'][i+1] = df['temperature'].iloc[7*i+1:7*i+7].sum()
df['weekhumidity'][i+1] = df['humidity'].iloc[7*i+1:7*i+7].sum()
df['weeklight'][i+1] = df['light'].iloc[7*i+1:7*i+7].sum()
df['weekpressure'][i+1] = df['pressure'].iloc[7*i+1:7*i+7].sum()
df['weekspeed'][i+1] = df['speed'].iloc[7*i+1:7*i+7].sum()
df['weekbattery_voltage'][i+1] =
df['battery_voltage'].iloc[7*i+1:7*i+7].sum()
i = i + 1
The value of sum is not correct. Cause the value of 2017-06-17 is a sum of 2017-06-12 to 2017-06-16. I do not want to add them again. This gap is not only one gap in the period. I want to fill all of them.
Here is a function I wrote that might be helpful to you. It looks for inconsistent jumps in time and fills them in. After using this function, try using a linear interpolation function (pandas has a good one) to fill in your null data values. Note: Numpy arrays are much faster to iterate over and manipulate than Pandas dataframes, which is why I switch between the two.
import numpy as np
import pandas as pd
data_arr = np.array(your_df)
periodicity = 'daily'
def fill_gaps(data_arr, periodicity):
rows = data_arr.shape[0]
data_no_gaps = np.copy(data_arr) #avoid altering the thing you're iterating over
data_no_gaps_idx = 0
for row_idx in np.arange(1, rows): #iterate once for each row (except the first record; nothing to compare)
oldtimestamp_str = str(data_arr[row_idx-1, 0])
oldtimestamp = np.datetime64(oldtimestamp_str)
currenttimestamp_str = str(data_arr[row_idx, 0])
currenttimestamp = np.datetime64(currenttimestamp_str)
period = currenttimestamp - oldtimestamp
if period != np.timedelta64(900,'s') and period != np.timedelta64(3600,'s') and period != np.timedelta64(86400,'s'):
if periodicity == 'quarterly':
desired_period = 900
elif periodicity == 'hourly':
desired_period = 3600
elif periodicity == 'daily':
desired_period = 86400
periods_missing = int(period / np.timedelta64(desired_period,'s'))
for missing in np.arange(1, periods_missing):
new_time_orig = str(oldtimestamp + missing*(np.timedelta64(desired_period,'s')))
new_time = new_time_orig.replace('T', ' ')
data_no_gaps = np.insert(data_no_gaps, (data_no_gaps_idx + missing),
np.array((new_time, np.nan, np.nan, np.nan, np.nan, np.nan)), 0) # INSERT VALUES YOU WANT IN THE NEW ROW
data_no_gaps_idx += (periods_missing-1) #incriment the index (zero-based => -1) in accordance with added rows
data_no_gaps_idx += 1 #allow index to change as we iterate over original data array (main for loop)
#create a dataframe:
data_arr_no_gaps = pd.DataFrame(data=data_no_gaps, index=None,columns=['Time', 'temp', 'humidity', 'light', 'pressure', 'speed'])
return data_arr_no_gaps
Fill time gaps and nulls
Use the function below to ensure expected date sequence exists, and then use forward fill to fill in nulls.
import pandas as pd
import os
def fill_gaps_and_nulls(df, freq='1D'):
'''
General steps:
A) check for extra dates (out of expected frequency/sequence)
B) check for missing dates (based on expected frequency/sequence)
C) use forwardfill to fill nulls
D) use backwardfill to fill remaining nulls
E) append to file
'''
#rename the timestamp to 'date'
df.rename(columns={"timestamp": "date"})
#sort to make indexing faster
df = df.sort_values(by=['date'], inplace=False)
#create an artificial index of dates at frequency = freq, with the same beginning and ending as the original data
all_dates = pd.date_range(start=df.date.min(), end=df.date.max(), freq=freq)
#record column names
df_cols = df.columns
#delete ffill_df.csv so we can begin anew
try:
os.remove('ffill_df.csv')
except FileNotFoundError:
pass
#check for extra dates and/or dates out of order. print warning statement for log
extra_dates = set(df.date).difference(all_dates)
#if there are extra dates (outside of expected sequence/frequency), deal with them
if len(extra_dates) > 0:
#############################
#INSERT DESIRED BEHAVIOR HERE
print('WARNING: Extra date(s):\n\t{}\n\t Shifting highlighted date(s) back by 1 day'.format(extra_dates))
for date in extra_dates:
#shift extra dates back one day
df.date[df.date == date] = date - pd.Timedelta(days=1)
#############################
#check the artificial date index against df to identify missing gaps in time and fill them with nulls
gaps = all_dates.difference(set(df.date))
print('\n-------\nWARNING: Missing dates: {}\n-------\n'.format(gaps))
#if there are time gaps, deal with them
if len(gaps) > 0:
#initialize df of correct size, filled with nulls
gaps_df = pd.DataFrame(index=gaps, columns=df_cols.drop('date')) #len(index) sets number of rows
#give index a name
gaps_df.index.name = 'date'
#add the region and type
gaps_df.region = r
gaps_df.type = t
#remove that index so gaps_df and df are compatible
gaps_df.reset_index(inplace=True)
#append gaps_df to df
new_df = pd.concat([df, gaps_df])
#sort on date
new_df.sort_values(by='date', inplace=True)
#fill nulls
new_df.fillna(method='ffill', inplace=True)
new_df.fillna(method='bfill', inplace=True)
#append to file
new_df.to_csv('ffill_df.csv', mode='a', header=False, index=False)
return df_cols, regions, types, all_dates

Pandas Columns Operations with List

I have a pandas dataframe with two columns, the first one with just a single date ('action_date') and the second one with a list of dates ('verification_date'). I am trying to calculate the time difference between the date in 'action_date' and each of the dates in the list in the corresponding 'verification_date' column, and then fill the df new columns with the number of dates in verification_date that have a difference of either over or under 360 days.
Here is my code:
df = pd.DataFrame()
df['action_date'] = ['2017-01-01', '2017-01-01', '2017-01-03']
df['action_date'] = pd.to_datetime(df['action_date'], format="%Y-%m-%d")
df['verification_date'] = ['2016-01-01', '2015-01-08', '2017-01-01']
df['verification_date'] = pd.to_datetime(df['verification_date'], format="%Y-%m-%d")
df['user_name'] = ['abc', 'wdt', 'sdf']
df.index = df.action_date
df = df.groupby(pd.TimeGrouper(freq='2D'))['verification_date'].apply(list).reset_index()
def make_columns(df):
df = df
for i in range(len(df)):
over_360 = []
under_360 = []
for w in [(df['action_date'][i]-x).days for x in df['verification_date'][i]]:
if w > 360:
over_360.append(w)
else:
under_360.append(w)
df['over_360'] = len(over_360)
df['under_360'] = len(under_360)
return df
make_columns(df)
This kinda works EXCEPT the df has the same values for each row, which is not true as the dates are different. For example, in the first row of the dataframe, there IS a difference of over 360 days between the action_date and both of the items in the list in the verification_date column, so the over_360 column should be populated with 2. However, it is empty and instead the under_360 column is populated with 1, which is accurate only for the second row in 'action_date'.
I have a feeling I'm just messing up the looping but am really stuck. Thanks for all help!
Your problem was that you were always updating the whole column with the value of the last calculation with these lines:
df['over_360'] = len(over_360)
df['under_360'] = len(under_360)
what you want to do instead is set the value for each line calculation accordingly, you can do this by replacing the above lines with these:
df.set_value(i,'over_360',len(over_360))
df.set_value(i,'under_360',len(under_360))
what it does is, it sets a value in line i and column over_360 or under_360.
you can learn more about it here.
If you don't like using set_values you can also use this:
df.ix[i,'over_360'] = len(over_360)
df.ix[i,'under_360'] = len(under_360)
you can check dataframe.ix here.
you might want to try this:
df['over_360'] = df.apply(lambda x: sum([((x['action_date'] - i).days >360) for i in x['verification_date']]) , axis=1)
df['under_360'] = df.apply(lambda x: sum([((x['action_date'] - i).days <360) for i in x['verification_date']]) , axis=1)
I believe it should be a bit faster.
You didn't specify what to do if == 360, so you can just change > or < into >= or <=.

Categories