Empty Plot when dealing with a huge number of rows - python

I attempted to use the code below to plot a graph to show the Speed per hour by days.
import pandas as pd
import datetime
import numpy as np
import matplotlib.pyplot as plt
from matplotlib import style
style.use('ggplot')
import glob, os
taxi_df = pd.read_csv('ChicagoTaxi.csv')
taxi_df['trip_start_timestamp'] = pd.to_datetime(taxi_df['trip_start_timestamp'], format = '%Y-%m-%d %H:%M:%S', errors = 'raise')
taxi_df['trip_end_timestamp'] = pd.to_datetime(taxi_df['trip_end_timestamp'], format = '%Y-%m-%d %H:%M:%S', errors = 'raise')
#For filtering away any zero values when trip_Seconds or trip_miles = 0
filterZero = taxi_df[(taxi_df.trip_seconds != 0) & (taxi_df.trip_miles != 0)]
filterZero['trip_seconds'] = filterZero['trip_seconds']/60
filterZero['trip_seconds'] = filterZero['trip_seconds'].apply(lambda x: round(x,0))
filterZero['speed'] = filterZero['trip_miles']/filterZero['trip_seconds']
filterZero['speed'] *= 60
filterZero = filterZero.reset_index(drop=True)
filterZero.groupby(filterZero['trip_start_timestamp'].dt.strftime('%w'))['speed'].mean().plot()
plt.xlabel('Day')
plt.ylabel('Speed(Miles per Minutes)')
plt.title('Mean Miles per Hour By Days')
plt.show() #Not working
Example rows
0 2016-01-13 06:15:00 8.000000
1 2016-01-22 09:30:00 10.500000
Small Dataset : [1250219 rows x 2 columns]
Big Dataset: [15172212 rows x 2 columns]
For a smaller dataset the code works perfectly and the plot is shown. However when I attempted to use a dataset with 15 million rows the plot shown was empty as the values were "inf" despite running mean(). Am i doing something wrong here?
0 inf
1 inf
...
5 inf
6 inf
The speed is "Miles Per Hour" by day! I was trying out all time format so there is a mismatch in the picture sorry.
Image of failed Plotting(Larger Dataset):
Image of successful Plotting(Smaller Dataset):

I can't really be sure because you do not provide a real example of your dataset, but I'm pretty sure your problem comes from the column trip_seconds.
See these two lines:
filterZero['trip_seconds'] = filterZero['trip_seconds']/60
filterZero['trip_seconds'] = filterZero['trip_seconds'].apply(lambda x: round(x,0))
If some of your values in the column trip_seconds are ≤ 30, then this line will round them to 0.0.
filterZero['speed'] = filterZero['trip_miles']/filterZero['trip_seconds']
Therefore this line will be filled with some inf values (as anything / 0.0 = inf). Taking the mean() of an array with inf values will return inf regardless.
Two things to consider:
if your values in the column trip_seconds are actually in seconds, then after dividing your values by 60, they will be in minutes, which will make your speed in miles per minutes, not per hour.
You should try without rounding the times

Related

How to create a column that returns the slope of a moving average?

on a dataframe that contains the price of bitcoin, I want to measure the strength of a trend by displaying the angle of the slope of a moving average (calculated over 20 periods) on each row.
A moving average allows you to analyze a time series, removing transient fluctuations in order to highlight longer term trends.
To calculate a simple 20-period moving average for trading purposes, we take the last 20 closing prices, add them together and divide the result by 20.
I started by trying to use the linregress function of scipy but I get the exception "len() of unsized object" that I could not solve:
from scipy.stats import linregress
x = df.iloc[-1, 8] # -1:last row, 8: sma20
y = df['sma20']
df['slope_deg'] = df.apply(linregress(x, y))
I then used the atan function of the math module but the result returned is always nan, whatever the row is:
import math
df['sma20'] = df['Close'].rolling(20).mean()
slope=((df['sma20'][0]-df['sma20'][20])/20)
df['slope_deg'] = math.atan(slope) * 180 / math.pi
... or 45 :
import math
df['sma20'] = df['Close'].rolling(20).mean()
df['slope_deg'] = math.atan(1) * 180 / math.pi
df
Here is an example of code with the date as an index, the price used to calculate the moving average, and the moving average (over 5 periods for the example):
df= pd.DataFrame({'date':np.tile( pd.date_range('1/1/2011',
periods=25, freq='D'), 4 ),
'price':(np.random.randn(100).cumsum() + 10),
'sma5':df['price'].rolling(5).mean()
})
df.head(10)
Can someone help me to create a column that returns the slope of a moving average?
OK, I did the 20 day sma, I am not so sure about the slope part, since you didnt clearly specify what you need.
I am assuming slope values, in degrees, as follows:
arctan( (PriceToday - Price20daysAgo)/ 20 )
Here you have the code:
EDIT 1: simplified 'slope' code and adapted following #Oliver 's suggestion.
import pandas as pd
import yfinance as yf
btc = yf.download('BTC-USD', period='1Y')
btc['sma20'] = btc.rolling(20).mean()['Adj Close']
btc['slope'] = np.degrees(np.arctan(btc['sma20'].diff()/20))
btc = btc[['Adj Close','sma20','slope']].dropna()
Output:
btc
Adj Close sma20 slope
Date
2021-03-15 55907.199219 51764.509570 86.767651
2021-03-16 56804.902344 52119.488086 86.775283
2021-03-17 58870.894531 52708.340234 88.054732
2021-03-18 57858.921875 53284.298242 88.011217
2021-03-19 58346.652344 53892.208203 88.115671
... ... ... ...
2022-02-19 40122.156250 41560.807227 79.715989
2022-02-20 38431.378906 41558.219922 -7.371144
2022-02-21 37075.281250 41474.820312 -76.514600
2022-02-22 38286.027344 41541.472461 73.297321
2022-02-23 38748.464844 41621.165625 75.911862
As you can see, the slope value means little as it is. Thats because the variation in price from a 20 days spam is far greater than 20 units, the value representing the time window you chose to use.
Plotting prices and sma20 vs date.
btc[['Adj Close','sma20']].plot(figsize=(14,7));

talib.EMA() returning nan values

So I have the following code :
import pandas as pd
import matplotlib.pyplot as plt
import bt
import numpy as np
import talib
btc_data = pd.read_csv('Binance_BTCUSDT_minute.csv', index_col= 'date', parse_dates = True)
one = btc_data['close'] #one minute candles
**closes = np.array(one)** #numpy array of one minute candles
five = one.resample('5min').mean() #five minute candles
type(one),type(five),type(one[0]),type(five[0]) #comparing types
(they are the exact same type)
period_short = 55
period_long = 144
**closes = np.array(five)** #I can comment this out if I want to use one minute candles instead
EMA_short = talib.EMA(closes, timeperiod= period_short)
EMA_long = talib.EMA(closes, timeperiod= period_long)
The weird part is that when I use the one minute candles, the EMAs return numerical values. But when I use five minute candles, it returns nan
I compared the types of both, and they are the same type (both the arrays and the values contained are numpy.ndarray and numpy.float64 respectively). Why is the 5 minute then unable to produce values ?

Outlier detection based on the moving mean in Python

I am trying to translate an algorithm from MATLAB to Python. The algorithm works with large datasets, and need an outlier detection and elimination technique to be applied.
In the MATLAB code, the outlier deletion technique I use is movmedian:
Outlier_T=isoutlier(Data_raw.Temperatura,'movmedian',3);
Data_raw(find(Outlier_T),:)=[]
Which detects outliers with a rolling median, by finding desproportionate values in the centre of a three value moving window. So If I have a column "Temperatura" with a 40 on row 3, it is detected and the entire row is deleted.
Temperatura Date
1 24.72 2.3
2 25.76 4.6
3 40 7.0
4 25.31 9.3
5 26.21 15.6
6 26.59 17.9
... ... ...
To my understanding, this is achieved with pandas.DataFrame.rolling. I have seen several posts examplify its use, but I am not managing to make it work with my code:
Attempt A:
Dataframe.rolling(df["t_new"]))
Attempt B:
df-df.rolling(3).median().abs()>200
#based on #Ami Tavory's answer
Am I missing something obvious here? What is the right way of doing this?
Thank you for your time.
Code below drops the rows based on threshold. This threshold could be adjusted as needed. Not sure if it replicates Matlab code though.
# Import Libraries
import pandas as pd
import numpy as np
# Create DataFrame
df = pd.DataFrame({
'Temperatura': [24.72, 25.76, 40, 25.31, 26.21, 26.59],
'Date':[2.3,4.6,7.0,9.3,15.6,17.9]
})
# Set threshold for difference with rolling median
upper_threshold = 1
lower_threshold = -1
# Calculate rolling median
df['rolling_temp'] = df['Temperatura'].rolling(window=3).median()
# Calculate difference
df['diff'] = df['Temperatura'] - df['rolling_temp']
# Flag rows to be dropped as `1`
df['drop_flag'] = np.where((df['diff']>upper_threshold)|(df['diff']<lower_threshold),1,0)
# Drop flagged rows
df = df[df['drop_flag']!=1]
df = df.drop(['rolling_temp', 'rolling_temp', 'diff', 'drop_flag'],axis=1)
Output
print(df)
Temperatura Date
0 24.72 2.3
1 25.76 4.6
3 25.31 9.3
4 26.21 15.6
5 26.59 17.9
Late to the party, based on Nilesh Ingle's answer. Modified to be more general, verbose (graphs!), and a percentage threshold instead of the data's real values.
# Calculate rolling median
df["Temp_Rolling"] = df["Temp"].rolling(window=3).median()
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
df["Temp_Scaled"] = scaler.fit_transform(df["Temp"].values.reshape(-1, 1))
df["Temp_Rolling"] = scaler.fit_transform(df["Temp_Rolling"].values.reshape(-1, 1))
# Calculate difference
df["Temp_Diff"] = df["Temp_Scaled"] - df["Temp_Rolling"]
import numpy as np
import matplotlib.pyplot as plt
# Set threshold for difference with rolling median
upper_threshold = 0.4
lower_threshold = -0.4
# Flag rows to be keepped True
df["Temp_Keep_Flag"] = np.where( (df["Temp_Diff"] > upper_threshold) | (df["Temp_Diff"] < lower_threshold), False, True)
# Keep flagged rows
print('dropped rows')
print(df[~df["Temp_Keep_Flag"]].index)
print('Your new graph')
df_result = df[df["Temp_Keep_Flag"].values]
df_result["Temp"].plot()
Once you're satisfied with the data cleaning
# Satisfied, replace data
df = df[df["Temp_Keep_Flag"].values]
df.drop(columns=["Temp_Rolling", "Temp_Diff", "Temp_Keep_Flag"], inplace=True)
df.plot()
Nilesh answer works perfectly, to iterate on his code you could also do :
upper_threshold = 1
lower_threshold = -1
# Calculate rolling median
df['rolling_temp'] = df['Temp'].rolling(window=3).median()
# all in one line
df = df.drop(df[(df['Temp']-df['rolling_temp']>upper_threshold)|(df['Temp']- df['rolling_temp']<lower_threshold)].index)
# if you want to drop the column as well
del df["rolling_temp"]

How turn this monthly xarray dataset into an annual mean without resampling?

I have an xarray of monthly average surface temperatures read in from a server using open_dataset with decode_times=False because the calendar type is not understood by xarray.
After some manipulation, I am left with a dataset my_dataset of surface temperatures ('ts') and times ('T'):
<xarray.Dataset>
Dimensions: (T: 1800)
Coordinates:
* T (T) float32 0.5 1.5 2.5 3.5 4.5 5.5 6.5 7.5 8.5 9.5 10.5 11.5 ...
Data variables:
ts (T) float64 246.6 247.9 250.7 260.1 271.9 281.1 283.3 280.5 ...
'T' has the following attributes:
Attributes:
pointwidth: 1.0
calendar: 360
gridtype: 0
units: months since 0300-01-01
I would like to take this monthly data and calculate annual averages, but because the T coordinate aren't datetimes, I'm unable to use xarray.Dataset.resample. Right now, I am simply converting to a numpy array, but I would like a way to do this preserving the xarray dataset.
My current, rudimentary way:
temps = np.mean(np.array(my_dataset['ts']).reshape(-1,12),axis=1)
years = np.array(my_dataset['T'])/12
I appreciate any help, even if the best way is redefining the time coordinate to use resampling.
Edit:
Requested how xarray was created, it was done via the following:
import numpy as np
import matplotlib.pyplot as plt
import xarray as xr
filename = 'http://strega.ldeo.columbia.edu:81/CMIP5/.byScenario/.abrupt4xCO2/.atmos/.mon/.ts/ACCESS1-0/r1i1p1/.ts/dods'
ds = xr.open_dataset(filename,decode_times=False)
zonal_mean = ds.mean(dim='lon')
arctic_only = zonal.where(zonal['lat'] >= 60).dropna('lat')
weights = np.cos(np.deg2rad(arctic['lat']))/np.sum(np.cos(np.deg2rad(arctic['lat'])))
my_dataset = (arctic_only * weights).sum(dim='lat')
This is a very common problem, especially with datasets from INGRID. The reason xarray can't decode the date whose units are "months since..." is due to the underlying netcdf4-python library's refusal to parse such dates. This is discussed in a netcdf4-python github issue
The problem with time units such as "months" is that they are not well defined. In contrast to days, hours, etc. the length of a month depends on the calendar used and even varies between different months.
INGRID unfortunately refuses to accept this fact and continues to use "months" as its default unit, despite the ambiguity. So right now there is this frustrating incompatibility between INGRID and xarray / python-netcdf4.
Anyway, here is a hack to accomplish what you want without leaving xarray
# create new coordinates for month and year
ds.coords['month'] = np.ceil(ds['T'] % 12).astype('int')
ds.coords['year'] = (ds['T'] // 12).astype('int')
# calculate monthly climatology
ds_clim = ds.groupby('month').mean(dim='T')
# calculate annual mean
ds_am = ds.groupby('year').mean(dim='T')

Python pandas time series interpolation and regularization

I am using Python Pandas for the first time. I have 5-min lag traffic data in csv format:
...
2015-01-04 08:29:05,271238
2015-01-04 08:34:05,329285
2015-01-04 08:39:05,-1
2015-01-04 08:44:05,260260
2015-01-04 08:49:05,263711
...
There are several issues:
for some timestamps there's missing data (-1)
missing entries (also 2/3 consecutive hours)
the frequency of the observations is not exactly 5 minutes, but actually loses some seconds once in a while
I would like to obtain a regular time series, so with entries every (exactly) 5 minutes (and no missing valus). I have successfully interpolated the time series with the following code to approximate the -1 values with this code:
ts = pd.TimeSeries(values, index=timestamps)
ts.interpolate(method='cubic', downcast='infer')
How can I both interpolate and regularize the frequency of the observations? Thank you all for the help.
Change the -1s to NaNs:
ts[ts==-1] = np.nan
Then resample the data to have a 5 minute frequency.
ts = ts.resample('5T')
Note that, by default, if two measurements fall within the same 5 minute period, resample averages the values together.
Finally, you could linearly interpolate the time series according to the time:
ts = ts.interpolate(method='time')
Since it looks like your data already has roughly a 5-minute frequency, you
might need to resample at a shorter frequency so cubic or spline interpolation
can smooth out the curve:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
values = [271238, 329285, -1, 260260, 263711]
timestamps = pd.to_datetime(['2015-01-04 08:29:05',
'2015-01-04 08:34:05',
'2015-01-04 08:39:05',
'2015-01-04 08:44:05',
'2015-01-04 08:49:05'])
ts = pd.Series(values, index=timestamps)
ts[ts==-1] = np.nan
ts = ts.resample('T').mean()
ts.interpolate(method='spline', order=3).plot()
ts.interpolate(method='time').plot()
lines, labels = plt.gca().get_legend_handles_labels()
labels = ['spline', 'time']
plt.legend(lines, labels, loc='best')
plt.show()

Categories