So I have the following code :
import pandas as pd
import matplotlib.pyplot as plt
import bt
import numpy as np
import talib
btc_data = pd.read_csv('Binance_BTCUSDT_minute.csv', index_col= 'date', parse_dates = True)
one = btc_data['close'] #one minute candles
**closes = np.array(one)** #numpy array of one minute candles
five = one.resample('5min').mean() #five minute candles
type(one),type(five),type(one[0]),type(five[0]) #comparing types
(they are the exact same type)
period_short = 55
period_long = 144
**closes = np.array(five)** #I can comment this out if I want to use one minute candles instead
EMA_short = talib.EMA(closes, timeperiod= period_short)
EMA_long = talib.EMA(closes, timeperiod= period_long)
The weird part is that when I use the one minute candles, the EMAs return numerical values. But when I use five minute candles, it returns nan
I compared the types of both, and they are the same type (both the arrays and the values contained are numpy.ndarray and numpy.float64 respectively). Why is the 5 minute then unable to produce values ?
Related
I have a dataframe with the columns: Time, ID, Drug, Value
Here is my code on how i perform two-way anova and multipletests
#libraries
import pandas as pd
import statsmodels.formula.api as sm
from statsmodels.stats.anova import anova_lm
from statsmodels.stats.multitest import multipletests
import os
df= pd.read_excel(r"C:path.xlsm", sheet_name="test") #dataframe
order = [18,19,20,21,22,23,0] #sort 24 hour time starting at time 18hr
df['Drug']=pd.Categorical(df['Drug'])
df['Time'] = pd.Categorical(df['Time'], categories=order)
#TWO-WAY ANOVA
mod = sm.ols('Value~Drug+Time+Time*Drug', data = df).fit()
aov = anova_lm(mod, type=2)
#Multi-test (mt)
mt = pd.concat([mod.params,mod.pvalues],axis=1)
mt.columns=['coefficient','pvalues']
mt = mt[mt.index.str.contains('Drug')]
mt['corrected_p'] = multipletests(mt['pvalues'],alpha=0.05,method="sidak",is_sorted=True)[1]
I get the following uncorrected['pvalues' and correct pvalues['corrected_p'] from the output of multi-test (mt):
Index pvalues correct_p
Drug[T.B] 0.0159475 0.106432
Time[T.19]:Drug[T.B] 0.0738362 0.41546
Time[T.20]:Drug[T.B] 0.0778909 0.43314
Time[T.21]:Drug[T.B] 0.0699678 0.398153
When i use the same dataset in graphpad prism i get these values instead (using two-way anova and multicomparison sidak:
Drug A-B Individual P value Adjusted P Value
18 0.0159 0.1064
19 0.9689 >0.999
20 >0.9999 >0.999
21 0.9379 >0.999
Especially for time 19,20 and 21 the adjusted P-value are signficantly different and I'm not sure why. I'm concerned if i coded my statistics incorrectly causing the difference.
Happy to provide further info as needed
I have a dataframe where, columns with subscript 1 are starting points and with 2 are end points.
I want to find a difference in kilometers between them.
I tried following code however got an error
import mpu
import pandas as pd
import numpy as np
data = {'lat1': [116.51172,116.51135,116.51135,116.51627,116.47186],
'lon1': [39.92123,39.93883,39.93883,39.91034,39.91248],
'lat2': [np.nan,116.51172,116.51135,116.51135,116.51627],
'lon2': [np.nan,39.92123,39.93883,39.93883,39.91034]}
# Create DataFrame
df_test = pd.DataFrame(data)
mpu.haversine_distance((df.lat1, df.lon1), (df.lat2, df.lon2))
I attempted to use the code below to plot a graph to show the Speed per hour by days.
import pandas as pd
import datetime
import numpy as np
import matplotlib.pyplot as plt
from matplotlib import style
style.use('ggplot')
import glob, os
taxi_df = pd.read_csv('ChicagoTaxi.csv')
taxi_df['trip_start_timestamp'] = pd.to_datetime(taxi_df['trip_start_timestamp'], format = '%Y-%m-%d %H:%M:%S', errors = 'raise')
taxi_df['trip_end_timestamp'] = pd.to_datetime(taxi_df['trip_end_timestamp'], format = '%Y-%m-%d %H:%M:%S', errors = 'raise')
#For filtering away any zero values when trip_Seconds or trip_miles = 0
filterZero = taxi_df[(taxi_df.trip_seconds != 0) & (taxi_df.trip_miles != 0)]
filterZero['trip_seconds'] = filterZero['trip_seconds']/60
filterZero['trip_seconds'] = filterZero['trip_seconds'].apply(lambda x: round(x,0))
filterZero['speed'] = filterZero['trip_miles']/filterZero['trip_seconds']
filterZero['speed'] *= 60
filterZero = filterZero.reset_index(drop=True)
filterZero.groupby(filterZero['trip_start_timestamp'].dt.strftime('%w'))['speed'].mean().plot()
plt.xlabel('Day')
plt.ylabel('Speed(Miles per Minutes)')
plt.title('Mean Miles per Hour By Days')
plt.show() #Not working
Example rows
0 2016-01-13 06:15:00 8.000000
1 2016-01-22 09:30:00 10.500000
Small Dataset : [1250219 rows x 2 columns]
Big Dataset: [15172212 rows x 2 columns]
For a smaller dataset the code works perfectly and the plot is shown. However when I attempted to use a dataset with 15 million rows the plot shown was empty as the values were "inf" despite running mean(). Am i doing something wrong here?
0 inf
1 inf
...
5 inf
6 inf
The speed is "Miles Per Hour" by day! I was trying out all time format so there is a mismatch in the picture sorry.
Image of failed Plotting(Larger Dataset):
Image of successful Plotting(Smaller Dataset):
I can't really be sure because you do not provide a real example of your dataset, but I'm pretty sure your problem comes from the column trip_seconds.
See these two lines:
filterZero['trip_seconds'] = filterZero['trip_seconds']/60
filterZero['trip_seconds'] = filterZero['trip_seconds'].apply(lambda x: round(x,0))
If some of your values in the column trip_seconds are ≤ 30, then this line will round them to 0.0.
filterZero['speed'] = filterZero['trip_miles']/filterZero['trip_seconds']
Therefore this line will be filled with some inf values (as anything / 0.0 = inf). Taking the mean() of an array with inf values will return inf regardless.
Two things to consider:
if your values in the column trip_seconds are actually in seconds, then after dividing your values by 60, they will be in minutes, which will make your speed in miles per minutes, not per hour.
You should try without rounding the times
I am using Python Pandas for the first time. I have 5-min lag traffic data in csv format:
...
2015-01-04 08:29:05,271238
2015-01-04 08:34:05,329285
2015-01-04 08:39:05,-1
2015-01-04 08:44:05,260260
2015-01-04 08:49:05,263711
...
There are several issues:
for some timestamps there's missing data (-1)
missing entries (also 2/3 consecutive hours)
the frequency of the observations is not exactly 5 minutes, but actually loses some seconds once in a while
I would like to obtain a regular time series, so with entries every (exactly) 5 minutes (and no missing valus). I have successfully interpolated the time series with the following code to approximate the -1 values with this code:
ts = pd.TimeSeries(values, index=timestamps)
ts.interpolate(method='cubic', downcast='infer')
How can I both interpolate and regularize the frequency of the observations? Thank you all for the help.
Change the -1s to NaNs:
ts[ts==-1] = np.nan
Then resample the data to have a 5 minute frequency.
ts = ts.resample('5T')
Note that, by default, if two measurements fall within the same 5 minute period, resample averages the values together.
Finally, you could linearly interpolate the time series according to the time:
ts = ts.interpolate(method='time')
Since it looks like your data already has roughly a 5-minute frequency, you
might need to resample at a shorter frequency so cubic or spline interpolation
can smooth out the curve:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
values = [271238, 329285, -1, 260260, 263711]
timestamps = pd.to_datetime(['2015-01-04 08:29:05',
'2015-01-04 08:34:05',
'2015-01-04 08:39:05',
'2015-01-04 08:44:05',
'2015-01-04 08:49:05'])
ts = pd.Series(values, index=timestamps)
ts[ts==-1] = np.nan
ts = ts.resample('T').mean()
ts.interpolate(method='spline', order=3).plot()
ts.interpolate(method='time').plot()
lines, labels = plt.gca().get_legend_handles_labels()
labels = ['spline', 'time']
plt.legend(lines, labels, loc='best')
plt.show()
Assume two dataframes, each with a datetime index, and each with one column of unnamed data. The dataframes are of different lengths and the datetime indexes may or may not overlap.
df1 is length 20. df2 is length 400. The data column consists of random floats.
I want to iterate through df2 taking 20 units per iteration, with each iteration incrementing the start vector by one unit - and similarly the end vector by one unit. On each iteration I want to calculate the correlation between the 20 units of df1 and the 20 units I've selected for this iteration of df2. This correlation coefficient and other statistics will then be recorded.
Once the loop is complete I want to plot df1 with the 20-unit vector of df2 that satisfies my statistical search - thus needing to keep up with some level of indexing to reacquire the vector once analysis has been completed.
Any thoughts?
Without knowing more specifics of the questions such as, why are you doing this or do dates matter, this will do what you asked. I'm happy to update based on your feedback.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import random
df1 = pd.DataFrame({'a':[random.randint(0, 20) for x in range(20)]}, index = pd.date_range(start = '2013-01-01',periods = 20, freq = 'D'))
df2 = pd.DataFrame({'b':[random.randint(0, 20) for x in range(400)]}, index = pd.date_range(start = '2013-01-10',periods = 400, freq = 'D'))
corr = pd.DataFrame()
for i in range(0,380):
t0 = df1.reset_index()['a'] # grab the numbers from df1
t1 = df2.iloc[i:i+20].reset_index()['b'] # grab 20 days, incrementing by one each time
t2 = df2.iloc[i:i+20].index[0] # set the index to be the first day of df2
corr = corr.append(pd.DataFrame({'corr':t0.corr(t1)}, index = [t2])) #calculate the correlation and append it to the DF
# plot it and save the graph
corr.plot()
plt.title("Correlation Graph")
plt.ylabel("(%)")
plt.grid(True)
plt.show()
plt.savefig('corr.png')