Convert tick data to tick bars/candlesticks using python-MT5, and pandas - python

I am trying to convert data obtained using a metatrader module for python found on the official mql5 website. I am trying to use tick data rather than importing candlestick data. Tick bars or candlesticks sample a set amount of ticks rather than a set amount of time in order to calculate ohlc. For example, 100 ticks creates a candle instead of 1 minute. Using the functions to copy ticks from metatrader5
copy_ticks_from
or
copy_ticks_range
results in a dataframe called copy_ ticks_from or copy_ticks_range but the data output is the same format.
time bid ask last volume time_msc flags volume_real
Ive watched videos and searched and searched, and will continue to but any help is greatly appreciated.
an example of code input and out can be found at https://www.mql5.com/en/docs/integration/python_metatrader5/mt5copyticksfrom_py
edit426221500
I was inspired by this article https://towardsdatascience.com/advanced-candlesticks-for-machine-learning-i-tick-bars-a8b93728b4c5
I think I am understanding a but more after this read through. I believe i need to use similar code to get desired output. Im working on converting my dataframe to a numpy array. After I will modify the code found in the reference above to be
Something like
def generate_tickbars(ticks, frequency=1000):
times = ticks[:,0]
time = ticks[:,1]
prices = ticks[:,2,3]
not sure about volume or the preceding lines but I think im on the right track or this may at least be one way of doing it.
researching from https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.to_numpy.html and then going to try and get the conversion working.
Edit426221640
using
ticks_frame.to_numpy(dtype=None, copy=True,)
I get a numpy array as an output.
array([[Timestamp('2020-01-10 01:05:00'), 1552.91, 1553.16, ...,
1578618300331, 134, 0.0],
[Timestamp('2020-01-10 01:05:00'), 1552.83, 1553.32, ...,
1578618300634, 134, 0.0],
[Timestamp('2020-01-10 01:05:01'), 1552.87, 1553.32, ...,
1578618301834, 130, 0.0],
...,etc
I am now stuck at the code referenced in the link above from the previous edit.
# expects a numpy array with trades
# each trade is composed of: [time, price, quantity]
def generate_tickbars(ticks, frequency=1000):
times = ticks[:,0]
prices = ticks[:,1]
volumes = ticks[:,2]
res = np.zeros(shape=(len(range(frequency, len(prices), frequency)), 6))
it = 0
for i in range(frequency, len(prices), frequency):
res[it][0] = times[i-1] # time
res[it][1] = prices[i-frequency] # open
res[it][2] = np.max(prices[i-frequency:i]) # high
res[it][3] = np.min(prices[i-frequency:i]) # low
res[it][4] = prices[i-1] # close
res[it][5] = np.sum(volumes[i-frequency:i]) # volume
it += 1
return res
How do make this work for my data? Is there a simpler way to accomplish this?
Edit426221745
I believe i have resampled the data correctly using different approach.
def bar(xs, y): return np.int64(xs / y) * y
ticks_frame.groupby(bar(np.arange(len(ticks_frame)),
1000)).agg({'bid': 'ohlc', 'volume': 'sum'})
Now onto plotting bars or candlesticks.
Edit426222250
Still stuck at the point of last edit. Although i can use bid for ohlc and group ticks and view that way it seems my issue is i need to reshape the dataframe or create a new dataframe from ticks_frame that uses bid to calculate ohlc values. Any and all help is greatly appreciated.

Related

plotly python lines going backwards even after sorting values

I am trying to create a plot which shows each individual's trajectory as well as the mean. This is working OK except that there appear to be extra lines and the lines go backwards, even after sorting the values.
Example:
import pandas as pd
import plotly.graph_objects as go
df = pd.DataFrame({"id": [1,1,1,1,2,2,2,2],
"months": [0,1,2,3,0,1,2,3],
"outcome":[5,2,7,11,18,3,15,3]})
#sort by each individual and the months ie. time column
df.sort_values(by=["id", "months"], inplace=True)
#create mean to overlay on plot
grouped = df.groupby("months")["outcome"].mean().reset_index()
#create plot
fig = go.Figure()
fig.add_trace(go.Scatter(x= df['months'], y= df['outcome'], name = "Individuals"))
fig.add_trace(go.Scatter(x=grouped['months'], y=grouped['outcome'], name = "Mean"))
fig.write_image("test.jpeg", scale = 2)
fig.show()
Now that I'm looking at it it actually looks like it's just creating one giant line for all IDs together, whereas I'd like one line for ID 1, and one line for ID2.
Any help much appreciated. Thanks in advance.
I believe the issue is in your x-values. In Pycharm, I looked at the dataframe and it looks like this:
Your months go from 0-3 and then back to 0-3. I'm a little unclear on what you want to do though - do you want to display only the ones with IDs that match? Such as all the ID with 1 and ID with 2?
Let us know what you expect to see given this dataframe I'm showing, it would be helpful.
EDIT So, I couldn't read the original question. Looking at it more, I believe I can at least answer the first portion however that led me to another bug. The line in question should be changed like so:
fig.add_trace(go.Scatter(x=df['months'][df['id'] == 1], y=df['outcome'][df['id'] == 1], name="Individuals"))
This will pull from the dataframe only where the id == 1, however this then won't show on your graph since your grouped dataframe doesn't fall within the same bounds.

Time Series Chart: Groupby seasons (or specfic months) for multiple years in xarray

Thank you for taking interest in my question.
I am hoping to do plot a temperature time series chart specifically between the months January to August from 1981-1999.
Below are my codes and attempts:
temperature = xr.open_dataarray('temperature.nc')
temp = temperature.sel(latitude=slice(34.5,30), longitude=slice(73,78.5))
templatlonmean = temp.mean(dim=['latitude','longitude'])-273.15
tempgraph1 = templatlonmean.sel(time=slice('1981','1999'))
The above commands read in fine without any errors.
Below are my attempts to divide the months into seasons:
1st Attempt
tempseason1 = tempgraph1.groupby("time.season").mean("time")
#Plotting Graph Command
myfig, myax = plt.subplots(figsize=(14,8))
timeyears = np.unique(tempgraph1["time.season"])
tempseason1.plot.line('b-', color='red', linestyle='--',linewidth=4, label='1981-1999 Mean')
I got this error:
"Plotting requires coordinates to be numeric, boolean, or dates of type numpy.datetime64, datetime.datetime, cftime.datetime or pandas.Interval. Received data of type object instead."
I tried this as my second attempt (retrieved from this post Select xarray/pandas index based on specific months)
However, I wasn't sure how can I plot a graph with this, so I tried the following:
def is_amj(month):
return (month >= 4) & (month <= 6)
temp_seasonal = tempgraph1.sel(time=is_amj(tempgraph1['time.month']))
#Plotting Graph Command
timeyears = np.unique(tempgraph1["time.season"])
temp_seasonal.plot.line('b-', color='red', linestyle='--',linewidth=4, label='1981-1999 Mean')
And it caused no error but the graph was not ideal
So I moved on to my 3rd attempt (from here http://xarray.pydata.org/en/stable/examples/monthly-means.html):
month_length = tempmean.time.dt.days_in_month
weights = month_length.groupby('time.season') / month_length.groupby('time.season').sum()
np.testing.assert_allclose(weights.groupby('time.season').sum().values, np.ones(4))
ds_weighted = (tempmean * weights).groupby('time.season').sum(dim='time')
ds_unweighted = tempmean.groupby('time.season').mean('time')
#Plot Commands
timeyears = np.unique(tempgraph1["time.season"])
ds_unweighted.plot.line('b-', color='red', linestyle='--',linewidth=4, label='1981-1999 Mean')
Still I got the same error as the 1st attempt:
"Plotting requires coordinates to be numeric, boolean, or dates of type numpy.datetime64, datetime.datetime, cftime.datetime or pandas.Interval. Received data of type object instead."
As I this command was used to plot weather maps rather than time series chart, however I believed the groupby process would be similar or the same even, thats's why I used it.
However, as I am relatively new in coding, please excuse any syntax errors and that I am not able to spot any obvious ways to go about this.
Therefore, I am wondering if you could suggest any other ways to plot specific monthly datas for xarray or if there's any adjustment I need to make for the commands I have attempted.
I greatly appreciate your generous help.
Please let me know if you need any more further information, I will respond as soon as possible.
Thank you!
About your issues 1. and 3., the object is the seasons of grouping.
You can visualize that by doing:
tempseason1 = tempgraph1.groupby("time.season").mean("time")
print(tempseason1.coords)
You should see something like:
Coordinates:
* lon (lon) float32 ...
* lat (lat) float32 ...
* season (season) object 'DJF' 'JJA' 'MAM' 'SON'
Notice the type object of season dimension.
I think you should use resample instead of groupby here.
Resample is basically a groupby to upsample or downsample time series.
It would look like:
tempseason1 = tempgraph1.resample(time="Q").mean("time")
The argument "Q" is a pandas offset for quarterly frequency, see there for details.
I don't know much about plotting though.

Reformatting y axis values in a multi-line plot in Python

Updated with more info
I've seen this answered on here for single line plots, but I need help with a plot showing two variables, if that matters at all... I am fairly new to python in general. My line graph shows two different departments' funding over the years. I just want to reformat the y axis to display as a number in the hundreds of millions.
Using a csv for the general public funding report of Minneapolis.
msp_df = pd.read_csv('Minneapolis_Data_Snapshot_v2.csv',error_bad_lines=False)
msp_df.info()
Saved just the two depts I was interested in, to a dataframe.
CPED_df = (msp_df['Unnamed: 0'] == 'CPED')
msp_df.iloc[CPED_df.values]
police_df = (msp_df['Unnamed: 0'] == 'Police')
msp_df.iloc[police_df.values]
("test" is the new name of my data frame containing all the info as seen below.)
test = pd.DataFrame({'Year': range(2014,2021),
'CPED': msp_df.iloc[CPED_df.values].T.reset_index(drop=True).drop(0,0)[5].tolist(),
'Police': msp_df.iloc[police_df.values].T.reset_index(drop=True).drop(0,0)[4].tolist()})
The numbers from the original dataset were being read as strings because of the commas so had to fix that first.)
test['Police2'] = test['Police'].str.replace(',','').astype(int)
test['CPED2'] = test['CPED'].str.replace(',','').astype(int)
And here is my code for the plot. It executes, I'm just wanting to reformat the y axis number scale. Right now it just shows up as a decimal. (I've already imported pandas and seaborn and matploblib)
plt.plot(test.Year, test.Police2, test.Year, test.CPED2)
plt.ylabel('Budget in Hundreds of Millions')
plt.xlabel('Year')
Current plot
Any help super appreciated! Thanks :)
the easiest way to reformat the y axis, to force it to take certain values ​​is to use
plt.yticks(ticks, labels)
for example if you want to have only display values ​​from 0 to 1 you can do :
plt.yticks([0,0.2,0.5,0.7,1], ['a', 'b', 'c', 'd', 'e'])

Plot each year of a time series on the same x-axis

I have a time series with daily data that I want to plot to see how it evolves over a year. I want to compare how it evolves over the year compared to previous years. I have written the following code in Python:
xindex = data['biljett'].index.month*30 + data['biljett'].index.day
plt.plot(xindex, data['biljett'])
plt.show()
The graph looks as follows:
A graph how the data evolves over a year compared to previous years. The line is continuous and and does not end with the end of the year which makes it fuzzy. What am I doing wrong ?
From technical perspectives, it happens because your data points are not sorted w.r.t. date, thus it goes back and forth to connect data points in the data frame order. you sort the data based on xindex and you're good to go. to do that: (first you need to put xindex in data dataframe as a new column)
data.sort_values(by='xindex').reset_index(drop=True)
From the visualization perspective, I think you might have several values per each day count, thus plot is not a good option to begin with. So IMHO you'd want to try plt.scatter() to visualize your data in a better way.
I have rewritten as follows:
xindex = data['biljett'].index.month*30 + data['biljett'].index.day
data['biljett'].sort_values('xindex').reset_index(drop=True)
plt.plot(xindex, data['biljett'])
plt.show()
but gets the following error message:
ValueError: No axis named xindex for object type

Efficient method to extract data from netCDF files, with Xarray, into a tall DataFrame

I have a list of about 350 coordinates, which are coordinates within a specified area, that I want to extract from a netCDF file using Xarray. In case it is relevant, I am trying to extract SWE (snow water equivalent) data from a particular land surface model.
My problem is that this for loop takes forever to go through each item in the list and get the relevant timeseries data. Perhaps to some extent this is unavoidable since I am having to actually load the data from the netCDF file for each coodinate. What I need help with is speeding up the code in any way possible. Right now this is taking a very long time to run, 3+ hours and counting to be more precise.
Here is everything I have done so far:
import xarray as xr
import numpy as np
import pandas as pd
from datetime import datetime as dt
1) First, open all of the files (daily data from 1915-2011).
df = xr.open_mfdataset(r'C:\temp\*.nc',combine='by_coords')
2) Narrow my location to a smaller box within the continental United States
swe_sub = df.swe.sel(lon=slice(246.695, 251), lat=slice(33.189, 35.666))
3) I just want to extract the first daily value for each month, which also narrows the timeseries.
swe_first = swe_sub.sel(time=swe_sub.time.dt.day == 1)
Now I want to load up my list list of coordinates (which happens to be in an Excel file).
coord = pd.read_excel(r'C:\Documents\Coordinate_List.xlsx')
print(coord)
lat = coord['Lat']
lon = coord['Lon']
lon = 360+lon
name = coord['OBJECTID']
The following for loop goes through each coordinate in my list of coordinates, extracts the timeseries at each coordinate, and rolls it into a tall DataFrame.
Newdf = pd.DataFrame([])
for i,j,k in zip(lat,lon,name):
dsloc = swe_first.sel(lat=i,lon=j,method='nearest')
DT=dsloc.to_dataframe()
# Insert the name of the station with preferred column title:
DT.insert(loc=0,column="Station",value=k)
Newdf=Newdf.append(DT,sort=True)
I would greatly appreciate any help or advice y'all can offer!
Alright I figured this one out. Turns out I needed to load my subset of data into memory first since Xarray "lazy loads" the into Dataset by default.
Here is the line of code that I revised to make this work properly:
swe_first = swe_sub.sel(time=swe_sub.time.dt.day == 1).persist()
Here is a link I found helpful for this issue:
https://examples.dask.org/xarray.html
I hope this helps someone else out too!

Categories