Issue with the aggregation function in the pipeline during online ingest

Issue with the aggregation function in the pipeline during online ingest - python

I see issue in the aggregation function (part of pipeline) during the online ingest, because aggregation output is invalid (output is different then expectation, I got value 0 instead of 6). The pipeline is really very simple:
See part of code (Python and MLRun):
import datetime
import mlrun
import mlrun.feature_store as fstore
from mlrun.datastore.targets import ParquetTarget, NoSqlTarget
# Prepare data, four columns key0, key1, fn1, sysdate
data = {"key0":[1,1,1,1,1,1], "key1":[0,0,0,0,0,0],"fn1":[1,1,2,3,1,0],
"sysdate":[datetime.datetime(2021,1,1,1), datetime.datetime(2021,1,1,1),
datetime.datetime(2021,1,1,1), datetime.datetime(2021,1,1,1),
datetime.datetime(2021,1,1,1), datetime.datetime(2021,1,1,1)]}
# Create project and featureset with NoSqlTarget & ParquetTarget
project = mlrun.get_or_create_project("jist-agg",context='./', user_project=False)
feature_set=featureGetOrCreate(True,project_name, 'sample')
# Add easy aggregation 'agg1'
feature_set.add_aggregation(name='fn1',column='fn1',operations=['count'],windows=['60d'],step_name="agg1")
# Ingest data to the on-line and off-line targets
output_df=fstore.ingest(feature_set, input_df, overwrite=True, infer_options=fstore.InferOptions.default())
# Read data from online source
svc=fstore.get_online_feature_service(fstore.FeatureVector("my-vec", ["sample.*"], with_indexes=True))
resp = svc.get([{"key0": 1, "key1":0} ])
# Output validation
assert resp[0]['fn1_count_60d'] == 6.0, 'Mistake in solution'
Do you see the mistake?

Whole code is valid, but the issue is on side of knowledge ;-).
Key information is that aggregation for on-line target works from NOW till history. If today is 03.02.2023, than aggregation window is minus 60 days (see part of code windows=['60d']) and source data focus on date 01.01.2021.
You have two possible solutions:
1. Change input data (move date from 2021 to 2023)
# Prepare data, four columns key0, key1, fn1, sysdate
data = {"key0":[1,1,1,1,1,1], "key1":[0,0,0,0,0,0],"fn1":[1,1,2,3,1,0],
"sysdate":[datetime.datetime(2023,1,1,1), datetime.datetime(2023,1,1,1),
datetime.datetime(2023,1,1,1), datetime.datetime(2023,1,1,1),
datetime.datetime(2023,1,1,1), datetime.datetime(2023,1,1,1)]}
or
2. Extend window for calculation (e.g. 3 years = ~1095 days)
# Add easy aggregation 'agg1'
feature_set.add_aggregation(name='fn1',column='fn1',operations=['count'],windows=['1095d'],step_name="agg1")

Related

HDF5 tagging datasets to events in other datasets

I am sampling time series data off various machines, and every so often need to collect a large high frequency burst of data from another device and append it to the time series data.
Imagine I am measuring temperature over time, and then every 10 degrees increase in temperature I sample a micro at 200khz, I want to be able to tag the large burst of micro data to a timestamp in the time-series data. Maybe even in the form of a figure.
I was trying to do this with regionref, but am struggling to find a elegant solution. and I'm finding myself juggling between pandas store and h5py and it just feels messy.
Initially I thought I would be able to make separate datasets from the burst-data then use reference or links to timestamps in the time-series data. But no luck so far.
Any way to reference a large packet of data to a timestamp in another pile of data would be appreciated!

How did use region references? I assume you had an array of references, with references alternating between a range of "standard rate" and "burst rate" data. That is a valid approach, and it will work. However, you are correct: it's messy to create, and messy to recover the data.
Virtual Datasets might be a more elegant solution....but tracking and creating the virtual layout definitions could get messy too. :-) However, once you have the virtual data set, you can read it with typical slice notation. HDF5/h5py handles everything under the covers.
To demonstrate, I created a "simple" example (realizing virtual datasets aren't "simple"). That said, if you can figure out region references, you can figure out virtual datasets. Here is a link to the h5py Virtual Dataset Documentation and Example for details. Here is a short summary of the process:
Define the virtual layout: this is the shape and dtype of the virtual dataset that will point to other datasets.
Define the virtual sources. Each is a reference to a HDF5 file and dataset (1 virtual source for file/dataset combination.)
Map virtual source data to the virtual layout (you can use slice notation, which is shown in my example).
Repeat steps 2 and 3 for all sources (or slices of sources)
Note: virtual datasets can be in a separate file, or in the same file as the referenced datasets. I will show both in the example. (Once you have defined the layout and sources, both methods are equally easy.)
There are at least 3 other SO questions and answers on this topic:
h5py, enums, and VirtualLayout
h5py error reading virtual dataset into NumPy array
How to combine multiple hdf5 files into one file and dataset?
Example follows:
Step 1: Create some example data. Without your schema, I guessed at how you stored "standard rate" and "burst rate" data. All standard rate data is stored in dataset 'data_log' and each burst is stored in a separate dataset named: 'burst_log_##'.
import numpy as np
import h5py
log_ntimes = 31
log_inc = 1e-3
arr = np.zeros((log_ntimes,2))
for i in range(log_ntimes):
time = i*log_inc
arr[i,0] = time
#temp = 70.+ 100.*time
#print(f'For Time = {time:.5f} ; Temp= {temp:.4f}')
arr[:,1] = 70.+ 100.*arr[:,0]
#print(arr)
with h5py.File('SO_72654160.h5','w') as h5f:
h5f.create_dataset('data_log',data=arr)
n_bursts = 4
burst_ntimes = 11
burst_inc = 5e-5
for n in range(1,n_bursts):
arr = np.zeros((burst_ntimes-1,2))
for i in range(1,burst_ntimes):
burst_time = 0.01*(n)
time = burst_time + i*burst_inc
arr[i-1,0] = time
#temp = 70.+ 100.*t
arr[:,1] = 70.+ 100.*arr[:,0]
with h5py.File('SO_72654160.h5','a') as h5f:
h5f.create_dataset(f'burst_log_{n:02}',data=arr)
Step 2: This is where the virtual layout and sources are defined and used to create the virtual dataset. This creates a virtual dataset a new file, and one in the existing file. (The statements are identical except for the file name and mode.)
source_file = 'SO_72654160.h5'
a0 = 0
with h5py.File(source_file, 'r') as h5f:
for ds_name in h5f:
a0 += h5f[ds_name].shape[0]
print(f'Total data rows in source = {a0}')
# alternate getting data from
# dataset: data_log, get rows 0-11, 11-21, 21-31
# datasets: burst_log_01, burst log_02, etc (each has 10 rows)
# Define virtual dataset layout
layout = h5py.VirtualLayout(shape=(a0, 2),dtype=float)
# Map virstual dataset to logged data
vsource1 = h5py.VirtualSource(source_file, 'data_log', shape=(41,2))
layout[0:11,:] = vsource1[0:11,:]
vsource2 = h5py.VirtualSource(source_file, 'burst_log_01', shape=(10,2))
layout[11:21,:] = vsource2
layout[21:31,:] = vsource1[11:21,:]
vsource2 = h5py.VirtualSource(source_file, 'burst_log_02', shape=(10,2))
layout[31:41,:] = vsource2
layout[41:51,:] = vsource1[21:31,:]
vsource2 = h5py.VirtualSource(source_file, 'burst_log_03', shape=(10,2))
layout[51:61,:] = vsource2
# Create NEW file, then add virtual dataset
with h5py.File('SO_72654160_VDS.h5', 'w') as h5vds:
h5vds.create_virtual_dataset("vdata", layout)
print(f'Total data rows in VDS 1 = {h5vds["vdata"].shape[0]}')
# Open EXISTING file, then add virtual dataset
with h5py.File('SO_72654160.h5', 'a') as h5vds:
h5vds.create_virtual_dataset("vdata", layout)
print(f'Total data rows in VDS 2 = {h5vds["vdata"].shape[0]}')

Can I loop the same analysis across multiple csv dataframes then concatenate results from each into one table?

newbie python learner here!
I have 20 participant csv files (P01.csv to P20.csv) with dataframes in them that contain stroop test data. The important columns for each are the condition column which has a random mix of incongruent and congruent conditions, the reaction time column for each condition and the column for if the response was correct, true or false.
Here is an example of the dataframe for P01 I'm not sure if this counts as a code snippet? :
trialnum,colourtext,colourname,condition,response,rt,correct
1,blue,red,incongruent,red,0.767041,True
2,yellow,yellow,congruent,yellow,0.647259,True
3,green,blue,incongruent,blue,0.990185,True
4,green,green,congruent,green,0.720116,True
5,yellow,yellow,congruent,yellow,0.562909,True
6,yellow,yellow,congruent,yellow,0.538918,True
7,green,yellow,incongruent,yellow,0.693017,True
8,yellow,red,incongruent,red,0.679368,True
9,yellow,blue,incongruent,blue,0.951432,True
10,blue,blue,congruent,blue,0.633367,True
11,blue,green,incongruent,green,1.289047,True
12,green,green,congruent,green,0.668142,True
13,blue,red,incongruent,red,0.647722,True
14,red,blue,incongruent,blue,0.858307,True
15,red,red,congruent,red,1.820112,True
16,blue,green,incongruent,green,1.118404,True
17,red,red,congruent,red,0.798532,True
18,red,red,congruent,red,0.470939,True
19,red,blue,incongruent,blue,1.142712,True
20,red,red,congruent,red,0.656328,True
21,red,yellow,incongruent,yellow,0.978830,True
22,green,red,incongruent,red,1.316182,True
23,yellow,yellow,congruent,green,0.964292,False
24,green,green,congruent,green,0.683949,True
25,yellow,green,incongruent,green,0.583939,True
26,green,blue,incongruent,blue,1.474140,True
27,green,blue,incongruent,blue,0.569109,True
28,green,green,congruent,blue,1.196470,False
29,red,red,congruent,red,4.027546,True
30,blue,blue,congruent,blue,0.833177,True
31,red,red,congruent,red,1.019672,True
32,green,blue,incongruent,blue,0.879507,True
33,red,red,congruent,red,0.579254,True
34,red,blue,incongruent,blue,1.070518,True
35,blue,yellow,incongruent,yellow,0.723852,True
36,yellow,green,incongruent,green,0.978838,True
37,blue,blue,congruent,blue,1.038232,True
38,yellow,green,incongruent,yellow,1.366425,False
39,green,red,incongruent,red,1.066038,True
40,blue,red,incongruent,red,0.693698,True
41,red,blue,incongruent,blue,1.751062,True
42,blue,blue,congruent,blue,0.449651,True
43,green,red,incongruent,red,1.082267,True
44,blue,blue,congruent,blue,0.551023,True
45,red,blue,incongruent,blue,1.012258,True
46,yellow,green,incongruent,yellow,0.801443,False
47,blue,blue,congruent,blue,0.664119,True
48,red,green,incongruent,yellow,0.716189,False
49,green,green,congruent,yellow,0.630552,False
50,green,yellow,incongruent,yellow,0.721917,True
51,red,red,congruent,red,1.153943,True
52,blue,red,incongruent,red,0.571019,True
53,yellow,yellow,congruent,yellow,0.651611,True
54,blue,blue,congruent,blue,1.321344,True
55,green,green,congruent,green,1.159240,True
56,blue,blue,congruent,blue,0.861646,True
57,yellow,red,incongruent,red,0.793069,True
58,yellow,yellow,congruent,yellow,0.673190,True
59,yellow,red,incongruent,red,1.049320,True
60,red,yellow,incongruent,yellow,0.773447,True
61,red,yellow,incongruent,yellow,0.693554,True
62,red,red,congruent,red,0.933901,True
63,blue,blue,congruent,blue,0.726794,True
64,green,green,congruent,green,1.046116,True
65,blue,blue,congruent,blue,0.713565,True
66,blue,blue,congruent,blue,0.494177,True
67,green,green,congruent,green,0.626399,True
68,blue,blue,congruent,blue,0.711896,True
69,blue,blue,congruent,blue,0.460420,True
70,green,green,congruent,yellow,1.711978,False
71,blue,blue,congruent,blue,0.634218,True
72,yellow,blue,incongruent,yellow,0.632482,False
73,yellow,yellow,congruent,yellow,0.653813,True
74,green,green,congruent,green,0.808987,True
75,blue,blue,congruent,blue,0.647117,True
76,green,red,incongruent,red,1.791693,True
77,red,yellow,incongruent,yellow,1.482570,True
78,red,red,congruent,red,0.693132,True
79,red,yellow,incongruent,yellow,0.815830,True
80,green,green,congruent,green,0.614441,True
81,yellow,red,incongruent,red,1.080385,True
82,red,green,incongruent,green,1.198548,True
83,blue,green,incongruent,green,0.845769,True
84,yellow,blue,incongruent,blue,1.007089,True
85,green,blue,incongruent,blue,0.488701,True
86,green,green,congruent,yellow,1.858272,False
87,yellow,yellow,congruent,yellow,0.893149,True
88,yellow,yellow,congruent,yellow,0.569597,True
89,yellow,yellow,congruent,yellow,0.483542,True
90,yellow,red,incongruent,red,1.669842,True
91,blue,green,incongruent,green,1.158416,True
92,blue,red,incongruent,red,1.853055,True
93,green,yellow,incongruent,yellow,1.023785,True
94,yellow,blue,incongruent,blue,0.955395,True
95,yellow,yellow,congruent,yellow,1.303260,True
96,blue,yellow,incongruent,yellow,0.737741,True
97,yellow,green,incongruent,green,0.730972,True
98,green,red,incongruent,red,1.564596,True
99,yellow,yellow,congruent,yellow,0.978911,True
100,blue,yellow,incongruent,yellow,0.508151,True
101,red,green,incongruent,green,1.821969,True
102,red,red,congruent,red,0.818726,True
103,yellow,yellow,congruent,yellow,1.268222,True
104,yellow,yellow,congruent,yellow,0.585495,True
105,green,green,congruent,green,0.673404,True
106,blue,yellow,incongruent,yellow,1.407036,True
107,red,red,congruent,red,0.701050,True
108,red,green,incongruent,red,0.402334,False
109,red,green,incongruent,green,1.537681,True
110,green,yellow,incongruent,yellow,0.675118,True
111,green,green,congruent,green,1.004550,True
112,yellow,blue,incongruent,blue,0.627439,True
113,yellow,yellow,congruent,yellow,1.150248,True
114,blue,yellow,incongruent,yellow,0.774452,True
115,red,red,congruent,red,0.860966,True
116,red,red,congruent,red,0.499595,True
117,green,green,congruent,green,1.059725,True
118,red,red,congruent,red,0.593180,True
119,green,yellow,incongruent,yellow,0.855915,True
120,blue,green,incongruent,green,1.335018,True
But I am only interested in the 'condition', 'rt', and 'correct' columns.
I need to create a table that says the mean reaction time for the congruent conditions, and the incongruent conditions, and the percentage correct for each condition. But I want to create an overall table of these results for each participant. I am aiming to get something like this as an output table:
Participant
Stimulus Type
Mean Reaction Time
Percentage Correct
01
Congruent
0.560966
80
01
Incongruent
0.890556
64
02
Congruent
0.460576
89
02
Incongruent
0.956556
55
Etc. for all 20 participants. This was just an example of my ideal output because later I'd like to plot a graph of the means from each condition across the participants. But if anyone thinks that table does not make sense or is inefficient, I'm open to any advice!
I want to use pandas but don't know where to begin finding the rt means for each condition when there are two different conditions in the same column in each dataframe? And I'm assuming I need to do it in some kind of loop that can run over each participant csv file, and then concatenates the results in a table for all the participants?
Initially, after struggling to figure out the loop I would need and looking on the web, I ran this code, which worked to concatenate all of the dataframes of the participants, I hoped this would help me to do the same analysis on all of them at once but the problem is it doesn't identify the individual participants for each of the rows from each participant csv file (there are 120 rows for each participant like the example I give above) that I had put into one table:
import os
import glob
import pandas as pd
#set working directory
os.chdir('data')
#find all csv files in the folder
#use glob pattern matching -> extension = 'csv'
#save result in list -> all_filenames
extension = 'csv'
all_filenames = [i for i in glob.glob('*.{}'.format(extension))]
#print(all_filenames)
#combine all files in the list
combined_csv = pd.concat([pd.read_csv(f) for f in all_filenames ])
#export to csv
combined_csv.to_csv( "combined_csv.csv", index=False, encoding='utf-8-sig')
Perhaps I could do something to add a participant column to identify each participant's data set in the concatenated table and then perform the mean and percentage correct analysis on the two conditions for each participant in that big concatenated table?
Or would it be better to do the analysis and then loop it over all of the individual participant csv files of dataframes?
I'm sorry if this is a really obvious process, I'm new to python and trying to learn to analyse my data more efficiently, have been scouring the Internet and Panda tutorials but I'm stuck. Any help is welcome! I've also never used Stackoverflow before so sorry if I haven't formatted things correctly here but thanks for the feedback about including examples of the input data, code I've tried, and desired output data, I really appreciate the help.

Try this:
from pathlib import Path
# Use the Path class to represent a path. It offers more
# functionalities when perform operations on paths
path = Path("./data").resolve()
# Create a dictionary whose keys are the Participant ID
# (the `01` in `P01.csv`, etc), and whose values are
# the data frames initialized from the CSV
data = {
p.stem[1:]: pd.read_csv(p) for p in path.glob("*.csv")
}
# Create a master data frame by combining the individual
# data frames from each CSV file
df = pd.concat(data, keys=data.keys(), names=["participant", None])
# Calculate the statistics
result = (
df.groupby(["participant", "condition"]).agg(**{
"Mean Reaction Time": ("rt", "mean"),
"correct": ("correct", "sum"),
"size": ("trialnum", "size")
}).assign(**{
"Percentage Correct": lambda x: x["correct"] / x["size"]
}).drop(columns=["correct", "size"])
.reset_index()
)

Python good practice with NetCDF4 shared dimensions across groups

This question is conceptual in place of a direct error.
I am working with the python netcdf4 api to translate and store binary datagram packets from multiple sensors packaged in a single file. My question is in reference to Scope of dimensions and best use practices.
According to the Netcdf4 convention and metadata docs, dimension scope is such that all child groups have access to a dimension defined in the parent group (http://cfconventions.org/Data/cf-conventions/cf-conventions-1.9/cf-conventions.html#_scope).
Context:
The multiple sensors datapackets are written to a binary file. Timing adjustments are handled prior to writing the binary file such that we can trust the time stamp of a data packet. Time sampling rates are not synchonious. Sensor 1 samples at say 1Hz. Sensor 2 samples at 100Hz. Sensor 1 and 2 measure a number of different variables.
Questions:
Do I define a single, unlimited time dimension at the root level and create multiple variables using that dimension, or create individual time dimensions at the group level. Psuedo-code below.
In setting up the netcdf I would use the following code:
import netcdf4
data_set = netcdf4.Dataset(file_name, mode='w')
dim_time = data_set.createDimension(dimname='time', size=None)
grp_1 = data_set.createGroup('grp1')
var_time_1 = grp_1.createVariable(varname='sensor_1_time', datatype='f8', dimensions=(time,))
var_time_1[:] = sensor_1_time # timestamp of data values from sensor_1
var_1 = grp_1.createVariable(varname='sensor_1_data', datatype='f8',
dimensions=(time,))
var_1[:] = sensor_1_data # data values from sensor 1
grp_2 = data_set.createGroup('grp2')
var_time_2 = grp_2.createVariable(varname='sensor_2_time', datatype='f8', dimensions=(time,))
var_time_2[:] = sensor_2_time
var_2 = grp_2.createVariable(varname='sensor_2_data', datatype='f8', dimension=(time,))
var_2[:] = sensor_2_data # data values from sensor 2
The group separation is not necessarily by sensor but by logical data grouping. In the case that data from two sensors falls into multiple groups, is it best to replicate the time array in each group or is it acceptable to reference to other groups using the Scope mechanism.
import netcdf4
data_set = netcdf4.Dataset(file_name, mode='w')
dim_time = data_set.createDimension(dimname='time', size=None)
grp_env = data_set.createGroup('env_data')
sensor_time_1 = grp_env.createVariable(varname='sensor_1_time', datatype='f8', dimensions=(time,))
sensor_time_1[:] = sensor_1_time # timestamp of data values from sensor_1
env_1 = grp_env.createVariable(varname='sensor_1_data', datatype='f8', dimensions=(time,))
env_1[:] = sensor_1_data # data values from sensor 1
env_2 = grp_1.createVariable(varname='sensor_2_data', datatype='f8', dimensions=(time,))
env_2.coordinates = "/grp_platform/sensor_time_1"
grp_platform = data_set.createGroup('platform')
sensor_time_2 = grp_platform.createVariable(varname='sensor_2_time', datatype='f8', dimensions=(time,))
sensor_time_2[:] = sensor_2_time
plt_2 = grp_platform.createVariable(varname='sensor_2_data', datatype='f8', dimension=(time,))
var_2[:] = sensor_2_data # data values from sensor 2
Most examples do not deal with these cross group functionality and I can't seem to find the best practices. I'd love some advice, or even a push in the right direction.

Storing L2 tick data with Python

Preamble:
I am working with L2 tick data.
The bid/offer will not necessarily be balanced in terms of number of levels
The number of levels could range from 0 to 20.
I want to save the full book to disk every time it is updated
I believe I want to use numpy array such that I can use h5py/vaex to perform offline data processing.
I'll ideally be writing (appending) to disk every x updates or on a timer.
If we assume an example book looks like this:
array([datetime.datetime(2017, 11, 6, 14, 57, 8, 532152), # book creation time
array(['20171106-14:57:08.528', '20171106-14:57:08.428'], dtype='<U21'), # quote entry (bid)
array([1.30699, 1.30698]), # quote price (bid)
array([100000., 250000.]), # quote size (bid)
array(['20171106-14:57:08.528'], dtype='<U21'), # quote entry (offer)
array([1.30709]), # quote price (offer)
array([100000.])], # quote size (offer)
dtype=object)
Numpy doesnt like the jagged-ness of the array, and whilst I'm happy (enough) to use np.pad to pad the times/prices/sizes to a length of 20, I don't think I want to be creating an array for the book creation time.
Could/should I be going about this differently? Ultimately I'll want to do asof joins against the a list of trades hence I'd like a column-store approach. How is everyone else doing this? Are they storing multiple rows? or multiple columns?
EDIT:
I want to be able to do something like:
with h5py.File("foo.h5", "w") as f:
f.create_dataset(data=my_np_array)
and then later perform an asof join between my hdf5 tickdata and a dataframe of trades.
EDIT2:
In KDB the entry would look like:
q)t:([]time:2017.11.06D14:57:08.528;sym:`EURUSD;bid_time:enlist 2017.11.06T14:57:08.528 20171106T14:57:08.428;bid_px:enlist 1.30699, 1.30698;bid_size:enlist 100000. 250000.;ask_time:enlist 2017.11.06T14:57:08.528;ask_px:enlist 1.30709;ask_size:enlist 100000.)
q)t
time sym bid_time bid_px bid_size ask_time ask_px ask_size
-----------------------------------------------------------------------------------------------------------------------------------------------------------
2017.11.06D14:57:08.528000000 EURUSD 2017.11.06T14:57:08.528 2017.11.06T14:57:08.428 1.30699 1.30698 100000 250000 2017.11.06T14:57:08.528 1.30709 100000
q)first t
time | 2017.11.06D14:57:08.528000000
sym | `EURUSD
bid_time| 2017.11.06T14:57:08.528 2017.11.06T14:57:08.428
bid_px | 1.30699 1.30698
bid_size| 100000 250000f
ask_time| 2017.11.06T14:57:08.528
ask_px | 1.30709
ask_size| 100000f
EDIT3:
Should I just give in with the idea of a nested column and have 120 columns (20*(bid_times+bid_prices+bid_sizes+ask_times+ask_prices+ask_sizes)? Seems excessive, and unwieldy to work with...

For anyone is stumbling across this ~2 years later, I have recently revisited this code and have swapped out h5py for pyarrow+parquet.
This means I can create a schema with nested columns and read that back into a pandas DataFrame with ease:
import pyarrow as pa
schema = pa.schema([
("Time", pa.timestamp("ns")),
("Symbol", pa.string()),
("BidTimes", pa.list_(pa.timestamp("ns"))),
("BidPrices", pa.list_(pa.float64())),
("BidSizes", pa.list_(pa.float64())),
("BidProviders", pa.list_(pa.string())),
("AskTimes", pa.list_(pa.timestamp("ns"))),
("AskPrices", pa.list_(pa.float64())),
("AskSizes", pa.list_(pa.float64())),
("AskProviders", pa.list_(pa.string())),
])
In terms of streaming the data to disk, I use pq.ParquetWriter.write_table - keeping track of open filehandles (one per Symbol) so that I can append to the file, only closing (and thus writing metadata) when I'm done.
Rather than streaming pyarrow tables, I stream regular Python dictionaries, creating a Pandas DataFrame when I hit a given size (e.g. 1024 rows) which I then pass to the ParquetWriter to write down.

Alpha Vantage stockinfo only collects 4 dfs properly formatted, not 6

I can get 4 tickers of stockinfo from Alpha Vantage before the rest of the DataFrames are not getting the stockinfo I ask for. So my resulting concatenated df gets interpreted as Nonetype (because the 4 first dfs are formatted differently than the last 2). This is not my problem. The fact that I only get 4 of my requests is... If I can fix that - the resulting concatenated df will be intact.
My code
import pandas as pd
import datetime
import requests
from alpha_vantage.timeseries import TimeSeries
import time
tickers = []
def alvan_csv(stocklist):
api_key = 'demo' # For use with Alpha Vantage stock-info retrieval.
for ticker in stocklist:
#data=requests.get('https://www.alphavantage.co/query?function=TIME_SERIES_DAILY_ADJUSTED&symbol=%s&apikey={}'.format(api_key) %(ticker))
df = pd.read_csv('https://www.alphavantage.co/query?function=TIME_SERIES_DAILY_ADJUSTED&datatype=csv&symbol=%s&apikey={}'.format(api_key) %(ticker))#, index_col = 0) &outputsize=full
df['ticker'] = ticker
tickers.append(df)
# concatenate all the dfs
df = pd.concat(tickers)
print('\ndata before json parsing for === %s ===\n%s' %(ticker,df))
df['adj_close'] = df['adjusted_close']
del df['adjusted_close']
df['date'] = df['timestamp']
del df['timestamp']
df = df[['date','ticker','adj_close','volume','dividend_amount','split_coefficient','open','high','low']] #
df=df.sort_values(['ticker','date'], inplace=True)
time.sleep(20.3)
print('\ndata after col reshaping for === %s ===\n%s' %(ticker,df))
return df
if __name__ == '__main__':
stocklist = ['vws.co','nflx','mmm','abt','msft','aapl']
df = alvan_csv(stocklist)
NB. Please note that to use the Alpha Vantage API, you need a free API-Key which you may optain here: https://www.alphavantage.co/support/#api-key
Replace the demo API Key with your API Key to make this code work.
Any ideas as to get this to work?

Apparently Alpha Vantage has a pretty low fair usage allowance, where they measure no of queries pr. minute. So in effekt only the first 4 stocks are allowed at full speed. The rest of the stocks need to pause before downloading for not violating their fair-usage policy.
I have now introduced a pause between my stock-queries. At the moment I get approx 55% of my stocks, if I pause for 10 sec. between calls, and 100% if I pause for 15 seconds.
I will be testing exactly how low the pause can be set to allow for 100% of stocks to come through.
I must say compared to the super high-speed train we had at finance.yahoo.com, this strikes me as steam-train. Really really slow downloads. To get my 500 worth of tickers it takes me 2½ hours. But I guess beggars can't be choosers. This is a free service and I will manage with this.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Issue with the aggregation function in the pipeline during online ingest - python

Related

HDF5 tagging datasets to events in other datasets

Can I loop the same analysis across multiple csv dataframes then concatenate results from each into one table?

Python good practice with NetCDF4 shared dimensions across groups

Storing L2 tick data with Python

Alpha Vantage stockinfo only collects 4 dfs properly formatted, not 6

Categories

Resources