How to store multiple related time series in Pandas - python

I'm new to Pandas and would like some insight from the pros. I need to perform various statistical analyses (multiple regression, correlation etc) on >30 time series of financial securities' daily Open, High, Low, Close prices. Each series has 500-1500 days of data. As each analysis looks at multiple securities, I'm wondering if it's preferable from an ease of use and efficiency perspective to store each time series in a separate df, each with date as the index, or to merge them all into a single df with a single date index, which would effectively be a 3d df. If the latter, any recommendations on how to structure it?
Any thoughts much appreciated.
PS. I'm working my way up to working with intraday data across multiple timezones but that's a bit much for my first pandas project; this is a first step in that direction.

since you're only dealing with OHLC, it's not that much data to process, so that's good.
for these types of things i usually use a multiindex (http://pandas.pydata.org/pandas-docs/stable/indexing.html) with symbol as the first level and date as the second. then you can have just the columns OHLC and you're all set.
to access multiindex use the .xs function.

Unless you are going to correlate everything with everything, my suggestion is to put this into separate dataframes and put them all in a dictionary, ie {"Timeseries1":df1, "Timeseries 2":df2...}. Then, when you want to correlate some timeseries together, you can merge them and put suffixes in the columns of every different df to differentiate between them.
Probably you are interested in this talk Python for Financial Data Analysis with pandas by the author of pandas himself.

Related

Undersampling large dataset under specific conditon applied to other column in python/pandas

I currently working with a large dataset (about 40 coulmns and tens of thousans of rows) and i would like to undersample it to be able to work with it more easily.
For the undersampling, unlike the resample method from pandas that resample according to timedelta, I'm trying to specify conditons on other columns to determine the data points to keep.
I'm not sure it's so clear but for example, let's say I have 3 columns (index, time and temperature) like as followed:
Now for the resampling, I would like to keep a data point every 1s or every 2°C, the resulting dataset would look like this:
I couldn't find a simple way of doing this with pandas. The only way would be to iterate over the rows but it was very slow because of the size of my datasets.
I though about using the diff method but of course it can only determine the difference on a specified period, same for pct_change that could have been use to keep only the points in the regions were the variations are maximal to undersample.
Thanks in advance if you have any suggestions on how to proceed with this resampling.

Vectorized Backtester in Pandas/Python: Loop through each stock as a new dataframe or put it all in one dataframe?

I've been trying to build my own simple vectorized backtester in Pandas/Python to create a simple way to test some trading strategies. I have been using this article as a guide and it has been pretty helpful.
I want to perform a simple portfolio backtest of say 10 stocks/ETFs. For each stock I will have a dataframe which will have a date as a row index and the columns will be the Open, High, Low, Close prices for that date (financial time series data). So I will have say 10 of these dataframes that will 4 columns each. What would be the most pythonic and efficient way to do the backtest:
Work on each dataframe separately, by looping through and carrying out my calculations on each dataframe then summing the profits at the end.
OR
Concatenating all the dataframes together and just working on the one dataframe
In the example article I have been using, he works with just one dataframe, but he just uses Close price, so when he does this he doesn't need a column multi-index. I would need a column multi-index (level 0 is the stock name, level 1 is the Close, Open, High, Low, etc) and given my beginner pandas status, that's making things complicated for me. I've been thinking it would be easier for me to create a loop and work with 10 separate dataframes, but I'm wondering if this is just lazy and will hinder my development in the long run.
A df of closes is the simplest. You need a multiindex https://pandas.pydata.org/pandas-docs/stable/user_guide/advanced.html to use the other fields. The issue I found with multiindex is that adding columns to it requires some hacking of the df every change.

How to speed up the same calculation on sub-combinations of pandas DataFrame columns?

I'm looking to apply the same function to multiple sub-combinations of a pandas DataFrame. Imagine the full DataFrame having 15 columns, and I want to draw from this full DataFrame a sub-frame containing 10 columns, I would have 3003 such sub-frames in total. My current approach is to use multiprocessing which works well for a full DataFrame with about 20 columns - 184,756 combinations, however the real full frame has 50 columns leading to more than 10 billions combinations, after which it will take too long. Is there any library that would be suitable for this type of calculation ? I have used dask before and it's incredibly powerful but dask is only suitable for calculation on a single DataFrame, not different ones.
Thanks.
It's hard to answer this question without a MVCE. The best path forward depends on what you want to do with your 10 billion DataFrame combinations (write them to disk, train models, aggregations, etc.).
I'll provide some high level advice:
using a columnar file format like Parquet allows for column pruning (grabbing certain columns rather than all of them), which can be memory efficient
Persisting the entire DataFrame in memory with ddf.persist() may be a good way for you to handle this combinations problem so you're not constantly reloading it
Feel free to add more detail about the problem and I can add a more detailed solution.

Transforming data in pandas

What would be the best way to approach this problem using python and pandas?
I have an excel file of electricity usage. It comes in an awkward structure and I want to transform it so that I can compare it to weather data based on date and time.
The structure look like ( foo is a string and xx is a number)
100,foo,foo,foo,foo
200,foo,foo,foo,foo,foo,0000,kWh,15
300,20181101,xx,xx,xx,xx...(96 columns)xx,A
... several hundred more 300 type rows
the 100 and 200 rows identify the meter and provide a partial schema. ie data is in kWh and 15 minute intervals. The 300 rows contain date and 96 (ie 96 = 24hours*4 15min blocks) columns of 15min power consumption and one column with a data quality flag.
I have previously processed all the data in other tools but I'm trying to learn how to do it in Python (jupyter notebook to be precise) and tap into the far more advanced analysis, modeling and visualisation tools available.
I think the thing to do is transform the data into a series of datetime and power. From there I can aggregate filter and compare however I like.
I am at a loss even to know what question to ask or resource to look up to tackle this problem. I could just import the 300 rows as is and loop through the rows and columns to create a new series in the right structure - easy enough to do. However, I strongly suspect there is an inbuilt method for doing this sort of thing and I would greatly appreciate any advise on what might be the best strategies. Possibly I don't need to transform the data at all.
You can read the data easy enough into a DataFrame, you just have to step over the metadata rows, e.g.:
df = pd.read_csv(<file>, skiprows=[0,1], index_col=1, parse_dates=True, header=None)
This will read in the csv, skip over the first 2 lines, make the date column the index and try and parse it to a date type.

Nested data in Pandas

First of all: I know this is a dangerous question. There are a lot of similar questions about storing and accessing nested data in pandas but I think my question is different (more general) so hang on. :)
I have medium sized dataset of workouts for 1 athlete. Each workout has a date and time, ~200 properties (e.g. average speed and heart rate) and some raw data (3-10 lists of e.g. speed and heart rate values per second). I have about 300 workouts and each workouts contains on average ~4000 seconds.
So far I tried 3 solutions to store this data with pandas to be able to analyze it:
I could use MultiIndex and store all data in 1 DataFrame but this
DataFrame would get quite large (which doesn't have to be a problem
but visually inspecting it will be hard) and slicing the data is cumbersome.
Another way would be to store the date and properties
in a DataFrame df_1 and to store the raw data in a separate
DataFrame df_2 that I would store in a separate column raw_data
in df_1.
...Or (similar to (2)) I could store the raw data in separate DataFrames
that I store in a dict with keys identical to the index of the
DataFrame df_1.
Either of these solutions work and for this use case there are no major performance benefits to either of them. To me (1) feels the most 'Pandorable' (really like that word :) ) but slicing the data is difficult and visual inspection of the DataFrame (printing it) is of no use. (2) feels a bit 'hackish' and in-place modifications can be unreliable but this solution is very nice to work with. And (3) is ugly and a bit difficult to work with, but also the most Pythonic in my opinion.
Question: What would be the benefits of each method and what is the most Pandorable solution in your opinion?
By the way: Of course I am open to alternative solutions.

Categories