Efficient Python/Pandas Selection and Sorting - python

My brother and I are working on reproducing the findings of this paper where they use daily historical stock prices to predict future prices with incredibly high accuracy.
They used data from the center for research and security prices, but we are using data from quandl and are running into trouble with the runtimes for our preprocessing. As in the paper, for each ticker and each day, we want to create a vector of 32 z-scores, comparing that ticker's momentum to the rest of the market's. As an intermediate step, they create a vector of 32 cumulative returns for each stock over a given time period (12 for the previous 12 months and 20 for the previous 20 days.) (this process is also briefly discussed in section 2.2 of the paper)
Because of the large amount of data, we are having problems with runtimes simply creating these cumulative returns vectors. We imported the data from a 1.7gb csv and turned it into a Pandas dataframe. We then wrote a function to take a given ticker and date and create a cumulative returns vector. I suspect that our trouble is in how we select data out of the dataframe. (specifically the following lines [each of them don't take much time, but they both need to be run many times for each date])
prices = data[data.ticker == ticker_argument]
and
p = prices[prices.date == date_argument]
What would be the most efficient way of doing this considering that the data is sorted by ticker and the data for each ticker is also sorted by date? I'd imagine that you could do some sort of binary search to speed most of it up, but should I do that by hand in plain python or is there a better pandas way to do that? Any help would be greatly appreciated.
The quandl csv is at http://www.jackmckeown.com/prices.csv if you want more info on how the data is formatted.

Related

How to deal with consecutive missing values of stock price in a time series using python?

I have a data frame consisting of two-time series describing two different stock prices, spanning over five years with an interval of approximately 2 minutes. I am struggling to decide how to deal with the missing values to build a meaningful model.
Some info about the data frame:-
Total number rows: 1315440
Number of missing values in Series_1: 1113923
Number of missing values in Series_2: 378952
Often there are missing values in 100+ consecutive rows, which is what makes me confused about how to deal with this dataset.
Below is a portion of the data, plots of Series_1 (column 2) and Series_2 (column_3).
Visualisation of Series_1:
Visualisation of Series_2:
Any advice would be appreciated. Thanks.
Depending on where your data come from, the missing data at a given time may mean that at this particular timestamp, out of the two stocks, an order was executed for one but not for the other. There is no reason in fact that two different stocks trade at exactly the same time. Certain dormant stocks with no liquidity can go for a long time without being traded while others are more active. Moreover, given that the precision of the data is down to the microsecond, no surprise that the trades on both stocks are not necessarily happening at the exact same microsecond. In this cases, it is safe to assume that the price of the stock was the last recorded transaction and update the missing values accordingly. Assuming you are using pandas, you could harmonize it by applying the pandas fillna method. Just make sure to sort your data frame beforehand:
df.sort_values('Time', inplace=True)
df['Series1'].fillna(method='ffill', inplace=True)
df['Series2'].fillna(method='ffill', inplace=True)

Undersampling large dataset under specific conditon applied to other column in python/pandas

I currently working with a large dataset (about 40 coulmns and tens of thousans of rows) and i would like to undersample it to be able to work with it more easily.
For the undersampling, unlike the resample method from pandas that resample according to timedelta, I'm trying to specify conditons on other columns to determine the data points to keep.
I'm not sure it's so clear but for example, let's say I have 3 columns (index, time and temperature) like as followed:
Now for the resampling, I would like to keep a data point every 1s or every 2°C, the resulting dataset would look like this:
I couldn't find a simple way of doing this with pandas. The only way would be to iterate over the rows but it was very slow because of the size of my datasets.
I though about using the diff method but of course it can only determine the difference on a specified period, same for pct_change that could have been use to keep only the points in the regions were the variations are maximal to undersample.
Thanks in advance if you have any suggestions on how to proceed with this resampling.

What Python data structures allow for easy access to values with multiple indices?

I work in Freight shipping, and I recently built a scraper that scraped market rates of shipments based on 3 features: origin city, destination city, and time period.
Currently, I have these results stored in a csv/xlsx file that has this data outlined as follows:
My current project involves comparing what we actually paid for shipments versus the going market rate. From my scraped data, I need a way to rapidly access the:
AVERAGE MARKET RATE
based on: MONTH, ORIGIN CITY, and DESTINATION CITY.
Since I know what we paid for shipping on a particular day, if I can access the average market rate from that month, I can perform a simple subtraction to tell us how much we over or underpaid.
I am relatively proficient with using Pandas dataframes, and my first instincts were to try to combine a dataframe with a dictionary to call values based on those features, but I am unsure of how I can do that exactly.
Do you have any ideas?
Using pandas, you could add your data as a new column in your csv. Then you could just subtract the two indexes, eg df['mean'] - df['paid']
You could do that in Excel too.
As a side note, you'll probably want to update your csv so that each row has the appropriate city - maybe it's harder to read, but it'll definitely be easier to work with in your code.

Transforming data in pandas

What would be the best way to approach this problem using python and pandas?
I have an excel file of electricity usage. It comes in an awkward structure and I want to transform it so that I can compare it to weather data based on date and time.
The structure look like ( foo is a string and xx is a number)
100,foo,foo,foo,foo
200,foo,foo,foo,foo,foo,0000,kWh,15
300,20181101,xx,xx,xx,xx...(96 columns)xx,A
... several hundred more 300 type rows
the 100 and 200 rows identify the meter and provide a partial schema. ie data is in kWh and 15 minute intervals. The 300 rows contain date and 96 (ie 96 = 24hours*4 15min blocks) columns of 15min power consumption and one column with a data quality flag.
I have previously processed all the data in other tools but I'm trying to learn how to do it in Python (jupyter notebook to be precise) and tap into the far more advanced analysis, modeling and visualisation tools available.
I think the thing to do is transform the data into a series of datetime and power. From there I can aggregate filter and compare however I like.
I am at a loss even to know what question to ask or resource to look up to tackle this problem. I could just import the 300 rows as is and loop through the rows and columns to create a new series in the right structure - easy enough to do. However, I strongly suspect there is an inbuilt method for doing this sort of thing and I would greatly appreciate any advise on what might be the best strategies. Possibly I don't need to transform the data at all.
You can read the data easy enough into a DataFrame, you just have to step over the metadata rows, e.g.:
df = pd.read_csv(<file>, skiprows=[0,1], index_col=1, parse_dates=True, header=None)
This will read in the csv, skip over the first 2 lines, make the date column the index and try and parse it to a date type.

How can I modify data from timestamps to consumption per day per user?

For my Bachelor Degree in Economics I need to analyse data on energy consumption. However, I got some data set delivered in a certain format and I have troubles with modifying this data to make it useful for me and to be able to analyze it with Stata.
I have some basic skills in Python and SQL, however so far I didn't succeeded with my last data-set for my thesis. I would be grateful for all your help :)
The problem:
I got a data-set with 3 columns and 23 million rows. The 3 columns are time-stamp, user(around 130 users) and consumption(Watt per second).
Example of data set in Access
On the first example, you can see that some users have negative consumption.
Those users are irrelevant for my research and all users with negative consumption values can be removed. How can I easily do this?
In the second example the raw data-set is given. The time stamps are based on intervals around 10-15 seconds and are consecutive. So measurement 1458185209 is 10-15 seconds after measurement with time-stamp 1458185109. Those time-stamps are anonymously generated. However,I know the exact begin- and end-time and date of measurements.
From this information, I want to calculate the average consumption (In KWatt/hour) per user per day. Let's say, there are 300.000 measurement points per user in the data-set. The total time of measuring is 2 months. So the average consumption of a user can be calculated by taking the average from time-stamp 1 till time-stamp 4918 (300.000/61 days).
I want to do this for all users for all days in the given period.
I have some basics in Acces, Python and MySQL. However, all computers I tried on have troubles with 23 million rows in Access. In Access I simply can't 'play' with it because every iteration takes me about half an hour. Maybe the option could be to write a python script?
As said, I am a student in economics and not in Data Science so I really hope you can help me trying to overcome this problem. I am open for any suggestions! I tried to describe the problem as specific as possible, if there is something unclear please let me know :)
Thanks a lot!
Do you have any indexes defined on your dataset? Put an index on user, timestamp, and both user and timestamp could greatly improve performance of some of your queries.
When working with the much data it will likely be best to offset as many calculations to the database as you can and only pull the already processed stuff into Python for further analysis.

Categories