How to deal with infrequent data in a time series prediction model

How to deal with infrequent data in a time series prediction model - python

I am trying to create a basic model for stock price prediction and some of the features I want to include come from the companies quarterly earnings report (every 3 months) so for example; if my data features are Date, OpenPrice, Close Price, Volume, LastQrtrRevenue how do I include LastQrtrRevenue if I only have a value for it every 3 months? Do I leave the other days blank (or null) or should I just include a constant of the LastQrtrRevenue and just update it on the day the new figures are released? Please if anyone has any feedback on dealing with data that is released infrequently but is important to include please share.... Thank you in advance.

I would be tempted to put the last quarter revenue in a separate table, with a date field representing when that quarter began (or ended, it doesn't really matter). Then you can write queries to work the way that most suits your application. You could certainly reconstitute the view you mention above using that table, as long as you can relate it to the main table.
You would just need to join the main table by company name, while selected the max() of the last quarter revenue table.

Related

Can python pandas replace a combination of if and xlookup in my excel tables?

So I have 2 data tables (2 top ones in the photo). Table 1 is a list of contracts between suppliers and clients with a many to many relationship. Contract details include start date, end date, cost of each day, number of contracted days in the week. Table 2 is a list of all dates in the year with an indication if they're day off or not.
I want a result table (as in the photo) that calculate the estimated amount each supplier will get paid each month. To do that first, I need to find the overlap between the length of contract and the start and end dates of each month, then see how many of that are not day off. So now I'm doing a bunch of xlookup and "if" in a single cell, and also pivot tables which drives me nut.
I'm learning python right now and saw people mention pandas and numpy for data manipulation. I've been googling but only found how to manipulate 1 table, or merge 2 tables. I haven't found how to do conditional calculating across multiple different tables. My question is, can I do this with python or should I just stick with Excel?

What Python data structures allow for easy access to values with multiple indices?

I work in Freight shipping, and I recently built a scraper that scraped market rates of shipments based on 3 features: origin city, destination city, and time period.
Currently, I have these results stored in a csv/xlsx file that has this data outlined as follows:
My current project involves comparing what we actually paid for shipments versus the going market rate. From my scraped data, I need a way to rapidly access the:
AVERAGE MARKET RATE
based on: MONTH, ORIGIN CITY, and DESTINATION CITY.
Since I know what we paid for shipping on a particular day, if I can access the average market rate from that month, I can perform a simple subtraction to tell us how much we over or underpaid.
I am relatively proficient with using Pandas dataframes, and my first instincts were to try to combine a dataframe with a dictionary to call values based on those features, but I am unsure of how I can do that exactly.
Do you have any ideas?

Using pandas, you could add your data as a new column in your csv. Then you could just subtract the two indexes, eg df['mean'] - df['paid']
You could do that in Excel too.
As a side note, you'll probably want to update your csv so that each row has the appropriate city - maybe it's harder to read, but it'll definitely be easier to work with in your code.

Django field for storing occurrence rates?

I have a Download model. I would like to store how often each instance is being downloaded.
I could just store a count but this seems brutally simplistic. Old files will seem more popular. I could mean-average against the upload date, but this makes newer almost always better. I don't neccessarily want to rank thigns but I'd like to show meaningful data to the people managing the files, something like:
Last 24 hours
Last 7 days
Last 30 days
Last 365 days
Total
Is there a field, method, or some sort of generic add-on that I can add to my model to store rate so that I have counts for something like that?
Edit: It occurs to me that doing "last x" means having to somehow store every instance with a datetime until they fall out the end (and only apply to total). I'm open to more efficient slightly compromised solutions that involve less churn.

You have not stated how many records / downloads you will be handling but a simple way of doing this create a DownloadCount model. Maybe something like:
class DownloadCount(models.Model):
date = models.DateField(auto_now_add=True)
item_downloaded = models.ForeignKey(Downloads)
You will then be able to produce metrics based off of the date field.
Something like:
DownloadCount.objects.filter(
date__range=(state_date, end_date),
item_downloaded=<Download-Model>
).count()
Obviously it would be highly advantageous to cache such results.
If you wanted to make the results a little more accurate you could also add an IPAddressField into the model making DownloadCount records unique based off of IP Address and Date.

Pandas-ipython, how to create new data frames with drill down capabilities

I think my problem is simple but I've made a long post in the interest of being thorough.
I need to visualize some data but first I need to perform some calculations that seem too cumbersome in Tableau (am I hated if I say tableau sucks!)
I have a general problem with how to output data with my calculations in a nice format that can be visualized either in Tableau or something else so it needs to hang on to a lot of information.
My data set is a number of fields associated to usage of an application by user id. So there are potentially multiple entries for each user id and each entry (record) has information in columns such as time they began using app, end time, price they paid, whether they were on wifi, and other attributes (dimensions).
I have one year of data and want to do things like calculate average/total of duration/price paid in app over each month and over the full year of each user (remember each user will appear multiple times-each time they sign in).
I know some basics, like appending a column which subtracts start time from end time to get time spent and my python is fully functional but my data capabilities are amateur.
My question is, say I want the following attributes (measures) calculated (all per user id): average price, total price, max/min price, median price, average duration, total duration, max/min duration, median duration, and number of times logged in (so number of instances of id) and all on a per month and per year basis. I know that I could calculate each of these things but what is the best way to store them for use in a visualization?
For context, I may want to visualize the group of users who paid on average more than 8$ and were in the app a total of more than 3 hours (to this point a simple new table can be created with the info) but if I want it in terms of what shows they watched and whether they were on wifi (other attributes in the original data set) and I want to see it broken down monthly, it seems like having my new table of calculations won't cut it.
Would it then be best to create a yearly table and a table for each month for a total of 13 tables each of which contain the user id's over that time period with all the original information and then append a column for each calculation (if the calc is an avg then I enter the same value for each instance of an id)?
I searched and found that maybe the plyr functionality in R would be useful but I am very familiar with python and using ipython. All I need is a nice data set with all this info that can then be exported into a visualization software unless you can also suggest visualization tools in ipython :)
Any help is much appreciated, I'm so hoping it makes sense to do this in python as tableau is just painful for the calculation side of things....please help :)

It sounds like you want to run a database query like this:
SELECT user, show, month, wifi, sum(time_in_pp)
GROUP BY user, show, month, wifi
HAVING sum(time_in_pp) > 3
Put it into a database and run your queries using pandas sql interface or ordinary python queries. Presumably you index your database table on these columns.

Efficiently determining if a business is open or not based on store hours

Given a time (eg. currently 4:24pm on Tuesday), I'd like to be able to select all businesses that are currently open out of a set of businesses.
I have the open and close times for every business for every day of the week
Let's assume a business can open/close only on 00, 15, 30, 45 minute marks of each hour
I'm assuming the same schedule each week.
I am most interested in being able to quickly look up a set of businesses that is open at a certain time, not the space requirements of the data.
Mind you, some my open at 11pm one day and close 1am the next day.
Holidays don't matter - I will handle these separately
What's the most efficient way to store these open/close times such that with a single time/day-of-week tuple I can speedily figure out which businesses are open?
I am using Python, SOLR and mysql. I'd like to be able to do the querying in SOLR. But frankly, I'm open to any suggestions and alternatives.

If you are willing to just look at single week at a time, you can canonicalize all opening/closing times to be set numbers of minutes since the start of the week, say Sunday 0 hrs. For each store, you create a number of tuples of the form [startTime, endTime, storeId]. (For hours that spanned Sunday midnight, you'd have to create two tuples, one going to the end of the week, one starting at the beginning of the week). This set of tuples would be indexed (say, with a tree you would pre-process) on both startTime and endTime. The tuples shouldn't be that large: there are only ~10k minutes in a week, which can fit in 2 bytes. This structure would be graceful inside a MySQL table with appropriate indexes, and would be very resilient to constant insertions & deletions of records as information changed. Your query would simply be "select storeId where startTime <= time and endtime >= time", where time was the canonicalized minutes since midnight on sunday.
If information doesn't change very often, and you want to have lookups be very fast, you could solve every possible query up front and cache the results. For instance, there are only 672 quarter-hour periods in a week. With a list of businesses, each of which had a list of opening & closing times like Brandon Rhodes's solution, you could simply, iterate through every 15-minute period in a week, figure out who's open, then store the answer in a lookup table or in-memory list.

The bitmap field mentioned by another respondent would be incredibly efficient, but gets messy if you want to be able to handle half-hour or quarter-hour times, since you have to increase arithmetically the number of bits and the design of the field each time you encounter a new resolution that you have to match.
I would instead try storing the values as datetimes inside a list:
openclosings = [ open1, close1, open2, close2, ... ]
Then, I would use Python's "bisect_right()" function in its built-in "bisect" module to find, in fast O(log n) time, where in that list your query time "fits". Then, look at the index that is returned. If it is an even number (0, 2, 4...) then the time lies between one of the "closed" times and the next "open" time, so the shop is closed then. If, instead, the bisection index is an odd number (1, 3, 5...) then the time has landed between an opening and a closing time, and the shop is open.
Not as fast as bitmaps, but you don't have to worry about resolution, and I can't think of another O(log n) solution that's as elegant.

You say you're using SOLR, don't care about storage, and want the lookups to be fast. Then instead of storing open/close tuples, index an entry for every open block of time at the level of granularity you need (15 mins). For the encoding itself, you could use just cumulative hours:minutes.
For example, a store open from 4-5 pm on Monday, would have indexed values added for [40:00, 40:15, 40:30, 40:45]. A query at 4:24 pm on Monday would be normalized to 40:15, and therefore match that store document.
This may seem inefficient at first glance, but it's a relatively small constant penalty for indexing speed and space. And makes the searches as fast as possible.

Sorry I don't have an easy answer, but I can tell you that as the manager of a development team at a company in the late 90's we were tasked with solving this very problem and it was HARD.
It's not the weekly hours that's tough, that can be done with a relatively small bitmask (168 bits = 1 per hour of the week), the trick is the businesses which are closed every alternating Tuesday.
Starting with a bitmask then moving on to an exceptions field is the best solution I've ever seen.

In your Solr index, instead of indexing each business as one document with hours, index every "retail session" for every business during the course of a week.
For example if Joe's coffee is open Mon-Sat 6am-9pm and closed on Sunday, you would index six distinct documents, each with two indexed fields, "open" and "close". If your units are 15 minute intervals, then the values can range from 0 to 7*24*4. Assuming you have a unique ID for each business, store this in each document so you can map the sessions to businesses.
Then you can simply do a range search in Solr:
open:[* TO N] AND close:[N+1 TO *]
where N is computed to the Nth 15 minute interval that the current time falls into. For examples if it's 10:10AM on Wednesday, your query would be:
open:[* TO 112] AND close:[113 TO *]
aka "find a session that starts at or before 10:00am Wed and ends at or after 10:15am Wed"
If you want to include other criteria in your search, such as location or products, you will need to index this with each session document as well. This is a bit redundant, but if your index is not huge, it shouldn't be a problem.

If you can control your data well, I see a simple solution, similar to #Sebastian's. Follow the advice of creating the tuples, except create them of the form [time=startTime, storeId] and [time=endTime, storeId], then sort these in a list. To find out if a store is open, simply do a query like:
select storeId
from table
where time <= '#1'
group by storeId
having count(storeId) % 2 == 1
To optimize this, you could build a lookup table at each of time t, store the stores that are open at t, and the store openings/closings between t and t+1 (for any grouping of t).
However, this has the drawback of being harder to maintain (overlapping openings/closings need to be merged into a longer open-close period).

Have you looked at how many unique open/close time combinations there are? If there are not that many, make a reference table of the unique combinations and store the index of the appropriate entry against each business. Then you only have to search the reference table and then find the business with those indices.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to deal with infrequent data in a time series prediction model - python

Related

Can python pandas replace a combination of if and xlookup in my excel tables?

What Python data structures allow for easy access to values with multiple indices?

Django field for storing occurrence rates?

Pandas-ipython, how to create new data frames with drill down capabilities

Efficiently determining if a business is open or not based on store hours

Categories

Resources