Pandas-ipython, how to create new data frames with drill down capabilities

Pandas-ipython, how to create new data frames with drill down capabilities - python

I think my problem is simple but I've made a long post in the interest of being thorough.
I need to visualize some data but first I need to perform some calculations that seem too cumbersome in Tableau (am I hated if I say tableau sucks!)
I have a general problem with how to output data with my calculations in a nice format that can be visualized either in Tableau or something else so it needs to hang on to a lot of information.
My data set is a number of fields associated to usage of an application by user id. So there are potentially multiple entries for each user id and each entry (record) has information in columns such as time they began using app, end time, price they paid, whether they were on wifi, and other attributes (dimensions).
I have one year of data and want to do things like calculate average/total of duration/price paid in app over each month and over the full year of each user (remember each user will appear multiple times-each time they sign in).
I know some basics, like appending a column which subtracts start time from end time to get time spent and my python is fully functional but my data capabilities are amateur.
My question is, say I want the following attributes (measures) calculated (all per user id): average price, total price, max/min price, median price, average duration, total duration, max/min duration, median duration, and number of times logged in (so number of instances of id) and all on a per month and per year basis. I know that I could calculate each of these things but what is the best way to store them for use in a visualization?
For context, I may want to visualize the group of users who paid on average more than 8$ and were in the app a total of more than 3 hours (to this point a simple new table can be created with the info) but if I want it in terms of what shows they watched and whether they were on wifi (other attributes in the original data set) and I want to see it broken down monthly, it seems like having my new table of calculations won't cut it.
Would it then be best to create a yearly table and a table for each month for a total of 13 tables each of which contain the user id's over that time period with all the original information and then append a column for each calculation (if the calc is an avg then I enter the same value for each instance of an id)?
I searched and found that maybe the plyr functionality in R would be useful but I am very familiar with python and using ipython. All I need is a nice data set with all this info that can then be exported into a visualization software unless you can also suggest visualization tools in ipython :)
Any help is much appreciated, I'm so hoping it makes sense to do this in python as tableau is just painful for the calculation side of things....please help :)

It sounds like you want to run a database query like this:
SELECT user, show, month, wifi, sum(time_in_pp)
GROUP BY user, show, month, wifi
HAVING sum(time_in_pp) > 3
Put it into a database and run your queries using pandas sql interface or ordinary python queries. Presumably you index your database table on these columns.

Related

What Python data structures allow for easy access to values with multiple indices?

I work in Freight shipping, and I recently built a scraper that scraped market rates of shipments based on 3 features: origin city, destination city, and time period.
Currently, I have these results stored in a csv/xlsx file that has this data outlined as follows:
My current project involves comparing what we actually paid for shipments versus the going market rate. From my scraped data, I need a way to rapidly access the:
AVERAGE MARKET RATE
based on: MONTH, ORIGIN CITY, and DESTINATION CITY.
Since I know what we paid for shipping on a particular day, if I can access the average market rate from that month, I can perform a simple subtraction to tell us how much we over or underpaid.
I am relatively proficient with using Pandas dataframes, and my first instincts were to try to combine a dataframe with a dictionary to call values based on those features, but I am unsure of how I can do that exactly.
Do you have any ideas?

Using pandas, you could add your data as a new column in your csv. Then you could just subtract the two indexes, eg df['mean'] - df['paid']
You could do that in Excel too.
As a side note, you'll probably want to update your csv so that each row has the appropriate city - maybe it's harder to read, but it'll definitely be easier to work with in your code.

How to deal with infrequent data in a time series prediction model

I am trying to create a basic model for stock price prediction and some of the features I want to include come from the companies quarterly earnings report (every 3 months) so for example; if my data features are Date, OpenPrice, Close Price, Volume, LastQrtrRevenue how do I include LastQrtrRevenue if I only have a value for it every 3 months? Do I leave the other days blank (or null) or should I just include a constant of the LastQrtrRevenue and just update it on the day the new figures are released? Please if anyone has any feedback on dealing with data that is released infrequently but is important to include please share.... Thank you in advance.

I would be tempted to put the last quarter revenue in a separate table, with a date field representing when that quarter began (or ended, it doesn't really matter). Then you can write queries to work the way that most suits your application. You could certainly reconstitute the view you mention above using that table, as long as you can relate it to the main table.
You would just need to join the main table by company name, while selected the max() of the last quarter revenue table.

How can I modify data from timestamps to consumption per day per user?

For my Bachelor Degree in Economics I need to analyse data on energy consumption. However, I got some data set delivered in a certain format and I have troubles with modifying this data to make it useful for me and to be able to analyze it with Stata.
I have some basic skills in Python and SQL, however so far I didn't succeeded with my last data-set for my thesis. I would be grateful for all your help :)
The problem:
I got a data-set with 3 columns and 23 million rows. The 3 columns are time-stamp, user(around 130 users) and consumption(Watt per second).
Example of data set in Access
On the first example, you can see that some users have negative consumption.
Those users are irrelevant for my research and all users with negative consumption values can be removed. How can I easily do this?
In the second example the raw data-set is given. The time stamps are based on intervals around 10-15 seconds and are consecutive. So measurement 1458185209 is 10-15 seconds after measurement with time-stamp 1458185109. Those time-stamps are anonymously generated. However,I know the exact begin- and end-time and date of measurements.
From this information, I want to calculate the average consumption (In KWatt/hour) per user per day. Let's say, there are 300.000 measurement points per user in the data-set. The total time of measuring is 2 months. So the average consumption of a user can be calculated by taking the average from time-stamp 1 till time-stamp 4918 (300.000/61 days).
I want to do this for all users for all days in the given period.
I have some basics in Acces, Python and MySQL. However, all computers I tried on have troubles with 23 million rows in Access. In Access I simply can't 'play' with it because every iteration takes me about half an hour. Maybe the option could be to write a python script?
As said, I am a student in economics and not in Data Science so I really hope you can help me trying to overcome this problem. I am open for any suggestions! I tried to describe the problem as specific as possible, if there is something unclear please let me know :)
Thanks a lot!

Do you have any indexes defined on your dataset? Put an index on user, timestamp, and both user and timestamp could greatly improve performance of some of your queries.
When working with the much data it will likely be best to offset as many calculations to the database as you can and only pull the already processed stuff into Python for further analysis.

Complete sparse data in Graphite

I have a list of objects that each contain: Product, Price, Time. It basically holds product price change over time. So, for each price change you'll get a record with the product name, the new price and the exact second of the change.
I'm sending the data to graphite using the timestamp (written in python):
import socket
sock = socket.socket()
sock.connect(('my-host-ip', 2003))
message = 'my.metric.prefix.%s %s %d\n' % (product, price, timestamp)
sock.sendall(message)
sock.close()
Thing is, as prices do not change very often, the data points are very sparsed which mean I get a point per product in a frequency of hours/days. If I look at Graphite at the exact time of the price change, I can see the data point. But if I want to look at price change over time, I would like to draw a constant line from the data point of the price change going forward.
I tried using:
keepLastValue(my.metric.prefix.*)
I would work only if I look at the data points in a time frame of a few minutes, but not hours (surely not days). Is there a way to do something like that in Graphite? Or I have to put some redundant data every minute to describe the missing points?

I believe using keepLastValue doesn't work for you for a coarser time interval due to aggregation rules defined in storage-aggregation.conf. You can try using xFilesFactor = 0 and aggregationMethod = last to get always the last value of the metric at each aggregated point.
However I think your concrete use case is much better resolved by using StatsD gauges. Basically you can set an arbitrary numerical value for a gauge in StatsD and it will send (flush) its value to Graphite every 10 seconds by default. You can set the flush interval to a shorter period, like 1 second, if you really need to record the second of the change. If the gauge is not updated at the next flush, StatsD will send the previous value.
So basically StatsD gauges do what you say about sending redundant data to describe missing points.

I had the same problem as well with sparse data.
I used the whisper database tools outlined in the link below to update my whisper files which were aggregating load data on 10 minute intervals.
https://github.com/graphite-project/whisper
First examine my file using whisper-info:
/opt/graphite/bin/whisper-info.py /opt/graphite/storage/whisper/test/system/myserver/n001/loadave.wsp
Then fix aggregation methods using whisper-resize:
/opt/graphite/bin/whisper-resize.py --xFilesFactor=0 --aggregationMethod=last /opt/graphite/storage/whisper/test/system/myserver/n001/loadave.wsp 600:259200
Please be cautious using whisper-resize as it can result in data loss if you aren't careful!

store a calculated value in the datastore or just calculate it on the fly

I writing an app in python for google app engine where each user can submit a post and each post has a ranking which is determined by its votes and comment count. The ranking is just a simple calculation based on these two parameters. I am wondering should I store this value in the datastore (and take up space there) or just simply calculate it every time that I need it. Now just fyi the posts will be sorted by ranking so that needs to be taken into account.
I am mostly thinking for the sake of efficiency and trying to balance if I should try and save the datastore room or save the read/write quota.
I would think it would be better to simply store it but then I need to recalculate and rewrite the ranking value every time anyone votes or comments on the post.
Any input would be great.

What about storing the ranking as a property in the post. That would make sense for querying/sorting wouldn't it.
If you store the ranking at the same time (meaning in the same entitiy) as you store the votes/comment count, then the only increase in write cost would be for the index. (ok initial write cost too but that is what 2 [very small anyway]).
You need to do a database operation everytime anyone votes or comments on the post anyway right!?! How else can to track votes/comments?
Actually though, I imagine you will get into use text search to find data in the posts. If so, I would look into maybe storing the ranking as a property in the search index and using it to rank matching results.
Don't we need to consider how you are selecting the posts to display. Is ranking by votes and comments the only criteria?

Caching is most useful when the calculation is expensive. If the calculation is simple and cheap, you might as well recalculate as needed.

If you're depending on keeping a running vote count in an entity, then you either have to be willing to lose an occasional vote, or you have to use transactions. If you use transactions, you're rate limited as to how many transactions you can do per second. (See the doc on transactions and entity groups). If you're liable to have a high volume of votes, rate limiting can be a problem.
For a low rate of votes, keeping a count in an entity might work fine. But if you any significant peaks in voting rate, storing separate Vote entities that periodically get rolled up into a cached count, perhaps adjusted by (possibly unreliable) incremental counts kept in memcache, might work better for you.
It really depends on what you want to optimize for. If you're trying to minimize disk writes by keeping a vote count cached non-transactionally, you risk losing votes.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.