I'm currently struggling to find good information on how to calculate differences, percentages etc. using several columns and rows in a Pandas dataframe - and how to show the output in a nice table using Python.
Short example of what I'm going for:
I'm working with NBA data and have gathered a bunch of match statistics for home and away teams during the 2019/20 season (the season finishes later this month). The first row shows the Free Throw percentage and "Regular" means regular matches with audience members and "Bubble" denotes the matches without audience members.
A short view of my Pandas dataframe:
How do I automate the calculations using Python code? Feel free to give me examples!
Related
I work in Freight shipping, and I recently built a scraper that scraped market rates of shipments based on 3 features: origin city, destination city, and time period.
Currently, I have these results stored in a csv/xlsx file that has this data outlined as follows:
My current project involves comparing what we actually paid for shipments versus the going market rate. From my scraped data, I need a way to rapidly access the:
AVERAGE MARKET RATE
based on: MONTH, ORIGIN CITY, and DESTINATION CITY.
Since I know what we paid for shipping on a particular day, if I can access the average market rate from that month, I can perform a simple subtraction to tell us how much we over or underpaid.
I am relatively proficient with using Pandas dataframes, and my first instincts were to try to combine a dataframe with a dictionary to call values based on those features, but I am unsure of how I can do that exactly.
Do you have any ideas?
Using pandas, you could add your data as a new column in your csv. Then you could just subtract the two indexes, eg df['mean'] - df['paid']
You could do that in Excel too.
As a side note, you'll probably want to update your csv so that each row has the appropriate city - maybe it's harder to read, but it'll definitely be easier to work with in your code.
I have a dataframe in Python that I want to trace through in a very specific way and I'm very new to using Pandas so I need some advice on how best to do this. This dataframe has information on many, many video games released over the course of history. Each row is an entry for a particular video game and each column contains info such as game names, release years, sales numbers, and console platforms (the same game appears multiple times if released on multiple platforms).
I want to do some calculations on sales figures based on release consoles over particular dates. The most obvious way of doing this is, of course, manually looping over every row in the dataframe checking to see if entries match my particular requirements for a calculation.
This is how I plan to do my traversals:
for s in frame.iterrows():
if s[1][1] == "Wii":
print (s[1][1]) ##As a test, I can print out the names of Wii games
My question is if this is the "correct" or most efficient way to do this, which I assume it's not. Pandas seems to have a TON of useful methods for dataframes and I would like to know if it contains a more efficient method for only looking up data with certain prerequisites.
assuming you want wii games an easy way to do this is the following. Let's take a toy dataframe example:
# Dataframe 'games':
console title
0 Xbox Halo
1 Wii Smash Bros
To get all the rows with wii games, you can run
games[games["console"] == "Wii"]
# returns
console title
1 Wii Smash Bros
Hope this helps! Let me know if you have any follow up questions/want more detail
My brother and I are working on reproducing the findings of this paper where they use daily historical stock prices to predict future prices with incredibly high accuracy.
They used data from the center for research and security prices, but we are using data from quandl and are running into trouble with the runtimes for our preprocessing. As in the paper, for each ticker and each day, we want to create a vector of 32 z-scores, comparing that ticker's momentum to the rest of the market's. As an intermediate step, they create a vector of 32 cumulative returns for each stock over a given time period (12 for the previous 12 months and 20 for the previous 20 days.) (this process is also briefly discussed in section 2.2 of the paper)
Because of the large amount of data, we are having problems with runtimes simply creating these cumulative returns vectors. We imported the data from a 1.7gb csv and turned it into a Pandas dataframe. We then wrote a function to take a given ticker and date and create a cumulative returns vector. I suspect that our trouble is in how we select data out of the dataframe. (specifically the following lines [each of them don't take much time, but they both need to be run many times for each date])
prices = data[data.ticker == ticker_argument]
and
p = prices[prices.date == date_argument]
What would be the most efficient way of doing this considering that the data is sorted by ticker and the data for each ticker is also sorted by date? I'd imagine that you could do some sort of binary search to speed most of it up, but should I do that by hand in plain python or is there a better pandas way to do that? Any help would be greatly appreciated.
The quandl csv is at http://www.jackmckeown.com/prices.csv if you want more info on how the data is formatted.
I think my problem is simple but I've made a long post in the interest of being thorough.
I need to visualize some data but first I need to perform some calculations that seem too cumbersome in Tableau (am I hated if I say tableau sucks!)
I have a general problem with how to output data with my calculations in a nice format that can be visualized either in Tableau or something else so it needs to hang on to a lot of information.
My data set is a number of fields associated to usage of an application by user id. So there are potentially multiple entries for each user id and each entry (record) has information in columns such as time they began using app, end time, price they paid, whether they were on wifi, and other attributes (dimensions).
I have one year of data and want to do things like calculate average/total of duration/price paid in app over each month and over the full year of each user (remember each user will appear multiple times-each time they sign in).
I know some basics, like appending a column which subtracts start time from end time to get time spent and my python is fully functional but my data capabilities are amateur.
My question is, say I want the following attributes (measures) calculated (all per user id): average price, total price, max/min price, median price, average duration, total duration, max/min duration, median duration, and number of times logged in (so number of instances of id) and all on a per month and per year basis. I know that I could calculate each of these things but what is the best way to store them for use in a visualization?
For context, I may want to visualize the group of users who paid on average more than 8$ and were in the app a total of more than 3 hours (to this point a simple new table can be created with the info) but if I want it in terms of what shows they watched and whether they were on wifi (other attributes in the original data set) and I want to see it broken down monthly, it seems like having my new table of calculations won't cut it.
Would it then be best to create a yearly table and a table for each month for a total of 13 tables each of which contain the user id's over that time period with all the original information and then append a column for each calculation (if the calc is an avg then I enter the same value for each instance of an id)?
I searched and found that maybe the plyr functionality in R would be useful but I am very familiar with python and using ipython. All I need is a nice data set with all this info that can then be exported into a visualization software unless you can also suggest visualization tools in ipython :)
Any help is much appreciated, I'm so hoping it makes sense to do this in python as tableau is just painful for the calculation side of things....please help :)
It sounds like you want to run a database query like this:
SELECT user, show, month, wifi, sum(time_in_pp)
GROUP BY user, show, month, wifi
HAVING sum(time_in_pp) > 3
Put it into a database and run your queries using pandas sql interface or ordinary python queries. Presumably you index your database table on these columns.
I'm new to Pandas and would like some insight from the pros. I need to perform various statistical analyses (multiple regression, correlation etc) on >30 time series of financial securities' daily Open, High, Low, Close prices. Each series has 500-1500 days of data. As each analysis looks at multiple securities, I'm wondering if it's preferable from an ease of use and efficiency perspective to store each time series in a separate df, each with date as the index, or to merge them all into a single df with a single date index, which would effectively be a 3d df. If the latter, any recommendations on how to structure it?
Any thoughts much appreciated.
PS. I'm working my way up to working with intraday data across multiple timezones but that's a bit much for my first pandas project; this is a first step in that direction.
since you're only dealing with OHLC, it's not that much data to process, so that's good.
for these types of things i usually use a multiindex (http://pandas.pydata.org/pandas-docs/stable/indexing.html) with symbol as the first level and date as the second. then you can have just the columns OHLC and you're all set.
to access multiindex use the .xs function.
Unless you are going to correlate everything with everything, my suggestion is to put this into separate dataframes and put them all in a dictionary, ie {"Timeseries1":df1, "Timeseries 2":df2...}. Then, when you want to correlate some timeseries together, you can merge them and put suffixes in the columns of every different df to differentiate between them.
Probably you are interested in this talk Python for Financial Data Analysis with pandas by the author of pandas himself.