Cumulative Frequency with Defined Bins in Python - python

I have an array of data on how quickly people take action measured in hours. I want to generate a table that tells me what % of users have taken by the first hour, first day, first week, first month, etc.
I have used the pandas.cut to categorize and give them group_names
bins_hours = [0...]
group_names = [...]
hourlylook = pd.cut(av.date_diff, bins_hours, labels=group_names,right=False)
I then plotted hourlylook and got an awesome bar chart.
But I want to express this information cumulatively, too, in a table format. What's the best way to tackle this problem?

Have a look at: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.cumsum.html
This should allow you to create a new column with the cumulative sum.

Related

How can I make my data numeral, so I can visualize them via matplotlib?

So i have this df, the columns that im intrested in visualizing
later with matplotlib are the 'incident_date', 'fatalities'. I want to create two diagrams. The one will display the number of the incidents with injuries (the column named 'fatalities' says whether it was a fatal accident, or just one with injuries or neither), the other will display the dates with the most deaths. So, in order to do those, I need somehow to turn the data in the 'fatalities' column into numeral ones.
This is my df's head, so you get an idea
I created dummy data based on picture you provided
data = {'incident_date':['1-Mar-20','1-Mar-20','3-Mar-20','3-Mar-20','3-Mar-20','5-Mar-20','6-Mar-20','7-Mar-20','7-Mar-20'] \
,'fatalities':['Fatal','Fatal','Injuries','Injuries','Neither','Fatal','Fatal','Fatal','Fatal'] \
, 'conclusion_number':[1,1,3,23,23,34,23,24,123]}
df = pd.DataFrame(data)
All you need is to do a group by incident_data and fatalities and you will get the numerical values for that particular date and that particular incident.
df_grp = df.groupby(['incident_date','fatalities'],as_index=False)['conclusion_number'].count()
df_grp.rename({'conclusion_number':'counts'},inplace=True, axis=1)
The Output of above looks like this.
output dataframe
Once you get counts column you can perform your matplot diagrams.
Let me know if you need help with diagrams as well

Problem plotting dataframe with matplotlib

I'm trying to plot a bar chart of some de-identified transactional banking data using the pandas and matplotlib libraries.
The data looks like this:
The column named "day" stores the numbers of the days on which the transaction was made, the column named "tr_type" stores the numbers of transactions made on each day, and the column named "average_income" stores the average amount of incomes for each of the different types of transactions.
The task is to display the data of all three columns, which have the largest average amount of incomes, on one graph.
For definiteness, I took the top 5 rows of sorted data.
`
slised_two = sliced_df_new.sort_values('average_income', ascending=False).head(5)
slised_two = slised_two.set_index('day')
`
For convenience in further plotting, I set a column called "day" as an index. I get this:
Based on this data, I tried to build one graph, but, unfortunately, I did not achieve the result I wanted, because I had to build 2 graphs for normal data display.
`
axes = slised_two.plot.bar(rot=0, subplots=True)
axes[1].legend(loc=2)
`
The question arises, is it possible to build a histogram in such a way that days are displayed on the x-axis, the average amount of incomes is displayed on the y-axis, and at the same time, the transaction number is signed on top of each column?

How do I stop Pandas from continuing to put the same data on the same plot?

My first post, so I hope I do this correctly!
This is admittedly an OOP modification of something on DataCamp. I have two objects which contain Pandas dataframes. The first (StockData) has stock data for every trading day of 2016 for both Amazon and Facebook. The second (BenchmarkData) has the S&P 500 closing values for every trading day of 2016. For both, I want to calculate the percent change (StockReturns and BenchmarkReturns, respectively) and then plot them. I want both of the StockReturns on the same plot, but the BenchmarkReturns (which is a Series and not a dataframe for reasons irrelevant to this part of the code) on a separate plot. For my function, I've added a flag as input to tell the program whether the object contains a stock dataframe or a benchmark dataframe and I call the function twice during runtime, once for stock and the other for benchmark. However, no matter what I do, Pandas plots all 3 on the same plot. How do I separate the benchmark data and get it on its own plot?
def _CalculatePercentChange(self, IsStockData):
if(IsStockData):
self.__StockReturns = self.__StockData.GetData().pct_change()
self.__StockReturns.plot(title = 'Daily Percent Change')
else:
self.__BenchmarkReturnsDataFrame = self.__BenchmarkData.GetData().pct_change()
self.__BenchmarkReturns = self.__BenchmarkReturnsDataFrame['S&P 500'].squeeze()
self.__BenchmarkReturns.plot(title = 'Daily Percent Change')
Thanks guys.

Plot each year of a time series on the same x-axis

I have a time series with daily data that I want to plot to see how it evolves over a year. I want to compare how it evolves over the year compared to previous years. I have written the following code in Python:
xindex = data['biljett'].index.month*30 + data['biljett'].index.day
plt.plot(xindex, data['biljett'])
plt.show()
The graph looks as follows:
A graph how the data evolves over a year compared to previous years. The line is continuous and and does not end with the end of the year which makes it fuzzy. What am I doing wrong ?
From technical perspectives, it happens because your data points are not sorted w.r.t. date, thus it goes back and forth to connect data points in the data frame order. you sort the data based on xindex and you're good to go. to do that: (first you need to put xindex in data dataframe as a new column)
data.sort_values(by='xindex').reset_index(drop=True)
From the visualization perspective, I think you might have several values per each day count, thus plot is not a good option to begin with. So IMHO you'd want to try plt.scatter() to visualize your data in a better way.
I have rewritten as follows:
xindex = data['biljett'].index.month*30 + data['biljett'].index.day
data['biljett'].sort_values('xindex').reset_index(drop=True)
plt.plot(xindex, data['biljett'])
plt.show()
but gets the following error message:
ValueError: No axis named xindex for object type

Efficient workaround to group by multiple time coordinates in xarray

I'm currently working with CESM Large Ensemble data on the cloud (ala https://medium.com/pangeo/cesm-lens-on-aws-4e2a996397a1) using xarray and Dask and am trying to plot the trends in extreme precipitation in each season over the historical period (Dec-Jan-Feb and Jun-Jul-Aug specifically).
Eg. If one had a daily time-series data split into months like:
1920: J,F,M,A,M,J,J,A,S,O,N,D
1921: J,F,M,A,M,J,J,A,S,O,N,D
...
My aim is to group together the JJA days in each year and then take the maximum value within that group of days for each year. Ditto for DJF, however here you have to be careful because DJF is a year-skipping season; the most natural way to define it is 1921's DJF = 1920 D + 1921 JF.
Using iris this would be simple (though quite inefficient), as you could just add auxiliary time-coordinates for season and season_year and then aggregate/groupby those two coordinates and take a maximum, this would give you a (year, lat, lon) output where each year contains the maximum of the precipitation field in the chosen season (eg. maximum DJF precip in 1921 in each lat,lon pixel).
However in xarray this operation is not as natural because you can't natively groupby multiple coordinates, see https://github.com/pydata/xarray/issues/324 for further info on this. However, in this github issue someone suggests a simple, nested workaround to the problem using xarray's .apply() functionality:
def nested_groupby_apply(dataarray, groupby, apply_fn):
if len(groupby) == 1:
return dataarray.groupby(groupby[0]).apply(apply_fn)
else:
return dataarray.groupby(groupby[0]).apply(nested_groupby_apply, groupby = groupby[1:], apply_fn = apply_fn)
I'd be quite keen to try and use this workaround myself, but I have two main questions beforehand:
1) I can't seem to work out how to groupby coordinates such that I don't take the maximum of DJF in the same year?
Eg. If one simply applies the function like (for a suitable xr_max() function):
outp = nested_groupby_apply(daily_prect, ['time.season', 'time.year'], xr_max)
outp_djf = outp.sel(season='DJF')
Then you effectively define 1921's DJF as 1921 D + 1921 JF, which isn't actually what you want to look at! This is because the 'time.year' grouping doesn't account for the year-skipping behaviour of seasons like DJF. I'm not sure how to workaround this?
2) This nested groupby function is incredibly slow! As such, I was wondering if anyone in the community had found a more efficient solution to this problem, with similar functionality?
Thanks ahead of time for your help, all! Let me know if anything needs clarifying.
EDIT: Since posting this, I've discovered there already is a workaround for this in the specific case of taking DJF/JJA means each year (Take maximum rainfall value for each season over a time period (xarray)), however I'm keeping this question open because the general problem of an efficient workaround for multi-coord grouping is still unsolved.

Categories