Create groupings on data - python

I have input data like this:
The output I require is grouping all the elements and giving distinct customers and total sales for each group. I have tried cross matrix, group by, and cube functions until now, but however, I am not getting the desired results.
Output expected:
Any help will highly appreciated.

Related

How to bin the a column almost equally based on the range in PySpark, on groupBy results?

I have been trying hard to divide a column that I have into bins. I looked at pandas.qcut and pandas.cut and have understood this fairly. ( qcut will divide my population into equal number of bins and cut will divide my range into equal bins).
I understand that I need the functionality of cut in PySpark.
I experimented with approxQuantile, which takes a real long time in processing. I also experimented with QuantileDiscretizer ( which is similar to qcut, which gave me the same number of items in my output).
I am trying out Bucketizer, however looks like there's no straightaway method to apply this on a groupby/window output.
Is there any other way that I should look at in order to compute this?
Applying Bucketizer to Spark dataframe after partitioning based on a column value

Creating 30 Plotly charts based on multi-index Pandas series

I've got a DataFrame with PLAYER_NAME, the corresponding cluster they're assigned to, their team's net rating, and their team ids. This is what it looks like:
I'd like to have a bunch of bar charts for each team that look like the following:
The would be matched with the team's net rating and id. I've tried using groupby like this to get a multi-index Pandas series where there's a team_id and a cluster number corresponding to the number of instances that cluster appears for a certain team. It looks like this:
.
Unfortunately this removes the net rating, so I'm really not sure how to get all that info. I'm using plotly right now but if there's a solution with matplotlib that's fine.
This is the groupby code:
temp_df = pos_clusters.groupby(['TEAM_ID'])['Cluster'].value_counts().sort_index()
Thanks so much!

Why do I get different results from different credit pricing engines in QuantLib

I am trying to use three credit pricing engines: IsdaCdsEngine, MidPointCdsEngine and IntegralCdsEngine but I am getting different NPV results from each of them. The case is like this: When I have as an input same cds spread values for different tenors, the npv values are different while when cds spread values differ for different tenors the npv result is similar(still not the same).
I would really appreciate if someone would give me an explanation how do these three engines differ from each other.

Find correlation between two columns in two different dataframes

I have two dataframes which both have an ID column, and for each ID a date columns with timestamps and a Value column. Now, I would like to find a correlation between the values from each dataset in this way: dataset 1 has all the values of people that got a specific disease, and in dataset 2 there are values for people that DIDN'T get the disease. Now, using the corr function:
corr = df1['val'].corr(df2['val'])
my result is 0.1472 and is very very low (too much), meaning they don't have nothing in correlation.
Am I wrong in something? How do I calculate the correlation? Is there a way to find a value (maybe a line) where after that value the people will get the disease? I would like to try this with a Machine Learning technique (SVMs), but first it would be good to have something like the part I explained before. How can I do that?
Thanks
Maybe your low correlation is due to the index or order of your observations
Have you tried to do a left join by ID?

How to loop through dates when in YYYY and MM format in two separate columns?

I am a newbie to Python and learning my way and your help is appreciated.
I am trying to do a simple calculation where I calculate rolling average of last 4 datapoint (fin_ratio) i.e. (x1+x2+x3+x4)/4
Now my dataset is as follows:
I have put a small sample of file (xl format) here for your reference:
https://www.mediafire.com/file/4gjsprvfc31n79g/test_data_python.xlsx/file
1) it has unique firm_id, so the rolling mean should not use data from two different firm_id ( i.e. I can't just calculate the mean and let it go down the rows as it will then use fin_ratio from two different firms where one ends and other firm data starts.
2) How should I use the firm_year and firm_qtr in the for loop in this case.
Thanks for your time and I appreciate any pointers you may have.
Regards
John M.

Categories