How to translate rolling function in Python to PySpark

How to translate rolling function in Python to PySpark - python

I am trying to replicate the following Python code:
df_act['steps_ma'] = df_act['steps'].rolling(5).mean()
what the code is supposed to do is to calculate the mean of the column stemps_ma, the mean has to be calculated using the current row and the 4 rows prior. So for example if we are looking at row number 10, the mean would have to be calculated with the values of steps_ma for columns 10,9,8,7,6.
to Pyspark, but the only info I am getting is with columns of timestamp, while my column "steps" is a column of integers.
I am going back and forth with this approach but I am getting so many errors:
winSpec = Window.partitionBy('user_id').orderBy('timestamp').rangeBetween(-(5), 0)
df_act.withColumn('steps_ma', mean('steps').over(winSpec)).show()

Related

Python 3.8 Errorr while trying to define a range of values on a Pandas column

Hello I'm learning Python and Pandas, and I'm working on an exercise. I'm loading in 2 csv files and merging them into one dataframe.
import pandas as pd
# File to Load (Remember to Change These)
school_data_to_load = "Resources/schools_complete.csv"
student_data_to_load = "Resources/students_complete.csv"
# Read School and Student Data File and store into Pandas DataFrames
school_data_df = pd.read_csv(school_data_to_load)
student_data_df = pd.read_csv(student_data_to_load)
# Combine the data into a single dataset.
school_data_complete_df = pd.merge(student_data_df, school_data_df, how="left", on=["school_name", "school_name"])
school_data_complete_df.head()
The output looks like the picture above.
I'm trying to:
Calculate the percentage of students with a passing math score (70 or greater)
Calculate the percentage of students with a passing reading score (70 or greater)
Calculate the percentage of students who passed math and reading (% Overall Passing)
I'm looking to populate a new dataframe by looking at students who only got 70 or greater on their math and reading scores by using the loc command
I got this error. I don't understand because the values in the columns should all be integers so why is it saying I'm trying to pass strings in there as well?

You are not comparing the values in the column. You are just comparing "math_score" >= 70. There's a string on the left, and an integer on the right, hence your problem.
Fix the location of your angle brackets, and you should be good to go:
passing_maths_total = school_data_complete_df.loc[school_data_complete_df["math score"] >= 70]
Pandas broadcasts the result of the >= comparison, so comparing the Pandas Series school_data_complete_df["math score"] with 70 results in a boolean Pandas Series which can be used for indexing, e.g. in .loc.
The colon is unnecessary because the row index comes first in .loc anyways.
This solution is not tested.

Find the difference between the max and min of a third column if the first two columns match of an Excel file

Rather new to Python and so far I've only done the very basics in an intro to python class. I was handed this data set and thought it could very easily be handled in python but I have no idea where to begin.
I have a 3 column table in excel. First column is a code, second column is a row number, and third column is a numeric value. If the first and second column combined are unique, that is if the first column is FLD04 and the second column is 1, then I want to find the difference + 1 between the max and min value in the 3rd column and print a line that reads FLD04 1 30 (30 being the result of the difference between the max and min + 1). And iterate this over and over for every instance where the first and second column together are unique.
IDK I can't figure out how to past the excel info as anything but an image. Sorry. Just wanted to post it to help illustrate what I am dealing with
enter image description here

When I first learn Python, I like to print out those intermediate variables and see how they look. You may try the following code out, but change "Column#1,2,3" into the right names.
data = pd.read_excel(...) # Read your data: https://pandas.pydata.org/docs/reference/api/pandas.read_excel.html
max = data.groupby(["CODE", "ROW"])["NUMBER"].max() #Group data by the first 2 columns, and for each group, find the maximum value in column 3
min = data.groupby(["CODE", "ROW"])["NUMBER"].min() #Get minimums
result = max - min + 1
print(result)

Pandas - Understanding how rolling averages work

So I'm trying to calculate rolling averages, based on some column and some groupby columns.
In my case:
rolling column = RATINGS,
groupby_columns = ["DEMOGRAPHIC","ORIGINATOR","START_ROUND_60","WDAY","PLAYBACK_PERIOD"]
one group of my data looks like that:
my code to compute the rolling average is:
df['rolling']= df.groupby(groupby_columns_keys)['RATINGS'].\
apply(lambda x: x.shift().rolling(10,min_periods=1).mean())
What I don't understand is what is happening when the RATINGS value are starting to be NaN.
As my window size is 10, I would expect the second number in the test (index 11) to be:
np.mean([178,479,72,272,158,37,85.5,159,107,164.55]) = 171.205
But it is instead 171.9444, and same apply to the next numbers.
What is happening here?
And how I should calculate the next rolling averages the way I want (simply to average the 10 last ratings - and if ratings is NaN to take the calculated average of the previous row instead).
Any help will be appreciated.

np.mean([178,479,72,272,158,37,85.5,159,107,164.55]) = 171.205
Where does the 164.55 come from? The rest of those values are from the "RATINGS" column and the 164.55 is from the "rolling" column. Maybe I am misunderstanding what the rolling function does.

Python and Pandas - Transferring Excel formula to Pandas

At the moment, I'm trying to migrate my Weibull calculations from an excel macro to Python, and the tool I've been primarily using is Pandas. The formula I am currently having trouble converting from excel to Python is as follows:
Adjusted Rank = (Previous value in the adjusted rank column) * (Another column's value), but the first value in the adjusted rank column = 0
My brain is trying to copy and paste this methodology to pandas, but as you can imagine, it doesn't work that way:
DF[Adjusted Rank] = (Previous value in the adjusted rank column) * DF(Another Column), but the first value in the adjusted rank column = 0
In the end, I imagine the adjusted rank column will look like so:
Adjusted Rank
0
Some Number
Some Number
Some Number
etc.
I'm having some trouble puzzling out how to make each "cell" in the adjusted rank column refer to the previous value in the column In Pandas. Additionally, Is there a way to set only the first entry in the column equal to 0? Thanks all!

You can use shift to multiply by previous values and add a zero to the start, this should work:
df['new'] = df['adjusted_rank'].shift(period = 1, fill_value=0) * df['another_column']

Performing multiple calculations on a Python Pandas group from CSV data

I have daily csv's that are automatically created for work that average about 1000 rows and exactly 630 columns. I've been trying to work with pandas to create a summary report that I can write to a new txt.file each day.
The problem that I'm facing is that I don't know how to group the data by 'provider', while also performing my own calculations based on the unique values within that group.
After 'Start', the rest of the columns(-2000 to 300000) are profit/loss data based on time(milliseconds). The file is usually between 700-1000 lines and I usually don't use any data past column heading '20000' (not shown).
I am trying to make an output text file that will summarize the csv file by 'provider'(there are usually 5-15 unique providers per file and they are different each day). The calculations I would like to perform are:
Provider = df.group('providers')
total tickets = sum of 'filled' (filled column: 1=filled, 0=reject)
share % = a providers total tickets / sum of all filled tickets in file
fill rate = sum of filled / (sum of filled + sum of rejected)
Size = Sum of 'fill_size'
1s Loss = (count how many times column '1000' < $0) / total_tickets
1s Avg = average of column '1000'
10s Loss = (count how many times MIN of range ('1000':'10000') < $0) / total_tickets
10s Avg = average of range ('1000':'10000')
Ideally, my output file will have these headings transposed across the top and the 5-15 unique providers underneath
While I still don't understand the proper format to write all of these custom calculations, my biggest hurdle is referencing one of my calculations in the new dataframe (ie. total_tickets) and applying it to the next calculation (ie. 1s loss)
I'm looking for someone to tell me the best way to perform these calculations and maybe provide an example of at least 2 or 3 of my metrics. I think that if I have the proper format, I'll be able to run with the rest of this project.
Thanks for the help.

The function you want is DataFrame.groupby, with more examples in the documentation here.
Usage is fairly straightforward.
You have a field called 'provider' in your dataframe, so to create groups, you simple call grouped = df.groupby('provider'). Note that this does no calculations, just tells pandas how to find groups.
To apply functions to this object, you can do a few things:
If it's an existing function (like sum), tell the grouped object which columns you want and then call .sum(), e.g., grouped['filled'].sum() will give the sum of 'filled' for each group. If you want the sum of every column, grouped.sum() will do that. For your second example, you could divide this resulting series by df['filled'].sum() to get your percentages.
If you want to pass a custom function, you can call grouped.apply(func) to apply that function to each group.
To store your values (e.g., for total tickets), you can just assign them to a variable, to total_tickets = df['filled'].sum(), and tickets_by_provider = grouped['filled'].sum(). You can then use these in other calculations.
Update:
For one second loss (and for the other losses), you need two things:
The number of times for each provider df['1000'] < 0
The total number of records for each provider
These both fit within groupby.
For the first, you can use grouped.apply with a lambda function. It could look like this:
_1s_loss_freq = grouped.apply(lambda x: x['fill'][x['1000'] < 0].sum())
For group totals, you just need to pick a column and get counts. This is done with the count() function.
records_per_group = grouped['1000'].count()
Then, because pandas aligns on indices, you can get your percentages with _1s_loss_freq / records_per_group.
This analogizes to the 10s Loss question.
The last question about the average over a range of columns relies on pandas understanding of how it should apply functions. If you take a dataframe and call dataframe.mean(), pandas returns the mean of each column. There's a default argument in mean() that is axis=0. If you change that to axis=1, pandas will instead take the mean of each row.
For your last question, 10s Avg, I'm assuming you've aggregated to the provider level already, so that each provider has one row. I'll do that with sum() below but any aggregation will do. Assuming the columns you want the mean over are stored in a list called cols, you want:
one_rec_per_provider = grouped[cols].sum()
provider_means_over_cols = one_rec_per_provider.mean(axis=1)

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to translate rolling function in Python to PySpark - python

Related

Python 3.8 Errorr while trying to define a range of values on a Pandas column

Find the difference between the max and min of a third column if the first two columns match of an Excel file

Pandas - Understanding how rolling averages work

Python and Pandas - Transferring Excel formula to Pandas

Performing multiple calculations on a Python Pandas group from CSV data

Categories

Resources