Python and Pandas - Transferring Excel formula to Pandas - python

At the moment, I'm trying to migrate my Weibull calculations from an excel macro to Python, and the tool I've been primarily using is Pandas. The formula I am currently having trouble converting from excel to Python is as follows:
Adjusted Rank = (Previous value in the adjusted rank column) * (Another column's value), but the first value in the adjusted rank column = 0
My brain is trying to copy and paste this methodology to pandas, but as you can imagine, it doesn't work that way:
DF[Adjusted Rank] = (Previous value in the adjusted rank column) * DF(Another Column), but the first value in the adjusted rank column = 0
In the end, I imagine the adjusted rank column will look like so:
Adjusted Rank
0
Some Number
Some Number
Some Number
etc.
I'm having some trouble puzzling out how to make each "cell" in the adjusted rank column refer to the previous value in the column In Pandas. Additionally, Is there a way to set only the first entry in the column equal to 0? Thanks all!

You can use shift to multiply by previous values and add a zero to the start, this should work:
df['new'] = df['adjusted_rank'].shift(period = 1, fill_value=0) * df['another_column']

Related

How to translate rolling function in Python to PySpark

I am trying to replicate the following Python code:
df_act['steps_ma'] = df_act['steps'].rolling(5).mean()
what the code is supposed to do is to calculate the mean of the column stemps_ma, the mean has to be calculated using the current row and the 4 rows prior. So for example if we are looking at row number 10, the mean would have to be calculated with the values of steps_ma for columns 10,9,8,7,6.
to Pyspark, but the only info I am getting is with columns of timestamp, while my column "steps" is a column of integers.
I am going back and forth with this approach but I am getting so many errors:
winSpec = Window.partitionBy('user_id').orderBy('timestamp').rangeBetween(-(5), 0)
df_act.withColumn('steps_ma', mean('steps').over(winSpec)).show()

Create Dataframe for What-If Visualization by Resorting Data

I'm trying to create an experimental dataframe (that will be used only for the comparative viz) from other correlated dataframes and then sort the column values independent of each other to later visualize and show what correlated data should look like (because my current data actually shows no correlation)
experimental_df = full_df[['Token_Rarity','Price_USD']]
Becomes:
experimental_df.sort_values(by=['Token_Rarity','Price_USD'],ascending=[True,True])
I'm trying to get the lowest value in Token Column, and lowest value in price column, or vise versa, regardless of any other values or arguments. Current result:
Please try this:
experimental_df = pd.DataFrame()
experimental_df['Token_Rarity'] = full_df['Token_Rarity'].sort_values()
experimental_df['Price_USD'] = full_df['Price_USD'].sort_values().reset_index(drop=True)

How to get values using pd.shift

I am trying to populate values in the column motooutstandingbalance by subtracting the previous row actualmotordeductionfortheweek from previous row motooutstandingbalance. I am using pandas shift command but currently not getting the desired output which should be a consistent reduction in motooutstandingbalance week by week.
Final result should look like this
Here is my code
x['motooutstandingbalance']=np.where(x.salesrepid == x.shift(1).salesrepid, x.shift(1).motooutstandingbalance - x.shift(1).actualmotordeductionfortheweek, x.motooutstandingbalance)
Any ideas on how to achieve this?
This works:
start_value = 468300.0
df['motooutstandingbalance'] = (-df['actualmotordeductionfortheweek'][::-1]).append(pd.Series([start_value], index=[-1]))[::-1].cumsum().reset_index(drop=True)
Basically what I'm doing is I'm—
Taking the actualmotordeductionfortheweek column, negating it (all the values become negative), and reversing it
Adding the start value (which is positive, as opposed to all the other values which are negative) at index -1 (which is before 0, not at the very end as is usual in Python)
Reversing it back, so that the new -1 entry goes to the very beginning
Using cumsum() to add all the of values of the column. This actually work to subtract all the values from the start value, because the first value is positive and the rest of the values are negative (because x + (-y) = x - y)

I am not able to correctly assign a value to a df row based on 3 conditions (checking values in 3 other columns)

I am trying to assign a proportion value to a column in a specific row inside my df. Each row represents a unique product's sales in a specific month, in a dataframe (called testingAgain) like this:
Month ProductID(SKU) Family Sales ProporcionVenta
1 1234 FISH 10000.0 0.0
This row represents product 1234's sales during January. (It is an aggregate, so it represents every January in the DB)
Now I am trying to find the proportion of sales of that unique productid-month in relation to the sum of sales of family-month. For example, the family fish has sold 100,000 in month 1, so in this specific case it would be calculated 10,000/100,000 (productid-month-sales/family-month-sales)
I am trying to do so like this:
for family in uniqueFamilies:
for month in months:
salesFamilyMonth = testingAgain[(testingAgain['Family']==family)&(testingAgain['Month']==month)]['Qty'].sum()
for sku in uniqueSKU:
salesSKUMonth = testingAgain[(testingAgain['Family']==family)&(testingAgain['Month']==month)&(testingAgain['SKU']==sku)]['Qty'].sum()
proporcion = salesSKUMonth/salesFamilyMonth
testingAgain[(testingAgain['SKU']==sku)&(testingAgain['Family']==familia)&(testingAgain['Month']==month)]['ProporcionVenta'] = proporcion
The code works, it runs, and I have even individually printed the proportions and calculated them in Excel and they are correct, but the problem is with the last line. As soon as the code finishes running, I print testingAgain and see all proportions listed as 0.0, even though they should have been assigned the new one.
I'm not completely convinced about my approach, but I think it is decent.
Any ideas on how to solve this problem?
Thanks, appreciate it.
Generally, in Pandas (even Numpy), unlike general purpose Python, analysts should avoid using for loops as there are many vectorized options to run conditional or grouped calculations. In your case, consider groupby().transform() which returns inline aggregates (i.e., aggregate values without collapsing rows) or
as docs indicate: broadcast to match the shape of the input array.
Currently, your code is attempting to assign a value to a subsetted slice of data frame column that should raise SettingWithCopyWarning. Such an operation would not affect original data frame. Your loop can use .loc for conditional assignment
testingAgain.loc[(testingAgain['SKU']==sku) &
(testingAgain['Family']==familia) &
(testingAgain['Month']==month), 'ProporcionVenta'] = proporcion
However, avoid looping since transform works nicely to assign new data frame columns. Also, below div is the Series division method (functionally equivalent to / operator).
testingAgain['ProporcionVenta'] = (testingAgain.groupby(['SKU', 'Family', 'Monthh'])['Qty'].transform('sum')
.div(testingAgain.groupby(['Family', 'Month'])['Qty'].transform('sum'))
)

Performing multiple calculations on a Python Pandas group from CSV data

I have daily csv's that are automatically created for work that average about 1000 rows and exactly 630 columns. I've been trying to work with pandas to create a summary report that I can write to a new txt.file each day.
The problem that I'm facing is that I don't know how to group the data by 'provider', while also performing my own calculations based on the unique values within that group.
After 'Start', the rest of the columns(-2000 to 300000) are profit/loss data based on time(milliseconds). The file is usually between 700-1000 lines and I usually don't use any data past column heading '20000' (not shown).
I am trying to make an output text file that will summarize the csv file by 'provider'(there are usually 5-15 unique providers per file and they are different each day). The calculations I would like to perform are:
Provider = df.group('providers')
total tickets = sum of 'filled' (filled column: 1=filled, 0=reject)
share % = a providers total tickets / sum of all filled tickets in file
fill rate = sum of filled / (sum of filled + sum of rejected)
Size = Sum of 'fill_size'
1s Loss = (count how many times column '1000' < $0) / total_tickets
1s Avg = average of column '1000'
10s Loss = (count how many times MIN of range ('1000':'10000') < $0) / total_tickets
10s Avg = average of range ('1000':'10000')
Ideally, my output file will have these headings transposed across the top and the 5-15 unique providers underneath
While I still don't understand the proper format to write all of these custom calculations, my biggest hurdle is referencing one of my calculations in the new dataframe (ie. total_tickets) and applying it to the next calculation (ie. 1s loss)
I'm looking for someone to tell me the best way to perform these calculations and maybe provide an example of at least 2 or 3 of my metrics. I think that if I have the proper format, I'll be able to run with the rest of this project.
Thanks for the help.
The function you want is DataFrame.groupby, with more examples in the documentation here.
Usage is fairly straightforward.
You have a field called 'provider' in your dataframe, so to create groups, you simple call grouped = df.groupby('provider'). Note that this does no calculations, just tells pandas how to find groups.
To apply functions to this object, you can do a few things:
If it's an existing function (like sum), tell the grouped object which columns you want and then call .sum(), e.g., grouped['filled'].sum() will give the sum of 'filled' for each group. If you want the sum of every column, grouped.sum() will do that. For your second example, you could divide this resulting series by df['filled'].sum() to get your percentages.
If you want to pass a custom function, you can call grouped.apply(func) to apply that function to each group.
To store your values (e.g., for total tickets), you can just assign them to a variable, to total_tickets = df['filled'].sum(), and tickets_by_provider = grouped['filled'].sum(). You can then use these in other calculations.
Update:
For one second loss (and for the other losses), you need two things:
The number of times for each provider df['1000'] < 0
The total number of records for each provider
These both fit within groupby.
For the first, you can use grouped.apply with a lambda function. It could look like this:
_1s_loss_freq = grouped.apply(lambda x: x['fill'][x['1000'] < 0].sum())
For group totals, you just need to pick a column and get counts. This is done with the count() function.
records_per_group = grouped['1000'].count()
Then, because pandas aligns on indices, you can get your percentages with _1s_loss_freq / records_per_group.
This analogizes to the 10s Loss question.
The last question about the average over a range of columns relies on pandas understanding of how it should apply functions. If you take a dataframe and call dataframe.mean(), pandas returns the mean of each column. There's a default argument in mean() that is axis=0. If you change that to axis=1, pandas will instead take the mean of each row.
For your last question, 10s Avg, I'm assuming you've aggregated to the provider level already, so that each provider has one row. I'll do that with sum() below but any aggregation will do. Assuming the columns you want the mean over are stored in a list called cols, you want:
one_rec_per_provider = grouped[cols].sum()
provider_means_over_cols = one_rec_per_provider.mean(axis=1)

Categories