Performing multiple calculations on a Python Pandas group from CSV data - python

I have daily csv's that are automatically created for work that average about 1000 rows and exactly 630 columns. I've been trying to work with pandas to create a summary report that I can write to a new txt.file each day.
The problem that I'm facing is that I don't know how to group the data by 'provider', while also performing my own calculations based on the unique values within that group.
After 'Start', the rest of the columns(-2000 to 300000) are profit/loss data based on time(milliseconds). The file is usually between 700-1000 lines and I usually don't use any data past column heading '20000' (not shown).
I am trying to make an output text file that will summarize the csv file by 'provider'(there are usually 5-15 unique providers per file and they are different each day). The calculations I would like to perform are:
Provider = df.group('providers')
total tickets = sum of 'filled' (filled column: 1=filled, 0=reject)
share % = a providers total tickets / sum of all filled tickets in file
fill rate = sum of filled / (sum of filled + sum of rejected)
Size = Sum of 'fill_size'
1s Loss = (count how many times column '1000' < $0) / total_tickets
1s Avg = average of column '1000'
10s Loss = (count how many times MIN of range ('1000':'10000') < $0) / total_tickets
10s Avg = average of range ('1000':'10000')
Ideally, my output file will have these headings transposed across the top and the 5-15 unique providers underneath
While I still don't understand the proper format to write all of these custom calculations, my biggest hurdle is referencing one of my calculations in the new dataframe (ie. total_tickets) and applying it to the next calculation (ie. 1s loss)
I'm looking for someone to tell me the best way to perform these calculations and maybe provide an example of at least 2 or 3 of my metrics. I think that if I have the proper format, I'll be able to run with the rest of this project.
Thanks for the help.

The function you want is DataFrame.groupby, with more examples in the documentation here.
Usage is fairly straightforward.
You have a field called 'provider' in your dataframe, so to create groups, you simple call grouped = df.groupby('provider'). Note that this does no calculations, just tells pandas how to find groups.
To apply functions to this object, you can do a few things:
If it's an existing function (like sum), tell the grouped object which columns you want and then call .sum(), e.g., grouped['filled'].sum() will give the sum of 'filled' for each group. If you want the sum of every column, grouped.sum() will do that. For your second example, you could divide this resulting series by df['filled'].sum() to get your percentages.
If you want to pass a custom function, you can call grouped.apply(func) to apply that function to each group.
To store your values (e.g., for total tickets), you can just assign them to a variable, to total_tickets = df['filled'].sum(), and tickets_by_provider = grouped['filled'].sum(). You can then use these in other calculations.
Update:
For one second loss (and for the other losses), you need two things:
The number of times for each provider df['1000'] < 0
The total number of records for each provider
These both fit within groupby.
For the first, you can use grouped.apply with a lambda function. It could look like this:
_1s_loss_freq = grouped.apply(lambda x: x['fill'][x['1000'] < 0].sum())
For group totals, you just need to pick a column and get counts. This is done with the count() function.
records_per_group = grouped['1000'].count()
Then, because pandas aligns on indices, you can get your percentages with _1s_loss_freq / records_per_group.
This analogizes to the 10s Loss question.
The last question about the average over a range of columns relies on pandas understanding of how it should apply functions. If you take a dataframe and call dataframe.mean(), pandas returns the mean of each column. There's a default argument in mean() that is axis=0. If you change that to axis=1, pandas will instead take the mean of each row.
For your last question, 10s Avg, I'm assuming you've aggregated to the provider level already, so that each provider has one row. I'll do that with sum() below but any aggregation will do. Assuming the columns you want the mean over are stored in a list called cols, you want:
one_rec_per_provider = grouped[cols].sum()
provider_means_over_cols = one_rec_per_provider.mean(axis=1)

Related

How to calculate means of other columns based on one column data if I want that data to be in specific range

I am a beginner and currently doing a data analysis project and would like to know if its possible to filter one column data and find the means of other columns based on that filtration?
Right now I have this
means = df.groupby('popularity').mean()
but I would like to get only those means if the popularity is greater than 95.
It ranges from 0 to 97 and right now the code shows me all of the means of different columns based on popularity column.
There is multiple ways to do this
You can try filtering.
df.loc[df['Popularity'] >= 95].groupby(['Popularity']).mean()
That oughta return your full dataset and do the same grouped operation but filtered according to your condition
df.loc[df['Popularity'] >= 95].groupby(['Popularity']).transform(np.mean)
Oughta return a series of the mean values not sure which one you want so here is both

I am not able to correctly assign a value to a df row based on 3 conditions (checking values in 3 other columns)

I am trying to assign a proportion value to a column in a specific row inside my df. Each row represents a unique product's sales in a specific month, in a dataframe (called testingAgain) like this:
Month ProductID(SKU) Family Sales ProporcionVenta
1 1234 FISH 10000.0 0.0
This row represents product 1234's sales during January. (It is an aggregate, so it represents every January in the DB)
Now I am trying to find the proportion of sales of that unique productid-month in relation to the sum of sales of family-month. For example, the family fish has sold 100,000 in month 1, so in this specific case it would be calculated 10,000/100,000 (productid-month-sales/family-month-sales)
I am trying to do so like this:
for family in uniqueFamilies:
for month in months:
salesFamilyMonth = testingAgain[(testingAgain['Family']==family)&(testingAgain['Month']==month)]['Qty'].sum()
for sku in uniqueSKU:
salesSKUMonth = testingAgain[(testingAgain['Family']==family)&(testingAgain['Month']==month)&(testingAgain['SKU']==sku)]['Qty'].sum()
proporcion = salesSKUMonth/salesFamilyMonth
testingAgain[(testingAgain['SKU']==sku)&(testingAgain['Family']==familia)&(testingAgain['Month']==month)]['ProporcionVenta'] = proporcion
The code works, it runs, and I have even individually printed the proportions and calculated them in Excel and they are correct, but the problem is with the last line. As soon as the code finishes running, I print testingAgain and see all proportions listed as 0.0, even though they should have been assigned the new one.
I'm not completely convinced about my approach, but I think it is decent.
Any ideas on how to solve this problem?
Thanks, appreciate it.
Generally, in Pandas (even Numpy), unlike general purpose Python, analysts should avoid using for loops as there are many vectorized options to run conditional or grouped calculations. In your case, consider groupby().transform() which returns inline aggregates (i.e., aggregate values without collapsing rows) or
as docs indicate: broadcast to match the shape of the input array.
Currently, your code is attempting to assign a value to a subsetted slice of data frame column that should raise SettingWithCopyWarning. Such an operation would not affect original data frame. Your loop can use .loc for conditional assignment
testingAgain.loc[(testingAgain['SKU']==sku) &
(testingAgain['Family']==familia) &
(testingAgain['Month']==month), 'ProporcionVenta'] = proporcion
However, avoid looping since transform works nicely to assign new data frame columns. Also, below div is the Series division method (functionally equivalent to / operator).
testingAgain['ProporcionVenta'] = (testingAgain.groupby(['SKU', 'Family', 'Monthh'])['Qty'].transform('sum')
.div(testingAgain.groupby(['Family', 'Month'])['Qty'].transform('sum'))
)

Return String Similarity Scores between two String Columns - Pandas

I'm trying to build a search based results, where in I will have an input dataframe having one row and I want to compare with another dataframe having almost 1 million rows. I'm using a package called Record Linkage
However, I'm not able to handle typos. Lets say I have "HSBC" in my original data and the user types it as "HKSBC", I want to return "HSBC" results only. On comparing the string similarity distance with jarowinkler I get the following results:
from pyjarowinkler import distance
distance.get_jaro_distance("hksbc", "hsbc", winkler=True, scaling=0.1)
>> 0.94
However, I'm not able to give "HSBC" as an output, so I want to create a new column in my pandas dataframe where in I'll compute the string similarity scores and take that part of the score which has a score above a particular threshold.
Also, the main bottleneck is that I have almost 1 million data, so I need to compute it really fast.
P.S. I have no intentions of using fuzzywuzzy, preferable either of Jaccard or Jaro-Winkler
P.P.S. Any other ideas to handle typos for search based thing is also acceptable
I was able to solve it through record linkage only. So basically it does an initial indexing and generates candidate links (You can refer to the documentation on "SortedNeighbourhoodindexing" for more info), i.e. it does a multi-indexing between the two dataframes that needs to be compared, which I did manually.
So here is my code:
import recordlinkage
df['index'] = 1 # this will be static since I'll have only one input value
df['index_2'] = range(1, len(df)+1)
df.set_index(['index', 'index_2'], inplace=True)
candidate_links=df.index
df.reset_index(drop=True, inplace=True)
df.index = range(1, len(df)+1)
# once the candidate links has been generated you need to reset the index and compare with the input dataframe which basically has only one static index, i.e. 1
compare_cl = recordlinkage.Compare()
compare_cl.string('Name', 'Name', label='Name', method='jarowinkler') # 'Name' is the column name which is there in both the dataframe
features = compare_cl.compute(candidate_links,df_input,df) # df_input is the i/p df having only one index value since it will always have only one row
print(features)
Name
index index_2
1 13446 0.494444
13447 0.420833
13469 0.517949
Now I can give a filter like this:
features = features[features['Name'] > 0.9] # setting the threshold which will filter away my not-so-close names.
Then,
df = df[df['index'].isin(features['index_2'])
This will sort my results and give me the final dataframe which has a name score greater than a particular threshold set by the user.

How to to arrange a loop in order to loop over columns and then do something

I'm a complete newbie to python, and I'm currently trying to work on a problem that allows me to take the average of each column except the number of columns is unknown.
I figured how to do it if I knew how many columns it is and to do each calculation separate. I'm supposed to do it by creating an empty list and looping the columns back into it.
import numpy as np
#average of all data not including NAN
def average (dataset):
return np.mean (dataset [np.isfinite (dataset)])
#this is how I did it by each column separate
dataset = np.genfromtxt("some file")
print (average(dataset [:,0]))
print (average(dataset [:,1]))
#what I'm trying to do with a loop
def avg (dataset):
for column in dataset:
lst = []
column = #i'm not sure how to define how many columns I have
Avg = average (column)
return Avg
You can use the numpy.mean() function:
https://docs.scipy.org/doc/numpy/reference/generated/numpy.mean.html
with:
np.mean(my_data, axis=0)
The axis indicates whether you are taking the average along columns or rows (axis = 0 means you take the average of each column, what you are trying to do). The output will be a vector whose length is the same as the number of columns (or rows) along which you took the average, and each element is the average of the corresponding column (or row). You do not need to know the shape of the matrix in advance to do this.
You CAN do this using a for loop, but it's not a good idea -- looping over matrices in numpy is slow, whereas using vectorized operations like np.mean() is very very fast. So in general when using numpy one tries to use those types of built-in operations instead of looping over everything at least if possible.
Also -- if you want the number of columns in your matrix -- it's
my_matrix.shape[1]
returns number of columns;
my_matrix.shape[0] is number of rows.

Apply LSH approxNearestNeighbors on all rows of a dataframe

I am trying to apply BucketedRandomProjectionLSH's function model.approxNearestNeighbors(df, key, n) on all the rows of a dataframe in order to approx-find the top n most similar items for every item. My dataframe has 1 million rows.
My problem is that I have to find a way to compute it within a reasonable time (no more than 2 hrs). I've read about that function approxSimilarityJoin(df, df, threshold) but the function takes way too long and doesn't return the right number of rows : if my dataframe has 100.000 rows, and I set a threshold VERY high/permissive I get something like not even 10% of the number of rows returned.
So, I'm thinking about using approxNearestNeighbors on all rows so that the computation time is almost linear.
How do you apply that function to every row of a dataframe ? I can't use a UDF since I need the model + a dataframe as inputs.
Do you have any suggestions ?

Categories