Groupby values, perform calculations and apply to repeating rows - python

I have the following df:
wallet position position_rewards position_type token_sales
0 0x123 SUSHI_LP 250 Sushi_LP 500
0 0x123 ALCX 750 LP Token 500
1 0xabc GAMMA 333.33 LP Token 750
1 0xabc FXS 666.66 LD 750
Note that the sum of the values in position_rewards for each wallet is the TOTAL for that wallet, and the token_sales column might show a lower amount that was sold from that total amount. you can see that in: wallet 0x123received 1000 rewards in total, but sold only 500.
I want to create the following columns, which are calculations based on the already existing columns. Logic below too:
Column 1: df['position_rewards_pct']
This column is supposed to have the corresponding % of the rewards per positionover the total rewards per wallet.
My code:
df['position_rewards_pct'] = (df['position_rewards'] / sum(df['position_rewards'].apply(Decimal))) * 100
Problem: Outputting NaNs
Column 2: df['token_sales_per_type'] This column is supposed to show how many tokens have been sold (token_sales column) for a given potition_type.
Please note that for each value in the existing token_sales column, each wallet has only that value. That is, you will never have a different value in token_sales for a single wallet.
In th end, this column should show (repeatedly, for every row in position_type, the amount of tokens sold for that specific type. So as rows in position_typerepeat, so will the rows in df['token_sales_per_type'].
Note that all values are in Decimal object form.
Essentially, the structure of the final df should logically be the following:

In trying to formulate a response, I'm finding that the text of your question doesn't quite match up with the data and visualizations you've provided. Perhaps that DataFrame you're showing is a result of a preliminary grouping operation, rather than the underlying data?
In any event, your question is in the general category of split/apply/combine, for which Pandas has many tools, some of which may seem a bit tricky to grasp.
Usually when you want to perform a grouping, and then apply some operation back to the dataset on the basis of what you found in the grouping, you use .groupby() followed by .transform().
.transform() has the wonderful ability to take the result of an aggregation function, and apply it back to every member of the group. The classic example is subtracting the mean() for a group from every value within that group.
Examples using the dataframe you provided:
Group by the wallet and position, sum the rewards
someDF.groupby(by=["wallet", "position"]).transform(sum)["position_rewards"]
0 250.0
1 750.0
2 333.33
3 666.66
percentage of total (this one doesn't quite make sense in the context of the df you provided, since position column is all unique)
someDF["position_rewards"] / someDF.groupby(by=["wallet", "position"]).transform(sum)["position_rewards"]
0 1.0
1 1.0
2 1.0
3 1.0
Apply the sum of token_sales to each position type
someDF.groupby(by=["position_type"]).transform(sum)["token_sales"]
0 500
1 1250
2 1250
3 750
One final comment on decimal and percentage formatting - best to leave that for display, rather than modifying the data. You can do that with the pandas styler.

For the first question you need to a groupby.transform('sum') and rdiv:
df['position_rewards_pct'] = (df.groupby('wallet')['position_rewards']
.transform('sum').rdiv(df['position_rewards'])
.mul(100).round(2)
)
output:
wallet position position_rewards position_type token_sales position_rewards_pct
0 0x123 SUSHI_LP 250.00 Sushi_LP 500 25.00
0 0x123 ALCX 750.00 LP Token 500 75.00
1 0xabc GAMMA 333.33 LP Token 750 33.33
1 0xabc FXS 666.66 LD 750 66.67

Related

Pandas - partition a dataframe into two groups with an approximate mean value

I want to split all rows into two groups that have similar means.
I have a dataframe of about 50 rows but this could go into several thousands with a column of interest called 'value'.
value total bucket
300048137 3.0741 3.0741 0
352969997 2.1024 5.1765 0
abc13.com 4.5237 9.7002 0
abc7.com 5.8202 15.5204 0
abcnews.go.com 6.7270 22.2474 0
........
www.legacy.com 12.6609 263.0797 1
www.math-aids.com 10.9832 274.0629 1
So far I tried using cumulative sum for which total column was created then I essentially made the split based on where the mid-point of the total column is. Based on this solution.
test['total'] = test['value'].cumsum()
df_sum = test['value'].sum()//2
test['bucket'] = np.where(test['total'] <= df_sum, 0,1)
If I try to group them and take the average for each group then the difference is quite significant
display(test.groupby('bucket')['value'].mean())
bucket
0 7.456262
1 10.773905
Is there a way I could achieve this partition based on means instead of sums? I was thinking about using expanding means from pandas but couldn't find a proper way to do it.
I am not sure I understand what you are trying to do, but possibly you want to groupy by quantiles of a column. If so:
test['bucket'] = pd.qcut(test['value'], q=2, labels=False)
which will have bucket=0 for the half of rows with the lesser value values. And 1 for the rest. By tweakign the q parameter you can have as many groups as you want (as long as <= number of rows).
Edit:
New attemp, now that I think I understand better your aim:
df = pd.DataFrame( {'value':pd.np.arange(100)})
df['group'] = df['value'].argsort().mod(2)
df.groupby('group')['value'].mean()
# group
# 0 49
# 1 50
# Name: value, dtype: int64
​
df['group'] = df['value'].argsort().mod(3)
df.groupby('group')['value'].mean()
#group
# 0 49.5
# 1 49.0
# 2 50.0
# Name: value, dtype: float64

Trying to group repeated x values, and find the mean of the y values associated with these repeats

I am using pandas. I wrote this script that does what I want but is definitely not optimized at all. Basically, I find all x repeats in namearray, take the average of the associated y values, replace the y value of the first row with the average and remove all repeated x value's rows except for the first row. Effectively, on a graph, I remove points that appear stacked on top of each other and only plot the average produced instead.
cats = np.unique(name_array[selected_x].values)
for j in cats:
rows_cat = name_array[name_array[selected_x] == j]
first_row = rows_cat.iloc[[0],:]
avg = rows_cat[selected_y].mean()
first_row[selected_y] = avg
name_array = name_array[name_array[selected_x] != j]
name_array = name_array.append(first_row,ignore_index=True)
This is the script I am trying to replace it by. However, it does not work and I am not sure why. I am trying to group by the x values like before, and replace the y value of the newly grouped x with the mean:
name_array[selected_y] = name_array.groupby(selected_x)[selected_y].mean()
This approach seems much simpler, more readable, and efficient. Any ideas why it is not performing the same function?
Edit:
An input example:
date
state
new_cases
new deaths
days_since_date
etc.
2021-03-24
PA
500
200
4
etc.
2021-03-25
PA
300
300
4
etc.
2021-03-26
PA
400
100
2
etc.
2021-03-27
PA
200
200
1
etc.
say selected_y is new_cases, and selected_x is days_since_date.
What I want is, this:
date
state
new_cases
new deaths
days_since_date
etc.
2021-03-24
PA
400
200
4
etc.
2021-03-26
PA
400
100
2
etc.
2021-03-27
PA
200
200
1
etc.
Essentially, group where it repeats in selected_x column, take the mean of the associated values in selected_y column, but do not take the mean of the other columns.
The reason being, the date is not a datetime type, so I cannot see taking the mean of that as something that will actually mean anything, and in the grand scheme of things I do not care if the date is averaged. Same applies to state, you cannot get the mean of a string, unless you do some type of ASCII math or something, which is not what I want either.
data.groupby(['x']).mean()['y']
In this way you group the data based on x find the mean of all columns associated to that group by calling .mean and the slice the column y that you need.

How to detect outliers in a timeseries dataframe and write the "clean" ones in a new dataframe

I'm really new to Python (and programming in general, hihi) and I'm analyzing 2 years of metereological data measured every 10s, in total I have 12 metereological parameters and I've created my dataframe df with the time as my row index and the name of the metereological parameters as the column names. Since I don't need a super granularity, I've resampled the data to hourly data, so the dataframe looks something like this.
Time G_DIFF G_HOR G_INCL RAIN RH T_a V_a V_a_dir
2016-05-01 02:00:00 0.0 0.011111 0.000000 0.013333 100.0 9.128167 1.038944 175.378056
2016-05-01 03:00:00 0.0 0.200000 0.016667 0.020000 100.0 8.745833 1.636944 218.617500
2016-05-01 04:00:00 0.0 0.105556 0.013889 0.010000 100.0 8.295333 0.931000 232.873333
There are outliers and I can get rid of them with a rolling standard deviation and mean which is what I've done "by hand" with the following code for one of the columns (the ambient temperature) where the algorithm writes the clean data in another dataframe (tr, in the example below).
roll = df["T_a"].rolling(24,center = True) #24h window
mean, std = roll.mean(), roll.std()
cut = std*3
low, up = mean - cut, mean+cut
tr.loc[(df["T_a"] < low) | (df["T_a"] > up) | (df["T_a"].isna()), "outliers"] = df["T_a"]
tr.loc[(df["T_a"] >= low) & (df["T_a"] <= up), "T_a"] = df["T_a"]
tr.loc[tr["T_a"].isna(),"T_a"] = tr["T_a"].bfill() #to input a value when a datum is NaN
Now, as I said, that works okay for a column BUT I would like to be able to do it for the 12 columns and, also, I'm almost sure that there's a more pythonic way to do it. I guess with a for loop should be feasible but nothing I've tried so far is working.
Could anyone give me some light, please? Thank you so much!!
all_columns = [df.columns] #This will give you list of all column names
all_columns = all_columns.remove('G_DIFF') # This will remove the column name that doesn't include those 12 columns
for column in all_columns:
roll = df[column].rolling(24,center = True) #24h window
mean, std = roll.mean(), roll.std()
cut = std*3
low, up = mean - cut, mean+cut
tr.loc[(df[column] < low) | (df[column] > up) | (df[column].isna()), "outliers"] = df[column]
tr.loc[(df[column] >= low) & (df[column] <= up), column] = df[column]
tr.loc[tr[column].isna(),column] = tr[column].bfill() #to input a value when a datum is NaN
There are two ways to remove outliers from time series data one is calculating percentile, mean std-dev which I am thinking you are using another way is looking at the graphs because sometimes data spread gives more information visually.
I have worked in data of yellow taxi prediction in a certain area, so basically I have a model which can predict in which region of NYC taxi can get more customers.
In that I had a time series data with 10-sec intervals with various features like trip distance,speed, working hours, and one was "Total fare", now I also wanted to remove the outliers from each column so started using mean and percentiles to do so,
The thing with total fares was mean and percentile was not giving an accurate threshold,
and my percentiles values:
0 percentile value is -242.55//
10 percentile value is 6.3//
20 percentile value is 7.8//
30 percentile value is 8.8//
40 percentile value is 9.8//
50 percentile value is 11.16//
60 percentile value is 12.8//
70 percentile value is 14.8//
80 percentile value is 18.3//
90 percentile value is 25.8//
100 percentile value is 3950611.6//
as you can see 100 was an ok fare but was considered as an outlier,
So I basically turned to visualization,
I sorted my fare values and plot it
as you can see in the end there is little of steepness
so basically magnified it,
Something like this,
and then I magnified it more for 50th to second last percentile
and voila I got my threshold, i.e 1000,
This method in actual terms is called the "elbow method", what you are doing is the first step and if you are not happy this can be the second step to find those thresholds,
I suggest you go from column to column and use any of these techniques because if you go from column to column you know how much data you are losing because losing data is losing information.
Personally, I follow visualization, in the end, it really depends on the data.

Identify ID columns in a data frame

Is there any way to identify columns such as Account_Number, Employee_ID, Transaction_ID etc type of columns automatically in a data frame which are usually not included in model building?
Note that there might be more than one record of the same employee across different dates. In short, how to identify useless columns when they are not unique?
There are several ways to recognize the lease important columns/classes/features, in a dataset. Correlation is one of them. Follow the example below by first downloading this movies dataset from Kaggle.
df = pd.read_csv("tmdb_5000_movies.csv")
df = df[["id", "budget", "popularity", "vote_average"]]
df.head()
This is how the dataframe looks:
id budget popularity vote_average
0 19995 237000000 150.437577 7.2
1 285 300000000 139.082615 6.9
2 206647 245000000 107.376788 6.3
3 49026 250000000 112.312950 7.6
4 49529 260000000 43.926995 6.1
We are looking for an automatic way of detecting that "id" is a useless column.
Let's find the correlation between each column and the other:
corr_df = pd.DataFrame(columns=list(df.columns))
for col_from in df.columns:
for col_to in df.columns:
corr_df.loc[col_from, col_to] = df[col_from].corr(df[col_to])
print(corr_df.head())
Correlation is simply a measure between -1 and 1, numbers close to zero indicate that the two classes are uncorrelated, the further you go from zero, (even in the negative direction) is an indication that the two parameters are coupled in some sense.
Observe how id has a very small correlation with budget and popularity
id budget popularity vote_average
id 1 -0.0893767 0.031202 -0.270595
budget -0.0893767 1 0.505414 0.0931457
popularity 0.031202 0.505414 1 0.273952
vote_average -0.270595 0.0931457 0.273952 1
Let's go a little step further and get the absolute value and sum all the correlations, the class with the least correlation score is considered the least useless:
corr_df = corr_df.abs()
corr_df["sum"] = corr_df.sum(axis=0) - 1
print(corr_df.head())
Result:
id budget popularity vote_average sum
id 1 0.0893767 0.031202 0.270595 0.391173
budget 0.0893767 1 0.505414 0.0931457 0.687936
popularity 0.031202 0.505414 1 0.273952 0.810568
vote_average 0.270595 0.0931457 0.273952 1 0.637692
Not that there are many issues with this method, for example: if ids are increasing from 0 to N and there is a value that is also increasing amongst the rows with a constant rate, their correlation will be high; moreover, some column X might yield a smaller correlation with column Y than the correlation between Y and id; nevertheless the absolute sum result is good enough in most cases.

Satisfying Cross tab constraints in Python by filling in Random Numbers

I have a problem to modify a dataframe (actual data) to satisfy cross tab constraints and generating a new dataframe as described below:
In cross-tab 1 (attached pic n code), we have 2 tasks for John in Area A, 1 task for John in Area B and so on. However, my desired distribution is as shown in cross-tab 2 i.e., John has 1 task in Area A, 4 tasks in Area B etc. Thus I need to modify original data as depicted by crosstab 1, to satisfy the row and column totals constraints as required in crosstab2, while grand total should remain 18 as in both cross tabs. Number filling may be random
Another constraint is average time which should be for example 11 minutes for John (average of 03 tasks), 7 minutes for William and 5 minutes for Richard(03 tasks).
Thus, the task is to modify original dataframe which satisfies row, column total as in crosstab2 and average time requirement. The final dataframe will have three columns Person, Area of Work and Time and will generate a crosstab similar to crosstab2, while randomly filling in numbers
Cross tab2- Required
Cross tab1- Actual Data
Actual Data:
df = pd.DataFrame([['John','A',2,8],['John','B',1,9],['John','C',0,12],['William','A',1,14],['William','B',2,10],['William','C',2,9],['Richard','A',3,8],['Richard','B',4,7],['Richard','C',3,5]],columns=['Person', 'AreaOfWork', 'Task','Time'])
1.1 Actual Cross-Tab:
pd.crosstab(df.AreaOfWork, df.Person, values=df.Task, aggfunc=np.sum, margins=True)
Required-Dataframe
df1 = pd.DataFrame([['John','A',1,10],['John','B',4,11],['John','C',3,12],['William','A',0,9],['William','B',1,7],['William','C',3,5],['Richard','A',2,5],['Richard','B',1,3],['Richard','C',3,8]],columns=['Person', 'AreaOfWork', 'Task','Time'])
2.1 Required crosstab
pd.crosstab(df.AreaOfWork, df.Person, values=df1.Task, aggfunc=np.sum, margins=True)

Categories