Pandas: Transpose, groupby and summarize columns - python

i have a pandas DataFrame which looks like this:
| Id | Filter 1 | Filter 2 | Filter 3 |
|----|----------|----------|----------|
| 25 | 0 | 1 | 1 |
| 25 | 1 | 0 | 1 |
| 25 | 0 | 0 | 1 |
| 30 | 1 | 0 | 1 |
| 31 | 1 | 0 | 1 |
| 31 | 0 | 1 | 0 |
| 31 | 0 | 0 | 1 |
I need to transpose this table, add "Name" column with the name of the filter and summarize Filters column values. The result table should be like this:
| Id | Name | Summ |
| 25 | Filter 1 | 1 |
| 25 | Filter 2 | 1 |
| 25 | Filter 3 | 3 |
| 30 | Filter 1 | 1 |
| 30 | Filter 2 | 0 |
| 30 | Filter 3 | 1 |
| 31 | Filter 1 | 1 |
| 31 | Filter 2 | 1 |
| 31 | Filter 3 | 2 |
The only solution i have came so far was to use apply function on groupped by Id column, but this mehod is too slow for my case - dataset can be more than 40 columns and 50_000 rows, how can i do this with pandas native methods?(eg Pivot, Transpose, Groupby)

Use:
df_new=df.melt('Id',var_name='Name',value_name='Sum').groupby(['Id','Name']).Sum.sum()\
.reset_index()
print(df_new)
Id Name Sum
0 25 Filter 1 1
1 25 Filter 2 1
2 25 Filter 3 3
3 30 Filter 1 1
4 30 Filter 2 0
5 30 Filter 3 1
6 31 Filter 1 1
7 31 Filter 2 1
8 31 Filter 3 1

stack then groupby
df.set_index('Id').stack().groupby(level=[0,1]).sum().reset_index()
Id level_1 0
0 25 Filter 1 1
1 25 Filter 2 1
2 25 Filter 3 3
3 30 Filter 1 1
4 30 Filter 2 0
5 30 Filter 3 1
6 31 Filter 1 1
7 31 Filter 2 1
8 31 Filter 3 1
Short version
df.set_index('Id').sum(level=0).stack()#df.groupby('Id').sum().stack()

Using filter and melt
df.filter(like='Filter').groupby(df.Id).sum().T.reset_index().melt(id_vars='index')
index Id value
0 Filter 1 25 1
1 Filter 2 25 1
2 Filter 3 25 3
3 Filter 1 30 1
4 Filter 2 30 0
5 Filter 3 30 1
6 Filter 1 31 1
7 Filter 2 31 1
8 Filter 3 31 2

Related

Mapping duplicate rows to originals with dictionary - Python 3.6

I am trying to locate duplicate rows in my pandas dataframe. In reality, df.shape is 438796, 4531, but I am using this toy example below for an MRE
| id | ft1 | ft2 | ft3 | ft4 | ft5 | label |
|:------:|:---:|:---:|:---:|:---:|:---:|:------:|
| id_100 | 1 | 1 | 43 | 1 | 1 | High |
| id_101 | 1 | 1 | 33 | 0 | 1 | Medium |
| id_102 | 1 | 1 | 12 | 1 | 1 | Low |
| id_103 | 1 | 1 | 46 | 1 | 0 | Low |
| id_104 | 1 | 1 | 10 | 1 | 1 | High |
| id_105 | 0 | 1 | 99 | 0 | 1 | Low |
| id_106 | 0 | 0 | 0 | 0 | 0 | High |
| id_107 | 1 | 1 | 6 | 0 | 1 | High |
| id_108 | 1 | 1 | 29 | 1 | 1 | Medium |
| id_109 | 1 | 0 | 27 | 0 | 0 | Medium |
| id_110 | 0 | 1 | 32 | 0 | 1 | High |
What I am trying to accomplish is observing a subset of the features, and if there are duplicate rows, to keep the first and then denote which id: label pair is the duplicate.
I have looked at the following posts:
find duplicate rows in a pandas dataframe
(I could not figure out how to replace col1 in df['index_original'] = df.groupby(['col1', 'col2']).col1.transform('idxmin') with my list of cols)
Find all duplicate rows in a pandas dataframe
I know pandas has a duplicated() call. So I tried implementing that and it sort of works:
import pandas as pd
# Read in example data
df = pd.read_clipboard()
# Declare columns I am interested in
cols = ['ft1', 'ft2', 'ft4', 'ft5']
# Create a subset of my dataframe with only the columns I care about
sub_df = df[cols]
# Create a list of duplicates
dupes = sub_df.index[sub_df.duplicated(keep='first')].tolist()
# Loop through the duplicates and print out the values I want
for idx in dupes:
# print(df[:idx])
print(df.loc[[idx],['id', 'label']])
However, what I am trying to do is for a particular row, determine which rows are duplicates of it by saving those rows as id: label combination. So while I'm able to extract the id and label for each duplicate, I have no ability to map it back to the original row for which it is a duplicate.
An ideal dataset would look like:
| id | ft1 | ft2 | ft3 | ft4 | ft5 | label | duplicates |
|:------:|:---:|:---:|:---:|:---:|:---:|:------:|:-------------------------------------------:|
| id_100 | 1 | 1 | 43 | 1 | 1 | High | {id_102: Low, id_104: High, id_108: Medium} |
| id_101 | 1 | 1 | 33 | 0 | 1 | Medium | {id_107: High} |
| id_102 | 1 | 1 | 12 | 1 | 1 | Low | |
| id_103 | 1 | 1 | 46 | 1 | 0 | Low | |
| id_104 | 1 | 1 | 10 | 1 | 1 | High | |
| id_105 | 0 | 1 | 99 | 0 | 1 | Low | {id_110: High} |
| id_106 | 0 | 0 | 0 | 0 | 0 | High | |
| id_107 | 1 | 1 | 6 | 0 | 1 | High | |
| id_108 | 1 | 1 | 29 | 1 | 1 | Medium | |
| id_109 | 1 | 0 | 27 | 0 | 0 | Medium | |
| id_110 | 0 | 1 | 32 | 0 | 1 | High | |
How can I take my duplicated values and map them back to their originals efficiently (understanding the size of my actual dataset)?
Working with dictionaries in columns is really complicated, here is one possible solution:
# Declare columns I am interested in
cols = ['ft1', 'ft2', 'ft4', 'ft5']
# Create a subset of my dataframe with only the columns I care about
sub_df = df[cols]
#mask for first dupes
m = sub_df.duplicated()
#create tuples, aggregate to list of tuples
s = (df.assign(a = df[['id','label']].apply(tuple, 1))[m]
.groupby(cols)['a']
.agg(lambda x: dict(list(x))))
#add new column
df = df.join(s.rename('duplicates'), on=cols)
#repalce missing values and not first duplciates to empty strings
df['duplicates'] = df['duplicates'].fillna('').mask(m, '')
print (df)
id ft1 ft2 ft3 ft4 ft5 label \
0 id_100 1 1 43 1 1 High
1 id_101 1 1 33 0 1 Medium
2 id_102 1 1 12 1 1 Low
3 id_103 1 1 46 1 0 Low
4 id_104 1 1 10 1 1 High
5 id_105 0 1 99 0 1 Low
6 id_106 0 0 0 0 0 High
7 id_107 1 1 6 0 1 High
8 id_108 1 1 29 1 1 Medium
9 id_109 1 0 27 0 0 Medium
10 id_110 0 1 32 0 1 High
duplicates
0 {'id_102': 'Low', 'id_104': 'High', 'id_108': ...
1 {'id_107': 'High'}
2
3
4
5 {'id_110': 'High'}
6
7
8
9
10
Alternative with custom function for assign all dupes without first one to first value of new column per groups, last is changed mask for replace empty strings:
# Declare columns I am interested in
cols = ['ft1', 'ft2', 'ft4', 'ft5']
m = ~df.duplicated(subset=cols) & df.duplicated(subset=cols, keep=False)
def f(x):
x.loc[x.index[0], 'duplicated'] = [dict(x[['id','label']].to_numpy()[1:])]
return x
df = df.groupby(cols).apply(f)
df['duplicated'] = df['duplicated'].where(m, '')
print (df)
id ft1 ft2 ft3 ft4 ft5 label \
0 id_100 1 1 43 1 1 High
1 id_101 1 1 33 0 1 Medium
2 id_102 1 1 12 1 1 Low
3 id_103 1 1 46 1 0 Low
4 id_104 1 1 10 1 1 High
5 id_105 0 1 99 0 1 Low
6 id_106 0 0 0 0 0 High
7 id_107 1 1 6 0 1 High
8 id_108 1 1 29 1 1 Medium
9 id_109 1 0 27 0 0 Medium
10 id_110 0 1 32 0 1 High
duplicated
0 {'id_102': 'Low', 'id_104': 'High', 'id_108': ...
1 {'id_107': 'High'}
2
3
4
5 {'id_110': 'High'}
6
7
8
9
10

Add values in two Spark DataFrames, row by row

I have two Spark DataFrames, with values that I would like to add, and then multiply, and keep the lowest pair of values only. I have written a function that will do this:
math_func(aValOne, aValTwo, bValOne, bValTwo):
tmpOne = aValOne + bValOne
tmpTwo = aValTwo + bValTwo
final = tmpOne*tmpTwo
return final
I would like to iterate through two Spark DataFrames, "A" and "B", row by row, and keep the lowest values results. So if I have two DataFrames:
DataFrameA:
ID | ValOne | ValTwo
0 | 2 | 4
1 | 3 | 6
DataFrameB:
ID | ValOne | ValTwo
0 | 4 | 5
1 | 7 | 9
I would like to first take row 0 from DataFrameA:, compare it to rows 0 and 1 of DataFrameB, and then keep the lowest value results. I have tried this:
results = DataFrameA.select('ID')(lambda i: DataFrameA.select('ID')(math_func(DataFrameA.ValOne, DataFrameA.ValTwo, DataFrameB.ValOne, DataFrameB.ValOne))
but I get errors about iterating through a DataFrame column. I know that in Pandas I would essentially make a nested "for loop", and then just write the results to another DataFrame and append the results. The results I would expect are:
Initial Results:
DataFrameA_ID | Value | DataFrameB_ID
0 | 54 | 0
0 | 117 | 1
1 | 77 | 0
1 | 150 | 1
Final Results:
DataFrameA_ID | Value | DataFrameB_ID
0 | 54 | 0
1 | 77 | 0
I am quite new at Spark, but I know enough to know I'm not approaching this the right way.
Any thoughts on how to go about this?
You will need multiple steps to achieve this.
Suppose you have data
DFA:
ID | ValOne | ValTwo
0 | 2 | 4
1 | 3 | 6
DFB:
ID | ValOne | ValTwo
0 | 4 | 5
1 | 7 | 9
Step 1.
Do a cartesian join on your 2 dataframes. That will give you:
Cartesian:
DFA.ID | DFA.ValOne | DFA.ValTwo | DFB.ID | DFB.ValOne | DFB.ValTwo
0 | 2 | 4 | 0 | 4 | 5
1 | 3 | 6 | 0 | 4 | 5
0 | 2 | 4 | 1 | 7 | 9
1 | 3 | 6 | 1 | 7 | 9
Step 2.
Multiply columns:
Multiplied:
DFA.ID | DFA.Mul | DFB.ID | DFB.Mul
0 | 8 | 0 | 20
1 | 18 | 0 | 20
0 | 8 | 1 | 63
1 | 18 | 1 | 63
Step 3.
Group by DFA.ID and select min from DFA.Mul and DFB.Mul

Python Postgres Set Column to be Percentile

I have a table of transactions and would like to add a percentile column that specifies the percentile of that transaction in that month based on the amount column.
Take this smaller example with quartiles instead of percentiles:
Example Input:
id | month | amount
1 | 1 | 1
2 | 1 | 2
3 | 1 | 5
4 | 1 | 3
5 | 2 | 1
6 | 2 | 3
1 | 2 | 5
1 | 2 | 7
1 | 2 | 9
1 | 2 | 11
1 | 2 | 15
1 | 2 | 16
Example Output
id | month | amount | quartile
1 | 1 | 1 | 25
2 | 1 | 2 | 50
3 | 1 | 5 | 100
4 | 1 | 3 | 75
5 | 2 | 1 | 25
6 | 2 | 3 | 25
1 | 2 | 5 | 50
1 | 2 | 15 | 100
1 | 2 | 9 | 75
1 | 2 | 11 | 75
1 | 2 | 7 | 50
1 | 2 | 16 | 100
Currently, I use postgres's percentile_cont function to determine the amount values of the cutoff points for different percentiles and then go through and do update the percentile column accordingly. Unfortunately, this approach is way too slow because I have many different months. Any ideas for how to do this more quickly, preferably combining the calculation of the percentile and the update in one SQL statement.
My code:
num_buckets = 10
for i in range(num_buckets):
decimal_percentile = (i+1)*(1.0/num_buckets)
prev_decimal_percentile = i*1.0/num_buckets
percentile = int(decimal_percentile*100)
cursor.execute("SELECT month,
percentile_cont(%s) WITHIN GROUP (ORDER BY amount ASC),
percentile_cont(%s) WITHIN GROUP (ORDER BY amount ASC)
FROM transactions GROUP BY month;",
(prev_decimal_percentile, decimal_percentile))
iter_cursor = connection.cursor()
for data in cursor:
iter_cursor.execute("UPDATE transactions SET percentile=%s
WHERE month = %s
AND amount >= %s AND amount <= %s;",
(percentile, data[0], data[1], data[2]))
You can do this in a single query, example for 4 buckets:
update transactions t
set percentile = calc_percentile
from (
select distinct on (month, amount)
id,
month,
amount,
calc_percentile
from transactions
join (
select
bucket,
month as calc_month,
percentile_cont(bucket*1.0/4) within group (order by amount asc) as calc_amount,
bucket*100/4 as calc_percentile
from transactions
cross join generate_series(1, 4) bucket
group by month, bucket
) s on month = calc_month and amount <= calc_amount
order by month, amount, calc_percentile
) s
where t.month = s.month and t.amount = s.amount;
Results:
select *
from transactions
order by month, amount;
id | month | amount | percentile
----+-------+--------+------------
1 | 1 | 1 | 25
2 | 1 | 2 | 50
4 | 1 | 3 | 75
3 | 1 | 5 | 100
5 | 2 | 1 | 25
6 | 2 | 3 | 25
1 | 2 | 5 | 50
1 | 2 | 7 | 50
1 | 2 | 9 | 75
1 | 2 | 11 | 75
1 | 2 | 15 | 100
1 | 2 | 16 | 100
(12 rows)
Btw, id should be a primary key, then it could be used in joins for better performance.
DbFiddle.

Pandas, how to count the occurance within grouped dataframe and create new column?

How do I get the count of each values within the group using pandas ?
In the below table, I have Group and the Value column, and I want to generate a new column called count, which should contain the total nunmber of occurance of that value within the group.
my df dataframe is as follows (without the count column):
-------------------------
| Group| Value | Count? |
-------------------------
| A | 10 | 3 |
| A | 20 | 2 |
| A | 10 | 3 |
| A | 10 | 3 |
| A | 20 | 2 |
| A | 30 | 1 |
-------------------------
| B | 20 | 3 |
| B | 20 | 3 |
| B | 20 | 3 |
| B | 10 | 1 |
-------------------------
| C | 20 | 2 |
| C | 20 | 2 |
| C | 10 | 2 |
| C | 10 | 2 |
-------------------------
I can get the counts using this:
df.groupby(['group','value']).value.count()
but this is just to view, I am having difficuly putting the results back to the dataframe as new columns.
Using transform
df['count?']=df.groupby(['group','value']).value.transform('count').values
Try a merge:
df
Group Value
0 A 10
1 A 20
2 A 10
3 A 10
4 A 20
5 A 30
6 B 20
7 B 20
8 B 20
9 B 10
10 C 20
11 C 20
12 C 10
13 C 10
g = df.groupby(['Group', 'Value']).Group.count()\
.to_frame('Count?').reset_index()
df = df.merge(g)
df
Group Value Count?
0 A 10 3
1 A 10 3
2 A 10 3
3 A 20 2
4 A 20 2
5 A 30 1
6 B 20 3
7 B 20 3
8 B 20 3
9 B 10 1
10 C 20 2
11 C 20 2
12 C 10 2
13 C 10 2

Pandas: How to create a multi-indexed pivot

I have a set of experiments defined by two variables: scenario and height. For each experiment, I take 3 measurements: result 1, 2 and 3.
The dataframe that collects all the results looks like this:
import numpy as np
import pandas as pd
df = pd.DataFrame()
df['Scenario']= np.repeat(['Scenario a','Scenario b','Scenario c'],3)
df['height'] = np.tile([0,1,2],3)
df['Result 1'] = np.arange(1,10)
df['Result 2'] = np.arange(20,29)
df['Result 3'] = np.arange(30,39)
If I run the following:
mypiv = df.pivot('Scenario','height').transpose()
writer = pd.ExcelWriter('test_df_pivot.xlsx')
mypiv.to_excel(writer,'test df pivot')
writer.save()
I obtain a dataframe where columns are the scenarios, and the rows have a multi-index defined by result and height:
+----------+--------+------------+------------+------------+
| | height | Scenario a | Scenario b | Scenario c |
+----------+--------+------------+------------+------------+
| Result 1 | 0 | 1 | 4 | 7 |
| | 1 | 2 | 5 | 8 |
| | 2 | 3 | 6 | 9 |
| Result 2 | 0 | 20 | 23 | 26 |
| | 1 | 21 | 24 | 27 |
| | 2 | 22 | 25 | 28 |
| Result 3 | 0 | 30 | 33 | 36 |
| | 1 | 31 | 34 | 37 |
| | 2 | 32 | 35 | 38 |
+----------+--------+------------+------------+------------+
How can I create a pivot where the indices are swapped, i.e. height first, then result?
I couldn't find a way to create it directly. I managed to get what I want swapping the levels and the re-sorting the results:
mypiv2 = mypiv.swaplevel(0,1 , axis=0).sortlevel(level=0,axis=0,sort_remaining=True)
but I was wondering if there is a more direct way.
You can first set_index and then stack with unstack:
print (df.set_index(['height','Scenario']).stack().unstack(level=1))
Scenario Scenario a Scenario b Scenario c
height
0 Result 1 1 4 7
Result 2 20 23 26
Result 3 30 33 36
1 Result 1 2 5 8
Result 2 21 24 27
Result 3 31 34 37
2 Result 1 3 6 9
Result 2 22 25 28
Result 3 32 35 38

Categories