Get a combination of values in dataframe and implement a function - python

I want to take the combination of values ​​in a column and apply a function to each combination.What is the easiest way to do this?
Example Data
| name | value |
|------|-------|
| 6A | 1 |
| 6A | 1 |
| 6A | 1 |
| 6B | 3 |
| 6B | 3 |
| 6B | 3 |
| 6C | 7 |
| 6C | 5 |
| 6C | 4 |
The Result I Want
i used sum as a function in the example:
| pair | result |
|-------|--------|
| 6A_6B | 4 |
| 6A_6B | 4 |
| 6A_6B | 4 |
| 6A_6C | 8 |
| 6A_6C | 6 |
| 6A_6C | 5 |
| 6B_6C | 10 |
| 6B_6C | 8 |
| 6B_6C | 7 |
Note
My function takes "pandas.Series" as parameters.
For example:
x=a series of "6A"
and
y=a series of "6B"
6A_6B = sum(x,y)

I find it more straightforward to reshape the data, then it's a simple addition of all pairwise combinations.
import pandas as pd
from itertools import combinations
u = (df.assign(idx = df.groupby('name').cumcount()+1)
.pivot(index='idx', columns='name', values='value'))
#name 6A 6B 6C
#idx
#1 1 3 7
#2 1 3 5
#3 1 3 4
l = []
for items in combinations(u.columns, 2):
l.append(u.loc[:, items].sum(1).to_frame('result').assign(pair='_'.join(items)))
df = pd.concat(l)
result pair
idx
1 4 6A_6B
2 4 6A_6B
3 4 6A_6B
1 8 6A_6C
2 6 6A_6C
3 5 6A_6C
1 10 6B_6C
2 8 6B_6C
3 7 6B_6C

itertools.combinations
Off the top of my head
from itertools import combinations
g = dict(tuple(df.groupby('name')))
pd.DataFrame([
(f'{x}_{y}', a + b)
for x, y in combinations(g, 2)
for a, b in zip(g[x]['value'], g[y]['value'])
], columns=df.columns)
name value
0 6A_6B 4
1 6A_6B 4
2 6A_6B 4
3 6A_6C 8
4 6A_6C 6
5 6A_6C 5
6 6B_6C 10
7 6B_6C 8
8 6B_6C 7

Related

Transposing group of data in pandas dataframe

I have a large dataframe like this:
|type| qt | vol|
|----|---- | -- |
| A | 1 | 10 |
| A | 2 | 12 |
| A | 1 | 12 |
| B | 3 | 11 |
| B | 4 | 20 |
| B | 4 | 20 |
| C | 4 | 20 |
| C | 4 | 20 |
| C | 4 | 20 |
| C | 4 | 20 |
How can I transpose to the dataframe with grouping horizontally like that?
|A. |B. |C. |
|--------------|--------------|--------------|
|type| qt | vol|type| qt | vol|type| qt | vol|
|----|----| ---|----|----| ---|----|----| ---|
| A | 1 | 10 | B | 3 | 11 | C | 4 | 20 |
| A | 2 | 12 | B | 4 | 20 | C | 4 | 20 |
| A | 1 | 12 | B | 4 | 20 | C | 4 | 20 |
| C | 4 | 20 |
You can group the dataframe on type then create key-value pairs of groups inside a dict comprehension, finally use concat along axis=1 and pass the optional keys parameter to get the final result:
d = {k:g.reset_index(drop=True) for k, g in df.groupby('type')}
pd.concat(d.values(), keys=d.keys(), axis=1)
Alternatively you can use groupby + cumcount to create a sequential counter per group, then create a multilevel index having two levels where the first level is counter and second level is column type itself, finally use stack followed by unstack to reshape:
c = df.groupby('type').cumcount()
df.set_index([c, df['type'].values]).stack().unstack([1, 2])
A B C
type qt vol type qt vol type qt vol
0 A 1 10 B 3 11 C 4 20
1 A 2 12 B 4 20 C 4 20
2 A 1 12 B 4 20 C 4 20
3 NaN NaN NaN NaN NaN NaN C 4 20
This is pretty much pivot by one column:
(df.assign(idx=df.groupby('type').cumcount())
.pivot(index='idx',columns='type', values=df.columns)
.swaplevel(0,1, axis=1)
.sort_index(axis=1)
)
Output:
type A B C
qt type vol qt type vol qt type vol
idx
0 1 A 10 3 B 11 4 C 20
1 2 A 12 4 B 20 4 C 20
2 1 A 12 4 B 20 4 C 20
3 NaN NaN NaN NaN NaN NaN 4 C 20

Select a row based on column value and its previous 2 rows

+---+---+---+---+----+
| A | B | C | D | E |
+---+---+---+---+----+
| 1 | 2 | 3 | 4 | VK |
| 1 | 4 | 6 | 9 | MD |
| 2 | 5 | 7 | 9 | V |
| 2 | 3 | 5 | 8 | VK |
| 2 | 3 | 7 | 9 | V |
| 1 | 1 | 1 | 1 | N |
| 0 | 1 | 6 | 9 | V |
| 1 | 2 | 5 | 7 | VK |
| 1 | 7 | 8 | 0 | MD |
| 1 | 5 | 7 | 9 | VK |
| 0 | 1 | 6 | 8 | V |
+---+---+---+---+----+
i want to select a row based on column value and its two previous rows. For example in the given dataset (on the picture) I want to select row based on 'E' column value 'VK' and two previous rows of that selected row. So we should get a dataset like this:
+---+---+---+---+----+
| A | B | C | D | E |
+---+---+---+---+----+
| 1 | 2 | 3 | 4 | VK |
| 1 | 4 | 6 | 9 | MD |
| 2 | 5 | 7 | 9 | V |
| 2 | 3 | 5 | 8 | VK |
| 2 | 3 | 7 | 9 | V |
| 1 | 1 | 1 | 1 | N |
| 1 | 2 | 5 | 7 | VK |
| 1 | 7 | 8 | 0 | MD |
| 1 | 5 | 7 | 9 | VK |
+---+---+---+---+----+
1st we need filter the dataframe until the last VK, then create the groupkey with cumsum , then do groupby head
df=df.loc[:df.E.eq('VK').loc[lambda x : x].index.max()]
df=df.iloc[::-1].groupby(df.E.eq('VK').iloc[::-1].cumsum()).head(3).sort_index()
df
Out[102]:
A B C D E
0 1 2 3 4 VK
1 1 4 6 9 MD
2 2 5 7 9 V
3 2 3 5 8 VK
5 1 1 1 1 N
6 0 1 6 9 V
7 1 2 5 7 VK
8 1 7 8 0 MD
9 1 5 7 9 VK

Pandas table computation [duplicate]

This question already has answers here:
Get statistics for each group (such as count, mean, etc) using pandas GroupBy?
(9 answers)
Closed 3 years ago.
I have a table as follows:
+-------+-------+-------------+
| Code | Event | No. of runs |
+-------+-------+-------------+
| 66 | 1 | |
| 66 | 1 | 2 |
| 66 | 2 | |
| 66 | 2 | |
| 66 | 2 | 3 |
| 66 | 3 | |
| 66 | 3 | |
| 66 | 3 | |
| 66 | 3 | |
| 66 | 3 | 5 |
| 70 | 1 | |
| 70 | 1 | |
| 70 | 1 | |
| 70 | 1 | 4 |
+-------+-------+-------------+
Let's call each row a run. I want to count the no. of runs in each Event, separately for each Code. Would I need to use the groupby function? I have added the expected output in the No. of runs column.
Try using groupby with transfrom then mask duplicated rows:
df['Runs'] = df.groupby(['Code', 'Event'])['Event']\
.transform('count')\
.mask(df.duplicated(['Code','Event'], keep='last'), '')
Output (add new column to output dataframe from comparison to desired result):
Code Event No. of runs Runs
0 66 1
1 66 1 2 2
2 66 2
3 66 2
4 66 2 3 3
5 66 3
6 66 3
7 66 3
8 66 3
9 66 3 5 5
10 70 1
11 70 1
12 70 1
13 70 1 4 4

Pandas, how to count the occurance within grouped dataframe and create new column?

How do I get the count of each values within the group using pandas ?
In the below table, I have Group and the Value column, and I want to generate a new column called count, which should contain the total nunmber of occurance of that value within the group.
my df dataframe is as follows (without the count column):
-------------------------
| Group| Value | Count? |
-------------------------
| A | 10 | 3 |
| A | 20 | 2 |
| A | 10 | 3 |
| A | 10 | 3 |
| A | 20 | 2 |
| A | 30 | 1 |
-------------------------
| B | 20 | 3 |
| B | 20 | 3 |
| B | 20 | 3 |
| B | 10 | 1 |
-------------------------
| C | 20 | 2 |
| C | 20 | 2 |
| C | 10 | 2 |
| C | 10 | 2 |
-------------------------
I can get the counts using this:
df.groupby(['group','value']).value.count()
but this is just to view, I am having difficuly putting the results back to the dataframe as new columns.
Using transform
df['count?']=df.groupby(['group','value']).value.transform('count').values
Try a merge:
df
Group Value
0 A 10
1 A 20
2 A 10
3 A 10
4 A 20
5 A 30
6 B 20
7 B 20
8 B 20
9 B 10
10 C 20
11 C 20
12 C 10
13 C 10
g = df.groupby(['Group', 'Value']).Group.count()\
.to_frame('Count?').reset_index()
df = df.merge(g)
df
Group Value Count?
0 A 10 3
1 A 10 3
2 A 10 3
3 A 20 2
4 A 20 2
5 A 30 1
6 B 20 3
7 B 20 3
8 B 20 3
9 B 10 1
10 C 20 2
11 C 20 2
12 C 10 2
13 C 10 2

Pandas: How to create a multi-indexed pivot

I have a set of experiments defined by two variables: scenario and height. For each experiment, I take 3 measurements: result 1, 2 and 3.
The dataframe that collects all the results looks like this:
import numpy as np
import pandas as pd
df = pd.DataFrame()
df['Scenario']= np.repeat(['Scenario a','Scenario b','Scenario c'],3)
df['height'] = np.tile([0,1,2],3)
df['Result 1'] = np.arange(1,10)
df['Result 2'] = np.arange(20,29)
df['Result 3'] = np.arange(30,39)
If I run the following:
mypiv = df.pivot('Scenario','height').transpose()
writer = pd.ExcelWriter('test_df_pivot.xlsx')
mypiv.to_excel(writer,'test df pivot')
writer.save()
I obtain a dataframe where columns are the scenarios, and the rows have a multi-index defined by result and height:
+----------+--------+------------+------------+------------+
| | height | Scenario a | Scenario b | Scenario c |
+----------+--------+------------+------------+------------+
| Result 1 | 0 | 1 | 4 | 7 |
| | 1 | 2 | 5 | 8 |
| | 2 | 3 | 6 | 9 |
| Result 2 | 0 | 20 | 23 | 26 |
| | 1 | 21 | 24 | 27 |
| | 2 | 22 | 25 | 28 |
| Result 3 | 0 | 30 | 33 | 36 |
| | 1 | 31 | 34 | 37 |
| | 2 | 32 | 35 | 38 |
+----------+--------+------------+------------+------------+
How can I create a pivot where the indices are swapped, i.e. height first, then result?
I couldn't find a way to create it directly. I managed to get what I want swapping the levels and the re-sorting the results:
mypiv2 = mypiv.swaplevel(0,1 , axis=0).sortlevel(level=0,axis=0,sort_remaining=True)
but I was wondering if there is a more direct way.
You can first set_index and then stack with unstack:
print (df.set_index(['height','Scenario']).stack().unstack(level=1))
Scenario Scenario a Scenario b Scenario c
height
0 Result 1 1 4 7
Result 2 20 23 26
Result 3 30 33 36
1 Result 1 2 5 8
Result 2 21 24 27
Result 3 31 34 37
2 Result 1 3 6 9
Result 2 22 25 28
Result 3 32 35 38

Categories