Find optimal combinations of two columns based on another column value

Find optimal combinations of two columns based on another column value - python

So, my dataframe looks like this
index Client Manager Score
0 1 1 0.89
1 1 2 0.78
2 1 3 0.65
3 2 1 0.91
4 2 2 0.77
5 2 3 0.97
6 3 1 0.35
7 3 2 0.61
8 3 3 0.81
9 4 1 0.69
10 4 2 0.22
11 4 3 0.93
12 5 1 0.78
13 5 2 0.55
14 5 3 0.44
15 6 1 0.64
16 6 2 0.99
17 6 3 0.22
My expected output looks like this
index Client Manager Score
0 1 1 0.89
1 2 3 0.97
2 3 2 0.61
3 4 3 0.93
4 5 1 0.78
5 6 2 0.99
We have 3 managers and 6 clients. I want each manager to have 2 clients based on highest Score. Each manager should have only unique client, so that if one client is good for two managers, we need to take second best score and so on. May I have your suggestions? Thank you in advance.

df = df.drop("index", axis=1)
df = df.sort_values("Score").iloc[::-1,:]
df
selected_client = []
selected_manager = []
selected_df = []
iter_rows = df.iterrows()
for i,d in iter_rows:
client = int(d.to_frame().loc[["Client"],[i]].values[0][0])
manager = int(d.to_frame().loc[["Manager"],[i]].values[0][0])
if client not in selected_client and selected_manager.count(manager) != 2:
selected_client.append(client)
selected_manager.append(manager)
selected_df.append(d)
result = pd.concat(selected_df, axis=1, sort=False)
print(result.T)

Try this:
df = df.sort_values('Score',ascending = False) #sort values to prioritize high scores
d = {i:[] for i in df['Manager']} #create an empty dictionary to fill in the client/manager pairs
n = 2 #set number of clients per manager
for c,m in zip(df['Client'],df['Manager']): #iterate over client and manager pairs
if len(d.get(m))<n and c not in [c2 for i in d.values() for c2,m2 in i]: #if there are not already two pairs, and if the client has not already been added, append the pair to the list
d.get(m).append((c,m))
else:
pass
ndf = pd.merge(df,pd.DataFrame([k for v in d.values() for k in v],columns = ['Client','Manager'])).sort_values('Client') #filter for just the pairs found above.
Output:
index Client Manager Score
3 0 1 1 0.89
1 5 2 3 0.97
5 7 3 2 0.61
2 11 4 3 0.93
4 12 5 1 0.78
0 16 6 2 0.99

Related

Grouping a dataframe by one column and adding information based on the column

I have a dataframe like so:
df = pd.DataFrame({1:[1,1,1,2,2,2,3,4,5,6,7,8,9],
2:["a","a","x","b","b","y","c","d","e","f","g","h","i"],
3:[0.5,0.6,0.7,0.8,0.9,0.10,0.11,0.13,0.13,0.14,0.15,0.16,0.17]})
1 2 3
0 1 a 0.50
1 1 a 0.60
2 1 x 0.70
3 2 b 0.80
4 2 b 0.90
5 2 y 0.10
6 3 c 0.11
7 4 d 0.13
8 5 e 0.13
9 6 f 0.14
10 7 g 0.15
11 8 h 0.16
12 9 i 0.17
I want to:
Group the items by the first column so that it's comprised of unique values.
Attach the mean of the third column to this grouping.
Attach the second column's information to each corresponding first column's.
I can do (1) and (2) with the following method:
In [33]: df.groupby(1).mean()
Out[33]:
3
1
1 0.60
2 0.60
3 0.11
4 0.13
5 0.13
6 0.14
7 0.15
8 0.16
9 0.17
However, I'm not sure how to go about attaching the second column to the grouping.
I've tried grouping by multiple columns:
In [34]: df.groupby([1,2]).mean()
Out[34]:
3
1 2
1 a 0.55
x 0.70
2 b 0.60
3 c 0.11
4 d 0.13
5 e 0.13
6 f 0.14
7 g 0.15
8 h 0.16
9 i 0.17
But in the actual dataset it leaves out several entries.
If you notice, within the dataframe there are some differences in the 2nd column data for each entry (Number 1 under column 1 has 2 "a's" and an "x", and number 2 has 2 "b's" and a "y"). This is because in the actual dataset there are minute differences between the entries due to errors and slight (but insignificant) differences in the string data.
Edit
The above is just a conceptual presentation of the problem. If you need something more tangible, this is the dataset. I want to group by the "CUSTOMERS NAME" and "CUSTOMER ADDRESS" columns while finding the mean, but grouping by the two of them simultaneously leads to a loss in entries for some reason. If I group purely by "CUSTOMER NAME" then there are a little over 4300 entries.
In [35]: len(ensemble.("CUSTOMERS NAME").mean())
Out[35]: 4376
But if I group by both name and address it falls substantially:
In [36]: len(ensemble.groupby(["CUSTOMERS NAME","CUSTOMER ADDRESS"]).mean())
Out[36]: 4154
I know something's wrong somewhere because the total number of unique values in the "CUSTOMERS NAME" columns is 4376.
For clarification, the output should be a dataframe with three columns. The first is the name of the customer, the second is the address attached to the customer's name (the first is fine), the third is the mean of that customer's transactions.

If the first value from 2 col is fine, you can use Groupby.agg:
In [583]: x = df.groupby(1, as_index=False).agg({2:'first', 3:'mean'})
In [584]: x
Out[584]:
1 2 3
0 1 a 0.60
1 2 b 0.60
2 3 c 0.11
3 4 d 0.13
4 5 e 0.13
5 6 f 0.14
6 7 g 0.15
7 8 h 0.16
8 9 I 0.17
OR, if you want all values, you can have a list:
In [586]: x = df.groupby(1, as_index=False).agg({2: list, 3:'mean'})
In [587]: x
Out[587]:
1 2 3
0 1 [a, a, x] 0.60
1 2 [b, b, y] 0.60
2 3 [c] 0.11
3 4 [d] 0.13
4 5 [e] 0.13
5 6 [f] 0.14
6 7 [g] 0.15
7 8 [h] 0.16
8 9 [i] 0.17

Selectively use df.div() to divide only a certain column based on index match

I have 2 DataFrames, one is a monthly total and the other contains values by which I want to divide the first in order to get monthly percentage contributions.
Here are some example DataFrames:
MonthlyTotals = pd.DataFrame(data={'Month':[1,2,3],'Value':[100,200,300]})
Data = pd.DataFrame(data={'ID':[1,2,3,1,2,3,1,2,3],
'Month':[1,1,1,2,2,2,3,3,3],
'Value':[40,30,30,60,70,70,150,60,90]})
I am using df.div() so I set the index like so
MonthlyTotals.set_index('Month', inplace=True)
Data.set_index('Month', inplace=True)
Then I do the division
Contributions = Data.div(MonthlyTotals, axis='index')
The resulting DataFrame is what I want but I cannot see the ID that the Value relates to as this isn't in the MonthlyTotals frame. How would I use df.div() but only selectively on certain columns?
Here is an example dataframe of the result I am looking for
result = pd.DataFrame(data={'ID':[1,2,3,1,2,3,1,2,3],'Value':[0.4,0.3,0.3,0.3,0.35,0.35,0.5,0.2,0.3]})

You may not need MonthlyTotals if Data is complete. You can calculate MonthlyTotal using transform and then calculate Contributions.
Data = pd.DataFrame(data={'ID':[1,2,3,1,2,3,1,2,3],
'Month':[1,1,1,2,2,2,3,3,3],
'Value':[40,30,30,60,70,70,150,60,90]})
Data['MonthlyTotal'] = Data.Gropuby('Month')['Value'].transform('sum')
Data['Contributions'] = Data['Value'] / Data['MonthlyTotal']
Output
ID Month Value MonthlyTotal Contributions
0 1 1 40 100 0.40
1 2 1 30 100 0.30
2 3 1 30 100 0.30
3 1 2 60 200 0.30
4 2 2 70 200 0.35
5 3 2 70 200 0.35
6 1 3 150 300 0.50
7 2 3 60 300 0.20
8 3 3 90 300 0.30

Also if you would like only use pandas you can fix your code with reindex + update
Data.update(Data['Value'].div(MonthlyTotals['Value'].reindex(Data.index),axis=0))
Data
ID Value
Month
1 1 0.40
1 2 0.30
1 3 0.30
2 1 0.30
2 2 0.35
2 3 0.35
3 1 0.50
3 2 0.20
3 3 0.30

Remapping and regrouping values in python pandas

I have a dataframe where values have been assigned to groups:
import pandas as pd
df = pd.DataFrame({ 'num' : [0.43, 5.2, 1.3, 0.33, .74, .5, .2, .12],
'group' : [1, 2, 2, 2, 3,4,5,5]
})
df
group num
0 1 0.43
1 2 5.20
2 2 1.30
3 2 0.33
4 3 0.74
5 4 0.50
6 5 0.20
7 5 0.12
I would like to ensure that no value is in a group alone. If a value is an "orphan", it should be reassigned to the next highest group with more than one member. So the resultant dataframe should look like this instead:
group num
0 2 0.43
1 2 5.20
2 2 1.30
3 2 0.33
4 5 0.74
5 5 0.50
6 5 0.20
7 5 0.12
What's the most pythonic way to achieve this result?

Here is one solution I found, there may be much better ways to do this...
# Find the orphans
count = df.group.value_counts().sort_index()
orphans = count[count == 1].index.values.tolist()
# Find the sets
sets = count[count > 1].index.values.tolist()
# Find where orphans should be remapped
where = [bisect.bisect(sets, x) for x in orphans]
remap = [sets[x] for x in where]
# Create a dictionary for remapping, and replace original values
change = dict(zip(orphans, remap))
df = df.replace({'group': change})
df
group num
0 2 0.43
1 2 5.20
2 2 1.30
3 2 0.33
4 5 0.74
5 5 0.50
6 5 0.20
7 5 0.12

It is possible to use only vectorised operations for this task. You can use pd.Series.bfill to create a mapping from your original index to a new one:
counts = df['group'].value_counts().sort_index().reset_index()
counts['original'] = counts['index']
counts.loc[counts['group'] == 1, 'index'] = np.nan
counts['index'] = counts['index'].bfill().astype(int)
print(counts)
index group original
0 2 1 1
1 2 3 2
2 5 1 3
3 5 1 4
4 5 2 5
Then use pd.Series.map to perform your mapping:
df['group'] = df['group'].map(counts.set_index('original')['index'])
print(df)
group num
0 2 0.43
1 2 5.20
2 2 1.30
3 2 0.33
4 5 0.74
5 5 0.50
6 5 0.20
7 5 0.12

Using igraph for read 'ncol' format while preserving labels

I need an automated way to read 'ncol' format (edge list) while preserving labels.
For instance:
Given a small-graph.edgelist:
0 1 0.47
0 2 0.67
0 3 0.98
0 4 0.12
0 5 0.82
0 10 0.34
1 2 0.94
1 3 0.05
1 4 0.22
2 3 0.24
2 4 0.36
3 4 0.69
5 6 0.97
5 8 0.44
5 7 0.43
5 9 0.37
6 7 0.83
6 8 0.49
6 9 0.55
7 8 0.39
7 9 0.73
8 9 0.68
10 11 0.22
10 14 0.59
11 12 0.40
12 13 0.78
13 14 0.81
Graph:
I try:
import igraph
g = igraph.read("smallgraph.edgelist", format="ncol", directed=False, names=True)
But this function does not preserve the labels!!!!
The output generated by this function:
for edge in g.es():
print edge.tuple[0], edge.tuple[1], edge["weight"]
0 1 0.47
0 2 0.67
0 3 0.98
0 4 0.12
0 5 0.82
0 6 0.34 -> e.g.: Considering the original labels here should be '0 10 0.34'
1 2 0.94
1 3 0.05
1 4 0.22
2 3 0.24
2 4 0.36
3 4 0.69
5 7 0.97
5 8 0.44
5 9 0.43
5 10 0.37
6 11 0.22
6 12 0.59
7 8 0.49
7 9 0.83
7 10 0.55
8 9 0.39
8 10 0.68
9 10 0.73
11 13 0.4
12 14 0.81
13 14 0.78
Output:
The labels of the input file (small-graph.edgelist) are not preserved.
I think something like this could work:
g = igraph.Graph()
g.add_vertices(15)
g = igraph.read("input/small-graph.edgelist", format="ncol", directed=False, names=True)
But this doesn't work and I don't know how to do it.
Does anyone know how to preserve the original labels?

The original labels are preserved, but they are stored in the name vertex attribute. Try this after reading your graph as usual:
names = g.vs["name"]
for edge in g.es:
print names[edge.tuple[0]], names[edge.tuple[1]], edge["weight"]
Update: If you are absolutely sure that your file contains only continuous numeric IDs from zero (i.e. if you have n vertices then your IDs are from zero to n-1), you can do the following:
edges, weights = [], []
for line in open("input_file.txt"):
u, v, weight = line.split()
edges.append((int(u), int(v)))
weights.append(float(weight))
g = Graph(edges, edge_attrs={"weight": weights})

Just a little improvement based on updated answer of Tamás' for the case that the file contains only continuous numeric IDs from zero. This works for directed graphs and handles some cases such as there might be no edge from 0 to any other vertices:
read_edges(num_vertices,input_graph):
g = Graph(directed=True)
g.add_vertices(list(range(0,num_vertices)))
for line in open(input_graph):
u, v= line.split()
g.add_edge(int(u), int(v))
return g

Pandas: trouble understanding how merge works

I'm doing something wrong with merge and I can't understand what it is. I've done the following to estimate a histogram of a series of integer values:
import pandas as pnd
import numpy as np
series = pnd.Series(np.random.poisson(5, size = 100))
tmp = {"series" : series, "count" : np.ones(len(series))}
hist = pnd.DataFrame(tmp).groupby("series").sum()
freq = (hist / hist.sum()).rename(columns = {"count" : "freq"})
If I print hist and freq this is what I get:
> print hist
count
series
0 2
1 4
2 13
3 15
4 12
5 16
6 18
7 7
8 8
9 3
10 1
11 1
> print freq
freq
series
0 0.02
1 0.04
2 0.13
3 0.15
4 0.12
5 0.16
6 0.18
7 0.07
8 0.08
9 0.03
10 0.01
11 0.01
They're both indexed by "series" but if I try to merge:
> df = pnd.merge(freq, hist, on = "series")
I get a KeyError: 'no item named series' exception. If I omit on = "series" I get a IndexError: list index out of range exception.
I don't get what I'm doing wrong. May be "series" is an index and not a column so I must do it differently?

From docs:
on: Columns (names) to join on. Must be found in both the left and
right DataFrame objects. If not passed and left_index and right_index
are False, the intersection of the columns in the DataFrames will be
inferred to be the join keys
I don't know why this is not in the docstring, but it explains your problem.
You can either give left_index and right_index:
In : pnd.merge(freq, hist, right_index=True, left_index=True)
Out:
freq count
series
0 0.01 1
1 0.04 4
2 0.14 14
3 0.12 12
4 0.21 21
5 0.14 14
6 0.17 17
7 0.07 7
8 0.05 5
9 0.01 1
10 0.01 1
11 0.03 3
Or you can make your index a column and use on:
In : freq2 = freq.reset_index()
In : hist2 = hist.reset_index()
In : pnd.merge(freq2, hist2, on='series')
Out:
series freq count
0 0 0.01 1
1 1 0.04 4
2 2 0.14 14
3 3 0.12 12
4 4 0.21 21
5 5 0.14 14
6 6 0.17 17
7 7 0.07 7
8 8 0.05 5
9 9 0.01 1
10 10 0.01 1
11 11 0.03 3
Alternatively and more simply, DataFrame has join method which does exactly what you want:
In : freq.join(hist)
Out:
freq count
series
0 0.01 1
1 0.04 4
2 0.14 14
3 0.12 12
4 0.21 21
5 0.14 14
6 0.17 17
7 0.07 7
8 0.05 5
9 0.01 1
10 0.01 1
11 0.03 3

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Find optimal combinations of two columns based on another column value - python

Related

Grouping a dataframe by one column and adding information based on the column

Selectively use df.div() to divide only a certain column based on index match

Remapping and regrouping values in python pandas

Using igraph for read 'ncol' format while preserving labels

Pandas: trouble understanding how merge works

Categories

Resources