Pandas: trouble understanding how merge works - python

I'm doing something wrong with merge and I can't understand what it is. I've done the following to estimate a histogram of a series of integer values:
import pandas as pnd
import numpy as np
series = pnd.Series(np.random.poisson(5, size = 100))
tmp = {"series" : series, "count" : np.ones(len(series))}
hist = pnd.DataFrame(tmp).groupby("series").sum()
freq = (hist / hist.sum()).rename(columns = {"count" : "freq"})
If I print hist and freq this is what I get:
> print hist
count
series
0 2
1 4
2 13
3 15
4 12
5 16
6 18
7 7
8 8
9 3
10 1
11 1
> print freq
freq
series
0 0.02
1 0.04
2 0.13
3 0.15
4 0.12
5 0.16
6 0.18
7 0.07
8 0.08
9 0.03
10 0.01
11 0.01
They're both indexed by "series" but if I try to merge:
> df = pnd.merge(freq, hist, on = "series")
I get a KeyError: 'no item named series' exception. If I omit on = "series" I get a IndexError: list index out of range exception.
I don't get what I'm doing wrong. May be "series" is an index and not a column so I must do it differently?

From docs:
on: Columns (names) to join on. Must be found in both the left and
right DataFrame objects. If not passed and left_index and right_index
are False, the intersection of the columns in the DataFrames will be
inferred to be the join keys
I don't know why this is not in the docstring, but it explains your problem.
You can either give left_index and right_index:
In : pnd.merge(freq, hist, right_index=True, left_index=True)
Out:
freq count
series
0 0.01 1
1 0.04 4
2 0.14 14
3 0.12 12
4 0.21 21
5 0.14 14
6 0.17 17
7 0.07 7
8 0.05 5
9 0.01 1
10 0.01 1
11 0.03 3
Or you can make your index a column and use on:
In : freq2 = freq.reset_index()
In : hist2 = hist.reset_index()
In : pnd.merge(freq2, hist2, on='series')
Out:
series freq count
0 0 0.01 1
1 1 0.04 4
2 2 0.14 14
3 3 0.12 12
4 4 0.21 21
5 5 0.14 14
6 6 0.17 17
7 7 0.07 7
8 8 0.05 5
9 9 0.01 1
10 10 0.01 1
11 11 0.03 3
Alternatively and more simply, DataFrame has join method which does exactly what you want:
In : freq.join(hist)
Out:
freq count
series
0 0.01 1
1 0.04 4
2 0.14 14
3 0.12 12
4 0.21 21
5 0.14 14
6 0.17 17
7 0.07 7
8 0.05 5
9 0.01 1
10 0.01 1
11 0.03 3

Related

Find optimal combinations of two columns based on another column value

So, my dataframe looks like this
index Client Manager Score
0 1 1 0.89
1 1 2 0.78
2 1 3 0.65
3 2 1 0.91
4 2 2 0.77
5 2 3 0.97
6 3 1 0.35
7 3 2 0.61
8 3 3 0.81
9 4 1 0.69
10 4 2 0.22
11 4 3 0.93
12 5 1 0.78
13 5 2 0.55
14 5 3 0.44
15 6 1 0.64
16 6 2 0.99
17 6 3 0.22
My expected output looks like this
index Client Manager Score
0 1 1 0.89
1 2 3 0.97
2 3 2 0.61
3 4 3 0.93
4 5 1 0.78
5 6 2 0.99
We have 3 managers and 6 clients. I want each manager to have 2 clients based on highest Score. Each manager should have only unique client, so that if one client is good for two managers, we need to take second best score and so on. May I have your suggestions? Thank you in advance.
df = df.drop("index", axis=1)
df = df.sort_values("Score").iloc[::-1,:]
df
selected_client = []
selected_manager = []
selected_df = []
iter_rows = df.iterrows()
for i,d in iter_rows:
client = int(d.to_frame().loc[["Client"],[i]].values[0][0])
manager = int(d.to_frame().loc[["Manager"],[i]].values[0][0])
if client not in selected_client and selected_manager.count(manager) != 2:
selected_client.append(client)
selected_manager.append(manager)
selected_df.append(d)
result = pd.concat(selected_df, axis=1, sort=False)
print(result.T)
Try this:
df = df.sort_values('Score',ascending = False) #sort values to prioritize high scores
d = {i:[] for i in df['Manager']} #create an empty dictionary to fill in the client/manager pairs
n = 2 #set number of clients per manager
for c,m in zip(df['Client'],df['Manager']): #iterate over client and manager pairs
if len(d.get(m))<n and c not in [c2 for i in d.values() for c2,m2 in i]: #if there are not already two pairs, and if the client has not already been added, append the pair to the list
d.get(m).append((c,m))
else:
pass
ndf = pd.merge(df,pd.DataFrame([k for v in d.values() for k in v],columns = ['Client','Manager'])).sort_values('Client') #filter for just the pairs found above.
Output:
index Client Manager Score
3 0 1 1 0.89
1 5 2 3 0.97
5 7 3 2 0.61
2 11 4 3 0.93
4 12 5 1 0.78
0 16 6 2 0.99

Find the max value(s) for specific columns in every row and match the corresponding column values to the values of another column

I have a table like this,
Hour represents the hour of the day agent reached the customer successfully.
Cus1 to Cus4 are the top 4 time slots provided to the agent to call the customer.
Cus1_Score to Cus4_Score represents the probability of success for the call during the hours corresponding to Cus1 to Cus4.
I need to get YES or NO values for Match1 and Match2 columns.
Match1 represents this - Check for the highest score in the columns Cus1_Score to Cus4_Score. If no duplicates, check if Cus1 = Hour and if it's a match we write YES else NO. If duplicates exist, check all the columns with highest score and and check if the Hour number matches to any of the values in the high score represented Cus columns(Cus1 to Cus4). Again, if it's match YES else NO.
Match2 represents this - Check for the second highest score in the columns Cus1_Score to Cus4_Score. If no duplicates, check if Cus2 = Hour and if it's a match we write YES else NO. If duplicates exist, check all the columns with second highest score and and check if the Hour number matches to any of the values in the second high score represented Cus columns(Cus1 to Cus4). Again, if it's match YES else NO.
ID Hour Cus1 Cus2 Cus3 Cus4 Cus1_Score Cus2_Score Cus3_Score Cus4_Score Match1 Match2
1 11 8 10 11 14 0.62 0.59 0.59 0.54 NO YES
2 13 15 16 18 13 0.57 0.57 0.57 0.57 YES NO
3 16 09 14 16 12 0.67 0.54 0.48 0.34 NO NO
4 08 11 08 12 17 0.58 0.55 0.43 0.25 NO YES
I tried using idxmax() and nlargest(2) functions and I have no luck as I am not very strong in Python. Highly appreciate your help.
You need to write a function for row operation with all the criterion that you explained. Then you can apply this function to the rows using, .iterrows, for, or .apply method which you can find them here.
Hopefully I got the conditions right; the calculations below rely on Pandas aligning on the index before assigning values; so I create a long dataframe, create conditions off the long dataframe, reduce the size to unique indices and assign the outcome to the original dataframe:
Create a long dataframe to work out the conditions(I'm using pyjanitor pivot_longer to reshape here because of the convenience; you can do this in a couple of steps without pyjanitor):
# pip install git+https://github.com/pyjanitor-devs/pyjanitor.git
import janitor as jn
import pandas as pd
res = (df.pivot_longer(['ID', 'Hour'],
names_pattern= [r'.+\d$', r'.+Score$'],
names_to = ['cus', 'score'],
ignore_index = False)
)
print(res)
ID Hour cus score
0 1 11 8 0.62
1 2 13 15 0.57
2 3 16 9 0.67
3 4 8 11 0.58
0 1 11 10 0.59
1 2 13 16 0.57
2 3 16 14 0.54
3 4 8 8 0.55
0 1 11 11 0.59
1 2 13 18 0.57
2 3 16 16 0.48
3 4 8 12 0.43
0 1 11 14 0.54
1 2 13 13 0.57
2 3 16 12 0.34
3 4 8 17 0.25
Create conditions for match1:
# get booleans for max and no duplicates
res = res.sort_index()
max1 = res.groupby(level=0).score.transform('max')
# max without duplicates
cond1 = res.score.eq(max1).groupby(level=0).sum().eq(1)
cond1 = cond1 & df.Hour.eq(df.Cus1)
# max with duplicates
cond2 = res.score.eq(max1).groupby(level=0).transform('sum').gt(1)
cond2 &= res.Hour.eq(res.cus)
cond2 = cond2.groupby(level=0).any()
df['match1'] = np.where(cond1 | cond2, 'YES', 'NO')
match2:
second_largest = (res.sort_values('score', ascending=False)
.groupby(level=0, sort = False)
.score
.transform('nth', 1)
)
second_largest = second_largest.sort_index()
# second largest without duplicates
cond_1 = res.score.eq(second_largest).groupby(level=0).sum().eq(1)
cond_1 &= df.Hour.eq(df.Cus2)
# second largest with duplicates
cond_2 = res.score.eq(second_largest).groupby(level=0).transform('sum').gt(1)
cond_2 &= res.Hour.eq(res.cus)
cond_2 = cond_2.groupby(level=0).any()
df['match2'] = np.where(cond_1 | cond_2, 'YES', 'NO')
df
ID Hour Cus1 Cus2 Cus3 Cus4 Cus1_Score Cus2_Score Cus3_Score Cus4_Score match1 match2
0 1 11 8 10 11 14 0.62 0.59 0.59 0.54 NO YES
1 2 13 15 16 18 13 0.57 0.57 0.57 0.57 YES YES
2 3 16 9 14 16 12 0.67 0.54 0.48 0.34 NO NO
3 4 8 11 8 12 17 0.58 0.55 0.43 0.25 NO YES

Np random sampling in python

I have two pd data tables. I want to create a new column in df2 by assign random Rate using Weight from df1.
df1
Income_Group Rate Weight
0 1 3.5 0.5
1 1 2.5 0.25
2 1 3.75 0.15
3 1 5.0 0.15
4 2 4.5 0.35
5 2 2.5 0.25
6 2 4.75 0.20
7 2 5.0 0.20
....
30 8 2.25 0.75
31 8 4.15 0.05
32 8 6.35 0.20
df2
ID Income_Group State Rate
0 12 1 9 3.5
1 13 2 6 4.5
2 15 8 1 6.35
3 8 1 5 2.5
4 9 8 4 6.35
5 17 2 3 4.75
......
100 50 1 4 3.75
I tried the following code:
df2['Rate']=df1.groupby('Income_Group').apply(lambda gp.np.random.choice(a=gp.Rate, p=gp.Weight,
replace=True))
Of course, the code didn't work. Can someone help me on this? Thank you in advance.
Your data is pretty small, so we can do:
rate_dict = df1.groupby('Income_Group')[['Rate', 'Weight']].agg(list)
df2['Rate'] = df2.Income_Group.apply(lambda x: np.random.choice(rate_dict.loc[x, 'Rate'],
p=rate_dict.loc[x, 'Weight'])
)
Or you can do groupby on df2 as well:
(df2.groupby('Income_Group')
.Income_Group
.transform(lambda x: np.random.choice(rate_dict.loc[x.iloc[0], 'Rate'],
size=len(x),
p=rate_dict.loc[x.iloc[0], 'Weight']))
)
You can try:
df1 = pd.DataFrame([[1,3.5,.5], [1,2.5,.25], [1,3.75,.15]],
columns=['Income_Group', 'Rate', 'Weight'])
df2 = pd.DataFrame()
weights = np.random.rand(df1.shape[0])
df2['Rate'] = df1.Rate.values * weights

Remapping and regrouping values in python pandas

I have a dataframe where values have been assigned to groups:
import pandas as pd
df = pd.DataFrame({ 'num' : [0.43, 5.2, 1.3, 0.33, .74, .5, .2, .12],
'group' : [1, 2, 2, 2, 3,4,5,5]
})
df
group num
0 1 0.43
1 2 5.20
2 2 1.30
3 2 0.33
4 3 0.74
5 4 0.50
6 5 0.20
7 5 0.12
I would like to ensure that no value is in a group alone. If a value is an "orphan", it should be reassigned to the next highest group with more than one member. So the resultant dataframe should look like this instead:
group num
0 2 0.43
1 2 5.20
2 2 1.30
3 2 0.33
4 5 0.74
5 5 0.50
6 5 0.20
7 5 0.12
What's the most pythonic way to achieve this result?
Here is one solution I found, there may be much better ways to do this...
# Find the orphans
count = df.group.value_counts().sort_index()
orphans = count[count == 1].index.values.tolist()
# Find the sets
sets = count[count > 1].index.values.tolist()
# Find where orphans should be remapped
where = [bisect.bisect(sets, x) for x in orphans]
remap = [sets[x] for x in where]
# Create a dictionary for remapping, and replace original values
change = dict(zip(orphans, remap))
df = df.replace({'group': change})
df
group num
0 2 0.43
1 2 5.20
2 2 1.30
3 2 0.33
4 5 0.74
5 5 0.50
6 5 0.20
7 5 0.12
It is possible to use only vectorised operations for this task. You can use pd.Series.bfill to create a mapping from your original index to a new one:
counts = df['group'].value_counts().sort_index().reset_index()
counts['original'] = counts['index']
counts.loc[counts['group'] == 1, 'index'] = np.nan
counts['index'] = counts['index'].bfill().astype(int)
print(counts)
index group original
0 2 1 1
1 2 3 2
2 5 1 3
3 5 1 4
4 5 2 5
Then use pd.Series.map to perform your mapping:
df['group'] = df['group'].map(counts.set_index('original')['index'])
print(df)
group num
0 2 0.43
1 2 5.20
2 2 1.30
3 2 0.33
4 5 0.74
5 5 0.50
6 5 0.20
7 5 0.12

Using igraph for read 'ncol' format while preserving labels

I need an automated way to read 'ncol' format (edge list) while preserving labels.
For instance:
Given a small-graph.edgelist:
0 1 0.47
0 2 0.67
0 3 0.98
0 4 0.12
0 5 0.82
0 10 0.34
1 2 0.94
1 3 0.05
1 4 0.22
2 3 0.24
2 4 0.36
3 4 0.69
5 6 0.97
5 8 0.44
5 7 0.43
5 9 0.37
6 7 0.83
6 8 0.49
6 9 0.55
7 8 0.39
7 9 0.73
8 9 0.68
10 11 0.22
10 14 0.59
11 12 0.40
12 13 0.78
13 14 0.81
Graph:
I try:
import igraph
g = igraph.read("smallgraph.edgelist", format="ncol", directed=False, names=True)
But this function does not preserve the labels!!!!
The output generated by this function:
for edge in g.es():
print edge.tuple[0], edge.tuple[1], edge["weight"]
0 1 0.47
0 2 0.67
0 3 0.98
0 4 0.12
0 5 0.82
0 6 0.34 -> e.g.: Considering the original labels here should be '0 10 0.34'
1 2 0.94
1 3 0.05
1 4 0.22
2 3 0.24
2 4 0.36
3 4 0.69
5 7 0.97
5 8 0.44
5 9 0.43
5 10 0.37
6 11 0.22
6 12 0.59
7 8 0.49
7 9 0.83
7 10 0.55
8 9 0.39
8 10 0.68
9 10 0.73
11 13 0.4
12 14 0.81
13 14 0.78
Output:
The labels of the input file (small-graph.edgelist) are not preserved.
I think something like this could work:
g = igraph.Graph()
g.add_vertices(15)
g = igraph.read("input/small-graph.edgelist", format="ncol", directed=False, names=True)
But this doesn't work and I don't know how to do it.
Does anyone know how to preserve the original labels?
The original labels are preserved, but they are stored in the name vertex attribute. Try this after reading your graph as usual:
names = g.vs["name"]
for edge in g.es:
print names[edge.tuple[0]], names[edge.tuple[1]], edge["weight"]
Update: If you are absolutely sure that your file contains only continuous numeric IDs from zero (i.e. if you have n vertices then your IDs are from zero to n-1), you can do the following:
edges, weights = [], []
for line in open("input_file.txt"):
u, v, weight = line.split()
edges.append((int(u), int(v)))
weights.append(float(weight))
g = Graph(edges, edge_attrs={"weight": weights})
Just a little improvement based on updated answer of Tamás' for the case that the file contains only continuous numeric IDs from zero. This works for directed graphs and handles some cases such as there might be no edge from 0 to any other vertices:
read_edges(num_vertices,input_graph):
g = Graph(directed=True)
g.add_vertices(list(range(0,num_vertices)))
for line in open(input_graph):
u, v= line.split()
g.add_edge(int(u), int(v))
return g

Categories