K-Means classification by group - python

I'm trying to do a K-means analysis in a dataframe like this:
URBAN AREA PROVINCE DENSITY
0 1 TRUJILLO 0.30
1 2 TRUJILLO 0.03
2 3 TRUJILLO 0.80
3 1 LIMA 1.20
4 2 LIMA 0.04
5 1 LAMBAYEQUE 0.90
6 2 LAMBAYEQUE 0.10
7 3 LAMBAYEQUE 0.08
(You can download it from here)
As you can see, the df refers to different urban areas (with different urban density values) inside provinces. So, I want to do the K-means clasification by one column: DENSITY. To do so, I execute this code:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
df=pd.read_csv('C:/Path/to/example.csv')
clustering=KMeans(n_clusters=2, max_iter=300)
clustering.fit(df[['DENSITY']])
df['KMeans_Clusters']=clustering.labels_
df
And I get this result, which is OK for this first part of the example:
URBAN AREA PROVINCE DENSITY KMeans_Clusters
0 1 TRUJILLO 0.30 0
1 2 TRUJILLO 0.03 0
2 3 TRUJILLO 0.80 1
3 1 LIMA 1.20 1
4 2 LIMA 0.04 0
5 1 LAMBAYEQUE 0.90 1
6 2 LAMBAYEQUE 0.10 0
7 3 LAMBAYEQUE 0.08 0
But now I want to do the k-means classification in urban areas by province. I mean, to repeat the same process inside any province. So I had tried with this code:
df=pd.read_csv('C:/Users/rojas/Desktop/example.csv')
clustering=KMeans(n_clusters=2, max_iter=300)
clustering.fit(df[['DENSITY']]).groupby('PROVINCE')
df['KMeans_Clusters']=clustering.labels_
df
but I get this message:
AttributeError Traceback (most recent call last)
<ipython-input-4-87e7696ff61a> in <module>
3 clustering=KMeans(n_clusters=2, max_iter=300)
4
----> 5 clustering.fit(df[['DENSITY']]).groupby('PROVINCE')
6
7 df['KMeans_Clusters']=clustering.labels_
AttributeError: 'KMeans' object has no attribute 'groupby'
Is there a way to do so?

try this
def k_means(row):
clustering=KMeans(n_clusters=2, max_iter=300)
model = clustering.fit(row[['DENSITY']])
row['KMeans_Clusters'] = model.labels_
return row
df = df.groupby('PROVINCE').apply(k_means)
results
URBAN AREA PROVINCE DENSITY KMeans_Clusters
0 0 1 TRUJILLO 0.30 0
1 1 2 TRUJILLO 0.03 0
2 2 3 TRUJILLO 0.80 1
3 3 1 LIMA 1.20 1
4 4 2 LIMA 0.04 0
5 5 1 LAMBAYEQUE 0.90 0
6 6 2 LAMBAYEQUE 0.10 1
7 7 3 LAMBAYEQUE 0.08 1

Related

Find optimal combinations of two columns based on another column value

So, my dataframe looks like this
index Client Manager Score
0 1 1 0.89
1 1 2 0.78
2 1 3 0.65
3 2 1 0.91
4 2 2 0.77
5 2 3 0.97
6 3 1 0.35
7 3 2 0.61
8 3 3 0.81
9 4 1 0.69
10 4 2 0.22
11 4 3 0.93
12 5 1 0.78
13 5 2 0.55
14 5 3 0.44
15 6 1 0.64
16 6 2 0.99
17 6 3 0.22
My expected output looks like this
index Client Manager Score
0 1 1 0.89
1 2 3 0.97
2 3 2 0.61
3 4 3 0.93
4 5 1 0.78
5 6 2 0.99
We have 3 managers and 6 clients. I want each manager to have 2 clients based on highest Score. Each manager should have only unique client, so that if one client is good for two managers, we need to take second best score and so on. May I have your suggestions? Thank you in advance.
df = df.drop("index", axis=1)
df = df.sort_values("Score").iloc[::-1,:]
df
selected_client = []
selected_manager = []
selected_df = []
iter_rows = df.iterrows()
for i,d in iter_rows:
client = int(d.to_frame().loc[["Client"],[i]].values[0][0])
manager = int(d.to_frame().loc[["Manager"],[i]].values[0][0])
if client not in selected_client and selected_manager.count(manager) != 2:
selected_client.append(client)
selected_manager.append(manager)
selected_df.append(d)
result = pd.concat(selected_df, axis=1, sort=False)
print(result.T)
Try this:
df = df.sort_values('Score',ascending = False) #sort values to prioritize high scores
d = {i:[] for i in df['Manager']} #create an empty dictionary to fill in the client/manager pairs
n = 2 #set number of clients per manager
for c,m in zip(df['Client'],df['Manager']): #iterate over client and manager pairs
if len(d.get(m))<n and c not in [c2 for i in d.values() for c2,m2 in i]: #if there are not already two pairs, and if the client has not already been added, append the pair to the list
d.get(m).append((c,m))
else:
pass
ndf = pd.merge(df,pd.DataFrame([k for v in d.values() for k in v],columns = ['Client','Manager'])).sort_values('Client') #filter for just the pairs found above.
Output:
index Client Manager Score
3 0 1 1 0.89
1 5 2 3 0.97
5 7 3 2 0.61
2 11 4 3 0.93
4 12 5 1 0.78
0 16 6 2 0.99

Np random sampling in python

I have two pd data tables. I want to create a new column in df2 by assign random Rate using Weight from df1.
df1
Income_Group Rate Weight
0 1 3.5 0.5
1 1 2.5 0.25
2 1 3.75 0.15
3 1 5.0 0.15
4 2 4.5 0.35
5 2 2.5 0.25
6 2 4.75 0.20
7 2 5.0 0.20
....
30 8 2.25 0.75
31 8 4.15 0.05
32 8 6.35 0.20
df2
ID Income_Group State Rate
0 12 1 9 3.5
1 13 2 6 4.5
2 15 8 1 6.35
3 8 1 5 2.5
4 9 8 4 6.35
5 17 2 3 4.75
......
100 50 1 4 3.75
I tried the following code:
df2['Rate']=df1.groupby('Income_Group').apply(lambda gp.np.random.choice(a=gp.Rate, p=gp.Weight,
replace=True))
Of course, the code didn't work. Can someone help me on this? Thank you in advance.
Your data is pretty small, so we can do:
rate_dict = df1.groupby('Income_Group')[['Rate', 'Weight']].agg(list)
df2['Rate'] = df2.Income_Group.apply(lambda x: np.random.choice(rate_dict.loc[x, 'Rate'],
p=rate_dict.loc[x, 'Weight'])
)
Or you can do groupby on df2 as well:
(df2.groupby('Income_Group')
.Income_Group
.transform(lambda x: np.random.choice(rate_dict.loc[x.iloc[0], 'Rate'],
size=len(x),
p=rate_dict.loc[x.iloc[0], 'Weight']))
)
You can try:
df1 = pd.DataFrame([[1,3.5,.5], [1,2.5,.25], [1,3.75,.15]],
columns=['Income_Group', 'Rate', 'Weight'])
df2 = pd.DataFrame()
weights = np.random.rand(df1.shape[0])
df2['Rate'] = df1.Rate.values * weights

Remapping and regrouping values in python pandas

I have a dataframe where values have been assigned to groups:
import pandas as pd
df = pd.DataFrame({ 'num' : [0.43, 5.2, 1.3, 0.33, .74, .5, .2, .12],
'group' : [1, 2, 2, 2, 3,4,5,5]
})
df
group num
0 1 0.43
1 2 5.20
2 2 1.30
3 2 0.33
4 3 0.74
5 4 0.50
6 5 0.20
7 5 0.12
I would like to ensure that no value is in a group alone. If a value is an "orphan", it should be reassigned to the next highest group with more than one member. So the resultant dataframe should look like this instead:
group num
0 2 0.43
1 2 5.20
2 2 1.30
3 2 0.33
4 5 0.74
5 5 0.50
6 5 0.20
7 5 0.12
What's the most pythonic way to achieve this result?
Here is one solution I found, there may be much better ways to do this...
# Find the orphans
count = df.group.value_counts().sort_index()
orphans = count[count == 1].index.values.tolist()
# Find the sets
sets = count[count > 1].index.values.tolist()
# Find where orphans should be remapped
where = [bisect.bisect(sets, x) for x in orphans]
remap = [sets[x] for x in where]
# Create a dictionary for remapping, and replace original values
change = dict(zip(orphans, remap))
df = df.replace({'group': change})
df
group num
0 2 0.43
1 2 5.20
2 2 1.30
3 2 0.33
4 5 0.74
5 5 0.50
6 5 0.20
7 5 0.12
It is possible to use only vectorised operations for this task. You can use pd.Series.bfill to create a mapping from your original index to a new one:
counts = df['group'].value_counts().sort_index().reset_index()
counts['original'] = counts['index']
counts.loc[counts['group'] == 1, 'index'] = np.nan
counts['index'] = counts['index'].bfill().astype(int)
print(counts)
index group original
0 2 1 1
1 2 3 2
2 5 1 3
3 5 1 4
4 5 2 5
Then use pd.Series.map to perform your mapping:
df['group'] = df['group'].map(counts.set_index('original')['index'])
print(df)
group num
0 2 0.43
1 2 5.20
2 2 1.30
3 2 0.33
4 5 0.74
5 5 0.50
6 5 0.20
7 5 0.12

Using igraph for read 'ncol' format while preserving labels

I need an automated way to read 'ncol' format (edge list) while preserving labels.
For instance:
Given a small-graph.edgelist:
0 1 0.47
0 2 0.67
0 3 0.98
0 4 0.12
0 5 0.82
0 10 0.34
1 2 0.94
1 3 0.05
1 4 0.22
2 3 0.24
2 4 0.36
3 4 0.69
5 6 0.97
5 8 0.44
5 7 0.43
5 9 0.37
6 7 0.83
6 8 0.49
6 9 0.55
7 8 0.39
7 9 0.73
8 9 0.68
10 11 0.22
10 14 0.59
11 12 0.40
12 13 0.78
13 14 0.81
Graph:
I try:
import igraph
g = igraph.read("smallgraph.edgelist", format="ncol", directed=False, names=True)
But this function does not preserve the labels!!!!
The output generated by this function:
for edge in g.es():
print edge.tuple[0], edge.tuple[1], edge["weight"]
0 1 0.47
0 2 0.67
0 3 0.98
0 4 0.12
0 5 0.82
0 6 0.34 -> e.g.: Considering the original labels here should be '0 10 0.34'
1 2 0.94
1 3 0.05
1 4 0.22
2 3 0.24
2 4 0.36
3 4 0.69
5 7 0.97
5 8 0.44
5 9 0.43
5 10 0.37
6 11 0.22
6 12 0.59
7 8 0.49
7 9 0.83
7 10 0.55
8 9 0.39
8 10 0.68
9 10 0.73
11 13 0.4
12 14 0.81
13 14 0.78
Output:
The labels of the input file (small-graph.edgelist) are not preserved.
I think something like this could work:
g = igraph.Graph()
g.add_vertices(15)
g = igraph.read("input/small-graph.edgelist", format="ncol", directed=False, names=True)
But this doesn't work and I don't know how to do it.
Does anyone know how to preserve the original labels?
The original labels are preserved, but they are stored in the name vertex attribute. Try this after reading your graph as usual:
names = g.vs["name"]
for edge in g.es:
print names[edge.tuple[0]], names[edge.tuple[1]], edge["weight"]
Update: If you are absolutely sure that your file contains only continuous numeric IDs from zero (i.e. if you have n vertices then your IDs are from zero to n-1), you can do the following:
edges, weights = [], []
for line in open("input_file.txt"):
u, v, weight = line.split()
edges.append((int(u), int(v)))
weights.append(float(weight))
g = Graph(edges, edge_attrs={"weight": weights})
Just a little improvement based on updated answer of Tamás' for the case that the file contains only continuous numeric IDs from zero. This works for directed graphs and handles some cases such as there might be no edge from 0 to any other vertices:
read_edges(num_vertices,input_graph):
g = Graph(directed=True)
g.add_vertices(list(range(0,num_vertices)))
for line in open(input_graph):
u, v= line.split()
g.add_edge(int(u), int(v))
return g

Pandas: trouble understanding how merge works

I'm doing something wrong with merge and I can't understand what it is. I've done the following to estimate a histogram of a series of integer values:
import pandas as pnd
import numpy as np
series = pnd.Series(np.random.poisson(5, size = 100))
tmp = {"series" : series, "count" : np.ones(len(series))}
hist = pnd.DataFrame(tmp).groupby("series").sum()
freq = (hist / hist.sum()).rename(columns = {"count" : "freq"})
If I print hist and freq this is what I get:
> print hist
count
series
0 2
1 4
2 13
3 15
4 12
5 16
6 18
7 7
8 8
9 3
10 1
11 1
> print freq
freq
series
0 0.02
1 0.04
2 0.13
3 0.15
4 0.12
5 0.16
6 0.18
7 0.07
8 0.08
9 0.03
10 0.01
11 0.01
They're both indexed by "series" but if I try to merge:
> df = pnd.merge(freq, hist, on = "series")
I get a KeyError: 'no item named series' exception. If I omit on = "series" I get a IndexError: list index out of range exception.
I don't get what I'm doing wrong. May be "series" is an index and not a column so I must do it differently?
From docs:
on: Columns (names) to join on. Must be found in both the left and
right DataFrame objects. If not passed and left_index and right_index
are False, the intersection of the columns in the DataFrames will be
inferred to be the join keys
I don't know why this is not in the docstring, but it explains your problem.
You can either give left_index and right_index:
In : pnd.merge(freq, hist, right_index=True, left_index=True)
Out:
freq count
series
0 0.01 1
1 0.04 4
2 0.14 14
3 0.12 12
4 0.21 21
5 0.14 14
6 0.17 17
7 0.07 7
8 0.05 5
9 0.01 1
10 0.01 1
11 0.03 3
Or you can make your index a column and use on:
In : freq2 = freq.reset_index()
In : hist2 = hist.reset_index()
In : pnd.merge(freq2, hist2, on='series')
Out:
series freq count
0 0 0.01 1
1 1 0.04 4
2 2 0.14 14
3 3 0.12 12
4 4 0.21 21
5 5 0.14 14
6 6 0.17 17
7 7 0.07 7
8 8 0.05 5
9 9 0.01 1
10 10 0.01 1
11 11 0.03 3
Alternatively and more simply, DataFrame has join method which does exactly what you want:
In : freq.join(hist)
Out:
freq count
series
0 0.01 1
1 0.04 4
2 0.14 14
3 0.12 12
4 0.21 21
5 0.14 14
6 0.17 17
7 0.07 7
8 0.05 5
9 0.01 1
10 0.01 1
11 0.03 3

Categories