Sklearn changing string class label to int - python

I have a pandas dataframe and I'm trying to change the values in a given column which are represented by strings into integers. For instance:
df = index fruit quantity price
0 apple 5 0.99
1 apple 2 0.99
2 orange 4 0.89
4 banana 1 1.64
...
10023 kiwi 10 0.92
I would like it to look at:
df = index fruit quantity price
0 1 5 0.99
1 1 2 0.99
2 2 4 0.89
4 3 1 1.64
...
10023 5 10 0.92
I can do this using
df["fruit"] = df["fruit"].map({"apple": 1, "orange": 2,...})
which works if I have a small list to change, but I'm looking at a column with over 500 different labels. Is there any way of changing this from a string to a an int?

You can use sklearn.preprocessing
from sklearn import preprocessing
le = preprocessing.LabelEncoder()
le.fit(df.fruit)
df['categorical_label'] = le.transform(df.fruit)
Transform labels back to original encoding.
le.inverse_transform(df['categorical_label'])

Use factorize and then convert to categorical if necessary:
df.fruit = pd.factorize(df.fruit)[0]
print (df)
fruit quantity price
0 0 5 0.99
1 0 2 0.99
2 1 4 0.89
3 2 1 1.64
4 3 10 0.92
df.fruit = pd.Categorical(pd.factorize(df.fruit)[0])
print (df)
fruit quantity price
0 0 5 0.99
1 0 2 0.99
2 1 4 0.89
3 2 1 1.64
4 3 10 0.92
print (df.dtypes)
fruit category
quantity int64
price float64
dtype: object
Also if need count from 1:
df.fruit = pd.Categorical(pd.factorize(df.fruit)[0] + 1)
print (df)
fruit quantity price
0 1 5 0.99
1 1 2 0.99
2 2 4 0.89
3 3 1 1.64
4 4 10 0.92

you can use factorize method:
In [13]: df['fruit'] = pd.factorize(df['fruit'])[0].astype(np.uint16)
In [14]: df
Out[14]:
index fruit quantity price
0 0 0 5 0.99
1 1 0 2 0.99
2 2 1 4 0.89
3 4 2 1 1.64
4 10023 3 10 0.92
In [15]: df.dtypes
Out[15]:
index int64
fruit uint16
quantity int64
price float64
dtype: object
alternatively you can do it this way:
In [21]: df['fruit'] = df.fruit.astype('category').cat.codes
In [22]: df
Out[22]:
index fruit quantity price
0 0 0 5 0.99
1 1 0 2 0.99
2 2 3 4 0.89
3 4 1 1 1.64
4 10023 2 10 0.92
In [23]: df.dtypes
Out[23]:
index int64
fruit int8
quantity int64
price float64
dtype: object

Related

Find optimal combinations of two columns based on another column value

So, my dataframe looks like this
index Client Manager Score
0 1 1 0.89
1 1 2 0.78
2 1 3 0.65
3 2 1 0.91
4 2 2 0.77
5 2 3 0.97
6 3 1 0.35
7 3 2 0.61
8 3 3 0.81
9 4 1 0.69
10 4 2 0.22
11 4 3 0.93
12 5 1 0.78
13 5 2 0.55
14 5 3 0.44
15 6 1 0.64
16 6 2 0.99
17 6 3 0.22
My expected output looks like this
index Client Manager Score
0 1 1 0.89
1 2 3 0.97
2 3 2 0.61
3 4 3 0.93
4 5 1 0.78
5 6 2 0.99
We have 3 managers and 6 clients. I want each manager to have 2 clients based on highest Score. Each manager should have only unique client, so that if one client is good for two managers, we need to take second best score and so on. May I have your suggestions? Thank you in advance.
df = df.drop("index", axis=1)
df = df.sort_values("Score").iloc[::-1,:]
df
selected_client = []
selected_manager = []
selected_df = []
iter_rows = df.iterrows()
for i,d in iter_rows:
client = int(d.to_frame().loc[["Client"],[i]].values[0][0])
manager = int(d.to_frame().loc[["Manager"],[i]].values[0][0])
if client not in selected_client and selected_manager.count(manager) != 2:
selected_client.append(client)
selected_manager.append(manager)
selected_df.append(d)
result = pd.concat(selected_df, axis=1, sort=False)
print(result.T)
Try this:
df = df.sort_values('Score',ascending = False) #sort values to prioritize high scores
d = {i:[] for i in df['Manager']} #create an empty dictionary to fill in the client/manager pairs
n = 2 #set number of clients per manager
for c,m in zip(df['Client'],df['Manager']): #iterate over client and manager pairs
if len(d.get(m))<n and c not in [c2 for i in d.values() for c2,m2 in i]: #if there are not already two pairs, and if the client has not already been added, append the pair to the list
d.get(m).append((c,m))
else:
pass
ndf = pd.merge(df,pd.DataFrame([k for v in d.values() for k in v],columns = ['Client','Manager'])).sort_values('Client') #filter for just the pairs found above.
Output:
index Client Manager Score
3 0 1 1 0.89
1 5 2 3 0.97
5 7 3 2 0.61
2 11 4 3 0.93
4 12 5 1 0.78
0 16 6 2 0.99

Better way to create modied copies of pandas rows based on condition [duplicate]

I have a dataframe where some cells contain lists of multiple values. Rather than storing multiple
values in a cell, I'd like to expand the dataframe so that each item in the list gets its own row (with the same values in all other columns). So if I have:
import pandas as pd
import numpy as np
df = pd.DataFrame(
{'trial_num': [1, 2, 3, 1, 2, 3],
'subject': [1, 1, 1, 2, 2, 2],
'samples': [list(np.random.randn(3).round(2)) for i in range(6)]
}
)
df
Out[10]:
samples subject trial_num
0 [0.57, -0.83, 1.44] 1 1
1 [-0.01, 1.13, 0.36] 1 2
2 [1.18, -1.46, -0.94] 1 3
3 [-0.08, -4.22, -2.05] 2 1
4 [0.72, 0.79, 0.53] 2 2
5 [0.4, -0.32, -0.13] 2 3
How do I convert to long form, e.g.:
subject trial_num sample sample_num
0 1 1 0.57 0
1 1 1 -0.83 1
2 1 1 1.44 2
3 1 2 -0.01 0
4 1 2 1.13 1
5 1 2 0.36 2
6 1 3 1.18 0
# etc.
The index is not important, it's OK to set existing
columns as the index and the final ordering isn't
important.
Pandas >= 0.25
Series and DataFrame methods define a .explode() method that explodes lists into separate rows. See the docs section on Exploding a list-like column.
df = pd.DataFrame({
'var1': [['a', 'b', 'c'], ['d', 'e',], [], np.nan],
'var2': [1, 2, 3, 4]
})
df
var1 var2
0 [a, b, c] 1
1 [d, e] 2
2 [] 3
3 NaN 4
df.explode('var1')
var1 var2
0 a 1
0 b 1
0 c 1
1 d 2
1 e 2
2 NaN 3 # empty list converted to NaN
3 NaN 4 # NaN entry preserved as-is
# to reset the index to be monotonically increasing...
df.explode('var1').reset_index(drop=True)
var1 var2
0 a 1
1 b 1
2 c 1
3 d 2
4 e 2
5 NaN 3
6 NaN 4
Note that this also handles mixed columns of lists and scalars, as well as empty lists and NaNs appropriately (this is a drawback of repeat-based solutions).
However, you should note that explode only works on a single column (for now).
P.S.: if you are looking to explode a column of strings, you need to split on a separator first, then use explode. See this (very much) related answer by me.
A bit longer than I expected:
>>> df
samples subject trial_num
0 [-0.07, -2.9, -2.44] 1 1
1 [-1.52, -0.35, 0.1] 1 2
2 [-0.17, 0.57, -0.65] 1 3
3 [-0.82, -1.06, 0.47] 2 1
4 [0.79, 1.35, -0.09] 2 2
5 [1.17, 1.14, -1.79] 2 3
>>>
>>> s = df.apply(lambda x: pd.Series(x['samples']),axis=1).stack().reset_index(level=1, drop=True)
>>> s.name = 'sample'
>>>
>>> df.drop('samples', axis=1).join(s)
subject trial_num sample
0 1 1 -0.07
0 1 1 -2.90
0 1 1 -2.44
1 1 2 -1.52
1 1 2 -0.35
1 1 2 0.10
2 1 3 -0.17
2 1 3 0.57
2 1 3 -0.65
3 2 1 -0.82
3 2 1 -1.06
3 2 1 0.47
4 2 2 0.79
4 2 2 1.35
4 2 2 -0.09
5 2 3 1.17
5 2 3 1.14
5 2 3 -1.79
If you want sequential index, you can apply reset_index(drop=True) to the result.
update:
>>> res = df.set_index(['subject', 'trial_num'])['samples'].apply(pd.Series).stack()
>>> res = res.reset_index()
>>> res.columns = ['subject','trial_num','sample_num','sample']
>>> res
subject trial_num sample_num sample
0 1 1 0 1.89
1 1 1 1 -2.92
2 1 1 2 0.34
3 1 2 0 0.85
4 1 2 1 0.24
5 1 2 2 0.72
6 1 3 0 -0.96
7 1 3 1 -2.72
8 1 3 2 -0.11
9 2 1 0 -1.33
10 2 1 1 3.13
11 2 1 2 -0.65
12 2 2 0 0.10
13 2 2 1 0.65
14 2 2 2 0.15
15 2 3 0 0.64
16 2 3 1 -0.10
17 2 3 2 -0.76
UPDATE: the solution below was helpful for older Pandas versions, because the DataFrame.explode() wasn’t available. Starting from Pandas 0.25.0 you can simply use DataFrame.explode().
lst_col = 'samples'
r = pd.DataFrame({
col:np.repeat(df[col].values, df[lst_col].str.len())
for col in df.columns.drop(lst_col)}
).assign(**{lst_col:np.concatenate(df[lst_col].values)})[df.columns]
Result:
In [103]: r
Out[103]:
samples subject trial_num
0 0.10 1 1
1 -0.20 1 1
2 0.05 1 1
3 0.25 1 2
4 1.32 1 2
5 -0.17 1 2
6 0.64 1 3
7 -0.22 1 3
8 -0.71 1 3
9 -0.03 2 1
10 -0.65 2 1
11 0.76 2 1
12 1.77 2 2
13 0.89 2 2
14 0.65 2 2
15 -0.98 2 3
16 0.65 2 3
17 -0.30 2 3
PS here you may find a bit more generic solution
UPDATE: some explanations: IMO the easiest way to understand this code is to try to execute it step-by-step:
in the following line we are repeating values in one column N times where N - is the length of the corresponding list:
In [10]: np.repeat(df['trial_num'].values, df[lst_col].str.len())
Out[10]: array([1, 1, 1, 2, 2, 2, 3, 3, 3, 1, 1, 1, 2, 2, 2, 3, 3, 3], dtype=int64)
this can be generalized for all columns, containing scalar values:
In [11]: pd.DataFrame({
...: col:np.repeat(df[col].values, df[lst_col].str.len())
...: for col in df.columns.drop(lst_col)}
...: )
Out[11]:
trial_num subject
0 1 1
1 1 1
2 1 1
3 2 1
4 2 1
5 2 1
6 3 1
.. ... ...
11 1 2
12 2 2
13 2 2
14 2 2
15 3 2
16 3 2
17 3 2
[18 rows x 2 columns]
using np.concatenate() we can flatten all values in the list column (samples) and get a 1D vector:
In [12]: np.concatenate(df[lst_col].values)
Out[12]: array([-1.04, -0.58, -1.32, 0.82, -0.59, -0.34, 0.25, 2.09, 0.12, 0.83, -0.88, 0.68, 0.55, -0.56, 0.65, -0.04, 0.36, -0.31])
putting all this together:
In [13]: pd.DataFrame({
...: col:np.repeat(df[col].values, df[lst_col].str.len())
...: for col in df.columns.drop(lst_col)}
...: ).assign(**{lst_col:np.concatenate(df[lst_col].values)})
Out[13]:
trial_num subject samples
0 1 1 -1.04
1 1 1 -0.58
2 1 1 -1.32
3 2 1 0.82
4 2 1 -0.59
5 2 1 -0.34
6 3 1 0.25
.. ... ... ...
11 1 2 0.68
12 2 2 0.55
13 2 2 -0.56
14 2 2 0.65
15 3 2 -0.04
16 3 2 0.36
17 3 2 -0.31
[18 rows x 3 columns]
using pd.DataFrame()[df.columns] will guarantee that we are selecting columns in the original order...
you can also use pd.concat and pd.melt for this:
>>> objs = [df, pd.DataFrame(df['samples'].tolist())]
>>> pd.concat(objs, axis=1).drop('samples', axis=1)
subject trial_num 0 1 2
0 1 1 -0.49 -1.00 0.44
1 1 2 -0.28 1.48 2.01
2 1 3 -0.52 -1.84 0.02
3 2 1 1.23 -1.36 -1.06
4 2 2 0.54 0.18 0.51
5 2 3 -2.18 -0.13 -1.35
>>> pd.melt(_, var_name='sample_num', value_name='sample',
... value_vars=[0, 1, 2], id_vars=['subject', 'trial_num'])
subject trial_num sample_num sample
0 1 1 0 -0.49
1 1 2 0 -0.28
2 1 3 0 -0.52
3 2 1 0 1.23
4 2 2 0 0.54
5 2 3 0 -2.18
6 1 1 1 -1.00
7 1 2 1 1.48
8 1 3 1 -1.84
9 2 1 1 -1.36
10 2 2 1 0.18
11 2 3 1 -0.13
12 1 1 2 0.44
13 1 2 2 2.01
14 1 3 2 0.02
15 2 1 2 -1.06
16 2 2 2 0.51
17 2 3 2 -1.35
last, if you need you can sort base on the first the first three columns.
Trying to work through Roman Pekar's solution step-by-step to understand it better, I came up with my own solution, which uses melt to avoid some of the confusing stacking and index resetting. I can't say that it's obviously a clearer solution though:
items_as_cols = df.apply(lambda x: pd.Series(x['samples']), axis=1)
# Keep original df index as a column so it's retained after melt
items_as_cols['orig_index'] = items_as_cols.index
melted_items = pd.melt(items_as_cols, id_vars='orig_index',
var_name='sample_num', value_name='sample')
melted_items.set_index('orig_index', inplace=True)
df.merge(melted_items, left_index=True, right_index=True)
Output (obviously we can drop the original samples column now):
samples subject trial_num sample_num sample
0 [1.84, 1.05, -0.66] 1 1 0 1.84
0 [1.84, 1.05, -0.66] 1 1 1 1.05
0 [1.84, 1.05, -0.66] 1 1 2 -0.66
1 [-0.24, -0.9, 0.65] 1 2 0 -0.24
1 [-0.24, -0.9, 0.65] 1 2 1 -0.90
1 [-0.24, -0.9, 0.65] 1 2 2 0.65
2 [1.15, -0.87, -1.1] 1 3 0 1.15
2 [1.15, -0.87, -1.1] 1 3 1 -0.87
2 [1.15, -0.87, -1.1] 1 3 2 -1.10
3 [-0.8, -0.62, -0.68] 2 1 0 -0.80
3 [-0.8, -0.62, -0.68] 2 1 1 -0.62
3 [-0.8, -0.62, -0.68] 2 1 2 -0.68
4 [0.91, -0.47, 1.43] 2 2 0 0.91
4 [0.91, -0.47, 1.43] 2 2 1 -0.47
4 [0.91, -0.47, 1.43] 2 2 2 1.43
5 [-1.14, -0.24, -0.91] 2 3 0 -1.14
5 [-1.14, -0.24, -0.91] 2 3 1 -0.24
5 [-1.14, -0.24, -0.91] 2 3 2 -0.91
For those looking for a version of Roman Pekar's answer that avoids manual column naming:
column_to_explode = 'samples'
res = (df
.set_index([x for x in df.columns if x != column_to_explode])[column_to_explode]
.apply(pd.Series)
.stack()
.reset_index())
res = res.rename(columns={
res.columns[-2]:'exploded_{}_index'.format(column_to_explode),
res.columns[-1]: '{}_exploded'.format(column_to_explode)})
I found the easiest way was to:
Convert the samples column into a DataFrame
Joining with the original df
Melting
Shown here:
df.samples.apply(lambda x: pd.Series(x)).join(df).\
melt(['subject','trial_num'],[0,1,2],var_name='sample')
subject trial_num sample value
0 1 1 0 -0.24
1 1 2 0 0.14
2 1 3 0 -0.67
3 2 1 0 -1.52
4 2 2 0 -0.00
5 2 3 0 -1.73
6 1 1 1 -0.70
7 1 2 1 -0.70
8 1 3 1 -0.29
9 2 1 1 -0.70
10 2 2 1 -0.72
11 2 3 1 1.30
12 1 1 2 -0.55
13 1 2 2 0.10
14 1 3 2 -0.44
15 2 1 2 0.13
16 2 2 2 -1.44
17 2 3 2 0.73
It's worth noting that this may have only worked because each trial has the same number of samples (3). Something more clever may be necessary for trials of different sample sizes.
import pandas as pd
df = pd.DataFrame([{'Product': 'Coke', 'Prices': [100,123,101,105,99,94,98]},{'Product': 'Pepsi', 'Prices': [101,104,104,101,99,99,99]}])
print(df)
df = df.assign(Prices=df.Prices.str.split(',')).explode('Prices')
print(df)
Try this in pandas >=0.25 version
Very late answer but I want to add this:
A fast solution using vanilla Python that also takes care of the sample_num column in OP's example. On my own large dataset with over 10 million rows and a result with 28 million rows this only takes about 38 seconds. The accepted solution completely breaks down with that amount of data and leads to a memory error on my system that has 128GB of RAM.
df = df.reset_index(drop=True)
lstcol = df.lstcol.values
lstcollist = []
indexlist = []
countlist = []
for ii in range(len(lstcol)):
lstcollist.extend(lstcol[ii])
indexlist.extend([ii]*len(lstcol[ii]))
countlist.extend([jj for jj in range(len(lstcol[ii]))])
df = pd.merge(df.drop("lstcol",axis=1),pd.DataFrame({"lstcol":lstcollist,"lstcol_num":countlist},
index=indexlist),left_index=True,right_index=True).reset_index(drop=True)
Also very late, but here is an answer from Karvy1 that worked well for me if you don't have pandas >=0.25 version: https://stackoverflow.com/a/52511166/10740287
For the example above you may write:
data = [(row.subject, row.trial_num, sample) for row in df.itertuples() for sample in row.samples]
data = pd.DataFrame(data, columns=['subject', 'trial_num', 'samples'])
Speed test:
%timeit data = pd.DataFrame([(row.subject, row.trial_num, sample) for row in df.itertuples() for sample in row.samples], columns=['subject', 'trial_num', 'samples'])
1.33 ms ± 74.8 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit data = df.set_index(['subject', 'trial_num'])['samples'].apply(pd.Series).stack().reset_index()
4.9 ms ± 189 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit data = pd.DataFrame({col:np.repeat(df[col].values, df['samples'].str.len())for col in df.columns.drop('samples')}).assign(**{'samples':np.concatenate(df['samples'].values)})
1.38 ms ± 25 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

Np random sampling in python

I have two pd data tables. I want to create a new column in df2 by assign random Rate using Weight from df1.
df1
Income_Group Rate Weight
0 1 3.5 0.5
1 1 2.5 0.25
2 1 3.75 0.15
3 1 5.0 0.15
4 2 4.5 0.35
5 2 2.5 0.25
6 2 4.75 0.20
7 2 5.0 0.20
....
30 8 2.25 0.75
31 8 4.15 0.05
32 8 6.35 0.20
df2
ID Income_Group State Rate
0 12 1 9 3.5
1 13 2 6 4.5
2 15 8 1 6.35
3 8 1 5 2.5
4 9 8 4 6.35
5 17 2 3 4.75
......
100 50 1 4 3.75
I tried the following code:
df2['Rate']=df1.groupby('Income_Group').apply(lambda gp.np.random.choice(a=gp.Rate, p=gp.Weight,
replace=True))
Of course, the code didn't work. Can someone help me on this? Thank you in advance.
Your data is pretty small, so we can do:
rate_dict = df1.groupby('Income_Group')[['Rate', 'Weight']].agg(list)
df2['Rate'] = df2.Income_Group.apply(lambda x: np.random.choice(rate_dict.loc[x, 'Rate'],
p=rate_dict.loc[x, 'Weight'])
)
Or you can do groupby on df2 as well:
(df2.groupby('Income_Group')
.Income_Group
.transform(lambda x: np.random.choice(rate_dict.loc[x.iloc[0], 'Rate'],
size=len(x),
p=rate_dict.loc[x.iloc[0], 'Weight']))
)
You can try:
df1 = pd.DataFrame([[1,3.5,.5], [1,2.5,.25], [1,3.75,.15]],
columns=['Income_Group', 'Rate', 'Weight'])
df2 = pd.DataFrame()
weights = np.random.rand(df1.shape[0])
df2['Rate'] = df1.Rate.values * weights

Remapping and regrouping values in python pandas

I have a dataframe where values have been assigned to groups:
import pandas as pd
df = pd.DataFrame({ 'num' : [0.43, 5.2, 1.3, 0.33, .74, .5, .2, .12],
'group' : [1, 2, 2, 2, 3,4,5,5]
})
df
group num
0 1 0.43
1 2 5.20
2 2 1.30
3 2 0.33
4 3 0.74
5 4 0.50
6 5 0.20
7 5 0.12
I would like to ensure that no value is in a group alone. If a value is an "orphan", it should be reassigned to the next highest group with more than one member. So the resultant dataframe should look like this instead:
group num
0 2 0.43
1 2 5.20
2 2 1.30
3 2 0.33
4 5 0.74
5 5 0.50
6 5 0.20
7 5 0.12
What's the most pythonic way to achieve this result?
Here is one solution I found, there may be much better ways to do this...
# Find the orphans
count = df.group.value_counts().sort_index()
orphans = count[count == 1].index.values.tolist()
# Find the sets
sets = count[count > 1].index.values.tolist()
# Find where orphans should be remapped
where = [bisect.bisect(sets, x) for x in orphans]
remap = [sets[x] for x in where]
# Create a dictionary for remapping, and replace original values
change = dict(zip(orphans, remap))
df = df.replace({'group': change})
df
group num
0 2 0.43
1 2 5.20
2 2 1.30
3 2 0.33
4 5 0.74
5 5 0.50
6 5 0.20
7 5 0.12
It is possible to use only vectorised operations for this task. You can use pd.Series.bfill to create a mapping from your original index to a new one:
counts = df['group'].value_counts().sort_index().reset_index()
counts['original'] = counts['index']
counts.loc[counts['group'] == 1, 'index'] = np.nan
counts['index'] = counts['index'].bfill().astype(int)
print(counts)
index group original
0 2 1 1
1 2 3 2
2 5 1 3
3 5 1 4
4 5 2 5
Then use pd.Series.map to perform your mapping:
df['group'] = df['group'].map(counts.set_index('original')['index'])
print(df)
group num
0 2 0.43
1 2 5.20
2 2 1.30
3 2 0.33
4 5 0.74
5 5 0.50
6 5 0.20
7 5 0.12

Pandas: How to maintain the type of columns with nan?

For example,I have a df with nan and use the following method to fillna.
import pandas as pd
a = [[2.0, 10, 4.2], ['b', 70, 0.03], ['x', ]]
df = pd.DataFrame(a)
print(df)
df.fillna(int(0),inplace=True)
print('fillna df\n',df)
dtype_df = df.dtypes.reset_index()
OUTPUT:
0 1 2
0 2 10.0 4.20
1 b 70.0 0.03
2 x NaN NaN
fillna df
0 1 2
0 2 10.0 4.20
1 b 70.0 0.03
2 x 0.0 0.00
col type
0 0 object
1 1 float64
2 2 float64
Actually,I want the column 1 maintain the type of int instead of float.
My desired output:
fillna df
0 1 2
0 2 10 4.20
1 b 70 0.03
2 x 0 0.00
col type
0 0 object
1 1 int64
2 2 float64
So how to do it?
Try adding downcast='infer' to downcast any eligible columns:
df.fillna(0, downcast='infer')
0 1 2
0 2 10 4.20
1 b 70 0.03
2 x 0 0.00
And the corresponding dtypes are
0 object
1 int64
2 float64
dtype: object

Categories