pandas convert text feature to numeric value

pandas convert text feature to numeric value - python

I can convert all text features in a pandas dataframe by casting to 'category' using the df.astype() method as below. However I find category hard to work with (eg for plotting data) and would prefer to create a new column of integers
#convert all objects to categories
object_types = dataset.select_dtypes(include=['O'])
for col in object_types:
dataset['{0}_category'.format(col)] = dataset[col].astype('category')
I can convert the text to integers using this hack:
#convert all objects to int values
object_types = dataset.select_dtypes(include=['O'])
new_cols = {}
for col in object_types:
data_set = set(dataset[col].tolist())
data_indexed = {}
for i, item in enumerate(data_set):
data_indexed[item] = i
new_list = []
for item in dataset[col].tolist():
new_list.append(data_indexed[item])
new_cols[col]=new_list
for key, val in new_cols.items():
dataset['{0}_int_value'.format(key)] = val
But is there a better (or existing) way to do the same?

I would use factorize method, which is designed for this particular task:
In [90]: x
Out[90]:
A B
9 c z
10 c z
4 b x
5 b y
1 a w
7 b z
In [91]: x.apply(lambda col: pd.factorize(col, sort=True)[0])
Out[91]:
A B
9 2 3
10 2 3
4 1 1
5 1 2
1 0 0
7 1 3
or:
In [92]: x.apply(lambda col: pd.factorize(col)[0])
Out[92]:
A B
9 0 0
10 0 0
4 1 1
5 1 2
1 2 3
7 1 0

consider df
df = pd.DataFrame(dict(A=list('aaaabbbbcccc'),
B=list('wwxxxyyzzzzz')))
df
you can convert to integers like this
def intify(s):
u = np.unique(s)
i = np.arange(len(u))
return s.map(dict(zip(u, i)))
or shorter version
def intify(s):
u = np.unique(s)
return s.map({k: i for i, k in enumerate(u)})
df.apply(intify)
Or in a single line
df.apply(lambda s: s.map({k:i for i,k in enumerate(s.unique())}))

Related

How to fit a column as a function of another column in Python dataframe

For example, I have a dataframe (df) and the Target column is df['Z']. I have two other columns, df['X'] and df['Y']. I have received all this data from the real-world data collection.
How can I make an equation Z as the following functions in python: (i.e. fit Z as a function of X and Y)
> 1. Z = f(X)
> 2. Z = f(X,Y)

That's how you do that:
def function(x, y):
return x+y+4 # Obviously the function can be more complex
df["Z"] = function(df["A"], df["B"])
Example
data = {'A': [x for x in range(5)], 'B': [x for x in range(6,11)]}
df = pd.DataFrame(data)
def function(x,y):
return x+y+4
df["Z"] = function(df["A"], df["B"])
print(df)
Output:
A B Z
0 0 6 10
1 1 7 12
2 2 8 14
3 3 9 16
4 4 10 18

Splitting columns containing comma separated string to new row values

I have a data frame of the below format
variable val
0 'a','x','y' 10
I would like to unnlist(explode) the data in the below format.
variable1 variable2 value
0 a x 10
1 a y 10
2 x y 10
I have tried using df.explode which does not give me the relation between x and y. My code is as below. Can anyone guide me as to how can I proceed further to get the x and y data. Thanks in advance.
import pandas as pd
from ast import literal_eval
data = {'name':["'a','x','y'"], 'val' : [10]}
df = pd.DataFrame(data)
df2 = (df['name'].str.split(',',expand = True, n = 1)
.rename(columns = {0 : 'variable 1', 1 : 'variable 2'})
.join(df.drop(columns = 'name')))
df2['variable 2']=df2['variable 2'].map(literal_eval)
df2=df2.explode('variable 2',ignore_index=True)
print(df2)
OUTPUT:
variable 1 variable 2 val
0 'a' x 10
1 'a' y 10

If need each combinations per splitted values by , use:
print (df)
variable val
0 'a','x','y' 10
1 'a','x','y','f' 80
2 's' 4
from itertools import combinations
df['variable'] = df['variable'].str.replace("'", "", regex=True)
s = [x.split(',') if ',' in x else (x,x) for x in df['variable']]
L = [(*y, z) for x, z in zip(s, df['val']) for y in combinations(x, 2)]
df = pd.DataFrame(L, columns=['variable 1','variable 2','val'])
print (df)
variable 1 variable 2 val
0 a x 10
1 a y 10
2 x y 10
3 a x 80
4 a y 80
5 a f 80
6 x y 80
7 x f 80
8 y f 80
9 s s 4

Simple method to change DataFrame column type but use a default value for errors?

Suppose I have the following column.
>>> import pandas
>>> a = pandas.Series(['0', '1', '5', '1', None, '3', 'Cat', '2'])
I would like to be able to convert all the data in the column to type int, and any element that cannot be converted should be replaced with a 0.
My current solution to this is to use to_numeric with the 'coerce' option, fill any NaN with 0, and then convert to int (since the presence of NaN made the column float instead of int).
>>> pandas.to_numeric(a, errors='coerce').fillna(0).astype(int)
0 0
1 1
2 5
3 1
4 0
5 3
6 0
7 2
dtype: int64
Is there any method that would allow me to do this in one step rather than having to go through two intermediate states? I am looking for something that would behave like the following imaginary option to astype:
>>> a.astype(int, value_on_error=0)

Option 1
pd.to_numeric(a, 'coerce').fillna(0).astype(int)
Option 2
b = pd.to_numeric(a, 'coerce')
b.mask(b.isnull(), 0).astype(int)
Option 3
def try_int(x):
try:
return int(x)
except:
return 0
a.apply(try_int)
Option 4
b = np.empty(a.shape, dtype=int)
i = np.core.defchararray.isdigit(a.values.astype(str))
b[i] = a[i].astype(int)
b[~i] = 0
pd.Series(b, a.index)
All produce
0 0
1 1
2 5
3 1
4 0
5 3
6 0
7 2
dtype: int64
Timing
Code Below
def pir1(a):
return pd.to_numeric(a, 'coerce').fillna(0).astype(int)
def pir2(a):
b = pd.to_numeric(a, 'coerce')
return b.mask(b.isnull(), 0).astype(int)
def try_int(x):
try:
return int(x)
except:
return 0
def pir3(a):
return a.apply(try_int)
def pir4(a):
b = np.empty(a.shape, dtype=int)
i = np.core.defchararray.isdigit(a.values.astype(str))
b[i] = a[i].astype(int)
b[~i] = 0
return pd.Series(b, a.index)
def alt1(a):
return pd.to_numeric(a.where(a.str.isnumeric(), 0))
results = pd.DataFrame(
index=[1, 3, 10, 30, 100, 300, 1000, 3000, 10000],
columns='pir1 pir2 pir3 pir4 alt1'.split()
)
for i in results.index:
c = pd.concat([a] * i, ignore_index=True)
for j in results.columns:
stmt = '{}(c)'.format(j)
setp = 'from __main__ import c, {}'.format(j)
results.set_value(i, j, timeit(stmt, setp, number=10))
results.plot(logx=True, logy=True)

a.where(a.str.isnumeric(),0).astype(int)
Output:
0 0
1 1
2 5
3 1
4 0
5 3
6 0
7 2
dtype: int64

Transform pandas Series with variable length comma separated values to Dataframe

I have a pandas-Series 'A' containing comma separated values like this :
index A
1 null
2 5,6
3 3
4 null
5 5,18,22
... ...
I need a dataframe like this :
index A_5 A_6 A_18 A_20
1 0 0 0 ...
2 1 1 0 ...
3 0 0 0 ...
4 0 0 0 ...
5 1 0 1 ...
... ... ... ... ...
Values that don't occur at least MIN_OBS times should be ignored and not get an own column, because there are so many distinct values that the df would become too big if this threshold isn't applied.
I designed the solution below. It works, but is way too slow (due to iterating over rows I suppose). Could anyone suggest a faster approach ?
temp_dict = defaultdict(int)
for k, v in A.iteritems():
temp_list = v.split(',')
for item in temp_list:
temp_dict[item] += 1
cols_to_make = []
for k, v in temp_dict.iteritems():
if v > MIN_OBS:
cols_to_make.append('A_' + k)
result_df = pd.DataFrame(0, index = the_series.index, columns = cols_to_make)
for k, v in A.iteritems():
temp_list = v.split(',')
for item in temp_list:
if ('A_' + item) in cols_to_make:
temp_df['A_' + item][k] = 1

You can use get_dummies for creating indicator variables, then convert columns to numbers by to_numeric and last filter columns by variable TRESH and ix:
print df
A
index
1 null
2 5,6
3 3
4 null
5 5,18,22
df = df.A.str.get_dummies(sep=",")
print df
18 22 3 5 6 null
index
1 0 0 0 0 0 1
2 0 0 0 1 1 0
3 0 0 1 0 0 0
4 0 0 0 0 0 1
5 1 1 0 1 0 0
df.columns = pd.to_numeric(df.columns, errors='coerce')
df = df.sort_index(axis=1)
TRESH = 5
cols = [col for col in df.columns if col > TRESH]
print cols
[6.0, 18.0, 22.0]
df = df.ix[:, cols]
print df
6 18 22
index
1 0 0 0
2 1 0 0
3 0 0 0
4 0 0 0
5 0 1 1
df.columns = ["A_" + str(int(col)) for col in df.columns]
print df
A_6 A_18 A_22
index
1 0 0 0
2 1 0 0
3 0 0 0
4 0 0 0
5 0 1 1
EDIT:
I try modified perfect original unutbu answer and change creating Series, removing Series with null values in index and add parameter prefix to get_dummies:
import numpy as np
import pandas as pd
s = pd.Series(['null', '5,6', '3', 'null', '5,18,22', '3,4'])
print s
#result = s.str.split(',').apply(pd.Series).stack()
#replacing to:
result = pd.DataFrame([ x.split(',') for x in s ]).stack()
count = pd.value_counts(result)
min_obs = 2
#add removing Series, which contains null
count = count[(count >= min_obs) & ~(count.index.isin(['null'])) ]
result = result.loc[result.isin(count.index)]
#add prefix to function get_dummies
result = pd.get_dummies(result, prefix="A")
result.index = result.index.droplevel(1)
result = result.reindex(s.index)
print(result)
A_3 A_5
0 NaN NaN
1 0 1
2 1 0
3 NaN NaN
4 0 1
5 1 0
Timings:
In [143]: %timeit pd.DataFrame([ x.split(',') for x in s ]).stack()
1000 loops, best of 3: 866 µs per loop
In [144]: %timeit s.str.split(',').apply(pd.Series).stack()
100 loops, best of 3: 2.46 ms per loop

Since memory is an issue, we have to be careful not to build large intermediate
data structures if possible.
Let's start with the OP's posted code that works:
def orig(A, MIN_OBS):
temp_dict = collections.defaultdict(int)
for k, v in A.iteritems():
temp_list = v.split(',')
for item in temp_list:
temp_dict[item] += 1
cols_to_make = []
for k, v in temp_dict.iteritems():
if v > MIN_OBS:
cols_to_make.append('A_' + k)
result_df = pd.DataFrame(0, index=A.index, columns=cols_to_make)
for k, v in A.iteritems():
temp_list = v.split(',')
for item in temp_list:
if ('A_' + item) in cols_to_make:
result_df['A_' + item][k] = 1
return result_df
and extract the first loop into its own function:
def count(A, MIN_OBS):
temp_dict = collections.Counter()
for k, v in A.iteritems():
temp_list = v.split(',')
for item in temp_list:
temp_dict[item] += 1
temp_dict = {k:v for k, v in temp_dict.items() if v > MIN_OBS}
return temp_dict
From experimentation in an interactive session, we can see this is not the bottleneck; even for "large" DataFrames, count(A, MIN_OBS) completes fairly quickly.
The slowness of orig occurs in the the double for-loop at the end of orig
which increments modifies cells in the DataFrame one value at a time
(e.g. result_df['A_' + item][k] = 1.)
We could replace that double-for loop with a single for-loop over the columns of the DataFrame, using the vectorized string method, A.str.contains to search for values in the strings. Since we never split the original strings into Python lists of strings (or Pandas DataFrames holding the string fragments), we save some memory.
Since orig and alt use similar data structures, their memory footprint is about the same.
def alt(A, MIN_OBS):
temp_dict = count(A, MIN_OBS)
df = pd.DataFrame(0, index=A.index, columns=temp_dict)
for col in df:
df[col] = A.str.contains(r'^{v}|,{v},|,{v}$'.format(v=col)).astype(int)
df.columns = ['A_{}'.format(col) for col in df]
return df
Here is an example, on a 200K row DataFrame with 40K different possible values:
import numpy as np
import pandas as pd
import collections
np.random.seed(2016)
ncols = 5
nrows = 200000
nvals = 40000
MIN_OBS = 200
# nrows = 20
# nvals = 4
# MIN_OBS = 2
idx = np.random.randint(ncols, size=nrows).cumsum()
data = np.random.choice(np.arange(nvals), size=idx[-1])
data = np.array_split(data, idx[:-1])
data = map(','.join, [map(str, arr) for arr in data])
A = pd.Series(data)
A.loc[A == ''] = 'null'
def orig(A, MIN_OBS):
temp_dict = collections.defaultdict(int)
for k, v in A.iteritems():
temp_list = v.split(',')
for item in temp_list:
temp_dict[item] += 1
cols_to_make = []
for k, v in temp_dict.iteritems():
if v > MIN_OBS:
cols_to_make.append('A_' + k)
result_df = pd.DataFrame(0, index=A.index, columns=cols_to_make)
for k, v in A.iteritems():
temp_list = v.split(',')
for item in temp_list:
if ('A_' + item) in cols_to_make:
result_df['A_' + item][k] = 1
return result_df
def count(A, MIN_OBS):
temp_dict = collections.Counter()
for k, v in A.iteritems():
temp_list = v.split(',')
for item in temp_list:
temp_dict[item] += 1
temp_dict = {k:v for k, v in temp_dict.items() if v > MIN_OBS}
return temp_dict
def alt(A, MIN_OBS):
temp_dict = count(A, MIN_OBS)
df = pd.DataFrame(0, index=A.index, columns=temp_dict)
for col in df:
df[col] = A.str.contains(r'^{v}|,{v},|,{v}$'.format(v=col)).astype(int)
df.columns = ['A_{}'.format(col) for col in df]
return df
Here is a benchmark:
In [48]: %timeit expected = orig(A, MIN_OBS)
1 loops, best of 3: 3.03 s per loop
In [49]: %timeit expected = alt(A, MIN_OBS)
1 loops, best of 3: 483 ms per loop
Note that the majority of the time required for alt to complete is spent in count:
In [60]: %timeit count(A, MIN_OBS)
1 loops, best of 3: 304 ms per loop

Would something like this work or could it be modified to fit your need?
df = pd.DataFrame({'A': ['null', '5,6', '3', 'null', '5,18,22']}, columns=['A'])
A
0 null
1 5,6
2 3
3 null
4 5,18,22
Then use get_dummies()
pd.get_dummies(df['A'].str.split(',').apply(pd.Series), prefix=df.columns[0])
Result:
A_3 A_5 A_null A_18 A_6 A_22
index
1 0 0 1 0 0 0
2 0 1 0 0 1 0
3 1 0 0 0 0 0
4 0 0 1 0 0 0
5 0 1 0 1 0 1

how to select multiple values from a single level of a dataframe multiindex

If I have the following:
df = pd.DataFrame(np.random.random((4,8)))
tupleList = zip([x for x in 'abcdefgh'], [y for y in ['iijjkkll'])
ind = pd.MultiIndex.from_tuples(tupleList)
df.columns = ind
In [71]: df
Out[71]:
a b c d e f g \
i i j j k k l
0 0.968112 0.809183 0.144320 0.518120 0.820079 0.648237 0.971552
1 0.959022 0.721705 0.139588 0.408940 0.230956 0.907192 0.467016
2 0.335085 0.537437 0.725119 0.486447 0.114048 0.150150 0.894322
3 0.051249 0.186547 0.779814 0.905914 0.024298 0.002489 0.339714
h
l
0 0.438330
1 0.225447
2 0.331413
3 0.530789
[4 rows x 8 columns]
what is the easiest way to select the columns that have a second level label of "j" or "k"?
c d e f
j j k k
0 0.948030 0.243993 0.627497 0.729024
1 0.087703 0.874968 0.581875 0.996466
2 0.802155 0.213450 0.375096 0.184569
3 0.164278 0.646088 0.201323 0.022498
I can do this:
df.loc[:, df.columns.get_level_values(1).isin(['j', 'k'])]
But that seems pretty verbose for something that feels like it should be simple. Any better approaches?

See here for multiindex using slicers, introduced in 0.14.0
In [36]: idx = pd.IndexSlice
In [37]: df.loc[:, idx[:, ['j', 'k']]]
Out[37]:
c d e f
j j k k
0 0.750582 0.877763 0.262696 0.226005
1 0.025902 0.967179 0.125647 0.297304
2 0.463544 0.104973 0.154113 0.284820
3 0.631695 0.841023 0.820907 0.938378

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

pandas convert text feature to numeric value - python

Related

How to fit a column as a function of another column in Python dataframe

Splitting columns containing comma separated string to new row values

Simple method to change DataFrame column type but use a default value for errors?

Transform pandas Series with variable length comma separated values to Dataframe

how to select multiple values from a single level of a dataframe multiindex

Categories

Resources