For example, I have a dataframe (df) and the Target column is df['Z']. I have two other columns, df['X'] and df['Y']. I have received all this data from the real-world data collection.
How can I make an equation Z as the following functions in python: (i.e. fit Z as a function of X and Y)
> 1. Z = f(X)
> 2. Z = f(X,Y)
That's how you do that:
def function(x, y):
return x+y+4 # Obviously the function can be more complex
df["Z"] = function(df["A"], df["B"])
Example
data = {'A': [x for x in range(5)], 'B': [x for x in range(6,11)]}
df = pd.DataFrame(data)
def function(x,y):
return x+y+4
df["Z"] = function(df["A"], df["B"])
print(df)
Output:
A B Z
0 0 6 10
1 1 7 12
2 2 8 14
3 3 9 16
4 4 10 18
I have a data frame of the below format
variable val
0 'a','x','y' 10
I would like to unnlist(explode) the data in the below format.
variable1 variable2 value
0 a x 10
1 a y 10
2 x y 10
I have tried using df.explode which does not give me the relation between x and y. My code is as below. Can anyone guide me as to how can I proceed further to get the x and y data. Thanks in advance.
import pandas as pd
from ast import literal_eval
data = {'name':["'a','x','y'"], 'val' : [10]}
df = pd.DataFrame(data)
df2 = (df['name'].str.split(',',expand = True, n = 1)
.rename(columns = {0 : 'variable 1', 1 : 'variable 2'})
.join(df.drop(columns = 'name')))
df2['variable 2']=df2['variable 2'].map(literal_eval)
df2=df2.explode('variable 2',ignore_index=True)
print(df2)
OUTPUT:
variable 1 variable 2 val
0 'a' x 10
1 'a' y 10
If need each combinations per splitted values by , use:
print (df)
variable val
0 'a','x','y' 10
1 'a','x','y','f' 80
2 's' 4
from itertools import combinations
df['variable'] = df['variable'].str.replace("'", "", regex=True)
s = [x.split(',') if ',' in x else (x,x) for x in df['variable']]
L = [(*y, z) for x, z in zip(s, df['val']) for y in combinations(x, 2)]
df = pd.DataFrame(L, columns=['variable 1','variable 2','val'])
print (df)
variable 1 variable 2 val
0 a x 10
1 a y 10
2 x y 10
3 a x 80
4 a y 80
5 a f 80
6 x y 80
7 x f 80
8 y f 80
9 s s 4
Suppose I have the following column.
>>> import pandas
>>> a = pandas.Series(['0', '1', '5', '1', None, '3', 'Cat', '2'])
I would like to be able to convert all the data in the column to type int, and any element that cannot be converted should be replaced with a 0.
My current solution to this is to use to_numeric with the 'coerce' option, fill any NaN with 0, and then convert to int (since the presence of NaN made the column float instead of int).
>>> pandas.to_numeric(a, errors='coerce').fillna(0).astype(int)
0 0
1 1
2 5
3 1
4 0
5 3
6 0
7 2
dtype: int64
Is there any method that would allow me to do this in one step rather than having to go through two intermediate states? I am looking for something that would behave like the following imaginary option to astype:
>>> a.astype(int, value_on_error=0)
Option 1
pd.to_numeric(a, 'coerce').fillna(0).astype(int)
Option 2
b = pd.to_numeric(a, 'coerce')
b.mask(b.isnull(), 0).astype(int)
Option 3
def try_int(x):
try:
return int(x)
except:
return 0
a.apply(try_int)
Option 4
b = np.empty(a.shape, dtype=int)
i = np.core.defchararray.isdigit(a.values.astype(str))
b[i] = a[i].astype(int)
b[~i] = 0
pd.Series(b, a.index)
All produce
0 0
1 1
2 5
3 1
4 0
5 3
6 0
7 2
dtype: int64
Timing
Code Below
def pir1(a):
return pd.to_numeric(a, 'coerce').fillna(0).astype(int)
def pir2(a):
b = pd.to_numeric(a, 'coerce')
return b.mask(b.isnull(), 0).astype(int)
def try_int(x):
try:
return int(x)
except:
return 0
def pir3(a):
return a.apply(try_int)
def pir4(a):
b = np.empty(a.shape, dtype=int)
i = np.core.defchararray.isdigit(a.values.astype(str))
b[i] = a[i].astype(int)
b[~i] = 0
return pd.Series(b, a.index)
def alt1(a):
return pd.to_numeric(a.where(a.str.isnumeric(), 0))
results = pd.DataFrame(
index=[1, 3, 10, 30, 100, 300, 1000, 3000, 10000],
columns='pir1 pir2 pir3 pir4 alt1'.split()
)
for i in results.index:
c = pd.concat([a] * i, ignore_index=True)
for j in results.columns:
stmt = '{}(c)'.format(j)
setp = 'from __main__ import c, {}'.format(j)
results.set_value(i, j, timeit(stmt, setp, number=10))
results.plot(logx=True, logy=True)
a.where(a.str.isnumeric(),0).astype(int)
Output:
0 0
1 1
2 5
3 1
4 0
5 3
6 0
7 2
dtype: int64
I have a pandas-Series 'A' containing comma separated values like this :
index A
1 null
2 5,6
3 3
4 null
5 5,18,22
... ...
I need a dataframe like this :
index A_5 A_6 A_18 A_20
1 0 0 0 ...
2 1 1 0 ...
3 0 0 0 ...
4 0 0 0 ...
5 1 0 1 ...
... ... ... ... ...
Values that don't occur at least MIN_OBS times should be ignored and not get an own column, because there are so many distinct values that the df would become too big if this threshold isn't applied.
I designed the solution below. It works, but is way too slow (due to iterating over rows I suppose). Could anyone suggest a faster approach ?
temp_dict = defaultdict(int)
for k, v in A.iteritems():
temp_list = v.split(',')
for item in temp_list:
temp_dict[item] += 1
cols_to_make = []
for k, v in temp_dict.iteritems():
if v > MIN_OBS:
cols_to_make.append('A_' + k)
result_df = pd.DataFrame(0, index = the_series.index, columns = cols_to_make)
for k, v in A.iteritems():
temp_list = v.split(',')
for item in temp_list:
if ('A_' + item) in cols_to_make:
temp_df['A_' + item][k] = 1
You can use get_dummies for creating indicator variables, then convert columns to numbers by to_numeric and last filter columns by variable TRESH and ix:
print df
A
index
1 null
2 5,6
3 3
4 null
5 5,18,22
df = df.A.str.get_dummies(sep=",")
print df
18 22 3 5 6 null
index
1 0 0 0 0 0 1
2 0 0 0 1 1 0
3 0 0 1 0 0 0
4 0 0 0 0 0 1
5 1 1 0 1 0 0
df.columns = pd.to_numeric(df.columns, errors='coerce')
df = df.sort_index(axis=1)
TRESH = 5
cols = [col for col in df.columns if col > TRESH]
print cols
[6.0, 18.0, 22.0]
df = df.ix[:, cols]
print df
6 18 22
index
1 0 0 0
2 1 0 0
3 0 0 0
4 0 0 0
5 0 1 1
df.columns = ["A_" + str(int(col)) for col in df.columns]
print df
A_6 A_18 A_22
index
1 0 0 0
2 1 0 0
3 0 0 0
4 0 0 0
5 0 1 1
EDIT:
I try modified perfect original unutbu answer and change creating Series, removing Series with null values in index and add parameter prefix to get_dummies:
import numpy as np
import pandas as pd
s = pd.Series(['null', '5,6', '3', 'null', '5,18,22', '3,4'])
print s
#result = s.str.split(',').apply(pd.Series).stack()
#replacing to:
result = pd.DataFrame([ x.split(',') for x in s ]).stack()
count = pd.value_counts(result)
min_obs = 2
#add removing Series, which contains null
count = count[(count >= min_obs) & ~(count.index.isin(['null'])) ]
result = result.loc[result.isin(count.index)]
#add prefix to function get_dummies
result = pd.get_dummies(result, prefix="A")
result.index = result.index.droplevel(1)
result = result.reindex(s.index)
print(result)
A_3 A_5
0 NaN NaN
1 0 1
2 1 0
3 NaN NaN
4 0 1
5 1 0
Timings:
In [143]: %timeit pd.DataFrame([ x.split(',') for x in s ]).stack()
1000 loops, best of 3: 866 µs per loop
In [144]: %timeit s.str.split(',').apply(pd.Series).stack()
100 loops, best of 3: 2.46 ms per loop
Since memory is an issue, we have to be careful not to build large intermediate
data structures if possible.
Let's start with the OP's posted code that works:
def orig(A, MIN_OBS):
temp_dict = collections.defaultdict(int)
for k, v in A.iteritems():
temp_list = v.split(',')
for item in temp_list:
temp_dict[item] += 1
cols_to_make = []
for k, v in temp_dict.iteritems():
if v > MIN_OBS:
cols_to_make.append('A_' + k)
result_df = pd.DataFrame(0, index=A.index, columns=cols_to_make)
for k, v in A.iteritems():
temp_list = v.split(',')
for item in temp_list:
if ('A_' + item) in cols_to_make:
result_df['A_' + item][k] = 1
return result_df
and extract the first loop into its own function:
def count(A, MIN_OBS):
temp_dict = collections.Counter()
for k, v in A.iteritems():
temp_list = v.split(',')
for item in temp_list:
temp_dict[item] += 1
temp_dict = {k:v for k, v in temp_dict.items() if v > MIN_OBS}
return temp_dict
From experimentation in an interactive session, we can see this is not the bottleneck; even for "large" DataFrames, count(A, MIN_OBS) completes fairly quickly.
The slowness of orig occurs in the the double for-loop at the end of orig
which increments modifies cells in the DataFrame one value at a time
(e.g. result_df['A_' + item][k] = 1.)
We could replace that double-for loop with a single for-loop over the columns of the DataFrame, using the vectorized string method, A.str.contains to search for values in the strings. Since we never split the original strings into Python lists of strings (or Pandas DataFrames holding the string fragments), we save some memory.
Since orig and alt use similar data structures, their memory footprint is about the same.
def alt(A, MIN_OBS):
temp_dict = count(A, MIN_OBS)
df = pd.DataFrame(0, index=A.index, columns=temp_dict)
for col in df:
df[col] = A.str.contains(r'^{v}|,{v},|,{v}$'.format(v=col)).astype(int)
df.columns = ['A_{}'.format(col) for col in df]
return df
Here is an example, on a 200K row DataFrame with 40K different possible values:
import numpy as np
import pandas as pd
import collections
np.random.seed(2016)
ncols = 5
nrows = 200000
nvals = 40000
MIN_OBS = 200
# nrows = 20
# nvals = 4
# MIN_OBS = 2
idx = np.random.randint(ncols, size=nrows).cumsum()
data = np.random.choice(np.arange(nvals), size=idx[-1])
data = np.array_split(data, idx[:-1])
data = map(','.join, [map(str, arr) for arr in data])
A = pd.Series(data)
A.loc[A == ''] = 'null'
def orig(A, MIN_OBS):
temp_dict = collections.defaultdict(int)
for k, v in A.iteritems():
temp_list = v.split(',')
for item in temp_list:
temp_dict[item] += 1
cols_to_make = []
for k, v in temp_dict.iteritems():
if v > MIN_OBS:
cols_to_make.append('A_' + k)
result_df = pd.DataFrame(0, index=A.index, columns=cols_to_make)
for k, v in A.iteritems():
temp_list = v.split(',')
for item in temp_list:
if ('A_' + item) in cols_to_make:
result_df['A_' + item][k] = 1
return result_df
def count(A, MIN_OBS):
temp_dict = collections.Counter()
for k, v in A.iteritems():
temp_list = v.split(',')
for item in temp_list:
temp_dict[item] += 1
temp_dict = {k:v for k, v in temp_dict.items() if v > MIN_OBS}
return temp_dict
def alt(A, MIN_OBS):
temp_dict = count(A, MIN_OBS)
df = pd.DataFrame(0, index=A.index, columns=temp_dict)
for col in df:
df[col] = A.str.contains(r'^{v}|,{v},|,{v}$'.format(v=col)).astype(int)
df.columns = ['A_{}'.format(col) for col in df]
return df
Here is a benchmark:
In [48]: %timeit expected = orig(A, MIN_OBS)
1 loops, best of 3: 3.03 s per loop
In [49]: %timeit expected = alt(A, MIN_OBS)
1 loops, best of 3: 483 ms per loop
Note that the majority of the time required for alt to complete is spent in count:
In [60]: %timeit count(A, MIN_OBS)
1 loops, best of 3: 304 ms per loop
Would something like this work or could it be modified to fit your need?
df = pd.DataFrame({'A': ['null', '5,6', '3', 'null', '5,18,22']}, columns=['A'])
A
0 null
1 5,6
2 3
3 null
4 5,18,22
Then use get_dummies()
pd.get_dummies(df['A'].str.split(',').apply(pd.Series), prefix=df.columns[0])
Result:
A_3 A_5 A_null A_18 A_6 A_22
index
1 0 0 1 0 0 0
2 0 1 0 0 1 0
3 1 0 0 0 0 0
4 0 0 1 0 0 0
5 0 1 0 1 0 1
If I have the following:
df = pd.DataFrame(np.random.random((4,8)))
tupleList = zip([x for x in 'abcdefgh'], [y for y in ['iijjkkll'])
ind = pd.MultiIndex.from_tuples(tupleList)
df.columns = ind
In [71]: df
Out[71]:
a b c d e f g \
i i j j k k l
0 0.968112 0.809183 0.144320 0.518120 0.820079 0.648237 0.971552
1 0.959022 0.721705 0.139588 0.408940 0.230956 0.907192 0.467016
2 0.335085 0.537437 0.725119 0.486447 0.114048 0.150150 0.894322
3 0.051249 0.186547 0.779814 0.905914 0.024298 0.002489 0.339714
h
l
0 0.438330
1 0.225447
2 0.331413
3 0.530789
[4 rows x 8 columns]
what is the easiest way to select the columns that have a second level label of "j" or "k"?
c d e f
j j k k
0 0.948030 0.243993 0.627497 0.729024
1 0.087703 0.874968 0.581875 0.996466
2 0.802155 0.213450 0.375096 0.184569
3 0.164278 0.646088 0.201323 0.022498
I can do this:
df.loc[:, df.columns.get_level_values(1).isin(['j', 'k'])]
But that seems pretty verbose for something that feels like it should be simple. Any better approaches?
See here for multiindex using slicers, introduced in 0.14.0
In [36]: idx = pd.IndexSlice
In [37]: df.loc[:, idx[:, ['j', 'k']]]
Out[37]:
c d e f
j j k k
0 0.750582 0.877763 0.262696 0.226005
1 0.025902 0.967179 0.125647 0.297304
2 0.463544 0.104973 0.154113 0.284820
3 0.631695 0.841023 0.820907 0.938378