Convert dataframe string into multiple dummy variables in Python - python

I have a dataframe with several columns. One column is "category", which is a space separated string. A sample of the df's category is:
3 36 211 433 474 533 690 980
3 36 211
3 16 36 211 396 398 409
3 35 184 590 1038
67 179 208 1008 5000 5237
I have another list of categories dict = [3,5,7,8,16,5000].
What I would like to see is a new data frame with dict as columns, and 0/1 as entries. If a row in df contains the dict entry, it's 1, else it's 0. So the output is:
3 5 7 8 16 36 5000
1 0 0 0 0 1 0
1 0 0 0 0 1 0
1 0 0 0 1 1 0
1 0 0 0 0 0 0
0 0 0 0 0 0 1
Have tried something like:
for cat in level_0_cat:
df[cat] = df.apply(lambda x: int(cat in map(int, x.category)), axis = 1)
But it does not work for large dataset (10 million rows). Have also tried isin, but have not figured out. Any idea is appreciated.

This ought to do it.
# Read your data
>>> s = pd.read_clipboard(sep='|', header=None)
# Convert `cats` to string to make `to_string` approach work below
>>> cats = list(map(str, [3,4,7,8,16,36,5000]))
>>> cats
['3', '4', '7', '8', '16', '36', '5000']
# Nested list comprehension... Checks whether each `c` in `cats` exists in each row
>>> encoded = [[1 if v in set(s.ix[idx].to_string().split()) else 0 for idx in s.index] for v in cats]
>>> encoded
[[1, 1, 1, 1, 0], [0, 0, 0, 0, 0], [0, 0, 0, 0, 0], [0, 0, 0, 0, 0], [0, 0, 1, 0, 0], [0, 0, 0, 0, 1]]
>>> import numpy as np
# Convert the whole thing to a dataframe to add columns
>>> encoded = pd.DataFrame(data=np.matrix(encoded).T, columns=cats)
>>> encoded
3 4 7 8 16 36 5000
0 1 0 0 0 0 1 0
1 1 0 0 0 0 1 0
2 1 0 0 0 1 1 0
3 1 0 0 0 0 0 0
4 0 0 0 0 0 0 1
Edit: answer way to do this without directly calling any pandas indexing methods like ix or loc.
encoded = [[1 if v in row else 0 for row in s[0].str.split().map(set)] for v in cats]
encoded
Out[18]:
[[1, 1, 1, 1, 0],
[0, 0, 0, 0, 0],
[0, 0, 0, 0, 0],
[0, 0, 0, 0, 0],
[0, 0, 1, 0, 0],
[1, 1, 1, 0, 0],
[0, 0, 0, 0, 1]]
encoded = pd.DataFrame(data=np.matrix(encoded).T, columns=cats)
encoded
Out[20]:
3 4 7 8 16 36 5000
0 1 0 0 0 0 1 0
1 1 0 0 0 0 1 0
2 1 0 0 0 1 1 0
3 1 0 0 0 0 0 0
4 0 0 0 0 0 0 1

You don't need to convert every line to integers, it's simpler to
convert to strings the elements of the list of categories...
categories = [l.strip() for l in '''\
3 36 211 433 474 533 690 980
3 36 211
3 16 36 211 396 398 409
3 35 184 590 1038
67 179 208 1008 5000 5237'''.split('\n')]
result = [3,5,7,8,16,5000]
d = [str(n) for n in result]
for category in categories:
result.append([1 if s in category else 0 for s in d])
Please don't use dict (that is a builtin function) to name one of your objects.

Related

Replacing values of pandas data frame to lists using dictionary mapping?

If we have a pandas data frame and a mapping dictionary for the values in the data frame, replacing the values in the data frame using the dictionary as a mapping can be done like so:
In: df
Out:
Col1 Col2
0 a c
1 b c
2 b c
In: key
Out: {'a': 1, 'b': 2, 'c': 3}
In: df.replace(key)
Out:
Col1 Col2
0 1 3
1 2 3
2 2 3
How can a similar transformation be accomplished when the mapping dictionary has lists as values? For example:
In: key
Out: {'a': [1, 0, 0], 'b': [0, 1, 0], 'c': [0, 0, 1]}
In: df.replace(key)
ValueError: NumPy boolean array indexing assignment cannot assign 3 input values to the 1 output values where the mask is true
In this example, the end goal would be to have a new data frame that has 3 rows and 6 columns:
1 0 0 0 0 1
0 1 0 0 0 1
0 1 0 0 0 1
IIUC, you can applymap+explode+reshape:
df2 = df.applymap(key.get).explode(list(df.columns))
df2 = (df2
.set_index(df2.groupby(level=0).cumcount(), append=True)
.unstack(level=1)
)
output:
Col1 Col2
0 1 2 0 1 2
0 1 0 0 0 0 1
1 0 1 0 0 0 1
2 0 1 0 0 0 1
NB. to reset the columns: df2.columns = range(df2.shape[1])
0 1 2 3 4 5
0 1 0 0 0 0 1
1 0 1 0 0 0 1
2 0 1 0 0 0 1
You can use a combination DataFrame.apply and Series.map to perform this substitution. From there, you can perform a DataFrame.sum to concatenate the lists and then cast your data back into a new DataFrame
out = pd.DataFrame(
df.apply(lambda s: s.map(key)).sum(axis=1).tolist()
)
print(out)
0 1 2 3 4 5
0 1 0 0 0 0 1
1 0 1 0 0 0 1
2 0 1 0 0 0 1
Semi-related testing of .sum vs .chain:
In [22]: %timeit tmp_df.sum(axis=1)
77.6 µs ± 1.82 µs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
In [23]: %timeit tmp_df.apply(lambda row: list(chain.from_iterable(row)), axis=1)
197 µs ± 1.3 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
In [24]: tmp_df
Out[24]:
Col1 Col2
0 [1, 0, 0] [0, 0, 1]
1 [0, 1, 0] [0, 0, 1]
2 [0, 1, 0] [0, 0, 1]
While I won't say that .sum is the best method for concatenating lists in a Series, .apply & chain.from_iterable doesn't seem to fair much better- at least on a very small sample like this.
Hmm, this is tricky.
One solution I came up with is to convert the lists to their string represention before replacing with them, because pandas treats lists specially. Then you can use itertools.chain.from_iterable on each row to combine all the lists into one big list, and create a dataframe out of that:
import ast
from itertools import chain
n = df.replace({k: str(v) for k, v in key.items()}).applymap(ast.literal_eval)
df =pd.DataFrame(n.apply(lambda x: list(chain.from_iterable(x)), axis=1).tolist())
Output:
>>> df
0 1 2 3 4 5
0 1 0 0 0 0 1
1 0 1 0 0 0 1
2 0 1 0 0 0 1
Here's a method of replacing the items with lists without looping or stringifying:
df[:] = pd.Series(key)[df.to_numpy().flatten()].to_numpy().reshape(df.shape)
Output:
>>> df
Col1 Col2
0 [1, 0, 0] [0, 0, 1]
1 [0, 1, 0] [0, 0, 1]
2 [0, 1, 0] [0, 0, 1]
Or, you can use explode and reshape to convert the data directly to a numpy array:
arr = pd.Series(key)[df.to_numpy().flatten()].explode().to_numpy().reshape(-1, 6) # 6 = len of one of the items of `key` * number of columns in df
Output:
>>> arr
array([[1, 0, 0, 0, 0, 1],
[0, 1, 0, 0, 0, 1],
[0, 1, 0, 0, 0, 1]], dtype=object)
>>> pd.DataFrame(arr)
0 1 2 3 4 5
0 1 0 0 0 0 1
1 0 1 0 0 0 1
2 0 1 0 0 0 1

Add row to pandas data frame causes prediction failure

What is the best way to add a row to training data?
import numpy as np
import pandas as pd
# Features=x / Labels=y
new_train1 = pd.DataFrame({'A': [1,2,3,3,4,4],
'B': [4,5,6,6,4,3],
'C': ['a','b','c','ddd','c','ddd']})
new_train2 = pd.DataFrame({'A': [1],
'B': [4],
'C': ['a']})
# Add new_train2's row to new_train1.
Maybe this would work:
new_train1 = new_train1.append(new_train2)
new_train1 = new_train1.reset_index(drop=True)
Finally, the data is split into features and labels.
new_train_x = new_train1.iloc[:,0:1] # Cols A and B
new_train_y = new_train1['C']
EDIT: Notably, after attempting this process (to add a row), here is the confusion matrix (#s are from real data set, not above sample set):
[[336 0 7 0 3 0]
[ 23 8 358 0 0 3]
[ 0 0 373 1 0 0]
[ 0 0 0 281 30 25]
[ 0 0 0 14 220 33]
[ 0 0 0 6 14 265]]
Whereas prior to adding the row (and whenever dropping one row multiple times), here is the typical confusion matrix (once again using #s from real data not above sample data):
[[343 0 0 0 3 0]
[ 2 349 39 0 0 2]
[ 0 52 322 0 0 0]
[ 0 0 0 330 3 3]
[ 0 0 0 3 261 3]
[ 0 0 0 2 1 282]]
And here is the confusion matrix before adding or removing any data points:
[[343 0 0 0 3 0]
[ 3 355 31 0 0 3]
[ 0 30 344 0 0 0]
[ 0 0 0 331 1 4]
[ 0 0 0 1 261 5]
[ 0 0 0 3 4 278]]

python pandas - .replace - nothing happens

I have a df called all_data and within it a column called 'Neighborhood'. all_data["Neighborhood"].head() looks like
0 CollgCr
1 Veenker
2 CollgCr
3 Crawfor
4 NoRidge
I want to replace certain neighborhood names with 0, and others with 1 to get
0 1
1 1
2 1
3 0
4 1
So I did this:
all_data["Neighb_Good"] = all_data["Neighborhood"].copy().replace({'Neighborhood': {'StoneBr': 1, 'NrdigHt': 1,
'Veenker': 1, 'Somerst': 1,
'Timber': 1, 'CollgCr': 1,
'Blmngtn': 1, 'NoRidge': 1,
'Mitchel': 1, 'ClearCr': 1,
'ClearCr': 0, 'Crawfor': 0,
'SawyerW': 0, 'Gilbert': 0,
'Sawyer': 0, 'NPkVill': 0,
'NAmes': 0, 'NWAmes': 0,
'BrkSide': 0, 'MeadowV': 0,
'Edwards': 0, 'Blueste': 0,
'BrDale': 0, 'OldTown': 0,
'IDOTRR': 0, 'SWISU': 0,
}})
It doesn't give me an error, but nothing happens. Instead, all_data["Neighb_Good"] looks exactly like all_data["Neighborhood"].
I've been trying to figure it out for a while now and I swear I can't see what's the matter because I've used this same method yesterday on some other columns and it worked perfectly.
UPDATE: you seem to need Series.map():
In [196]: df['Neighb_Good'] = df['Neighborhood'].map(d['Neighborhood'])
In [197]: df
Out[197]:
Neighborhood Neighb_Good
0 CollgCr 1
1 Veenker 1
2 CollgCr 1
3 Crawfor 0
4 NoRidge 1
using data set from comments:
In [201]: df["ExterQual_Good"] = df["ExterQual"].map(d)
In [202]: df
Out[202]:
ExterQual ExterQual_Good
0 TA 1
1 Fa 0
2 Gd 1
3 Po 0
4 Ex 1
Old answer:
Use DataFrame.replace() instead of Series.replace() if you have a nested dict, containing column names:
In [81]: df['Neighb_Good'] = df.replace(d)
In [82]: df
Out[82]:
Neighborhood Neighb_Good
0 CollgCr 1
1 Veenker 1
2 CollgCr 1
3 Crawfor 0
4 NoRidge 1
or use Series.replace() with a flat (not nested dict):
In [85]: df['Neighb_Good'] = df['Neighborhood'].replace(d['Neighborhood'])
In [86]: df
Out[86]:
Neighborhood Neighb_Good
0 CollgCr 1
1 Veenker 1
2 CollgCr 1
3 Crawfor 0
4 NoRidge 1
How about
A B C
P 0 1 2
Q 3 4 5
R 6 7 8
S 9 10 11
T 12 13 14
U 15 16 17
data1.A.replace({0:"A"..and so on})

Pandas Dataframe: Split a column into multiple columns

I have a pandas dataframe with columns names as: (columns type as Object)
1. x_id
2. y_id
3. Sentence1
4. Sentences2
5. Label
I want to separate sentences1 and sentence2 into multiple columns in same dataframe.
Here is an example: dataframe names as df
x_id y_id Sentence1 Sentence2 Label
0 2 This is a ball I hate you 0
1 5 I am a boy Ahmed Ali 1
2 1 Apple is red Rose is red 1
3 9 I love you so much Me too 1
After splitting the columns[Sentence1,Sentence2] by ' ' Space, dataframe looks like:
x_id y_id 1 2 3 4 5 6 7 8 Label
0 2 This is a ball NONE I hate you 0
1 5 I am a boy NONE Ahmed Ali NONE 1
2 1 Apple is red NONE NONE Rose is red 1
3 9 I love you so much Me too NONE 1
How to split the columns like this in python? How to do this using pandas dataframe?
In [26]: x = pd.concat([df.pop('Sentence1').str.split(expand=True),
...: df.pop('Sentence2').str.split(expand=True)],
...: axis=1)
...:
In [27]: x.columns = np.arange(1, x.shape[1]+1)
In [28]: x
Out[28]:
1 2 3 4 5 6 7 8
0 This is a ball None I hate you
1 I am a boy None Ahmed Ali None
2 Apple is red None None Rose is red
3 I love you so much Me too None
In [29]: df = df.join(x)
In [30]: df
Out[30]:
x_id y_id Label 1 2 3 4 5 6 7 8
0 0 2 0 This is a ball None I hate you
1 1 5 1 I am a boy None Ahmed Ali None
2 2 1 1 Apple is red None None Rose is red
3 3 9 1 I love you so much Me too None
One-hot-encoding labeling solution:
In [14]: df.Sentence1 += ' ' + df.pop('Sentence2')
In [15]: df
Out[15]:
x_id y_id Sentence1 Label
0 0 2 This is a ball I hate you 0
1 1 5 I am a boy Ahmed Ali 1
2 2 1 Apple is red Rose is red 1
3 3 9 I love you so much Me too 1
In [16]: from sklearn.feature_extraction.text import CountVectorizer
In [17]: vect = CountVectorizer()
In [18]: X = vect.fit_transform(df.Sentence1.fillna(''))
X - is a sparsed (memory saving) matrix:
In [23]: X
Out[23]:
<4x17 sparse matrix of type '<class 'numpy.int64'>'
with 19 stored elements in Compressed Sparse Row format>
In [24]: type(X)
Out[24]: scipy.sparse.csr.csr_matrix
In [19]: X.toarray()
Out[19]:
array([[0, 0, 0, 0, 1, 0, 1, 1, 0, 0, 0, 0, 0, 0, 1, 0, 1],
[1, 1, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 1, 0, 0, 0, 2, 0, 0, 0, 2, 1, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 1, 0, 1, 1]], dtype=int64)
Most of sklearn methods accept sparsed matrixes.
If you want to "unpack" it:
In [21]: r = pd.DataFrame(X.toarray(), columns=vect.get_feature_names())
In [22]: r
Out[22]:
ahmed ali am apple ball boy hate is love me much red rose so this too you
0 0 0 0 0 1 0 1 1 0 0 0 0 0 0 1 0 1
1 1 1 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0
2 0 0 0 1 0 0 0 2 0 0 0 2 1 0 0 0 0
3 0 0 0 0 0 0 0 0 1 1 1 0 0 1 0 1 1
Here is how to do it for the sentences in the column Sentence1. The idea is identical for the Sentence2 column.
splits = df.Sentence1.str.split(' ')
longest = splits.apply(len).max()
Note that longest is the length of the longest sentence. Now make the Null columns:
for j in range(1,longest+1):
df[str(j)] = np.nan
And finally, go through the splitted values and assign them:
for j in splits.values:
for k in range(1,longest+1):
try:
df.loc[str(j), k] = j[k]
except:
pass
`
It looks like a machine learning problem. Converting from 1 col to max words columns this way may not be efficient.
Another (probably more efficient) solution is converting each words to integer and then padding to the longest sentences. Tensorflow as tools for that.

pandas DataFrame set non-contiguous sections

I have a DataFrame like below and would like for B to be 1 for n rows after the 1 in column A (where below n = 2)
index A B
0 0 0
1 1 0
2 0 1
3 0 1
4 1 0
5 0 1
6 0 1
7 0 0
8 1 0
9 0 1
I think I can do it using .ix similar to this example but not sure how. I'd like to do it in a single in pandas-style selection command if possible. (Ideally not using rolling_apply.)
Modifying a subset of rows in a pandas dataframe
EDIT: the application is that the 1 in column A is "ignored" if it falls within n rows of the previous 1. As per the comments, for n = 2 then, and these example:
A = [1, 0, 1, 0, 1], B should be = [0, 1, 1, 0, 0]
A = [1, 1, 0, 0], B should be [0, 1, 1, 0]

Categories