Pandas Dataframe: Split a column into multiple columns - python

I have a pandas dataframe with columns names as: (columns type as Object)
1. x_id
2. y_id
3. Sentence1
4. Sentences2
5. Label
I want to separate sentences1 and sentence2 into multiple columns in same dataframe.
Here is an example: dataframe names as df
x_id y_id Sentence1 Sentence2 Label
0 2 This is a ball I hate you 0
1 5 I am a boy Ahmed Ali 1
2 1 Apple is red Rose is red 1
3 9 I love you so much Me too 1
After splitting the columns[Sentence1,Sentence2] by ' ' Space, dataframe looks like:
x_id y_id 1 2 3 4 5 6 7 8 Label
0 2 This is a ball NONE I hate you 0
1 5 I am a boy NONE Ahmed Ali NONE 1
2 1 Apple is red NONE NONE Rose is red 1
3 9 I love you so much Me too NONE 1
How to split the columns like this in python? How to do this using pandas dataframe?

In [26]: x = pd.concat([df.pop('Sentence1').str.split(expand=True),
...: df.pop('Sentence2').str.split(expand=True)],
...: axis=1)
...:
In [27]: x.columns = np.arange(1, x.shape[1]+1)
In [28]: x
Out[28]:
1 2 3 4 5 6 7 8
0 This is a ball None I hate you
1 I am a boy None Ahmed Ali None
2 Apple is red None None Rose is red
3 I love you so much Me too None
In [29]: df = df.join(x)
In [30]: df
Out[30]:
x_id y_id Label 1 2 3 4 5 6 7 8
0 0 2 0 This is a ball None I hate you
1 1 5 1 I am a boy None Ahmed Ali None
2 2 1 1 Apple is red None None Rose is red
3 3 9 1 I love you so much Me too None

One-hot-encoding labeling solution:
In [14]: df.Sentence1 += ' ' + df.pop('Sentence2')
In [15]: df
Out[15]:
x_id y_id Sentence1 Label
0 0 2 This is a ball I hate you 0
1 1 5 I am a boy Ahmed Ali 1
2 2 1 Apple is red Rose is red 1
3 3 9 I love you so much Me too 1
In [16]: from sklearn.feature_extraction.text import CountVectorizer
In [17]: vect = CountVectorizer()
In [18]: X = vect.fit_transform(df.Sentence1.fillna(''))
X - is a sparsed (memory saving) matrix:
In [23]: X
Out[23]:
<4x17 sparse matrix of type '<class 'numpy.int64'>'
with 19 stored elements in Compressed Sparse Row format>
In [24]: type(X)
Out[24]: scipy.sparse.csr.csr_matrix
In [19]: X.toarray()
Out[19]:
array([[0, 0, 0, 0, 1, 0, 1, 1, 0, 0, 0, 0, 0, 0, 1, 0, 1],
[1, 1, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 1, 0, 0, 0, 2, 0, 0, 0, 2, 1, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 1, 0, 1, 1]], dtype=int64)
Most of sklearn methods accept sparsed matrixes.
If you want to "unpack" it:
In [21]: r = pd.DataFrame(X.toarray(), columns=vect.get_feature_names())
In [22]: r
Out[22]:
ahmed ali am apple ball boy hate is love me much red rose so this too you
0 0 0 0 0 1 0 1 1 0 0 0 0 0 0 1 0 1
1 1 1 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0
2 0 0 0 1 0 0 0 2 0 0 0 2 1 0 0 0 0
3 0 0 0 0 0 0 0 0 1 1 1 0 0 1 0 1 1

Here is how to do it for the sentences in the column Sentence1. The idea is identical for the Sentence2 column.
splits = df.Sentence1.str.split(' ')
longest = splits.apply(len).max()
Note that longest is the length of the longest sentence. Now make the Null columns:
for j in range(1,longest+1):
df[str(j)] = np.nan
And finally, go through the splitted values and assign them:
for j in splits.values:
for k in range(1,longest+1):
try:
df.loc[str(j), k] = j[k]
except:
pass
`

It looks like a machine learning problem. Converting from 1 col to max words columns this way may not be efficient.
Another (probably more efficient) solution is converting each words to integer and then padding to the longest sentences. Tensorflow as tools for that.

Related

Dataframe column: to find local maxima

In the below dataframe the column "CumRetperTrade" is a column which consists of a few vertical vectors (=sequences of numbers) separated by zeros. (= these vectors correspond to non-zero elements of column "Portfolio").
I would like to find the local maxima of every non-zero vector contained in column "CumRetperTrade"
To be precise, I would like to transform (using vectorization - or other - methods) column "CumRetperTrade" to the column "PeakCumRet" (desired result) which provides maxima for every vector contained in column "CumRetperTrade" its local max value. The numeric example is below. Thanks in advance.
import numpy as np
import pandas as pd
df1 = pd.DataFrame({"Portfolio": [1, 1, 1, 0 , 0, 0, 1, 1, 1],
"CumRetperTrade": [3, 2, 1, 0 , 0, 0, 4, 2, 1],
"PeakCumRet": [3, 3, 3, 0 , 0, 0, 4, 4, 4]})
df1
Portfolio CumRetperTrade PeakCumRet
1 3 3
1 2 3
1 1 3
0 0 0
0 0 0
0 0 0
1 4 4
1 2 4
1 1 4
You can use :
df1['PeakCumRet'] = (df1.groupby(df1['Portfolio'].ne(df1['Portfolio'].shift()).cumsum())
['CumRetperTrade'].transform('max')
)
Output:
Portfolio CumRetperTrade PeakCumRet
0 1 3 3
1 1 2 3
2 1 1 3
3 0 0 0
4 0 0 0
5 0 0 0
6 1 4 4
7 1 2 4
8 1 1 4

python pandas - .replace - nothing happens

I have a df called all_data and within it a column called 'Neighborhood'. all_data["Neighborhood"].head() looks like
0 CollgCr
1 Veenker
2 CollgCr
3 Crawfor
4 NoRidge
I want to replace certain neighborhood names with 0, and others with 1 to get
0 1
1 1
2 1
3 0
4 1
So I did this:
all_data["Neighb_Good"] = all_data["Neighborhood"].copy().replace({'Neighborhood': {'StoneBr': 1, 'NrdigHt': 1,
'Veenker': 1, 'Somerst': 1,
'Timber': 1, 'CollgCr': 1,
'Blmngtn': 1, 'NoRidge': 1,
'Mitchel': 1, 'ClearCr': 1,
'ClearCr': 0, 'Crawfor': 0,
'SawyerW': 0, 'Gilbert': 0,
'Sawyer': 0, 'NPkVill': 0,
'NAmes': 0, 'NWAmes': 0,
'BrkSide': 0, 'MeadowV': 0,
'Edwards': 0, 'Blueste': 0,
'BrDale': 0, 'OldTown': 0,
'IDOTRR': 0, 'SWISU': 0,
}})
It doesn't give me an error, but nothing happens. Instead, all_data["Neighb_Good"] looks exactly like all_data["Neighborhood"].
I've been trying to figure it out for a while now and I swear I can't see what's the matter because I've used this same method yesterday on some other columns and it worked perfectly.
UPDATE: you seem to need Series.map():
In [196]: df['Neighb_Good'] = df['Neighborhood'].map(d['Neighborhood'])
In [197]: df
Out[197]:
Neighborhood Neighb_Good
0 CollgCr 1
1 Veenker 1
2 CollgCr 1
3 Crawfor 0
4 NoRidge 1
using data set from comments:
In [201]: df["ExterQual_Good"] = df["ExterQual"].map(d)
In [202]: df
Out[202]:
ExterQual ExterQual_Good
0 TA 1
1 Fa 0
2 Gd 1
3 Po 0
4 Ex 1
Old answer:
Use DataFrame.replace() instead of Series.replace() if you have a nested dict, containing column names:
In [81]: df['Neighb_Good'] = df.replace(d)
In [82]: df
Out[82]:
Neighborhood Neighb_Good
0 CollgCr 1
1 Veenker 1
2 CollgCr 1
3 Crawfor 0
4 NoRidge 1
or use Series.replace() with a flat (not nested dict):
In [85]: df['Neighb_Good'] = df['Neighborhood'].replace(d['Neighborhood'])
In [86]: df
Out[86]:
Neighborhood Neighb_Good
0 CollgCr 1
1 Veenker 1
2 CollgCr 1
3 Crawfor 0
4 NoRidge 1
How about
A B C
P 0 1 2
Q 3 4 5
R 6 7 8
S 9 10 11
T 12 13 14
U 15 16 17
data1.A.replace({0:"A"..and so on})

Add values for matching column and row names

Quick question that I'm brain-farting on how to best implement. I am generating a matrix to add up how many times two items are found next to each other in a list across a large number of permutations of this list. My code looks something like this:
agreement_matrix = pandas.DataFrame(0, index=names, columns=names)
for list in bunch_of_lists:
for i in range(len(list)-1):
agreement_matrix[list[i]][list[i+1]] += 1
It generates an array like:
A B C D
A 0 2 1 1
B 2 0 1 1
C 1 1 0 2
D 1 1 2 0
And because I don't care about order as much I want to add up values so it's like this:
A B C D
A 0 4 2 2
B 0 0 2 2
C 0 0 0 4
D 0 0 0 0
Is there any fast/simple way to achieve this? I've been toying around with both doing it after generation and trying to do it as I add values.
Use np.tri*:
np.triu(df) + np.tril(df).T
array([[0, 4, 2, 2],
[0, 0, 2, 2],
[0, 0, 0, 4],
[0, 0, 0, 0]])
Call the DataFrame constructor:
pd.DataFrame(np.triu(df) + np.tril(df).T, df.index, df.columns)
A B C D
A 0 4 2 2
B 0 0 2 2
C 0 0 0 4
D 0 0 0 0
To solve the problem ..
np.triu(df.values*2)#df.values.T+df.values
Out[595]:
array([[0, 4, 2, 2],
[0, 0, 2, 2],
[0, 0, 0, 4],
[0, 0, 0, 0]], dtype=int64)
Then you do
pd.DataFrame(np.triu(df.values*2), df.index, df.columns)
Out[600]:
A B C D
A 0 4 2 2
B 0 0 2 2
C 0 0 0 4
D 0 0 0 0
A pandas solution to avoid the first loop:
values=['ABCD'[i] for i in np.random.randint(0,4,100)] # data
df=pd.DataFrame(values)
df[1]=df[0].shift()
df=df.iloc[1:]
df.values.sort(axis=1)
df[2]=1
res=df.pivot_table(2,0,1,np.sum,0)
#
#1 A B C D
#0
#A 2 14 11 16
#B 0 5 9 13
#C 0 0 10 17
#D 0 0 0 2

Convert dataframe string into multiple dummy variables in Python

I have a dataframe with several columns. One column is "category", which is a space separated string. A sample of the df's category is:
3 36 211 433 474 533 690 980
3 36 211
3 16 36 211 396 398 409
3 35 184 590 1038
67 179 208 1008 5000 5237
I have another list of categories dict = [3,5,7,8,16,5000].
What I would like to see is a new data frame with dict as columns, and 0/1 as entries. If a row in df contains the dict entry, it's 1, else it's 0. So the output is:
3 5 7 8 16 36 5000
1 0 0 0 0 1 0
1 0 0 0 0 1 0
1 0 0 0 1 1 0
1 0 0 0 0 0 0
0 0 0 0 0 0 1
Have tried something like:
for cat in level_0_cat:
df[cat] = df.apply(lambda x: int(cat in map(int, x.category)), axis = 1)
But it does not work for large dataset (10 million rows). Have also tried isin, but have not figured out. Any idea is appreciated.
This ought to do it.
# Read your data
>>> s = pd.read_clipboard(sep='|', header=None)
# Convert `cats` to string to make `to_string` approach work below
>>> cats = list(map(str, [3,4,7,8,16,36,5000]))
>>> cats
['3', '4', '7', '8', '16', '36', '5000']
# Nested list comprehension... Checks whether each `c` in `cats` exists in each row
>>> encoded = [[1 if v in set(s.ix[idx].to_string().split()) else 0 for idx in s.index] for v in cats]
>>> encoded
[[1, 1, 1, 1, 0], [0, 0, 0, 0, 0], [0, 0, 0, 0, 0], [0, 0, 0, 0, 0], [0, 0, 1, 0, 0], [0, 0, 0, 0, 1]]
>>> import numpy as np
# Convert the whole thing to a dataframe to add columns
>>> encoded = pd.DataFrame(data=np.matrix(encoded).T, columns=cats)
>>> encoded
3 4 7 8 16 36 5000
0 1 0 0 0 0 1 0
1 1 0 0 0 0 1 0
2 1 0 0 0 1 1 0
3 1 0 0 0 0 0 0
4 0 0 0 0 0 0 1
Edit: answer way to do this without directly calling any pandas indexing methods like ix or loc.
encoded = [[1 if v in row else 0 for row in s[0].str.split().map(set)] for v in cats]
encoded
Out[18]:
[[1, 1, 1, 1, 0],
[0, 0, 0, 0, 0],
[0, 0, 0, 0, 0],
[0, 0, 0, 0, 0],
[0, 0, 1, 0, 0],
[1, 1, 1, 0, 0],
[0, 0, 0, 0, 1]]
encoded = pd.DataFrame(data=np.matrix(encoded).T, columns=cats)
encoded
Out[20]:
3 4 7 8 16 36 5000
0 1 0 0 0 0 1 0
1 1 0 0 0 0 1 0
2 1 0 0 0 1 1 0
3 1 0 0 0 0 0 0
4 0 0 0 0 0 0 1
You don't need to convert every line to integers, it's simpler to
convert to strings the elements of the list of categories...
categories = [l.strip() for l in '''\
3 36 211 433 474 533 690 980
3 36 211
3 16 36 211 396 398 409
3 35 184 590 1038
67 179 208 1008 5000 5237'''.split('\n')]
result = [3,5,7,8,16,5000]
d = [str(n) for n in result]
for category in categories:
result.append([1 if s in category else 0 for s in d])
Please don't use dict (that is a builtin function) to name one of your objects.

pandas DataFrame set non-contiguous sections

I have a DataFrame like below and would like for B to be 1 for n rows after the 1 in column A (where below n = 2)
index A B
0 0 0
1 1 0
2 0 1
3 0 1
4 1 0
5 0 1
6 0 1
7 0 0
8 1 0
9 0 1
I think I can do it using .ix similar to this example but not sure how. I'd like to do it in a single in pandas-style selection command if possible. (Ideally not using rolling_apply.)
Modifying a subset of rows in a pandas dataframe
EDIT: the application is that the 1 in column A is "ignored" if it falls within n rows of the previous 1. As per the comments, for n = 2 then, and these example:
A = [1, 0, 1, 0, 1], B should be = [0, 1, 1, 0, 0]
A = [1, 1, 0, 0], B should be [0, 1, 1, 0]

Categories