Splitting survey data with OneHotEncoder

Splitting survey data with OneHotEncoder - python

I have a dataframe with results from a survey where there were options A-E and it was possible to select more than one option - a selection could be 'A' or 'A;C;D', etc.
I will be using the data for some machine learning and am looking to run it through OneHotEncoder to end up with the 5 columns with 1's and 0's.
An example of my initial survey data is :
survey_data = pd.DataFrame({'Q1': ['A','B','C','A;D', 'D;E', 'F']})
I initially tried LabelEncoder but obviously ended up with a lot of features (rather than just the A-E).

You can also use MultilabelBinarizer for this:
inputX = [element.split(';') for element in survey_data['Q1']]
from sklearn.preprocessing import MultiLabelBinarizer
mlb = MultiLabelBinarizer()
transformedX = mlb.fit_transform(inputX)
#Out: transformedX
#array([[1, 0, 0, 0, 0, 0],
[0, 1, 0, 0, 0, 0],
[0, 0, 1, 0, 0, 0],
[1, 0, 0, 1, 0, 0],
[0, 0, 0, 1, 1, 0],
[0, 0, 0, 0, 0, 1]])

Here's one approach, using get_dummies:
import pandas as pd
# example data provided by OP
survey_data = pd.DataFrame({'Q1': ['A','B','C','A;D', 'D;E', 'F']})
# split out rows with multiple chosen options into columns
tmp = survey_data.Q1.str.split(';').apply(pd.Series)
# one-hot encode columns with get_dummies, then overlay into one df
df = (pd.get_dummies(tmp[0])
.add(pd.get_dummies(tmp[1]), fill_value=0)
.astype(int))
print(df)
A B C D E F
0 1 0 0 0 0 0
1 0 1 0 0 0 0
2 0 0 1 0 0 0
3 1 0 0 1 0 0
4 0 0 0 1 1 0
5 0 0 0 0 0 1

Related

Getting binary labels on from a dataframe and a list of labels

Suppose I have the following list of labels,
labs = ['G1','G2','G3','G4','G5','G6','G7']
and also suppose that I have the following df:
group entity_label
0 0 G1
1 0 G2
3 1 G5
4 1 G1
5 2 G1
6 2 G2
7 2 G3
to produce the above df you can use:
df_test = pd.DataFrame({'group': [0,0,0,1,1,2,2,2,2],
'entity_label':['G1','G2','G2','G5','G1','G1','G2','G3','G3']})
df_test.drop_duplicates(subset=['group','entity_label'], keep='first')
for each group I want to use a mapping to look up on the labels and make a new dataframe with binary labels
group entity_label_binary
0 0 [1, 1, 0, 0, 0, 0, 0]
1 1 [1, 0, 0, 0, 1, 0, 0]
2 2 [1, 1, 1, 0, 0, 0, 0]
namely for group 0 we have G1 and G2 hence 1s in above table and so on. I wonder how one can do this?

One option, based on crosstab:
labs = ['G1','G2','G3','G4','G5','G6','G7']
(pd.crosstab(df_test['group'], df_test['entity_label'])
.clip(upper=1)
.reindex(columns=labs, fill_value=0)
.agg(list, axis=1)
.reset_index(name='entity_label_binary')
)
Variant, with get_dummies and groupby.max:
(pd.get_dummies(df_test['entity_label'])
.groupby(df_test['group']).max()
.reindex(columns=labs, fill_value=0)
.agg(list, axis=1)
.reset_index(name='entity_label_binary')
)
Output:
group entity_label_binary
0 0 [1, 1, 0, 0, 0, 0, 0]
1 1 [1, 0, 0, 0, 1, 0, 0]
2 2 [1, 1, 1, 0, 0, 0, 0]

How to obtain the documents that belongs to its cluster in density based clustering?

I use DBSCAN clustering for text document as follows,
thanks to this post.
db = DBSCAN(eps=0.3, min_samples=2).fit(X)
core_samples_mask1 = np.zeros_like(db1.labels_, dtype=bool)
core_samples_mask1[db1.core_sample_indices_] = True
labels1 = db1.labels_
Now I want to see which document belongs to which cluster, like:
[I have a car and it is blue] belongs to cluster0
or
idx [112] belongs to cluster0
The similar way my question asked in here but I am already tested the some of the answers provided there as:
X[labels == 1,:]
and I got :
array([[0, 0, 1, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0],
[0, 0, 1, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0]], dtype=int64)
but this does not help me. Please let me know if you have any suggestion or ways to do it.

If you have a pandas dataframe df with columns idx and messages, then all you have to do is
df['cluster'] = db.labels_
in order to get a new column cluster with the cluster membership.
Here is a short demo with dummy data:
import numpy as np
import pandas as pd
from sklearn.cluster import DBSCAN
X = np.array([[1, 2], [5, 8], [2, 3],
[8, 7], [8, 8], [2, 2]])
db = DBSCAN(eps=3, min_samples=2).fit(X)
db.labels_
# array([0, 1, 0, 1, 1, 0], dtype=int64)
# convert our numpy array to pandas:
df = pd.DataFrame({'Column1':X[:,0],'Column2':X[:,1]})
print(df)
# result:
Column1 Column2
0 1 2
1 5 8
2 2 3
3 8 7
4 8 8
5 2 2
# add new column with the belonging cluster:
df['cluster'] = db.labels_
print(df)
# result:
Column1 Column2 cluster
0 1 2 0
1 5 8 1
2 2 3 0
3 8 7 1
4 8 8 1
5 2 2 0

Intersection of multiple rows in single DataFrame

I have a DataFrame of Temperature 1000s of rows(Time series data) and 40 columns(40 points in a catchment ). Entries in this DataFrame are zeros and one (1 means active part of catchment and zero means non-active part). I want to place number of intersected values in a separate column(named inter) in the same DataFrame .
I expect the output in this way [attached image]
value in the first row of inter should be zero as all entries are zero
and no part is active on day first
value in the 2nd row of inter should be 4 as four parts are active
on day 2.
value in the 3rd row of inter should be 3 (number of intersected values
of all above rows including 3rd row)[enter image description here][1].
Green boxes in image show the value for 3rd row
value in 4th row of inter should be number of intersected values of
all above rows (yellow shaded area in the image).
similarly blue boxes show the value for 5th row and red boxes show
the value for sixth row and so on
Note: for every row I will count the intersection of all above rows

I deserve a reward for this :)
Here is you answer:
import pandas as pd
import numpy as np
# setup test data
data = {'0': [0, 0, 0, 1, 0], '1': [0, 0, 1, 0, 1], '2': [0, 0, 0, 1, 0], '3': [0, 0, 1, 1, 1], '4': [0, 1, 1, 1, 0]
, '5': [0, 0, 0, 0, 1], '6': [0, 1, 1, 1, 0], '7': [0, 0, 1, 0, 1], '8': [0, 1, 0, 1, 0], '9': [0, 1, 1, 0, 0],
'10': [0, 0, 1, 0, 0], '11': [0, 0, 0, 1, 1], '12': [0, 0, 0, 1, 1]}
data = pd.DataFrame(data=data)
# collect inter data
inter_data = []
for main_index, main_row in data.iterrows():
# select data for calculations
selected_data = data.loc[0:main_index,:]
# handle firs row with 0 values
if not 1 in main_row.values:
inter_data.append(0)
else:
# handle second row
if selected_data.shape[0] == 2:
inter_data.append(selected_data[1:2].values[0].sum())
# handle rest of data
else:
# drop last row from selected data
selected_data = selected_data[:-1]
# sum selected data
summed_data = 0
for index, row in selected_data.iterrows():
summed_data += row.values
# get position of 1
positions = np.where(main_row.values == 1)
# get summed data based on position
positions_data = summed_data[positions[0]]
# sum occurance in data
inter_data.append((positions_data >= 1).sum())
# add inter data to raw data
data['inter'] = pd.DataFrame(inter_data)
Output:
0 1 2 3 4 5 6 7 8 9 10 11 12 inder
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
1 0 0 0 0 1 0 1 0 1 1 0 0 0 4
2 0 1 0 1 1 0 1 1 0 1 1 0 0 3
3 1 0 1 1 1 0 1 0 1 0 0 1 1 4
4 0 1 0 1 0 1 0 1 0 0 0 1 1 5

Finding common subindex values with error?

How would you find a common values of subindices, columns B in this example, between two dataframes where index A = 'a'?
import pandas as pd
df = pd.DataFrame({'Do': [0, 0, 0, 0, 0, 0], 'Ri': [0, 0, 0, 0, 0, 0],
'Mi': [0, 0, 0, 0, 0, 0],'A':['a', 'a', 'a', 'a', 'b', 'b'],
'B': [1, 2, 2, 3, 4, 5]})
df.set_index(['A', 'B'])
Do Ri Mi
A B
a 1 0 0 0
2 0 0 0
2 0 0 0
3 0 0 0
b 4 0 0 0
5 0 0 0
df2 = pd.DataFrame({'Do': [0, 0, 0, 0, 0, 0], 'Ri': [0, 0, 0, 0, 0, 0],
'Mi': [0, 0, 0, 0, 0, 0], 'A':['a', 'a', 'a', 'a', 'b', 'b'],
'B': [3, 3, 4, 6, 7, 8]})
df2.set_index(['A', 'B'])
Do Ri Mi
A B
a 3 0 0 0
3 0 0 0
4 0 0 0
6 0 0 0
b 7 0 0 0
8 0 0 0
Currently I have:
df_a = df.loc[['a']].sort_index(level='B')
df2_a = df2.loc[['a']].sort_index(level='B')
df_a_b = df_a.index.levels[1].tolist()
df2_a_n = df2_a.index.levels[1].tolist()
set(df_a_b) & set(df2_a_n)
But this seems to take from where index A = 'a' and A = 'b'
I noticed that have loc['a'] or loc[['a']] results in different dfs, I'm not sure if this relates, but what is the significance of [['a']] vs ['a']?

For a single overlap use set intersection after subsetting each DataFrme:
set(df.loc['a'].index) & set(df2.loc['a'].index)
#{3}
merge also works, but is overkill for a single intersection. On the other hand, if you want to do all the intersections at once, then use .merge + groupby
#Single
df.loc['a'].merge(df2.loc['a'], left_index=True, right_index=True).index.unique()
#Int64Index([3], dtype='int64', name='B')
#All
df.merge(df2, on=['A', 'B']).reset_index().groupby('A').B.unique()
#A
#a [3]
#Name: B, dtype: object
To explain your error, you were finding the intersection of the levels, but what you want is the intersection of the level values. Your current code should be changed to:
df_a = df.loc[['a']].sort_index(level='B')
df2_a = df2.loc[['a']].sort_index(level='B')
# Get The Level Values, not the Level IDs
df_a_b = df_a.index.get_level_values(1).tolist()
df2_a_n = df2_a.index.get_level_values(1).tolist()
set(df_a_b) & set(df2_a_n)
#{3}

How to get an object bounding box given pixel label in python?

Say I have a scene parsing map for an image, each pixel in this scene parsing map indicates which object this pixel belongs to. Now I want to get bounding box of each object, how can I implement this in python?
For a detail example, say I have a scene parsing map like this:
0 0 0 0 0 0 0
0 1 1 0 0 0 0
1 1 1 1 0 0 0
0 0 1 1 1 0 0
0 0 1 1 1 0 0
0 0 0 0 0 0 0
0 0 0 0 0 0 0
So the bounding box is:
0 0 0 0 0 0 0
1 1 1 1 1 0 0
1 0 0 0 1 0 0
1 0 0 0 1 0 0
1 1 1 1 1 0 0
0 0 0 0 0 0 0
0 0 0 0 0 0 0
Actually, in my task, just know the width and height of this object is enough.
A basic idea is to search four edges in the scene parsing map, from top, bottom, left and right direction. But there might be a lot of small objects in the image, this way is not time efficient.
A second way is to calculate the coordinates of all non-zero elements and find the max/min x/y. Then calculate weight and height using these x and y.
Is there any other more efficient way to do this? Thx.

If you are processing images, you can use scipy's ndimage library.
If there is only one object in the image, you can get the measurements with scipy.ndimage.measurements.find_objects (http://docs.scipy.org/doc/scipy-0.16.1/reference/generated/scipy.ndimage.measurements.find_objects.html):
import numpy as np
from scipy import ndimage
a = np.array([[0, 0, 0, 0, 0, 0, 0],
[0, 1, 1, 0, 0, 0, 0],
[1, 1, 1, 1, 0, 0, 0],
[0, 0, 1, 1, 1, 0, 0],
[0, 0, 1, 1, 1, 0, 0],
[0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0]])
# Find the location of all objects
objs = ndimage.find_objects(a)
# Get the height and width
height = int(objs[0][0].stop - objs[0][0].start)
width = int(objs[0][1].stop - objs[0][1].start)
If there are many objects in the image, you first have to label each object and then get the measurements:
import numpy as np
from scipy import ndimage
a = np.array([[0, 0, 0, 0, 0, 0, 0],
[0, 1, 1, 0, 0, 0, 0],
[1, 1, 1, 1, 0, 0, 0],
[0, 0, 1, 1, 1, 0, 0],
[0, 0, 1, 1, 1, 0, 0],
[0, 0, 0, 0, 0, 0, 0],
[0, 0, 1, 1, 1, 0, 0]]) # Second object here
# Label objects
labeled_image, num_features = ndimage.label(a)
# Find the location of all objects
objs = ndimage.find_objects(labeled_image)
# Get the height and width
measurements = []
for ob in objs:
measurements.append((int(ob[0].stop - ob[0].start), int(ob[1].stop - ob[1].start)))
If you check ndimage.measurements, you can get more measurements: center of mass, area...

using numpy:
import numpy as np
ind = np.nonzero(arr.any(axis=0))[0] # indices of non empty columns
width = ind[-1] - ind[0] + 1
ind = np.nonzero(arr.any(axis=1))[0] # indices of non empty rows
height = ind[-1] - ind[0] + 1
a bit more explanation:
arr.any(axis=0) gives a boolean array telling you if the columns are empty (False) or not (True). np.nonzero(arr.any(axis=0))[0] then extract the non zero (i.e. True) indices from that array. ind[0] is the first element of that array, hence the left most column non empty column and ind[-1] is the last element, hence the right most non empty column. The difference then gives the width, give or take 1 depending on whether you include the borders or not.
Similar stuff for the height but on the other axis.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Splitting survey data with OneHotEncoder - python

Related

Getting binary labels on from a dataframe and a list of labels

How to obtain the documents that belongs to its cluster in density based clustering?

Intersection of multiple rows in single DataFrame

Finding common subindex values with error?

How to get an object bounding box given pixel label in python?

Categories

Resources