I'm trying to one-hot encode one column of a dataframe.
enc = OneHotEncoder()
minitable = enc.fit_transform(df["ids"])
But I'm getting
DeprecationWarning: Passing 1d arrays as data is deprecated in 0.17
and willraise ValueError in 0.19.
Is there a workaround for this?
I think you can use get_dummies:
df = pd.DataFrame({'ids':['a','b','c']})
print (df)
ids
0 a
1 b
2 c
print (df.ids.str.get_dummies())
a b c
0 1 0 0
1 0 1 0
2 0 0 1
EDIT:
If input is column with lists, first cast to str, remove [] by strip and call get_dummies:
df = pd.DataFrame({'ids':[[0,4,5],[4,7,8],[5,1,2]]})
print(df)
ids
0 [0, 4, 5]
1 [4, 7, 8]
2 [5, 1, 2]
print (df.ids.astype(str).str.strip('[]').str.get_dummies(', '))
0 1 2 4 5 7 8
0 1 0 0 1 1 0 0
1 0 0 0 1 0 1 1
2 0 1 1 0 1 0 0
Related
My head is spinning trying to figure out if I have to use pivot_table, melt, or some other function.
I have a DF that looks like this:
month day week_day classname_en origin destination
0 1 7 2 1 2 5
1 1 2 6 2 1 167
2 2 1 5 1 2 54
3 2 2 6 4 1 6
4 1 2 6 5 6 1
But I want to turn it into something like:
month_1 month_2 ...classname_en_1 classname_en_2 ... origin_1 origin_2 ...destination_1
0 1 0 1 0 0 1 0
1 1 0 0 1 1 0 0
2 0 1 1 0 0 1 0
3 0 1 0 0 1 0 0
4 1 0 0 0 0 0 1
Basically, turn all values into columns and then have binary rows 1 - if the column is present, 0 if none.
IDK if it is at all possible to do with like a single function or not, but would appreciate all and any help!
To expand #Corraliens answer
It is indeed a way to do it, but since you write for ML purposes, you might introduce a bug.
With the code above you get a matrix with 20 features. Now, say you want to predict on some data which suddenly have a month more than your training data, then your matrix on your prediction data would have 21 features, thus you cannot parse that into your fitted model.
To overcome this you can use one-hot-encoding from Sklearn. It'll make sure that you always have the same amount of features on "new data" as your training data.
import pandas as pd
df_train = pd.DataFrame({"color":["red","blue"],"age":[10,15]})
pd.get_dummies(df_train)
# output
age color_blue color_red
0 10 0 1
1 15 1 0
df_new = pd.DataFrame({"color":["red","blue","green"],"age":[10,15,20]})
pd.get_dummies(df_new)
#output
age color_blue color_green color_red
0 10 0 0 1
1 15 1 0 0
2 20 0 1 0
and as you can see, the order of the color-binary representation has also changed.
If we on the other hand use OneHotEncoder you can ommit all those issues
from sklearn.preprocessing import OneHotEncoder
df_train = pd.DataFrame({"color":["red","blue"],"age":[10,15]})
ohe = OneHotEncoder(handle_unknown="ignore")
color_ohe_transformed= ohe.fit_transform(df_train[["color"]]) #creates sparse matrix
ohe_features = ohe.get_feature_names_out() # [color_blue, color_red]
pd.DataFrame(color_ohe_transformed.todense(),columns = ohe_features, dtype=int)
# output
color_blue color_red
0 0 1
1 1 0
# now transform new data
df_new = pd.DataFrame({"color":["red","blue","green"],"age":[10,15,20]})
new_data_ohe_transformed = ohe.transform(df_new[["color"]])
pd.DataFrame(new_data_ohe_transformed .todense(),columns = ohe_features, dtype=int)
#output
color_blue color_red
0 0 1
1 1 0
2 0 0
note in the last row that both blue and red are both zeros since it has color= "green" which was not present in the training data.
Note the todense() function is only used here to illustrate how it works. Ususally you would like to keep it a sparse matrix and use e.g scipy.sparse.hstack to append your other features such as age to it.
Use pd.get_dummies:
out = pd.get_dummies(df, columns=df.columns)
print(out)
# Output
month_1 month_2 day_1 day_2 day_7 week_day_2 week_day_5 ... origin_2 origin_6 destination_1 destination_5 destination_6 destination_54 destination_167
0 1 0 0 0 1 1 0 ... 1 0 0 1 0 0 0
1 1 0 0 1 0 0 0 ... 0 0 0 0 0 0 1
2 0 1 1 0 0 0 1 ... 1 0 0 0 0 1 0
3 0 1 0 1 0 0 0 ... 0 0 0 0 1 0 0
4 1 0 0 1 0 0 0 ... 0 1 1 0 0 0 0
[5 rows x 20 columns]
You can use get_dummies function of pandas for convert row to column based on data.
For that your code will be:
import pandas as pd
df = pd.DataFrame({
'month': [1, 1, 2, 2, 1],
'day': [7, 2, 1, 2, 2],
'week_day': [2, 6, 5, 6, 6],
'classname_en': [1, 2, 1, 4, 5],
'origin': [2, 1, 2, 1, 6],
'destination': [5, 167, 54, 6, 1]
})
response = pd.get_dummies(df, columns=df.columns)
print(response)
Result :
I have a question regarding a transformation I want to add to a Dataframe pandas.
I have a dataframe df with the following columns :
df.columns = Index(['S', 'HZ', 'Z', 'Demand'], dtype='object')
I want to perform the following transformation:
for s in range(S):
for t in range(HZ):
for z in range(Z):
df.loc[(df['S'] == s) & (df['HZ'] == t) & (df['Z'] == z), 'Demand'] = D[s][t][z]
Where D is a numpy array with the corresponding dimensions. Here is a simple example of what I try to do (with T = 0 for making it simpler).
Here is df before :
S HZ Z Demand
0 0 0 0 0
1 0 0 1 0
2 0 1 0 0
3 0 1 1 0
4 0 2 0 0
5 0 2 1 0
6 1 0 0 0
7 1 0 1 0
8 1 1 0 0
9 1 1 1 0
10 1 2 0 0
11 1 2 1 0
Here is D :
D = [[[1, 2],
[3, 4],
[5, 6]],
[[7, 8],
[9, 10],
[11, 12]]]
And here is what i want:
S HZ Z Demand
0 0 0 0 1
1 0 0 1 2
2 0 1 0 3
3 0 1 1 4
4 0 2 0 5
5 0 2 1 6
6 1 0 0 7
7 1 0 1 8
8 1 1 0 9
9 1 1 1 10
10 1 2 0 11
11 1 2 1 12
This code functions, but it is very long, so I tried something else to avoid the for loops :
df.loc[df['HZ'] >= T, 'Demand'] = D[df['S']][df['HZ']][df['Z']]
Which raises the following error :
ValueError: Must have equal len keys and value when setting with an iterable
I am searching what this error means, how to fix it if possible, and if not possible, is there a mean to do what I want without using for loops ?
Thanks by advance
After trying many things, I finally found something that works :
df.loc[df['HZ'] >= T, 'Demand'] = D[df.loc[df['HZ'] >= T, 'S']][df.loc[df['HZ'] >= T, 'HZ'] - T][df.loc[df['HZ'] >= T, 'Z']]
The problem is that with this line :
df.loc[df['HZ'] >= T, 'Demand'] = D[df['S']][df['HZ']][df['Z']]
I try to access the elements of D with series (df['S'] for example), which represent columns of the dataframes, and this is not possible. With the solution I found, I access the value of the column I wanted to locate the corresponding element of D.
I would like to use Pandas to parse Q26 Challenges into the subsequent columns, with a "1" representing its presence in the original unparsed column. So the data frame initially looks like this:
ID
Q26 Challenges
Q26_1
Q26_2
Q26_3
Q26_4
Q26_5
Q26_6
Q26_7
1
5
0
0
0
0
0
0
0
2
1,2
0
0
0
0
0
0
0
3
1,3,7
0
0
0
0
0
0
0
And I want it to look like this:
ID
Q26 Challenges
Q26_1
Q26_2
Q26_3
Q26_4
Q26_5
Q26_6
Q26_7
1
5
0
0
0
0
1
0
0
2
1,2
1
1
0
0
0
0
0
3
1,3,7
1
0
1
0
0
0
1
You can iterate over the range of values in Q26 Challenges, using str.contains to check if the current value is contained in the string and then converting that boolean value to an integer. For example:
df = pd.DataFrame({'id' : [1, 2, 3, 4, 5], 'Q26 Challenges': ['0', '1,2', '2', '1,2,6,7', '3,4,5,11' ] })
for i in range(1, 12):
df[f'Q26_{i}'] = df['Q26 Challenges'].str.contains(rf'\b{i}\b').astype(int)
df
Output:
id Q26 Challenges Q26_1 Q26_2 Q26_3 Q26_4 Q26_5 Q26_6 Q26_7 Q26_8 Q26_9 Q26_10 Q26_11
0 1 0 0 0 0 0 0 0 0 0 0 0 0
1 2 1,2 1 1 0 0 0 0 0 0 0 0 0
2 3 2 0 1 0 0 0 0 0 0 0 0 0
3 4 1,2,6,7 1 1 0 0 0 1 1 0 0 0 0
4 5 3,4,5,11 0 0 1 1 1 0 0 0 0 0 1
str.get_dummies can be used on the 'Q26 Challenges' column to create the indicator values. This indicator DataFrame can be reindexed to include the complete result range (note column headers will be of type string). add_prefix can be used to add the 'Q26_' to the column headers. Lastly, join back to the original DataFrame:
df = df.join(
df['Q26 Challenges'].str.get_dummies(sep=',')
.reindex(columns=map(str, range(1, 8)), fill_value=0)
.add_prefix('Q26_')
)
The reindexing can also be done dynamically based on the resulting columns. It is necessary to convert the resulting column headers to numbers first to ensure numeric order, rather than lexicographic ordering:
s = df['Q26 Challenges'].str.get_dummies(sep=',')
# Convert to numbers to correctly access min and max
s.columns = s.columns.astype(int)
# Add back to DataFrame
df = df.join(s.reindex(
# Build range from the min column to max column values
columns=range(min(s.columns), max(s.columns) + 1),
fill_value=0
).add_prefix('Q26_'))
Both options produce:
ID Q26 Challenges Q26_1 Q26_2 Q26_3 Q26_4 Q26_5 Q26_6 Q26_7
0 1 5 0 0 0 0 1 0 0
1 2 1,2 1 1 0 0 0 0 0
2 3 1,3,7 1 0 1 0 0 0 1
Given initial input:
import pandas as pd
df = pd.DataFrame({
'ID': [1, 2, 3],
'Q26 Challenges': ['5', '1,2', '1,3,7']
})
ID Q26 Challenges
0 1 5
1 2 1,2
2 3 1,3,7
I have a df with several nominal categorical columns that I would want to create dummies for. Here's a mock df:
data = {'Frukt':[1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
'Vikt':[23, 45, 31, 28, 62, 12, 44, 42, 23, 32],
'Färg':['grön', 'gul', 'röd', 'grön', 'grön', 'gul', 'röd', 'röd', 'gul', 'grön'],
'Smak':['god', 'sådär', 'supergod', 'rälig', 'rälig', 'supergod', 'god', 'god', 'rälig', 'god']}
df = pd.DataFrame(data)
I have tried naming the columns I want to get dummies from:
nomcols = ['Färg', 'Smak']
for column in ['nomcols']:
dummies = pd.get_dummies(df[column])
df[dummies.columns] = dummies
which was a tip I got from another question that I found, but it didn't work. I have looked at the other four questions that are similar but haven't had any luck since most of them get dummies from ALL the columns in the df.
What I would like is something like this:
Use get_dummies with specify columns in list, then remove separator by columns names with prefix seting to empty string:
nomcols = ['Färg', 'Smak']
df = pd.get_dummies(df, columns=nomcols, prefix='', prefix_sep='')
print (df)
Frukt Vikt grön gul röd god rälig supergod sådär
0 1 23 1 0 0 1 0 0 0
1 2 45 0 1 0 0 0 0 1
2 3 31 0 0 1 0 0 1 0
3 4 28 1 0 0 0 1 0 0
4 5 62 1 0 0 0 1 0 0
5 6 12 0 1 0 0 0 1 0
6 7 44 0 0 1 1 0 0 0
7 8 42 0 0 1 1 0 0 0
8 9 23 0 1 0 0 1 0 0
9 10 32 1 0 0 1 0 0 0
What you did was more or less correct.
But you did:
for column in ['nomcols']:
dummies = pd.get_dummies(df[column])
So you're trying to access df at 'nomcols'. What you wanted to do was:
dummies = pd.get_dummies(df[nomcols])
You want to access the dataframe at the column names inside the nomcols list.
nomcols = ['Färg', 'Smak']
for column in nomcols:
dummies = pd.get_dummies(df[column])
The above code should work.
I have a dataframe of the below structure. I want to get the column numbers which are unique to a particular row.
1 1 0 1 1 1 0 0 0
0 1 0 1 0 0 0 0 0
0 1 0 0 1 0 0 0 0
1 0 0 0 1 0 0 0 1
0 0 0 0 0 0 1 1 0
1 0 0 0 1 0 0 0 0
In the above example I should get coln6, coln7, coln8, coln9 (as there is only one row which has a value specific to these columns). Also I should be able to distinguish among the columns like coln7 and coln8 should group together as they are unique to the same row. Is there an efficient solution in Python for this?
You can call sum on the df and compare against 1 and use this to mask the columns:
In [19]:
df.columns[df.sum(axis=0) == 1]
Out[19]:
Int64Index([5, 6, 7, 8], dtype='int64')
Here is my first approach:
import numpy as np
import pandas as pd
df = pd.DataFrame(np.array([
1,1,0,1,1,1,0,0,0,
0,1,0,1,0,0,0,0,0,
0,1,0,0,1,0,0,0,0,
1,0,0,0,1,0,0,0,1,
0,0,0,0,0,0,1,1,0,
1,0,0,0,1,0,0,0,0]).reshape(6,9))
print df.sum(axis=0).apply(lambda x: True if x == 1 else False)
Output:
0 False
1 False
2 False
3 False
4 False
5 True
6 True
7 True
8 True
dtype: bool