I have the following program:
cat_feats = ['x', 'y', 'z', 'a', 'b',
'c', 'd', 'e']
onehot_encoder = OneHotEncoder(categories='auto')
# convert each categorical feature from integer
# to one-hot
for feature in cat_feats:
data[feature] = data[feature].array.reshape(len(data[feature]), 1)
data[feature] = onehot_encoder.fit_transform(data[feature])
I am having issues with this. I get:
'PandasArray' object has no attribute 'reshape'
The output of data.head() before using the encoder is this:
0 2 1 4 6 3 2 1 37
2 1 7 2 10 0 4 1 37
3 2 15 2 6 0 2 1 37
5 2 0 4 7 1 4 1 37
7 4 14 2 9 0 4 1 37
This output is of type DataFrame and contains only integers which I am trying to convert to one-hot. I have tried .array, .values, .array.reshape(-1, 1), but none of these things are working. I found that trying .values seemed to work in the first line of the for loop, but I got garbage from my one-hot conversion.
Please help.
These following informations might be helpful:
The type of some of the objects:
data[feature]: pandas.Series
data[feature].values: numpy.ndarray
You can reshape a numpy.ndarray but not a pandas.Series, so you need to use .values to get a numpy.ndarray
When you assign a numpy.ndarray to data[feature], automatic type conversion occurs, so data[feature] = data[feature].values.reshape(-1, 1) doesn't seem to do anything.
fit_transform takes an array-like(Need to be a 2D array, e.g. pandas.DataFrame or numpy.ndarray) object as argument because sklearn.preprocessing.OneHotEncoder is designed to fit/transform multiple features at the same time, input pandas.Series(1D array) will cause error.
fit_transform will return sparse matrix(or 2-d array), assign it to a pandas.Series may cause a disaster.
(Not Recommended) If you insist on processing one feature after another:
for feature in categorical_feats:
encoder = OneHotEncoder()
tmp_ohe_data = pd.DataFrame(
encoder.fit_transform(data[feature].values.reshape(-1, 1)).toarray(),
columns=encoder.get_feature_names(),
)
data = pd.concat([tmp_ohe_data, data], axis=1).drop([feature], axis=1)
I Recommended do encoding like this:
encoder = OneHotEncoder()
ohe_data = pd.DataFrame(
encoder.fit_transform(data[categorical_feats]).toarray(),
columns=encoder.get_feature_names(),
)
res = pd.concat([ohe_data, data], axis=1).drop(categorical_feats, axis=1)
pandas.get_dummies is also a good choice.
Related
I am unable to create an empty dataframe and then copy the edge nodes into the dataframe using a list comprehension.
df = pandas.DataFrame(columns=['Source','Target'])
df[['Source','Target']] = [(s,t) for (s,t) in graph.edges]
I receive an error stating that it can't copy 44000 into a series.
I dunno what graph edges are and I don't want to guess. This code has the same problem.
df = pd.DataFrame(columns=['Source','Target'])
df[['Source','Target']] = [(s,t) for (s,t) in zip(range(1000), range(1000))]
Results in:
ValueError: cannot copy sequence with size 1000 to array axis with dimension 0
I don't really have an answer for why besides you just can't do that.
df = pd.DataFrame(columns=['Source','Target'])
df[['Source','Target']] = pd.DataFrame([(s,t) for (s,t) in zip(range(1000), range(1000))])
Now it works.
>>> df.head()
Source Target
0 0 0
1 1 1
2 2 2
3 3 3
4 4 4
I need to classify the String Values of a feature of my dataset, so that I can further use it for other things, let's say predicting or plotting.
How do I convert it?
I found this solution, but here I've to manually type in the code for every unique value of the feature. For 2-3 unique values, it's alright, but I've got a feature with more than 50 unique values of countries, I can't write code for every country.
def sex_class(x):
if x == 'male':
return 1
else:
return 0
This changes the male values to 1 and female values to 0 in the feature - sex.
You can make use of the scikit-learn LabelEncoder
#given a list containing all possible labels
sex_classes = ['male', 'female']
from sklearn import preprocessing
le = preprocessing.LabelEncoder()
le.fit(sex_classes)
This will assign labels to all the unique values in the given list. You can save this label encoder object as a pickle file for later use as well.
rank or pd.factorize
df['ID_int'] = df['id'].rank(method='dense').astype(int)
df['ID_int2'] = pd.factorize(df['id'])[0]
Output:
id ID_int ID_int2
0 a 2 0
1 b 3 1
2 c 4 2
3 a 2 0
4 b 3 1
5 c 4 2
6 A 1 3
7 b 3 1
The labels are different, but consistent.
You can use a dictionary instead.
sex_class = {'male': 1, 'female': 0}
I have a pandas series, and a function that takes a value in the series and returns a dataframe. Is there a way to apply the function to the series and collate the results in a natural way?
What I am really trying to do is to use pandas series/multiindex to keep track of the results in each step of my data analysis pipeline, where the multiindex holds the parameters used to get the values. For example, the series (s below) is the result of step 0 in my data analysis pipeline. In step 1, I want to try x more dimensions (2 below, thus the dataframe) and collate the results into another series.
Can we do better than below? Where stack() calls seem a bit excessive. Will the xarray library be a good fit for my use case?
In [112]: s
Out[112]:
a 0
b 1
c 2
dtype: int64
In [113]: d = s.apply(lambda x: pd.DataFrame([[x,x*2],[x*3,x*4]]).stack()).stack().stack()
In [114]: d
Out[114]:
a 0 0 0
1 0
1 0 0
1 0
b 0 0 1
1 3
1 0 2
1 4
c 0 0 2
1 6
1 0 4
1 8
dtype: int64
This should give you a DataSet of 2D arrays, and align them for you. You may want to set the dimensions prior if you want them to be named a certain way / be a certain size.
xr.Dataset(k: func(v) for k, v in series.items())
I have applied a LabelEncoder() on a dataframe, which returns the following:
The order/new_carts have different label-encoded numbers, like 70, 64, 71, etc
Is this inconsistent labeling, or did I do something wrong somewhere?
LabelEncoder works on one-dimensional arrays. If you apply it to multiple columns, it will be consistent within columns but not across columns.
As a workaround, you can convert the dataframe to a one dimensional array and call LabelEncoder on that array.
Assume this is the dataframe:
df
Out[372]:
0 1 2
0 d d a
1 c a c
2 c c b
3 e e d
4 d d e
5 d b e
6 e e b
7 a e b
8 b c c
9 e a b
With ravel and then reshaping:
pd.DataFrame(LabelEncoder().fit_transform(df.values.ravel()).reshape(df.shape), columns = df.columns)
Out[373]:
0 1 2
0 3 3 0
1 2 0 2
2 2 2 1
3 4 4 3
4 3 3 4
5 3 1 4
6 4 4 1
7 0 4 1
8 1 2 2
9 4 0 1
Edit:
If you want to store the labels, you need to save the LabelEncoder object.
le = LabelEncoder()
df2 = pd.DataFrame(le.fit_transform(df.values.ravel()).reshape(df.shape), columns = df.columns)
Now, le.classes_ gives you the classes (starting from 0).
le.classes_
Out[390]: array(['a', 'b', 'c', 'd', 'e'], dtype=object)
If you want to access the integer by label, you can construct a dict:
dict(zip(le.classes_, np.arange(len(le.classes_))))
Out[388]: {'a': 0, 'b': 1, 'c': 2, 'd': 3, 'e': 4}
You can do the same with transform method, without building a dict:
le.transform('c')
Out[395]: 2
Your LabelEncoder object is being re-fit to each column of your DataFrame.
Because of the way the apply and fit_transform functions work, you are accidentally calling the fit function on each column of your frame. Let's walk through whats happening in the following line:
labeled_df = String_df.apply(LabelEncoder().fit_transform)
create a new LabelEncoder object
Call apply passing in the fit_transform method. For each column in your DataFrame it will call fit_transform on your encoder passing in the column as an argument. This does two things:
A. refit your encoder (modifying its state)
B. return the codes for the elements of your column based on your encoders new fitting.
The codes will not be consistent across columns because each time you call fit_transform the LabelEncoder object can choose new transformation codes.
If you want your codes to be consistent across columns, you should fit your LabelEncoder to your whole dataset.
Then pass the transform function to your apply function, instead of the fit_transform function. You can try the following:
encoder = LabelEncoder()
all_values = String_df.values.ravel() #convert the dataframe to one long array
encoder.fit(all_values)
labeled_df = String_df.apply(encoder.transform)
This question is a general version of a specific case asked about here.
I have a pandas dataframe with columns that contain integers. I'd like to concatenate all of those integers into a string in one column.
Given this answer, for particular columns, this works:
(dl['ungrd_dum'].map(str) +
dl['mba_dum'].map(str) +
dl['jd_dum'].map(str) +
dl['ma_phd_dum'].map(str))
But suppose I have many (hundreds) of such columns, whose names are in a list dummies. I'm certain there's some cool pythonic way of doing this with one magical line that will do it all. I've tried using map with dummies, but haven't yet been able to figure it out.
IIUC you should be able to do
df[dummies].astype(str).apply(lambda x: ''.join(x), axis=1)
Example:
In [12]:
df = pd.DataFrame({'a':np.random.randint(0,100, 5), 'b':np.arange(5), 'c':np.random.randint(0,10,5)})
df
Out[12]:
a b c
0 5 0 2
1 46 1 3
2 86 2 4
3 85 3 9
4 60 4 4
In [15]:
cols=['a','c']
df[cols].astype(str).apply(''.join, axis=1)
Out[15]:
0 52
1 463
2 864
3 859
4 604
dtype: object
EDIT
As #JohnE has pointed out you could call sum instead which will be faster:
df[cols].astype(str).sum(axis=1)
However, that will implicitly convert the dtype to float64 so you'd have to cast back to str again and slice the decimal point off if necessary:
df[cols].astype(str).sum(axis=1).astype(str).str[:-2]
from operator import add
reduce(add, (df[c].astype(str) for c in cols), "")
For example:
df = pd.DataFrame({'a':np.random.randint(0,100, 5),
'b':np.arange(5),
'c':np.random.randint(0,10,5)})
cols = ['a', 'c']
In [19]: df
Out[19]:
a b c
0 6 0 4
1 59 1 9
2 13 2 5
3 44 3 1
4 79 4 4
In [20]: reduce(add, (df[c].astype(str) for c in cols), "")
Out[20]:
0 64
1 599
2 135
3 441
4 794
dtype: object
The first thing you need to do is to convert your Dataframe of numbers in a Dataframe of strings, as efficiently as possible:
dl = dl.astype(str)
Then, you're in the same situation as this other question, and can use the same Series.str accessor techniques as in this answer:
.str.cat()
Using str.cat() you could do:
dl['result'] = dl[dl.columns[0]].str.cat([dl[c] for c in dl.columns[1:]], sep=' ')
str.join()
To use .str.join() you need a series of iterables, say tuples.
df['result'] = df[df.columns[1:]].apply(tuple, axis=1).str.join(' ')
Don't try the above with list instead of tuple or the apply() methdo will return a Dataframe and dataframes don't have the .str accessor like Series.