Converting objects into suitable numerical values - python

This is a salary dataset composed of the following columns:
['work_year', 'experience_level', 'employment_type',
'job_title', 'salary', 'salary_currency', 'salary_in_usd',
'employee_residence', 'remote_ratio', 'company_location',
'company_size'],
dtype='object')
I want to look at the comparison between the features (experience_lvl, employment_type, job_title, salary_currency and remote ratio) and the label (salary).
I have to make the feature engineering part, which includes converting experience level, employment type and salary currency to suitable numerical values.
How can that be done? What is the optimal solution in this case?
The three columns that have to be converted

You could use e.g. one-hot encoding to transform a dataframe like
index
experience_level
employment_type
0
EX
CT
1
EX
PT
2
MI
CT
3
MI
FT
4
EX
PT
to a dataframe like
index
experience_level_EN
experience_level_EX
experience_level_MI
experience_level_SE
employment_type_CT
employment_type_FL
employment_type_FT
employment_type_PT
0
0
1
0
0
1
0
0
0
1
0
1
0
0
0
0
0
1
2
0
0
1
0
1
0
0
0
3
0
0
1
0
0
0
1
0
4
0
1
0
0
0
0
0
1
as follows:
cat_cols = ['experience_level', 'employment_type']
df_encoded = df.drop(columns=cat_cols)
for col in cat_cols:
encoder = LabelBinarizer().fit(df[col])
cols = [f'{col}_{c}' for c in encoder.classes_]
encoded = pd.DataFrame(encoder.transform(df[col]), columns=cols)
df_encoded = pd.concat([encoded, df_encoded], axis=1)
You may also want to apply this to other columns such as company location, employee_residence, etc., which I assume to be categorical too.

Related

pandas: convert a list in a column into separate columns

dataframe df has a column
id data_words
1 [salt,major,lab,water]
2 [lab,plays,critical,salt]
3 [water,success,major]
I want to make one-hot-code of the column
id critical lab major plays salt success water
1 0 1 1 0 1 0 1
2 1 1 0 1 1 0 0
3 0 0 1 1 0 1 0
What I tried:
Attempt 1:
from sklearn.preprocessing import MultiLabelBinarizer
mlb = MultiLabelBinarizer()
df = df.join(pd.DataFrame(mlb.fit_transform(df.pop('data_words')),
columns=mlb.classes_,
index=df.index))
Error: ValueError: columns overlap but no suffix specified: Index(['class'], dtype='object')
Attempt 2:
I converted the list into simple comma separated string with the following code
df['data_words_Joined'] = df.data_words.apply(','.join)
it makes the dataframe as following
id data_words
1 salt,major,lab,water
2 lab,plays,critical,salt
3 water,success,major
Then I tried
pd.concat([df,pd.get_dummies(df['data_words_Joined'])],axis=1)
But It makes all the words into one column name instead of separate words as separate columns
id salt,major,lab,water lab,plays,critical,salt water,success,major
1 1 0 0
2 0 1 0
3 0 0 1
You can try with explode followed by pivot_table
df_e = df.explode('data_words')
print(df_e.pivot_table(index=df_e['id'],columns=df_e['data_words'],values='id',aggfunc='count',fill_value=0))
Returning the following output:
data_words critical lab major plays salt success water
id
1 0 1 1 0 1 0 1
2 1 1 0 1 1 0 0
3 0 0 1 0 0 1 1
Edit: Adding data for replication purposes:
df = pd.DataFrame({'id':[1,2,3],
'data_words':[['salt','major','lab','water'],['lab','plays','critical','salt'],['water','success','major']]})
Which looks like:
id data_words
0 1 [salt, major, lab, water]
1 2 [lab, plays, critical, salt]
2 3 [water, success, major]
One possible approach could be to use get_dummies with your apply function:
new_df = df.data_words.apply(','.join).str.get_dummies(sep=',')
print(new_df)
Output:
critical lab major plays salt success water
0 0 1 1 0 1 0 1
1 1 1 0 1 1 0 0
2 0 0 1 0 0 1 1
Tested with pandas version 1.1.2 and borrowed input data from Celius Stingher's Answer.

Extracting rows in R based on an ID number

I have a data frame in R which, when running data$data['rs146217251',1:10,] looks something like:
g=0 g=1 g=2
1389117_1389117 0 1 NA
2912943_2912943 0 0 1
3094358_3094358 0 0 1
5502557_5502557 0 0 1
2758547_2758547 0 0 1
3527892_3527892 0 1 NA
3490518_3490518 0 0 1
1569224_1569224 0 0 1
4247075_4247075 0 1 NA
4428814_4428814 0 0 1
The leftmost column are participant identifiers. There are roughly 500,000 participants listed in this data frame, but I have a subset of them (about 5,000) listed by their identifiers. What is the best way to go about extracting only these rows that I care about according to their identifier in R or python (or some other way)?
Assuming that the participant identifiers are row names, you can filter by the vector of the identifiers as below:
df <- data$data['rs146217251',1:10,]
#Assuming the vector of identifiers
id <- c("4428814_4428814", "3490518_3490518", "3094358_3094358")
filtered <- df[id,]
Output:
> filtered
g.0 g.1 g.2
4428814_4428814 0 0 1
3490518_3490518 0 0 1
3094358_3094358 0 0 1

How to create by default two columns for every features (One Hot Encoding)?

My feature engineering runs for different documents. For some documents some features do not exist and followingly the sublist consists only of the same values such as the third sublist [0,0,0,0,0]. One hot encoding of this sublist leads to only one column, while the feature lists of other documents are transformed to two columns. Is there any possibility to tell ohe also to create two columns if it consits only of one and the same value and insert the column in the right spot? The main problem is that my feature dataframe of different documents consists in the end of a different number of columns, which make them not comparable.
import pandas as pd
feature = [[0,0,1,0,0], [1,1,1,0,1], [0,0,0,0,0], [1,0,1,1,1], [1,1,0,1,1], [1,0,1,1,1], [0,1,0,0,0]]
df = pd.DataFrame(feature[0])
df_features_final = pd.get_dummies(df[0])
for feature in feature[1:]:
df = pd.DataFrame(feature)
df_enc = pd.get_dummies(df[0])
print(df_enc)
df_features_final = pd.concat([df_features_final, df_enc], axis = 1, join ='inner')
print(df_features_final)
The result is the following dataframe. As you can see in the changing columntitles, after column 5 does not follow a 1:
0 1 0 1 0 0 1 0 1 0 1 0 1
0 1 0 0 1 1 0 1 0 1 0 1 1 0
1 1 0 0 1 1 1 0 0 1 1 0 0 1
2 0 1 0 1 1 0 1 1 0 0 1 1 0
3 1 0 1 0 1 0 1 0 1 0 1 1 0
4 1 0 0 1 1 0 1 0 1 0 1 1 0
I don't notice the functionality you want in pandas atleast. But, in TensorFlow, we do have
tf.one_hot(
indices, depth, on_value=None, off_value=None, axis=None, dtype=None, name=None
)
Set depth to 2.

How to clean up columns with values '10-12' (represented in range) in pandas dataframe?

I have Car Sales price dataset, where I am trying to predict the sales price given the features of a car. I have a variable called 'Fuel Economy city' which is having values like 10,12,10-12,13-14,.. in pandas dataframe. I need to convert this into numerical to apply regression algorithm. I don't have domain knowledge about automobiles. Please help.
I tried removing the hyphen, but it is treating as a four digit value which I don't think is correct in this context.
You could try pd.get_dummies() which will make a separate column for the various ranges, marking each column True (1) or False (0). These can then be used in lieu of the ranges (which are considered categorical data.)
import pandas as pd
data = [[10,"blue", "Ford"], [12,"green", "Chevy"],["10-12","white", "Chrysler"],["13-14", "red", "Subaru"]]
df = pd.DataFrame(data, columns = ["Fuel Economy city", "Color", "Make"])
print(df)
df = pd.get_dummies(df)
print(df)
OUTPUT:
Fuel Economy city_10 Fuel Economy city_12 Fuel Economy city_10-12 \
0 1 0 0
1 0 1 0
2 0 0 1
3 0 0 0
Fuel Economy city_13-14 Color_blue Color_green Color_red Color_white \
0 0 1 0 0 0
1 0 0 1 0 0
2 0 0 0 0 1
3 1 0 0 1 0
Make_Chevy Make_Chrysler Make_Ford Make_Subaru
0 0 0 1 0
1 1 0 0 0
2 0 1 0 0
3 0 0 0 1

Construct NetworkX graph from Pandas DataFrame

I'd like to create some NetworkX graphs from a simple Pandas DataFrame:
Loc 1 Loc 2 Loc 3 Loc 4 Loc 5 Loc 6 Loc 7
Foo 0 0 1 1 0 0 0
Bar 0 0 1 1 0 1 1
Baz 0 0 1 0 0 0 0
Bat 0 0 1 0 0 1 0
Quux 1 0 0 0 0 0 0
Where Foo… is the index, and Loc 1 to Loc 7 are the columns. But converting to Numpy matrices or recarrays doesn't seem to work for generating input for nx.Graph(). Is there a standard strategy for achieving this? I'm not averse the reformatting the data in Pandas --> dumping to CSV --> importing to NetworkX, but it seems as if I should be able to generate the edges from the index and the nodes from the values.
NetworkX expects a square matrix (of nodes and edges), perhaps* you want to pass it:
In [11]: df2 = pd.concat([df, df.T]).fillna(0)
Note: It's important that the index and columns are in the same order!
In [12]: df2 = df2.reindex(df2.columns)
In [13]: df2
Out[13]:
Bar Bat Baz Foo Loc 1 Loc 2 Loc 3 Loc 4 Loc 5 Loc 6 Loc 7 Quux
Bar 0 0 0 0 0 0 1 1 0 1 1 0
Bat 0 0 0 0 0 0 1 0 0 1 0 0
Baz 0 0 0 0 0 0 1 0 0 0 0 0
Foo 0 0 0 0 0 0 1 1 0 0 0 0
Loc 1 0 0 0 0 0 0 0 0 0 0 0 1
Loc 2 0 0 0 0 0 0 0 0 0 0 0 0
Loc 3 1 1 1 1 0 0 0 0 0 0 0 0
Loc 4 1 0 0 1 0 0 0 0 0 0 0 0
Loc 5 0 0 0 0 0 0 0 0 0 0 0 0
Loc 6 1 1 0 0 0 0 0 0 0 0 0 0
Loc 7 1 0 0 0 0 0 0 0 0 0 0 0
Quux 0 0 0 0 1 0 0 0 0 0 0 0
In[14]: graph = nx.from_numpy_matrix(df2.values)
This doesn't pass the column/index names to the graph, if you wanted to do that you could use relabel_nodes (you may have to be wary of duplicates, which are allowed in pandas' DataFrames):
In [15]: graph = nx.relabel_nodes(graph, dict(enumerate(df2.columns))) # is there nicer way than dict . enumerate ?
*It's unclear exactly what the columns and index represent for the desired graph.
A little late answer, but now networkx can read data from pandas dataframes, in that case ideally the format is the following for a simple directed graph:
+----------+---------+---------+
| Source | Target | Weight |
+==========+=========+=========+
| Node_1 | Node_2 | 0.2 |
+----------+---------+---------+
| Node_2 | Node_1 | 0.6 |
+----------+---------+---------+
If you are using adjacency matrixes then Andy Hayden is right, you should take care of the correct format. Since in your question you used 0 and 1, I guess you would like to see an undirected graph. It may seem counterintuitive first since you said Index represents e.g. a person, and columns represent groups to which a given person belongs, but it's correct also in the other way a group (membership) belongs to a person. Following this logic, you should actually put the groups in indexes and the persons in columns too.
Just a side note: You can also define this problem in the sense of a directed graph, for example you would like to visualize an association network of hierarchical categories. There, the association e.g. from Samwise Gamgee to Hobbits is stronger than in the other direction usually (since Frodo Baggins is more likely the Hobbit prototype)
You can also use scipy to create the square matrix like this:
import scipy.sparse as sp
cols = df.columns
X = sp.csr_matrix(df.astype(int).values)
Xc = X.T * X # multiply sparse matrix
Xc.setdiag(0) # reset diagonal
# create dataframe from co-occurence matrix in dense format
df = pd.DataFrame(Xc.todense(), index=cols, columns=cols)
Later on you can create an edge list from the dataframe and import it into Networkx:
df = df.stack().reset_index()
df.columns = ['source', 'target', 'weight']
df = df[df['weight'] != 0] # remove non-connected nodes
g = nx.from_pandas_edgelist(df, 'source', 'target', ['weight'])

Categories