Related
I am writing a for loop to try to do an encoding for all of my values in a dataset. I have plenty of categorical values and initially the for loop works for the label encoder but I am trying to include a onehotencoder instead of using get_dummies on a separate line.
sample data:
STYP_DESC Gender RACE_DESC DEGREE MAJR_DESC1 FTPT Target
0 New Female White BA Business Administration FT 1
1 New 1st Time Freshmn Female White BA Studio Art FT 1
2 New Male White MBAX Business Administration FT 1
3 New Female Unknown JD Juris Doctor PT 1
4 New Female Asian-American MBAX Business Administration PT 1
from sklearn.preprocessing import OneHotEncoder, LabelEncoder
le = LabelEncoder()
enc = OneHotEncoder(handle_unknown='ignore',drop='first')
le_count = 0
enc_count = 0
for col in X_train.columns[1:]:
if X_train[col].dtype == 'object':
if len(list(X_train[col].unique())) <= 2:
le.fit(X_train[col])
X_train[col] = le.transform(X_train[col])
le_count += 1
else:
enc.fit(X_train[[col]])
X_train[[col]] = enc.transform(X_train[[col]])
enc_count +=1
print('{} columns were label encoded and {} columns were 1-hot encoded'.format(le_count, enc_count))
but when I run it, I don't get errors but the encoding is super weird with a slew of tuples being inserted into my new dataset.
When I run the code without the everything in the else clause, it runs fine and I can simply use get_dummies to encode the other variables.
The only issue is when I use get_dummies, I drop_first is set to true; but I lose track of what is supposed to be 0 and what's supposed to be 1. (i.e. this problem is a major issue for tracking Gender and FTPT.
Any suggestions on this? I would use get_dummies but since I'm doing the preprocessing stage after splitting my data I'm worried about a category possibly being dropped out.
Change the transform line encoding else part as below
X_train[col] = enc.transform(X_train[[col]]).toarray()
Here I'm copying the full code, you may try it directly.
So error may be some other part of your code, please check.
styp = ['New','New 1st Time Freshmn','New','New','New']
gend = ['Female','Female','Male','Female','Female']
race = ['White','White','Unknown','Unknown','Asian-American']
deg = ['BA','BA','MBAX','JD','MBAX']
maj = ['Business Administration','Studio Art','Business Administration','Juris Doctor','Business Administration']
ftpt = ['FT','FT','FT','PT','PT']
df = pd.DataFrame({'STYP_DESC':styp, 'Gender':gend, 'RACE_DESC':race,'DEGREE':deg,\
'MAJR_DESC1':maj, 'FTPT':ftpt})
le = LabelEncoder()
enc = OneHotEncoder(handle_unknown='ignore',drop='first')
le_count = 0
enc_count = 0
for col in df.columns[1:]:
if df[col].dtype == 'object':
if len(list(df[col].unique())) <= 2:
le.fit(df[col])
df[col] = le.transform(df[col])
le_count += 1
else:
enc.fit(df[[col]])
df[col] = enc.transform(df[[col]]).toarray()
enc_count +=1
print(df)
print('{} columns were label encoded and {} columns were 1-hot encoded'.format(le_count, enc_count))
I have a data frame which contains a text column i.e. df["input"],
I would like to create a new variable which checks whether df["input"] column contains any of the word in a given list and assigns a value of 1 if previous dummy variable is equal to 0 (logic is 1) create a dummy variable that equals to zero 2) replace it to one if it contains any word in a given list and it was not contained in the previous lists.)
# Example lists
listings = ["amazon listing", "ecommerce", "products"]
scripting = ["subtitle", "film", "dubbing"]
medical = ["medical", "biotechnology", "dentist"]
df = pd.DataFrame({'input': ['amazon listing subtitle',
'medical',
'film biotechnology dentist']})
which looks like:
input
amazon listing subtitle
medical
film biotechnology dentist
final dataset should look like:
input listings scripting medical
amazon listing subtitle 1 0 0
medical 0 0 1
film biotechnology dentist 0 1 0
One possible implementation is to use str.contains in a loop to create the 3 columns, then use idxmax to get the column name (or the list name) of the first match, then create a dummy variable from these matches:
import numpy as np
d = {'listings':listings, 'scripting':scripting, 'medical':medical}
for k,v in d.items():
df[k] = df['input'].str.contains('|'.join(v))
arr = df[list(d)].to_numpy()
tmp = np.zeros(arr.shape, dtype='int8')
tmp[np.arange(len(arr)), arr.argmax(axis=1)] = arr.max(axis=1)
out = pd.DataFrame(tmp, columns=list(d)).combine_first(df)
But in this case, it might be more efficient to use a nested for-loop:
import re
def get_dummy_vars(col, lsts):
out = []
len_lsts = len(lsts)
for row in col:
tmp = []
# in the nested loop, we use the any function to check for the first match
# if there's a match, break the loop and pad 0s since we don't care if there's another match
for lst in lsts:
tmp.append(int(any(True for x in lst if re.search(fr"\b{x}\b", row))))
if tmp[-1]:
break
tmp += [0] * (len_lsts - len(tmp))
out.append(tmp)
return out
lsts = [listings, scripting, medical]
out = df.join(pd.DataFrame(get_dummy_vars(df['input'], lsts), columns=['listings', 'scripting', 'medical']))
Output:
input listings medical scripting
0 amazon listing subtitle 1 0 0
1 medical 0 1 0
2 film biotechnology dentist 0 0 1
Here is a simpler - more pandas vector style solution:
patterns = {} #<-- dictionary
patterns["listings"] = ["amazon listing", "ecommerce", "products"]
patterns["scripting"] = ["subtitle", "film", "dubbing"]
patterns["medical"] = ["medical", "biotechnology", "dentist"]
df = pd.DataFrame({'input': ['amazon listing subtitle',
'medical',
'film biotechnology dentist']})
#---------------------------------------------------------------#
# step 1, for each column create a reg-expression
for col, items in patterns.items():
# create a regex pattern (word1|word2|word3)
pattern = f"({'|'.join(items)})"
# find the pattern in the input column
df[col] = df['input'].str.contains(pattern, regex=True).astype(int)
# step 2, if the value to the left is 1, change its value to 0
## 2.1 create a mask
## shift the rows to the right,
## --> if the left column contains the same value as the current column: True, otherwise False
mask = (df == df.shift(axis=1)).values
# substract the mask from the df
## and clip the result --> negative values will become 0
df.iloc[:,1:] = np.clip( df[mask].iloc[:,1:] - mask[:,1:], 0, 1 )
print(df)
Result
input listings scripting medical
0 amazon listing subtitle 1 0 0
1 medical 0 0 1
2 film biotechnology dentist 0 1 0
Great question and good answers (I somehow missed it yesterday)! Here's another variation with .str.extractall():
search = {"listings": listings, "scripting": scripting, "medical": medical, "dummy": []}
pattern = "|".join(
f"(?P<{column}>" + "|".join(r"\b" + s + r"\b" for s in strings) + ")"
for column, strings in search.items()
)
result = (
df["input"].str.extractall(pattern).assign(dummy=True).groupby(level=0).any()
.idxmax(axis=1).str.get_dummies().drop(columns="dummy")
)
I have many different tables that all have different column names and each refer to an outcome, like glucose, insulin, leptin etc (except keep in mind that the tables are all gigantic and messy with tons of other columns in them as well).
I am trying to generate a report that starts empty but then adds columns based on functions applied to each of the glucose, insulin, and leptin tables.
I have included a very simple example - ignore that the function makes little sense. The below code works, but I would like to, instead of copy + pasting final_report["outcome"] = over and over again, just run the find_result function over each of glucose, insulin, and leptin and add the "glucose_result", "insulin_result" and "leptin_result" to the final_report in one or a few lines.
Thanks in advance.
import pandas as pd
ids = [1,1,1,1,1,1,2,2,2,2,2,2,3,3,3,4,4,4,4,4,4]
timepoint = [1,2,3,4,5,6,1,2,3,4,5,6,1,2,4,1,2,3,4,5,6]
outcome = [2,3,4,5,6,7,3,4,1,2,3,4,5,4,5,8,4,5,6,2,3]
glucose = pd.DataFrame({'id':ids,
'timepoint':timepoint,
'outcome':outcome})
insulin = pd.DataFrame({'id':ids,
'timepoint':timepoint,
'outcome':outcome})
leptin = pd.DataFrame({'id':ids,
'timepoint':timepoint,
'outcome':outcome})
ids = [1,2,3,4]
start = [1,1,1,1]
end = [6,6,6,6]
final_report = pd.DataFrame({'id':ids,
'start':start,
'end':end})
def find_result(subject, start, end, df):
df = df.loc[(df["id"] == subject) & (df["timepoint"] >= start) & (df["timepoint"] <= end)].sort_values(by = "timepoint")
return df["timepoint"].nunique()
final_report['glucose_result'] = final_report.apply(lambda x: find_result(x['id'], x['start'], x['end'], glucose), axis=1)
final_report['insulin_result'] = final_report.apply(lambda x: find_result(x['id'], x['start'], x['end'], insulin), axis=1)
final_report['leptin_result'] = final_report.apply(lambda x: find_result(x['id'], x['start'], x['end'], leptin), axis=1)
If you have to use this code structure, you can create a simple dictionary with your dataframes and their names and loop through them, creating new columns with programmatically assigned names:
input_dfs = {"glucose": glucose, "insulin": insulin, "leptin": leptin}
for name, df in input_dfs.items():
final_report[f"{name}_result"] = final_report.apply(
lambda x: find_result(x['id'], x['start'], x['end'], df),
axis=1
)
Output:
id start end glucose_result insulin_result leptin_result
0 1 1 6 6 6 6
1 2 1 6 6 6 6
2 3 1 6 3 3 3
3 4 1 6 6 6 6
I have the following gist that implements the TrueSkill algorithm over a dataset that is a collection n-player free-for-all games, where there is a winner, then second, third, fourth place etc.
Basically what I am asking is:
Have I implemented the algo correctly
Whats the most efficient way to backfill data in a Pandas dataframe
Another solution:
# Fetch the data
df = pd.read_csv('http://horse-data-abcd.s3.amazonaws.com/game_results.csv')
#sample_df = df[df['game_id'].isin([1592008, 1592012, 1592238, 1500903])].copy()
sample_df = df.head(10000)
def rate_games(df):
# store ratings for each user, this way it is easier to keep track of Rating between games
# i've decided that it will be easier compared to earlier implementation
# Ratings are initialized with default values mu=25.0 sigma=8.333
trueskills = {user_id: Rating() for user_id in df['player_id'].unique()}
# Group by the game_id
games = df.sort_values('date').groupby('game_id')
dataframes_list = []
# Now iterate the games
for game_id, game in games:
# sorting by postion (from best to last) this way rate function will figure out which player was best automatically
sorted_players = game.sort_values('position').iterrows()
# rate function allows input in form of [{some_label: Rating}, ...] and also it will return list
# in the same form with new Ratings. having [{row_id: Rating}, ...] will allow merging back the rating data
# back into original dataframee
trueskills_dicts = [{index: trueskills.get(row['player_id'], Rating())} for index, row in sorted_players]
flat_trueskill_dict = dict(ChainMap(*trueskills_dicts))
# Get the results from the rate method
try:
results = rate(trueskills_dicts) # returns [{'row_id': new rating}, ...]
except ValueError:
results = trueskills_dicts
# This converts results list of dicts into dict
# {'row_id_1': rating, 'row_id_2': rating}
flat_results_dict = dict(ChainMap(*results))
# This section prepares small dataframe looking like this, sorted from best to worst:
# index | mu | sigma | post_mu | post_sigma
# 3245 | 25 | 8.33 | 27.0 | 5.0
# 1225 | 25 | 8.33 | 26.0 | 5.0
df_columns = defaultdict(list)
df_index = []
for index, new_rating in flat_results_dict.items():
df_index.append(index)
previous_rating = flat_trueskill_dict.get(index, Rating())
df_columns['mu'].append(previous_rating.mu)
df_columns['sigma'].append(previous_rating.sigma)
df_columns['post_mu'].append(new_rating.mu)
df_columns['post_sigma'].append(new_rating.sigma)
# this dataframe has the same index column as the main one
# because of that we will be able to easily join it at the end of the function
df_results = pd.DataFrame(
df_columns,
index=df_index
)
dataframes_list.append(df_results)
# ok, all calclulations done, lets update the trueskills and advance to the next game
trueskills.update({game.loc[index, 'player_id']: rating for index, rating in flat_results_dict.items()})
# last thing get list of dataframes with calculated ratings and join it with main dataframe
concatenated_df = pd.concat(dataframes_list)
df = df.join(concatenated_df)
df.loc[:, 'player_id'].fillna(0, inplace=True)
return df
sample_df = rate_games(sample_df)
sample_df
Here is what I have come up with, can probably be optimised for speed.
# Fetch the data
df_raw = pd.read_csv('http://horse-data-abcd.s3.amazonaws.com/game_results.csv')
# Create a holding DataFrame for our TrueRank
df_truerank_columns = ['game_id', 'player_id', 'position', 'mu', 'sigma', 'post_mu', 'post_sigma']
df_truerank = pd.DataFrame(columns=df_truerank_columns)
# Use a sample of 1000
df = df_raw.head(10000)
# Group by the game_id
games = df.groupby('game_id')
# Now iterate the games
for game_id, game in games:
# Setup lists so we can zip them back up at the end
trueskills = []
player_ids = []
game_ids = []
mus = []
sigmas = []
post_mus = []
post_sigmas = []
# Now iterate over each player in a game
for index, row in game.iterrows():
# Create a game_ids arary for zipping up
game_ids.append(game_id)
# Now push the player_id onto the player_ids array for zipping up
player_ids.append(int(row['player_id']))
# Get the players last game, hence tail(1)
filter = (df_truerank['game_id'] < game_id) & (df_truerank['player_id'] == row['player_id'])
df_player = df_truerank[filter].tail(1)
# If there isnt a game then just use the TrueSkill defaults
if (len(df_player) == 0):
mu = 25
sigma = 8.333
else:
# Otherwise get the mu and sigma from the players last game
row = df_player.iloc[0]
mu = row['post_mu']
sigma = row['post_sigma']
# Keep lists of pre mu and sigmas
mus.append(mu)
sigmas.append(sigma)
# Now create a TrueSkull Rating() class and pass it into the trueskills dictionary
trueskills.append(Rating(mu=mu, sigma=sigma))
# Use the positions as ranks, they are 0 based so -1 from all of them
ranks = [x - 1 for x in list(game['position'])]
# Create tuples out of the trueskills array
trueskills_tuples = [(x,) for x in trueskills]
try:
# Get the results from the TrueSkill rate method
results = rate(trueskills_tuples, ranks=ranks)
# Loop the TrueSkill results and get the new mu and sigma for each player
for result in results:
post_mus.append(round(result[0].mu, 2))
post_sigmas.append(round(result[0].sigma, 2))
except:
# If the TrusSkill rate method blows up, just use the previous
# games mus and sigmas
post_mus = mus
post_sigmas = sigmas
# Change the positions back to non 0 based
positions = [x + 1 for x in ranks]
# Now zip together all our lists
data = list(zip(game_ids, player_ids, positions, mus, sigmas, post_mus, post_sigmas))
# Create a temp DataFrame the same as df_truerank and add data to the DataFrame
df_temp = pd.DataFrame(data, columns=df_truerank_columns)
# Add df_temp to our df_truerank
df_truerank = df_truerank.append(df_temp)
I have a set of data for which i have put into a data frame and then binned:
print(data1)
[[-1.90658883e+00 5.66881290e-01 1.45443907e+00]
[-1.82926850e+00 2.53325112e-01 1.45480072e+00]
[-1.59073925e+00 5.33264011e-01 1.45461954e+00]
...
[ 2.86246982e+02 4.52961148e-01 6.19121328e+00]]
df = pd.DataFrame(data=data1,)
print(df)
bins = [0,50,100,150,200,250,300,400]
df1 = pd.cut(df[0],bins, labels = False)
print(df1)
1 0
2 0
..
500 4
501 4
502 5
0 through 5 are the bin labels. I want to be able to access the data in each bin/category and store it in a variable. Something like this:
x = df1(4) # this doesnt work, just an example.
^ meaning I want to access the data stored in the 4th bin in the pandas dataframe and assign it to the variable x as an array, but I am unsure how to do that.
You can use pandas.DataFrame.loc and pass a boolean array to it.
bi = pd.cut(df[0], bins, labels=False)
x = df.loc[bi == 4]