I want to create a dataset with dummy variables from the original data based on predefined bins. I have tried using loops and splits but its not efficient. I'll appreciate your help.
## original data
data_dict = {"Age":[29,35,42,11,43],"Salary":[4380,3280,8790,1200,5420],
"Payments":[23190,1780,3400,12900,7822]}
df = pd.DataFrame(data_dict)
df
Predefined bins:
card_dict = {"Dummy Variable":["Age:(-inf,24)","Age:(24,35)","Age:(35,49)","Age:(49,60)","Age:(60,inf)",
"Payments:(-inf,7654)","Payments:(7654,9088)","Payments:(9088,12055)","Payments:(12055,inf)",
"Salary:(-inf,2300)","Salary:(2300,3800)","Salary:(3800,5160)",
"Salary:(5160,7200)","Salary:(7200,inf)"]}
card = pd.DataFrame(card_dict)
card
My code is as follows:
# for numerical variables
def prepare_numerical_data(data, scard):
"""
function to create dummy variables from numerical columns
"""
# numerical columns
num_df = df.select_dtypes(exclude='object')
num_cols = num_df.columns.values
variable_names = list(set([val.split(':')[0] for val in scard['Dummy Variable']])) # to have the same columns used to create the scorecard
num_variables = [x for x in variable_names if x in num_cols] # select numerical variables only
for i in num_variables:
for j in scard['Dummy Variable']:
if j.split(":")[0] in num_variables:
for val in data[i].unique():
if (val > (float(j.split(':')[1].split(',')[0][1:]))) & (val <= (float(j.split(':')[1].split(',')[1][:-1]))):
data.loc[data[i] == val, j] = 1
else:
data.loc[data[i] == val, j] = 0
return data
Here are the results:
result_df = prepare_numerical_data(df,card)
result_df
The results are not OK for salary and payments columns. The function didn't create correct dummies for the two columns as it did for age. How can I correct that?
This worked for me. Initially my code was not looping through every column in the dataframe.
def create_dummies(data, card):
# specify numerical and categorical columns
num_df = data.select_dtypes(exclude='object')
cat_df = data.select_dtypes(exclude=['float','int'])
num_cols = num_df.columns.values
cat_cols = cat_df.columns.values
# create dummies for numerical columns
for j in num_df.columns:
all_value = num_df[j].values
for variable_v in all_value:
for i in card["Dummy Variable"].values:
if i.split(":")[0] in num_cols:
var1 = i.split(":")
val1 = float(var1[1].strip("()").strip("[]").split(",")[0])
val2 = float(var1[1].strip("()").strip("[]").split(",")[1])
variable = var1[0]
if variable.lower() == j.lower():
if variable_v >= val1 and variable_v < val2:
num_df.loc[num_df[j] == variable_v, i] = 1
else:
num_df.loc[num_df[j] == variable_v, i] = 0
return num_df
Related
I have a dataframe which contains multiple columns, I want to filter down the results based on unique values, then scale them between zero to one, finally group them into three categories. I used nested for loop, but I'm sure this is not the efficient way to solve this. Can someone show how to achieve this using map function or any other better ways.
Dataframe(df) looks like below, all columns expect value column are categorical.
prod_id tier1 tier2 tier3 tier4 value
X X X X X 3
X X X X X 2
X X X X X 6
grouping = df.groupby(["tier1", "tier2", "tier3", "tier4"]).agg({'prod_id':lambda x: len(pd.unique(x))}).reset_index().sort_values(by = 'prod_id', ascending = False)
#Selecting the most number of products (above 100)
df1 = grouping[grouping['prod_id'] >= 100]
my_data = pd.DataFrame([])
for a in tqdm(df1['tier1'].unique()):
for b in df1['tier2'].unique():
for c in df1['tier3'].unique():
for d in df1['tier4'].unique():
data = df[(df['tier1'] == a) & (df['tier2'] == b) &
(df['tier3'] == c) & (df['tier4'] == d)]
#print(data.shape)
if data.shape[0] is not 0:
data['scaled'] = minmax_scale(data['value'])
data['target_class'] = pd.cut(data['scaled'], [0, 0.25, 0.75, 1], labels = ['Low', 'Medium', 'High'])
my_data = my_data.append(data, ignore_index = True)
else:
pass
I am calculating correlations and the data frame I have needs to be filtered.
I am looking to remove the rows under the current row from the data frame that are above or under by X amount starting with the first row and looping through the dataframe all the way until the last row.
example:
df['y'] has the values 50,51,52,53,54,55,70,71,72,73,74,75
if X = 10 it would start at 50 and see 51,52,53,54,55 as within that 10+- range and delete the rows. 70 would stay as it is not within that range and the same test would start again at 70 where 71,72,73,74,75 and respective rows would be deleted
the filter if X=10 would thus leave us with the rows including 50,75 for df.
It would leave me with a clean dataframe that deletes the instances that are linked to the first instance of what is essentially the same observed period. I tried coding a loop to do that but I am left with the wrong result and desperate at this point. Hopefully someone can correct the mistake or point me in the right direction.
df6['index'] = df6.index
df6.sort_values('index')
boom = len(dataframe1.index)/3
#Taking initial comparison values from first row
c = df6.iloc[0]['index']
#Including first row in result
filters = [True]
#Skipping first row in comparisons
for index, row in df6.iloc[1:].iterrows():
if c-boom <= row['index'] <= c+boom:
filters.append(False)
else:
filters.append(True)
# Updating values to compare based on latest accepted row
c = row['index']
df2 = df6.loc[filters].sort_values('correlation').drop('index', 1)
df2
OUTPUT BEFORE
OUTPUT AFTER
IIUC, your main issue is to filter consecutive values within a threshold.
You can use a custom function for that that acts on a Series (=column) to return the list of valid indices:
def consecutive(s, threshold = 10):
prev = float('-inf')
idx = []
for i, val in s.iteritems():
if val-prev > threshold:
idx.append(i)
prev = val
return idx
Example of use:
import pandas as pd
df = pd.DataFrame({'y': [50,51,52,53,54,55,70,71,72,73,74,75]})
df2 = df.loc[consecutive(df['y'])]
Output:
y
0 50
6 70
variant
If you prefer the function to return a boolean indexer, here is a varient:
def consecutive(s, threshold = 10):
prev = float('-inf')
idx = [False]*len(s)
for i, val in s.iteritems():
if val-prev > threshold:
idx[i] = True
prev = val
return idx
I have a dataframe listed like below:
There are actually 120000 rows in this data, and there are 20000 users, this is just one user. For every user I need to make sure the prediction is three "1" and three "0".
I wrote the following function to do that:
def check_prediction_quality(df):
df_n = df.copy()
unique = df_n['userID'].unique()
for i in range(len(unique)):
ex_df = df[df['userID']== unique[i]]
v = ex_df['prediction'].tolist()
v_bool = [i == 0 for i in v]
if sum(v_bool) != 3:
if sum(v_bool) > 3:
res = [i for i,val in enumerate(v_bool) if val]
diff = sum(v_bool) - 3
for i in range(diff):
idx = np.random.choice(res,1)[0]
v[idx] = float(1)
res.remove(idx)
elif sum(v_bool) < 3:
res = [i for i,val in enumerate(v_bool) if not val]
diff = 3 - sum(v_bool)
for i in range(diff):
idx = np.random.choice(res,1)[0]
v[idx] = float(0)
res.remove(idx)
for j in range(len(v)):
df_n.loc[(0+i*6)+j:(6+i*6)+j,'prediction'] = v[j]
return df_n
However, when I run to check if the number of "0" and "1" are the same, turns it's not.. I am not sure what I did wrong.
sum([i == 0 for i in df['prediction']])
should be six using the below example, but when I run on my 120000 dataframe, it does not have 60000 on each
data = {'userID': [199810,199810,199810,199810,199810,199810,199812,199812,199812,199812,199812,199812],
'trackID':[1,2,3,4,5,6,7,8,9,10,11,12],
'prediction':[0,0,0,0,1,1,1,1,1,1,0,0]
}
df = pd.DataFrame(data = data)
df
Much appreciated!
When working with pandas dataframes you should reassign the post-processed Dataframe to the old one.
df = pd.DataFrame(np.array(...))
#reasignation:
df.loc[:,3:5] = df.loc[:,3:5]*10 #This multiplies the columns from 3 to 5 by 10
Actually never mind. I found out I don't have to modify the "0" and "1"..
I have a pandas dataframe looking like the following picture:
The goal here is to select the least amount of rows to have a "1" in all columns. In this scenario, the final selection should be these two rows:
The algorithm should work even if I add columns and rows. It should also work if I change the combination of 1 and 0 in any given row.
Use sum per rows, then compare by Series.ge (>=) for greater or equal and filter by boolean indexing:
df[df.sum(axis=1).ge(2)]
It want test 1 or 0 values first compare by DataFrame.eq for equal ==:
df[df.eq(1).sum(axis=1).ge(2)]
df[df.eq(0).sum(axis=1).ge(2)]
For those interested, this is how I managed to do it:
def _getBestRowsFinalSelection(self, df, cols):
"""
Get the selected rows for the final selection
Parameters:
1. df: Dataframe to use
2. cols: Columns of the binary variables in the Dataframe object (df)
RETURNS -> DataFrame : dfSelected
"""
isOne = df.loc[df[df.loc[:, cols] == 1].sum(axis=1) > 0, :]
lstIsOne = isOne.loc[:, cols].values.tolist()
lstIsOne = [(x, lstItem) for x, lstItem in zip(isOne.index.values.tolist(), lstIsOne)]
winningComb = None
stopFlag = False
for i in range(1, isOne.shape[0] + 1):
if stopFlag:
break;
combs = combinations(lstIsOne, i) #from itertools
for c in combs:
data = [x[1] for x in c]
index = [x[0] for x in c]
dfTmp = pd.DataFrame(data=data, columns=cols, index=index)
if (dfTmp.sum() > 0).all():
dfTmp["Final Selection"] = "Yes"
winningComb = dfTmp
stopFlag = True
break;
return winningComb
I have 2 dataframes(df and df_flagMax) that are not the same in size. I need help on the structure of comparing two different databases that are not the same in size. I want to compare the rows of both dataframes.
df = pd.read_excel('df.xlsx')
df_flagMax = df.groupby(['Name'], as_index=False)['Max'].max()
df['flagMax'] = 0
num = len(df)
for i in range(num):
colMax = df.at[i, 'Name']
df['flagMax'][(df['Max'] == colMax)] = 1
print(df)
df_flagMax data:
Name Max
0 Sf 39.91
1 Th -25.74
df data:
For example: I want to compare 'Sf' from both df and df_flagMax and then perform this line:
df['flag'][(df['Max'] == colMax)] = 1
if and only if the 'Sf' is in both dataframes on the same row index. The same goes for the next Name value ... 'Th'