I have 2 dataframes(df and df_flagMax) that are not the same in size. I need help on the structure of comparing two different databases that are not the same in size. I want to compare the rows of both dataframes.
df = pd.read_excel('df.xlsx')
df_flagMax = df.groupby(['Name'], as_index=False)['Max'].max()
df['flagMax'] = 0
num = len(df)
for i in range(num):
colMax = df.at[i, 'Name']
df['flagMax'][(df['Max'] == colMax)] = 1
print(df)
df_flagMax data:
Name Max
0 Sf 39.91
1 Th -25.74
df data:
For example: I want to compare 'Sf' from both df and df_flagMax and then perform this line:
df['flag'][(df['Max'] == colMax)] = 1
if and only if the 'Sf' is in both dataframes on the same row index. The same goes for the next Name value ... 'Th'
Related
I want to create a dataset with dummy variables from the original data based on predefined bins. I have tried using loops and splits but its not efficient. I'll appreciate your help.
## original data
data_dict = {"Age":[29,35,42,11,43],"Salary":[4380,3280,8790,1200,5420],
"Payments":[23190,1780,3400,12900,7822]}
df = pd.DataFrame(data_dict)
df
Predefined bins:
card_dict = {"Dummy Variable":["Age:(-inf,24)","Age:(24,35)","Age:(35,49)","Age:(49,60)","Age:(60,inf)",
"Payments:(-inf,7654)","Payments:(7654,9088)","Payments:(9088,12055)","Payments:(12055,inf)",
"Salary:(-inf,2300)","Salary:(2300,3800)","Salary:(3800,5160)",
"Salary:(5160,7200)","Salary:(7200,inf)"]}
card = pd.DataFrame(card_dict)
card
My code is as follows:
# for numerical variables
def prepare_numerical_data(data, scard):
"""
function to create dummy variables from numerical columns
"""
# numerical columns
num_df = df.select_dtypes(exclude='object')
num_cols = num_df.columns.values
variable_names = list(set([val.split(':')[0] for val in scard['Dummy Variable']])) # to have the same columns used to create the scorecard
num_variables = [x for x in variable_names if x in num_cols] # select numerical variables only
for i in num_variables:
for j in scard['Dummy Variable']:
if j.split(":")[0] in num_variables:
for val in data[i].unique():
if (val > (float(j.split(':')[1].split(',')[0][1:]))) & (val <= (float(j.split(':')[1].split(',')[1][:-1]))):
data.loc[data[i] == val, j] = 1
else:
data.loc[data[i] == val, j] = 0
return data
Here are the results:
result_df = prepare_numerical_data(df,card)
result_df
The results are not OK for salary and payments columns. The function didn't create correct dummies for the two columns as it did for age. How can I correct that?
This worked for me. Initially my code was not looping through every column in the dataframe.
def create_dummies(data, card):
# specify numerical and categorical columns
num_df = data.select_dtypes(exclude='object')
cat_df = data.select_dtypes(exclude=['float','int'])
num_cols = num_df.columns.values
cat_cols = cat_df.columns.values
# create dummies for numerical columns
for j in num_df.columns:
all_value = num_df[j].values
for variable_v in all_value:
for i in card["Dummy Variable"].values:
if i.split(":")[0] in num_cols:
var1 = i.split(":")
val1 = float(var1[1].strip("()").strip("[]").split(",")[0])
val2 = float(var1[1].strip("()").strip("[]").split(",")[1])
variable = var1[0]
if variable.lower() == j.lower():
if variable_v >= val1 and variable_v < val2:
num_df.loc[num_df[j] == variable_v, i] = 1
else:
num_df.loc[num_df[j] == variable_v, i] = 0
return num_df
I am trying to assign a value to a dataframe column based on a value that falls IN BETWEEN two values of an other dataframe:
intervals = pd.DataFrame(columns = ['From','To','Value'], data = [[0,100,'A'],[100,200,'B'],[200,500,'C']])
print('intervals\n',intervals,'\n')
points = pd.DataFrame(columns = ['Point', 'Value'], data = [[45,'X'],[125,'X'],[145,'X'],[345,'X']])
print('points\n',points,'\n')
DesiredResult = pd.DataFrame(columns = ['Point', 'Value'], data = [[45,'A'],[125,'B'],[145,'B'],[345,'C']])
print('DesiredResult\n',DesiredResult,'\n')
Many thanks
Let's use map, first create a series using pd.IntervalIndex with from_arrays method:
intervals = intervals.set_index(pd.IntervalIndex.from_arrays(intervals['From'],
intervals['To']))['Value']
points['Value'] = points['Point'].map(intervals)
Output:
Point Value
0 45 A
1 125 B
2 145 B
3 345 C
Another approach:
def calculate_value(x):
return intervals.loc[(x >= intervals['From']) & (x < intervals['To']), 'Value'].squeeze()
desired_result = points.copy()
desired_result['Value'] = desired_result['Point'].apply(calculate_value)
I have a pandas dataframe looking like the following picture:
The goal here is to select the least amount of rows to have a "1" in all columns. In this scenario, the final selection should be these two rows:
The algorithm should work even if I add columns and rows. It should also work if I change the combination of 1 and 0 in any given row.
Use sum per rows, then compare by Series.ge (>=) for greater or equal and filter by boolean indexing:
df[df.sum(axis=1).ge(2)]
It want test 1 or 0 values first compare by DataFrame.eq for equal ==:
df[df.eq(1).sum(axis=1).ge(2)]
df[df.eq(0).sum(axis=1).ge(2)]
For those interested, this is how I managed to do it:
def _getBestRowsFinalSelection(self, df, cols):
"""
Get the selected rows for the final selection
Parameters:
1. df: Dataframe to use
2. cols: Columns of the binary variables in the Dataframe object (df)
RETURNS -> DataFrame : dfSelected
"""
isOne = df.loc[df[df.loc[:, cols] == 1].sum(axis=1) > 0, :]
lstIsOne = isOne.loc[:, cols].values.tolist()
lstIsOne = [(x, lstItem) for x, lstItem in zip(isOne.index.values.tolist(), lstIsOne)]
winningComb = None
stopFlag = False
for i in range(1, isOne.shape[0] + 1):
if stopFlag:
break;
combs = combinations(lstIsOne, i) #from itertools
for c in combs:
data = [x[1] for x in c]
index = [x[0] for x in c]
dfTmp = pd.DataFrame(data=data, columns=cols, index=index)
if (dfTmp.sum() > 0).all():
dfTmp["Final Selection"] = "Yes"
winningComb = dfTmp
stopFlag = True
break;
return winningComb
I have a dataset that I am trying to split into 2 smaller dataframes called test and train. The original dataset has two columns "patient_nbr" and "encounter_id". These columns all have 6 digit values.
How can I go through this dataframe, and add up all the digits in those two columns. For example, if in the first row of the dataframe the values are 123456 and 123456, I need to add 1+2+3+4+5+6+1+2+3+4+5+6. The sum is used to determine if that row goes into test or train. If it is even, test. If it is odd, train.
Below is what I tried. But it is so slow. I turned the two columns I need into two numpy arrays in order to break down and add up the digits. I added those numpy arrays to get one, and looped through that to get determine which dataframe it should go in.
with ZipFile('dataset_diabetes.zip') as zf:
with zf.open('dataset_diabetes/diabetic_data.csv','r') as f:
df = pd.read_csv(f)
nums1 = []
nums2 = []
encounters = df["encounter_id"].values
for i in range(len(encounters)):
result = 0
while encounters[i] > 0:
rem = encounters[i] % 10
result = result + rem
encounters[i] = int(encounters[i]/10)
nums1.append(result)
patients = df["patient_nbr"].values
for i in range(len(patients)):
result = 0
while patients[i] > 0:
rem = patients[i] % 10
result = result + rem
patients[i] = int(patients[i]/10)
nums2.append(result)
nums = np.asarray(nums1) + np.asarray(nums2)
df["num"] = nums
# nums = df["num"].values
train = pd.DataFrame()
test = pd.DataFrame()
for i in range(len(nums)):
if int(nums[i] % 2) == 0:
# goes to train
train.append(df.iloc[i])
else:
# goes to test
test.append(df.iloc[i])
you can do it by playing with astype to go from int to str to float, sum both columns over the row once string (like concatenate both strings), then str.split and expand the string, and sum again per row after selecting the right columns and change the type of each digit as float.
#dummy example
df = pd.DataFrame({'patient_nbr':[123456, 123457, 123458],
'encounter_id':[123456, 123456, 123457]})
#create num
df['num'] = df[['patient_nbr', 'encounter_id']].astype(str).sum(axis=1)\
.astype(str).str.split('', expand=True)\
.loc[:,1:12].astype(float).sum(axis=1)
print (df)
patient_nbr encounter_id num
0 123456 123456 42.0
1 123457 123456 43.0
2 123458 123457 45.0
then use this column to create a mask with even as False and odd as True
mask = (df['num']%2).astype(bool)
train = df.loc[~mask, :] #train is the even
test = df.loc[mask, :] #test is the odd
print (test)
patient_nbr encounter_id num
1 123457 123456 43.0
2 123458 123457 45.0
df
I'm attempting to count the number of null values below each non-null cell in a dataframe and put the number into a new variable (size) and data frame.
I have included a picture of the dataframe I'm trying to count. I'm only interested in the Arrival Date Column for now. The new data frame should have a column that has 1,1,3,7..etc as it's first observations.
##Loops through all of rows in DOAs
for i in range(0, DOAs.shape[0]):
j=0
if DOAs.iloc[int(i),3] != None: ### the rest only runs if the current, i, observation isn't null
newDOAs.iloc[int(j),0] = DOAs.iloc[int(i),3] ## sets the jth i in the new dataframe to the ith (currently assessed) row of the old
foundNull = True #Sets foundNull equal to true
k=1 ## sets the counter of people
while foundNull == True and (k+i) < 677:
if DOAs.iloc[int(i+k),3] == None: ### if the next one it looks at is null, increment the counter to add another person to the family
k = k+1
else:
newDOAs.iloc[int(j),1] = k ## sets second column in new dataframe equal to the size
j = j+1
foundNull = False
j=0
What you can do is get the indices of non-null entries in whatever column of your dataframe, then get the distances between each. Note: This is assuming they are nicely ordered and/or you don't mind calling .reset_index() on your dataframe.
Here is a sample:
df = pd.DataFrame({'a': [1, None, None, None, 2, None, None, 3, None, None]})
not_null_index = df.dropna(subset=['a']).index
null_counts = {}
for i in range(len(not_null_index)):
if i < len(not_null_index) - 1:
null_counts[not_null_index[i]] = not_null_index[i + 1] - 1 - not_null_index[i]
else:
null_counts[not_null_index[i]] = len(df.a) - 1 - not_null_index[i]
null_counts_df = pd.DataFrame({'nulls': list(null_counts.values())}, index=null_counts.keys())
df_with_null_counts = pd.merge(df, null_counts_df, left_index=True, right_index=True)
Essentially all this code does is get the indices of non-null values in the dataframe, then gets the difference between each index and the next non-null index and puts that in the column. Then sticks those null_counts in a dataframe and merges it with the original.
After running this snippet, df_with_null_counts is equal to:
a nulls
0 1.0 3
4 2.0 2
7 3.0 2
Alternatively, you can use numpy instead of using a loop, which would be much faster for large dataframes. Here's a sample:
df = pd.DataFrame({'a': [1, None, None, None, 2, None, None, 3, None, None]})
not_null_index = df.dropna(subset=['a']).index
offset_index = np.array([*not_null_index[1:], len(df.a)])
null_counts = offset_index - np.array(not_null_index) - 1
null_counts_df = pd.DataFrame({'nulls': null_counts}, index=not_null_index)
df_with_null_counts = pd.merge(df, null_counts_df, left_index=True, right_index=True)
And the output will be the same.