Build table from for loop values - python

I have a for loop that does calculations from multiple columns in a dataframe with multiple criteria that prints float values I need to arrange in a table.
demolist = ['P13+', 'P18-34']
impcount = ['<1M', '1-5M']
for imp in impcount:
print(imp)
for d in demolist:
print(d)
target_ua = df.loc[(df['target'] == d) & (df['IMP Count'] == imp), 'in_target_ua_digital'].sum()
target_pop = df.loc[(df['target'] == d) & (df['IMP Count'] == imp), 'in_target_pop'].sum()
target_reach = target_ua / target_pop
print(target_reach)
The output looks like this:
<1M
P13+
0.10
P18-34
0.12
1-5M
P13+
0.92
P18-34
0.53
The code is working correctly, but I need the output to be arranged in a new dataframe with impcount in the columns and demolist in the rows
<1M 1-5M
P13+ 0.10 0.92
P18-34 0.12 0.53

It is just a matter of how to arrange your data. A table is a 2D data structure, which is often represented as a list of list (tuple) in python, e.g. [[1,2], [3, 4]]. For your case, you could collect your data row by row to build the table data, meaning that generate a tuple or list for each element of the row, then for the whole row we get a list of list (the table).
Here is an example showing how to form a table when each value of each cell could be calculated (here is a random value)
In [53]: x = list('abc')
...: y = list('123')
...:
...: data=[]
...: for i in x:
...: row=[]
...: for j in y:
...: row.append(np.random.rand())
...: data.append(row)
...:
...: df = pd.DataFrame(data, index=x, columns=y)
...:
In [54]: df
Out[54]:
1 2 3
a 0.107659 0.840387 0.642285
b 0.184508 0.641443 0.475105
c 0.503608 0.379945 0.933735

Try this:
demolist = ['P13+', 'P18-34']
impcount = ['<1M', '1-5M']
imp_str = '\t'
for imp in impcount:
imp_str += imp + '\t'
print(imp_str.rstrip())
imp_counter = 0
for imp in impcount:
demo_str = demolist[imp_counter]+'\t'
for d in demolist:
target_ua = df.loc[(df['target'] == d) & (df['IMP Count'] == imp), 'in_target_ua_digital'].sum()
target_pop = df.loc[(df['target'] == d) & (df['IMP Count'] == imp), 'in_target_pop'].sum()
target_reach = target_ua / target_pop
demo_str += str(target_reach)+'\t'
print(demo_str.rstrip())
imp_counter += 1
Hope this helps!

Related

Efficient way to narrow down the results and group them based on one column

I have a dataframe which contains multiple columns, I want to filter down the results based on unique values, then scale them between zero to one, finally group them into three categories. I used nested for loop, but I'm sure this is not the efficient way to solve this. Can someone show how to achieve this using map function or any other better ways.
Dataframe(df) looks like below, all columns expect value column are categorical.
prod_id tier1 tier2 tier3 tier4 value
X X X X X 3
X X X X X 2
X X X X X 6
grouping = df.groupby(["tier1", "tier2", "tier3", "tier4"]).agg({'prod_id':lambda x: len(pd.unique(x))}).reset_index().sort_values(by = 'prod_id', ascending = False)
#Selecting the most number of products (above 100)
df1 = grouping[grouping['prod_id'] >= 100]
my_data = pd.DataFrame([])
for a in tqdm(df1['tier1'].unique()):
for b in df1['tier2'].unique():
for c in df1['tier3'].unique():
for d in df1['tier4'].unique():
data = df[(df['tier1'] == a) & (df['tier2'] == b) &
(df['tier3'] == c) & (df['tier4'] == d)]
#print(data.shape)
if data.shape[0] is not 0:
data['scaled'] = minmax_scale(data['value'])
data['target_class'] = pd.cut(data['scaled'], [0, 0.25, 0.75, 1], labels = ['Low', 'Medium', 'High'])
my_data = my_data.append(data, ignore_index = True)
else:
pass

Insert 0's in a full of ones matrix after a nested loop verification

I'm wondering why my last loop doesn't succed to replace 1's by 0's in my 'bmatrix' matrix at the end of the process
been stuck in that doubt for a while
Basically, for the pairs of columns in the object h, I'm comparing them using the negoct (negociation values) dataframe. If for a given line col1 > col2, put 1 for col1 and 0 for col2 on my bmatrix
The expected result should be:
bmatrix =
This is my code:
valmerc = pd.DataFrame({'variableA': (1,2.0,3), 'variableB': (np.nan,2,np.nan), 'variableC': (9,10,15), 'variableD' : (18,25,43),'variableE':(36,11,12),'variableF':(99,10,98), 'variableG': (42,19,27)})
negoct = pd.DataFrame({'variableA': (0.1,0.2,0.3), 'variableB': (0.5,np.nan,0.303), 'variableC': (0.9,0.10,0.4), 'variableD' : (0.12,0.11,0.09),'variableE':(np.nan,0.13,0.21),'variableF':(1.4,np.nan,0.03), 'variableG': (1.41,0.134,0.111)})
cols = valmerc.columns.values
bmatrix = pd.DataFrame(index=negoct.index, columns=cols, data=1)
arr = valmerc.to_numpy()
is_equal = ((arr == arr[None].T).any(axis=1))
is_equal[np.tril_indices_from(is_equal)] = False
inds_of_same_cols = [*zip(*np.where(is_equal))]
h = [inds for inds in inds_of_same_cols]
donuts = []
for i in h:
op = pd.DataFrame(negoct.iloc[:,i[0]])
ap = pd.DataFrame(negoct.iloc[:,i[1]])
ep = pd.concat([op, ap], axis=1)
donuts.append(ep)
#donuts[0]
for i in range(len(donuts)):
for a,b in donuts[i].values:
#print(donuts[i].columns[0],donuts[i].columns[1],a,b, a - b, a < b)
if a < b:
bmatrix[donuts[i].columns[0]] = 0
else:
bmatrix[donuts[i].columns[0]] = 1
bmatrix

modifying the dataframe column and get unexpected results

I have a dataframe listed like below:
There are actually 120000 rows in this data, and there are 20000 users, this is just one user. For every user I need to make sure the prediction is three "1" and three "0".
I wrote the following function to do that:
def check_prediction_quality(df):
df_n = df.copy()
unique = df_n['userID'].unique()
for i in range(len(unique)):
ex_df = df[df['userID']== unique[i]]
v = ex_df['prediction'].tolist()
v_bool = [i == 0 for i in v]
if sum(v_bool) != 3:
if sum(v_bool) > 3:
res = [i for i,val in enumerate(v_bool) if val]
diff = sum(v_bool) - 3
for i in range(diff):
idx = np.random.choice(res,1)[0]
v[idx] = float(1)
res.remove(idx)
elif sum(v_bool) < 3:
res = [i for i,val in enumerate(v_bool) if not val]
diff = 3 - sum(v_bool)
for i in range(diff):
idx = np.random.choice(res,1)[0]
v[idx] = float(0)
res.remove(idx)
for j in range(len(v)):
df_n.loc[(0+i*6)+j:(6+i*6)+j,'prediction'] = v[j]
return df_n
However, when I run to check if the number of "0" and "1" are the same, turns it's not.. I am not sure what I did wrong.
sum([i == 0 for i in df['prediction']])
should be six using the below example, but when I run on my 120000 dataframe, it does not have 60000 on each
data = {'userID': [199810,199810,199810,199810,199810,199810,199812,199812,199812,199812,199812,199812],
'trackID':[1,2,3,4,5,6,7,8,9,10,11,12],
'prediction':[0,0,0,0,1,1,1,1,1,1,0,0]
}
df = pd.DataFrame(data = data)
df
Much appreciated!
When working with pandas dataframes you should reassign the post-processed Dataframe to the old one.
df = pd.DataFrame(np.array(...))
#reasignation:
df.loc[:,3:5] = df.loc[:,3:5]*10 #This multiplies the columns from 3 to 5 by 10
Actually never mind. I found out I don't have to modify the "0" and "1"..

Subset a row based on the column with similar name

Assuming a pandas dataframe like the one in the picture, I would like to fill the na values based with the value of the other variable similar to it. To be more clear, my variables are
mean_1, mean_2 .... , std_1, std_2, ... min_1, min_2 ...
So I would like to fill the na values with the values of the other columns, but not all the columns, only those whose represent the same metric, in the picture i highligted 2 na values. The first one I would like to fill it with the mean obtain from the variables 'MEAN' at row 2, while the second na I would like to fill it with the mean obtain from variable 'MIN' at row 9. Is there a way to do it?
you can find the unique prefixes, iterate through each and do fillna for subsets seperately
uniq_prefixes = set([x.split('_')[0] for x in df.columns])
for prfx in uniq_prefixes:
mask = [col for col in df if col.startswith(prfx)]
# Transpose is needed because row wise fillna is not implemented yet
df.loc[:,mask] = df[mask].T.fillna(df[mask].mean(axis=1)).T
Yes, it is possible doing it using the loop. Below is the naive approach, but even for fancier ones, it is not much optimisation (at least I don't see them).
for i, row in df.iterrows():
sum_means = 0
n_means = 0
sum_stds = 0
n_stds = 0
fill_mean_idxs = []
fill_std_idxs = []
for idx, item in item.iteritems():
if idx.startswith('mean') and item is None:
fill_mean_idxs.append(idx)
elif idx.startswith('mean'):
sum_means += float(item)
n_means += 1
elif idx.startswith('std') and item is None:
fill_std_idxs.append(idx)
elif idx.startswith('std'):
sum_stds += float(item)
n_stds += 1
ave_mean = sum_means / n_means
std_mean = sum_stds / n_stds
for idx in fill_mean_idx:
df.loc[i, idx] = ave_mean
for idx in fill_std_idx:
df.loc[i, idx] = std_mean

Cover all columns using the least amount of rows in a pandas dataframe

I have a pandas dataframe looking like the following picture:
The goal here is to select the least amount of rows to have a "1" in all columns. In this scenario, the final selection should be these two rows:
The algorithm should work even if I add columns and rows. It should also work if I change the combination of 1 and 0 in any given row.
Use sum per rows, then compare by Series.ge (>=) for greater or equal and filter by boolean indexing:
df[df.sum(axis=1).ge(2)]
It want test 1 or 0 values first compare by DataFrame.eq for equal ==:
df[df.eq(1).sum(axis=1).ge(2)]
df[df.eq(0).sum(axis=1).ge(2)]
For those interested, this is how I managed to do it:
def _getBestRowsFinalSelection(self, df, cols):
"""
Get the selected rows for the final selection
Parameters:
1. df: Dataframe to use
2. cols: Columns of the binary variables in the Dataframe object (df)
RETURNS -> DataFrame : dfSelected
"""
isOne = df.loc[df[df.loc[:, cols] == 1].sum(axis=1) > 0, :]
lstIsOne = isOne.loc[:, cols].values.tolist()
lstIsOne = [(x, lstItem) for x, lstItem in zip(isOne.index.values.tolist(), lstIsOne)]
winningComb = None
stopFlag = False
for i in range(1, isOne.shape[0] + 1):
if stopFlag:
break;
combs = combinations(lstIsOne, i) #from itertools
for c in combs:
data = [x[1] for x in c]
index = [x[0] for x in c]
dfTmp = pd.DataFrame(data=data, columns=cols, index=index)
if (dfTmp.sum() > 0).all():
dfTmp["Final Selection"] = "Yes"
winningComb = dfTmp
stopFlag = True
break;
return winningComb

Categories