comparing each rows with apply() to faster the run - python

I want to faster my program for comparing each rows with each other. I am thinking about using pandas apply() method but I just cant fiqure it how.
This is the data that I want to compare:
data
I want to compare each rows to become something like this
Currently I am using this code below:
df = pd.read_excel(r'example.xlsx',sheet_name='Sheet3')
df['Title_new'] = df[df.columns[2:]].apply(lambda x: ','.join(x.dropna().astype(str)),axis=1)
r_list = []
for i in range(len(df['Title_new'])):
list1 = df['Title_new'][i]
index_a = df['index'][i]
source_a = df['Source'][i]
for j in range(len(df['Title_new'])):
list2 = df['Title_new'][j]
index_b = df['index'][j]
source_b = df['Source'][j]
if index_a == index_b :
continue
r_list.append([index_a,source_a,list1,index_b,source_b,list2])
print([index_a,source_a,list1,index_b,source_b,list2])
r_df = pd.DataFrame(r_list)
r_df.columns= ['index_a','source_a','title_a','index_b','source_b','title_b']
r_df

Related

how to slice a pandas data frame with for loop in another list?

I have a dataframe like this but Its not necessary that there are just 3 sites always:
data = [[501620, 501441,501549], [501832, 501441,501549], [528595, 501662,501549],[501905,501441,501956],[501913,501441,501549]]
df = pd.DataFrame(data, columns = ["site_0", "site_1","site_2"])
I want to slice the dataframe which can take condition dynemically from li(list) element and random combination.
I have tried below code which is a static one:
li = [1,2]
random_com = (501620, 501441,501549)
df_ = df[(df["site_"+str(li[0])] == random_com[li[0]]) & \
(df["site_"+str(li[1])] == random_com[li[1]])]
How can I make the above code dynemic ?
I have tried this but It is giving me two different dataframe but I need one dataframe with both condition (AND).
[df[(df["site_"+str(j)] == random_com[j])] for j in li]
You can do an iteration over the conditions and create an & of all the conditions.
li = [1,2]
main_mask = True
for i in li:
main_mask = main_mask & (df["site_"+str(i)] == random_com[i])
df_ = df[main_mask]
If you prefer one-liner, I think you could use reduce()
df_ = df[reduce(lambda x, y: x & y, [(df["site_"+str(j)] == random_com[j]) for j in li])]

modifying the dataframe column and get unexpected results

I have a dataframe listed like below:
There are actually 120000 rows in this data, and there are 20000 users, this is just one user. For every user I need to make sure the prediction is three "1" and three "0".
I wrote the following function to do that:
def check_prediction_quality(df):
df_n = df.copy()
unique = df_n['userID'].unique()
for i in range(len(unique)):
ex_df = df[df['userID']== unique[i]]
v = ex_df['prediction'].tolist()
v_bool = [i == 0 for i in v]
if sum(v_bool) != 3:
if sum(v_bool) > 3:
res = [i for i,val in enumerate(v_bool) if val]
diff = sum(v_bool) - 3
for i in range(diff):
idx = np.random.choice(res,1)[0]
v[idx] = float(1)
res.remove(idx)
elif sum(v_bool) < 3:
res = [i for i,val in enumerate(v_bool) if not val]
diff = 3 - sum(v_bool)
for i in range(diff):
idx = np.random.choice(res,1)[0]
v[idx] = float(0)
res.remove(idx)
for j in range(len(v)):
df_n.loc[(0+i*6)+j:(6+i*6)+j,'prediction'] = v[j]
return df_n
However, when I run to check if the number of "0" and "1" are the same, turns it's not.. I am not sure what I did wrong.
sum([i == 0 for i in df['prediction']])
should be six using the below example, but when I run on my 120000 dataframe, it does not have 60000 on each
data = {'userID': [199810,199810,199810,199810,199810,199810,199812,199812,199812,199812,199812,199812],
'trackID':[1,2,3,4,5,6,7,8,9,10,11,12],
'prediction':[0,0,0,0,1,1,1,1,1,1,0,0]
}
df = pd.DataFrame(data = data)
df
Much appreciated!
When working with pandas dataframes you should reassign the post-processed Dataframe to the old one.
df = pd.DataFrame(np.array(...))
#reasignation:
df.loc[:,3:5] = df.loc[:,3:5]*10 #This multiplies the columns from 3 to 5 by 10
Actually never mind. I found out I don't have to modify the "0" and "1"..

Cover all columns using the least amount of rows in a pandas dataframe

I have a pandas dataframe looking like the following picture:
The goal here is to select the least amount of rows to have a "1" in all columns. In this scenario, the final selection should be these two rows:
The algorithm should work even if I add columns and rows. It should also work if I change the combination of 1 and 0 in any given row.
Use sum per rows, then compare by Series.ge (>=) for greater or equal and filter by boolean indexing:
df[df.sum(axis=1).ge(2)]
It want test 1 or 0 values first compare by DataFrame.eq for equal ==:
df[df.eq(1).sum(axis=1).ge(2)]
df[df.eq(0).sum(axis=1).ge(2)]
For those interested, this is how I managed to do it:
def _getBestRowsFinalSelection(self, df, cols):
"""
Get the selected rows for the final selection
Parameters:
1. df: Dataframe to use
2. cols: Columns of the binary variables in the Dataframe object (df)
RETURNS -> DataFrame : dfSelected
"""
isOne = df.loc[df[df.loc[:, cols] == 1].sum(axis=1) > 0, :]
lstIsOne = isOne.loc[:, cols].values.tolist()
lstIsOne = [(x, lstItem) for x, lstItem in zip(isOne.index.values.tolist(), lstIsOne)]
winningComb = None
stopFlag = False
for i in range(1, isOne.shape[0] + 1):
if stopFlag:
break;
combs = combinations(lstIsOne, i) #from itertools
for c in combs:
data = [x[1] for x in c]
index = [x[0] for x in c]
dfTmp = pd.DataFrame(data=data, columns=cols, index=index)
if (dfTmp.sum() > 0).all():
dfTmp["Final Selection"] = "Yes"
winningComb = dfTmp
stopFlag = True
break;
return winningComb

simplify splitting a dataframe to several dataframes

So I have some dataframes (df0, df1, df2) with various numbers of rows. I wanted to split any dataframe which has a number of rows more than 30 to several dataframes consists of 30 rows only. So for example my dataframe df0 has 156 rows, then I would separated this dataframe into several dataframes like this:
if len(df0) > 30:
df0_A = df0[0:30]
df0_B = df0[31:60]
df0_C = df0[61:90]
df0_D = df0[91:120]
df0_E = df0[121:150]
df0_F = df0[151:180]
else:
df0= df0
The problem with this code is that I need to repeat the code exhaustively many times for the next code like this:
df0= pd.DataFrame(df0)
df0_A = pd.DataFrame(df0_A)
df0_B = pd.DataFrame(df0_B)
df0_C = pd.DataFrame(df0_C)
df0_D = pd.DataFrame(df0_D)
df0_E = pd.DataFrame(df0_E)
df0_F = pd.DataFrame(df0_F)
df0= df0.to_string(header=False,
index=False,
index_names=False).split('\n')
df0_A = df0_A.to_string(header=False,
index=False,
index_names=False).split('\n')
df0_B = df0_B.to_string(header=False,
index=False,
index_names=False).split('\n')
df0_C = df0_C.to_string(header=False,
index=False,
index_names=False).split('\n')
df0_D = df0_D.to_string(header=False,
index=False,
index_names=False).split('\n')
df0_E = df0_E.to_string(header=False,
index=False,
index_names=False).split('\n')
df0_F = idUGS0_F.to_string(header=False,
index=False,
index_names=False).split('\n')
df0= [','.join(ele.split()) for ele in df0]
df0_A = [','.join(ele.split()) for ele in df0_A]
df0_B = [','.join(ele.split()) for ele in df0_B]
df0_C = [','.join(ele.split()) for ele in df0_C]
df0_D = [','.join(ele.split()) for ele in df0_D]
df0_E = [','.join(ele.split()) for ele in df0_E]
df0_F = [','.join(ele.split()) for ele in df0_F]
now imagine I have ten dataframes that I need to split each into five dataframes. Then I need to make the same code for 50 times!
I'm quite new to Python. So, can anyone help me with how to simplify this code, maybe with simple for loop? thanks
You could probably automate it a little bit more, but this should be enough!
import copy
import numpy as np
df0 = pd.DataFrame({'Test' : np.random.randint(100000,999999,size=180)})
len(df0)
if len(df0) > 30:
df_dict = {}
x=0
y=30
for df_letter in ['A','B','C','D','E','F']:
df_name = f'df0_{df_letter}'
df_dict[df_name] = copy.deepcopy(df_letter)
df_dict[df_name] = pd.DataFrame(df0[x:y]).to_string(header=False, index=False, index_names=False).split('\n ')
df_dict[df_name] = [','.join(ele.split()) for ele in df_dict[df_name]]
x += 30
y += 30
df_name
else:
df0
for df in df_dict:
print(df)
print('--------------------------------------------------------------------')
print(f'length: {len(df_dict[df])}')
print('--------------------------------------------------------------------')
print(df_dict[df])
print('--------------------------------------------------------------------')
Assuming you have one column for identification,
def split_df(idf, idcol, nsize):
g = idf.groupby(idcol)
# Compute the size for each value of identification column
size = g.size()
dflist = []
for _id,_idcount in size.iteritems():
if _idcount > nsize:
# print(_id, ' = ', _idcount)
idx = idf[ idf[idcol].eq(_id) ].index
# print(idx)
# lets split the array into equal parts of `nsize`
# e.g. [1,2,3,4,5] with nsize = 2 will split into ([1,2], [3,4], [5])
ilist = np.array_split(idx, round(idx.shape[0]/nsize + 0.5))
dflist += ilist
return [idf.loc[idx].copy(deep=True) for idx in dflist]
df = pd.DataFrame(data=np.hstack((np.random.choice(np.arange(1,3), 10).reshape(10, -1), np.random.rand(10,3))), columns=['id', 'a', 'b', 'c'])
df = df.astype({'id': np.int64})
split(df, 'id', 2)
This is a great problem, you can use this (data is the DataFrame here):
# Create subsets of size 30 for the DataFrame
subsets = list(range(0, len(data), 30))
# Create start cutoffs for subsets of the DataFrame
start_cutoff = subsets
# Create end cutoffs for subsets of the DataFrame
end_cutoff = subsets[1:] + [len(data)]
# Zip the start cutoffs and end cutoffs into a List of Cutoffs
cutoffs = list(zip(start_cutoff, end_cutoff))
# List containing Splitted Dataframes
list_dfs = [data.iloc[cutoff[0]: cutoff[-1]] for cutoff in cutoffs]
# convert list to string DFs
string_dfs = [df.to_string(header=False, index=False, index_names=False).split('\n') for df in list_dfs]
final_df_list = [','.join(ele.split()) for string_df in string_dfs for ele in string_df]
Now you can access the DataFrames by:
print(final_df_list[0])
print(final_df_list[1])

Getting the row and column numbers that meets multiple conditions in Pandas

I am trying to get the row and column number, which meets three conditions in Pandas DataFrame.
I have a DataFrame of 0, 1, -1 (bigger than 1850); when I try to get the row and column it takes forever to get the output.
The following is an example I have been trying to use:
import pandas as pd
import numpy as np
a = pd.DataFrame(np.random.randint(2, size=(1845,1850)))
b = pd.DataFrame(np.random.randint(2, size=(5,1850)))
b[b == 1] = -1
c = pd.concat([a,b], ignore_index=True)
column_positive = []
row_positive = []
column_negative = []
row_negative = []
column_zero = []
row_zero = []
for column in range(0, c.shape[0]):
for row in range(0, c.shape[1]):
if c.iloc[column, row] == 1:
column_positive.append(column)
row_positive.append(row)
elif c.iloc[column, row] == -1:
column_negative.append(column)
row_negative.append(row)
else:
column_zero.append(column)
row_zero.append(row)
I did some web searching and found that np.where() does something like this, but I have no idea how to do it.
Could anyone tell a better alternative?
You are right np.where would be one way to do it. Here's an implementation with it -
# Extract the values from c into an array for ease in further processing
c_arr = c.values
# Use np.where to get row and column indices corresponding to three comparisons
column_zero, row_zero = np.where(c_arr==0)
column_negative, row_negative = np.where(c_arr==-1)
column_positive, row_positive = np.where(c_arr==1)
If you don't mind having rows and columns as a Nx2 shaped array, you could do it in a bit more concise manner, like so -
neg_idx, zero_idx, pos_idx = [np.argwhere(c_arr == item) for item in [-1,0,1]]

Categories