I want to faster my program for comparing each rows with each other. I am thinking about using pandas apply() method but I just cant fiqure it how.
This is the data that I want to compare:
data
I want to compare each rows to become something like this
Currently I am using this code below:
df = pd.read_excel(r'example.xlsx',sheet_name='Sheet3')
df['Title_new'] = df[df.columns[2:]].apply(lambda x: ','.join(x.dropna().astype(str)),axis=1)
r_list = []
for i in range(len(df['Title_new'])):
list1 = df['Title_new'][i]
index_a = df['index'][i]
source_a = df['Source'][i]
for j in range(len(df['Title_new'])):
list2 = df['Title_new'][j]
index_b = df['index'][j]
source_b = df['Source'][j]
if index_a == index_b :
continue
r_list.append([index_a,source_a,list1,index_b,source_b,list2])
print([index_a,source_a,list1,index_b,source_b,list2])
r_df = pd.DataFrame(r_list)
r_df.columns= ['index_a','source_a','title_a','index_b','source_b','title_b']
r_df
I have a dataframe like this but Its not necessary that there are just 3 sites always:
data = [[501620, 501441,501549], [501832, 501441,501549], [528595, 501662,501549],[501905,501441,501956],[501913,501441,501549]]
df = pd.DataFrame(data, columns = ["site_0", "site_1","site_2"])
I want to slice the dataframe which can take condition dynemically from li(list) element and random combination.
I have tried below code which is a static one:
li = [1,2]
random_com = (501620, 501441,501549)
df_ = df[(df["site_"+str(li[0])] == random_com[li[0]]) & \
(df["site_"+str(li[1])] == random_com[li[1]])]
How can I make the above code dynemic ?
I have tried this but It is giving me two different dataframe but I need one dataframe with both condition (AND).
[df[(df["site_"+str(j)] == random_com[j])] for j in li]
You can do an iteration over the conditions and create an & of all the conditions.
li = [1,2]
main_mask = True
for i in li:
main_mask = main_mask & (df["site_"+str(i)] == random_com[i])
df_ = df[main_mask]
If you prefer one-liner, I think you could use reduce()
df_ = df[reduce(lambda x, y: x & y, [(df["site_"+str(j)] == random_com[j]) for j in li])]
I am new to pandas and I want to compare rows an then only enter into another for loop
for i in node:
temp_df=df[(df['NODE'])==i]
min_time=min(temp_df['time1'])
max_time=max(temp_df['time1'])
while min_time<=max_time:
print(min_time)
df['No.Of_CellDown']=temp_df['time1'].between(min_time,min_time + timedelta(minutes=5)).sum()
print(count)
min_time=min_time + timedelta(minutes=5)
I want to update conditions to check if Tech and Issue column has same value for row and row(-1)
and then proceed to execute for loop in the given code
Try:
(df
.assign(different_from_previous_row = lambda x:
(x['Tech'] == x['Tech'].shift(1))
& (x['Issue']==x['Issue'].shift(1))
)
Try this,
for index, row in temp_df.iterrows():
if index -1 >= 0:
if temp_df['Tech'][index-1] == row['Tech'] and temp_df['Issue'][index-1] == row['Issue]:
//Do your thing here
else:
print('different')
I am trying to add elements in rows from "list1" and "list2" using while loop. But getting "KeyError: 'the label [7] is not in the [index]". I know the simple way to do this is:
df['sum'] = (df["list1"]+df["list2"])
But I want to try this with loop for learning purposes.
import pandas as pd
df= pd.DataFrame({"list1":[2,5,4,8,4,7,8],"list2":[5,8,4,8,7,5,5],"list3":
[50,65,4,82,89,90,76]})
d=[]
count=0
x=0
while count<len(df):
df1=df.loc[x,"list1"]+df.loc[x,"list2"]
d.append(df1)
x=x+1
count=count+1
df["sum"]=d
you are really close but just a few suggestions:
no need for both count and x values
you are getting the error because then len of df (7) falls outside the index which is what loc is looking for. That can be fixed by doing len(df)-1
you do not need to do x = x+1 you can use x+=1
d=[]
x=0
while x <= len(df)-1:
df1 = df.loc[x, "list1"] + df.loc[x,"list2"]
d.append(df1)
x += 1
df["sum"]=d
I have a pandas dataframe with the following general format:
id,atr1,atr2,orig_date,fix_date
1,bolt,l,2000-01-01,nan
1,screw,l,2000-01-01,nan
1,stem,l,2000-01-01,nan
2,stem,l,2000-01-01,nan
2,screw,l,2000-01-01,nan
2,stem,l,2001-01-01,2001-01-01
3,bolt,r,2000-01-01,nan
3,stem,r,2000-01-01,nan
3,bolt,r,2001-01-01,2001-01-01
3,stem,r,2001-01-01,2001-01-01
This result would be the following:
id,atr1,atr2,orig_date,fix_date,failed_part_ind
1,bolt,l,2000-01-01,nan,0
1,screw,l,2000-01-01,nan,0
1,stem,l,2000-01-01,nan,0
2,stem,l,2000-01-01,nan,1
2,screw,l,2000-01-01,nan,0
2,stem,l,2001-01-01,2001-01-01,0
3,bolt,r,2000-01-01,nan,1
3,stem,r,2000-01-01,nan,1
3,bolt,r,2001-01-01,2001-01-01,0
3,stem,r,2001-01-01,2001-01-01,0
Any tips or tricks most welcome!
Update2:
A better way to describe what I need to accomplish is that in a .groupby(['id','atr1','atr2']) to create a new indicator column where the following criteria are met for records within the groups:
(df['orig_date'] < df['fix_date'])
I think this should work:
df['failed_part_ind'] = df.apply(lambda row: 1 if ((row['id'] == row['id']) &
(row['atr1'] == row['atr1']) &
(row['atr2'] == row['atr2']) &
(row['orig_date'] < row['fix_date']))
else 0, axis=1)
Update: I think this is what you want:
import numpy as np
def f(g):
min_fix_date = g['fix_date'].min()
if np.isnan(min_fix_date):
g['failed_part_ind'] = 0
else:
g['failed_part_ind'] = g['orig_date'].apply(lambda d: 1 if d < min_fix_date else 0)
return g
df.groupby(['id', 'atr1', 'atr2']).apply(lambda g: f(g))