Dataframe compare EQ - position doesn't matter - python

Reading this article - https://datatofish.com/compare-values-dataframes/ while helpful it doesn't help with my use-case in that I have multiple prices per product. ie product1 = computer has price = 350, 850.
My compare to DF has product2 = computer has price = 850, 350
When I compare these TWO since I think the order seems to matter it says that they do not match, how can I compare irregardless of order?
df.product_series.eq(other=df.warehouse_series)
sample dataframe
,Unnamed: 0,product,testfrom1,testfrom2,seriesfrom1,seriesfrom2
0,0,hi this is me,1703.0|1144.0|2172.0|735.0,,"['1703.0', '1144.0', '2172.0', '735.0']",
1,1,abc543,1120.0|637.0|2026.0|1599.0,,"['1120.0', '637.0', '2026.0', '1599.0']",
2,2,thisisus,2663.0|859.0|2281.0|1487.0,,"['2663.0', '859.0', '2281.0', '1487.0']",
3,3,abc123,1407.0|1987.0|696.0,,"['1407.0', '1987.0', '696.0']",
4,4,thing2,1392.0|1971.0|552.0,,"['1392.0', '1971.0', '552.0']",
5,5,thing1,1025.0|1566.0|581.0,,"['1025.0', '1566.0', '581.0']",
in the sample above I compare seriesfrom1 and seriesfrom2 but again I think the order where they differ is throwing things off..

Use this custom function to compare. Can't compare it directly using eq
def custom_compare_eq(series,other):
length = len(series.values)
for i in range(length):
r1 = eval(str(series.values[i]))
r2 = eval(str(other.values[i]))
if type(r1)!=type(r2):
yield False
else:
if type(r1)==int:
yield r1==r2
elif type(r1)==list:
yield set(r1)==set(r2)
result = list(custom_compare_eq(df.column1,df.column2))
This will compare two lists with different orders

Related

I have two averages numbers and I want to find the ratio for them

I have 2 float numbers that are calculated from a csv file dataset.
value1 = 10.82500730353491
value2 = 4.057505093955173
Now I need to find the ratio of those two variables from the dataset I have.
Here is my current code that I used to extract the means. Now I need to find the ratio.
mean_buildingFire = dataset.loc[dataset['INCIDENT_TYPE_DESC'] == '111 - Building fire', 'UNITS_ONSCENE'].mean()
mean_smoke = dataset.loc[dataset['INCIDENT_TYPE_DESC'] == '651 - Smoke scare, odor of smoke', 'UNITS_ONSCENE'].mean()
print(mean_buildingFire)
print(mean_smoke)
My current output:
10.82500730353491
4.057505093955173
How do I find the ratio as whole numbers?
(mean_buildingFire/mean_smoke).as_integer_ratio()

Problem generating a list with a numeric qualifier

I am working on a course with low code requirements, and have one step where I am stuck.
I have this code that creates a list of restaurants and the number of reviews each has:
Filter the rated restaurants
df_rated = df[df['rating'] != 'Not given'].copy()
df_rated['rating'] = df_rated['rating'].astype('int')
df_rating_count = df_rated.groupby(['restaurant_name'])['rating'].count().sort_values(ascending = False).reset_index()
df_rating_count.head()
From there I am supposed to create a list limited to those above 50 reviews, starting from this base:
# Get the restaurant names that have rating count more than 50
rest_names = df_rating_count['______________']['restaurant_name']
# Filter to get the data of restaurants that have rating count more than 50
df_mean_4 = df_rated[df_rated['restaurant_name'].isin(rest_names)].copy()
# Group the restaurant names with their ratings and find the mean rating of each restaurant
df_mean_4.groupby(['_______'])['_______'].mean().sort_values(ascending = False).reset_index().dropna() ## Complete the code to find the mean rating
Where I am stuck is on the first step.
rest_names = df_rating_count['______________']['restaurant_name']
I am pretty confident in the other 2 steps.
df_mean_4 = df_rated[df_rated['restaurant_name'].isin(rest_names)].copy()
df_mean_4.groupby(['restaurant_name'])['rating'].mean().sort_values(ascending = False).reset_index().dropna()
I have frankly tried so many different things I don't even know where to start.
Does anyone have any hints to at least point me in the right direction?
you can index and filter using [].
# Get the restaurant names that have rating count more than 50
rest_names = df_rating_count[df_rating_count['rating'] > 50]['restaurant_name']
#function to determine the revenue
def compute_rev(x):
if x > 20:
return x*0.25
elif x > 5:
return x*0.15
else:
return x*0
## Write the appropriate column name to compute the revenue
df['Revenue'] = df['________'].apply(compute_rev)
df.head()

Iterating on a list based on different parameters

I'm once again asking for help on iterating over a list. This is the problem that eludes me this time:
I have this table:
that contains various combinations of countries with their relative trade flow.
Since trade goes both ways, my list has for example one value for ALB-ARM (how much albania traded with armenia that year) and then down the list another value for ARM-ALB (the other way around).
I want to sum this two trade values for every pair of countries; and I've been trying around with some code but I quickly realise how all my approaches are wrong.
How do I even set it up? I feel like it's too hard with a loop and it will be easy with some function that I don't even know exists.
Example data in Table format:
from astropy.table import Table
country1 = ["ALB","ALB","ARM","ARM","AZE","AZE"]
country2 = ["ARM","AZE","ALB","AZE","ALB","ARM"]
flow = [500,0,200,300,90,20]
t = Table([country1,country2,flow],names=["1","2","flow"],meta={"Header":"Table"})
and the expected output would be:
trade = [700,90,700,320,90,320]
result = Table([country1,country2,flow,trade],names=["1","2","flow","trade"],meta={"Header":"Table"})
Thank you in advance all
Maybe this could help:
country1 = ["ALB","ALB","ARM","ARM","AZE","AZE"]
country2 = ["ARM","AZE","ALB","AZE","ALB","ARM"]
flow = [500,0,200,300,90,20]
trade = []
pairs = map(lambda t: '-'.join(t), zip(country1, country2))
flow_map = dict(zip(pairs, flow))
for left_country, right_country in zip(country1, country2):
trade.append(flow_map['-'.join((left_country, right_country))] + flow_map['-'.join((right_country, left_country))])
print(trade)
outputs:
[700, 90, 700, 320, 90, 320]

Python/ R code is taking too long to extract pairwise information from dataset. How to optimize?

Code was initially in R, but as R does not handle large dataset well, I converted the code to python and ported it to Google Colab. Even on Google Colab it took very long, and I never actually saw it finish runing even after 8 hours. I also added more breaking statements to avoid unnecessary runs.
The dataset has around unique 50000 time stamps, unique 40000 ids. It is in the format of ['time','id','x-coordinate','y-coordinate], very clear cut passenger trajectory dataset.
What the code is trying to do is extract out all the pairs of IDs which are 2 meters/less apart from each other at the same time frame.
Please let me know if there are ways to optimize this.
Here's a short overview of the data. [my_data.head(10)][1]
i=0
y = pd.DataFrame(columns=['source', 'dest']) #empty contact network df
infectedGrp = [824, 11648, 23468]
while (i < my_data.shape[0]):
row1=my_data.iloc[i]
id1=row1[1]
time1=row1[0]
x1=row1[2]
y1=row1[3]
infected1=my_data.iloc[i,4]
infectious1=my_data.iloc[i,5]
#print(row1)
#print(time1)
for j in range(i+1,my_data.shape[0]):
row2=my_data.iloc[j]
id2=row2[1]
time2=row2[0]
x2=row2[2]
y2=row2[3]
infected2=my_data.iloc[j,4]
infectious2=my_data.iloc[j,5]
print(time2)
if(time2!=time1):
i=i+1
print("diff time...breaking")
break
if(x2>x1+2) or (x1>x2+2):
i=i+1
print("x more than 2...breaking")
break
if(y2>y1+2) or (y1>y2+2):
i=i+1
print("y more than 2...breaking")
break
probability = 0
distance = round(math.sqrt(pow((x1-x2),2)+pow((y1-y2),2)),2)
print(distance)
print(infected1)
print(infected2)
if (distance<=R):
if infectious1 and not infected2 : #if one person is infectious and the other is not infected
probability = (1-beta)*(1/R)*(math.sqrt(R**2-distance**2))
print(probability)
print("here")
infected2=decision(probability)
numid2= int(id2) # update all entries for id2
if (infected2):
my_data.loc[my_data['id'] == numid2, 'infected'] = True
#my_data.iloc[j,7]=probability
elif infectious2 and not infected1:
infected1=decision(probability)
numid1= int(id1) # update all entries for id1
if (infected1):
my_data.loc[my_data['id'] == numid1, 'infected'] = True
#my_data.iloc[i,7]=probability
inf1 = 'F'
inf2 = 'F'
if (infected1):
inf1 = 'T'
if (infected2):
inf2 = 'T'
print('prob '+str(probability)+' at time '+str(time1))
new_row = {'source': id1.astype(str)+' '+inf1, 'dest': id2.astype(str)+' '+inf2}
y = y.append(new_row, ignore_index=True)
i=i+1
[1]: https://i.stack.imgur.com/YVdmB.png
Its hard to tell now for sure, but I think good guess is this line is your biggest "sin":
y = y.append(new_row, ignore_index=True)
You should not append rows to dataframe in a loop.
You should aggregate them in python list and then create DataFrame using all of them after the loop.
y = []
while (i < my_data.shape[0])
(...)
y.append(new_row)
y = pd.DataFrame(y)
I also suggest to use line profiler to analyse which parts of the code are the bottlenecks
You are using a nested loop to find time values that are equivalent. You can get a huge improvement by doing a group_by operation instead and then iterating through the groups.

What is the fastest way to generate a dynamic lag variable that updates with each iteration?

I have a few hundred thousand groups through which I want to iterate this particular lag operation. Below is a sample where Buy_Ord_No is the group by variable:
I would like to generate Lag_Exec_Qty and Exec_Qty. What I am basically doing here is initially setting Exec_Qty equal to 0 when Buy_Act_Type = 1 or Buy_Act_Type = 4. Then, I take the lag value of Exec_Qty ad Lag_Exec_Qty. In the same row, I sum up Trd_Qty and Lag_Exec_Qty to get the updated Exec_Qty.
This is the code that I currently have:
for b in buy:
temp=buy_sorted_file[buy_sorted_file["Buy_Ord_No"]==b]
temp=temp.sort_values(["Buy_Ord_No","Buy_Ord_Txn_Time"], ascending=[True, True]).reset_index(drop=True)
for index in range(len(temp.index)):
if(int(temp["Buy_Act_Type"].iloc[index])==1 or int(temp["Buy_Act_Type"].iloc[index])==4):
temp["Exec_Qty"].iloc[index]=0
temp["Lag_Exec_Qty"].iloc[index]=0
else:
temp["Lag_Exec_Qty"].iloc[index]=temp["Exec_Qty"].iloc[index-1]
temp["Exec_Qty"].iloc[index]=temp["Trd_Qty"].iloc[index]+temp["Lag_Exec_Qty"].iloc[index]
if (len(buy_sorted_exec_file.index) == 0):
buy_sorted_exec_file = temp.copy()
else:
buy_sorted_exec_file = pd.concat([temp,buy_sorted_exec_file]).reset_index(drop=True)
buy_sorted_file= buy_sorted_exec_file.sort_values(["Buy_Ord_Txn_Time", "Buy_Ord_Limit_Pr"],ascending=[True, True]).reset_index(drop=True)
The code takes a really long time to run. Is there anyway through which I can speed this process up?
You should be able to do, without any loops:
temp['Lag_Exec_Qty'] = temp['Exec_Qty'].shift(1)
temp['Exec_Qty'] = temp['Trd_Qty'] + temp['Lag_Exec_Qty']

Categories