Is there a way to dinamically perform this loop? - python

do you know how if there is a better way to perform this task without using for loop?
Starting with the following dataset:
import pandas as pd
df = pd.DataFrame({'A': [90, 85, 85, 85, 100, 170, 150, 130, 125, 125],
'B':[100, 100, 100, 100, 100, 100, 100, 100, 100, 100]})
df['C'] = 0
df.loc[0, 'C'] = df.loc[0, 'B']
df['D'] = 0
df.loc[0, 'D'] = df.loc[0, 'C'] * 0.95
df['E'] = 0
df.loc[0, 'E'] = df.loc[0, 'C'] * 0.80
Now,
if the value in row 1 column A is greater than the value in row 0 column D:
the value in row 1 column C will be equal to the value in row 1 column A * 2
the value in row 1 column D will be equal to the value in row 1 column C * 0.95
the value in row 1 column E will be equal to the value in row 1 column D * 0.8
elif the value in row 1 column A is less than the value in row 0 column E:
the value in row 1 column C will be equal to the value in row 1 column A
the value in row 1 column D will be equal to the value in row 1 column C * 0.95
the value in row 1 column E will be equal to the value in row 1 column D * 0.8
else:
the value in row 1 column C will be equal to the value in row 0 column C
the value in row 1 column D will be equal to the value in row 1 column C * 0.95
the value in row 1 column E will be equal to the value in row 1 column D * 0.8
As output, I would like to create a df like this:
df_out = pd.DataFrame({'A': [90, 85, 85, 85, 100, 170, 150, 130, 125, 125],
'B':[100, 100, 100, 100, 100, 100, 100, 100, 100, 100],
'C':[100, 100, 100, 100, 200, 200, 150, 150, 150, 150],
'D':[95, 95, 95, 95, 190, 190, 190, 143, 143, 143],
'E':[80, 80, 80, 80, 160, 160, 160, 120, 120, 120]})
Considering that I have to iterate for more than 5000 rows and for around 3000 possible scenarios I'm looking for the fastest way to perform this task and I've noted that the for loop is extremely slow.
Thank you guys in advance and apologize for the trivial question!! I'm new to python and I'm trying to learn as much as possible!!
Best

Per our discussion in the comments, if you do the loop this way it's reasonably quick:
alist = [90, 85, 85, 85, 100, 170, 150, 130, 125, 125] * 500
a = alist[0]
c = 100
d = 95
e = 80
clist = [c]
dlist = [d]
elist = [e]
for a in alist[1:]:
if a > d:
c_new = round(a*1.5)
elif a < e:
c_new = a
else:
c_new = c
c = c_new
d = round(c_new * 0.95)
e = round(d * 0.8)
clist.append(c_new)
dlist.append(d)
elist.append(e)
df_out = pd.DataFrame({ 'A' : alist, 'C' : clist, 'D' : dlist, 'E' : elist })
print(df_out.head(10))
A C D E
0 90 100 95 80
1 85 100 95 76
2 85 100 95 76
3 85 100 95 76
4 100 150 142 114
5 170 255 242 194
6 150 150 142 114
7 130 150 142 114
8 125 150 142 114
9 125 150 142 114

Related

Compare two dataframe and conditionally capture random data in Python

The main logic of my question is on comparing the two dataframes a little, but it will be different from the existing questions here. Q1, Q2, Q3
Let's create dummy two dataframes.
data1 = {'user': [1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 3, 3, 3, 4, 4,4],
'checkinid': [10, 20, 30, 40, 50, 35, 45, 55, 20, 120, 100, 35, 55, 180, 200,400],
'count': [1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1]}
data2 = {'checkinid': [10, 20, 30, 35, 40, 45, 50,55, 60, 70,100,120,180,200,300,400]}
df1 = pd.DataFrame(data1)
df2 = pd.DataFrame(data2)
data2 consists of whole checkinid values. I am trying to create a training file.
For example, user 1 visited 5 places where ids are (10,20,30,40,50)
I want to add randomly the places that user 1 does not visit and set the 'count column' as 0.
My expectation dataframe like this
user checkinid count
1 10 1
1 20 1
1 30 1
1 40 1
1 50 1
1 300 0 (add randomly)
1 180 0 (add randomly)
1 55 0 (add randomly)
2 35 1
2 45 1
2 55 1
2 20 1
2 120 1
2 10 0 (add randomly)
2 400 0 (add randomly)
2 180 0 (add randomly)
... ...
Now those who read the question can ask how many random data they will add.
For each user, just add 3 non-visited places is enough for this example.
This might not be the best solution but it works
you have to get each users and then pick the checkinids which are not assigned to them
#get all users
users = df1.user.unique();
for user in users:
checkins = df1.loc[df1['user'] == user]
df = checkins.merge(df2, how = 'outer' ,indicator=True).loc[lambda x : x['_merge']=='right_only'].sample(n=3)
df['user']=[user,user,user]
df['count']=[0,0,0]
df.pop("_merge")
df1 = df1.append(df, ignore_index=True)
#sort data frome based on user
df1 = df1.sort_values(by=['user']);
#re-arrange cols
df1 = df1[['user', 'checkinid', 'count']]
#print df
print df1

Divide a pandas dataframe by the sum of its index column and row

Here is what I currently have:
print(df)
10 25 26
10 530 1 46
25 1 61 61
26 46 61 330
How can i transform this to df1 so that we divide each element in the row by the sum of the index columns? The output of df1 should look like this:
df1:
10 25 26
10 530/(530) 1/(530+61) 46/(530+330)
25 1/(61+530) 61/(61) 61/(61+330)
26 46/(330+530) 61/(330+61) 330/(330)
print(df1)
10 25 26
10 1 0.0016 0.0534
25 0.0016 1 0.1560
26 0.0534 0.1560 1
IIUC, try:
a = np.diag(df)[None, :]
b = np.diag(df)[:, None]
c = a+b
np.fill_diagonal(c, np.diag(df))
df_out = df.div(c)
df_out
Output:
10 25 26
10 1.000000 0.001692 0.053488
25 0.001692 1.000000 0.156010
26 0.053488 0.156010 1.000000
I think this is the solution but you have to change your columns and indexes.
import pandas as pd
df = pd.DataFrame({530: [530, 1, 46],
61: [1, 61, 61],
330: [46, 61, 330]},
index = [530, 61, 330])
for i in range(len(df)):
for j in range(len(df)):
if i == j:
df.iloc[i,j] = df.iloc[i, j] / df.index[i]
else:
df.iloc[i,j] = df.iloc[i,j] / (df.index[i] + df.columns[j])
df
You can divide the rows by the max in the column to reproduce your example.
df1 = pd.DataFrame(
{
"column1": df['10'].divide(df['10'].max()),
"column2": df['25'].divide(df['25'].max()),
"column3": df['26'].divide(df['26'].max())
}
)

Pandas - Interate over row and compare previous values -faster

I am trying to get my results faster (13 minutes for 800 rows). I asked a similar question here: pandas - iterate over rows and calculate - faster - but I not able to use the good solutions for my variation. The difference is that if the overlap of previous values in 'col2' is more than 'n=3', the value of 'col1' in the row is set to '0' and affect the following code.
import pandas as pd
d = {'col1': [20, 23, 40, 41, 46, 47, 48, 49, 50, 50, 52, 55, 56, 69, 70],
'col2': [39, 32, 42, 50, 63, 67, 64, 68, 68, 74, 59, 75, 58, 71, 66]}
df = pd.DataFrame(data=d)
df["overlap_count"] = "" #create new column
n = 3 #if x >= n, then value = 0
for row in range(len(df)):
x = (df["col2"].loc[0:row-1] > (df["col1"].loc[row])).sum()
df["overlap_count"].loc[row] = x
if x >= n:
df["col2"].loc[row] = 0
df["overlap_count"].loc[row] = 'x'
df
I obtain following result: replacing values in col1 if they are greater than 'n' and the column overlap_count
col1 col2 overlap_count
0 20 39 0
1 23 32 1
2 40 42 0
3 41 50 1
4 46 63 1
5 47 67 2
6 48 0 x
7 49 0 x
8 50 68 2
9 50 0 x
10 52 0 x
11 55 0 x
12 56 0 x
13 69 71 0
14 70 66 1
Thank you for your help and time!
I think you can use numba for improve performance, only is necessary working with numeric values, so instead x is added -1 and new column is filled by 0 instead empty string:
df["overlap_count"] = 0 #create new column
n = 3 #if x >= n, then value = 0
a = df[['col1','col2','overlap_count']].values
from numba import njit
#njit
def custom_sum(arr, n):
for row in range(arr.shape[0]):
x = (arr[0:row, 1] > arr[row, 0]).sum()
arr[row, 2] = x
if x >= n:
arr[row, 1] = 0
arr[row, 2] = -1
return arr
df1 = pd.DataFrame(custom_sum(a, n), columns=df.columns)
print (df1)
col1 col2 overlap_count
0 20 39 0
1 23 32 1
2 40 42 0
3 41 50 1
4 46 63 1
5 47 67 2
6 48 0 -1
7 49 0 -1
8 50 68 2
9 50 0 -1
10 52 0 -1
11 55 0 -1
12 56 0 -1
13 69 71 0
14 70 66 1
Performance:
d = {'col1': [20, 23, 40, 41, 46, 47, 48, 49, 50, 50, 52, 55, 56, 69, 70],
'col2': [39, 32, 42, 50, 63, 67, 64, 68, 68, 74, 59, 75, 58, 71, 66]}
df = pd.DataFrame(data=d)
#4500rows
df = pd.concat([df] * 300, ignore_index=True)
print (df)
In [115]: %%timeit
...: pd.DataFrame(custom_sum(a, n), columns=df.columns)
...:
8.11 ms ± 224 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [116]: %%timeit
...: for row in range(len(df)):
...: x = (df["col2"].loc[0:row-1] > (df["col1"].loc[row])).sum()
...: df["overlap_count"].loc[row] = x
...:
...: if x >= n:
...: df["col2"].loc[row] = 0
...: df["overlap_count"].loc[row] = 'x'
...:
...:
7.84 s ± 442 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
create a function and then just apply the function as shown below:
df['overlap_count'] = [fn(i) for i in df['overlap_count']]
Try this one, maybe it will be faster.
df['overlap_count'] = df.groupby('col1')['col2'].transform(lambda g: len((g >= g.name).index))

Merge DataFrames with ordering criteria

In a previous question, I was asking how to match values from this DataFrame source:
car_id lat lon
0 100 10.0 15.0
1 100 12.0 10.0
2 100 13.0 09.0
3 110 23.0 08.0
4 110 13.0 09.0
5 110 12.0 10.0
6 110 12.0 02.0
7 120 11.0 11.0
8 120 12.0 10.0
9 120 13.0 09.0
10 120 14.0 08.0
11 130 12.0 10.0
And keep only those whose coords are in this second DataFrame coords:
lat lon
0 12.0 10.0
1 13.0 09.0
But this time I'd like to match each car_id who gets:
all the values from coords
with the same order
So that the resulting DataFrame result would be:
car_id
1 100
2 120
# 110 has all the values from coords, but not in the same order
# 130 doesn't have all the values from coords
Is there a way to achieve this result in a vectorized way, avoiding going through a lot of loops and conditionals?
plan
we will groupby 'car_id' and evaluate each subset
after an inner merge we should see two things
the resultant merged dataframe should have the same values as coords
the resultant merged dataframe should cover everything
def duper(df):
m = df.merge(coords)
c = pd.concat([m, coords])
# we put the merged rows first and those are
# the ones we'll keep after `drop_duplicates(keep='first')`
# `keep='first'` is the default, so I don't pass it
c1 = (c.drop_duplicates().values == coords.values).all()
# if `keep=False` then I drop all duplicates. If I got
# everything in `coords` this should be empty
c2 = c.drop_duplicates(keep=False).empty
return c1 & c2
source.set_index('car_id').groupby(level=0).filter(duper).index.unique().values
array([100, 120])
slight alternative
def duper(df):
m = df.drop('car_id', 1).merge(coords)
c = pd.concat([m, coords])
c1 = (c.drop_duplicates().values == coords.values).all()
c2 = c.drop_duplicates(keep=False).empty
return c1 & c2
source.groupby('car_id').filter(duper).car_id.unique()
This isn't pretty, but what if you did something like this:
df2 = DataFrame(df, copy=True)
df2[['lat2', 'lon2']] = df[['lat', 'lon']].shift(-1)
df2.set_index(['lat', 'lon', 'lat2', 'lon2'], inplace=True)
print(df2.loc[(12, 10, 13, 9)].reset_index(drop=True))
car_id
0 100
1 120
And this would be the general case:
raw_data = {'car_id': [100, 100, 100, 110, 110, 110, 110, 120, 120, 120, 120, 130],
'lat': [10, 12, 13, 23, 13, 12, 12, 11, 12, 13, 14, 12],
'lon': [15, 10, 9, 8, 9, 10, 2, 11, 10, 9, 8, 10],
}
df = pd.DataFrame(raw_data, columns = ['car_id', 'lat', 'lon'])
raw_data = {
'lat': [10, 12, 13],
'lon': [15, 10, 9],
}
coords = pd.DataFrame(raw_data, columns = ['lat', 'lon'])
def submatch(df, match):
df2 = DataFrame(df['car_id'])
for x in range(match.shape[0]):
df2[['lat{}'.format(x), 'lon{}'.format(x)]] = df[['lat', 'lon']].shift(-x)
n = match.shape[0]
cols = [item for sublist in
[['lat{}'.format(x), 'lon{}'.format(x)] for x in range(n)]
for item in sublist]
df2.set_index(cols, inplace=True)
return df2.loc[tuple(match.stack().values)].reset_index(drop=True)
print(submatch(df, coords))
car_id
0 100

Write to file from dictionary instead of pandas

I would like to print dictionaries to file in a different way.
Right now, I am using Pandas to convert dictionaries to Dataframes, combine several Dataframes and then print them to file (see below code).
However, the Pandas operations seem to take a very long time and I would like to do this more efficiently.
Is it possible to do the below approach more efficiently while retaining the structure of the output files? (e.g. by printing from dictionary directly?)
import pandas as pd
labels = ["A", "B", "C"]
periods = [0, 1, 2]
header = ['key', 'scenario', 'metric', 'labels']
metrics_names = ["metric_balances", "metric_record"]
key = "key_x"
scenario = "base"
# The metrics are structured as dicts where the keys are `periods` and the values
# are arrays (where each array entry correspond to one of the `labels`)
metric_balances = {0: [1000, 100, 50], 1: [900, 150, 100], 2: [800, 350, 100]}
metric_record = {0: [20, 10, 5], 1: [90, 15, 10], 2: [80, 35, 10]}
# Combine all metrics into one output structure for key "x"
output_x = pd.concat([pd.DataFrame(metric_balances, columns=periods, index=labels),
pd.DataFrame(metric_record, columns=periods, index=labels)],
keys=pd.MultiIndex.from_product([[key], [scenario], metrics_names]),
names=header)
key = "key_y"
scenario = "base_2"
metric_balances = {0: [2000, 200, 50], 1: [1900, 350, 100], 2: [1200, 750, 100]}
metric_record = {0: [40, 5, 3], 1: [130, 45, 10], 2: [82, 25, 18]}
# Combine all metrics into one output structure for key "y"
output_y = pd.concat([pd.DataFrame(metric_balances, columns=periods, index=labels),
pd.DataFrame(metric_record, columns=periods, index=labels)],
keys=pd.MultiIndex.from_product([[key], [scenario], metrics_names]),
names=header)
# Concatenate all output dataframes
output = pd.concat([output_x, output_y], names=header)
# Print results to a csv file
output.to_csv("test.csv", index=False)
Below are the respective outputs:
OUTPUT X
0 1 2
key scenario metric labels
key_x base metric_balances A 1000 900 800
B 100 150 350
C 50 100 100
metric_record A 20 90 80
B 10 15 35
C 5 10 10
-----------------------------------
OUTPUT Y
0 1 2
key scenario metric labels
key_y base_2 metric_balances A 2000 1900 1200
B 200 350 750
C 50 100 100
metric_record A 40 130 82
B 5 45 25
C 3 10 18
------------------------------
OUTPUT COMBINED
0 1 2
key scenario metric labels
key_x base metric_balances A 1000 900 800
B 100 150 350
C 50 100 100
metric_record A 20 90 80
B 10 15 35
C 5 10 10
key_y base_2 metric_balances A 2000 1900 1200
B 200 350 750
C 50 100 100
metric_record A 40 130 82
B 5 45 25
C 3 10 18
I was looking into row wise printing of the dictionaries - but I had difficulties in merging the labels with the relevant arrays.

Categories