My dataframe has a daily price column and a window size column :
df = pd.DataFrame(columns = ['price', 'window'],
data = [[100, 1],[120, 2], [115, 2], [116, 2], [100, 4]])
df
price window
0 100 1
1 120 2
2 115 2
3 116 2
4 100 4
I would like to compute the rolling mean of price for each row using the window of the window column.
The result would be this :
df
price window rolling_mean_price
0 100 1 100.00
1 120 2 110.00
2 115 2 117.50
3 116 2 115.50
4 100 4 112.75
I don't find any elegant way to do it with apply and I refuse to loop over each row of my DataFrame...
The best solutions, in terms of raw speed and complexity, are based on ideas from summed-area table. The problem can be consider as a table of one dimension. Below you can find several approaches, ranked from best to worst.
Numpy + Linear complexity
size = len(df['price'])
price = np.zeros(size + 1)
price[1:] = df['price'].values.cumsum()
window = np.clip(np.arange(size) - (df['window'].values - 1), 0, None)
df['rolling_mean_price'] = (price[1:] - price[window]) / df['window'].values
print(df)
Output
price window rolling_mean_price
0 100 1 100.00
1 120 2 110.00
2 115 2 117.50
3 116 2 115.50
4 100 4 112.75
Loopy + Linear complexity
price = df['price'].values.cumsum()
df['rolling_mean_price'] = [(price[i] - float((i - w) > -1) * price[i-w]) / w for i, w in enumerate(df['window'])]
Loopy + Quadratic complexity
price = df['price'].values
df['rolling_mean_price'] = [price[i - (w - 1):i + 1].mean() for i, w in enumerate(df['window'])]
I would not recommend this approach using pandas.DataFrame.apply() (reasons described here), but if you insist on it, here is one solution:
df['rolling_mean_price'] = df.apply(
lambda row: df.rolling(row.window).price.mean().iloc[row.name], axis=1)
The output looks like this:
>>> print(df)
price window rolling_mean_price
0 100 1 100.00
1 120 2 110.00
2 115 2 117.50
3 116 2 115.50
4 100 4 112.75
Related
I have two dataframes. Here are their samples. dt1:
id val
1 smth11
1 smth12
2 smth21
2 smth22
2 smth23
... ...
dt2:
id val
1 blabla
2 bla2
2 bla3
... ...
i have a function which calculates a similarity score between strings (like "smth11" and "blabla" in this example) from 0 to 1: my_func. For each value in the "val" column in the dt1 dataset, I want to count the number of values in the "val" column in the dt2 dataset that have a score greater than 0.7. Only the values that are in the same groups of the "id" column in both datasets are compared. So desired result should look like this:
id val count
1 smth11 2
1 smth12 2
2 smth21 5
2 smth22 7
2 smth23 3
... ...
The problem is that my actual datasets are huge (several thousand rows each). I wanted to know how I could do this in the most efficient way (perhaps doing the calculations in parallel?)
I think that the following code should be pretty fast since all calculations are performed by numpy.
import pandas as pd
import numpy as np
import random
# Since the similarity function was not given,
# we'll use random.random to generate values
# between 0 and 1
random.seed(1)
a1 = np.array([
[1, 'smth11'],
[1, 'smth12'],
[2, 'smth21'],
[2, 'smth23'],
[2, 'smth24'],
])
df1 = pd.DataFrame(a1, columns = ['id','val1'])
a2 = np.array([
[1, 'blabla'],
[2, 'bla2'],
[2, 'bla3'],
])
df2 = pd.DataFrame(a2, columns = ['id','val2'])
# matrix merges the df's in such a way as to include
# all (useful) combinations of df1 and df2
matrix = df1.merge(df2, left_on='id', right_on='id')
# Here we add the 'similarity' column to the matrix df.
# You will need to modify the (smilarity) lambda function below.
# I.e. something like lambda row: <some fn of row['val1'] and row(['val2']>
matrix['similarity'] = matrix.apply(lambda row: random.random(), axis=1)
print('------ matrix with scores')
print(matrix)
# Finally we count cases with similarities > .7
counts = matrix.query("similarity > .7").groupby("val1").size()
print('------ counts')
print(counts)
print('NOTE: the type of "counts" is', type(counts))
Output:
------ matrix with scores
id val1 val2 similarity
0 1 smth11 blabla 0.134364
1 1 smth12 blabla 0.847434
2 2 smth21 bla2 0.763775
3 2 smth21 bla3 0.255069
4 2 smth23 bla2 0.495435
5 2 smth23 bla3 0.449491
6 2 smth24 bla2 0.651593
7 2 smth24 bla3 0.788723
------ counts
val1
smth12 1
smth21 1
smth24 1
dtype: int64
NOTE: the type of "counts" is <class 'pandas.core.series.Series'>
Please let us know how this code performs with your data.
This is my second answer to this question. It uses the exact same method used in the previous answer but it adds:
Programmatically generating arbitrarily large datasets.
Performance measurement.
As it turns out, with datasets df1 and df2 having about 4000 rows each, the computation of "counts" takes about 0.68 seconds. This is on an Intel i5-4570 Quad Core 3.2GHz CPU.
The code, as it appears below, uses Python's (very fast) random.random() function to simulate the similarity calculation. Switching to the slower random.betavariate(alpha, beta) function increases the runtime from about 0.7 to about 1 second. See the comments marked "NOTE" to play woth this.
Code:
import pandas as pd
import numpy as np
import random
import time
def get_df1(group_count, min_size, max_size, population):
rows = []
for group in range(1, group_count+1):
row_count = random.randint(min_size, max_size)
ids = sorted(random.sample(range(1, population+1), row_count))
for i in range(row_count):
rows.append([group, f'smth-{group}-{ids[i]}'])
return pd.DataFrame(rows, columns = ['id','val1'])
def get_df2(group_count, min_size, max_size):
rows = []
for group in range(1, group_count+1):
row_count = random.randint(min_size, max_size)
for i in range(row_count):
rows.append([group, 'blablabla'])
return pd.DataFrame(rows, columns = ['id','val2'])
def simulate(group_count, min_size, max_size, population, sensitivity):
df1 = get_df1(group_count, min_size, max_size, population)
df2 = get_df2(group_count, min_size, max_size)
# Measure time from here...
start_time = time.time()
matrix = df1.merge(df2, left_on='id', right_on='id')
matrix['similarity'] = matrix.apply(
# NOTE: Using random.random() takes 0.680 seconds
lambda row: random.random(), axis=1)
# NOTE: Using random.betavariate(1, 5) takes 1.050 seconds
#lambda row: random.betavariate(1, 5), axis=1)
counts = matrix.query(f'similarity > {sensitivity}').groupby("val1").size()
seconds = time.time() - start_time
# ... to here
print('-' * 40, 'df1\n', df1)
print('-' * 40, 'df2\n', df2)
print('-' * 40, 'matrix\n', matrix)
print('-' * 40, 'counts\n', counts)
print('-------------------------- Summary')
print('df1 rows: ', len(df1))
print('df2 rows: ', len(df2))
print('matrix rows:', len(matrix))
print('counts rows:', len(counts))
print(f'--- {seconds:.3f} seconds ---')
random.seed(2)
simulate(100, 30, 50, 500, .7)
Output:
---------------------------------------- df1
id val1
0 1 smth-1-19
1 1 smth-1-44
2 1 smth-1-47
3 1 smth-1-82
4 1 smth-1-87
... ... ...
3917 100 smth-100-449
3918 100 smth-100-465
3919 100 smth-100-478
3920 100 smth-100-496
3921 100 smth-100-500
[3922 rows x 2 columns]
---------------------------------------- df2
id val2
0 1 blablabla
1 1 blablabla
2 1 blablabla
3 1 blablabla
4 1 blablabla
... ... ...
3903 100 blablabla
3904 100 blablabla
3905 100 blablabla
3906 100 blablabla
3907 100 blablabla
[3908 rows x 2 columns]
---------------------------------------- matrix
id val1 val2 similarity
0 1 smth-1-19 blablabla 0.150723
1 1 smth-1-19 blablabla 0.333073
2 1 smth-1-19 blablabla 0.592977
3 1 smth-1-19 blablabla 0.917483
4 1 smth-1-19 blablabla 0.119862
... ... ... ... ...
153482 100 smth-100-500 blablabla 0.645689
153483 100 smth-100-500 blablabla 0.595884
153484 100 smth-100-500 blablabla 0.697562
153485 100 smth-100-500 blablabla 0.704013
153486 100 smth-100-500 blablabla 0.342706
[153487 rows x 4 columns]
---------------------------------------- counts
val1
smth-1-109 13
smth-1-129 11
smth-1-138 10
smth-1-158 11
smth-1-185 5
..
smth-99-49 9
smth-99-492 8
smth-99-59 10
smth-99-95 11
smth-99-97 13
Length: 3922, dtype: int64
-------------------------- Summary
df1 rows: 3922
df2 rows: 3908
matrix rows: 153487
counts rows: 3922
--- 0.673 seconds ---
I have the following table:
df = pd.DataFrame({"A":['CH','CH','NU','NU','J'],
"B":['US','AU','Q','US','Q'],
"TOTAL":[10,13,3,1,18]})
And I wish to get the ratio of B with respect to its total for A. So the end result should be:
what I do is:
df['sum'] = df.groupby(['A'])['TOTAL'].transform(np.sum)
df['ratio'] = df['TOTAL']/df['sum']*100
Question: how can one achieve this with a lambda (or is there a better way).
If you want to use a lambda you can do the division inside transform:
df['ratio'] = df.groupby('A')['TOTAL'].transform(lambda x: x / x.sum() * 100)
Output:
A B TOTAL sum ratio
0 CH US 10 23 43.478261
1 CH AU 13 23 56.521739
2 NU Q 3 4 75.000000
3 NU US 1 4 25.000000
4 J Q 18 18 100.000000
But this is slower (because we go group-by-group). If I were you, I'd choose your code over this one.
I want to be able to expand my DataFrame to incorporate other scenarios. For example, for a DataFrame capturing active users per company, I want to add a scenario where active users increase but do not exceed the total user count.
Example input:
Example output:
I tried using a loop but quite inefficiently, yielding odd results:
while df[df['active_users'] + add_users <= df['total_users']].any():
df[(df['active_users'] + add_users) <= df["total_users"]]['active_users'] = (df['active_users'] + add_users).astype(int)
add_users += 1
Use Index.repeat with DataFrame.loc and for counters use GroupBy.cumcount:
df1 = df.loc[df.index.repeat(df['inactive_users'] + 1)]
df1['inactive_users'] = df1.groupby(level=0).cumcount(ascending=False)
s = df1.groupby(level=0).cumcount()
df1['active_users'] += s
df1['company'] = (df1['company'] + ' + ' + s.astype(str).replace('0','')).str.strip(' +')
print (df1)
company contract total_users active_users inactive_users
0 A 10000 10 7 3
0 A + 1 10000 10 8 2
0 A + 2 10000 10 9 1
0 A + 3 10000 10 10 0
1 B 7500 5 4 1
1 B + 1 7500 5 5 0
I am having dataframe like this:
azimuth id
15 100
15 1
15 100
150 2
150 100
240 3
240 100
240 100
350 100
What I need is to fill instead 100 values values from row where azimuth is the closest:
Desired output:
azimuth id
15 1
15 1
15 1
150 2
150 2
240 3
240 3
240 3
350 1
350 is near to 15 because this is a circle (angle representation). The difference is 25.
What I have:
def mysubstitution(x):
for i in x.index[x['id'] == 100]:
i = int(i)
diff = (x['azimuth'] - x.loc[i, 'azimuth']).abs()
for ind in diff.index:
if diff[ind] > 180:
diff[ind] = 360 - diff[ind]
else:
pass
exclude = [y for y in x.index if y not in x.index[x['id'] == 100]]
closer_idx = diff[exclude]
closer_df = pd.DataFrame(closer_idx)
sorted_df = closer_df.sort_values('azimuth', ascending=True)
try:
a = sorted_df.index[0]
x.loc[i, 'id'] = x.loc[a, 'id']
except Exception as a:
print(a)
return x
Which works ok most of the time, but I guess there is some simpler solution.
Thanks in advance.
I tried to implement the functionality in two steps. First, for each azimuth, I grouped another dataframe that holds their id value(for values other than 100).
Then, using this array I implemented the replaceAzimuth function, which takes each row in the dataframe, first checks if the value already exists. If so, it directly replaces it. Otherwise,it replaces the id value with the closest azimuth value from the grouped dataframe.
Here is the implementation:
df = pd.DataFrame([[15,100],[15,1],[15,100],[150,2],[150,100],[240,3],[240,100],[240,100],[350,100]],columns=['azimuth','id'])
df_non100 = df[df['id'] != 100]
df_grouped = df_non100.groupby(['azimuth'])['id'].min().reset_index()
def replaceAzimuth(df_grouped,id_val):
real_id = df_grouped[df_grouped['azimuth'] == id_val['azimuth']]['id']
if real_id.size == 0:
df_diff = df_grouped
df_diff['azimuth'] = df_diff['azimuth'].apply(lambda x: min(abs(id_val['azimuth'] - x),(360 - id_val['azimuth'] + x)))
id_val['id'] = df_grouped.iloc[df_diff['azimuth'].idxmin()]['id']
else:
id_val['id'] = real_id
return id_val
df = df.apply(lambda x: replaceAzimuth(df_grouped,x), axis = 1)
df
For me, the code seems to give the output you have shown. But not sure if will work on all cases!
First set all ids to nan if they are 100.
df.id = np.where(df.id==100, np.nan, df.id)
Then calculate the angle diff pairwise and find the closest ID to fill the nans.
df.id = df.id.combine_first(
pd.DataFrame(np.abs(((df.azimuth.values[:,None]-df.azimuth.values) +180) % 360 - 180))
.pipe(np.argsort)
.applymap(lambda x: df.id.iloc[x])
.apply(lambda x: x.dropna().iloc[0], axis=1)
)
df
azimuth id
0 15 1.0
1 15 1.0
2 15 1.0
3 150 2.0
4 150 2.0
5 240 3.0
6 240 3.0
7 240 3.0
8 350 1.0
I am installing a ranking system and basically I have a field called site_fees that accounts for 10% of the total for consideration. A site fee of 0 would get all 10 points. What I want to do is calculate how many points the non-zero fields would get, but I am struggling to do so.
My initial approach was to split the dataframe into 2 dataframes (dfb where site_fees are 0 and dfa where they are > 0) and calculate the average for dfa, assign the rating for dfb as 10, then union the two.
The code is as follows:
dfSitesa = dfSites[dfSites['site_fees'].notnull()]
dfSitesb = dfSites[dfSites['site_fees'].isnull()]
dfSitesa['rating'] = FeeWeight * \
dfSitesa['site_fees'].min()/dfSitesa['site_fees']
dfSitesb['rating'] = FeeWeight
dfSites = pd.concat([dfSitesa,dfSitesb])
This produces an output, however the results of dfa are not correct because the minimum of dfa is 5000 instead of 0, so the rating of a site with $5000 in fees is 10 (the maximum, not correct). What am I doing wrong?
The minimum non-zero site_fee is 5000 and the maximum is 15000. Based on this, I would expect a general ranking system like:
15000 | 0
10000 | 3.3
5000 | 6.6
0 | 10
Here is a way to do it :
dfSites = pd.DataFrame({'site_fees':[0,1,2,3,5]})
FeeWeight = 10
dfSitesa = dfSites[dfSites['site_fees'].notnull()]
dfSitesb = dfSites[dfSites['site_fees'].isnull()]
dfSitesb['rating'] = FeeWeight
factor = (dfSitesa['site_fees'].max() - dfSitesa['site_fees'].min())
dfSitesa['rating'] = FeeWeight * ( 1 - ( (dfSitesa['site_fees'] - dfSitesa['site_fees'].min()) / factor) )
dfSites = pd.concat([dfSitesa,dfSitesb])
In [1] : print(dfSites)
Out[1] :
site_fees rating
0 0 10.0
1 1 8.0
2 2 6.0
3 3 4.0
4 5 0.0