R - Merging Two Data.Frames with Row-Level Conditional Variables - python

Short version: I have a slightly trickier than usual merge operation I'd like help optimizing with dplyr or merge. I have a number of solutions already, but these run quite slow over large datasets and I am curious if there exist a faster method in R (or in SQL or python alternatively)
I have two data.frames:
a asynchronous log of events tied to Stores, and
a table that gives more details about the stores in that log.
The issue: Store IDs are unique identifiers for a specific location, but store locations may change ownership from one period to the next (and just for completeness, no two owners may possess the same store at the same time). So when I merge over store level info, I need some sort of conditional that merges store-level info for the correct period.
Reproducible Example:
# asynchronous log.
# t for period.
# Store for store loc ID
# var1 just some variable.
set.seed(1)
df <- data.frame(
t = c(1,1,1,2,2,2,3,3,4,4,4),
Store = c(1,2,3,1,2,3,1,3,1,2,3),
var1 = runif(11,0,1)
)
# Store table
# You can see, lots of store location opening and closing,
# StateDate is when this business came into existence
# Store is the store id from df
# CloseDate is when this store when out of business
# storeVar1 is just some important var to merge over
Stores <- data.frame(
StartDate = c(0,0,0,4,4),
Store = c(1,2,3,2,3),
CloseDate = c(9,2,3,9,9),
storeVar1 = c("a","b","c","d","e")
)
Now, I only want to merge over information in Store d.f. to log, if that Store is open for business in that period (t). CloseDate and StartDate indicate the last and first periods of this business's operation, respectively. (For completeness but not too important, with StartDate 0 the store existed since before the sample. For CloseDate 9 the store hadn't gone out of business at that location by the end of the sample.)
One solution relies on a period t level split() and dplyr::rbind_all(), e.g.
# The following seems to do the trick.
complxMerge_v1 <- function(df, Stores, by = "Store"){
library("dplyr")
temp <- split(df, df$t)
for (Period in names(temp))(
temp[[Period]] <- dplyr::left_join(
temp[[Period]],
dplyr::filter(Stores,
StartDate <= as.numeric(Period) &
CloseDate >= as.numeric(Period)),
by = "Store"
)
)
df <- dplyr::rbind_all(temp); rm(temp)
df
}
complxMerge_v1(df, Stores, "Store")
Functionally, this appears to work (haven't come across a significant error yet anyway). However we are dealing with (increasingly usual) billions of rows of log data.
I made a larger reproducible example on sense.io if you'd like to use it for bench-marking. See here: https://sense.io/economicurtis/r-faster-merging-of-two-data.frames-with-row-level-conditionals
Two questions:
First and foremost, is there another way to approach this problem using similar methods that will run faster?
Is there by chance a quick and easy solution in SQL and Python (of which I am not quite as familiar, but could rely on if need be).
Also, can you help me articulate this question in a more general, abstract way? Right now I only know how to talk about the problem in context specific terms, but I'd love to be able to talk about these types of issues with more appropriate, but more general programming or data manipulation terminologies.

In R, You could take a look at the data.table::foverlaps function
library(data.table)
# Set start and end values in `df` and key by them and by `Store`
setDT(df)[, c("StartDate", "CloseDate") := list(t, t)]
setkey(df, Store, StartDate, CloseDate)
# Run `foverlaps` function
foverlaps(setDT(Stores), df)
# Store t var1 StartDate CloseDate i.StartDate i.CloseDate storeVar1
# 1: 1 1 0.26550866 1 1 0 9 a
# 2: 1 2 0.90820779 2 2 0 9 a
# 3: 1 3 0.94467527 3 3 0 9 a
# 4: 1 4 0.62911404 4 4 0 9 a
# 5: 2 1 0.37212390 1 1 0 2 b
# 6: 2 2 0.20168193 2 2 0 2 b
# 7: 3 1 0.57285336 1 1 0 3 c
# 8: 3 2 0.89838968 2 2 0 3 c
# 9: 3 3 0.66079779 3 3 0 3 c
# 10: 2 4 0.06178627 4 4 4 9 d
# 11: 3 4 0.20597457 4 4 4 9 e

You can transform your Stores data.frame adding t-column, which contains all values of t for a definite Store and then use unnest function from Hadley's tydir package to transform it to "long" form.
require("tidyr")
require("dplyr")
complxMerge_v2 <- function(df, Stores, by = NULL) {
Stores %>% mutate(., t = lapply(1:nrow(.),
function(ii) (.)[ii, "StartDate"]:(.)[ii, "CloseDate"]))%>%
unnest(t) %>% left_join(df, ., by = by)
}
complxMerge_v2(df, Stores)
# Joining by: c("t", "Store")
# t Store var1 StartDate CloseDate storeVar1
# 1 1 1 0.26550866 0 9 a
# 2 1 2 0.37212390 0 2 b
# 3 1 3 0.57285336 0 3 c
# 4 2 1 0.90820779 0 9 a
# 5 2 2 0.20168193 0 2 b
# 6 2 3 0.89838968 0 3 c
# 7 3 1 0.94467527 0 9 a
# 8 3 3 0.66079779 0 3 c
# 9 4 1 0.62911404 0 9 a
# 10 4 2 0.06178627 4 9 d
# 11 4 3 0.20597457 4 9 e
require("microbenchmark")
# I've downloaded your large data samples
df <- read.csv("./df.csv")
Stores <- read.csv("./Stores.csv")
microbenchmark(complxMerge_v1(df, Stores), complxMerge_v2(df, Stores), times = 10L)
# Unit: milliseconds
# expr min lq mean median uq max neval
# complxMerge_v1(df, Stores) 9501.217 9623.754 9712.8689 9681.3808 9816.8984 9886.5962 10
# complxMerge_v2(df, Stores) 532.744 539.743 567.7207 561.9635 588.0637 636.5775 10
Here are step-by-step results to make the process clear.
Stores_with_t <-
Stores %>% mutate(., t = lapply(1:nrow(.),
function(ii) (.)[ii, "StartDate"]:(.)[ii, "CloseDate"]))
# StartDate Store CloseDate storeVar1 t
# 1 0 1 9 a 0, 1, 2, 3, 4, 5, 6, 7, 8, 9
# 2 0 2 2 b 0, 1, 2
# 3 0 3 3 c 0, 1, 2, 3
# 4 4 2 9 d 4, 5, 6, 7, 8, 9
# 5 4 3 9 e 4, 5, 6, 7, 8, 9
# After that `unnest(t)`
Stores_with_t_unnest <-
with_t %>% unnest(t)
# StartDate Store CloseDate storeVar1 t
# 1 0 1 9 a 0
# 2 0 1 9 a 1
# 3 0 1 9 a 2
# 4 0 1 9 a 3
# 5 0 1 9 a 4
# 6 0 1 9 a 5
# 7 0 1 9 a 6
# 8 0 1 9 a 7
# 9 0 1 9 a 8
# 10 0 1 9 a 9
# 11 0 2 2 b 0
# 12 0 2 2 b 1
# 13 0 2 2 b 2
# 14 0 3 3 c 0
# 15 0 3 3 c 1
# 16 0 3 3 c 2
# 17 0 3 3 c 3
# 18 4 2 9 d 4
# 19 4 2 9 d 5
# 20 4 2 9 d 6
# 21 4 2 9 d 7
# 22 4 2 9 d 8
# 23 4 2 9 d 9
# 24 4 3 9 e 4
# 25 4 3 9 e 5
# 26 4 3 9 e 6
# 27 4 3 9 e 7
# 28 4 3 9 e 8
# 29 4 3 9 e 9
# And then simple `left_join`
left_join(df, Stores_with_t_unnest)
# Joining by: c("t", "Store")
# t Store var1 StartDate CloseDate storeVar1
# 1 1 1 0.26550866 0 9 a
# 2 1 2 0.37212390 0 2 b
# 3 1 3 0.57285336 0 3 c
# 4 2 1 0.90820779 0 9 a
# 5 2 2 0.20168193 0 2 b
# 6 2 3 0.89838968 0 3 c
# 7 3 1 0.94467527 0 9 a
# 8 3 3 0.66079779 0 3 c
# 9 4 1 0.62911404 0 9 a
# 10 4 2 0.06178627 4 9 d
# 11 4 3 0.20597457 4 9 e

Related

Auto re-assign ids in a dataframe

I have the following dataframe:
import pandas as pd
data = {'id': [542588, 542594, 542594, 542605, 542605, 542605, 542630, 542630],
'label': [3, 3, 1, 1, 2, 0, 0, 2]}
df = pd.DataFrame(data)
df
id label
0 542588 3
1 542594 3
2 542594 1
3 542605 1
4 542605 2
5 542605 0
6 542630 0
7 542630 2
The id columns contains large integers (6-digits). I want a way to simplify it, starting from 10, so that 542588 becomes 10, 542594 becomes 11, etc...
Required output:
id label
0 10 3
1 11 3
2 11 1
3 12 1
4 12 2
5 12 0
6 13 0
7 13 2
You can use factorize:
df['id'] = df['id'].factorize()[0] + 10
Output:
id label
0 10 3
1 11 3
2 11 1
3 12 1
4 12 2
5 12 0
6 13 0
7 13 2
Note: factorize will enumerate the keys in the order that they occur in your data, while groupby().ngroup() solution will enumerate the key in the increasing order. You can mimic the increasing order with factorize by sorting the data first. Or you can replicate the data order with groupby() by passing sort=False to it.
You can try
df['id'] = df.groupby('id').ngroup().add(10)
print(df)
id label
0 10 3
1 11 3
2 11 1
3 12 1
4 12 2
5 12 0
6 13 0
7 13 2
This is a naive way of looping through the IDs, and every time you encounter an ID you haven't seen before, associate it in a dictionary with a new ID (starting at 10, incrementing by 1 each time).
You can then swap out the values of the ID column using the map method.
new_ids = dict()
new_id = 10
for old_id in df['id']:
if old_id not in new_ids:
new_ids[old_id] = new_id
new_id += 1
df['id'] = df['id'].map(new_ids)

How to sort column by value counts in pandas

I want to sort a dataframe in pandas. I wanna do it by sorting 2 columns by value counts. One depends on the other. As seen in the image, I have achieved categorical sorting. However, I want the column 'category' to be sorted by value counts. And then the dataframe is to be sorted again based on 'beneficiary_name' under the same category.
This is the code I have written to achieve this till now.
data_category = data_category.sort_values(by=['category','beneficiary_name'], ascending=False)
Please help me figure this out. Thanks.
Inspired by this related question:
Create column of value_counts in Pandas dataframe
import pandas as pd
df = pd.DataFrame({'id': range(9), 'cat': list('ababaacdc'), 'benef': list('uuuuiiiii')})
print(df)
# id cat benef
# 0 0 a u
# 1 1 b u
# 2 2 a u
# 3 3 b u
# 4 4 a i
# 5 5 a i
# 6 6 c i
# 7 7 d i
# 8 8 c i
df['cat_count'] = df.groupby(['cat'])['id'].transform('count')
print(df)
# id cat benef cat_count
# 0 0 a u 4
# 1 1 b u 2
# 2 2 a u 4
# 3 3 b u 2
# 4 4 a i 4
# 5 5 a i 4
# 6 6 c i 2
# 7 7 d i 1
# 8 8 c i 2
df = df.sort_values(by=['cat_count', 'cat', 'benef'], ascending=False)
print(df)
# id cat benef cat_count
# 0 0 a u 4
# 2 2 a u 4
# 4 4 a i 4
# 5 5 a i 4
# 6 6 c i 2
# 8 8 c i 2
# 1 1 b u 2
# 3 3 b u 2
# 7 7 d i 1

Rolling Cummulative Sum of a Column's Values Until A Condition is Met

I have a dataframe which is called "df". It looks like this:
a
0 2
1 3
2 0
3 5
4 1
5 3
6 1
7 2
8 2
9 1
I would like to produce a cummulative sum column which:
Sums the contents of column "a" cumulatively;
Until it gets a sum of "5";
Resets the cumsum total, to 0, when it reaches a sum of "5", and continues with the summing process;
I would like the dataframe to look like this:
a a_cumm_sum
0 2 2
1 3 5
2 0 0
3 5 5
4 1 1
5 3 4
6 1 5
7 2 2
8 2 4
9 1 5
In the dataframe, the column "a_cumm_summ" contains the results of the cumulative sum.
Does anyone know how I can achieve this? I have hunted through the forums. And saw similar questions, for example, this one, but they did not meet my exact requirements.
You can get the cumsum, and floor divide by 5. Then subtract the result of the floor division, multiplied by 5, from the below row's cumulative sum:
c = df['a'].cumsum()
g = 5 * (c // 5)
df['a_cumm_sum'] = (c.shift(-1) - g).shift().fillna(df['a']).astype(int)
df
Out[1]:
a a_cumm_sum
0 2 2
1 3 5
2 0 0
3 5 5
4 1 1
5 3 4
6 1 5
7 2 2
8 2 4
9 1 5
Solution #2 (more robust):
Per Trenton's comment, A good, diverse sample dataset goes a long way to figure out unbreakable logic for these types of problems. I probably would have come up with a better solution first time around with a good sample dataset. Here is a solution that overcomes the sample dataset that Trenton mentioned in the comments. As shown, there are more conditions to handle as you have to deal with carry-over. On a large dataset, this would still be much more performant than a for-loop, but it is much more difficult logic to vectorize:
df = pd.DataFrame({'a': {0: 2, 1: 4, 2: 1, 3: 5, 4: 1, 5: 3, 6: 1, 7: 2, 8: 2, 9: 1}})
c = df['a'].cumsum()
g = 5 * (c // 5)
df['a_cumm_sum'] = (c.shift(-1) - g).shift().fillna(df['a']).astype(int)
over = (df['a_cumm_sum'].shift(1) - 5)
df['a_cumm_sum'] = df['a_cumm_sum'] - np.where(over > 0, df['a_cumm_sum'] - over, 0).cumsum()
s = np.where(df['a_cumm_sum'] < 0, df['a_cumm_sum']*-1, 0).cumsum()
df['a_cumm_sum'] = np.where((df['a_cumm_sum'] > 0) & (s > 0), s + df['a_cumm_sum'],
df['a_cumm_sum'])
df['a_cumm_sum'] = np.where(df['a_cumm_sum'] < 0, df['a_cumm_sum'].shift() + df['a'], df['a_cumm_sum'])
df
Out[2]:
a a_cumm_sum
0 2 2.0
1 4 6.0
2 1 1.0
3 5 6.0
4 1 1.0
5 3 4.0
6 1 5.0
7 2 2.0
8 2 4.0
9 1 5.0
The assignment can be combined with a condition. The code is as follows:
import numpy as np
import pandas as pd
a = [2, 3, 0, 5, 1, 3, 1, 2, 2, 1]
df = pd.DataFrame(a, columns=["a"])
df["cumsum"] = df["a"].cumsum()
df["new"] = df["cumsum"]%5
df["new"][((df["cumsum"]/5)==(df["cumsum"]/5).astype(int)) & (df["a"]!=0)] = 5
df
The output is as follows:
a cumsum new
0 2 2 2
1 3 5 5
2 0 5 0
3 5 10 5
4 1 11 1
5 3 14 4
6 1 15 5
7 2 17 2
8 2 19 4
9 1 20 5
Working:
Basically, take remainder for the cumulative sum for 5. In cases where the actual sum is 5 also becomes zero. So, for these cases, check if the value/5 == int(value/5). Then, remove cases where the actual value is zero.
EDIT:
As Trenton McKinney pointed out in the comments, OP likely wanted to reset it to 0 whenever the cumsum exceeded 5. This makes the definition to be a recurrence which is usually difficult to do with pandas/numpy (see David's solution). I'd recommend using numba to speed up the for loop in this case
Another alternative: using groupby
In [78]: df.groupby((df['a'].cumsum()% 5 == 0).shift().fillna(False).cumsum()).cumsum()
Out[78]:
a
0 2
1 5
2 0
3 5
4 1
5 4
6 5
7 2
8 4
9 5
You could try using this for loop:
lastvalue = 0
newcum = []
for i in df['a']:
if lastvalue >= 5:
lastvalue = i
else:
lastvalue += i
newcum.append(lastvalue)
df['a_cum_sum'] = newcum
print(df)
Output:
a a_cum_sum
0 2 2
1 3 5
2 0 0
3 5 5
4 1 1
5 3 4
6 1 5
7 2 2
8 2 4
9 1 5
The above for loop iterates through the a column, and when the cumulative sum is 5 or more, it resets it to 0 then adds the a column's value i, but if the cumulative sum is lower than 5, it just adds the a column's value i (the iterator).

How to pass a value from one row to the next one in pandas + python and use it to calculate the same following value recursively

This is my desired output:
I am trying to calculate the column df[Value] and df[Value_Compensed]. However, to do that, I need to consider the previous value of the row df[Value_Compensed]. In terms of my table:
The first row all the values are 0
The following rows: df[Remained] = previous df[Value_compensed]. Then df[Value] = df[Initial_value] + df[Remained]. Then df[Value_Compensed] = df[Value] - df[Compensation]
...And So on...
I am struggling to pass the value of Value_Compensed from one row to the next, I tried with the function shift() but as you can see in the following image the values in df[Value_Compensed] are not correct due to it is not a static value and also it also changes after each row it did not work. Any Idea??
Thanks.
Manuel.
You can use apply to create your customised operations. I've made a dummy dataset as you didn't provide the initial dataframe.
from itertools import zip_longest
# dummy data
df = pd.DataFrame(np.random.randint(1, 10, (8, 5)),
columns=['compensation', 'initial_value',
'remained', 'value', 'value_compensed'],)
df.loc[0] = 0,0,0,0,0
>>> print(df)
compensation initial_value remained value value_compensed
0 0 0 0 0 0
1 2 9 1 9 7
2 1 4 9 8 3
3 3 4 5 7 6
4 3 2 5 5 6
5 9 1 5 2 4
6 4 5 9 8 2
7 1 6 9 6 8
Use apply (axis=1) to do row-wise iteration, where you use the initial dataframe as an argument, from which you can then get the previous row x.name-1 and do your calculations. Not sure if I fully understood the intended result, but you can adjust the individual calculations of the different columns in the function.
def f(x, data):
if x.name == 0:
return [0,]*data.shape[1]
else:
x_remained = data.loc[x.name-1]['value_compensed']
x_value = data.loc[x.name-1]['initial_value'] + x_remained
x_compensed = x_value - x['compensation']
return [x['compensation'], x['initial_value'], x_remained, \
x_value, x_compensed]
adj = df.apply(f, args=(df,), axis=1)
adj = pd.DataFrame.from_records(zip_longest(*adj.values), index=df.columns).T
>>> print(adj)
compensation initial_value remained value value_compensed
0 0 0 0 0 0
1 5 9 0 0 -5
2 5 7 4 13 8
3 7 9 1 8 1
4 6 6 5 14 8
5 4 9 6 12 8
6 2 4 2 11 9
7 9 2 6 10 1

R's replicate and do.call functions equivalent in Python

Suppose that you want construct a pd.DataFrame and you want to get different numbers every-time you increase replicate number in it. (Please Scroll down for Reproducible example in R)
I would like to get same output with Python but I dont know how to get there!
If you consider this simple pd.Dataframe
df = pd.DataFrame({
'a':[np.random.normal(0.27,0.01,5),np.random.normal(1,0.01,5)]})
df
a
0 [0.268297564096, 0.252974100195, 0.27613413347...
1 [0.996267313891, 1.00497494738, 1.022271644, 1...
I dont know why the data look like this. When I do only one np.random.normal I am getting this,
a
0 0.092309
1 0.085985
2 0.083635
3 0.081582
4 0.104096
Sorry, I cannot explain this behaviour.I am new in pandas maybe you could explain this.
Ok, lets get back to original question;
If you want to generate second group of numbers and I guess I should use np.repeat
df = pd.DataFrame({['a':np.repeat(np.random.normal(0.10,0.01,5),np.random.normal(0.10,0.01,5)])})
df
Out[59]:
a
0 0.090305
1 0.090305
2 0.109092
3 0.109092
4 0.101706
5 0.101706
6 0.087357
7 0.087357
8 0.099094
9 0.099094
10 0.101595
11 0.101595
12 0.100343
13 0.100343
14 0.085380
15 0.085380
16 0.102118
17 0.102118
18 0.107328
19 0.107328
But np.repeat is just generating the same numbers twice is not the output what I want.
here is the approach in R case,
df <- data.frame(y = do.call(c,replicate(n = 2,
expr = c(rnorm(5,0.10,0.01),rnorm(5,1,0.01)),
simplify = FALSE)),gr = rep(seq(1,2),each=10))
y gr
1 0.11300203 1
2 0.11840556 1
3 0.09420799 1
4 0.10480623 1
5 0.08561427 1
6 1.00076001 1
7 1.00035891 1
8 1.00936751 1
9 1.00050563 1
10 1.00564799 1
11 0.09415217 2
12 0.10794155 2
13 0.11534605 2
14 0.08806740 2
15 0.12394189 2
16 0.99330066 2
17 0.98254134 2
18 0.99828079 2
19 1.00786526 2
20 0.97864180 2
Basically in R you can do this in pretty straightforward. But I guess in python one has to write a function for it.
In R you can generate normal distribution of numbers with rnorm and on numpy we can do that with np.random.normal. But I could not find any built in function for especially do.call.
Actually, in R you do not need do.call():
set.seed(95)
df <- data.frame(y = c(rnorm(10,0.10,0.01), rnorm(10,1,0.01)),
gr = c(rep(0,10), rep(1,10)))
df
# y gr
# 1 0.08970880 1
# 2 0.08384474 1
# 3 0.09972121 1
# 4 0.09678872 1
# 5 0.11880371 1
# 6 0.10696807 1
# 7 0.09135123 1
# 8 0.08925115 1
# 9 0.10994412 1
# 10 0.09769954 1
# 11 1.01486420 2
# 12 1.01533145 2
# 13 1.01454184 2
# 14 0.99125878 2
# 15 0.98222886 2
# 16 1.00128867 2
# 17 0.97588819 2
# 18 0.98216944 2
# 19 0.99982671 2
# 20 0.99090591 2
And with Python pandas/numpy, consider concatenating arrays using np.concatenate
import pandas as pd
import numpy as np
np.random.seed(89)
df = pd.DataFrame({'y': np.concatenate([np.random.normal(0.1,0.01,10),
np.random.normal(1,0.01,10)]),
'gr': [1]*10 + [2]*10})
print(df)
# gr y
# 0 1 0.083063
# 1 1 0.099979
# 2 1 0.095741
# 3 1 0.097444
# 4 1 0.096942
# 5 1 0.100405
# 6 1 0.099316
# 7 1 0.087978
# 8 1 0.098175
# 9 1 0.091204
# 10 2 0.997568
# 11 2 1.006740
# 12 2 1.003449
# 13 2 0.993747
# 14 2 0.997935
# 15 2 0.991284
# 16 2 0.991299
# 17 2 1.003981
# 18 2 0.993347
# 19 2 1.001337
Not sure if this is what you wanted, but you could use a for loop and generate the second set of random numbers as shown below.
df = pd.DataFrame.from_items([('a' , np.append([np.random.normal(0.10,0.01,5) for _ in xrange(2)],
[np.random.normal(1,0.01,5) for _ in xrange(2)]
))])
df is then
a
0 0.105469
1 0.091046
2 0.091626
3 0.104579
4 0.110971
5 0.076754
6 0.104674
7 0.096062
8 0.103571
9 0.089955
10 0.978489
11 0.997081
12 1.009864
13 1.000333
14 0.998483
15 1.010685
16 1.004473
17 1.001833
18 1.007723
19 0.999845

Categories