How to divide data into groups in a fast way - python

I have a large matrix with 12 columns and approximately 1.000.000 rows. Each column represents the money spent by a client in a given month, so with the 12 columns I have information for 1 full year. Each row represents one client.
I need to divide the people into groups based on how much money they spent each month, and I consider the following intervals:
money=0
0<money<=25
25<money<=50
50<money<=75
So for example group1 would be formed by clients that spent 0$ each month for the whole year, group2 would be clients who spent between 0 and 25$ the first month, and 0$ the rest of the months, and so on. In the end I have 12 months, and 4 intervals, so I need to divide data into 4^12=16.777.216 groups (I know this yields to more groups than observations, and that many of the groups will be empty or with very few clients, but that is another problem, so far I am interested in doing this division into groups)
I am currently working in R although I could also switch to Python if required (those are the programming languages I control best), and so far my only idea has been to use nested for loops, one for loop for each month. But this is very, very slow.
So my question is: is there a faster way to do this?
Here I provide a small example with fake data, 10 observations (instead of the 1.000.000), 5 columns (instead of 12) and a simplified version of my current code for doing the grouping.
set.seed(5)
data = data.frame(id=1:10, matrix(rnorm(50), nrow=10, ncol=5))
intervals = c(-4, -1, 0, 1, 4)
id_list = c()
group_list = c()
group_idx = 0
for(idx1 in 1:(length(intervals)-1))
{
data1 = data[(data[, 2] >= intervals[idx1]) & (data[, 2] < intervals[idx1+1]),]
for(idx2 in 1:(length(intervals))-1)
{
data2 = data1[(data1[, 3] >= intervals[idx2]) & (data1[, 3] < intervals[idx2+1]),]
for(idx3 in 1:(length(intervals)-1))
{
data3 = data2[(data2[, 4] >= intervals[idx3]) & (data2[, 4] < intervals[idx3+1]),]
for(idx4 in 1:(length(intervals)-1))
{
data4 = data3[(data3[, 5] >= intervals[idx4]) & (data3[, 5] < intervals[idx4+1]),]
for(idx5 in 1:(length(intervals)-1))
{
data5 = data4[(data4[, 6] >= intervals[idx5]) & (data4[, 6] < intervals[idx5+1]),]
group_idx = group_idx + 1
id_list = c(id_list, data5$id)
group_list = c(group_list, rep(group_idx, nrow(data5)))
}
}
}
}
}

If you do need to do this--which I certainly have my doubts about--I would suggest creating a matrix with the classification for each cell of the original data, and then pasting them together to make a group label.
Doing this we can set the group labels to be human readable, which might be nice.
I would recommend simply adding this grouping column to the original data and then using dplyr or data.table to do grouped operations for your next steps, but if you really want separate data frames for each you can then split the original data based on these group labels.
## I redid your sample data to put it on the same general scale as
## your actual data
set.seed(5)
data = data.frame(id=1:10, matrix(rnorm(50, mean = 50, sd = 20), nrow=10, ncol=5))
my_breaks = c(0, 25 * 1:3, Inf)
## you could use default labels, but this seems nicer
my_labs = c("Low", "Med", "High", "Extreme")
## classify each value from the data
grouping = vapply(
data[-1], \(x) as.character(cut(x, breaks = my_breaks)),
FUN.VALUE = character(nrow(data))
)
## create labels for the groups
group_labels = apply(grouping, 2, \(x) paste(1:(ncol(data) - 1), x, sep = ":", collapse = " | "))
## either add the grouping value to the original data or split the data based on groups
data$group = group_labels
result = split(data, group_labels)
result
# $`1:(25,50] | 2:(75,Inf] | 3:(0,25] | 4:(50,75] | 5:(75,Inf] | 1:(25,50] | 2:(25,50] | 3:(25,50] | 4:(25,50] | 5:(50,75]`
# id X1 X2 X3 X4 X5
# 1 1 33.18289 74.55261 68.01024 56.31830 81.00121
# 6 6 37.94184 47.22028 44.13036 69.03148 61.24447
#
# $`1:(50,75] | 2:(25,50] | 3:(25,50] | 4:(25,50] | 5:(25,50] | 1:(25,50] | 2:(25,50] | 3:(0,25] | 4:(50,75] | 5:(25,50]`
# id X1 X2 X3 X4 X5
# 2 2 77.68719 33.96441 68.83739 72.19388 33.95154
# 7 7 40.55667 38.05374 78.37178 29.80935 32.25983
#
# $`1:(50,75] | 2:(50,75] | 3:(75,Inf] | 4:(50,75] | 5:(50,75] | 1:(25,50] | 2:(75,Inf] | 3:(75,Inf] | 4:(25,50] | 5:(25,50]`
# id X1 X2 X3 X4 X5
# 3 3 24.89016 28.392148 79.35924 94.309211 48.50842
# 8 8 37.29257 6.320665 79.97548 9.990545 40.79511
#
# $`1:(50,75] | 2:(50,75] | 3:(75,Inf] | 4:(50,75] | 5:(75,Inf] | 1:(50,75] | 2:(25,50] | 3:(0,25] | 4:(0,25] | 5:(25,50]`
# id X1 X2 X3 X4 X5
# 4 4 51.40286 46.84931 64.13522 74.34207 87.91336
# 9 9 44.28453 54.81635 36.85836 14.75628 35.51343
#
# $`1:(75,Inf] | 2:(25,50] | 3:(25,50] | 4:(75,Inf] | 5:(25,50] | 1:(50,75] | 2:(25,50] | 3:(25,50] | 4:(25,50] | 5:(25,50]`
# id X1 X2 X3 X4 X5
# 5 5 84.22882 28.56480 66.38018 79.58444 40.86862
# 10 10 52.76216 44.81289 32.94409 47.14784 48.61578

Using findInterval, a group ID can be added in a fraction of a second on a 1M row table:
library(data.table)
set.seed(538924142)
data <- data.frame(id = 1:1e6, matrix(runif(12e6, 0, 75)*sample(0:1, 12e6, TRUE, c(0.25, 0.75)), 1e6, 12))
system.time({
setDT(data)[
, grp := colSums(
matrix(
findInterval(
t(as.matrix(.SD)),
c(0, 25, 50, 75),
left.open = TRUE
),
12, 1e6
)*4^(0:11)
),
.SDcols = 2:13
]
})
#> user system elapsed
#> 0.26 0.05 0.31
head(data)
#> id X1 X2 X3 X4 X5 X6 X7 X8 X9 X10 X11 X12 grp
#> 1: 1 0.00000 67.680617 26.178075 65.66532 0.00000 55.2356394 5.438976 72.20526839 70.47368 0.000000 0.00000 29.17494 8641772
#> 2: 2 0.00000 8.193552 10.482581 19.15885 30.28639 44.3917749 1.876230 11.19145219 55.22776 48.725632 17.18597 74.58265 14375508
#> 3: 3 0.00000 63.301921 0.000000 61.50508 0.00000 0.5755531 52.139676 51.46551228 58.90514 60.098006 12.90056 0.00000 2094284
#> 4: 4 18.06334 34.970526 9.599701 38.64339 57.00753 62.3455201 30.377876 73.73237960 0.00000 18.706219 0.00000 25.57064 8712089
#> 5: 5 27.49489 8.770596 0.000000 67.30562 58.43427 26.2856874 65.784429 36.96939287 54.65132 3.676736 29.51849 25.35926 10992582
#> 6: 6 59.27949 14.830172 2.233060 13.27291 16.63301 2.5727847 0.000000 0.05254523 23.44611 29.529823 0.00000 63.00820 13190487
data[which.min(data$grp)]
#> id X1 X2 X3 X4 X5 X6 X7 X8 X9 X10 X11 X12 grp
#> 1: 189293 4.801804 0 26.7038 0 0 0 0 0 0 0 0 0 33
data[which.max(data$grp)]
#> id X1 X2 X3 X4 X5 X6 X7 X8 X9 X10 X11 X12 grp
#> 1: 637400 0 0 69.10316 56.61781 52.88433 62.50076 72.81748 57.27957 70.34022 72.01065 53.4228 56.72517 16777200
Then proceed with data.table subsetting and grouping operations. If you really want it split:
group_list <- split(data, by = "grp")
But,
Be aware that processing list of data.tables will be generally much
slower than manipulation in single data.table by group using by
argument

Related

Python: dynamic column sum for each row

I have a dataframe with 2 identifiers (ID1, ID2) and 3 numeric columns (X1,X2,X3) and a column titled 'input' (total 6 columns) and n rows. For each row, I want to get the index of the nth column such that n is the last time that (x1+x2+xn... >=0) is still true.
How can I do this in Python?
In R I did this by using:
tmp = data
for (i in 4:5)
{
data[,i]<- tmp$input - rowSums(tmp[,3:i])
}
output<- apply((data[,3:5]), 1, function(x) max(which(x>0)))
data$output <- output
I am trying to translate this into Python. What might be the best way to do this? There can be N such rows, and M such columns.
Sample Data:
ID1 ID2 X1 X2 X3 INPUT OUTPUT (explanation)
a b 1 2 3 3 2 (X1 = 1, x1+x2 = 3, x1+x3+x3 = 6 ... and after 2 sums, input< sums)
a1 a2 5 2 1 4 0 (X1 = 5, x1+x2 = 7, x1+x3+x3 = 8 ... and even for 1 sum, input< sums)
a2 b2 0 4 5 100 3 (X1=0, X1+X2=4, X1+X2+X3=9, ... even after 3 sums, input>sums)
You can use Pandas module which handles this very effectively in Python.
import pandas as pd
#Taking a sample data here
df = pd.DataFrame([
['A','B',1,3,4,0.1],
['K','L',10,3,14,0.5],
['P','H',1,73,40,0.6]],columns = ['ID1','ID2','X2','X3','X4','INPUT'])
#Below code does the functionality you would want.
df['new_column']=df[['X2','X3','X4']].max(axis=1)

Comparing a value from one dataframe with values from columns in another dataframe and getting the data from third column

The title is bit confusing but I'll do my best to explain my problem here. I have 2 pandas dataframes, a and b:
>> print a
id | value
1 | 250
2 | 150
3 | 350
4 | 550
5 | 450
>> print b
low | high | class
100 | 200 | 'A'
200 | 300 | 'B'
300 | 500 | 'A'
500 | 600 | 'C'
I want to create a new column called class in table a that contains the class of the value in accordance with table b. Here's the result I want:
>> print a
id | value | class
1 | 250 | 'B'
2 | 150 | 'A'
3 | 350 | 'A'
4 | 550 | 'C'
5 | 450 | 'A'
I have the following code written that sort of does what I want:
a['class'] = pd.Series()
for i in range(len(a)):
val = a['value'][i]
cl = (b['class'][ (b['low'] <= val) \
(b['high'] >= val) ].iat[0])
a['class'].set_value(i,cl)
Problem is, this is quick for tables length of 10 or so, but I am trying to do this with a table size of 100,000+ for both a and b. Is there a quicker way to do this, using some function/attribute in pandas?
Here is a way to do a range join inspired by #piRSquared's solution:
A = a['value'].values
bh = b.high.values
bl = b.low.values
i, j = np.where((A[:, None] >= bl) & (A[:, None] <= bh))
pd.DataFrame(
np.column_stack([a.values[i], b.values[j]]),
columns=a.columns.append(b.columns)
)
Output:
id value low high class
0 1 250 200 300 'B'
1 2 150 100 200 'A'
2 3 350 300 500 'A'
3 4 550 500 600 'C'
4 5 450 300 500 'A'
Here's a solution that is admittedly less elegant than using Series.searchsorted, but it runs super fast!
I pull data out from the pandas DataFrames and convert them to lists and then use np.where to populate a variable called "aclass" where the conditions are satified (in brute force for loops). Then I write "aclass" to the original data frame a.
The evaluation time was 0.07489705 s, so it's pretty fast, even with 200,000 data points!
# create 200,000 fake a data points
avalue = 100+600*np.random.random(200000) # assuming you extracted this from a with avalue = np.array(a['value'])
blow = [100,200,300,500] # assuming you extracted this from b with list(b['low'])
bhigh = [200,300,500,600] # assuming you extracted this from b with list(b['high'])
bclass = ['A','B','A','C'] # assuming you extracted this from b with list(b['class'])
aclass = [[]]*len(avalue) # initialize aclass
start_time = time.time() # this is just for timing the execution
for i in range(len(blow)):
for j in np.where((avalue>=blow[i]) & (avalue<=bhigh[i]))[0]:
aclass[j]=bclass[i]
# add the class column to the original a DataFrame
a['class'] = aclass
print("--- %s seconds ---" % np.round(time.time() - start_time,decimals = 8))

Gurobi: How can I sum just a part of a variable?

I have the following model:
from gurobipy import *
n_units = 1
n_periods = 3
n_ageclasses = 4
units = range(1,n_units+1)
periods = range(1,n_periods+1)
periods_plus1 = periods[:]
periods_plus1.append(max(periods_plus1)+1)
ageclasses = range(1,n_ageclasses+1)
nothickets = ageclasses[1:]
model = Model('MPPM')
HARVEST = model.addVars(units, periods, nothickets, vtype=GRB.INTEGER, name="HARVEST")
FOREST = model.addVars(units, periods_plus1, ageclasses, vtype=GRB.INTEGER, name="FOREST")
model.addConstrs((quicksum(HARVEST[(k+1), (t+1), nothicket] for k in range(n_units) for t in range(n_periods) for nothicket in nothickets) == FOREST[unit, period+1, 1] for unit in units for period in periods if period < max(periods_plus1)), name="A_Thicket")
I have a problem with formulating the constraint. I want for every unit and every period to sum the nothickets part of the variable HARVEST. Concretely I want xk=1,t=1,2 + xk=1,t=1,3 + xk=1,t=1,4
and so on. This should result in only three ones per row of the constraint matrix. But with the formulation above I get 9 ones.
I tried to use a for loop outside of the sum, but this results in another problem:
for k in range(n_units):
for t in range(n_periods):
model.addConstrs((quicksum(HARVEST[(k+1), (t+1), nothicket] for nothicket in nothickets) == FOREST[unit,period+1, 1] for unit in units for period in periods if period < max(periods_plus1)), name="A_Thicket")
With this formulation I get this matrix:
constraint matrix
But what I want is:
row_idx | col_idx | coeff
0 | 0 | 1
0 | 1 | 1
0 | 2 | 1
0 | 13 | -1
1 | 3 | 1
1 | 4 | 1
1 | 5 | 1
1 | 17 | -1
2 | 6 | 1
2 | 7 | 1
2 | 8 | 1
2 | 21 | -1
Can anybody please help me to reformulate this constraint?
This worked for me:
model.addConstrs((HARVEST.sum(unit, period, '*') == ...

Quadratic Programming CPLEX

I am trying to implement a simple quadratic program using CPLEX's Python API. The sample file qpex1 provided with CPLEX discusses this. The problem, as mentioned in qpex.lp is
Maximize
obj: x1 + 2 x2 + 3 x3 + [ - 33 x1 ^2 + 12 x1 * x2 - 22 x2 ^2 + 23 x2 * x3
- 11 x3 ^2 ] / 2
Subject To
c1: - x1 + x2 + x3 <= 20
c2: x1 - 3 x2 + x3 <= 30
Bounds
0 <= x1 <= 40
End
The problem, while being implemented in python, receives a matrix qmat which implements the quadratic portion of the objective function. The matrix is :
qmat = [[[0, 1], [-33.0, 6.0]],
[[0, 1, 2], [6.0, -22.0, 11.5]],
[[1, 2], [11.5, -11.0]]]
p.objective.set_quadratic(qmat)
Can someone explain the structure of this matrix? What are the parts in the data structure that is being used? What are the components and so on.
First list is the set of indeces, the second list the set of the corresponding values, so the qmat matrix is:
-33 6 0
6 -22 11.5
0 11.5 -11
that results in:
| -33 6 0 | x1
x1 x2 x3 | 6 -22 11.5 | x2 = - 33 x1 ^2 + 12 x1 * x2 - 22 x2 ^2 + 23 x2 * x3 - 11 x3 ^2
| 0 11.5 -11 | x3

Pandas new column with constant increments

I need a new column that adds in increments, in this case .02.
DF before:
x y x2 y2
0 1.022467 1.817298 1.045440 3.302572
1 1.026426 1.821669 1.053549 3.318476
2 1.018198 1.818419 1.036728 3.306648
3 1.013077 1.813290 1.026325 3.288020
4 1.017878 1.811058 1.036076 3.279930
DF after:
x y x2 y2 t
0 1.022467 1.817298 1.045440 3.302572 0.000000
1 1.026426 1.821669 1.053549 3.318476 0.020000
2 1.018198 1.818419 1.036728 3.306648 0.040000
3 1.013077 1.813290 1.026325 3.288020 0.060000
4 1.017878 1.811058 1.036076 3.279930 0.080000
5 1.016983 1.814031 1.034254 3.290708 0.100000
I have looked around for a while, and cannot find a good solution. The only way on my mind is to make a standard python list and bring it in. There has to be a better way. Thanks
Because your index is the perfect range for this (i.e. 0...n), just multiply it by your constant:
df['t'] = .02 * df.index.values
>>> df
x y x2 y2 t
0 1.022467 1.817298 1.045440 3.302572 0.00
1 1.026426 1.821669 1.053549 3.318476 0.02
2 1.018198 1.818419 1.036728 3.306648 0.04
3 1.013077 1.813290 1.026325 3.288020 0.06
4 1.017878 1.811058 1.036076 3.279930 0.08
You could also use a list comprehension:
df['t'] = [0.02 * i for i in range(len(df))]

Categories