prediction based on updating data - python

I'm currently working on an online advertisement optimzier project.
Let's assume that only thing I can change is CPC(cost per click).
I don't have much data, as the data is updated only once a day.
I want to get the prediction of net_income by CPC and want to let the program to suggest the best CPC value to maximize the net_income for tomorrow, based on the data that updates every day.
cpc margin
0 440 -95224.0
1 840 -81620.0
2 530 -57496.0
3 590 -47287.0
4 560 -45681.0
5 590 -52766.0
6 500 -60852.0
7 650 -59653.0
8 480 -48905.0
9 620 -56496.0
10 680 -53614.0
11 590 -44440.0
12 460 -34066.0
13 720 -31086.0
14 590 -23177.0
15 680 -12803.0
16 760 -10625.0
17 590 -20548.0
18 800 -15136.0
19 650 -12804.0
20 420 -63435.0
21 400 -7566.0
22 400 21136.0
23 400 -58585.0
24 400 -14166.0
25 420 -23065.0
26 400 -28533.0
27 380 -14454.0
28 400 -50819.0
29 380 -26356.0
30 400 -26322.0
31 380 -19107.0
32 400 -28270.0
33 380 -88439.0
34 360 -32207.0
35 340 -27632.0
36 340 -18050.0
37 340 -71574.0
38 340 -18050.0
39 320 -20735.0
40 300 -17984.0
41 290 -9426.0
42 280 -16555.0
43 290 2961.0
For instance, say the above data is df.
I tried use sklearn and LogisticRegression to get the prediction:
import pandas as pd
from sklearn import datasets
from sklearn import metrics
from sklearn.linear_model import LogisticRegression
model = LinearRegression()
model.fit(df['cpc'], df['margin'])
prediction = model.predict([[300]])
print(prediction[0])
margin is net_income, btw.
So by doing this, I thought I might get the prediction based on the data when CPC is 300, but it returned an error saying:
ValueError: Expected 2D array, got 1D array instead:
array=[440 840 530 590 560 590 500 650 480 620 680 590 460 720 590 680 760 590
800 650 420 400 400 400 400 420 400 380 400 380 400 380 400 380 360 340
340 340 340 320 300 290 280 290].
Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample.
I've been looking for some examples using linear regression models or logistics regression models, but they all use a 2-d array for input, which doesn't fit my needs. I only have one factor that I can change, and the result is simply the net_income(or margin).
How would I use sklearn on my project? Or is there maybe another better way to solve the problem?
I'm pretty new to programming and have no knowledge of math and statistics which makes it harder for me to understand or get keywords to study... please guide me on this.
---------------------------------updated-------------------------------------
Allright, let me give you another df
cpc margin
0 440 -35224.0
1 340 -11574.0
2 380 -68439.0
3 420 -23435.0
4 840 -81620.0
5 400 -38585.0
6 530 -37496.0
7 590 -7287.0
8 560 -5681.0
9 590 -32766.0
10 500 -60852.0
11 400 -30819.0
12 650 -59653.0
13 480 -28905.0
14 620 -56496.0
15 680 -53614.0
16 590 -44440.0
17 460 -14066.0
18 420 16935.0
19 360 -12207.0
20 400 -8533.0
21 400 -6322.0
22 400 25834.0
23 720 -31086.0
24 400 121136.0
25 400 -28270.0
26 340 1950.0
27 340 1950.0
28 300 2016.0
29 340 -27632.0
30 400 32434.0
31 380 -26356.0
32 590 -23177.0
33 680 7197.0
34 320 -20735.0
35 760 9375.0
36 590 -20548.0
37 290 10574.0
38 380 -19107.0
39 290 42961.0
40 280 -16555.0
41 800 -15136.0
42 380 -14454.0
43 650 -12804.0
Thanks to your answers, I could go further as below.
after I could run my code without error, I thought by looping the input, I would be able to get the optimal cpc value.
import pandas as pd
from sklearn import datasets
from sklearn import metrics
from sklearn.linear_model import LogisticRegression
df = pd.DataFrame(final_db)
model = LogisticRegression()
x = df[['cpc']]
model.fit(x, df['margin'])
previous_prediction = -99999999999999
df_prediction = []
for i in list(range(10, 1000, 10)):
prediction = model.predict([[i]])
df_prediction.append({'cpc':i, 'margin' : prediction})
if prediction > previous_prediction:
previous_prediction = prediction
previous_i = i
and the result was as below
which isn't very satisfying. based on the data I have, is there any better model to use? To achieve my goal, any other suggestions?

I guess it is complaining about this line
model.fit(df['cpc'], df['margin'])
where first parameter should be two dimensional array. You can used array indexing of DataFrame
df[['cpc']]
to get DataFrame instead of series which will fix the issue

Related

What is the internal load factor of a sets in Python

I am trying to find out what the internal load factor is for the Python sets. For dictionary which uses a hash table with a load factor of 0.66 (2/3) is. The number of buckets start at 8 and when the 6th key is inserted the number of buckets increases to 16
The table below shows the shift in buckets.
bucket
shift
8
5
16
10
32
21
64
42
128
85
This can be seen with de following Python code where the size of a dictionary and sets is shows with the getsizeof method:
import sys
d = {}
s = set()
for x in range(25):
d[x] = 1
s.add(x)
print(len(d), sys.getsizeof(d), sys.getsizeof(s))
# of elements
memory used for dict
memory used for sets
1
232
216
2
232
216
3
232
216
4
232
216
5
232
728
6
360
728
7
360
728
8
360
728
9
360
728
10
360
728
11
640
728
12
640
728
13
640
728
14
640
728
15
640
728
16
640
728
17
640
728
18
640
728
19
640
2264
20
640
2264
21
640
2264
22
1176
2264
23
1176
2264
24
1176
2264
25
1176
2264
The above table shows that the shift in the buckets correct is for the dictionary, but not for the sets. The memory in sets is different.
I am trying to find out what the load factor is for a set. Is that also 2/3? Or am I doing something wrong with the code?
Currently, it's about 3/5. See the source:
if ((size_t)so->fill*5 < mask*3)
return 0;
return set_table_resize(so, so->used>50000 ? so->used*2 : so->used*4);
fill is the number of occupied table cells (including "deleted entry" markers), and mask is 1 less than the total table capacity.

How to compare two values at a specific location in a loop, and append data in a range of values in Pandas Dataframe

I have a dataframe, from where I extracted some sample data:
Time Val
0 70000 -322
1 70500 -439
2 71000 -528
3 71500 -606
4 72000 -642
5 72500 -663
6 73000 -620
7 73500 -561
8 74000 -592
9 74500 -614
10 75000 -630
11 75500 -719
12 80000 -613
13 80500 -127
14 81000 -235
15 81500 -186
16 82000 -82
17 82500 836
18 83000 1137
183 70000 -106
184 70500 -117
185 71000 -626
186 71500 -810
187 72000 -822
188 72500 -676
189 73000 -639
190 73500 -664
191 74000 -708
192 74500 -515
193 75000 -61
194 75500 -121
195 80000 -145
196 80500 -57
197 81000 -133
198 81500 101
199 82000 235
200 82500 585
201 83000 550
366 70000 18
367 70500 138
368 71000 22
369 71500 -68
370 72000 -146
371 72500 -163
372 73000 -251
373 73500 -230
374 74000 -218
375 74500 -137
376 75000 -126
Now I would like to compare the value from 'Val' at time 73000 with the value [i-3].
If the value is less, then append the continuous values to the list until Time has reached 80000.
I wrote this loop but the problem is that 'Val' compares ALL values [i-3] between 73000 and 80000. I want that the comparison happens ONLY at 73000, and if the condition is true, write the data to the list (until Time 80000)
box = []
for i in df.index:
if df.Time[i] >= 73000 and df.Time[i] <= 80000 and df.Val[i] < df.Val[i-3]:
box.append(
{
'Time': df.Time[i],
'newVAL': df.Val[i],
}
)
box = pd.DataFrame (box, columns = ['Time','newVAL'])
How could I change the code in order to achieve this?
You need to remember the result of the comparison in another variable, and reset it whenever you encounter a time value outside your desired interval. The code would look like this.
box = []
writeToList = False
for i in df.index:
if df.Time[i] < 73000 or df.Time[i] > 80000:
writeToList = False
if df.Time[i] == 73000 and df.Val[i] < df.Val[i-3]:
writeToList = True
if writeToList and df.Time[i] >= 73000 and df.Time[i] <= 80000 :
box.append(
{
'Time': df.Time[i],
'newVAL': df.Val[i],
}
)
box = pd.DataFrame (box, columns = ['Time','newVAL'])
Hope this helps.

How can I draw circle on a map?

Here is my dataframe:
Boston
Zipcode Employees Latitude Longitude
0 02021 174 -71.131057 42.228065
1 02026 193 -71.143038 42.237719
3 02109 45 -71.054027 42.363498
4 02110 14 -71.053642 42.357649
5 02111 30 -71.060280 42.350586
6 02113 77 -71.054618 42.365215
8 02115 116 -71.095106 42.343330
10 02118 318 -71.072103 42.339342
11 02119 804 -71.085268 42.323002
12 02120 168 -71.097569 42.332539
13 02121 781 -71.086649 42.305792
15 02124 1938 -71.066702 42.281721
16 02125 859 -71.053049 42.310813
17 02126 882 -71.090424 42.272444
19 02128 786 -71.016037 42.375254
21 02130 886 -71.114080 42.309087
22 02131 1222 -71.121464 42.285216
23 02132 1348 -71.168150 42.280316
24 02134 230 -71.123323 42.355355
25 02135 584 -71.147046 42.357537
26 02136 1712 -71.125550 42.255064
28 02152 119 -70.960324 42.351129
29 02163 1 -71.120420 42.367263
30 02186 361 -71.113223 42.258883
31 02199 4 -71.082279 42.346991
32 02210 35 -71.044281 42.347148
33 02215 83 -71.103877 42.348709
34 02459 27 -71.187563 42.286356
35 02467 66 -71.157691 42.314277
And I want to draw circles on my map, each circle corresponds to one point, the size of the circle depends on the number of Employees
Here are my map code (I try to use marker, but I think circle is better:
boston_map=folium.Map([Boston['Longitude'].mean(), Boston['Latitude'].mean()],zoom_start=12)
incidents2=plugins.MarkerCluster().add_to(boston_map)
for Latitude,Longitude,Employees in zip(Boston.Latitude,Boston.Longitude,Boston.Employees):
folium.Marker(location=[Latitude,Longitude],icon=None,popup=Employees).add_to(incidents2)
boston_map.add_child(incidents2)
boston_map
Here is my map:
If the number of employees can show in the circle, it will be better! Thank you very much!
To draw circles you can use CircleMarker instead of Marker
BTW: you have wrong column's names. Boston has lat: 42.361145, long: -71.057083 but you have values 42 in column Longitude and values -71 in column Latitude
Because I don't use Juputer so I save map in HTML file and use webbrowser to automatically open it in web browser.
Because it created big circles so I divide Employees to create smaller circles. But now some circles are very small and it shows number of circles instead circles. Maybe it should be used math.log() or other method to make it smaller (normalized).
I use tooltip=str(employees) to display number when you hover circle.
text = '''
Zipcode Employees Longitude Latitude
0 02021 174 -71.131057 42.228065
1 02026 193 -71.143038 42.237719
3 02109 45 -71.054027 42.363498
4 02110 14 -71.053642 42.357649
5 02111 30 -71.060280 42.350586
6 02113 77 -71.054618 42.365215
8 02115 116 -71.095106 42.343330
10 02118 318 -71.072103 42.339342
11 02119 804 -71.085268 42.323002
12 02120 168 -71.097569 42.332539
13 02121 781 -71.086649 42.305792
15 02124 1938 -71.066702 42.281721
16 02125 859 -71.053049 42.310813
17 02126 882 -71.090424 42.272444
19 02128 786 -71.016037 42.375254
21 02130 886 -71.114080 42.309087
22 02131 1222 -71.121464 42.285216
23 02132 1348 -71.168150 42.280316
24 02134 230 -71.123323 42.355355
25 02135 584 -71.147046 42.357537
26 02136 1712 -71.125550 42.255064
28 02152 119 -70.960324 42.351129
29 02163 1 -71.120420 42.367263
30 02186 361 -71.113223 42.258883
31 02199 4 -71.082279 42.346991
32 02210 35 -71.044281 42.347148
33 02215 83 -71.103877 42.348709
34 02459 27 -71.187563 42.286356
35 02467 66 -71.157691 42.314277
'''
import pandas as pd
import io
import folium
import folium.plugins
boston = pd.read_csv(io.StringIO(text), sep='\s+')
boston_map = folium.Map([boston.Latitude.mean(), boston.Longitude.mean(), ], zoom_start=12)
incidents2 = folium.plugins.MarkerCluster().add_to(boston_map)
for latitude, longitude, employees in zip(boston.Latitude, boston.Longitude, boston.Employees):
print(latitude, longitude, employees)
folium.vector_layers.CircleMarker(
location=[latitude, longitude],
tooltip=str(employees),
radius=employees/10,
color='#3186cc',
fill=True,
fill_color='#3186cc'
).add_to(incidents2)
boston_map.add_child(incidents2)
# display in web browser
import webbrowser
boston_map.save('map.html')
webbrowser.open('map.html')
EDIT: answer for question how to add a label on each circle in a folium.circile map python shows how to use Marker with icon=DivIcon(text) to add text but it doesn't work as I expect.

Converting from AMPL to Pyomo

I'm trying to convert an AMPL model to Pyomo (something I have no experience with using). I'm finding the syntax hard to adapt to, especially the constraint and objective sections. I've already linked my computer together with python, anaconda, Pyomo, and GLPK, and just need to get the actual code down. I'm a beginner coder so forgive me if my code is poorly written. Still trying to get the hang of this!
Here is the data from the AMPL code:
set PROD := 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30;
set PROD1:= 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30;
ProdCost 414 3 46 519 876 146 827 996 922 308 568 176 58 13 20 974 121 751 130 844 280 123 275 843 717 694 72 413 65 631
HoldingCost 606 308 757 851 148 867 336 44 364 960 69 428 778 485 285 938 980 932 199 175 625 513 536 965 366 950 632 88 698 744
Demand 105 70 135 67 102 25 147 69 23 84 32 41 81 133 180 22 174 80 24 156 28 125 23 137 180 151 39 138 196 69
And here is the model:
set PROD; # set of production amounts
set PROD1; # set of holding amounts
param ProdCost {PROD} >= 0; # parameter set of production costs
param Demand {PROD} >= 0; # parameter set of demand at each time
param HoldingCost {PROD} >= 0; # parameter set of holding costs
var Inventory {PROD1} >= 0; # variable that sets inventory amount at each time
var Make {p in PROD} >= 0; # variable of amount produced at each time
minimize Total_Cost: sum {p in PROD} ((ProdCost[p] * Make[p]) + (Inventory[p] * HoldingCost[p]));
# Objective: minimize total cost from all production and holding cost
subject to InventoryConstraint {p in PROD}: Inventory[p] = Inventory[p-1] + Make[p] - Demand[p];
# excess production transfers to inventory
subject to MeetDemandConstraint {p in PROD}: Make[p] >= Demand[p] - Inventory[p-1];
# Constraint: holding and production must exceed demand
subject to InitialInventoryConstraint: Inventory[0] = 0;
# Constraint: Inventory must start at 0
Here's what I have so far. Not sure if it's right or not:
from pyomo.environ import *
demand=[105,70,135,67,102,25,147,69,23,84,32,41,81,133,180,22,174,80,24,156,28,125,23,137,180,151,39,138,196,69]
holdingcost=[606,308,757,851,148,867,336,44,364,960,69,428,778,485,285,938,980,932,199,175,625,513,536,965,366,950,632,88,698,744]
productioncost=[414,3,46,519,876,146,827,996,922,308,568,176,58,13,20,974,121,751,130,844,280,123,275,843,717,694,72,413,65,631]
model=ConcreteModel()
model.I=RangeSet(1,30)
model.J=RangeSet(0,30)
model.x=Var(model.I, within=NonNegativeIntegers)
model.y=Var(model.J, within=NonNegativeIntegers)
model.obj = Objective(expr = sum(model.x[i]*productioncost[i]+model.y[i]*holdingcost[i] for i in model.I))
def InventoryConstraint(model, i):
return model.y[i-1] + model.x[i] - demand[i] <= model.y[i]
InvCont = Constraint(model, rule=InventoryConstraint)
def MeetDemandConstraint(model, i):
return model.x[i] >= demand[i] - model.y[i-1]
DemCont = Constraint(model, rule=MeetDemandConstraint)
def Initial(model):
return model.y[0] == 0
model.Init = Constraint(rule=Initial)
opt = SolverFactory('glpk')
results = opt.solve(model,load_solutions=True)
model.solutions.store_to(results)
results.write()
Thanks!
The only issues I see are in some of your constraint declarations. You need to attach the constraints to the model and the first argument passed in should be the indexing set (which I'm assuming should be model.I).
def InventoryConstraint(model, i):
return model.y[i-1] + model.x[i] - demand[i] <= model.y[i]
model.InvCont = Constraint(model.I, rule=InventoryConstraint)
def MeetDemandConstraint(model, i):
return model.x[i] >= demand[i] - model.y[i-1]
model.DemCont = Constraint(model.I, rule=MeetDemandConstraint)
The syntax that you're using to solve the model is a little out-dated but should work. Another option would be:
opt = SolverFactory('glpk')
opt.solve(model,tee=True) # The 'tee' option prints the solver output to the screen
model.display() # This will print a summary of the model solution
Another command that is useful for debugging is model.pprint(). This will display the entire model including the expressions for Constraints and Objectives.

Extra Bin with Pandas Resample

I've got a pandas data frame defined like this:
last_4_weeks_range = pandas.date_range(
start=datetime.datetime(2001, 5, 4), periods=28)
last_4_weeks = pandas.DataFrame(
[{'REST_KEY': 1, 'DLY_TRN_QT': 80, 'DLY_SLS_AMT': 90,
'COOP_DLY_TRN_QT': 30, 'COOP_DLY_SLS_AMT': 20}] * 28 +
[{'REST_KEY': 2, 'DLY_TRN_QT': 70, 'DLY_SLS_AMT': 10,
'COOP_DLY_TRN_QT': 50, 'COOP_DLY_SLS_AMT': 20}] * 28,
index=last_4_weeks_range.append(last_4_weeks_range))
last_4_weeks.sort(inplace=True)
and when I go to resample it:
In [265]: last_4_weeks.resample('7D', how='sum')
Out[265]:
COOP_DLY_SLS_AMT COOP_DLY_TRN_QT DLY_SLS_AMT DLY_TRN_QT REST_KEY
2001-05-04 280 560 700 1050 21
2001-05-11 280 560 700 1050 21
2001-05-18 280 560 700 1050 21
2001-05-25 280 560 700 1050 21
2001-06-01 0 0 0 0 0
I end up with an extra empty bin I wouldn't expect to see -- 2001-06-01. I wouldn't expect that bin to be there, as my 28 days are evenly divisible into the 7 day resample I'm performing. I've tried messing around with the closed kwarg, but I can't escape that extra bin. Why is that extra bin showing up when I've got nothing to put into it and how can I avoid generating it?
What I'm ultimately trying to do is get 7 day averages per REST_KEY, so doing a
In [266]: last_4_weeks.groupby('REST_KEY').resample('7D', how='sum').mean(level=0)
Out[266]:
COOP_DLY_SLS_AMT COOP_DLY_TRN_QT DLY_SLS_AMT DLY_TRN_QT REST_KEY
REST_KEY
1 112 168 504 448 5.6
2 112 280 56 392 11.2
but that extra empty bin is throwing off my mean (e.g. for COOP_DLY_SLS_AMT I've got 112, which is (20 * 7 * 4) / 5 rather than the 140 I'd get from (20 * 7 * 4) / 4 if I didn't have that extra bin.) I also wouldn't expect REST_KEY to show up in the aggregation since it's part of the groupby, but that's really a smaller problem.
P.S. I'm using pandas 0.11.0
I think it's a bug:
The output with pandas 0.9.0dev on mac is:
In [3]: pandas.__version__
Out[3]: '0.9.0.dev-1e68fd9'
In [6]: last_4_weeks.resample('7D', how='sum')
Out[6]:
COOP_DLY_SLS_AMT COOP_DLY_TRN_QT DLY_SLS_AMT DLY_TRN_QT REST_KEY
2001-05-04 40 80 100 150 3
2001-05-11 280 560 700 1050 21
2001-05-18 280 560 700 1050 21
2001-05-25 280 560 700 1050 21
2001-06-01 240 480 600 900 18
In [4]: last_4_weeks.groupby('REST_KEY').resample('7D', how='sum').mean(level=0)
Out[4]:
COOP_DLY_SLS_AMT COOP_DLY_TRN_QT DLY_SLS_AMT DLY_TRN_QT REST_KEY
REST_KEY
1 112 168 504 448 5.6
2 112 280 56 392 11.2
I'm using this versions (via pip freeze):
numpy==1.8.0.dev-9597b1f-20120920
pandas==0.9.0.dev-1e68fd9-20120920

Categories