I've got a pandas data frame defined like this:
last_4_weeks_range = pandas.date_range(
start=datetime.datetime(2001, 5, 4), periods=28)
last_4_weeks = pandas.DataFrame(
[{'REST_KEY': 1, 'DLY_TRN_QT': 80, 'DLY_SLS_AMT': 90,
'COOP_DLY_TRN_QT': 30, 'COOP_DLY_SLS_AMT': 20}] * 28 +
[{'REST_KEY': 2, 'DLY_TRN_QT': 70, 'DLY_SLS_AMT': 10,
'COOP_DLY_TRN_QT': 50, 'COOP_DLY_SLS_AMT': 20}] * 28,
index=last_4_weeks_range.append(last_4_weeks_range))
last_4_weeks.sort(inplace=True)
and when I go to resample it:
In [265]: last_4_weeks.resample('7D', how='sum')
Out[265]:
COOP_DLY_SLS_AMT COOP_DLY_TRN_QT DLY_SLS_AMT DLY_TRN_QT REST_KEY
2001-05-04 280 560 700 1050 21
2001-05-11 280 560 700 1050 21
2001-05-18 280 560 700 1050 21
2001-05-25 280 560 700 1050 21
2001-06-01 0 0 0 0 0
I end up with an extra empty bin I wouldn't expect to see -- 2001-06-01. I wouldn't expect that bin to be there, as my 28 days are evenly divisible into the 7 day resample I'm performing. I've tried messing around with the closed kwarg, but I can't escape that extra bin. Why is that extra bin showing up when I've got nothing to put into it and how can I avoid generating it?
What I'm ultimately trying to do is get 7 day averages per REST_KEY, so doing a
In [266]: last_4_weeks.groupby('REST_KEY').resample('7D', how='sum').mean(level=0)
Out[266]:
COOP_DLY_SLS_AMT COOP_DLY_TRN_QT DLY_SLS_AMT DLY_TRN_QT REST_KEY
REST_KEY
1 112 168 504 448 5.6
2 112 280 56 392 11.2
but that extra empty bin is throwing off my mean (e.g. for COOP_DLY_SLS_AMT I've got 112, which is (20 * 7 * 4) / 5 rather than the 140 I'd get from (20 * 7 * 4) / 4 if I didn't have that extra bin.) I also wouldn't expect REST_KEY to show up in the aggregation since it's part of the groupby, but that's really a smaller problem.
P.S. I'm using pandas 0.11.0
I think it's a bug:
The output with pandas 0.9.0dev on mac is:
In [3]: pandas.__version__
Out[3]: '0.9.0.dev-1e68fd9'
In [6]: last_4_weeks.resample('7D', how='sum')
Out[6]:
COOP_DLY_SLS_AMT COOP_DLY_TRN_QT DLY_SLS_AMT DLY_TRN_QT REST_KEY
2001-05-04 40 80 100 150 3
2001-05-11 280 560 700 1050 21
2001-05-18 280 560 700 1050 21
2001-05-25 280 560 700 1050 21
2001-06-01 240 480 600 900 18
In [4]: last_4_weeks.groupby('REST_KEY').resample('7D', how='sum').mean(level=0)
Out[4]:
COOP_DLY_SLS_AMT COOP_DLY_TRN_QT DLY_SLS_AMT DLY_TRN_QT REST_KEY
REST_KEY
1 112 168 504 448 5.6
2 112 280 56 392 11.2
I'm using this versions (via pip freeze):
numpy==1.8.0.dev-9597b1f-20120920
pandas==0.9.0.dev-1e68fd9-20120920
Related
I have a dataframe with multiple columns with numerical values. I wanted to new columns which compare the values of other columns and assign its column name as label. I already understood its logic in r, but wondering how should I do this easily in python. Can anyone point me out how this can be done in python when we try to add new column where need to compare value of multiple columns and assign column name which has max value? Any idea?
reproducible example
this is 100% working reproducible example in R:
library(data.table)
df <- data.frame(a = sample(seq(1:10), size=10), b = sample(LETTERS[1:10], size=10), cnt=sample(seq(1:100), size=5),
RECENT_MOV= sample(seq(1:1000), size = 10),
RETIRED= sample(seq(1:200), size = 10),
SERV_EMPL= sample(seq(1:500), size = 10),
SUB_BUS=sample(seq(1:2000), size = 10),
WORK_HOME=sample(seq(1:1200), size = 10)
)
dt <- as.data.table(df)
write.csv(dt, "sample.csv")
label = c("RECENT_MOV", "RETIRED", "SERV_EMPL", "SUB_BUS","WORK_HOME")
df$category <- NA_character_
df[, row_ind:= 1:nrow(df)]
df[cnt > 2, category := names(which.max(.SD[, label, with = FALSE])), by = row_ind]
current output is:
> dput(dt)
structure(list(a = c(5L, 10L, 1L, 6L, 7L, 3L, 2L, 8L, 4L, 9L),
b = c("E", "A", "D", "H", "J", "F", "G", "I", "C", "B"),
cnt = c(13L, 88L, 45L, 92L, 70L, 13L, 88L, 45L, 92L, 70L),
RECENT_MOV = c(70L, 195L, 620L, 572L, 354L, 648L, 798L, 657L,
233L, 672L), RETIRED = c(189L, 195L, 191L, 88L, 148L, 186L,
39L, 78L, 158L, 55L), SERV_EMPL = c(65L, 151L, 415L, 383L,
255L, 207L, 210L, 470L, 181L, 188L), SUB_BUS = c(894L, 829L,
1798L, 502L, 897L, 1461L, 744L, 1991L, 260L, 1697L), WORK_HOME = c(553L,
739L, 454L, 137L, 435L, 1042L, 316L, 697L, 517L, 1158L),
category = c("SUB_BUS", "SUB_BUS", "SUB_BUS", "RECENT_MOV",
"SUB_BUS", "SUB_BUS", "RECENT_MOV", "SUB_BUS", "WORK_HOME",
"SUB_BUS"), row_ind = 1:10), row.names = c(NA, -10L), class = c("data.table",
"data.frame"), .internal.selfref = <pointer: 0x0000015a64b61ef0>)
my current python attempt
import pandas as pd
df=pd.read_csv("sample.csv", index_col=None, header=0)
label = ["RECENT_MOV", "RETIRED", "SERV_EMPL", "SUB_BUS","WORK_HOME"]
df['category'] = pd.NA
df['row_ind'] = range(1,len(df))
however, I have trouble to make this line in pythonic way:
df[cnt > 2, category := names(which.max(.SD[, label, with = FALSE])), by = row_ind]
basically, this lines said create new column variable called category where comparing columns in label where whichever column has max value, its column name will be assigned as value in category column. How should I do it this easily in python?
logic translation:
df[cnt > 2, category := names(which.max(.SD[, label, with = FALSE])), by = row_ind]
this line telling us that first do filter by cnt column where cnt > 2, then compare columns values of df[["RECENT_MOV", "RETIRED", "SERV_EMPL", "SUB_BUS","WORK_HOME"]] and pick the column with highest value by row-wise and assign that name of that column as value to df['category']=col_name_with_highest_value_in_each_row.
desirable output
this is desirable output that I want to produce in python:
a b cnt RECENT_MOV RETIRED SERV_EMPL SUB_BUS WORK_HOME category row_ind
1 5 E 13 70 189 65 894 553 SUB_BUS 1
2 10 A 88 195 195 151 829 739 SUB_BUS 2
3 1 D 45 620 191 415 1798 454 SUB_BUS 3
4 6 H 92 572 88 383 502 137 RECENT_MOV 4
5 7 J 70 354 148 255 897 435 SUB_BUS 5
6 3 F 13 648 186 207 1461 1042 SUB_BUS 6
7 2 G 88 798 39 210 744 316 RECENT_MOV 7
8 8 I 45 657 78 470 1991 697 SUB_BUS 8
9 4 C 92 233 158 181 260 517 WORK_HOME 9
10 9 B 70 672 55 188 1697 1158 SUB_BUS 10
This is actually really simple with pandas. Have a list of the columns to search in, and then use idxmax with axis=1:
# Filter out rows where `cnt` is less than or equal to 2
df = df[df['cnt'] > 2]
# Determine category for each row
search_cols = ['RECENT_MOV', 'RETIRED', 'SERV_EMPL', 'SUB_BUS', 'WORK_HOME']
df['category'] = df[search_cols].idxmax(axis=1)
# Assign row indexes
df['row_ind'] = df.index
Output:
>>> df
a b cnt RECENT_MOV RETIRED SERV_EMPL SUB_BUS WORK_HOME category row_ind
1 1 C 76 452 62 55 115 247 RECENT_MOV 1
2 7 E 14 50 165 337 1165 810 SUB_BUS 2
3 2 A 46 523 167 423 784 707 SUB_BUS 3
4 3 H 3 38 144 473 745 437 SUB_BUS 4
5 5 I 59 743 127 261 351 190 RECENT_MOV 5
6 8 J 76 143 49 470 1612 935 SUB_BUS 6
7 4 D 14 818 101 418 1919 314 SUB_BUS 7
8 6 F 46 714 9 446 1432 938 SUB_BUS 8
9 10 B 3 585 160 14 107 489 RECENT_MOV 9
10 9 G 59 814 73 449 937 287 SUB_BUS 10
I have the following dataframe:
q
1 0.83 97 0.7 193 0.238782 289 0.129692 385 0.090692
2 0.75 98 0.7 194 0.238782 290 0.129692 386 0.090692
...
96 0.94693 192 0.299753 288 0.145046 384 0.0965338 480 0.0823061
This data comes from somewhere else, and it has been split. However, the values correspond to a single variable 'q', along with its indices. To clarify, even though there are many columns, they all correspond to one column 'q', plus an index column (notice that the starting index of each column is the continuation of the end of the previous column).
How can I read the data with pandas? I believe I can do it by assigning names to each column and then merging them all together, but I was looking for a more elegant solution. Plus, the number of columns is not fixed.
This is the code that I am using at the moment:
q_param = pd.read_csv('Initial_solutions/initial_q_20y.dat', delim_whitespace=True)
Which does not do the trick. I would prefer to use pandas to solve this issue, but I can also work without it.
EDIT:
At the request of #user17242583, the following command:
print(q_param.head().to_dict())
Gives this output:
{'q': {(1, 0.83, 97, 0.7, 193, 0.238782, 289, 0.129692, 385): 0.090692, (2, 0.75, 98, 0.7, 194, 0.238782, 290, 0.129692, 386): 0.090692, (3, 0.64, 99, 0.64, 195, 0.238782, 291, 0.129692, 387): 0.090692, (4, 0.7, 100, 0.7, 196, 0.238782, 292, 0.129692, 388): 0.0884839, (5, 0.64, 101, 0.64, 197, 0.238782, 293, 0.129692, 389): 0.090692}}
It seems most of your data is index. Try:
df = pd.DataFrame({k:v for lst in [list(k)+[v] for k,v in q_param['q'].items()] for k,v in zip(lst[::2],lst[1::2])}, index=['q']).T.sort_index()
Try this:
data = {
0: pd.concat(q[c] for c in q.columns[0::2]).reset_index(drop=True),
1: pd.concat(q[c] for c in q.columns[1::2]).reset_index(drop=True),
}
df = pd.DataFrame(data)
Output:
>>> df
0 1
0 1 0.830000
1 2 0.750000
2 3 0.640000
3 4 0.700000
4 5 0.640000
5 97 0.700000
6 98 0.700000
7 99 0.640000
8 100 0.700000
9 101 0.640000
10 193 0.238782
11 194 0.238782
12 195 0.238782
13 196 0.238782
14 197 0.238782
15 289 0.129692
16 290 0.129692
17 291 0.129692
18 292 0.129692
19 293 0.129692
20 385 0.090692
21 386 0.090692
22 387 0.090692
23 388 0.088484
24 389 0.090692
I have this problem which I've been trying to solve:
I want the code to take this DataFrame and group multiple columns based on the most frequent number and sum the values on the last column. For example:
df = pd.DataFrame({'A':[1000, 1000, 1000, 1000, 1000, 200, 200, 500, 500],
'B':[380, 380, 270, 270, 270, 45, 45, 45, 55],
'C':[380, 380, 270, 270, 270, 88, 88, 88, 88],
'D':[45, 32, 67, 89, 51, 90, 90, 90, 90]})
df
A B C D
0 1000 380 380 45
1 1000 380 380 32
2 1000 270 270 67
3 1000 270 270 89
4 1000 270 270 51
5 200 45 88 90
6 200 45 88 90
7 500 45 88 90
8 500 55 88 90
I would like the code to show the result below:
A B C D
0 1000 380 380 284
1 1000 380 380 284
2 1000 270 270 284
3 1000 270 270 284
4 1000 270 270 284
5 200 45 88 360
6 200 45 88 360
7 500 45 88 360
8 500 55 88 360
Notice that the most frequent value on the first rows is 1000, and this way I group the column 'A' so I get the sum 284 on the column 'D'. However, on the last rows, the most frequent number, which is 88, is not on column 'A', but in column 'C'. I am trying to sum the values on column 'D' by grouping column 'C' and get 360. I am not sure if I made myself clear.
I tried to use the function df['D'] = df.groupby(['A', 'B', 'C'])['D'].transform('sum'), but it does not show the desired result aforementioned.
Is there any pandas-style way of resolving this? Thanks in advance!
Code
def get_count_sum(col, func):
return df.groupby(col).D.transform(func)
ga = get_count_sum('A', 'count')
gb = get_count_sum('B', 'count')
gc = get_count_sum('C', 'count')
conditions = [
((ga > gb) & (ga > gc)),
((gb > ga) & (gb > gc)),
((gc > ga) & (gc > gb)),
]
choices = [get_count_sum('A', 'sum'),
get_count_sum('B', 'sum'),
get_count_sum('C', 'sum')]
df['D'] = np.select(conditions, choices)
df
Output
A B C D
0 1000 380 380 284
1 1000 380 380 284
2 1000 270 270 284
3 1000 270 270 284
4 1000 270 270 284
5 200 45 88 360
6 200 45 88 360
7 500 45 88 360
8 500 55 88 360
Explanation
Since we need to group by each column 'A','B' or 'C' considering which one has max repeated number, so first we are checking the max repeated number and storing the groupby output in ga, gb, gc for A,B,C col respectively.
We are checking which col has max frequent number in conditions.
According to the conditions we are applying choices for if else conditions.
np.select is like if-elif-else where we placed the conditions and required output in choices.
I have a dataframe, from where I extracted some sample data:
Time Val
0 70000 -322
1 70500 -439
2 71000 -528
3 71500 -606
4 72000 -642
5 72500 -663
6 73000 -620
7 73500 -561
8 74000 -592
9 74500 -614
10 75000 -630
11 75500 -719
12 80000 -613
13 80500 -127
14 81000 -235
15 81500 -186
16 82000 -82
17 82500 836
18 83000 1137
183 70000 -106
184 70500 -117
185 71000 -626
186 71500 -810
187 72000 -822
188 72500 -676
189 73000 -639
190 73500 -664
191 74000 -708
192 74500 -515
193 75000 -61
194 75500 -121
195 80000 -145
196 80500 -57
197 81000 -133
198 81500 101
199 82000 235
200 82500 585
201 83000 550
366 70000 18
367 70500 138
368 71000 22
369 71500 -68
370 72000 -146
371 72500 -163
372 73000 -251
373 73500 -230
374 74000 -218
375 74500 -137
376 75000 -126
Now I would like to compare the value from 'Val' at time 73000 with the value [i-3].
If the value is less, then append the continuous values to the list until Time has reached 80000.
I wrote this loop but the problem is that 'Val' compares ALL values [i-3] between 73000 and 80000. I want that the comparison happens ONLY at 73000, and if the condition is true, write the data to the list (until Time 80000)
box = []
for i in df.index:
if df.Time[i] >= 73000 and df.Time[i] <= 80000 and df.Val[i] < df.Val[i-3]:
box.append(
{
'Time': df.Time[i],
'newVAL': df.Val[i],
}
)
box = pd.DataFrame (box, columns = ['Time','newVAL'])
How could I change the code in order to achieve this?
You need to remember the result of the comparison in another variable, and reset it whenever you encounter a time value outside your desired interval. The code would look like this.
box = []
writeToList = False
for i in df.index:
if df.Time[i] < 73000 or df.Time[i] > 80000:
writeToList = False
if df.Time[i] == 73000 and df.Val[i] < df.Val[i-3]:
writeToList = True
if writeToList and df.Time[i] >= 73000 and df.Time[i] <= 80000 :
box.append(
{
'Time': df.Time[i],
'newVAL': df.Val[i],
}
)
box = pd.DataFrame (box, columns = ['Time','newVAL'])
Hope this helps.
I'm currently working on an online advertisement optimzier project.
Let's assume that only thing I can change is CPC(cost per click).
I don't have much data, as the data is updated only once a day.
I want to get the prediction of net_income by CPC and want to let the program to suggest the best CPC value to maximize the net_income for tomorrow, based on the data that updates every day.
cpc margin
0 440 -95224.0
1 840 -81620.0
2 530 -57496.0
3 590 -47287.0
4 560 -45681.0
5 590 -52766.0
6 500 -60852.0
7 650 -59653.0
8 480 -48905.0
9 620 -56496.0
10 680 -53614.0
11 590 -44440.0
12 460 -34066.0
13 720 -31086.0
14 590 -23177.0
15 680 -12803.0
16 760 -10625.0
17 590 -20548.0
18 800 -15136.0
19 650 -12804.0
20 420 -63435.0
21 400 -7566.0
22 400 21136.0
23 400 -58585.0
24 400 -14166.0
25 420 -23065.0
26 400 -28533.0
27 380 -14454.0
28 400 -50819.0
29 380 -26356.0
30 400 -26322.0
31 380 -19107.0
32 400 -28270.0
33 380 -88439.0
34 360 -32207.0
35 340 -27632.0
36 340 -18050.0
37 340 -71574.0
38 340 -18050.0
39 320 -20735.0
40 300 -17984.0
41 290 -9426.0
42 280 -16555.0
43 290 2961.0
For instance, say the above data is df.
I tried use sklearn and LogisticRegression to get the prediction:
import pandas as pd
from sklearn import datasets
from sklearn import metrics
from sklearn.linear_model import LogisticRegression
model = LinearRegression()
model.fit(df['cpc'], df['margin'])
prediction = model.predict([[300]])
print(prediction[0])
margin is net_income, btw.
So by doing this, I thought I might get the prediction based on the data when CPC is 300, but it returned an error saying:
ValueError: Expected 2D array, got 1D array instead:
array=[440 840 530 590 560 590 500 650 480 620 680 590 460 720 590 680 760 590
800 650 420 400 400 400 400 420 400 380 400 380 400 380 400 380 360 340
340 340 340 320 300 290 280 290].
Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample.
I've been looking for some examples using linear regression models or logistics regression models, but they all use a 2-d array for input, which doesn't fit my needs. I only have one factor that I can change, and the result is simply the net_income(or margin).
How would I use sklearn on my project? Or is there maybe another better way to solve the problem?
I'm pretty new to programming and have no knowledge of math and statistics which makes it harder for me to understand or get keywords to study... please guide me on this.
---------------------------------updated-------------------------------------
Allright, let me give you another df
cpc margin
0 440 -35224.0
1 340 -11574.0
2 380 -68439.0
3 420 -23435.0
4 840 -81620.0
5 400 -38585.0
6 530 -37496.0
7 590 -7287.0
8 560 -5681.0
9 590 -32766.0
10 500 -60852.0
11 400 -30819.0
12 650 -59653.0
13 480 -28905.0
14 620 -56496.0
15 680 -53614.0
16 590 -44440.0
17 460 -14066.0
18 420 16935.0
19 360 -12207.0
20 400 -8533.0
21 400 -6322.0
22 400 25834.0
23 720 -31086.0
24 400 121136.0
25 400 -28270.0
26 340 1950.0
27 340 1950.0
28 300 2016.0
29 340 -27632.0
30 400 32434.0
31 380 -26356.0
32 590 -23177.0
33 680 7197.0
34 320 -20735.0
35 760 9375.0
36 590 -20548.0
37 290 10574.0
38 380 -19107.0
39 290 42961.0
40 280 -16555.0
41 800 -15136.0
42 380 -14454.0
43 650 -12804.0
Thanks to your answers, I could go further as below.
after I could run my code without error, I thought by looping the input, I would be able to get the optimal cpc value.
import pandas as pd
from sklearn import datasets
from sklearn import metrics
from sklearn.linear_model import LogisticRegression
df = pd.DataFrame(final_db)
model = LogisticRegression()
x = df[['cpc']]
model.fit(x, df['margin'])
previous_prediction = -99999999999999
df_prediction = []
for i in list(range(10, 1000, 10)):
prediction = model.predict([[i]])
df_prediction.append({'cpc':i, 'margin' : prediction})
if prediction > previous_prediction:
previous_prediction = prediction
previous_i = i
and the result was as below
which isn't very satisfying. based on the data I have, is there any better model to use? To achieve my goal, any other suggestions?
I guess it is complaining about this line
model.fit(df['cpc'], df['margin'])
where first parameter should be two dimensional array. You can used array indexing of DataFrame
df[['cpc']]
to get DataFrame instead of series which will fix the issue