I have a pandas dataframe that has a column created_date which is in epoch format. I wanted to use a filter condition as shown below.
Dataframe sample
created_time updated_time sys_time last_action_time account_id \
0 1624473000000 1624459148023 1624459148023 0 812
1 1624471920000 1624448094358 1624448094358 0 812
2 1624469400000 1624455267579 1624455267579 0 812
3 1624466580000 1624466620020 1624466590321 0 812
4 1624466529000 1624466610222 1624466540086 0 812
5 1624466501000 1624466610270 1624466510212 0 812
6 1624466461000 1624466620149 1624466469825 0 812
7 1624466443000 1624466446558 1624466446558 0 812
8 1624466435000 1624466460213 1624466460213 0 812
daily_data_df = [(data_df['created_time'] >= start_date_int) & (data_df['created_time'] < end_date_int)
Where,
start_date_int & end_date_int is GMT+7 timezone
created_time is epoch format
Please help me with the conversion.
First : Strip last 3 digits from column "created_time", it seems that epoch lengh is only 9-10 and you have 13 :
df['created_time'] = df['created_time'].astype(str).apply(lambda x: x[:-3])
Second : Convert from Unix epoch to datetime :
df['created_time'] = pd.to_datetime(df['created_time'], unit = 's')
Third : Filter date range (below sample date range) :
start_date_int = '2021-06-23 18:12:00'
end_date_int = '2021-06-23 18:30:00'
df_filterd = df[(df['created_time'] >= start_date_int) &\
(df['created_time'] < end_date_int)]
*Alternative method for filter :
df_filterd = df[df['created_time'].between(start_date_int , end_date_int)]
You can convert the epoch time to GMT+7 using pd.to_datetime() and dt.tz_convert(), as follows:
data_df['created_GMT+7'] = pd.to_datetime(data_df['created_time'], unit='ms', utc=True).dt.tz_convert('Etc/GMT+7')
Result:
print(data_df['created_GMT+7'])
0 2021-06-23 11:30:00-07:00
1 2021-06-23 11:12:00-07:00
2 2021-06-23 10:30:00-07:00
3 2021-06-23 09:43:00-07:00
4 2021-06-23 09:42:09-07:00
5 2021-06-23 09:41:41-07:00
6 2021-06-23 09:41:01-07:00
7 2021-06-23 09:40:43-07:00
8 2021-06-23 09:40:35-07:00
Name: created_GMT+7, dtype: datetime64[ns, Etc/GMT+7]
Then, filter the rows as follows:
start_date_int = 1624466460500
end_date_int = 1624469402000
mask = data_df['created_GMT+7'].between(pd.Timestamp(start_date_int, unit='ms', tz='Etc/GMT+7'), pd.Timestamp(end_date_int, unit='ms', tz='Etc/GMT+7'))
daily_data_df = data_df.loc[mask]
Or,
start_date_int = 1624466460500
end_date_int = 1624469402000
mask = ((data_df['created_GMT+7'] - pd.Timestamp("1970-01-01", tz='Etc/GMT+7')) // pd.Timedelta('1ms')).between(start_date_int, end_date_int)
daily_data_df = data_df.loc[mask]
Result:
(using the sample start_date_int and end_date_int above)
print(daily_data_df)
created_time updated_time sys_time last_action_time account_id created_GMT+7
2 1624469400000 1624455267579 1624455267579 0 812 2021-06-23 10:30:00-07:00
3 1624466580000 1624466620020 1624466590321 0 812 2021-06-23 09:43:00-07:00
4 1624466529000 1624466610222 1624466540086 0 812 2021-06-23 09:42:09-07:00
5 1624466501000 1624466610270 1624466510212 0 812 2021-06-23 09:41:41-07:00
6 1624466461000 1624466620149 1624466469825 0 812 2021-06-23 09:41:01-07:00
Related
I have a trouble of pulling prices on the Bid and Ask columns of this website: [https://banggia.vps.com.vn/chung-khoan/derivative-VN30][1]. Now I can only pull the name of the class, which is "price-table-content". How can I improve these codes so that I can pull prices on the Bid and Ask columns? Any helps to pull these prices are greatly appreciated :)
from selenium import webdriver
options = webdriver.ChromeOptions()
options.headless = True
path = 'C:/Users/quank/PycharmProjects/pythonProject2/chromedriver.exe'
driver = webdriver.Chrome(executable_path=path, options=options)
url = 'https://banggia.vps.com.vn/chung-khoan/derivative-VN30'
driver.get(url=url)
element = driver.find_elements_by_css_selector('#root > div > div.content.undefined >
div.derivative > table.price-table > tbody')
for i in element:
print(i.get_attribute('outerHTML'))
Here is the result of running these codes
C:\Users\quank\PycharmProjects\Botthudulieu\venv\Scripts\python.exe
C:/Users/quank/PycharmProjects/pythonProject2/Botthudulieu.py
<tbody class="price-table-content"></tbody>
When you check the network activity you'll see that the data is retrieved from an api. So query the api directly rather than trying to scrape the site.
import requests
data = requests.get('https://bgapidatafeed.vps.com.vn/getpsalldatalsnapshot/VN30F2109,VN30F2110,VN30F2112,VN30F2203').json()
Or with pandas:
import pandas as pd
df = pd.read_json('https://bgapidatafeed.vps.com.vn/getpsalldatalsnapshot/VN30F2109,VN30F2110,VN30F2112,VN30F2203')
Resulting dataframe:
id
sym
mc
c
f
r
lastPrice
lastVolume
lot
avePrice
highPrice
lowPrice
fBVol
fBValue
fSVolume
fSValue
g1
g2
g3
g4
g5
g6
g7
mkStatus
listing_status
matureDate
closePrice
ptVol
oi
oichange
lv
0
2100
VN30F2109
4
1505.1
1308.3
1406.7
1420
6018
351832
1406.49
1422.9
1390
2011
0
2225.0
0
1420.00
37
i
1419.90
232
i
1419.80
3
i
1420.10
289
i
1420.20
2
i
3
356133
e
A
0
37433
3
1
2100
VN30F2110
4
1504.3
1307.5
1405.9
1418
14
462
1406.94
1422
1390
0
0
1.0
0
1418.00
1
i
1417.80
1
i
1417.50
1
i
1420.00
4
i
1421.00
1
i
1
523
e
A
0
193
2
2
2100
VN30F2112
4
1505.5
1308.7
1407.1
1420
1
54
1424.31
1420
1390.8
0
0
0
1412.30
1
i
1411.60
1
i
1411.20
1
i
1420.80
1
i
1421.20
1
i
1
88
e
A
0
596
1
3
2100
VN30F2203
4
1503.9
1307.3
1405.6
1420
1
50
1402.19
1420
1390
0
0
0
1412.10
1
i
1412.00
1
i
1410.10
1
i
1420.50
1
i
1421.00
2
i
2
85
e
A
0
138
1
Problem: - I want to build a logic that take data like Attendance data, In Time, Employee Id and return a data frame with employee id, in time, attendance date and basically in which slot the employee entered. (Suppose In time is 9:30:00 of date 14-10-2019 so that if employee came at 9:30 so for that date and for that column it insert a value one.)
Given Example below
I tried lots of time to build logic for this problem but failed to build.
I have a dataset that looks like this.
I want an output like this so that whatever the time (employee enter's) it only insert data to that time column only column only.:
This is my code but its only repeating last loop.
temp =[]
for date in nf['DaiGong']:
for en in nf['EnNo']:
for i in nf['DateTime']:
col=['EnNo','Date','InTime','9:30-10:30','10:30-11:00','11:00-11:30','11:30-12:30','12:30-13:00','13:00-13:30']
ndf=pd.DataFrame(columns=col)
if i < '10:30:00' and i > '09:30:00':
temp.append(1)
ndf['9:30-10:30'] = temp
ndf['InTime'] = i
ndf['Date'] = date
ndf['EnNo'] = en
elif i < '11:00:00' and i > '10:30:00':
temp.append(1)
ndf['10:30-11:00'] = temp
ndf['InTime'] = i
ndf['Date'] = date
ndf['EnNo'] = en
elif i < '11:30:00' and i > '11:00:00':
temp.append(1)
ndf['11:00-11:30'] = temp
ndf['InTime'] = i
ndf['Date'] = date
ndf['EnNo'] = en
elif i < '12:30:00' and i > '11:30:00':
temp.append(1)
ndf['11:30-12:30'] = temp
ndf['InTime'] = i
ndf['Date'] = date
ndf['EnNo'] = en
elif i < '13:00:00' and i > '12:30:00':
temp.append(1)
ndf['12:30-13:00'] = temp
ndf['InTime'] = i
ndf['Date'] = date
ndf['EnNo'] = en
elif i < '13:30:00' and i > '13:00:00':
temp.append(1)
ndf['13:00-13:30'] = temp
ndf['InTime'] = i
ndf['Date'] = date
ndf['EnNo'] = en
This is the output of my code.
IIUC,
df = pd.DataFrame({'EnNo':[2,2,2,2,2,3,3,3,3],
'DaiGong':['2019-10-12', '2019-10-13', '2019-10-14', '2019-10-15', '2019-10-16', '2019-10-12', '2019-10-13', '2019-10-14', '2019-10-15'],
'DateTime':['09:53:56', '10:53:56', '09:23:56', '11:53:56', '11:23:56', '10:33:56', '12:53:56', '12:23:56', '09:53:56']})
df
DaiGong DateTime EnNo
0 2019-10-12 09:53:56 2
1 2019-10-13 10:53:56 2
2 2019-10-14 09:23:56 2
3 2019-10-15 11:53:56 2
4 2019-10-16 11:23:56 2
5 2019-10-12 10:33:56 3
6 2019-10-13 12:53:56 3
7 2019-10-14 12:23:56 3
8 2019-10-15 09:53:56 3
import datetime
df['DateTime'] = pd.to_datetime(df['DateTime']).dt.time #converting to datetime
def time_range(row): # I only wrote two conditions - add more
i = row['DateTime']
if i < datetime.time(10, 30, 0) and i > datetime.time(9, 30, 0):
return '9:30-10:30'
elif i < datetime.time(11, 0, 0) and i > datetime.time(10, 30, 0):
return '10:30-11:00'
else:
return 'greater than 11:00'
df['time range'] = df.apply(time_range, axis=1)
df1 = pd.concat([df[['EnNo', 'DaiGong', 'DateTime']], pd.get_dummies(df['time range'])], axis=1)
df1
EnNo DaiGong DateTime 10:30-11:00 9:30-10:30 greater than 11:00
0 2 2019-10-12 09:53:56 0 1 0
1 2 2019-10-13 10:53:56 1 0 0
2 2 2019-10-14 09:23:56 0 0 1
3 2 2019-10-15 11:53:56 0 0 1
4 2 2019-10-16 11:23:56 0 0 1
5 3 2019-10-12 10:33:56 1 0 0
6 3 2019-10-13 12:53:56 0 0 1
7 3 2019-10-14 12:23:56 0 0 1
8 3 2019-10-15 09:53:56 0 1 0
To get sum of count by employee,
df1.groupby(['EnNo'], as_index=False).sum()
Let me know if you have any questions
My test data:
df:
EnNo DaiGong DateTime
2 2019-10-12 09:53:56
2 2019-10-13 09:42:00
2 2019-10-14 12:00:01
1 2019-11-01 11:12:00
1 2019-11-02 10:13:45
Create helper datas:
tdr=pd.timedelta_range("09:00:00","12:30:00",freq="30T")
s=pd.Series( len(tdr)*["-"] )
s[0]=1
cls=[ t.rsplit(":",maxsplit=1)[0] for t in tdr.astype(str) ]
cols=[ t1+"-"+t2 for (t1,t2) in zip(cls,cls[1:]) ]
cols.append(cls[-1]+"-")
tdr:
TimedeltaIndex(['09:00:00', '09:30:00', '10:00:00', '10:30:00', '11:00:00', '11:30:00', '12:00:00', '12:30:00'], dtype='timedelta64[ns]', freq='30T')
cols:
['09:00-09:30', '09:30-10:00', '10:00-10:30', '10:30-11:00', '11:00-11:30', '11:30-12:00', '12:00-12:30', '12:30-']
s:
0 1
1 -
2 -
3 -
4 -
5 -
6 -
7 -
dtype: object
Use 'apply' and 'searchsorted' to get time slots:
df2= df.DateTime.apply(lambda t: \
s.shift(tdr.searchsorted(t)-1,fill_value="-"))
df2.columns=cols
df2:
09:00-09:30 09:30-10:00 10:00-10:30 10:30-11:00 11:00-11:30 11:30-12:00 12:00-12:30 12:30-
0 - 1 - - - - - -
1 - 1 - - - - - -
2 - - - - - - 1 -
3 - - - - 1 - - -
4 - - 1 - - - - -
Finally, concatenate the two data frames:
df_rslt= pd.concat([df,df2],axis=1)
df_rslt:
EnNo DaiGong DateTime 09:00-09:30 09:30-10:00 10:00-10:30 10:30-11:00 11:00-11:30 11:30-12:00 12:00-12:30 12:30-
0 2 2019-10-12 09:53:56 - 1 - - - - - -
1 2 2019-10-13 09:42:00 - 1 - - - - - -
2 2 2019-10-14 12:00:01 - - - - - - 1 -
3 1 2019-11-01 11:12:00 - - - - 1 - - -
4 1 2019-11-02 10:13:45 - - 1 - - - - -
I have CSV files that I want to merge with list of struct(class) I made.
In the CSV I have field 'sector' and another field with information about this sector.
The array type is of a class I made with fields: name, x, y where x,y is the location that belong to this name.
This is how I defined the list(I generated it from CSV file as well which each antenna appear many time with different parameters so I extracted only those I need)
# ant_file is the CSV with all the antennas, ant_list_name is the list with
# only antennas name and ant_list_tot is the list with the name and also x,y
# fields
for rowA in range(size_ant_file):
rec = ant_file.iloc[rowA]['name']
if rec not in ant_lis_name:
ant_lis_name.append(rec)
A = Antenna(ant_file.iloc[rowA]['name'], ant_file.iloc[rowA]['x'],
ant_file.iloc[rowA]['y'])
ant_list_tot.append(A)
print(antenna_list)
[Antenna(name='UWE33', x=34.9, y=31.9), Antenna(name='UTN00', x=34.8,
y=32.1), Antenna(name='UWE02', x=34.8, y=32.1)]
I tried to do it with double for loop:
#dataclass
class Antenna:
name: str
x: float
y: float
# records is the csv file and antenna_list is the list of type Antenna
for index in range(len(records)):
rec = records.iloc[index]['sector']
for i in range(len(antenna_list)):
if rec == antenna_list[i].name:
lat = antenna_list[i].x
lon = antenna_list[i].y
records.at[index, 'x'] = lat
records.at[index, 'y'] = lon
break
The result CSV file is partly right and at the end there are rows with all fields correctly except x and y fields which are 0 and some rows with x and y values but without the information of the original fields.
It seems like there is a big shift of rows but I can't understand why.
I checked that there are no missing values
example:
records.csv at the begining:(date,hour and user_id are random number and its not important)
sector date hour user_id x y
abc 1.1.19 20:00 123 0 0
dfs 5.8.17 12:40 876 0 0
ngh 6.9.19 08:12 962 0 0
yjt 10.10.16 17:18 492 0 0
abc 6.8.16 22:10 985 0 0
dfs 7.1.15 19:15 542 0 0
antenna_list in the form of (name,x,y): (also here, x and y is random number right now and its not important)
antenna_list[0] = (abc,12,16)
antenna_list[1] = (dfs,6,20)
antenna_list[2] = (ngh,13,98)
antenna_list[3] = (yjt,18,41)
the result I want to see is:
sector date hour user_id x y
abc 1.1.19 20:00 123 12 16
dfs 5.8.17 12:40 876 6 20
ngh 6.9.19 08:12 962 13 98
yjt 10.10.16 17:18 492 18 41
abc 6.8.16 22:10 985 12 16
dfs 7.1.15 19:15 542 6 20
but the real result is:
sector date hour user_id x y
abc 1.1.19 20:00 123 12 16
dfs 5.8.17 12:40 876 6 20
ngh 6.9.19 08:12 962 0 0
yjt 10.10.16 17:18 492 0 0
abc 6.8.16 22:10 985 0 0
dfs 7.1.15 19:15 542 0 0
13 98
18 41
12 16
6 20
TIA
If you save antenna_list as two dicts,
antenna_dict_x = {'abc':12, 'dfs':6, 'ngh':13, 'yjt':18}
antenna_dict_y = {'abc':16, 'dfs':20, 'ngh':98, 'yjt':41}
then creating two columns should be an easy map,
data['x']=data['sector'].map(antenna_dict_x)
data['y']=data['sector'].map(antenna_dict_y)
So if you do:
import pandas as pd
class Antenna():
def __init__(self, name, x, y):
self.name = name
self.x = x
self.y = y
antenna_list = [Antenna('abc',12,16), Antenna('dfs',6,20), Antenna('ngh',13,98), Antenna('yjt',18,41)]
records = pd.read_csv('something.csv')
for index in range(len(records)):
rec = records.iloc[index]['sector']
for i in range(len(antenna_list)):
if rec == antenna_list[i].name:
lat = antenna_list[i].x
lon = antenna_list[i].y
records.at[index, 'x'] = lat
records.at[index, 'y'] = lon
break
print(records)
you get:
sector date hour user_id x y
0 abc 1.1.19 20:00 123 12 16
1 dfs 5.8.17 12:40 876 6 20
2 ngh 6.9.19 8:12 962 13 98
3 yjt 10.10.16 17:18 492 18 41
4 abc 6.8.16 22:10 985 12 16
5 dfs 7.1.15 19:15 542 6 20
Which is what you were expecting. Also, if you do:
import pandas as pd
from dataclasses import dataclass
#dataclass
class Antenna:
name: str
x: float
y: float
antenna_list = [Antenna('abc',12,16), Antenna('dfs',6,20), Antenna('ngh',13,98), Antenna('yjt',18,41)]
records = pd.read_csv('something.csv')
for index in range(len(records)):
rec = records.iloc[index]['sector']
for i in range(len(antenna_list)):
if rec == antenna_list[i].name:
lat = antenna_list[i].x
lon = antenna_list[i].y
records.at[index, 'x'] = lat
records.at[index, 'y'] = lon
break
print(records)
you get:
sector date hour user_id x y
0 abc 1.1.19 20:00 123 12 16
1 dfs 5.8.17 12:40 876 6 20
2 ngh 6.9.19 8:12 962 13 98
3 yjt 10.10.16 17:18 492 18 41
4 abc 6.8.16 22:10 985 12 16
5 dfs 7.1.15 19:15 542 6 20
Which is, again, what you were expecting. You did not post how you created the antenna list, but I assume that is where your error is.
Working through Pandas Cookbook. Counting the Total Number of Flights Between Cities.
import pandas as pd
import numpy as np
# import matplotlib.pyplot as plt
print('NumPy: {}'.format(np.__version__))
print('Pandas: {}'.format(pd.__version__))
print('-----')
desired_width = 320
pd.set_option('display.width', desired_width)
pd.options.display.max_rows = 50
pd.options.display.max_columns = 14
# pd.options.display.float_format = '{:,.2f}'.format
file = "e:\\packt\\data_analysis_and_exploration_with_pandas\\section07\\data\\flights.csv"
flights = pd.read_csv(file)
print(flights.head(10))
print()
# This returns the total number of rows for each group.
flights_ct = flights.groupby(['ORG_AIR', 'DEST_AIR']).size()
print(flights_ct.head(10))
print()
# Get the number of flights between Atlanta and Houston in both directions.
print(flights_ct.loc[[('ATL', 'IAH'), ('IAH', 'ATL')]])
print()
# Sort the origin and destination cities:
# flights_sort = flights.sort_values(by=['ORG_AIR', 'DEST_AIR'], axis=1)
flights_sort = flights[['ORG_AIR', 'DEST_AIR']].apply(sorted, axis=1)
print(flights_sort.head(10))
print()
# Passing just the first row.
print(sorted(flights.loc[0, ['ORG_AIR', 'DEST_AIR']]))
print()
# Once each row is independently sorted, the column name are no longer correct.
# We will rename them to something generic, then again find the total number of flights between all cities.
rename_dict = {'ORG_AIR': 'AIR1', 'DEST_AIR': 'AIR2'}
flights_sort = flights_sort.rename(columns=rename_dict)
flights_ct2 = flights_sort.groupby(['AIR1', 'AIR2']).size()
print(flights_ct2.head(10))
print()
When I get to this line of code my output differs from the authors:
```flights_sort = flights[['ORG_AIR', 'DEST_AIR']].apply(sorted, axis=1)```
My output does not contain any column names. As a result, when I get to:
```flights_ct2 = flights_sort.groupby(['AIR1', 'AIR2']).size()```
it throws a KeyError. This makes sense, as I am trying to rename columns when no column names exist.
My question is, why are the column names gone? All other output matches the authors output exactly:
Connected to pydev debugger (build 191.7141.48)
NumPy: 1.16.3
Pandas: 0.24.2
-----
MONTH DAY WEEKDAY AIRLINE ORG_AIR DEST_AIR SCHED_DEP DEP_DELAY AIR_TIME DIST SCHED_ARR ARR_DELAY DIVERTED CANCELLED
0 1 1 4 WN LAX SLC 1625 58.0 94.0 590 1905 65.0 0 0
1 1 1 4 UA DEN IAD 823 7.0 154.0 1452 1333 -13.0 0 0
2 1 1 4 MQ DFW VPS 1305 36.0 85.0 641 1453 35.0 0 0
3 1 1 4 AA DFW DCA 1555 7.0 126.0 1192 1935 -7.0 0 0
4 1 1 4 WN LAX MCI 1720 48.0 166.0 1363 2225 39.0 0 0
5 1 1 4 UA IAH SAN 1450 1.0 178.0 1303 1620 -14.0 0 0
6 1 1 4 AA DFW MSY 1250 84.0 64.0 447 1410 83.0 0 0
7 1 1 4 F9 SFO PHX 1020 -7.0 91.0 651 1315 -6.0 0 0
8 1 1 4 AA ORD STL 1845 -5.0 44.0 258 1950 -5.0 0 0
9 1 1 4 UA IAH SJC 925 3.0 215.0 1608 1136 -14.0 0 0
ORG_AIR DEST_AIR
ATL ABE 31
ABQ 16
ABY 19
ACY 6
AEX 40
AGS 83
ALB 33
ANC 2
ASE 1
ATW 10
dtype: int64
ORG_AIR DEST_AIR
ATL IAH 121
IAH ATL 148
dtype: int64
*** No columns names *** Why?
0 [LAX, SLC]
1 [DEN, IAD]
2 [DFW, VPS]
3 [DCA, DFW]
4 [LAX, MCI]
5 [IAH, SAN]
6 [DFW, MSY]
7 [PHX, SFO]
8 [ORD, STL]
9 [IAH, SJC]
dtype: object
The author's output. Note the columns names are present.
sorted returns a list object and obliterates the columns:
In [11]: df = pd.DataFrame([[1, 2], [3, 4]], columns=["A", "B"])
In [12]: df.apply(sorted, axis=1)
Out[12]:
0 [1, 2]
1 [3, 4]
dtype: object
In [13]: type(df.apply(sorted, axis=1).iloc[0])
Out[13]: list
It's possible that this wouldn't have been the case in earlier pandas... but it would still be bad code.
You can do this by passing the columns explicitly:
In [14]: df.apply(lambda x: pd.Series(sorted(x), df.columns), axis=1)
Out[14]:
A B
0 1 2
1 3 4
A more efficient way to do this is to sort the sort the underlying numpy array:
In [21]: df = pd.DataFrame([[1, 2], [3, 1]], columns=["A", "B"])
In [22]: df
Out[22]:
A B
0 1 2
1 3 1
In [23]: arr = df[["A", "B"]].values
In [24]: arr.sort(axis=1)
In [25]: df[["A", "B"]] = arr
In [26]: df
Out[26]:
A B
0 1 2
1 1 3
As you can see this sorts each row.
A final note. I just applied #AndyHayden numpy based solution from above.
flights_sort = flights[["ORG_AIR", "DEST_AIR"]].values
flights_sort.sort(axis=1)
flights[["ORG_AIR", "DEST_AIR"]] = flights_sort
All I can say is … Wow. What an enormous performance difference. I get the exact same
correct answer and I get it as soon as I click the mouse as compared to the pandas lambda solution also provided by #AndyHayden which takes about 20 seconds to perform the sort. That dataset is 58,000+ rows. The numpy solution returns the sort instantly.
I have sql statment like this:
select id
, avg(case when rate=1 then rate end) as "P_Rate"
, stddev(case when rate=1 then rate end) as "std P_Rate",
, avg(case when f_rate = 1 then f_rate else 0 end) as "A_Rate"
, stddev(case when f_rate = 1 then f_rate else 0 end) as "std A_Rate"
from (
select id, connected_date,payment_type,acc_type,
max(case when is s_rate > 1 then 1 else 0 end) / count(open) as rate
sum(case when is hire_days <= 5 and paid>1000 then 1 else 0 end )/count(open) as f_rate
from analysis_table where alloc_date <= '2016-01-01' group by 1,2
) a group by id
I trying rewrite in using Pandas:
at first I will create dataframe for "inner" table:
filtered_data = data.where(data['alloc_date'] <= analysis_date)
then I will group this data
grouped = filtered_data.groupby(['id','connected_date'])
But what I have to use for filtering each column and use max/sum on it.
I tried something like this:
`def my_agg_function(hire_days,paid,open):
r_arr = []
if hire_days <= 5 and paid > 1000:
r_arr.append(1)
else:
r.append(0)
return np.max(r_arr)/len(????)
inner_table['f_rate'] = grouped.agg(lambda row: my_agg_function(row['hire_days'],row['paid'],row['open'])`
and something similar for rate
You should put a little DataFrame in your question to make it easier to answer.
For your need you might want to use agg method of groupby dataframes. Let's suppose you have the following dataframe:
connected_date id number_of_clicks time_spent
0 Mon matt 15 124
1 Tue john 13 986
2 Mon matt 48 451
3 Thu jack 68 234
4 Sun john 52 976
5 Sat sabrina 13 156
And you want to get the sum of the time spent by user by day and the maximum of clicks in a single session. Then you use groupby this way:
df.groupby(['id','connected_date'],as_index = False).agg({'number_of_clicks':max,'time_spent':sum})
Output:
id connected_date time_spent number_of_clicks
0 jack Thu 234 68
1 john Sun 976 52
2 john Tue 986 13
3 matt Mon 575 48
4 sabrina Sat 156 13
Note that I only passed the as_index=False for clarity of the output.