Pyflink session window aggregation by separate keys - python

I'm trying to wrap my head around the pyflink datastream api.
My use case is the following:
The source is a kinesis datastream consisting of the following:
cookie
cluster
dim0
dim1
dim2
time_event
1
1
5
5
5
1min
1
2
1
0
6
30min
2
1
1
2
3
45min
1
1
10
10
15
70min
2
1
5
5
10
120min
I want to create a session window aggregation with a gap of 60 minutes, calculating the mean for each cookie-cluster combination. The window assignment should be based on the cookie, the aggregation based on cookie and cluster.
The result would therefore be like this (each row being forwarded immediately):
cookie
cluster
dim0
dim1
dim2
time_event
1
1
5
5
5
1min
1
2
1
0
6
30min
2
1
1
2
3
45min
1
1
7.5
7.5
10
70 min
2
1
5
5
10
120min
Expressed in SQL, for a new record I'd like to perform this aggregation:
INSERT INTO `input` (`cookie`, `cluster`, `dim0`, `dim1`, `dim2`, `time_event`) VALUES
("1", "1", 0, 0, 0, 125)
WITH RECURSIVE by_key AS (
SELECT *,
(time_event - lag(time_event) over (partition by cookie order by time_event)) as "time_passed"
FROM input
WHERE cookie = "1"
),
new_session AS (
SELECT *,
CASE WHEN time_passed > 60 THEN 1 ELSE 0 END as "new_session"
FROM by_key),
by_session AS (
SELECT *, SUM(new_session) OVER(partition by cookie order by time_event) as "session_number"
FROM new_session)
SELECT cookie, cluster, avg(dim0), avg(dim1), avg(dim2), max(time_event)
FROM by_session
WHERE cluster = "1"
GROUP BY session_number
ORDER BY session_number DESC
LIMIT 1
I tried to accomplish this with the table api, but I need the results to be updated as soon as a new record is added to a cookie-cluster combination. This is my first project with flink, and the datastream API is an entirely different beast, especially since a lot of stuff is not included yet for python.
My current approach looks like this:
Create a table from the kinesis datastream (datastream has no kinesis connector)
Convert it to a datastream to perform the aggregation. From what I've read, watermarks are propagated and the resulting row objects contain the column names, i.e. I can handle them like a python dictionary. Please correct me, if I'm wrong on this.
Key the data stream by the cookie.
Window with a custom SessionWindowsAssigner, borrowing from the Table API. I'm working on a seperate post on that.
Process the windows by calculating the mean for each cluster
table_env = StreamTableEnvironment.create(stream_env, environment_settings=env_settings)
table_env.execute_sql(
create_table(input_table_name, input_stream, input_region, stream_initpos)
)
ds = table_env.to_append_stream(input_table_name)
ds.key_by(lambda r: r["cookie"])\
.window(SessionWindowAssigner(session_gap=60, is_event_time=True)\
.trigger(OnElementTrigger()).\
.process(MeanWindowProcessFunction())
My basic idea for the ProcessWindowFunction would go like this:
class MeanWindowProcessFunction(ProcessWindowFunction[Dict, Dict, str, TimeWindow]):
def process(self,
key: str,
content: ProcessWindowFunction.Context,
elements: Iterable) -> Iterable[Dict]:
clusters = {}
cluster_records = {}
for element in inputs:
if element["cluster"] not in clusters:
clusters[element["cluster"]] = {key: val for key, val in element.as_dict().items()}
cluster_records[element["cluster"]] = 0
else:
for dim in range(3):
clusters[element["cluster"]][f"dim{dim}"] += element[f"dim{dim}"]
clusters[element["cluster"]]["time_event"] = element["time_event"]
cluster_records[element["cluster"]] += 1
for cluster in clusters.keys():
for dim in range(3):
clusters[cluster][f"dim{dim}"] /= cluster_records[cluster]
return clusters.values()
def clear(self, context: 'ProcessWindowFunction.Context') -> None:
pass
Is this the right approach for this problem?
Do I need to consider anything else for the ProcessWindowFunction, like actually implementing the clear method?
I'd be very grateful for any help, or any more elaborate examples of windowed analytics applications in pyflink. Thank you!

Related

How to read multiple tables from each tab of an excel sheet in Python?

So I've an excel sheet that has multiple tabs and each individual tab has multiple tables in it. So i want to read the file in such a way that it reads each table from each tab of the sheet, for instance,
Tab1 has five tables in it.
Tab2 has Ten tables in it.
.....
.....
I want to read each one of these table in pandas dataframe and then save it to sql database. I know how how to read multiple tabs from the excel sheet.
Can anyone help me out here or point me to a direction where i can find a lead?
The tables in the tab are pre-defined and have name. Thats how it looks like in each tab Tab from excel sheet
You probably have to tweak it to match your data; imagine if you have some tables below and some above. This, hopefully, should point you in the right direction. Also, note the number of for loops I used; I believe you can do better and optimize it further.
from openpyxl import load_workbook
from collections import defaultdict
from itertools import product, groupby
from operator import itemgetter
wb = load_workbook(filename="test.xlsx")
sheet = wb["Sheet1"]
green_rows = defaultdict(list)
rest_data = []
for row in sheet:
for cell in row:
look for the green rows; they contain the headers
if cell.fill.fgColor.rgb == "FFA2D722":
# take advantage of the fact that header
# is the first entry in that row
if cell.value:
val = cell.value
green_rows[(val, cell.row)].append(cell.column)
else:
if cell.value not in (None, ""): # so the 0s are not lost
rest_data.append((cell.row, cell.column, cell.value))
# get the max and minimum column positions
# note the addition of 1 to the max,
# this is necessary when iterating to sort the data
# in the next section
green_rows = [
(name, row, range(min(value), max(value) + 1))
for (name, row), value in green_rows.items()
]
box = []
# here the green rows and the rest of the data
# are combined, then filtered for the respective
# sections
combo = product(green_rows, rest_data)
for (header, header_row, header_column_range), (
cell_row,
cell_column,
cell_value,
) in combo:
# this is where the filtration occurs
if (header_row < cell_row) and (cell_column in header_column_range):
box.append((header, cell_row, cell_column, cell_value))
final = defaultdict(list)
content = groupby(box, itemgetter(1, 0))
# another iteration to get the final result
for key, value in content:
final[key[-1]].append([val[-1] for val in value])
You can create your dataframe for each of the headers:
pd.DataFrame(final["Address Association"])
0 1 2 3 4 5
0 Column Name in DB Name Description SortOrder BusinessMeaningName Obsolete
1 Field Type nvarchar(100) nvarchar(255) int nvarchar(50) bit
2 Mandatory Yes Yes Yes No Yes
3 Foreign Key - - - - -
4 Optional Feature - - - - -
5 Field Name in U4SM Name Description Sort Order Business Meaning Name Obsolete
6 Address.Primary Primary Use this address by default. 1 Address.Primary 0
7 Address.Billing Billing address for billing. 2 Address.Billing 0
8 Address.Emergency Emergency use this for emergency. 3 Address.Emergency 0
9 Address.Emergency SMS Emergency SMS use this for emergency SMS. 4 Address.Emergency SMS 0
10 Address.Deceased Deceased address for deceased. 5 Address.Deceased 0
11 Address.Home Home address for home. 8 Address.Home 0
12 Address.Mailing Mailing address for mailing. 9 Address.Mailing 0
13 Address.Mobile Mobile use this for mobile. 10 Address.Mobile 0
14 Address.School School address for school. 13 Address.School 0
15 Address.SMS SMS use this for SMS text. 15 Address.SMS 0
16 Address.Work Work address for work 16 Address.Work 0
17 Address.Permanent Permanent Permanent Address 17 Address.Permanent 0
18 Address.HallsOfResidence Halls of Residence Halls of Residence 18 Address.HallsOfResidence 0

BigQuery use conditions to create a table from other tables

I am facing a serious blockage on a project of mine.
Here is the summary of what i would like to do :
I have a big hourly file (10 Go) with the following extract (no header) :
ID_A|segment_1,segment_2
ID_B|segment_2,segment_3,segment_4,segment_5
ID_C|segment_1
ID_D|segment_2,segment_4
Every ID (from A to D) can be linked to one or multiple segments (from 1 to 5).
I would like to process this file in order to have the following result (the result file contains a header) :
ID|segment_1|segment_2|segment_3|segment_4|segment_5
ID_A|1|1|0|0|0
ID_B|0|1|1|1|1
ID_C|1|0|0|0|0
ID_D|0|1|0|1|0
1 means that the ID is included in the segment, 0 means that it is not.
I can clearly do this task by using a python script with multiple loops and conditions, however, i need a fast script that can do the same work.
I would like to use BigQuery to perform this operation.
Is it possible to do such task in BigQuery ?
How can it be done ?
Thanks to all for your help.
Regards
Let me assume that the file is loaded into a BQ table with an id column and a segments column (which is a string). Then I would recommend storing the result values as an array, but that is not your question.
You can use the following select to create the table:
select id,
countif(segment = 'segment_1') as segment_1,
countif(segment = 'segment_2') as segment_2,
countif(segment = 'segment_3') as segment_3,
countif(segment = 'segment_4') as segment_4,
countif(segment = 'segment_5') as segment_5
from staging s cross join
unnest(split(segments, ',')) as segment
group by id;
Below is for BigQuery Standard SQL
#standardSQL
SELECT id,
IF('segment_1' IN UNNEST(list), 1, 0) AS segment_1,
IF('segment_2' IN UNNEST(list), 1, 0) AS segment_2,
IF('segment_3' IN UNNEST(list), 1, 0) AS segment_3,
IF('segment_4' IN UNNEST(list), 1, 0) AS segment_4,
IF('segment_5' IN UNNEST(list), 1, 0) AS segment_5
FROM `project.dataset.table`,
UNNEST([STRUCT(SPLIT(segments) AS list)])
Above assumes you have your data in the table like in below TCE
WITH `project.dataset.table` AS (
SELECT 'ID_A' id, 'segment_1,segment_2' segments UNION ALL
SELECT 'ID_B', 'segment_2,segment_3,segment_4,segment_5' UNION ALL
SELECT 'ID_C', 'segment_1' UNION ALL
SELECT 'ID_D', 'segment_2,segment_4'
)
if to apply above query to such data - result will be
Row id segment_1 segment_2 segment_3 segment_4 segment_5
1 ID_A 1 1 0 0 0
2 ID_B 0 1 1 1 1
3 ID_C 1 0 0 0 0
4 ID_D 0 1 0 1 0

How to find duplicated vaules in Tableau

I need to create a new column that advises if a customer is new or recurrent.
To do so I want to check, for each unique value in phone, if there is one or more date associated in the Date columns.
Phone Date
0 a 1
1 a 1
2 a 2
3 b 2
4 b 2
5 c 3
6 c 2
7 c 1
New users are those for whom there is only one unique (Phone, Date) couple with the same phone. The result that I want looks like:
Phone Date User_type
0 a 1 recurrent
1 a 1 recurrent
2 a 2 recurrent
3 b 2 new
4 b 2 new
5 c 3 recurrent
6 c 2 recurrent
7 c 1 recurrent
I manage to do it in few lines of code with python but my boss want insist that I do it in Tableau.
I know I need to use a calculated field but that's it.
If it can help, here is my python code that does the same:
import numpy as np
import pandas as pd
for item in set(data.Phone):
if len(set(data[data.Phone == item]['Date'])) == 1:
data.loc[data.Phone == item, 'type_user'] = 'new'
elif len(set(data[tata.Phone == item]['Date'])) > 1:
data.loc[data.Phone == item, 'type_user'] = 'recurrent'
else:
data.loc[data.Phone == item, 'type_user'] = np.nan
You can use LOD to do that, the below will give you how many records are duplicated
{Fixed [Phone],[Date]: SUM([Number of Records])}
If you want a text, do:
IF {Fixed [Phone],[Date]: SUM([Number of Records])} > 1 THEN 'recurrent' ELSE 'new' END
Example:
Thanks for you reply! It didn't exactly solved my problem but it definitely helped me finding the solution.
The solution:
I first got, for a given phone, the number of distinct Date
{Fixed [Phone] : COUNT([Date])}
Then I created my categorical (dimension) variable
if {Fixed [Phone] : COUNT([Date])} > 1 THEN 'recurrent' ELSE 'new' END
The result (phone numbers are hidden for data privacy reasons):
enter image description here

How to create columns in pandas df with .apply and user defined function

I'm trying to create several columns in a pandas DataFrame at once, where each column name is a key in a dictionary and the function returns 1 if any of the values corresponding to that key are present.
My DataFrame has 3 columns, jp_ref, jp_title, and jp_description. Essentially, I'm searching the jp_descriptions for relevant words assigned to that key and populating the column assigned to that key with 1s and 0s based on if any of the values are found present in the jp_description.
jp_tile = [‘software developer’, ‘operations analyst’, ‘it project manager’]
jp_ref = [‘j01’, ‘j02’, ‘j03’]
jp_description = [‘software developer with java and sql experience’, ‘operations analyst with ms in operations research, statistics or related field. sql experience desired.’, ‘it project manager with javascript working knowledge’]
myDict = {‘jp_title’:jp_title, ‘jp_ref’:jp_ref, ‘jp_description’:jp_description}
data = pd.DataFrame(myDict)
technologies = {'java':['java','jdbc','jms','jconsole','jprobe','jax','jax-rs','kotlin','jdk'],
'javascript':['javascript','js','node','node.js','mustache.js','handlebar.js','express','angular'
'angular.js','react.js','angularjs','jquery','backbone.js','d3'],
'sql':['sql','mysql','sqlite','t-sql','postgre','postgresql','db','etl']}
def term_search(doc,tech):
for term in technologies[tech]:
if term in doc:
return 1
else:
return 0
for tech in technologies:
data[tech] = data.apply(term_search(data['jp_description'],tech))
I received the following error but don't understand it:
TypeError: ("'int' object is not callable", 'occurred at index jp_ref')
Your logic is wrong you are traversing list in a loop and after first iteration it return 0 or 1 so jp_description value is never compared with complete list.
You split the jp_description and check the common elements with technology dict if common elements exists it means substring is found so return 1 else 0
def term_search(doc,tech):
doc = doc.split(" ")
common_elem = list(set(doc).intersection(technologies[tech]))
if len(common_elem)>0:
return 1
return 0
for tech in technologies:
data[tech] = data['jp_description'].apply(lambda x : term_search(x,tech))
jp_title jp_ref jp_description java javascript sql
0 software developer j01 software developer.... 1 0 1
1 operations analyst j02 operations analyst .. 0 0 1
2 it project manager j03 it project manager... 0 1 0

How to create a loop or function for the logic for this list of lists

I have a data set that looks like this:
CustomerID EventID EventType EventTime
6 1 Facebook 42373.31586
6 2 Facebook 42373.316
6 3 Web 42374.32921
6 4 Twitter 42377.14913
6 5 Facebook 42377.40598
6 6 Web 42378.31245
CustomerID: the unique identifier associated with the particular
customer
EventID: a unique identifier about a particular online activity
EventType: the type of online activity associated with this record
(Web, Facebook, or Twitter)
EventTime: the date and time at which this online activity took
place. This value is measured as the number of days since January 1, 1900, with fractions indicating particular times of day. So for example, an event taking place at the stroke of midnight on January 1, 2016 would be have an EventTime of 42370.00 while an event taking place at noon on January 1, 2016 would have an EventTime of 42370.50.
I've managed to import the CSV and creating into a list with the following code:
# Import Libraries & Set working directory
import csv
# STEP 1: READING THE DATA INTO A PYTHON LIST OF LISTS
f = open('test1000.csv', "r") # Import CSV as file type
a = f.read() # Convert file type into string
split_list = a.split("\r") # Removes \r
split_list[0:5] # Viewing the list
# Convert from lists to 'list of lists'
final_list = []
for row in split_list:
split_list = row.split(',') # Split list by comma delimiter
final_list.append(split_list)
print(final_list[0:5])
#CREATING INITIAL BLANK LISTS FOR OUTPUTTING DATA
legit = []
fraud = []
What I need to do next is sort each record into the fraud or legit list of lists. A record would be considered fraudulent under the following parameters. As such, that record would go to the fraud list.
Logic to assign a row to the fraud list: The CustomerID performs the same EventType within the last 4 hours.
For example, row 2 (event 2) in the sample data set above, would be moved to the fraud list because event 1 happened within the last 4 hours. On the other hand, event 4 would go to the legit list because in there are no Twitter records that happened in the last 4 hours.
The data set is in chronological order.
This solution groups by CustomerID and EventType and then checks if the previous event time occurred less than (lt) 4 hours ago (4. / 24).
df['possible_fraud'] = (
df.groupby(['CustomerID', 'EventType'])
.EventTime
.transform(lambda group: group - group.shift())
.lt(4. / 24))
>>> df
CustomerID EventID EventType EventTime possible_fraud
0 6 1 Facebook 42373.31586 False
1 6 2 Facebook 42373.31600 True
2 6 3 Web 42374.32921 False
3 6 4 Twitter 42377.14913 False
4 6 5 Facebook 42377.40598 False
5 6 6 Web 42378.31245 False
>>> df[df.possible_fraud]
CustomerID EventID EventType EventTime possible_fraud
1 6 2 Facebook 42373.316 True
Of course, pandas based solution seems more smart, but here is an example of using just inserted dictionary.
PS Try to perform input and output by himself
#!/usr/bin/python2.7
sample ="""
6 1 Facebook 42373.31586
6 2 Facebook 42373.316
6 3 Web 42374.32921
6 4 Twitter 42377.14913
5 5 Web 42377.3541
6 6 Facebook 42377.40598
6 7 Web 42378.31245
"""
last = {} # This dict will contain recent time
#values of events by client ID, for ex.:
#{"6": {"Facebook": 42373.31586, "Web": 42374.32921}}
legit = []
fraud = []
for row in sample.split('\n')[1:-1:]:
Cid, Eid, Type, Time = row.split()
if Cid not in last.keys():
legit.append(row)
last[Cid] = {Type: Time}
row += '\tlegit'
else:
if Type not in last[Cid].keys():
legit.append(row)
last[Cid][Type] = Time
row += '\tlegit'
else:
if float(Time) - float(last[Cid][Type]) > (4. / 24):
legit.append(row)
last[Cid][Type] = Time
row += '\tlegit'
else:
fraud.append(row)
row += '\tfraud'
print row

Categories