I am trying to build my dataset of network recording for that I want to count all connections of a specific source in a specific minute and display the result in this form :
Time Source Number of Connections
2018-03-18 21:17:0 192.168.0.2 5
2018-03-18 21:18:00 192.168.0.2 1
2018-03-18 21:17:0 192.168.0.3 2
.....
I am using this code to do the count :
c
onnection_count = {} # dictionary that stores count of connections per minute
for s in source:
for x in edited_time :
if x in connection_count :
value = connection_count[x]
value = value + 1
connection_count[x] = value
else:
connection_count[x] = 1
new_count_df #count # date #source
but I don't know how to display it the way I want in fact I used this :
for s in source:
for x in edited_time :
new_count_df = (_time, source, connection_count)
but it doesn't display it the way I want.
Related
I'm trying to wrap my head around the pyflink datastream api.
My use case is the following:
The source is a kinesis datastream consisting of the following:
cookie
cluster
dim0
dim1
dim2
time_event
1
1
5
5
5
1min
1
2
1
0
6
30min
2
1
1
2
3
45min
1
1
10
10
15
70min
2
1
5
5
10
120min
I want to create a session window aggregation with a gap of 60 minutes, calculating the mean for each cookie-cluster combination. The window assignment should be based on the cookie, the aggregation based on cookie and cluster.
The result would therefore be like this (each row being forwarded immediately):
cookie
cluster
dim0
dim1
dim2
time_event
1
1
5
5
5
1min
1
2
1
0
6
30min
2
1
1
2
3
45min
1
1
7.5
7.5
10
70 min
2
1
5
5
10
120min
Expressed in SQL, for a new record I'd like to perform this aggregation:
INSERT INTO `input` (`cookie`, `cluster`, `dim0`, `dim1`, `dim2`, `time_event`) VALUES
("1", "1", 0, 0, 0, 125)
WITH RECURSIVE by_key AS (
SELECT *,
(time_event - lag(time_event) over (partition by cookie order by time_event)) as "time_passed"
FROM input
WHERE cookie = "1"
),
new_session AS (
SELECT *,
CASE WHEN time_passed > 60 THEN 1 ELSE 0 END as "new_session"
FROM by_key),
by_session AS (
SELECT *, SUM(new_session) OVER(partition by cookie order by time_event) as "session_number"
FROM new_session)
SELECT cookie, cluster, avg(dim0), avg(dim1), avg(dim2), max(time_event)
FROM by_session
WHERE cluster = "1"
GROUP BY session_number
ORDER BY session_number DESC
LIMIT 1
I tried to accomplish this with the table api, but I need the results to be updated as soon as a new record is added to a cookie-cluster combination. This is my first project with flink, and the datastream API is an entirely different beast, especially since a lot of stuff is not included yet for python.
My current approach looks like this:
Create a table from the kinesis datastream (datastream has no kinesis connector)
Convert it to a datastream to perform the aggregation. From what I've read, watermarks are propagated and the resulting row objects contain the column names, i.e. I can handle them like a python dictionary. Please correct me, if I'm wrong on this.
Key the data stream by the cookie.
Window with a custom SessionWindowsAssigner, borrowing from the Table API. I'm working on a seperate post on that.
Process the windows by calculating the mean for each cluster
table_env = StreamTableEnvironment.create(stream_env, environment_settings=env_settings)
table_env.execute_sql(
create_table(input_table_name, input_stream, input_region, stream_initpos)
)
ds = table_env.to_append_stream(input_table_name)
ds.key_by(lambda r: r["cookie"])\
.window(SessionWindowAssigner(session_gap=60, is_event_time=True)\
.trigger(OnElementTrigger()).\
.process(MeanWindowProcessFunction())
My basic idea for the ProcessWindowFunction would go like this:
class MeanWindowProcessFunction(ProcessWindowFunction[Dict, Dict, str, TimeWindow]):
def process(self,
key: str,
content: ProcessWindowFunction.Context,
elements: Iterable) -> Iterable[Dict]:
clusters = {}
cluster_records = {}
for element in inputs:
if element["cluster"] not in clusters:
clusters[element["cluster"]] = {key: val for key, val in element.as_dict().items()}
cluster_records[element["cluster"]] = 0
else:
for dim in range(3):
clusters[element["cluster"]][f"dim{dim}"] += element[f"dim{dim}"]
clusters[element["cluster"]]["time_event"] = element["time_event"]
cluster_records[element["cluster"]] += 1
for cluster in clusters.keys():
for dim in range(3):
clusters[cluster][f"dim{dim}"] /= cluster_records[cluster]
return clusters.values()
def clear(self, context: 'ProcessWindowFunction.Context') -> None:
pass
Is this the right approach for this problem?
Do I need to consider anything else for the ProcessWindowFunction, like actually implementing the clear method?
I'd be very grateful for any help, or any more elaborate examples of windowed analytics applications in pyflink. Thank you!
I am reading links from an excel sheet ("sheet1.xlsx")
Suppose I have columns (ID, Name, Link) and rows from 1 to 100. For example:
ID Name Link
1 J facebook.com/J
2 L facebook.com/L
.
.
.
.
50 P facebook.com/P
51 Q facebook.com/Q
My code so far:
df1 = pd.read_excel("sheet1.xlsx")
for data in df1.Link :
print(data)
result = getResult(driver,data)
How can I make this code sleep for 60 minutes everytime it reads 50 links?
Hope that helps you
import time
link_counter = 0
df1 = pd.read_excel("sheet1.xlsx")
for data in df1.Link :
print(data)
result = getResult(driver,data)
if link_counter >= 49: #Check if its the 50th link
link_counter = 0 #Resets counter
time.sleep(3600) #Sleeps for 3600 seconds (1 hour)
else: link_counter += 1 #if statement is false then count one up
but as already said(by Balaji Ambresh) the better solution would be a package like scrapy
Edit:
that could also work
df1 = pd.read_excel("sheet1.xlsx")
for data in df1.Link :
if (df1.Link.index(data) + 1) % 50 == 1 and df1.Link.index(data) != 0 : time.sleep(3600)
print(data)
result = getResult(driver,data)
You can use the time module for this:
time.sleep(3600) # sleep for 60 minutes.
A better solution would be to use a package like scrapy to let it handle the throttling.
I have two dataframes that I want to merge. One contains data on "assessments" done on particular dates for particular clients. The second contains data on different categories of "services" performed for clients on particular dates. See sample code below:
assessments = pd.DataFrame({'ClientID' : ['212','212','212','292','292'],
'AssessmentDate' : ['2018-01-04', '2018-07-03', '2019-06-10', '2017-08-08', '2017-12-21'],
'Program' : ['Case Mgmt','Case Mgmt','Case Mgmt','Coordinated Access','Coordinated Access']})
ClientID AssessmentDate Program
212 2018-01-04 Case Mgmt
212 2018-07-03 Case Mgmt
212 2019-06-10 Case Mgmt
292 2017-08-08 Coordinated Access
292 2017-12-21 Coordinated Access
services = pd.DataFrame({'ClientID' : ['212','212','212','292','292'],
'ServiceDate' : ['2018-01-02', '2018-04-08', '2018-05-23', '2017-09-08', '2017-12-03'],
'Assistance Navigation' : ['0','1','1','0','1'],
'Basic Needs' : ['1','0','0','1','2']})
ClientID ServiceDate Assistance Navigation Basic Needs
212 2018-01-02 0 1
212 2018-04-08 1 0
212 2018-05-23 1 0
292 2017-09-08 0 1
292 2017-12-03 1 2
I want to know how many services of each service type (Assistance Navigation and Basic Needs) occur between consecutive assessments of the same program. In other words, I want to append two columns to the assessments dataframe named 'Assistance Navigation' and 'Basic Needs' that tell me how many Assistance Navigation services and how many Basic Needs services have occurred since the last assessment of the same program. The resulting dataframe would look like this:
assessmentsFinal = pd.DataFrame({'ClientID' : ['212','212','212','292','292'],
'AssessmentDate' : ['2018-01-04', '2018-07-03', '2019-06-10', '2017-08-08', '2017-12-21'],
'Program' : ['Case Mgmt','Case Mgmt','Case Mgmt','Coordinated Access','Coordinated Access'],
'Assistance Navigation' : ['0','2','0','0','1'],
'Basic Needs' : ['0','0','0','0','3']})
ClientID AssessmentDate Program Assistance Navigation Basic Needs
212 2018-01-04 Case Mgmt 0 0
212 2018-07-03 Case Mgmt 2 0
212 2019-06-10 Case Mgmt 0 0
292 2017-08-08 Coordinated Access 0 0
292 2017-12-21 Coordinated Access 1 3
Of course, the real data has many more service categories than just 'Assistance Navigation' and 'Basic Needs' and the number of services and assessments is huge. My current attempt uses loops (which I know is a Pandas sin) and takes a couple of minutes to run, which may pose problems when our dataset gets even larger. Below is the current code for reference. Basically we loop through the assessments dataframe to get the ClientID and the date range and then we go into the services sheet and tally up the service type occurrences. There's got to be a quick and easy way to do this in Pandas but I'm new to the game. Thanks in advance.
servicesDict = {}
prevClient = -1
prevDate = ""
prevProg = ""
categories = ["ClientID","ServiceDate","Family Needs","Housing Navigation","Housing Assistance","Basic Needs","Professional","Education","Financial Aid","Healthcare","Counseling","Contact","Assistance Navigation","Referral","Misc"]
for index, row in assessmentDF.iterrows():
curClient = row[0]
curDate = datetime.strptime(row[1], '%m/%d/%y')
curProg = row[7]
curKey = (curClient, curDate)
if curKey not in servicesDict:
services = [curClient, curDate, 0,0,0,0,0,0,0,0,0,0,0,0,0]
servicesDict.update({curKey : services})
services = servicesDict[curKey]
#if curDate and prevDate equal each other action required
if curClient == prevClient and curProg == prevProg:
boundary = serviceDF[serviceDF['ClientID'] == curClient].index
for x in boundary:
curRowSer = serviceDF.iloc[x]
curDateSer = datetime.strptime(curRowSer[1], '%m/%d/%y')
if curDateSer>=prevDate and curDateSer<curDate:
serviceCategory = curRowSer[5]
i = categories.index(serviceCategory)
services[i] = services[i] + 1
servicesDict.update({curKey : services})
prevClient = curClient
prevDate = curDate
prevProg = curProg
servicesCleaned = pd.DataFrame.from_dict(servicesDict, orient = 'index',columns=categories)
#then merge into assessments on clientID and AssessmentDate
One way would be like this. You'll probably have to tweak it for your original dataset, and check the edge cases.
assessments['PreviousAssessmentDate'] = assessments.groupby(['ClientID', 'Program']).AssessmentDate.shift(1, fill_value='0000-00-00')
df = assessments.merge(services, on='ClientID', how='left')
df[df.columns[5:]] = df[df.columns[5:]].multiply((df.AssessmentDate > df.ServiceDate) & (df.PreviousAssessmentDate < df.ServiceDate), axis=0)
df = df.groupby(['ClientID', 'AssessmentDate', 'Program']).sum().reset_index()
ClientID AssessmentDate Program Assistance Navigation Basic Needs
0 212 2018-01-04 Case Mgmt 0 1
1 212 2018-07-03 Case Mgmt 2 0
2 212 2019-06-10 Case Mgmt 0 0
3 292 2017-08-08 Coordinated Access 0 0
4 292 2017-12-21 Coordinated Access 1 3
Logic
We shift the AssessmentDate by 1 in order to determine the
previous assessment date
We merge the two dataframes on ClientID
We set all service type columns to 0 incase the ServiceDate doesn't fall between PreviousAssessmentDate and the AssessmentDate.
We groupby ClientID, Program and AssessmentDate and do a sum()
Assumptions
Service type categories are integers
Your data frame is sorted on AssessmentDate (for the shift)
I am a beginner in python and I am writing a script to save time for a data received from an algorithm
In the script I have an algorithm which accepts few parameters and returns the id of the data it detected. Below are few of its output:
S.No Data Id Time
1 0 2018-11-16 15:00:00
2 0, 1 2018-11-16 15:00:02
3 0, 1 2018-11-16 15:00:03
4 0, 1, 2 2018-11-16 15:00:05
5 0, 1, 2 2018-11-16 15:00:06
6 0, 2 2018-11-16 15:00:08
From the above output, we can see that at 1st attempt it detected data of id 0. In the 2nd attempt it detected the data of id 1 so total data id detected 0, 1. In 4th attempt, it detected 2 id. This keeps on going as it is running in while True. From above we can say that the time period for 0 id is 8 sec, for 1 time period is 4 sec and for 2 it is 3 sec. I need to calculate these time periods. For this I have written below code:
data_dict = {} # To store value of time for each data id
data_dict_sec = {} # To store value of seconds for each data id
data = amc_data_retention() # data contains the data id
for dataID in data.items():
if run_once:
run_once = False
data_dict[dataID] = time.time()
data_dict_sec[dataID] = 0
for dataID in data.items():
if dataID in data_dict:
sec = time.time() - data_dict[dataID]
data_dict_sec[dataID] += sec
data_dict[dataID] = time.time()
else:
print("New data detected")
The first for loop run once and it saves the value of time of dataID in dict. In the next for loop, that time is subtracted with current time and total seconds are saved in data_dict_sec. In first iteration, total seconds will be 0 but from the next iteration, it will start saving the correct seconds. This works fine only if there is 1 data id. As soon as 2nd data comes, it do not record time for it.
Can anyone please suggest a good way of writing this. Main objective is to keep saving the values of time period for each data id. Please help. Thanks
The only time this adds data_ID keys to the data_dict is in the first run. It should add each new data_ID is sees. I'm not seeing that the first for-loop is needed, where it only adds data_ID keys in the first run. It may do what you need if you moved the dictionary key initialization into the second for-loop where it checks if the data_ID is in the data_dict. If it isn't, then initialize it.
Perhaps this would do what you need:
data_dict = {} # To store value of time for each data id
data_dict_sec = {} # To store value of seconds for each data id
data = amc_data_retention() # data contains the data id
for dataID in data.items():
if dataID in data_dict:
sec = time.time() - data_dict[dataID]
data_dict_sec[dataID] += sec
data_dict[dataID] = time.time()
else:
print("New data detected")
data_dict[dataID] = time.time()
data_dict_sec[dataID] = 0
I have a data set that looks like this:
CustomerID EventID EventType EventTime
6 1 Facebook 42373.31586
6 2 Facebook 42373.316
6 3 Web 42374.32921
6 4 Twitter 42377.14913
6 5 Facebook 42377.40598
6 6 Web 42378.31245
CustomerID: the unique identifier associated with the particular
customer
EventID: a unique identifier about a particular online activity
EventType: the type of online activity associated with this record
(Web, Facebook, or Twitter)
EventTime: the date and time at which this online activity took
place. This value is measured as the number of days since January 1, 1900, with fractions indicating particular times of day. So for example, an event taking place at the stroke of midnight on January 1, 2016 would be have an EventTime of 42370.00 while an event taking place at noon on January 1, 2016 would have an EventTime of 42370.50.
I've managed to import the CSV and creating into a list with the following code:
# Import Libraries & Set working directory
import csv
# STEP 1: READING THE DATA INTO A PYTHON LIST OF LISTS
f = open('test1000.csv', "r") # Import CSV as file type
a = f.read() # Convert file type into string
split_list = a.split("\r") # Removes \r
split_list[0:5] # Viewing the list
# Convert from lists to 'list of lists'
final_list = []
for row in split_list:
split_list = row.split(',') # Split list by comma delimiter
final_list.append(split_list)
print(final_list[0:5])
#CREATING INITIAL BLANK LISTS FOR OUTPUTTING DATA
legit = []
fraud = []
What I need to do next is sort each record into the fraud or legit list of lists. A record would be considered fraudulent under the following parameters. As such, that record would go to the fraud list.
Logic to assign a row to the fraud list: The CustomerID performs the same EventType within the last 4 hours.
For example, row 2 (event 2) in the sample data set above, would be moved to the fraud list because event 1 happened within the last 4 hours. On the other hand, event 4 would go to the legit list because in there are no Twitter records that happened in the last 4 hours.
The data set is in chronological order.
This solution groups by CustomerID and EventType and then checks if the previous event time occurred less than (lt) 4 hours ago (4. / 24).
df['possible_fraud'] = (
df.groupby(['CustomerID', 'EventType'])
.EventTime
.transform(lambda group: group - group.shift())
.lt(4. / 24))
>>> df
CustomerID EventID EventType EventTime possible_fraud
0 6 1 Facebook 42373.31586 False
1 6 2 Facebook 42373.31600 True
2 6 3 Web 42374.32921 False
3 6 4 Twitter 42377.14913 False
4 6 5 Facebook 42377.40598 False
5 6 6 Web 42378.31245 False
>>> df[df.possible_fraud]
CustomerID EventID EventType EventTime possible_fraud
1 6 2 Facebook 42373.316 True
Of course, pandas based solution seems more smart, but here is an example of using just inserted dictionary.
PS Try to perform input and output by himself
#!/usr/bin/python2.7
sample ="""
6 1 Facebook 42373.31586
6 2 Facebook 42373.316
6 3 Web 42374.32921
6 4 Twitter 42377.14913
5 5 Web 42377.3541
6 6 Facebook 42377.40598
6 7 Web 42378.31245
"""
last = {} # This dict will contain recent time
#values of events by client ID, for ex.:
#{"6": {"Facebook": 42373.31586, "Web": 42374.32921}}
legit = []
fraud = []
for row in sample.split('\n')[1:-1:]:
Cid, Eid, Type, Time = row.split()
if Cid not in last.keys():
legit.append(row)
last[Cid] = {Type: Time}
row += '\tlegit'
else:
if Type not in last[Cid].keys():
legit.append(row)
last[Cid][Type] = Time
row += '\tlegit'
else:
if float(Time) - float(last[Cid][Type]) > (4. / 24):
legit.append(row)
last[Cid][Type] = Time
row += '\tlegit'
else:
fraud.append(row)
row += '\tfraud'
print row