Keep element data when extracting sessions

Keep element data when extracting sessions - python

Similarly to the top wikipedia sessions example I have the following test data
EDITS = [
json.dumps({'timestamp': 0, 'username': 'user1', 'action': 'a'}),
json.dumps({'timestamp': 1, 'username': 'user1', 'action': 'b'}),
json.dumps({'timestamp': 20, 'username': 'user1', 'action': 'a'}),
json.dumps({'timestamp': 132, 'username': 'user2', 'action': 'a'}),
json.dumps({'timestamp': 500, 'username': 'user2', 'action': 'b'}),
json.dumps({'timestamp': 3601, 'username': 'user2', 'action': 'b'}),
json.dumps({'timestamp': 3602, 'username': 'user2', 'action': 'a'}),
json.dumps({'timestamp': 8004, 'username': 'user2', 'action': 'a'}),
json.dumps({'timestamp': 9320, 'username': 'user1', 'action': 'b'})
]
I would like to split the dataset into sessions per username and then for each user session count the user actions. So for the previous dataset and one hour max gap (3600 seconds), I want to get the following result:
EXPECTED = [
'user1 : [0.0, 3620.0), a: 2, b: 1',
'user2 : [132.0, 7202.0), a: 2, b: 2',
'user2 : [8004.0, 11604.0), a: 1, b: 0',
'user1 : [9320.0, 12920.0), a: 0, b: 1',
]
Contrary to the wikipedia sessions example I need to keep the complete element data and not only the key in order to use within my custom combiner function.

You should be able to write a CombineFn that counts the number of actions of each type, using a dictionary of counts as the accumulator. Then, you can just use session windows in a collection keyed by user ID with that combiner.
See the Beam programming guide section on Combine Fns for ideas on how to write one.

Related

appending API request to another dictionary

making an application where I need to get field 'workers' from an API request and pass those to another dict.. dictionary is behaving strangely and I cannot seem to append through += or .update like a normal list or tupple.
main.py
# worker_detail - contains a list of filtered workers
# workers - contains a list of all workers
information = {}
reported_workers = []
for person in workers:
if person['id'] in worker_detail:
reported_workers += person
print(reported_workers)
If I use the above logic it will only print the fields in a dictionary without and workers..
['id', 'first_name', 'last_name', 'email', 'phone_number', 'hire_date', 'job_id', 'salary', 'commission_pct', 'manager_id', 'department_id', 'id', 'first_name', 'last_name', 'email', 'phone_number', 'hire_date', 'job_id', 'salary', 'commission_pct', 'manager_id', 'department_id']
If I print(person) output will be a dictionary containing all the neccessary fields and it's details
{'id': 1, 'first_name': 'Steven', 'last_name': 'King', 'email': 'SKING', 'phone_number': 5151234567, 'hire_date': '2021-06-17', 'job_id': 'AD_PRES', 'salary': 24000, 'commission_pct': 0, 'manager_id': 0, 'department_id': 0}
{'id': 2, 'first_name': 'Neena', 'last_name': 'Kochhar', 'email': 'NKOCHHAR', 'phone_number': 5151234568, 'hire_date': '2020-06-17', 'job_id': 'AD_VP', 'salary': 17000, 'commission_pct': 0, 'manager_id': 100, 'department_id': 90}
{'id': 5, 'first_name': 'Bruce', 'last_name': 'Ernst', 'email': 'BERNST', 'phone_number': 5151234571, 'hire_date': '2016-07-17', 'job_id': 'IT_PROG', 'salary': 6000, 'commission_pct': 0, 'manager_id': 103, 'department_id': 60}
{'id': 9, 'first_name': 'Inbal', 'last_name': 'Amor', 'email': 'IMOR', 'phone_number': 5151234575, 'hire_date': '2013-08-23', 'job_id': 'IT_PROG', 'salary': 5000, 'commission_pct': 0, 'manager_id': 104, 'department_id': 60}

Try using a list comprehension:
reported_workers = [person for person in workers if person in worker_detail]
If you find yourself looping over a list to make a new list, often you can replace it with this really nifty structure. It will let you abstract away your criteria also. If you want your worker_detail to be a more specific tuning, you can create a function for it and just call that in the list comprehension
def is_worker(person_id):
for worker in worker_detail:
if worker['id'] == person_id: return True
return False
reported_workers = [person for person in workers if is_worker(person['id'])]

append, dont +=
information = {}
reported_workers = []
for person in workers:
if person in worker_detail:
reported_workers.append(person)
print(reported_workers)

Can't update a single value in a nested dictionary

After creating a dictionary like {'key': {'key': {'key': 'value'}}}, I ran into issues trying to set the value for the higher depth key. After updating one of these values, the values for the remainder values (of other keys) were also updated.
Here's my Python code:
times = ["09:00", "09:30", "10:00", "10:30"]
courts = ["1", "2"]
daytime_dict = dict.fromkeys(times)
i = 0
for time in times:
daytime_dict[times[i]] = dict.fromkeys(["username"])
i += 1
courts_dict = dict.fromkeys(courts)
k = 0
for court in courts:
courts_dict[courts[k]] = daytime_dict
k += 1
day_info = [('name', '09:00', 1), ('name', '09:30', 1)]
for info in day_info:
info_court = str(info[2])
time = info[1]
# Here I am trying to set the value for courts_dict['1']['09:00']["username"] to be 'name',
# but the value for courts_dict['2']['09:00']["username"] and courts_dict['3']['09:00']["username"] is also set to 'name'
# What am I doing wrong? How can I only update the value for where the court is '1'?
courts_dict[info_court][time]["username"] = info[0]
I desire to get this:
{'1': {'09:00': {'username': 'name'},
'09:30': {'username': 'name'},
'10:00': {'username': None},
'10:30': {'username': None}},
'2': {'09:00': {'username': None},
'09:30': {'username': None},
'10:00': {'username': None},
'10:30': {'username': None}}
But I'm getting this:
{'1': {'09:00': {'username': 'name'},
'09:30': {'username': 'name'},
'10:00': {'username': None},
'10:30': {'username': None}},
'2': {'09:00': {'username': 'name'},
'09:30': {'username': 'name'},
'10:00': {'username': None},
'10:30': {'username': None}}
(See how court_dict['2']['09:00']['username'] and court_dict['2']['09:30']['username'] are both being updated when I only wish to update values from court_dict['1'])
Logically, I can't understand why both values are updated when I update the courts_dict (how I did in the last line of code), and not just one. Since info_court is "1", I thought only the "username" for that court would be updated.
What did I do wrong?

Logically, I can't understand why both values are updated when I update the courts_dict
For the dictionary objects you are using you are assigning the same object references as values, hence why you are seeing "both values are updated". You may want to rework your code using copy or deepcopy:
https://docs.python.org/3/library/copy.html
Assignment statements in Python do not copy objects, they create bindings between a target and an object. For collections that are mutable or contain mutable items, a copy is sometimes needed so one can change one copy without changing the other.

How can I implement this recursion in python?

Let's say that I have a Dictionary like this
dict1 = [{
'Name': 'Team1',
'id': '1',
'Members': [
{
'type': 'user',
'id': '11'
},
{
'type': 'user',
'id': '12'
}
]
},
{
'Name': 'Team2',
'id': '2',
'Members': [
{
'type': 'group'
'id': '1'
},
{
'type': 'user',
'id': '21'
}
]
},
{
'Name': 'Team3',
'id': '3',
'Members': [
{
'type': 'group'
'id': '2'
}
]
}]
and I want to get an output that can replace all the groups and nested groups with all distinct users.
In this case the output should look like this:
dict2 = [{
'Name': 'Team1',
'id': '1',
'Members': [
{
'type': 'user',
'id': '11'
},
{
'type': 'user',
'id': '12'
}
]
},
{
'Name': 'Team2',
'id': '2',
'Members': [
{
'type': 'user',
'id': '11'
},
{
'type': 'user',
'id': '12'
}
{
'type': 'user',
'id': '21'
}
]
},
{
'Name': 'Team3',
'id': '3',
'Members': [
{
'type': 'user',
'id: '11'
},
{
'type': 'user',
'id': '12'
}
{
'type': 'user',
'id': '21'
}
]
}]
Now let's assume that I have a large dataset to perform these actions on. (approx 20k individual groups)
What would be the best way to code this? I am attempting recursion, but I am not sure about how to search through the dictionary and lists in this manner such that it doesn't end up using too much memory

I do not think you need recursion. Looping is enough.
I think you can simply evaluate each Memberss, fetch users if group type, and make them unique. Then you can simply replace Members's value with distinct_users.
You might have a dictionary for groups like:
group_dict = {
'1': [
{'type': 'user', 'id': '11'},
{'type': 'user', 'id': '12'}
],
'2': [
{'type': 'user', 'id': '11'},
{'type': 'user', 'id': '12'},
{'type': 'user', 'id': '21'}
],
'3': [
{'type': 'group', 'id': '1'},
{'type': 'group', 'id': '2'},
{'type': 'group', 'id': '3'} # recursive
]
...
}
You can try:
def users_in_group(group_id):
users = []
groups_to_fetch = []
for user_or_group in group_dict[group_id]:
if user_or_group['type'] == 'group':
groups_to_fetch.append(user_or_group)
else: # 'user' type
users.append(user_or_group)
groups_fetched = set() # not to loop forever
while groups_to_fetch:
group = groups_to_fetch.pop()
if group['id'] not in groups_fetched:
groups_fetched.add(group['id'])
for user_or_group in group_dict[group['id']]:
if user_or_group['type'] == 'group' and user_or_group['id'] not in groups_fetched:
groups_to_fetch.append(user_or_group)
else: # 'user' type
users.append(user_or_group)
return users
def distinct_users_in(members):
distinct_users = []
def add(user):
if user['id'] not in user_id_set:
distinct_users.append(user)
user_id_set.add(user['id'])
user_id_set = set()
for member in members:
if member['type'] == 'group':
for user in users_in_group(member['id']):
add(user)
else: # 'user'
user = member
add(user)
return distinct_users
dict2 = dict1 # or `copy.deepcopy`
for element in dict2:
element['Members'] = distinct_users_in(element['Members'])
Each Members is re-assigned by distinct_users returned by the corresponding function.
The function takes Members and fetches users from each if member type. If user type, member itself is a user. While (fetched) users are appended to distinct_user, you can use their ids for uniquity.
When you fetch users_in_group, you can use two lists; groups_to_fetch and groups_fetched. The former is a stack to recursively fetch all groups in a group. The latter is not to fetch an already fetched group again. Or, it could loop forever.
Finally, if your data are already in memory, this approach may not exhaust memory and work.

How to flatten nested dict formatted '_source' column of csv, into dataframe

I have a csv with 500+ rows where one column "_source" is stored as JSON. I want to extract that into a pandas dataframe. I need each key to be its own column. #I have a 1 mb Json file of online social media data that I need to convert the dictionary and key values into their own separate columns. The social media data is from Facebook,Twitter/web crawled... etc. There are approximately 528 separate rows of posts/tweets/text with each having many dictionaries inside dictionaries. I am attaching a few steps from my Jupyter notebook below to give a more complete understanding. need to turn all key value pairs for dictionaries inside dictionaries into columns inside a dataframe
Thank you so much this will be a huge help!!!
I have tried changing it to a dataframe by doing this
source = pd.DataFrame.from_dict(source, orient='columns')
And it returns something like this... I thought it might unpack the dictionary but it did not.
#source.head()
#_source
#0 {'sub_organization_id': 'default', 'uid': 'aba...
#1 {'sub_organization_id': 'default', 'uid': 'ab0...
#2 {'sub_organization_id': 'default', 'uid': 'ac0...
below is the shape
#source.shape (528, 1)
below is what the an actual "_source" row looks like stretched out. There are many dictionaries and key:value pairs where each key needs to be its own column. Thanks! The actual links have been altered/scrambled for privacy reasons.
{'sub_organization_id': 'default',
'uid': 'ac0fafe9ba98327f2d0c72ddc365ffb76336czsa13280b',
'project_veid': 'default',
'campaign_id': 'default',
'organization_id': 'default',
'meta': {'rule_matcher': [{'atribs': {'website': 'github.com/res',
'source': 'Explicit',
'version': '1.1',
'type': 'crawl'},
'results': [{'rule_type': 'hashtag',
'rule_tag': 'Far',
'description': None,
'project_veid': 'A7180EA-7078-0C7F-ED5D-86AD7',
'campaign_id': '2A6DA0C-365BB-67DD-B05830920',
'value': '#Far',
'organization_id': None,
'sub_organization_id': None,
'appid': 'ray',
'project_id': 'CDE2F42-5B87-C594-C900E578C',
'rule_id': '1838',
'node_id': None,
'metadata': {'campaign_title': 'AF',
'project_title': 'AF '}}]}],
'render': [{'attribs': {'website': 'github.com/res',
'version': '1.0',
'type': 'Page Render'},
'results': [{'render_status': 'success',
'path': 'https://east.amanaws.com/rays-ime-store/renders/b/b/70f7dffb8b276f2977f8a13415f82c.jpeg',
'image_hash': 'bb7674b8ea3fc05bfd027a19815f82c',
'url': 'https://discooprdapp.com/',
'load_time': 32}]}]},
'norm_attribs': {'website': 'github.com/res',
'version': '1.1',
'type': 'crawl'},
'project_id': 'default',
'system_timestamp': '2019-02-22T19:04:53.569623',
'doc': {'appid': 'subtter',
'links': [],
'response_url': 'https://discooprdapp.com',
'url': 'https://discooprdapp.com/',
'status_code': 200,
'status_msg': 'OK',
'encoding': 'utf-8',
'attrs': {'uid': '2ab8f2651cb32261b911c990a8b'},
'timestamp': '2019-02-22T19:04:53.963',
'crawlid': '7fd95-785-4dd259-fcc-8752f'},
'type': 'crawl',
'norm': {'body': '\n',
'domain': 'discordapp.com',
'author': 'crawl',
'url': 'https://discooprdapp.com',
'timestamp': '2019-02-22T19:04:53.961283+00:00',
'id': '7fc5-685-4dd9-cc-8762f'}}
before you post make sure the actual code works for the data attached. Thanks!
The below code I tried but it did not work there was a syntax error that I could not figure out.
pd.io.json.json_normalize(source_data.[_source].apply(json.loads))
pd.io.json.json_normalize(source_data.[_source].apply(json.loads))
^
SyntaxError: invalid syntax
Whoever can help me with this will be a saint!

I had to do something like that a while back. Basically I used a function that completely flattened out the json to identify the keys that would be turned into the columns, then iterated through the json to reconstruct a row and append each row into a "results" dataframe. So with the data you provided, it created 52 column row and looking through it, looks like it included all the keys into it's own column. Anything nested, for example: 'meta': {'rule_matcher':[{'atribs': {'website': ...]} should then have a column name meta.rule_matcher.atribs.website where the '.' denotes those nested keys
data_source = {'sub_organization_id': 'default',
'uid': 'ac0fafe9ba98327f2d0c72ddc365ffb76336czsa13280b',
'project_veid': 'default',
'campaign_id': 'default',
'organization_id': 'default',
'meta': {'rule_matcher': [{'atribs': {'website': 'github.com/res',
'source': 'Explicit',
'version': '1.1',
'type': 'crawl'},
'results': [{'rule_type': 'hashtag',
'rule_tag': 'Far',
'description': None,
'project_veid': 'A7180EA-7078-0C7F-ED5D-86AD7',
'campaign_id': '2A6DA0C-365BB-67DD-B05830920',
'value': '#Far',
'organization_id': None,
'sub_organization_id': None,
'appid': 'ray',
'project_id': 'CDE2F42-5B87-C594-C900E578C',
'rule_id': '1838',
'node_id': None,
'metadata': {'campaign_title': 'AF',
'project_title': 'AF '}}]}],
'render': [{'attribs': {'website': 'github.com/res',
'version': '1.0',
'type': 'Page Render'},
'results': [{'render_status': 'success',
'path': 'https://east.amanaws.com/rays-ime-store/renders/b/b/70f7dffb8b276f2977f8a13415f82c.jpeg',
'image_hash': 'bb7674b8ea3fc05bfd027a19815f82c',
'url': 'https://discooprdapp.com/',
'load_time': 32}]}]},
'norm_attribs': {'website': 'github.com/res',
'version': '1.1',
'type': 'crawl'},
'project_id': 'default',
'system_timestamp': '2019-02-22T19:04:53.569623',
'doc': {'appid': 'subtter',
'links': [],
'response_url': 'https://discooprdapp.com',
'url': 'https://discooprdapp.com/',
'status_code': 200,
'status_msg': 'OK',
'encoding': 'utf-8',
'attrs': {'uid': '2ab8f2651cb32261b911c990a8b'},
'timestamp': '2019-02-22T19:04:53.963',
'crawlid': '7fd95-785-4dd259-fcc-8752f'},
'type': 'crawl',
'norm': {'body': '\n',
'domain': 'discordapp.com',
'author': 'crawl',
'url': 'https://discooprdapp.com',
'timestamp': '2019-02-22T19:04:53.961283+00:00',
'id': '7fc5-685-4dd9-cc-8762f'}}
Code:
def flatten_json(y):
out = {}
def flatten(x, name=''):
if type(x) is dict:
for a in x:
flatten(x[a], name + a + '_')
elif type(x) is list:
i = 0
for a in x:
flatten(a, name + str(i) + '_')
i += 1
else:
out[name[:-1]] = x
flatten(y)
return out
flat = flatten_json(data_source)
import pandas as pd
import re
results = pd.DataFrame()
special_cols = []
columns_list = list(flat.keys())
for item in columns_list:
try:
row_idx = re.findall(r'\_(\d+)\_', item )[0]
except:
special_cols.append(item)
continue
column = re.findall(r'\_\d+\_(.*)', item )[0]
column = re.sub(r'\_\d+\_', '.', column)
row_idx = int(row_idx)
value = flat[item]
results.loc[row_idx, column] = value
for item in special_cols:
results[item] = flat[item]
Output:
print (results.to_string())
atribs_website atribs_source atribs_version atribs_type results.rule_type results.rule_tag results.description results.project_veid results.campaign_id results.value results.organization_id results.sub_organization_id results.appid results.project_id results.rule_id results.node_id results.metadata_campaign_title results.metadata_project_title attribs_website attribs_version attribs_type results.render_status results.path results.image_hash results.url results.load_time sub_organization_id uid project_veid campaign_id organization_id norm_attribs_website norm_attribs_version norm_attribs_type project_id system_timestamp doc_appid doc_response_url doc_url doc_status_code doc_status_msg doc_encoding doc_attrs_uid doc_timestamp doc_crawlid type norm_body norm_domain norm_author norm_url norm_timestamp norm_id
0 github.com/res Explicit 1.1 crawl hashtag Far NaN A7180EA-7078-0C7F-ED5D-86AD7 2A6DA0C-365BB-67DD-B05830920 #Far NaN NaN ray CDE2F42-5B87-C594-C900E578C 1838 NaN AF AF github.com/res 1.0 Page Render success https://east.amanaws.com/rays-ime-store/render... bb7674b8ea3fc05bfd027a19815f82c https://discooprdapp.com/ 32.0 default ac0fafe9ba98327f2d0c72ddc365ffb76336czsa13280b default default default github.com/res 1.1 crawl default 2019-02-22T19:04:53.569623 subtter https://discooprdapp.com https://discooprdapp.com/ 200 OK utf-8 2ab8f2651cb32261b911c990a8b 2019-02-22T19:04:53.963 7fd95-785-4dd259-fcc-8752f crawl \n discordapp.com crawl https://discooprdapp.com 2019-02-22T19:04:53.961283+00:00 7fc5-685-4dd9-cc-8762f

Access data into list of dictionaries python

I have a list of dictionaries, with some nested dictionaries inside:
[{'id': '67569006',
'kind': 'analytics#accountSummary',
'name': 'Adopt-a-Hydrant',
'webProperties': [{'id': 'UA-62536006-1',
'internalWebPropertyId': '102299473',
'kind': 'analytics#webPropertySummary',
'level': 'STANDARD',
'name': 'Adopt-a-Hydrant',
'profiles': [{'id': '107292146',
'kind': 'analytics#profileSummary',
'name': 'Adopt a Hydrant view1',
'type': 'WEB'},
{'id': '1372982608',
'kind': 'analytics#profileSummary',
'name': 'Unfiltered view',
'type': 'WEB'}],
'websiteUrl': 'https://example1.com/'}]},
{'id': '44824959',
'kind': 'analytics#accountSummary',
'name': 'Adorn',
'webProperties': [{'id': 'UA-62536006-1',
'internalWebPropertyId': '75233390',
'kind': 'analytics#webPropertySummary',
'level': 'STANDARD',
'name': 'Website 2',
'profiles': [{'id': '77736192',
'kind': 'analytics#profileSummary',
'name': 'All Web Site Data',
'type': 'WEB'}],
'websiteUrl': 'http://www.example2.com'}]},
]
I'm trying to print the site name, url & view, if the site have 2 or more views print them all, and this is where it gets tricky.
So far I've tried:
all_properties = [The list above]
for single_property in all_properties:
single_propery_name=single_property['name']
view_name=single_property['webProperties'][0]['profiles'][0]['name']
view_id=single_property['webProperties'][0]['profiles'][0]['id']
print(single_propery_name, view_name, view_id)
This almost work, but it prints only the first view profile>name of each property, however some properties have more than one view and I need also these views to get print out.
The output now is:
Adopt-a-Hydrant Adopt a Hydrant view1 107292146
Website 2 All Web Site Data 77736192
So it's skipping the second view of the first property. I tried nesting a sub for loop but I can't get it to work, the final output should be:
Adopt-a-Hydrant Adopt a Hydrant view1 107292146
Adopt-a-Hydrant Unfiltered View 1372982608
Website 2 All Web Site Data 77736192
Any ideas on how to get that?

You need to iterate through the profiles list for each single_property:
for single_property in all_properties:
single_property_name = single_property['name']
for profile in single_property['webProperties'][0]['profiles']:
view_name = profile['name']
view_id = profile['id']
print(single_property_name, view_name, view_id)
It would probably help if you read a little in the python docs about lists and how to iterate through them

Just another proposition with oneline loops:
for single_property in data:
single_propery_name=single_property['name']
view_name = [i['name'] for i in single_property['webProperties'][0]['profiles']]
view_id = [i['id'] for i in single_property['webProperties'][0]['profiles']]
print(single_propery_name, view_name, view_id)
The point is that you will have to loop inside the lists. You could also make objects, if you think your Data would be more manageable.

If you're getting really confused, don't be afraid to just make variable.
Look at how much more readable this is:
for item in data:
webProperties = item['webProperties'][0]
print("Name: " + webProperties["name"])
print("URL: " + webProperties["websiteUrl"])
print("PRINTING VIEWS\n")
print("----------------------------")
views = webProperties['profiles']
for view in views:
print("ID: " + view['id'])
print("Kind: " + view['kind'])
print("Name: " + view['name'])
print("Type: " + view['type'])
print("----------------------------")
print("\n\n\n")
Data is defined as the information you gave us:
data = [{'id': '67569006',
'kind': 'analytics#accountSummary',
'name': 'Adopt-a-Hydrant',
'webProperties': [{'id': 'UA-62536006-1',
'internalWebPropertyId': '102299473',
'kind': 'analytics#webPropertySummary',
'level': 'STANDARD',
'name': 'Adopt-a-Hydrant',
'profiles': [{'id': '107292146',
'kind': 'analytics#profileSummary',
'name': 'Adopt a Hydrant view1',
'type': 'WEB'},
{'id': '1372982608',
'kind': 'analytics#profileSummary',
'name': 'Unfiltered view',
'type': 'WEB'}],
'websiteUrl': 'https://example1.com/'}]},
{'id': '44824959',
'kind': 'analytics#accountSummary',
'name': 'Adorn',
'webProperties': [{'id': 'UA-62536006-1',
'internalWebPropertyId': '75233390',
'kind': 'analytics#webPropertySummary',
'level': 'STANDARD',
'name': 'Website 2',
'profiles': [{'id': '77736192',
'kind': 'analytics#profileSummary',
'name': 'All Web Site Data',
'type': 'WEB'}],
'websiteUrl': 'http://www.example2.com'}]},
]

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Keep element data when extracting sessions - python

Related

appending API request to another dictionary

Can't update a single value in a nested dictionary

How can I implement this recursion in python?

How to flatten nested dict formatted '_source' column of csv, into dataframe

Access data into list of dictionaries python

Categories

Resources