MongoDB complex aggregation pipeline of Age fields - python

I have a MongoDB collection with documents that may have zero or more of the following fields: DOB (date of birth), YOB (year of birth) and Age. These may contain integers, floats or strings, and they may or may not be mappable to real values. For instance:
{'_id': id1, 'Age': 25},
{'_id': id2, 'Age': 'unknown', 'DOB': 'xxxx-xx-xx'},
{'_id': id3, 'Age': '8', 'DOB': '1988/01/05'},
{'_id': id4, 'YOB': '1995.0'},
{'_id': id5, 'DOB': '5/8/1886'},
{'_id': id6, 'Age': 17, 'YOB': 2003},
{'_id': id7},
...
For each document in the database, I need to extract a single standardized field, Age_Standardized, with the following criteria:
is an integer
is > 12
is < 99
I also need to implement an order of preference in the event multiple of the fields have a viable value, as DOB then YOB then Age.
So, for instance, if Age = 17 but DOB = '1900', Age_Standardized = 17 because, even though DOB exists (and is preferred to Age), it produces an Age_Standardized outside of the viable range (13-98).
If YOB = '1998.0' and Age = '19' then Age_Standardized = 23 because, even though Age exists and is viable, YOB is preferred and also viable.
I need to implement all of this over a large collection, and I was hoping to do it within a single PyMongo aggregation framework. In the examples above, the output would be:
{'_id': id1, 'Age': 25, 'Age_Standardized': 25},
{'_id': id2, 'Age': 'unknown', 'DOB': 'xxxx-xx-xx'},
{'_id': id3, 'Age': '8', 'DOB': '1988/01/05', 'Age_Standardized': 33},
{'_id': id4, 'YOB': '1995.0', 'Age_Standardized': 26},
{'_id': id5, 'DOB': '5/8/1886'},
{'_id': id6, 'Age': 17, 'YOB': 2003, 'Age_Standardized': 18},
{'_id': id7},

Related

pandas split list like object

Hi I have this column of data named labels:
[{'id': 123456,
'name': John,
'age': 22,
'pet': None,
'gender': male,
'result': [{'id': 'vEo0PIYPEE',
'type': 'choices',
'value': {'choices': ['Same Person']},
'to_name': 'image',
'from_name': 'person_evaluation'}]}]
[{'id': 123457,
'name': May,
'age': 21,
'pet': None,
'gender': female,
'result': [{'id': zTHYuKIOQ',
'type': 'choices',
'value': {'choices': ['Different Person']},
'to_name': 'image',
'from_name': 'person_evaluation'}]}]
......
Not sure what type is this, and I would like to break this down, to extract the value [Same Person], the outcome should be something like this:
0 [Same Person]
1 [Different Person]
....
How should I achieve this?
Based on the limited data that you have provided, would this work?
df['labels_new'] = df['labels'].apply(lambda x: x[0].get('result')[0].get('value').get('choices'))
labels labels_new
0 [{'id': 123456, 'name': 'John', 'age': 22, 'pe... [Same Person]
1 [{'id': 123457, 'name': 'May', 'age': 21, 'pet... [Different Person]
You can use the following as well, but I find dict.get() to be more versatile (returning default values for example) and has better exception handling.
df['labels'].apply(lambda x: x[0]['result'][0]['value']['choices'])
You could consider using pd.json_normalize (read more here) but for the current state of your column that you have, its going to be a bit complex to extract the data with that, rather than simply using a lambda function

pandas create a one row dataframe from nested dict

I know the "create pandas dataframe from nested dict" has a lot of entries here but I'm not found the answer that applies to my problem:
I have a dict like this:
{'id': 1,
'creator_user_id': {'id': 12170254,
'name': 'Nicolas',
'email': 'some_mail#some_email_provider.com',
'has_pic': 0,
'pic_hash': None,
'active_flag': True,
'value': 12170254}....,
and after reading with pandas look like this:
df = pd.DataFrame.from_dict(my_dict,orient='index')
print(df)
id 1
creator_user_id {'id': 12170254, 'name': 'Nicolas', 'email': '...
user_id {'id': 12264469, 'name': 'Daniela Giraldo G', ...
person_id {'active_flag': True, 'name': 'Cristina Cardoz...
org_id {'name': 'Cristina Cardozo', 'people_count': 1...
stage_id 2
title Cristina Cardozo
I would like to create a one-row dataframe where, for example, the nested creator_user_id column results in several columns that I after can name: creator_user_id_id, creator_user_id_name, etc.
thank you for your time!
Given you want one row, just use json_normalize()
pd.json_normalize({'id': 1,
'creator_user_id': {'id': 12170254,
'name': 'Nicolas',
'email': 'some_mail#some_email_provider.com',
'has_pic': 0,
'pic_hash': None,
'active_flag': True,
'value': 12170254}})

Pandas - Extracting values from a Dataframe column

I have a Dataframe in the below format:
cust_id, cust_details
101, [{'self': 'https://website.com/rest/api/2/customFieldOption/1', 'value': 'Type-A', 'id': '1'},
{'self': 'https://website.com/rest/api/2/customFieldOption/2', 'value': 'Type-B', 'id': '2'},
{'self': 'https://website.com/rest/api/2/customFieldOption/3', 'value': 'Type-C', 'id': '3'},
{'self': 'https://website.com/rest/api/2/customFieldOption/4', 'value': 'Type-D', 'id': '4'}]
102, [{'self': 'https://website.com/rest/api/2/customFieldOption/5', 'value': 'Type-X', 'id': '5'},
{'self': 'https://website.com/rest/api/2/customFieldOption/6', 'value': 'Type-Y', 'id': '6'}]
I am trying to extract for every cust_id all cust_detail values
Expected output:
cust_id, new_value
101,Type-A, Type-B, Type-C, Type-D
102,Type-X, Type-Y
Easy answer:
df['new_value'] = df.cust_details.apply(lambda ds: [d['value'] for d in ds])
More complex, potentially better answer:
Rather than storing lists of dictionaries in the first place, I'd recommend making each dictionary a row in the original dataframe.
df = pd.concat([
df['cust_id'],
pd.DataFrame(
df['cust_details'].explode().values.tolist(),
index=df['cust_details'].explode().index
)
], axis=1)
If you need to group values by id, you can do so via standard groupby methods:
df.groupby('cust_id')['value'].apply(list)
This may seem more complex, but depending on your use case might save you effort in the long-run.

Append Dates in Chronological Order

This is the JSON:
[{'can_occur_before': False,
'categories': [{'id': 8, 'name': 'Airdrop'}],
'coins': [{'id': 'cashaa', 'name': 'Cashaa', 'symbol': 'CAS'}],
'created_date': '2018-05-26T03:34:05+01:00',
'date_event': '2018-06-05T00:00:00+01:00',
'title': 'Unsold Token Distribution',
'twitter_account': None,
'vote_count': 125},
{'can_occur_before': False,
'categories': [{'id': 4, 'name': 'Exchange'}],
'coins': [{'id': 'tron', 'name': 'TRON', 'symbol': 'TRX'}],
'created_date': '2018-06-04T03:54:59+01:00',
'date_event': '2018-06-05T00:00:00+01:00',
'title': 'Indodax Listing',
'twitter_account': '#PutraDwiJuliyan',
'vote_count': 75},
{'can_occur_before': False,
'categories': [{'id': 5, 'name': 'Conference'}],
'coins': [{'id': 'modum', 'name': 'Modum', 'symbol': 'MOD'}],
'created_date': '2018-05-26T03:18:03+01:00',
'date_event': '2018-06-05T00:00:00+01:00',
'title': 'SAPPHIRE NOW',
'twitter_account': None,
'vote_count': 27},
{'can_occur_before': False,
'categories': [{'id': 4, 'name': 'Exchange'}],
'coins': [{'id': 'apr-coin', 'name': 'APR Coin', 'symbol': 'APR'}],
'created_date': '2018-05-29T17:45:16+01:00',
'date_event': '2018-06-05T00:00:00+01:00',
'title': 'TopBTC Listing',
'twitter_account': '#cryptoalarm',
'vote_count': 23}]
I want to take all the date_events and append them to a list in chronological order. I currently have this code and am not sure how to order them chronologically.
date = []
for i in getevents:
date.append(i['date_event'][:10])
Thanks for any help !
Simple way is to compose a list and then apply sort() method
data = json.load(open('filename.json','r'))
dates = [item['date_event'] for i in data]
dates.sort()
Using your example data with field 'creation_date' ('date_event' values are all the same) we'll get:
['2018-05-26T03:18:03+01:00',
'2018-05-26T03:34:05+01:00',
'2018-05-29T17:45:16+01:00',
'2018-06-04T03:54:59+01:00']
First of all, all the date_event in your array of objects are all the same, so not much sense in sorting them.. Also your approach will not get you far, you need to convert the dates to native date/time objects so that you can sort them through a sorting function.
The easiest way to parse properly formatted Date/Times is to use dateutil.parse.parser, and sorting an existing list is done by list.sort() - I made a quick example on how to use these tools, also i changed the date_event values to showcase it: https://repl.it/repls/BogusSpecificRate
After you have decoded the JSON string (json.loads) and have a Python list to work with, you can proceed with sorting the list:
# Ascending
events.sort(key=lambda e: parser.parse(e['date_event']))
print([":".join([e['title'], e['date_event']]) for e in events])
# Descending
events.sort(key=lambda e: parser.parse(e['date_event']), reverse=True)
print([":".join([e['title'], e['date_event']]) for e in events])

Proper way to format Dictionary with multiple entries in Python

I am just curious what is the best/most efficient way to structure a Dictionary in Python with multiple entries. It could be a dictionary of students, or employees etc. For sake of argument a 'Name' key/value and a few other key/value pairs per entry.
For example, this works great if you have just one student...
student_dict = {'name': 'John', 'age': 15, 'grades':[90, 80, 75]}
Is this the right way to do it? One variable per dictionary entry? Somehow I don't think it is:
student_1 = {'name': 'Fred', 'age': 17, 'grades':[85, 80, 75]}
student_2 = {'name': 'Sean', 'age': 16, 'grades':[65, 70, 100]}
student_3 = ...
I'm sure there is a simple way to structure this where it would be easy to add new entries and search existing entries in one location, I just can't wrap my head around it.
Thanks
Use a dictionary or list to store the dictionaries. Since you seem to want to be able to refer to individual dictionaries by name I suggest a dictionary of dictionaries:
students = {'student_1': {'name': 'Fred', 'age': 17, 'grades':[85, 80, 75]},
'student_2': {'name': 'Sean', 'age': 16, 'grades':[65, 70, 100]}}
Now you can refer to individual dictionaries by key:
>>> students['student_1']
{'name': 'Fred', 'age': 17, 'grades': [85, 80, 75]}
If you don't care about names, or you need to preserve the order, use a list:
students = [{'name': 'Fred', 'age': 17, 'grades':[85, 80, 75]},
{'name': 'Sean', 'age': 16, 'grades':[65, 70, 100]}]
Access them by index:
>>> students[0]
{'name': 'Fred', 'age': 17, 'grades': [85, 80, 75]}
Or iterate over them:
for student in students:
print(student['name'], student['age'], student['grades'])
You need to choose the key which will give you quick access to a student record. The name seems the most useful:
students = {
'Fred': {'name': 'Fred', 'age': 17, 'grades':[85, 80, 75]}
'Sean': {'name': 'Sean', 'age': 16, 'grades':[65, 70, 100]}
}
Then you can get the record for Fred with students['Fred'].
First of all, you should use dictionaries inside of dictionaries. For example:
people = { "John":{"age":15, "school":"asborne high"},
"Alex":{"age":32, "work":"microsoft"},
"Emily":{"age":21, "school":"florida state"} }
Using this method, you can efficiently index any value by its name alone:
print(people["Alex"]["age"])
Second, if you are shooting for readability and ease-of-use, make sure to properly format your multi-dimensional dictionary objects. What I mean by this is you should try to stick to at most two data structures for your custom-objects. If you need to organize a list of dogs, their colors, name, and age, you should use a structure similar to this:
dogs = { "Lisa":{"colors":["brown","white"], "age":4 },
"Spike":{"colors":["black","white"], "age":10} }
Notice how I do not switch between tuples and lists, or dictionaries and lists. Consistence is key.
When organzing numeric data, stick to the same concept.
numbers = { "A":[2132.62, 422.67, 938.2218113, 3772.7026994],
"B":[5771.11, 799.26, 417.9011772, 8922.0116259],
"C":[455.778, 592.224, 556.657001. 66.674254323] }

Categories