create dataframe by Iterating upto nth level of values in nested dictionary

create dataframe by Iterating upto nth level of values in nested dictionary - python

I have a json file downloaded from this link/website human diseased icd-11 classification, this data have a upto 8 level of nesting e.g:
"name":"br08403",
"children":[
{
"name":"01 Certain infectious or parasitic diseases",
"children":[
{
"name":"Gastroenteritis or colitis of infectious origin",
"children":[
{
"name":"Bacterial intestinal infections",
"children":[
{
"name":"1A00 Cholera",
"children":[
{
"name":"H00110 Cholera"
}
I tried with this code:
def flatten_json(nested_json):
"""
Flatten json object with nested keys into a single level.
Args:
nested_json: A nested json object.
Returns:
The flattened json object if successful, None otherwise.
"""
out = {}
def flatten(x, name=''):
if type(x) is dict:
for a in x:
flatten(x[a], name + a + '_')
elif type(x) is list:
i = 0
for a in x:
flatten(a, name + str(i) + '_')
i += 1
else:
out[name[:-1]] = x
flatten(nested_json)
return out
df2 = pd.Series(flatten_json(dictionary)).to_frame()
output i'm getting is:
name br08403
children_0_name 01 Certain infectious or parasitic diseases
children_0_children_0_name Gastroenteritis or colitis of infectious origin
children_0_children_0_children_0_name Bacterial intestinal infections
children_0_children_0_children_0_children_0_name 1A00 Cholera
... ...
children_21_children_17_children_10_name NF0A Certain early complications of trauma, n...
children_21_children_17_children_11_name NF0Y Other specified effects of external causes
children_21_children_17_children_12_name NF0Z Unspecified effects of external causes
children_21_children_18_name NF2Y Other specified injury, poisoning or cer...
children_21_children_19_name NF2Z Unspecified injury, poisoning or certain..
but the desired output is a dataframe with 8 columns which can accommodate the last depth of the nested name key e.g. something like this:
I would really appreciate any help
code tried for extracting the 'name' property by created a dataframe as follows:
with open('br08403.json') as f:
d = json.load(f)
df2 = pd.DataFrame(d)
data = []
for a in range(len(df2)):
# print(df2['children'][a]['name'])
data.append(df2['children'][a]['name'])
for b in range(len(df2['children'][a]['children'])):
# print(df2['children'][a]['children'][b]['name'])
data.append(df2['children'][a]['children'][b]['name'])
if len(df2['children'][a]['children'][b]) < 2:
print(df2['children'][a]['children'][b]['name'])
else:
for c in range(len(df2['children'][a]['children'][b]['children'])):
# print(df2['children'][a]['children'][b]['children'][c]['name'])
data.append(df2['children'][a]['children'][b]['children'][c]['name'])
if len(df2['children'][a]['children'][b]['children'][c]) < 2:
print(df2['children'][a]['children'][b]['children'][c]['name'])
else:
for d in range(len(df2['children'][a]['children'][b]['children'][c]['children'])):
# print(df2['children'][a]['children'][b]['children'][c]['children'][d]['name'])
data.append(df2['children'][a]['children'][b]['children'][c]['children'][d]['name'])
but i'm getting a plain list as follows:
['01 Certain infectious or parasitic diseases',
'Gastroenteritis or colitis of infectious origin',
'Bacterial intestinal infections',
'1A00 Cholera',
'1A01 Intestinal infection due to other Vibrio',
'1A02 Intestinal infections due to Shigella',
'1A03 Intestinal infections due to Escherichia coli',
'1A04 Enterocolitis due to Clostridium difficile',
'1A05 Intestinal infections due to Yersinia enterocolitica',
'1A06 Gastroenteritis due to Campylobacter',
'1A07 Typhoid fever',
'1A08 Paratyphoid Fever',
'1A09 Infections due to other Salmonella',....

A simple pandas only iterative approach.
res = requests.get("https://www.genome.jp/kegg-bin/download_htext?htext=br08403.keg&format=json&filedir=")
js = res.json()
df = pd.json_normalize(js)
for i in range(20):
df = pd.json_normalize(df.explode("children").to_dict(orient="records"))
if "children" in df.columns: df.drop(columns="children", inplace=True)
df = df.rename(columns={"children.name":f"level{i}","children.children":"children"})
if df[f"level{i}"].isna().all() or "children" not in df.columns: break

Related

Why is this For Loop overwriting the contents of my dictionary strangely?

I am trying to convert my pandas DataFrame data into a different medium which is easily represented via JSON. I have chosen to do this by turning it into python dictionaries then converting it into JSON.
The problem I am encountering is that the data I am putting through the process of formatting is coming out in a different order than expected - the values I am expecting are being replaced by the last values in my for loop.
Here is a reproducible example, which is split between 2 files:
import re
import pandas as pd
import json
from help import Model # Note! this is another file help.py
jan = {'Month': ["January", "January", "January", "January"],
'Date': ['1st', '2nd', '28th', '29th'],
'a': ["j1a", "a3x", "d9c", "h9c"],
'b': ["X1", "SG", "DV", "XP"]}
dec = {'Month': ["December", "December", "December", "December"],
'Date': ['1st', '2nd', '28th', '29th'],
'a': ["d1a", "o3x", "j9c", "h9c"],
'b': ["X2", "SG", "DV", "XP"]}
a = pd.DataFrame.from_dict(jan)
b = pd.DataFrame.from_dict(dec)
dfs = [a, b]
df = pd.concat(dfs)
DateNum = []
for values in df['Date']:
DateNum.append(re.search(r'\d+', values).group())
df['Date Num'] = DateNum
df.reset_index(drop=True, inplace=True)
dfl = df.Month.tolist()
months = []
for data in dfl:
if data not in months:
months.append(data)
# months = ['January', 'December']
models = []
for month in months:
models.append(Model(month))
calendar = {}
for month in models:
datacopy = df.copy()
datacopy = datacopy[datacopy.Month == month.name]
month.data = datacopy
month.update(debug=True)
calendar[month.name] = month.days
print(json.dumps(calendar, indent=4))
Here is the other file - help.py contains the classes Model and Day
class Model:
"""
model for months
"""
name = ""
data = None
days = {}
def __init__(self, monthname):
self.name = monthname
def update(self, debug=False):
edit = self.data # a copy of a slice from the df
edit = edit.drop("Month", axis=1) # drop Month column
edit = edit.set_index('Date Num').T.to_dict('list') # set Date Num column to be the index and make dict
data_formatted = {self.name: edit} # save the dict with key as month name as data_formatted
for k, v in data_formatted[self.name].items(): # data_formatted [month] = (day number : data)
if debug:
print(k, v) # e.g. k=1 v=['1st', 'a', 'n']
day_object = Day(v) # make a day object out of the values (formatting in initializer)
self.days[k] = day_object.data_formatted # assign the formatted value e.g. days[1] = (formatted data)
# print(self.days[k]) # shows correct data e.g. {'date': '25th', 'a': 'a', 'b': 'n', 'c': 'x'}
class Day:
date = ""
a = ""
b = ""
data_formatted = {}
def __init__(self, data):
self.date = data[0]
self.a = data[1]
self.b = data[2]
self.format_data()
def format_data(self):
self.data_formatted = {
"date": self.date,
"a": self.a,
"b": self.b,
}
As expected, the data is being processed in the expected order:
1 ['1st', 'j1a', 'X1']
2 ['2nd', 'a3x', 'SG']
28 ['28th', 'd9c', 'DV']
29 ['29th', 'h9c', 'XP']
1 ['1st', 'd1a', 'X2']
2 ['2nd', 'o3x', 'SG']
28 ['28th', 'j9c', 'DV']
29 ['29th', 'h9c', 'XP']
But the output of the json.dumps is different (identical to the last month in months):
{
"January": {
"1": {
"date": "1st",
"a": "d1a", - Should be j1a
"b": "X2" - should be X1
},
"2": {
"date": "2nd",
"a": "o3x", - Should be a3x
"b": "SG"
} ...
Thank you for reading this and I hope you can help me.
Here are some other notes:
The code without the Model class is being run in an interactive python notebook - could this change things?
The code I have provided only shows 2 months. In my case, the data from the last month (which I assume to be the last iteration) is being saved as the data for ALL the months.

The problem is here:
month.data = datacopy
month.update(debug=True)
calendar[month.name] = month.days
That's fine the first time around, but in the next iteration, you change the data and rerun .update for month, but its .days is still the same dictionary. So, you're not just updating the dictionary for the next month, but also for all previous months.
Edit: you asked for some clarification in the comments - that's fine, it's perhaps not immediately obvious.
The problem starts here, in your Model class:
class Model:
...
# this is the only place a new dictionary is created
days = {}
def __init__(self, monthname):
# after __init__, this object will have a reference to the 1 days in your class
...
def update(self, debug=False):
...
for k, v in data_formatted[self.name].items():
...
day_object = Day(v)
# so here, you just update that one dictionary
self.days[k] = day_object.data_formatted
I've removed the code that doesn't contribute to the problem and added some comments to explain. The key problem is that you defined days as an attribute of Model - that means it's a class attribute, to which all instances of the class have access, but there's only one of it.
If you need each instance of Model to have a unique instance of .days, you should just create it in __init__ (and you don't need it on the class body at all):
def __init__(self, monthname):
self.name = monthname
self.days = {}
So, the problem is not really to do with loops, the problem is the difference between a class attribute and an object attribute.

Python/Pandas finding missing items amongst similarly grouped data, nested iteration not efficient

I will start by saying I am VERY new to python/pandas. I have a dataframe with about 1.5 million rows of data and growing, below is an abstraction of the data. I am trying to find hosts in the same release and group, where a host is missing a path that other hosts have in common. My approach was to iterate through the data, it is not very efficient. Appreciate any feedback on different approach, or increasing performance on my current approach. Thank You
Result
release
group
host
missing_path
ReferenceHosts
A
one
abc
c:\one\three
def:ghi
Data
release
group
host
path
A
one
abc
c:\one\two
A
one
def
c:\one\two
A
one
def
c:\one\three
A
one
ghi
c:\one\two
A
one
ghi
c:\one\three
A
two... lots of groups
...
...
B... lots of releases
...
...
...
#get unique list of releases
list_releases = df['Release'].dropna().unique().tolist()
#get unique list groups
list_groups = df['Group'].dropna().unique().tolist()
#build dictionary { group:[hosts], }
lists_hosts = hosts_by_group(list_groups, df)
#detect missing files
audit_missing = find_missing_files(list_releases, lists_hosts, df)
overview = {"Release": [], "Group": [], "SubjectHost": [], "FileMissing": [], "ReferenceHosts": [], "Ref1Domain": [], "Ref2Domain": [], "SubjDomain": [], "Extension": []}
def generate_overview(grp, hst, ref, hi, ref1_domain, ref2_domain, subj_domain, df,release,hosts, checking_host, idx2):
df1 = df[(df.Hostname == hosts[idx2]) & (df.Release == release)]
df2 = df[(df.Hostname == hosts[hi]) & (df.Release == release)]
merge = pd.merge(df1, df2, how="inner", on=["Path"]).dropna()
merge2 = pd.merge(checking_host, merge, how="inner", on=["Path"]).dropna()
files_not_found = merge[~merge["Path"].isin(merge2["Path"])].dropna()
iter = files_not_found['Path'].tolist()
count = files_not_found['Path'].count()
if files_not_found.count().sum() > 0:
for file in iter:
ext = files_not_found.loc[files_not_found['Path'] == file, 'Extension_x'].item()
overview["Release"].append(release)
overview["Group"].append(grp)
overview["SubjectHost"].append(hst)
overview["FileMissing"].append(file)
overview["ReferenceHosts"].append(ref)
overview["Ref1Domain"].append(ref1_domain)
overview["Ref2Domain"].append(ref2_domain)
overview["SubjDomain"].append(subj_domain)
overview["Extension"].append(ext)
def missing_file_process(hosts,df, group, release):
for idx1, host in enumerate(hosts):
checking_host = df[(df.Hostname == host)]
subj_domain = (checking_host.Domain.unique())[0]
for idx2, host2 in enumerate(hosts):
num_hosts = len(hosts)
ref = ''
hosts_index = 0
ref1_domain = ''
ref2_domain = ''
if num_hosts - idx2 < 2:
ref = hosts[idx2] + ":" + hosts[0]
hosts_index = 0
ref1_domain = (df[(df.Hostname == hosts[idx2])].Domain.unique())[0]
ref2_domain = (df[(df.Hostname == hosts[0])].Domain.unique())[0]
if num_hosts - idx2 > 1:
ref = hosts[idx2] + ":" + hosts[idx2+1]
hosts_index = idx2+1
ref1_domain = (df[(df.Hostname == hosts[idx2])].Domain.unique())[0]
ref2_domain = (df[(df.Hostname == hosts[idx2+1])].Domain.unique())[0]
generate_overview( group, host, ref, hosts_index, ref1_domain, ref2_domain, subj_domain, df, release,hosts, checking_host, idx2)
def find_missing_files(releases, lists_hosts, df):
for release in releases:
for idx, (group, hosts) in enumerate(lists_hosts.items()):
if len(hosts) > 2:
missing_file_process(hosts,df, group, release)
return pd.DataFrame(data=overview)

Try this:
# Find all unique paths in a Release Group
tmp1 = df.groupby(['release', 'group'])['path'].apply(set).to_frame('path1').reset_index()
# Find all unique paths per Host in a Release Group
tmp2 = df.groupby(['release', 'group', 'host'])['path'].apply(set).to_frame('path2').reset_index()
# Line them up and find the missing paths in the second set
tmp3 = pd.merge(tmp1, tmp2, how='right', on=['release', 'group'])
tmp3['missing'] = tmp3['path1'] - tmp3['path2']
# Filter for those hosts where some paths are missing
result = tmp3[tmp3['missing'] != set({})]
Output:
release group path1 host path2 missing
A one {c:\one\three, c:\one\two} abc {c:\one\two} {c:\one\three}
You can rearrange the columns to taste 😀

Check string for specific format of substring, how to..?

Two strings. My items name:
Parfume name EDT 50ml
And competitor's items name:
Parfume another name EDP 60ml
And i have a long list of these names in one column, competitors names in other column, and I want to leave only those rows in dataframe, that have same amount of ml in both my and competitors names no matter what everything else in these strings look like. So how do I find a substring ending with 'ml' in a bigger string? I could simply do
"**ml" in competitors_name
to see if they both contain the same amount of ml.
Thank you
UPDATE
'ml' is not always at the end of string. It might look like this
Parfume yet another great name 60ml EDP

Try this:
import re
def same_measurement(my_item, competitor_item, unit="ml"):
matcher = re.compile(r".*?(\d+){}".format(unit))
my_match = matcher.match(my_item)
competitor_match = matcher.match(competitor_item)
return my_match and competitor_match and my_match.group(1) == competitor_match.group(1)
my_item = "Parfume name EDT 50ml"
competitor_item = "Parfume another name EDP 50ml"
assert same_measurement(my_item, competitor_item)
my_item = "Parfume name EDT 50ml"
competitor_item = "Parfume another name EDP 60ml"
assert not same_measurement(my_item, competitor_item)

You could use the python Regex library to select the 'xxml' values for each of your data rows and then do some logic to check if they match.
import re
data_rows = [["Parfume name EDT", "Parfume another name EDP 50ml"]]
for data_pairs in data_rows:
my_ml = None
comp_ml = None
# Check for my ml matches and set value
my_ml_matches = re.search(r'(\d{1,3}[Mm][Ll])', data_pairs[0])
if my_ml_matches != None:
my_ml = my_ml_matches[0]
else:
print("my_ml has no ml")
# Check for comp ml matches and set value
comp_ml_matches = re.search(r'(\d{1,3}[Mm][Ll])', data_pairs[1])
if comp_ml_matches != None:
comp_ml = comp_ml_matches[0]
else:
print("comp_ml has no ml")
# Print outputs
if (my_ml != None) and (comp_ml != None):
if my_ml == comp_ml:
print("my_ml: {0} == comp_ml: {1}".format(my_ml, comp_ml))
else:
print("my_ml: {0} != comp_ml: {1}".format(my_ml, comp_ml))
Where data_rows = each row in the data set
Where data_pairs = {your_item_name, competitor_item_name}

You could use a lambda function to do that.
import pandas as pd
import re
d = {
'Us':
['Parfume one 50ml', 'Parfume two 100ml'],
'Competitor':
['Parfume uno 50ml', 'Parfume dos 200ml']
}
df = pd.DataFrame(data=d)
df['Eq'] = df.apply(lambda x : 'Yes' if re.search(r'(\d+)ml', x['Us']).group(1) == re.search(r'(\d+)ml', x['Competitor']).group(1) else "No", axis = 1)
Result:
Doesn't matter whether 'ml' is in the end of in the middle of the string.

flat nested json inside arrays with Python

I want to convert this:
{} Json
{} 0
[] variants
{} 0
fileName
id
{} mediaType
baseFilePtah
id
name
sortOrder
{} 1
fileName
id
{} mediaType
baseFilePtah
id
name
sortOrder
Into this:
{} Json
{} 0
[] variants
{} 0
fileName
id
mediaType_baseFilePath
mediaType_id
mediaType_name
SortOrder
{} 1
fileName
id
mediaType_baseFilePath
mediaType_id
mediaType_name
SortOrder
Basically each
{}
{}
should be merged together. But not rows numbers.
This is the code I wrote:
def flatten_json(y):
out = {}
def flatten(x, name=''):
if type(x) is dict:
print type(x), name
for a in x:
flatten(x[a], name + a + '_')
elif type(x) is list:
print type(x), name
out[name[:-1]] = x
else:
out[name[:-1]] = x
flatten(y)
return out
def generatejson(response2):
# response 2 is [(first data set), (second data set)] convert it to dictionary {0: (first data set), 1: (second data set)}
sample_object = {i: data for i, data in enumerate(response2)}
# begin to flat (merge sub-jsons)
flat = {k: flatten_json(v) for k, v in sample_object.items()}
return json.dumps(flat, sort_keys=True)
This is the result of the code on my sample data:
As you can see manufacturer was merged but mediaType was not.
The code prints:
<type 'dict'>
<type 'list'> additionalLocaleInfos_
<type 'list'> variants_
<type 'dict'> manufacturer_
My aim was that type list will be further investigated in the recursion. The code suppose to detect that inside the variants list there is also a dict of mediaType but it doesn't.
Data sample for generatejson(response2) - is a list of this structure:
[{"additionalLocaleInfos": [], "approved": false, "approvedBy": null, "approvedOn": null, "catalogId": 4, "code": "611",
"createdOn": "2018-03-24 09:39", "customsCode": null, "deletedOn": null, "id": 1, "invariantName": "Leisure Ali Baba Trousers", "isPermanent": false, "locale": null, "madeIn": null,
"manufacturer": {"createdOn": "2018-02-23 18:20", "deletedOn": null, "id": 1, "invariantName": "Unknown", "updatedOn": "2018-02-23 18:20"},
"onNoStockShowComingSoon": false, "season": "", "updatedOn": "2018-03-24 09:39",
"variants": [{"assets": [{"fileName": "mu/2016/05/16/leisure-ali-baba-trousers-32956-0.jpg", "id": 1,
"mediaType": {"baseFilePath": "Catalog", "id": 7, "name": "Product Main Image"}, "sortOrder": 0}]} ]}]
Full example can be found here (but not mandatory for the question)
http://www.filedropper.com/file_389
How can I make it look inside the list to check if it's made of more objects?
This code works only without arrays. For some reason It doesn't look inside the array to see what objects are in it.

Something like this will flatten a dict structure containing dicts, lists and tuples into a flat dict.
The json_data blob is an excerpt from the data you posted.
import json
import collections
json_data = """
{"additionalLocaleInfos":[],"approved":false,"approvedBy":null,"approvedOn":null,"catalogId":4,"code":"611","createdOn":"2018-03-24 09:39","customsCode":null,"deletedOn":null,"id":1,"invariantName":"Leisure Ali Baba Trousers","isPermanent":false,"locale":null,"madeIn":null,"manufacturer":{"createdOn":"2018-02-23 18:20","deletedOn":null,"id":1,"invariantName":"Unknown","updatedOn":"2018-02-23 18:20"},"onNoStockShowComingSoon":false,"season":"","updatedOn":"2018-03-24 09:39","variants":[{"assets":[{"fileName":"mu/2016/05/16/leisure-ali-baba-trousers-32956-0.jpg","id":1,"mediaType":{"baseFilePath":"Catalog","id":7,"name":"Product Main Image"},"sortOrder":0},{"fileName":"080113/3638.jpg","id":2,"mediaType":{"baseFilePath":"Catalog","id":8,"name":"Product Additional Image"},"sortOrder":0},{"fileName":"mu/2016/05/16/leisure-ali-baba-trousers-32956-1.jpg","id":3,"mediaType":{"baseFilePath":"Catalog","id":8,"name":"Product Additional Image"},"sortOrder":0},{"fileName":"mu/2015/07/21/leisure-ali-baba-trousers-13730-0.jpg","id":4,"mediaType":{"baseFilePath":"Catalog","id":8,"name":"Product Additional Image"},"sortOrder":0},{"fileName":"mu/2016/05/16/leisure-ali-baba-trousers-32956-2.jpg","id":5,"mediaType":{"baseFilePath":"Catalog","id":8,"name":"Product Additional Image"},"sortOrder":0},{"fileName":"mu/2015/07/29/leisure-ali-baba-trousers-13853-0.jpg","id":6,"mediaType":{"baseFilePath":"Catalog","id":8,"name":"Product Additional Image"},"sortOrder":0}],"attributes":[{"attribute":{"code":"COL","cultureNeutralName":"Color","id":1,"useAsFilter":false},"code":"BLACK","groupId":0,"id":3,"invariantValue":"BLACK","locale":null,"sortOrder":0,"valueLocale":null},{"attribute":{"code":"SZ","cultureNeutralName":"Size","id":2,"useAsFilter":false},"code":"ONE SIZE","groupId":0,"id":7,"invariantValue":"ONE SIZE","locale":null,"sortOrder":0,"valueLocale":null},{"attribute":{"code":"WEIGHT","cultureNeutralName":"WEIGHT","id":14,"useAsFilter":false},"code":"0.30","groupId":0,"id":2,"invariantValue":"0.30","locale":null,"sortOrder":0,"valueLocale":null},{"attribute":{"code":"STLPTND","cultureNeutralName":"OsStyleOptionId","id":25,"useAsFilter":false},"code":"2","groupId":0,"id":6,"invariantValue":"2","locale":null,"sortOrder":0,"valueLocale":null},{"attribute":{"code":"STLNMBR","cultureNeutralName":"OsStyleNumber","id":26,"useAsFilter":false},"code":"611-1412","groupId":0,"id":1,"invariantValue":"611-1412","locale":null,"sortOrder":0,"valueLocale":null},{"attribute":{"code":"SZFCTEN","cultureNeutralName":"SizeFacetEn","id":35,"useAsFilter":true},"code":"S","groupId":0,"id":8,"invariantValue":"S","locale":null,"sortOrder":0,"valueLocale":null},{"attribute":{"code":"SZFCTEN","cultureNeutralName":"SizeFacetEn","id":35,"useAsFilter":true},"code":"M","groupId":0,"id":9,"invariantValue":"M","locale":null,"sortOrder":0,"valueLocale":null},{"attribute":{"code":"SZFCTEN","cultureNeutralName":"SizeFacetEn","id":35,"useAsFilter":true},"code":"L","groupId":0,"id":10,"invariantValue":"L","locale":null,"sortOrder":0,"valueLocale":null}],"cost":0,"createdOn":"2018-03-24 09:39","deletedOn":null,"eaN1":"2500002822528","eaN2":null,"eaN3":null,"id":1,"isDefault":false,"locale":null,"sku":"611-1412-28","sortOrder":0,"upC1":null,"upC2":null,"upC3":null,"updatedOn":"2018-03-24 09:39","variantInventories":[{"defectiveQty":0,"id":1,"lastUpdate":"2018-03-24 09:39","orderLevelQty":0,"preorderQty":0,"qtyInStock":0,"reorderQty":0,"reservedQty":100,"transferredQty":0,"variantId":1,"warehouseId":1}],"variantPrices":[{"id":1,"price":5,"priceListId":1,"priceType":{"code":"Base price","id":1,"remarks":null},"validFrom":"2018-03-24 09:39","validUntil":"2068-03-24 09:39","variantId":1}]}]}
""".strip()
data = json.loads(json_data)
def flatten_object(d, out=None, name_path=()):
out = (out or collections.OrderedDict())
iterator = (d.items() if isinstance(d, dict) else enumerate(d))
for index, value in iterator:
i_path = name_path + (index,)
if isinstance(value, (list, dict, tuple)):
flatten_object(value, out, i_path)
else:
out[i_path] = value
return out
for key, value in flatten_object(data).items():
print('_'.join(str(atom) for atom in key), value)
The output here will be something like
approved False
approvedBy None
approvedOn None
[...]
variants_0_cost 0
variants_0_createdOn 2018-03-24 09:39
variants_0_deletedOn None
variants_0_eaN1 2500002822528
variants_0_eaN2 None
variants_0_eaN3 None
variants_0_assets_0_fileName mu/2016/05/16/leisure-ali-baba-trousers-32956-0.jpg
variants_0_assets_0_id 1
variants_0_assets_0_mediaType_baseFilePath Catalog
variants_0_assets_0_mediaType_id 7
variants_0_assets_0_mediaType_name Product Main Image
variants_0_assets_0_sortOrder 0
variants_0_assets_1_fileName 080113/3638.jpg
variants_0_assets_1_id 2
variants_0_assets_1_mediaType_baseFilePath Catalog
variants_0_assets_1_mediaType_id 8
variants_0_assets_1_mediaType_name Product Additional Image
variants_0_assets_1_sortOrder 0
variants_0_assets_2_fileName mu/2016/05/16/leisure-ali-baba-trousers-32956-1.jpg
[...]
variants_0_attributes_0_attribute_code COL
variants_0_attributes_0_attribute_cultureNeutralName Color
variants_0_attributes_0_attribute_id 1
variants_0_attributes_0_attribute_useAsFilter False
variants_0_attributes_0_code BLACK
variants_0_attributes_0_groupId 0
variants_0_attributes_0_id 3
variants_0_attributes_0_invariantValue BLACK
variants_0_attributes_0_locale None
variants_0_attributes_0_sortOrder 0
variants_0_attributes_0_valueLocale None
variants_0_attributes_1_attribute_code SZ
variants_0_attributes_1_attribute_cultureNeutralName Size
variants_0_attributes_1_attribute_id 2
variants_0_attributes_1_attribute_useAsFilter False
variants_0_attributes_1_code ONE SIZE
variants_0_attributes_1_groupId 0
variants_0_attributes_1_id 7
variants_0_attributes_1_invariantValue ONE SIZE
variants_0_attributes_1_locale None
variants_0_attributes_1_sortOrder 0
variants_0_attributes_1_valueLocale None
variants_0_attributes_2_attribute_code WEIGHT
variants_0_attributes_2_attribute_cultureNeutralName WEIGHT
variants_0_attributes_2_attribute_id 14
variants_0_attributes_2_attribute_useAsFilter False
variants_0_attributes_2_code 0.30
variants_0_attributes_2_groupId 0
[...]
but you'll probably only want to run this on a single object within variants, or a list of attributes.
variant = data['variants'][0]
merged_flattened_assets = dict()
for asset in variant['assets']:
merged_flattened_assets.update({
'_'.join(key): value
for (key, value)
in flatten_object(asset).items()
})
for key, value in merged_flattened_assets.items():
print(key, value)
outputs
fileName mu/2015/07/29/leisure-ali-baba-trousers-13853-0.jpg
id 6
mediaType_baseFilePath Catalog
mediaType_id 8
mediaType_name Product Additional Image
sortOrder 0

Dynamodb isn't finding overlap between two date ranges

I am trying to search my database to see if a date range I am about to add overlaps with a date range that already exists in the database.
Using this question: Determine Whether Two Date Ranges Overlap
I came up with firstDay <= :end and lastDay >= :start for my FilterExpression.
def create(self, start=None, days=30):
# Create the start/end times
if start is None:
start = datetime.utcnow()
elif isinstance(start, datetime) is False:
raise ValueError('Start time must either be "None" or a "datetime"')
end = start + timedelta(days=days)
# Format the start and end string "YYYYMMDD"
start = str(start.year) + str('%02d' % start.month) + str('%02d' % start.day)
end = str(end.year) + str('%02d' % end.month) + str('%02d' % end.day)
# Search the database for overlap
days = self.connection.select(
filter='firstDay <= :end and lastDay >= :start',
attributes={
':start': {'N': start},
':end': {'N': end}
}
)
# if we get one or more days then there is overlap
if len(days) > 0:
raise ValueError('There looks to be a time overlap')
# Add the item to the database
self.connection.insert({
"firstDay": {"N": start},
"lastDay": {"N": end}
})
I am then calling the function like this:
seasons = dynamodb.Seasons()
seasons.create(start=datetime.utcnow() + timedelta(days=50))
As requested, the method looks like this:
def select(self, conditions='', filter='', attributes={}, names={}, limit=1, select='ALL_ATTRIBUTES'):
"""
Select one or more items from dynamodb
"""
# Create the condition, it should contain the datatype hash
conditions = self.hashKey + ' = :hash and ' + conditions if len(conditions) > 0 else self.hashKey + ' = :hash'
attributes[':hash'] = {"S": self.hashValue}
limit = max(1, limit)
args = {
'TableName': self.table,
'Select': select,
'ScanIndexForward': True,
'Limit': limit,
'KeyConditionExpression': conditions,
'ExpressionAttributeValues': attributes
}
if len(names) > 0:
args['ExpressionAttributeNames'] = names
if len(filter) > 0:
args['FilterExpression'] = filter
return self.connection.query(**args)['Items']
When I run the above, it keeps inserting the above start and end date into the database because it isn't finding any overlap. Why is this happening?
The table structure looks like this (JavaScript):
{
TableName: 'test-table',
AttributeDefinitions: [{
AttributeName: 'dataType',
AttributeType: 'S'
}, {
AttributeName: 'created',
AttributeType: 'S'
}],
KeySchema: [{
AttributeName: 'dataType',
KeyType: 'HASH'
}, {
AttributeName: 'created',
KeyType: 'RANGE'
}],
ProvisionedThroughput: {
ReadCapacityUnits: 5,
WriteCapacityUnits: 5
},
}

Looks like you are setting LIMIT=1. You are probably using this to say 'just return the first match found'. In fact setting Limit to 1 means you will only evaluate the first item found in the Query (i.e. in the partition value range). You probably need to remove the limit, so that each item in the partition range is evaluated for an overlap.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

create dataframe by Iterating upto nth level of values in nested dictionary - python

Related

Why is this For Loop overwriting the contents of my dictionary strangely?

Python/Pandas finding missing items amongst similarly grouped data, nested iteration not efficient

Check string for specific format of substring, how to..?

flat nested json inside arrays with Python

Dynamodb isn't finding overlap between two date ranges

Categories

Resources