I have a data like this and I want the data to be written in a dataframe so that I can convert it directly into a csv file.
Data =
[ {'event': 'User Clicked', 'properties': {'user_id': '123', 'page_visited': 'contact_us', etc},
{'event': 'User Clicked', 'properties': {'user_id': '456', 'page_visited': 'homepage', etc} , ......
{'event': 'User Clicked', 'properties': {'user_id': '789', 'page_visited': 'restaurant', etc}} ]
This is How I am able to access its values:
for item in list_of_dict_responses:
print item['event']
for key, value in item.items():
if type(value) is dict:
for k, v in value.items():
print k,v
I want it in a dataframe where event is a column with value of User Clicked and properties is a another column with sub column of user_id, page_visited, contact_us and then respective values of sub column.
flatten the nested dictionaries & then just use the data frame constructor to create a data frame.
data = [
{'event': 'User Clicked', 'properties': {'user_id': '123', 'page_visited': 'contact_us'}},
{'event': 'User Clicked', 'properties': {'user_id': '456', 'page_visited': 'homepage'}},
{'event': 'User Clicked', 'properties': {'user_id': '789', 'page_visited': 'restaurant'}}
]
The flattened dictionary may be constructed in several ways. Here's 1 method using a generator that is generic & will work with arbitrary-depth nested dictionaries (or at least until it hits the max recursion depth)
def flatten(kv, prefix=[]):
for k, v in kv.items():
if isinstance(v, dict):
yield from flatten(v, prefix+[str(k)])
else:
if prefix:
yield '_'.join(prefix+[str(k)]), v
else:
yield str(k), v
Then using list comprehension to flatten all the records in data, construct the data frame
pd.DataFrame({k:v for k, v in flatten(kv)} for kv in data)
#Out
event properties_page_visited properties_user_id
0 User Clicked contact_us 123
1 User Clicked homepage 456
2 User Clicked restaurant 789
You have 2 options: either use a MultiIndex for columns, or add a prefix for data in properties. The former, in my opinion, is not appropriate here, since you don't have a "true" hierarchical columnar structure. The second level, for example, would be empty for event.
Implementing the second idea, you can restructure your list of dictionaries before feeding to pd.DataFrame. The syntax {**d1, **d2} is used to combine two dictionaries.
data_transformed = [{**{'event': d['event']},
**{f'properties_{k}': v for k, v in d['properties'].items()}} \
for d in Data]
res = pd.DataFrame(data_transformed)
print(res)
event properties_page_visited properties_user_id
0 User Clicked contact_us 123
1 User Clicked homepage 456
2 User Clicked restaurant 789
This also aids writing to and reading from CSV files, where a MultiIndex can be ambiguous.
Related
I have a huge(around 350k elements) list of dictionaries:
lst = [
{'data': 'xxx', 'id': 1456},
{'data': 'yyy', 'id': 24234},
{'data': 'zzz', 'id': 3222},
{'data': 'foo', 'id': 1789},
]
On the other hand I receive dictionaries(around 550k) one by one with missing value(not every dict is missing this value) which I need to update from:
example_dict = {'key': 'x', 'key2': 'y', 'id': 1456, 'data': None}
To:
example_dict = {'key': 'x', 'key2': 'y', 'id': 1456, 'data': 'xxx'}
And I need to take each dict and search withing the list for matching 'id' and update the 'data'. Doing it this way takes ages to process:
if example_dict['data'] is None:
for row in lst:
if row['id'] == example_dict['id']:
example_dict['data'] = row['data']
Is there a way to build a structured chunked data divided to in e.g. 10k values and tell the incoming dict in which chunk to search for the 'id'? Or any other way to to optimize that? Any help is appreciated, take care.
Use a dict instead of searching linearly through the list.
The first important optimization is to remove that linear search through lst, by building a dict indexed on id pointing to the rows
For example, this will be a lot faster than your code, if you have enough RAM to hold all the rows in memory:
row_dict = {row['id']: row for row in lst}
if example_dict['data'] is None:
if example_dict['id'] in row_dict:
example_dict['data'] = row_dict[example_dict['id']]['data']
This improvement will be relevant for you whether you process the rows by chunks of 10k or all at once, since dictionary lookup time is constant instead of linear in the size of lst.
Make your own chunking process
Next you ask "Is there a way to build a structured chunked data divided...". Yes, absolutely. If the data is too big to fit in memory, write a first pass function that divides the input based on id into several temporary files. They could be based on the last two digits of ID if order is irrelevant, or on ranges of ids if you prefer. Do that for both the list of rows, and the dictionaries you receive, and then process each list/dict file pairs on the same ids one at a time, with code like above.
If you have to preserve the order in which you receive the dictionaries, though, this approach will be more difficult to implement.
Some preprocessing of lst list might help a lot. E.g. transform that list of dicts into dictionary, where id would be a key.
To be precise transform lst into such structure:
lst = {
'1456': 'xxx',
'24234': 'yyy',
'3222': 'zzz',
...
}
Then when trying to check your data attributes in example_dict, just access straight to id key in lst as follows:
if example_dict['data'] is None:
example_dict['data'] = lst.get(example_dict['id'])
It should reduce the time complexity from something as quadratic complexity (n*n) to linear complexity (n).
Try creating creating a hash table (in Python, a dict) from lst to speed up the lookup based on 'id':
lst = [
{'data': 'xxx', 'id': 1456},
{'data': 'yyy', 'id': 24234},
{'data': 'zzz', 'id': 3222},
{'data': 'foo', 'id': 1789},
]
example_dict = {'key': 'x', 'key2': 'y', 'id': 1456, 'data': None}
dct ={row['id'] : row for row in lst}
if example_dict['data'] is None:
example_dict['data'] = dct[example_dict['id']]['data']
print(example_dict)
Sample output:
{'key': 'x', 'key2': 'y', 'id': 1456, 'data': 'xxx'}
I have a very big dictionary with keys containing a list of items, these are unordered. I would like to group certain elements in a new key. For example
input= [{'name':'emp1','state':'TX','areacode':'001','mobile':123},{'name':'emp1','state':'TX','areacode':'002','mobile':234},{'name':'emp1','state':'TX','areacode':'003','mobile':345},{'name':'emp2','state':'TX','areacode':None,'mobile':None},]
for above input i would like to group areacode and mobile in a new key contactoptions
opdata = [{'name':'emp1','state':'TX','contactoptions':[{'areacode':'001','mobile':123},{'areacode':'002','mobile':234},{'areacode':'003','mobile':345}]},{'name':'emp2','state':'TX','contactoptions':[{'areacode':None,'mobile':None}]}]
i am doing this now with a two long iterations. i wanted to achieve the same more efficiently as the number of records are large. open to using existing methods if available in packages like pandas.
Try
result = (
df.groupby(['name', 'state'])
.apply(lambda x: x[['areacode', 'mobile']].to_dict(orient='records'))
.reset_index(name='contactoptions')
).to_dict(orient='records')
With regular dictionaries, you can do it in a single pass/loop using the setdefault method and no sorting:
data = [{'name':'emp1','state':'TX','areacode':'001','mobile':123},{'name':'emp1','state':'TX','areacode':'002','mobile':234},{'name':'emp1','state':'TX','areacode':'003','mobile':345},{'name':'emp2','state':'TX','areacode':None,'mobile':None}]
merged = dict()
for d in data:
od = merged.setdefault(d["name"],{k:d[k] for k in ("name","state")})
od.setdefault("contactoptions",[]).append({k:d[k] for k in ("areacode","mobile")})
merged = list(merged.values())
output:
print(merged)
# [{'name': 'emp1', 'state': 'TX', 'contactoptions': [{'areacode': '001', 'mobile': 123}, {'areacode': '002', 'mobile': 234}, {'areacode': '003', 'mobile': 345}]}, {'name': 'emp2', 'state': 'TX', 'contactoptions': [{'areacode': None, 'mobile': None}]}]
As you asked, you want to group the input items by 'name' and 'state' together.
My suggestion is, you can make a dictionary which keys will be 'name' plus 'state' such as 'emp1-TX' and values will be list of 'areacode' and 'mobile' such as [{'areacode':'001','mobile':123}]. In this case, the output can be achieved in one iteration.
Output:
{'emp1-TX': [{'areacode':'001','mobile':123}, {'areacode':'001','mobile':123}, {'areacode':'003','mobile':345}], 'emp2-TX': [{'areacode':None,'mobile':None}]}
my DF is:
df = pd.DataFrame({'city': ['POA', 'POA', 'SAN'], 'info' : [10,12,5]}, index = [4314902, 4314902, 4300803])
df.index.rename('ID_city', inplace=True)
output:
city info
ID_city
4314902 POA 10
4314902 POA 12
4300803 SAN 5
I need to save as json oriented by index. The following command works only when each index is unique.
df.to_json('df.json', orient='index')
Is possible to save this DataFrame and when he find a duplicate index, create a array?
My desire output:
{ 4314902 : [ {'city': 'POA', 'info': 10} , {'city': 'POA', 'info': 11} ]
,4300803 : {'city': 'SAN', 'info': 5} }
I'm not aware of built-in Pandas functionality, that handles duplicate indexes in json orient='index' exporting.
You could of course build this manually. Merge the columns into one that contains a dict:
cols_as_dict = df.apply(dict, axis=1)
ID_city
4314902 {'city': 'POA', 'info': 10}
4314902 {'city': 'POA', 'info': 12}
4300803 {'city': 'SAN', 'info': 5}
Put rows into lists, grouped by the index:
combined = cols_as_dict.groupby(cols_as_dict.index).apply(list)
ID_city
4300803 [{'city': 'SAN', 'info': 5}]
4314902 [{'city': 'POA', 'info': 10}, {'city': 'POA', ...
Then write the json:
combined.to_json()
'{"4300803":[{"city":"SAN","info":5}],"4314902":[{"city":"POA","info":10},{"city":"POA","info":12}]}'
It creates a list even if there's just a single entry per index. That should make processing actually easier than if you mix the data types (either list of elements or single element).
If you are set on the mixed type (either dict or list of several dicts), then do combined.to_dict(), change the lists with single elements back into their first element, and then dump the json.
I am trying to convert csv to Json. If I encounter csv headers with naming convention "columnName1.0.columnName2.0.columnName3" I need to create a nested JSON --> {ColumnName1 : {columnName2 : {columnName3 : value }}}..
So far I am able to split header into list of subColumnNames and create a nested JSON type, but I am unable to assign a value. Any Help?
data = open(str(fileName.strip("'")),'rb')
reader = csv.DictReader(data,delimiter = ',',quotechar='"')
'''
Get the header '''
for line in reader:
for x,y in line.items():
columns = re.split("\.\d\.",x)
if len(columns) == 1:
continue
else:
print "COLUMNS %s"%columns
testLine = {}
for subColumnName in reversed(columns):
testLine = {subColumnName: testLine}
''' Need to Assign value y? '''
print "LINE%s"%testLine
Output:
COLUMNS ['experience', 'title']
LINE{'experience': {'title': {}}}
COLUMNS ['experience', 'organization', 'profile_url']
LINE{'experience': {'organization': {'profile_url': {}}}}
COLUMNS ['experience', 'start']
LINE{'experience': {'start': {}}}
COLUMNS ['raw_experience', 'organization', 'profile_url']
LINE{'raw_experience': {'organization': {'profile_url': {}}}}
COLUMNS ['raw_experience', 'end']
LINE{'raw_experience': {'end': {}}}
COLUMNS ['experience', 'organization', 'name']
LINE{'experience': {'organization': {'name': {}}}}
The value you want is currently {}, the initial value of testLine. You can try this:
testLine = value
for subColumnName in reversed(columns):
testLine = {subColumnName: testLine}
I try to write a World of Warcraft Auctionhouse analyzing tool.
For each auction i have data that looks like this:
{
'timeLeftHash': 4,
'bid': 3345887,
'timestamp': 1415339912,
'auc': 1438188059,
'quantity': 1,
'id': 309774,
'ownerHash': 751,
'buy': 3717652,
'ownerRealmHash': 1,
'item': 35965
}
I'd like to combine all dicts that have the same value of "item" so i can get a minBuy, avgBuy, maxBuy, minQuantity, avgQuantity, maxQantity and the sum of combined auctions for the specific item.
How can i archieve that?
I already tried to write it in a Second list of dicts, but then the min and max is missing...
You could try to make a dictionary where the key is the item ID and the Value is a list of tuples of price and quantity.
If you would like to keep all the information, you could also make a dictionary where the key is the item ID and the value is a list of dictionaries corresponding to that ID and from there extract the info that you want through a generator.
data = [
{'item': 35964, 'buy': 3717650, ...},
{'item': 35965, 'buy': 3717652, ...},
...
]
by_item = {}
for d in data:
by_item.setdefault(d['item'], []).append(d['buy'])
stats = dict((k, {'minBuy': min(v), 'maxBuy': max(v)})
for k, v in by_item.iteritems())