I have a huge(around 350k elements) list of dictionaries:
lst = [
{'data': 'xxx', 'id': 1456},
{'data': 'yyy', 'id': 24234},
{'data': 'zzz', 'id': 3222},
{'data': 'foo', 'id': 1789},
]
On the other hand I receive dictionaries(around 550k) one by one with missing value(not every dict is missing this value) which I need to update from:
example_dict = {'key': 'x', 'key2': 'y', 'id': 1456, 'data': None}
To:
example_dict = {'key': 'x', 'key2': 'y', 'id': 1456, 'data': 'xxx'}
And I need to take each dict and search withing the list for matching 'id' and update the 'data'. Doing it this way takes ages to process:
if example_dict['data'] is None:
for row in lst:
if row['id'] == example_dict['id']:
example_dict['data'] = row['data']
Is there a way to build a structured chunked data divided to in e.g. 10k values and tell the incoming dict in which chunk to search for the 'id'? Or any other way to to optimize that? Any help is appreciated, take care.
Use a dict instead of searching linearly through the list.
The first important optimization is to remove that linear search through lst, by building a dict indexed on id pointing to the rows
For example, this will be a lot faster than your code, if you have enough RAM to hold all the rows in memory:
row_dict = {row['id']: row for row in lst}
if example_dict['data'] is None:
if example_dict['id'] in row_dict:
example_dict['data'] = row_dict[example_dict['id']]['data']
This improvement will be relevant for you whether you process the rows by chunks of 10k or all at once, since dictionary lookup time is constant instead of linear in the size of lst.
Make your own chunking process
Next you ask "Is there a way to build a structured chunked data divided...". Yes, absolutely. If the data is too big to fit in memory, write a first pass function that divides the input based on id into several temporary files. They could be based on the last two digits of ID if order is irrelevant, or on ranges of ids if you prefer. Do that for both the list of rows, and the dictionaries you receive, and then process each list/dict file pairs on the same ids one at a time, with code like above.
If you have to preserve the order in which you receive the dictionaries, though, this approach will be more difficult to implement.
Some preprocessing of lst list might help a lot. E.g. transform that list of dicts into dictionary, where id would be a key.
To be precise transform lst into such structure:
lst = {
'1456': 'xxx',
'24234': 'yyy',
'3222': 'zzz',
...
}
Then when trying to check your data attributes in example_dict, just access straight to id key in lst as follows:
if example_dict['data'] is None:
example_dict['data'] = lst.get(example_dict['id'])
It should reduce the time complexity from something as quadratic complexity (n*n) to linear complexity (n).
Try creating creating a hash table (in Python, a dict) from lst to speed up the lookup based on 'id':
lst = [
{'data': 'xxx', 'id': 1456},
{'data': 'yyy', 'id': 24234},
{'data': 'zzz', 'id': 3222},
{'data': 'foo', 'id': 1789},
]
example_dict = {'key': 'x', 'key2': 'y', 'id': 1456, 'data': None}
dct ={row['id'] : row for row in lst}
if example_dict['data'] is None:
example_dict['data'] = dct[example_dict['id']]['data']
print(example_dict)
Sample output:
{'key': 'x', 'key2': 'y', 'id': 1456, 'data': 'xxx'}
Related
I have a list of dictionaries as below and I'd like to create a dictionary to store specific data from the list.
test_list = [
{
'id':1,
'colour':'Red',
'name':'Apple',
'edible': True,
'price':100
},
{
'id':2,
'colour':'Blue',
'name':'Blueberry',
'edible': True,
'price':200
},
{
'id':3,
'colour':'Yellow',
'name':'Crayon',
'edible': False,
'price':300
}
]
For instance, a new dictionary to store just the {id, name, price} of the various items.
I created several lists:
id_list = []
name_list = []
price_list = []
Then I added the data I want to each list:
for n in test_list:
id_list.append(n['id']
name_list.append(n['name']
price_list.append(n['price']
But I can't figure out how to create a dictionary (or a more appropriate structure?) to store the data in the {id, name, price} format I'd like. Appreciate help!
If you don't have too much data, you can use this nested list/dictionary comprehension:
keys = ['id', 'name', 'price']
result = {k: [x[k] for x in test_list] for k in keys}
That'll give you:
{
'id': [1, 2, 3],
'name': ['Apple', 'Blueberry', 'Crayon'],
'price': [100, 200, 300]
}
I think a list of dictionaries is stille the right data format, so this:
test_list = [
{
'id':1,
'colour':'Red',
'name':'Apple',
'edible': True,
'price':100
},
{
'id':2,
'colour':'Blue',
'name':'Blueberry',
'edible': True,
'price':200
},
{
'id':3,
'colour':'Yellow',
'name':'Crayon',
'edible': False,
'price':300
}
]
keys = ['id', 'name', 'price']
limited = [{k: v for k, v in d.items() if k in keys} for d in test_list]
print(limited)
Result:
[{'id': 1, 'name': 'Apple', 'price': 100}, {'id': 2, 'name': 'Blueberry', 'price': 200}, {'id': 3, 'name': 'Crayon', 'price': 300}]
This is nice, because you can access its parts like limited[1]['price'].
However, your use case is perfect for pandas, if you don't mind using a third party library:
import pandas as pd
test_list = [
{
'id':1,
'colour':'Red',
'name':'Apple',
'edible': True,
'price':100
},
{
'id':2,
'colour':'Blue',
'name':'Blueberry',
'edible': True,
'price':200
},
{
'id':3,
'colour':'Yellow',
'name':'Crayon',
'edible': False,
'price':300
}
]
df = pd.DataFrame(test_list)
print(df['price'][1])
print(df)
The DataFrame is perfect for this stuff and selecting just the columns you need:
keys = ['id', 'name', 'price']
df_limited = df[keys]
print(df_limited)
The reason I'd prefer either to a dictionary of lists is that manipulating the dictionary of lists will get complicated and error prone and accessing a single record means accessing three separate lists - there's not a lot of advantages to that approach except maybe that some operations on lists will be faster, if you access a single attribute more often. But in that case, pandas wins handily.
In the comments you asked "Let's say I had item_names = ['Apple', 'Teddy', 'Crayon'] and I wanted to check if one of those item names was in the df_limited variable or I guess the df_limited['name'] - is there a way to do that, and if it is then print say the price, or manipulate the price?"
There's many ways of course, I recommend looking into some online pandas tutorials, because it's a very popular library and there's excellent documentation and teaching materials online.
However, just to show how easy it would be in both cases, retrieving the matching objects or just the prices for them:
item_names = ['Apple', 'Teddy', 'Crayon']
items = [d for d in test_list if d['name'] in item_names]
print(items)
item_prices = [d['price'] for d in test_list if d['name'] in item_names]
print(item_prices)
items = df[df['name'].isin(item_names)]
print(items)
item_prices = df[df['name'].isin(item_names)]['price']
print(item_prices)
Results:
[{'id': 1, 'colour': 'Red', 'name': 'Apple', 'edible': True, 'price': 100}, {'id': 3, 'colour': 'Yellow', 'name': 'Crayon', 'edible': False, 'price': 300}]
[100, 300]
id name price
0 1 Apple 100
2 3 Crayon 300
0 100
2 300
In the example with the dataframe there's a few things to note. They are using .isin() since using in won't work in the fancy way dataframes allow you to select data df[<some condition on df using df>], but there's fast and easy to use alternatives for all standard operations in pandas. More importantly, you can just do the work on the original df - it already has everything you need in there.
And let's say you wanted to double the prices for these products:
df.loc[df['name'].isin(item_names), 'price'] *= 2
This uses .loc for technical reasons (you can't modify just any view of a dataframe), but that's way too much to get into in this answer - you'll learn looking into pandas. It's pretty clean and simple though, I'm sure you agree. (you could use .loc for the previous example as well)
In this trivial example, both run instantly, but you'll find that pandas performs better for very large datasets. Also, try writing the same examples using the method you requested (as provided in the accepted answer) and you'll find that it's not as elegant, unless you start by zipping everything together again:
item_prices = [p for i, n, p in zip(result.values()) if n in item_names]
Getting out a result that has the same structure as result is way more trickier with more zipping and unpacking involved, or requires you to go over the lists twice.
I have a bunch of data that I am parsing, and I've managed to get all of it into a database but there is one last part that is tripping me up, getting the date into the db in a "nice" format. If I can figure out how to extract just the date string from the data then it shouldn't be hard to use dateparser, but this part is really getting to me and my brain just doesn't know where to go.
This is the format of the data. There may be other key value pairs.
[("[{'key': 'Date', 'value': 'Wednesday, 3 March 2021'}]",)]
and the only thing I've done that is remotely useful is figuring out that python will allow me to list the key value pairs
[{'key': 'Date', 'value': 'Wednesday, 3 March 2021'}]
using
data = [("[{'key': 'Date', 'value': 'Wednesday, 3 March 2021'}]",)]
for a in data:
for b in a:
print(b)
The format of the data is:
[("[{'key': 'key1', 'value': 'value1'}]", "[{'key': 'key2', 'value': 'value2'}]", "[{'key': 'key3', 'value': 'value3'}]")]
I could bruteforce it, although I know this is straightforward and I just can't figure it out. Any help with finding an elegant and correct solution appreciated.
Here's the code that does what I want, thanks to #Kraigolas
data = [("[{'key': 'Date', 'value': 'Wednesday, 3 March 2021'}]",)]
for a in data:
data = [eval(element)[0] for element in a]
if(data[0]['key'] == "Date"):
print(data[0]['value'])
This problem can be solved over several steps.
To start the problem, let:
data = [("[{'key': 'key1', 'value': 'value1'}]", "[{'key': 'key2', 'value': 'value2'}]", "[{'key': 'key3', 'value': 'value3'}]")]
Addressing the Tuple
You have a whole tuple inside your list. Let's just grab the tuple and work with that instead.
data = data[0]
# ("[{'key': 'key1', 'value': 'value1'}]", "[{'key': 'key2', 'value': 'value2'}]", "[{'key': 'key3', 'value': 'value3'}]")
Processing the Strings
Notice that the elements inside of your tuple look like this
element = "[{'key': 'key1', 'value': 'value1'}]"
That would be a dictionary inside of a list if it wasn't a string! Let's use the eval command to get it in that form:
element = eval(element)
# [{'key': 'key1', 'value': 'value1'}]
Getting rid of the outer list
To get rid of the list we grab the 0th element to get back just the dictionary
element = element[0]
# {'key': 'key1', 'value': 'value1'}
Getting the data
Index for "value" to get back the date.
element = element["value"]
# "value1"
Putting this all together into a list comprehension
dates = [eval(element)[0]["value"] for element in data[0]]
# ['value1', 'value2', 'value3']
I have a very big dictionary with keys containing a list of items, these are unordered. I would like to group certain elements in a new key. For example
input= [{'name':'emp1','state':'TX','areacode':'001','mobile':123},{'name':'emp1','state':'TX','areacode':'002','mobile':234},{'name':'emp1','state':'TX','areacode':'003','mobile':345},{'name':'emp2','state':'TX','areacode':None,'mobile':None},]
for above input i would like to group areacode and mobile in a new key contactoptions
opdata = [{'name':'emp1','state':'TX','contactoptions':[{'areacode':'001','mobile':123},{'areacode':'002','mobile':234},{'areacode':'003','mobile':345}]},{'name':'emp2','state':'TX','contactoptions':[{'areacode':None,'mobile':None}]}]
i am doing this now with a two long iterations. i wanted to achieve the same more efficiently as the number of records are large. open to using existing methods if available in packages like pandas.
Try
result = (
df.groupby(['name', 'state'])
.apply(lambda x: x[['areacode', 'mobile']].to_dict(orient='records'))
.reset_index(name='contactoptions')
).to_dict(orient='records')
With regular dictionaries, you can do it in a single pass/loop using the setdefault method and no sorting:
data = [{'name':'emp1','state':'TX','areacode':'001','mobile':123},{'name':'emp1','state':'TX','areacode':'002','mobile':234},{'name':'emp1','state':'TX','areacode':'003','mobile':345},{'name':'emp2','state':'TX','areacode':None,'mobile':None}]
merged = dict()
for d in data:
od = merged.setdefault(d["name"],{k:d[k] for k in ("name","state")})
od.setdefault("contactoptions",[]).append({k:d[k] for k in ("areacode","mobile")})
merged = list(merged.values())
output:
print(merged)
# [{'name': 'emp1', 'state': 'TX', 'contactoptions': [{'areacode': '001', 'mobile': 123}, {'areacode': '002', 'mobile': 234}, {'areacode': '003', 'mobile': 345}]}, {'name': 'emp2', 'state': 'TX', 'contactoptions': [{'areacode': None, 'mobile': None}]}]
As you asked, you want to group the input items by 'name' and 'state' together.
My suggestion is, you can make a dictionary which keys will be 'name' plus 'state' such as 'emp1-TX' and values will be list of 'areacode' and 'mobile' such as [{'areacode':'001','mobile':123}]. In this case, the output can be achieved in one iteration.
Output:
{'emp1-TX': [{'areacode':'001','mobile':123}, {'areacode':'001','mobile':123}, {'areacode':'003','mobile':345}], 'emp2-TX': [{'areacode':None,'mobile':None}]}
a portion of one column 'relatedWorkOrder' in my dataframe looks like this:
{'number': 2552, 'labor': {'name': 'IA001', 'code': '70M0901003'}...}
{'number': 2552, 'labor': {'name': 'IA001', 'code': '70M0901003'}...}
{'number': 2552, 'labor': {'name': 'IA001', 'code': '70M0901003'}...}
My desired output is to have a column 'name','labor_name','labor_code' with their respective values. I can do this using regex extract and replace:
df['name'] = df['relatedWorkOrder'].str.extract(r'{regex}',expand=False).str.replace('something','')
But I have several dictionaries in this column and in this way is tedious, also I'm wondering if it's possible doing this through accessing the keys and values of the dictionary
Any help with that?
You can join the result from pd.json_normalize:
df.join(pd.json_normalize(df['relatedWorkOrder'], sep='_'))
I try to write a World of Warcraft Auctionhouse analyzing tool.
For each auction i have data that looks like this:
{
'timeLeftHash': 4,
'bid': 3345887,
'timestamp': 1415339912,
'auc': 1438188059,
'quantity': 1,
'id': 309774,
'ownerHash': 751,
'buy': 3717652,
'ownerRealmHash': 1,
'item': 35965
}
I'd like to combine all dicts that have the same value of "item" so i can get a minBuy, avgBuy, maxBuy, minQuantity, avgQuantity, maxQantity and the sum of combined auctions for the specific item.
How can i archieve that?
I already tried to write it in a Second list of dicts, but then the min and max is missing...
You could try to make a dictionary where the key is the item ID and the Value is a list of tuples of price and quantity.
If you would like to keep all the information, you could also make a dictionary where the key is the item ID and the value is a list of dictionaries corresponding to that ID and from there extract the info that you want through a generator.
data = [
{'item': 35964, 'buy': 3717650, ...},
{'item': 35965, 'buy': 3717652, ...},
...
]
by_item = {}
for d in data:
by_item.setdefault(d['item'], []).append(d['buy'])
stats = dict((k, {'minBuy': min(v), 'maxBuy': max(v)})
for k, v in by_item.iteritems())