Translating characters in a string to multiple characters using Python - python

I have a list of strings with prefix characters representing the multiplying factor for the number. So if I have data like:
data = ['101n', '100m', '100.100f']
I want to use the dictionary
prefix_dict = {'y': 'e-24', 'z': 'e-21', 'a': 'e-18', 'f': 'e-15', 'p': 'e-12',
'n': 'e-9', 'u': 'e-6', 'm': 'e-3', 'c': 'e-2', 'd': 'e-1',
'da': 'e1', 'h': 'e2', 'k': 'e3', 'M': 'e6', 'G': 'e9',
'T': 'e12', 'P': 'e15', 'E': 'e18', 'Z': 'e21', 'Y': 'e24'}
To insert their corresponding strings. When I look at the other questions similar to mine there is one character being translated into another character. Is there a way to use the translate function to translate one character into multiple characters or should I be approaching this differently?

You can use regex for this, this works for 'da' as well:
>>> data = ['101n', '100m', '100.100f', '1d', '1da']
>>> import re
>>> r = re.compile(r'([a-zA-Z]+)$')
>>> for d in data:
print r.sub(lambda m: prefix_dict.get(m.group(1), m.group(1)), d)
...
101e-9
100e-3
100.100e-15
1e-1
1e1
And a non-regex version using itertools.takewhile:
>>> from itertools import takewhile
>>> def find_suffix(s):
return ''.join(takewhile(str.isalpha, s[::-1]))[::-1]
...
>>> for d in data:
sfx = find_suffix(d)
print (d.replace(sfx, prefix_dict.get(sfx, sfx)))
...
101e-9
100e-3
100.100e-15
1e-1
1e1

Try:
for i, entry in enumerate(data):
for key, value in sorted(prefix_dict.items(),
key = lambda x: len(x[0]), reverse=True):
# need to sort the dictionary so that 'da' always comes before 'a'
if key in entry:
data[i] = entry.replace(key, value)
print(data)
This works for arbitrary combinations in the dictionary and the data. If the dictionary key is always only 1 string long, you have lots of other solutions posted here.

import re
data = ['101da', '100m', '100.100f']
prefix_dict = {'y': 'e-24', 'z': 'e-21', 'a': 'e-18', 'f': 'e-15', 'p': 'e-12',
'n': 'e-9', 'u': 'e-6', 'm': 'e-3', 'c': 'e-2', 'd': 'e-1',
'da': 'e1', 'h': 'e2', 'k': 'e3', 'M': 'e6', 'G': 'e9',
'T': 'e12', 'P': 'e15', 'E': 'e18', 'Z': 'e21', 'Y': 'e24'}
comp = re.compile(r"[^\[A-Za-z]")
for ind,d in enumerate(data):
pre = re.sub(comp,"",d)
data[ind] = d.replace(pre,prefix_dict.get(pre))
print data
['101e1', '100e-3', '100.100e-15']
You can use pre = [x for x in d if x.isalpha()][0] instead of using re

Related

Dict comprehension to group words by first letter

Does anyone know how to avoid the error <generator object dictionary.<locals>.<genexpr> at 0x000001D295344580> that I get while trying to create a dict comprehension that generates specific keys: values?
For example, if we have a list:
words = ["hallo" , "hell", "hype", "empty", "full", "charge", "hey"]
I want to create a dictionary
{starting character of the item in list : list of items in words that start with the specific character}
so, for my example, the expected output would be:
{"h": ["hallo", "hell" , "hype", "hey"], "e" : ["empty"], "f": ["full"], "c": ["charge] }
My code:
{(chr(c) for c in range(ord("a"), ord("z")+1)):
[word for word in words if word.startswith("a")]}
The same happens if i try to generalize the word.startswith() statement.
Your current solution - and the corrected version - are rather inefficient, as they iterate on all letters, and for each letter, on all words, so 26*(number of words) loops.
You can do it by iterating only once on the list of words, by creating the dictionary key and the list that will contain the words on the fly. A defaultdict makes this easy:
from collections import defaultdict
words = ["hallo" , "hell", "hype", "empty", "full", "charge", "hey"]
out = defaultdict(list)
for word in words:
out[word[0]].append(word)
print(out)
# defaultdict(<class 'list'>, {'h': ['hallo', 'hell', 'hype', 'hey'], 'e': ['empty'], 'f': ['full'], 'c': ['charge']})
with just 7 loops, instead of 26*7 and as many tests, and simpler code...
You can do this easily with itertools.groupby:
>>> from itertools import groupby
>>> {k: list(v) for k, v in groupby(sorted(words), lambda s: s[0])}
{'c': ['charge'], 'e': ['empty'], 'f': ['full'], 'h': ['hallo', 'hell', 'hey', 'hype']}
Once the words are sorted in ordinary lexicographic order, it's safe to group them by their first letters. (Sorting by first letter only would be sufficient as well.)
That's not an error, it's an object that you inserted as a key. It seems like you're confused about the syntax for dict comprehensions. The generator expression you wrote ((chr(c) for c in ...)) doesn't expand, it gets used as the key instead. In fact, what you wrote isn't even a dict comprehension.
To do what you want, the loop needs to be after the key-value pair.
{chr(c): [word for word in words if word.startswith(chr(c))]
for c in range(ord("a"), ord("z")+1)}
For comparison, here's a loose version of the syntax:
{key: value for x in iterable}
This is the naive solution. See Thierry's and chepner's answers for the better solutions. With the naive one, you'd also need to remove the empty lists:
>>> d = {chr(c): [word for word in words if word.startswith(chr(c))]
... for c in range(ord("a"), ord("z")+1)}
>>> d
{'a': [], 'b': [], 'c': ['charge'], 'd': [], 'e': ['empty'], 'f': ['full'], 'g': [], 'h': ['hallo', 'hell', 'hype', 'hey'], 'i': [], 'j': [], 'k': [], 'l': [], 'm': [], 'n': [], 'o': [], 'p': [], 'q': [], 'r': [], 's': [], 't': [], 'u': [], 'v': [], 'w': [], 'x': [], 'y': [], 'z': []}
>>> {k: v for k, v in d.items() if v}
{'c': ['charge'], 'e': ['empty'], 'f': ['full'], 'h': ['hallo', 'hell', 'hype', 'hey']}

Is there another way to extract info with complex/unstructured nested dict format in Python?

Suppose I have an unstructured nested dict as following:
{
'A_brand': {'score1': {'A': 13, 'K': 50}},
'B_brand': {'before_taste': {'score2': {'A': 43, 'D': 23}}, 'after_taste': {'score3': {'H': 36, 'J': 34}}},
'Score4': {'G': 2, 'W': 19}
}
How can I get/show the info like: Which letter get the highest score for each scores?
like:
{'key':'value',
'A_brand/score1':'K',
'B_brand/before_taste/score2':'A',
'B_brand/after_taste/score3':'H',
'Score4':'W'}
What I did was dummies way which I created a new dict and accessed into each path, sorted them by values and selected first one item, then added it into the new dict.
For example:
new_csv={'key':'value'}
first=data['A_brand']['before_lunch_break']['score1']
first_new=sorted(first.items(),key=lambda x: x[1],reverse=True)
new_csv['A_brand/score']=first_new[0][0]
second=data['B_brand']['before_taste']['score2']
second_new=sorted(second.items(),key=lambda x: x[1],reverse=True)
new_csv['B_brand/before_taste/score2']=second_new[0][0]
...
I'm wondering if there is any faster or automatic ways to do that?
You can use a generator with recursion:
data = {'A_brand': {'score1': {'A': 13, 'K': 50}}, 'B_brand': {'before_taste': {'score2': {'A': 43, 'D': 23}}, 'after_taste': {'score3': {'H': 36, 'J': 34}}}, 'Score4': {'G': 2, 'W': 19}}
def get_max(d, c = []):
for a, b in d.items():
if all(not isinstance(i, dict) for i in b.values()):
yield ('/'.join(c+[a]), max(b, key=lambda x:b[x]))
else:
yield from get_max(b, c+[a])
print(dict(get_max(data)))
Output:
{'A_brand/score1': 'K', 'B_brand/before_taste/score2': 'A', 'B_brand/after_taste/score3': 'H', 'Score4': 'W'}

Get slice of Python list based on indices: list of list of dicts

I have a structure as below:
[[{'w': [0.5372377247650572, 1.9111341091016385, -3.2165806256024116, -1.7154987465370053, 1.0917999534858416], 'o': 0.0004326739879156587, 'd': 3.586499431857422e-05}],[{'w': [7.298542669399767, -3.9021024252822105], 'o': 0.019860841402923542, 'd': 0.00105997759946847}, {'w': [-2.8024625186056764, -0.34819658506990847], 'o': 0.4135257109795849, 'd': -0.0016469874583619935}, {'w': [-6.018257518762189, 0.3317488378886934], 'o': 0.5815513019444986, 'd': -1.1787471334339458e-05}]]
It is a list of lists of dicts, and these dicts contain keys 'w', 'o', 'd'. I want to create a slice of this structure such that I'm left with only the 'd' values:
3.586499431857422e-05, 0.00105997759946847, -0.0016469874583619935, -1.1787471334339458e-05
How can this be done?
structure = [[{'w': [0.5372377247650572, 1.9111341091016385, -3.2165806256024116, -1.7154987465370053, 1.0917999534858416], 'o': 0.0004326739879156587, 'd': 3.586499431857422e-05}],[{'w': [7.298542669399767, -3.9021024252822105], 'o': 0.019860841402923542, 'd': 0.00105997759946847}, {'w': [-2.8024625186056764, -0.34819658506990847], 'o': 0.4135257109795849, 'd': -0.0016469874583619935}, {'w': [-6.018257518762189, 0.3317488378886934], 'o': 0.5815513019444986, 'd': -1.1787471334339458e-05}]]
d_values = [ x['d'] for row in structure for x in row ]
Without list comprehension:
res=[]
for list_dicts in original_matrix:
for current_dict in list_dicts:
res.append(current_dict['d'])
print(res)

How to make replacement in python's dict?

The goal I want to achieve is to exchange all items whose form is #item_name# to the from (item_value) in the dict. I use two dict named test1 and test2 to test my function. Here is the code:
test1={'integer_set': '{#integer_list#?}', 'integer_list': '#integer_range#(?,#integer_range#)*', 'integer_range': '#integer#(..#integer#)?', 'integer': '[+-]?\\d+'}
test2={'b': '#a#', 'f': '#e#', 'c': '#b#', 'e': '#d#', 'd': '#c#', 'g': '#f#', 'a': 'correct'}
def change(pat_dict:{str:str}):
print('Expanding: ',pat_dict)
num=0
while num<len(pat_dict):
inv_pat_dict = {v: k for k, v in pat_dict.items()}
for value in pat_dict.values():
for key in pat_dict.keys():
if key in value:
repl='#'+key+'#'
repl2='('+pat_dict[key]+')'
value0=value.replace(repl,repl2)
pat_dict[inv_pat_dict[value]]=value0
num+=1
print('Result: ',pat_dict)
change(test1)
change(test2)
sometimes I can get correct result like:
Expanding: {'integer': '[+-]?\\d+', 'integer_list': '#integer_range#(?,#integer_range#)*', 'integer_set': '{#integer_list#?}', 'integer_range': '#integer#(..#integer#)?'}
Result: {'integer': '[+-]?\\d+', 'integer_list': '(([+-]?\\d+)(..([+-]?\\d+))?)(?,(([+-]?\\d+)(..([+-]?\\d+))?))*', 'integer_set': '{((([+-]?\\d+)(..([+-]?\\d+))?)(?,(([+-]?\\d+)(..([+-]?\\d+))?))*)?}', 'integer_range': '([+-]?\\d+)(..([+-]?\\d+))?'}
Expanding: {'c': '#b#', 'f': '#e#', 'e': '#d#', 'b': '#a#', 'g': '#f#', 'd': '#c#', 'a': 'correct'}
Result: {'c': '((correct))', 'f': '(((((correct)))))', 'e': '((((correct))))', 'b': '(correct)', 'g': '((((((correct))))))', 'd': '(((correct)))', 'a': 'correct'}
But most of time I get wrong results like that:
Expanding: {'integer_range': '#integer#(..#integer#)?', 'integer': '[+-]?\\d+', 'integer_set': '{#integer_list#?}', 'integer_list': '#integer_range#(?,#integer_range#)*'}
Result: {'integer_range': '([+-]?\\d+)(..([+-]?\\d+))?', 'integer': '[+-]?\\d+', 'integer_set': '{(#integer_range#(?,#integer_range#)*)?}', 'integer_list': '#integer_range#(?,#integer_range#)*'}
Expanding: {'f': '#e#', 'a': 'correct', 'd': '#c#', 'g': '#f#', 'b': '#a#', 'c': '#b#', 'e': '#d#'}
Result: {'f': '(((((correct)))))', 'a': 'correct', 'd': '(((correct)))', 'g': '((((((correct))))))', 'b': '(correct)', 'c': '((correct))', 'e': '((((correct))))'}
How could I update my code to achieve my goal?
Your problem is caused by the fact that python dictionaries are unordered. Try using a OrderedDict instead of dict and you should be fine. The OrderedDict works just like a normal dict but with ordering retained, at a small performance cost.
Note that while you could create an OrderedDict from a dict literal (like I did here at first), that dict would be unordered, so the ordering might not be guaranteed. Using a list of (key, value) pairs preserves the ordering in all cases.
from collections import OrderedDict
test1=OrderedDict([('integer_set', '{#integer_list#?}'), ('integer_list', '#integer_range#(?,#integer_range#)*'), ('integer_range', '#integer#(..#integer#)?'), ('integer', '[+-]?\\d+')])
test2=OrderedDict([('b', '#a#'), ('f', '#e#'), ('c', '#b#'), ('e', '#d#'), ('d', '#c#'), ('g', '#f#'), ('a', 'correct')])
def change(pat_dict:{str:str}):
print('Expanding: ',pat_dict)
num=0
while num<len(pat_dict):
inv_pat_dict = {v: k for k, v in pat_dict.items()}
for value in pat_dict.values():
for key in pat_dict.keys():
if key in value:
repl='#'+key+'#'
repl2='('+pat_dict[key]+')'
value0=value.replace(repl,repl2)
pat_dict[inv_pat_dict[value]]=value0
num+=1
print('Result: ',pat_dict)
change(test1)
change(test2)
Try this one. Your problem is due to mutating starting dict. You need to change its copy.
test1={'integer_set': '{#integer_list#?}', 'integer_list': '#integer_range#(?,#integer_range#)*', 'integer_range': '#integer#(..#integer#)?', 'integer': '[+-]?\\d+'}
test2={'b': '#a#', 'f': '#e#', 'c': '#b#', 'e': '#d#', 'd': '#c#', 'g': '#f#', 'a': 'correct'}
def change(d):
new_d = d.copy()
for k in d.keys():
for nk, v in new_d.items():
if k in v:
new_d[nk] = v.replace('#{}#'.format(k), '({})'.format(new_d[k]))
return new_d
test1 = change(test1)
test2 = change(test2)

Merge two lists into a dictionary and sum over the elements of the second list

If I have two lists (with the same length):
ls1 = ['a','b','c','a','d','c']
ls2 = [1,2,3,5,1,2]
I would like to get the following dictionary (sum over the values if it is the same key):
d = {'a':6,'b':2,'c':5,'d':1}
I did the following:
ls1 = np.array(ls1)
ls2 = np.array(ls2)
uniqe_vals = list(set(ls1))
d = {}
for u in uniqe_vals:
ind = np.where(ls1 == u)[0]
d[u] = sum(ls2[ind])
It works fine for small data, but it is taking too long for the whole data (I have a list of size ~5 million).
Do you have any suggestions for a more efficient way to do it ?
Also with a defaultdict, but different and simpler:
from collections import defaultdict
d = defaultdict(int)
for n, v in zip(ls1, ls2):
d[n] += v
Or, as suggested:
from collections import defaultdict
from itertools import izip
d = defaultdict(int)
for n, v in izip(ls1, ls2):
d[n] += v
You could try:
import numpy as np
uni, i = np.unique(ls1, return_inverse=1)
vals = np.bincount(i, ls2)
dict(zip(uni, vals))
Since you asked how to make it more efficient, I compared the time your original solution took with the version suggested in my comment (equivalent with Juergen's second solution) with 5 million random characters from a-z as keys and 5 million random values from 0-20, using my shell's time function:
~/test $ time python defdict.py
defaultdict(<type 'int'>, {'a': 381956, 'c': 383815, 'b': 378277, 'e': 384629, 'd': 383557, 'g': 381139, 'f': 386268, 'i': 383902, 'h': 385809, 'k': 385138, 'j': 384690, 'm': 388552, 'l': 384393, 'o': 384533, 'n': 385011, 'q': 385685, 'p': 386188, 's': 387132, 'r': 383886, 'u': 386176, 't': 387144, 'w': 386371, 'v': 388263, 'y': 381337, 'x': 385281, 'z': 384048})
python defdict.py 13,24s user 0,35s system 96% cpu 14,045 total
~/test $ time python original.py
{'a': 386316, 'c': 383596, 'b': 383424, 'e': 385598, 'd': 383324, 'g': 382233, 'f': 385435, 'i': 386761, 'h': 384047, 'k': 386640, 'j': 386313, 'm': 381032, 'l': 383035, 'o': 389142, 'n': 385000, 'q': 386088, 'p': 387435, 's': 385429, 'r': 384260, 'u': 385442, 't': 384793, 'w': 385052, 'v': 380830, 'y': 386500, 'x': 386871, 'z': 379870}
python original.py 14,68s user 0,38s system 96% cpu 15,529 total
So there seems to be a difference, although not a big one. To make it fairer, numpy was also imported in defdict.py.

Categories