I'm new to Python and Pyspark and I'm practicing TF-IDF.
I split all words from sentences in the txt file, removed punctuations, removed the words that are in the stop-words list, and saved them as a dictionary with the codes below.
x = text_file.flatmap(lambda line: str_clean(line).split()
x = x.filter(lambda word: word not in stopwords
x = x.reduceByKey(lambda a,b: a+b)
x = x.collectAsMap()
I have 10 different txt files for this same process. And I'd like to add a string like "#d1" to keys in dictionary so that I can indicate that the key is from document 1.
How can I add "#1" to all keys in the dictionary?
Essentially my dictionary is in the form:
{'word1': 1, 'word2': 1, 'word3': 2, ....}
And I would like it to be:
{'word1#d1': 1, 'word2#d1': 1, 'word3#d1': 2, ...}
Try a dictionary comprehension:
{k+'#d1': v for k, v in d.items()}
In Python 3.6+, you can use f-strings:
{f'{k}#d1': v for k, v in d.items()}
You can use dict constructor to rebuild the dict, appending file number to the end of each key:
>>> d = {'a': 1, 'b': 2}
>>> file_number = 1
>>> dict(("{}#{}".format(k,file_number),v) for k,v in d.items())
>>> {'a#1': 1, 'b#1': 2}
I have a list of dict that looks like below
def prefix_key_dict(prefix,test_dict):
res = {prefix + str(key).lower(): val for key, val in test_dict.items()}
return res
temp_prefix = 'column_'
transformed_dict = [prefix_dict(temp_prefix,each) for each in table_col_list]
and the transformed json looks like below
Related
Let's say I have the following dictionary:
full_dic = {
'aa': 1,
'ac': 1,
'ab': 1,
'ba': 2,
...
}
I normally use standard dictionary comprehension to remove dupes like:
t = {val : key for (key, val) in full_dic.items()}
cleaned_dic = {val : key for (key, val) in t.items()}
Calling print(cleaned_dic) outputs {'ab': 1,'ba': 2, ...}
With this code, the key that remains seems to always be the final one in the list, but I'm not sure that's even guaranteed as dictionaries are unordered. Instead, I'd like to find a way to ensure that the key I keep is the first alphabetically.
So, regardless of the 'order' the dictionary is in, I want the output to be:
>> {'aa': 1,'ba': 2, ...}
Where 'aa' comes first alphabetically.
I ran some timer tests on 3 answers below and got the following (dictionary was created with random key/value pairs):
dict length: 10
# of loops: 100000
HoliSimo (OrderedDict): 0.0000098405 seconds
Ricardo: 0.0000115448 seconds
Mark (itertools.groupby): 0.0000111745 seconds
dict length: 1000000
# of loops: 10
HoliSimo (OrderedDict): 6.1724137300 seconds
Ricardo: 3.3102091300 seconds
Mark (itertools.groupby): 6.1338266200 seconds
We can see that for smaller dictionary sizes using OrderedDict is fastest but for large dictionary sizes it's slightly better to use Ricardo's answer below.
t = {val : key for (key, val) in dict(sorted(full_dic.items(), key=lambda x: x[0].lower(), reverse=True)).items()}
cleaned_dic = {val : key for (key, val) in t.items()}
dict(sorted(cleaned_dic.items(), key=lambda x: x[0].lower()))
>>> {'aa': 1, 'ba': 2}
Seems like you can do this with a single sort and itertools.groupby. First sort the items by value, then key. Pass this to groupby and take the first item of each group to pass to the dict constructor:
from itertools import groupby
full_dic = {
'aa': 1,
'ac': 1,
'xx': 2,
'ab': 1,
'ba': 2,
}
groups = groupby(sorted(full_dic.items(), key=lambda p: (p[1], p[0])), key=lambda x: x[1])
dict(next(g) for k, g in groups)
# {'aa': 1, 'ba': 2}
You should use the OrderectDict class.
import collections
full_dic = {
'aa': 1,
'ac': 1,
'ab': 1
}
od = collections.OrderedDict(sorted(full_dic.items()))
In this way you will be sure to have sorted dictionary (Original code: StackOverflow).
And then:
result = {}
for k, vin od.items():
if value not in result.values():
result[key] = value
I'm not sure if it will speed up the computation but you can try:
inverted_dict = {}
for k, v in od.items():
if inverted_dict.get(v) is None:
inverted_dict[v] = k
res = {v: k for k, v in inverted_dict.items()}
Suppose I have a dictionary:
d = {'a_c':1,'b_c':2,'a_d':3,'b_d':4}
how do I split into two based on the last word/letter of the key ('c','d') like this?
d1 = {'a_c':1,'b_c':2}
d2 = {'a_d':3,'b_d':4}
This is one way:
from collections import defaultdict
d = {'a_c':1,'b_c':2,'a_d':3,'b_d':4}
key = lambda s: s.split('_')[1]
res = defaultdict(dict)
for k, v in d.items():
res[key(k)][k] = v
print(list(res.values()))
Output:
[{'a_c': 1, 'b_c': 2}, {'a_d': 3, 'b_d': 4}]
The result is a list of dictionaries divided on the last letter of the key.
You could try something like this:
func = lambda ending_str: {x: d[x] for x in d.keys() if x.endswith(ending_str)}
d1 = func('_c')
d2 = func('_d')
Also, like Marc mentioned in the comments, you shouldn't have two same name keys in the dictionary. It will only keep the last key/value pair in that case.
Suppose I have a dictionary as follows
dic = {0: [1,2,3,4,5], 1:[7,4,6]}
While printing the dictionary as key and count, first updating the dictionary
for k, v in dic.items():
dic[k] = len(v)
print(dic)
>>> {0:5, 1:3}
Is there a better way to do the above without for loop?
What do you mean saying without iteration? If you don't want to use a for loop, you can use a map function:
d = dict(map(lambda kv: (kv[0], len(kv[1])), d.items()))
If by no iteration you simply mean to do it without a for loop, you can map the dict values to the len function, zip them with the dict keys and pass the zipped key-value pairs to the dict constructor:
>>> d = {0: [1,2,3,4,5], 1:[7,4,6]}
>>> dict(zip(d, map(len, d.values())))
{0: 5, 1: 3}
>>>
First of all do not name a list 'list' or a dictionary 'dict', as this is a reserved word in python for the class that holds that data type.
You can do it neatly using dictionary comprehension as follows:
d = {0: [1,2,3,4,5], 1:[7,4,6]}
d = {i:len(j) for i,j in d.items()}
print(d)
Or use:
print({k:len(dic[k]) for k in dic})
Output:
{0: 5, 1: 3}
def invert_dict(d):
inv = dict()
for key in d:
val = d[key]
if val not in inv:
inv[val] = [key]
else:
inv[val].append(key)
return inv
This is an example from Think Python book, a function for inverting(swapping) keys and values in a dictionary. New values (former keys) are stored as lists, so if there was multiple dictionary values (bound to a different keys) that were equal before inverting, then this function simply appends them to the list of former keys.
Example:
somedict = {'one': 1, 'two': 2, 'doubletwo': 2, 'three': 3}
invert_dict(somedict) ---> {1: ['one'], 2: ['doubletwo', 'two'], 3: ['three']}
My question is, can the same be done with dictionary comprehensions? This function creates an empty dict inv = dict(), which is then checked later in the function with if/else for the presence of values. Dict comprehension, in this case, should check itself. Is that possible, and how the syntax should look like?
General dict comprehension syntax for swapping values is:
{value:key for key, value in somedict.items()}
but if I want to add an 'if' clausule, what it should look like? if value not in (what)?
Thanks.
I don't think it's possible with simple dict comprehension without using other functions.
Following code uses itertools.groupby to group keys that have same values.
>>> import itertools
>>> {k: [x[1] for x in grp]
for k, grp in itertools.groupby(
sorted((v,k) for k, v in somedict.iteritems()),
key=lambda x: x[0])
}
{1: ['one'], 2: ['doubletwo', 'two'], 3: ['three']}
You can use a set comprehension side effect:
somedict = {'one': 1, 'two': 2, 'doubletwo': 2, 'three': 3}
invert_dict={}
{invert_dict.setdefault(v, []).append(k) for k, v in somedict.items()}
print invert_dict
# {1: ['one'], 2: ['doubletwo', 'two'], 3: ['three']}
Here is a good answer:
fts = {1:1,2:1,3:2,4:1}
new_dict = {dest: [k for k, v in fts.items() if v == dest] for dest in set(fts.values())}
Reference: Head First Python ,2nd Edition, Page(502)
I have a list:
list = [(a,1),(b,2),(a,3)]
I want to convert it to a dict where when there is a duplicate (eg. (a,1) and (a,3)), it will be get the average so dict will just have 1 key:value pair which would be in this case a:2.
from collections import defaultdict
l = [('a',1),('b',2),('a',3)]
d = defaultdict(list)
for pair in l:
d[pair[0]].append(pair[1]) #add each number in to the list with under the correct key
for (k,v) in d.items():
d[k] = sum(d[k])/len(d[k]) #re-assign the value associated with key k as the sum of the elements in the list divided by its length
So
print(d)
>>> defaultdict(<type 'list'>, {'a': 2, 'b': 2})
Or even nicer and producing a plain dictionary in the end:
from collections import defaultdict
l = [('a',1),('b',2),('a',3)]
temp_d = defaultdict(list)
for pair in l:
temp_d[pair[0]].append(pair[1])
#CHANGES HERE
final = dict((k,sum(v)/len(v)) for k,v in temp_d.items())
print(final)
>>>
{'a': 2, 'b': 2}
Note that if you are using 2.x (as you are, you will need to adjust the following to force float division):
(k,sum(v)/float(len(v)))
OR
sum(d[k])/float(len(d[k]))