python - How to use map reduce MRJob

python - How to use map reduce MRJob - python

I need to apply the map reduce function from MRJob and I can't arrive.
I have a big list with two codes and a sentence, as following:
enter code here
L = ['E-0053 C-0169 It's goig to be a good day\n', 'D-0312 B-0291 Peter has arrived late\n', 'A-
0417 B-0187 for more information please call the following number\n']
I need to use map reduce to obtain a list that counts the number of words that have each sentences for each pair of combinations of letter from the code. For example, the solution with the list example would be:
enter code here
[EC 6, DB 4, AB 8]
I've tried with:
enter code here
C1 = [i [0] for i in L]
C2 = [i [7] for i in L]
C1_C2 = [C1[i]+C2[i] for i in range(len(C1))]
class count(MRJob):
def mapper(self, _, C1_C2):
[elem.split() for elem in L]
yield C1_C2, [(len(i)-2) for i in sentence]
def reducer(self, key, values):
yield key, sum(values)
count.run()

You could try this :
L = [
"E-0053 C-0169 It's goig to be a good day\n",
"D-0312 B-0291 Peter has arrived late\n",
"A-0417 B-0187 for more information please call the following number\n"
]
result = [i[0] + i[7] + " " + str(len(i.split()) - 2) for i in L]
print(result)
Output :
['EC 7', 'DB 4', 'AB 8']

Related

python edit tuple duplicates in a list

my target is:
while for looping a list I would like to check for duplicates and if there are some i would like to append a number to it see following example
my list output as an example:
[('name','company'), ('someguy','microsoft'), ('anotherguy','microsoft'), ('thirdguy','amazon')]
in a loop i would like to edit those duplicates so instead of the 2nd microsoft i would like to have microsoft1 (if there would be 3 microsoft guys so the third guy would have microsoft2)
with this i can filter the duplicates but i dont know how to edit them directly in the list
list = [('name','company'), ('someguy','microsoft'), ('anotherguy','microsoft'), ('thirdguy','amazon')]
names = []
double = []
for u in list[1:]:
names.append(u[1])
list_size = len(names)
for i in range(list_size):
k = i + 1
for j in range(k, list_size):
if names[i] == names[j] and names[i] not in double:
double.append(names[i])

This is one approach using collections.defaultdict.
Ex:
from collections import defaultdict
lst = [('name','company'), ('someguy','microsoft'), ('anotherguy','microsoft'), ('thirdguy','amazon')]
seen = defaultdict(int)
result = []
for k, v in lst:
if seen[v]:
result.append((k, "{}_{}".format(v, seen[v])))
else:
result.append((k,v))
seen[v] += 1
print(result)
Output:
[('name', 'company'),
('someguy', 'microsoft'),
('anotherguy', 'microsoft_1'),
('thirdguy', 'amazon')]

Algorithmics issue, python string, no idea

I have algorithm problem with Python and strings.
My issue:
My function should sum maximum values of substring.
For example:
ae-afi-re-fi -> 2+6+3+5=16
but
ae-a-fi-re-fi -> 2-10+5+3+5=5
I try use string.count function and counting substring, but this method is not good.
What would be the best way to do this in Python? Thanks in advance.
string = "aeafirefi"
Sum the value of substrings.

In my solution i'll use permutations from itertools module in order to list all the possible permutations of substrings that you gave in your question presented into a dict called vals. Then iterate through the input string and split the strings by all the permutations found below. Then sum the values of each permutations and finally get the max.
PS: The key of this solution is the get_sublists() method.
This is an example with some tests:
from itertools import permutations
def get_sublists(a, perm_vals):
# Find the sublists in the input string
# Based on the permutations of the dict vals.keys()
for k in perm_vals:
if k in a:
a = ''.join(a.split(k))
# Yield the sublist if we found any
yield k
def sum_sublists(a, sub, vals):
# Join the sublist and compare it to the input string
# Get the difference by lenght
diff = len(a) - len(''.join(sub))
# Sum the value of each sublist (on every permutation)
return sub , sum(vals[k] for k in sub) - diff * 10
def get_max_sum_sublists(a, vals):
# Get all the possible permutations
perm_vals = permutations(vals.keys())
# Remove duplicates if there is any
sub = set(tuple(get_sublists(a, k)) for k in perm_vals)
# Get the sum of each possible permutation
aa = (sum_sublists(a, k, vals) for k in sub)
# return the max of the above operation
return max(aa, key= lambda x: x[1])
vals = {'ae': 2, 'qd': 3, 'qdd': 5, 'fir': 4, 'afi': 6, 're': 3, 'fi': 5}
# Test
a = "aeafirefi"
final, s = get_max_sum_sublists(a, vals)
print("Sublists: {}\nSum: {}".format(final, s))
print('----')
a = "aeafirefiqdd"
final, s = get_max_sum_sublists(a, vals)
print("Sublists: {}\nSum: {}".format(final, s))
print('----')
a = "aeafirefiqddks"
final, s = get_max_sum_sublists(a, vals)
print("Sublists: {}\nSum: {}".format(final, s))
Output:
Sublists: ('ae', 'afi', 're', 'fi')
Sum: 16
----
Sublists: ('afi', 'ae', 'qdd', 're', 'fi')
Sum: 21
----
Sublists: ('afi', 'ae', 'qdd', 're', 'fi')
Sum: 1
Please try this solution with many input strings as you can and don't hesitate to comment if you found any wrong result.

Probably having a dictionary with:
key = substring: value = value
So if you have:
string = "aeafirefi"
first you look for the whole string in the dictionary, if you don't find it, you cut the last letter so you have "aeafiref", until you find a substring or you have an only letter.
then you skip the letters used: for example, if you found "aeaf", you start all over again using string = "iref".

Here's a brute force solution:
values_dict = {
'ae': 2,
'qd': 3,
'qdd': 5,
'fir': 4,
'afi': 6,
're': 3,
'fi': 5
}
def get_value(x):
return values_dict[x] if x in values_dict else -10
def next_tokens(s):
"""Returns possible tokens"""
# Return any tokens in values_dict
for x in values_dict.keys():
if s.startswith(x):
yield x
# Return single character.
yield s[0]
def permute(s, stack=[]):
"""Returns all possible variations"""
if len(s) == 0:
yield stack
return
for token in next_tokens(s):
perms = permute(s[len(token):], stack + [token])
for perm in perms:
yield perm
def process_string(s):
def process_tokens(tokens):
return sum(map(get_value, tokens))
return max(map(process_tokens, permute(s)))
print('Max: {}'.format(process_string('aeafirefi')))

Merge nested list items based on a repeating value

Although poorly written, this code:
marker_array = [['hard','2','soft'],['heavy','2','light'],['rock','2','feather'],['fast','3'], ['turtle','4','wet']]
marker_array_DS = []
for i in range(len(marker_array)):
if marker_array[i-1][1] != marker_array[i][1]:
marker_array_DS.append(marker_array[i])
print marker_array_DS
Returns:
[['hard', '2', 'soft'], ['fast', '3'], ['turtle', '4', 'wet']]
It accomplishes part of the task which is to create a new list containing all nested lists except those that have duplicate values in index [1]. But what I really need is to concatenate the matching index values from the removed lists creating a list like this:
[['hard heavy rock', '2', 'soft light feather'], ['fast', '3'], ['turtle', '4', 'wet']]
The values in index [1] must not be concatenated. I kind of managed to do the concatenation part using a tip from another post:
newlist = [i + n for i, n in zip(list_a, list_b]
But I am struggling with figuring out the way to produce the desired result. The "marker_array" list will be already sorted in ascending order before being passed to this code. All like-values in index [1] position will be contiguous. Some nested lists may not have any values beyond [0] and [1] as illustrated above.

Quick stab at it... use itertools.groupby to do the grouping for you, but do it over a generator that converts the 2 element list into a 3 element.
from itertools import groupby
from operator import itemgetter
marker_array = [['hard','2','soft'],['heavy','2','light'],['rock','2','feather'],['fast','3'], ['turtle','4','wet']]
def my_group(iterable):
temp = ((el + [''])[:3] for el in marker_array)
for k, g in groupby(temp, key=itemgetter(1)):
fst, snd = map(' '.join, zip(*map(itemgetter(0, 2), g)))
yield filter(None, [fst, k, snd])
print list(my_group(marker_array))

from collections import defaultdict
d1 = defaultdict(list)
d2 = defaultdict(list)
for pxa in marker_array:
d1[pxa[1]].extend(pxa[:1])
d2[pxa[1]].extend(pxa[2:])
res = [[' '.join(d1[x]), x, ' '.join(d2[x])] for x in sorted(d1)]
If you really need 2-tuples (which I think is unlikely):
for p in res:
if not p[-1]:
p.pop()

marker_array = [['hard','2','soft'],['heavy','2','light'],['rock','2','feather'],['fast','3'], ['turtle','4','wet']]
marker_array_DS = []
marker_array_hit = []
for i in range(len(marker_array)):
if marker_array[i][1] not in marker_array_hit:
marker_array_hit.append(marker_array[i][1])
for i in marker_array_hit:
lists = [item for item in marker_array if item[1] == i]
temp = []
first_part = ' '.join([str(item[0]) for item in lists])
temp.append(first_part)
temp.append(i)
second_part = ' '.join([str(item[2]) for item in lists if len(item) > 2])
if second_part != '':
temp.append(second_part);
marker_array_DS.append(temp)
print marker_array_DS
I learned python for this because I'm a shameless rep whore

marker_array = [
['hard','2','soft'],
['heavy','2','light'],
['rock','2','feather'],
['fast','3'],
['turtle','4','wet'],
]
data = {}
for arr in marker_array:
if len(arr) == 2:
arr.append('')
(first, index, last) = arr
firsts, lasts = data.setdefault(index, [[],[]])
firsts.append(first)
lasts.append(last)
results = []
for key in sorted(data.keys()):
current = [
" ".join(data[key][0]),
key,
" ".join(data[key][1])
]
if current[-1] == '':
current = current[:-1]
results.append(current)
print results
--output:--
[['hard heavy rock', '2', 'soft light feather'], ['fast', '3'], ['turtle', '4', 'wet']]

A different solution based on itertools.groupby:
from itertools import groupby
# normalizes the list of markers so all markers have 3 elements
def normalized(markers):
for marker in markers:
yield marker + [""] * (3 - len(marker))
def concatenated(markers):
# use groupby to iterator over lists of markers sharing the same key
for key, markers_in_category in groupby(normalized(markers), lambda m: m[1]):
# get separate lists of left and right words
lefts, rights = zip(*[(m[0],m[2]) for m in markers_in_category])
# remove empty strings from both lists
lefts, rights = filter(bool, lefts), filter(bool, rights)
# yield the concatenated entry for this key (also removing the empty string at the end, if necessary)
yield filter(bool, [" ".join(lefts), key, " ".join(rights)])
The generator concatenated(markers) will yield the results. This code correctly handles the ['fast', '3'] case and doesn't return an additional third element in such cases.

Generate a list of strings with a sliding window using itertools, yield, and iter() in Python 2.7.1?

I'm trying to generate a sliding window function in Python. I figured out how to do it but not all inside the function. itertools, yield, and iter() are entirely new to me.
i want to input
a='abcdefg'
b=window(a,3)
print b
['abc','bcd','cde','def','efg']
the way i got it work was
def window(fseq, window_size=5):
import itertools
tentative=[]
final=[]
iteration=iter(fseq)
value=tuple(itertools.islice(iteration,window_size))
if len(value) == window_size:
yield value
for element in iteration:
value = value[1:] + (element,)
yield value
a='abcdefg'
result=window(a)
list1=[]
for k in result:
list1.append(k)
list2=[]
for j in list1:
tentative=''.join(j)
list2.append(tentative)
print list2
basically what im confused about is how to use the final value of the function inside the function?
here is my code for the function
def window(fseq, window_size=5):
import itertools
tentative=[]
final=[]
iteration=iter(fseq)
value=tuple(itertools.islice(iteration,window_size))
if len(value) == window_size:
yield value
for element in iteration:
value = value[1:] + (element,)
yield value
for k in value:
tentative.append(k)
for j in tentative:
tentative_string=''.join(j)
final.append(tentative_string)
return final
seq='abcdefg'
uence=window(seq)
print uence
i want it to return the joined list but when i press run it, it says "There's an error in your program * 'return' with argument inside generator"
I'm really confused . . .

You mean you want to do this ? :
a='abcdefg'
b = [a[i:i+3] for i in xrange(len(a)-2)]
print b
['abc', 'bcd', 'cde', 'def', 'efg']

Your generator could be much shorter:
def window(fseq, window_size=5):
for i in xrange(len(fseq) - window_size + 1):
yield fseq[i:i+window_size]
for seq in window('abcdefghij', 3):
print seq
abc
bcd
cde
def
efg
fgh
ghi
hij

def window(fseq,fn):
alpha=[fseq[i:i+fn] for i in range(len(fseq)-(fn-1))]
return alpha

Use zip function in one line code:
[ "".join(j) for j in zip(*[fseq[i:] for i in range(window_size)])]

I don't know what your input or expected output are, but you cannot mix yield and return in one function. change return to yield and your function will not throw that error again.
def window(fseq, window_size=5):
....
final.append(tentative_string)
yield final

>>>def window(data, win_size):
... tmp = [iter(data[i:]) for i in range(win_size)]
... return zip(*tmp)
>>> a = [1, 2, 3, 4, 5, 6]
>>> window(a, 3)
>>>[(1,2,3), (2,3,4), (3,4,5), (4,5,6)]

Insert an item into sorted list in Python

I'm creating a class where one of the methods inserts a new item into the sorted list. The item is inserted in the corrected (sorted) position in the sorted list. I'm not allowed to use any built-in list functions or methods other than [], [:], +, and len though. This is the part that's really confusing to me.
What would be the best way in going about this?

Use the insort function of the bisect module:
import bisect
a = [1, 2, 4, 5]
bisect.insort(a, 3)
print(a)
Output
[1, 2, 3, 4, 5]

Hint 1: You might want to study the Python code in the bisect module.
Hint 2: Slicing can be used for list insertion:
>>> s = ['a', 'b', 'd', 'e']
>>> s[2:2] = ['c']
>>> s
['a', 'b', 'c', 'd', 'e']

You should use the bisect module. Also, the list needs to be sorted before using bisect.insort_left
It's a pretty big difference.
>>> l = [0, 2, 4, 5, 9]
>>> bisect.insort_left(l,8)
>>> l
[0, 2, 4, 5, 8, 9]
timeit.timeit("l.append(8); l = sorted(l)",setup="l = [4,2,0,9,5]; import bisect; l = sorted(l)",number=10000)
1.2235019207000732
timeit.timeit("bisect.insort_left(l,8)",setup="l = [4,2,0,9,5]; import bisect; l=sorted(l)",number=10000)
0.041441917419433594

I'm learning Algorithm right now, so i wonder how bisect module writes.
Here is the code from bisect module about inserting an item into sorted list, which uses dichotomy:
def insort_right(a, x, lo=0, hi=None):
"""Insert item x in list a, and keep it sorted assuming a is sorted.
If x is already in a, insert it to the right of the rightmost x.
Optional args lo (default 0) and hi (default len(a)) bound the
slice of a to be searched.
"""
if lo < 0:
raise ValueError('lo must be non-negative')
if hi is None:
hi = len(a)
while lo < hi:
mid = (lo+hi)//2
if x < a[mid]:
hi = mid
else:
lo = mid+1
a.insert(lo, x)

If there are no artificial restrictions, bisect.insort() should be used as described by stanga. However, as Velda mentioned in a comment, most real-world problems go beyond sorting pure numbers.
Fortunately, as commented by drakenation, the solution applies to any comparable objects. For example, bisect.insort() also works with a custom dataclass that implements __lt__():
from bisect import insort
#dataclass
class Person:
first_name: str
last_name: str
age: int
def __lt__(self, other):
return self.age < other.age
persons = []
insort(persons, Person('John', 'Doe', 30))
insort(persons, Person('Jane', 'Doe', 28))
insort(persons, Person('Santa', 'Claus', 1750))
# [Person(first_name='Jane', last_name='Doe', age=28), Person(first_name='John', last_name='Doe', age=30), Person(first_name='Santa', last_name='Claus', age=1750)]
However, in the case of tuples, it would be desirable to sort by an arbitrary key. By default, tuples are sorted by their first item (first name), then by the next item (last name), and so on.
As a solution you can manage an additional list of keys:
from bisect import bisect
persons = []
ages = []
def insert_person(person):
age = person[2]
i = bisect(ages, age)
persons.insert(i, person)
ages.insert(i, age)
insert_person(('John', 'Doe', 30))
insert_person(('Jane', 'Doe', 28))
insert_person(('Santa', 'Claus', 1750))
Official solution: The documentation of bisect.insort() refers to a recipe how to use the function to implement this functionality in a custom class SortedCollection, so that it can be used as follows:
>>> s = SortedCollection(key=itemgetter(2))
>>> for record in [
... ('roger', 'young', 30),
... ('angela', 'jones', 28),
... ('bill', 'smith', 22),
... ('david', 'thomas', 32)]:
... s.insert(record)
>>> pprint(list(s)) # show records sorted by age
[('bill', 'smith', 22),
('angela', 'jones', 28),
('roger', 'young', 30),
('david', 'thomas', 32)]
Following is the relevant extract of the class required to make the example work. Basically, the SortedCollection manages an additional list of keys in parallel to the items list to find out where to insert the new tuple (and its key).
from bisect import bisect_left
class SortedCollection(object):
def __init__(self, iterable=(), key=None):
self._given_key = key
key = (lambda x: x) if key is None else key
decorated = sorted((key(item), item) for item in iterable)
self._keys = [k for k, item in decorated]
self._items = [item for k, item in decorated]
self._key = key
def __getitem__(self, i):
return self._items[i]
def __iter__(self):
return iter(self._items)
def insert(self, item):
'Insert a new item. If equal keys are found, add to the left'
k = self._key(item)
i = bisect_left(self._keys, k)
self._keys.insert(i, k)
self._items.insert(i, item)
Note that list.insert() as well as bisect.insort() have O(n) complexity. Thus, as commented by nz_21, manually iterating through the sorted list, looking for the right position, would be just as good in terms of complexity. In fact, simply sorting the array after inserting a new value will probably be fine, too, since Python's Timsort has a worst-case complexity of O(n log(n)). For completeness, however, note that a binary search tree (BST) would allow insertions in O(log(n)) time.

This is a possible solution for you:
a = [15, 12, 10]
b = sorted(a)
print b # --> b = [10, 12, 15]
c = 13
for i in range(len(b)):
if b[i] > c:
break
d = b[:i] + [c] + b[i:]
print d # --> d = [10, 12, 13, 15]

# function to insert a number in an sorted list
def pstatement(value_returned):
return print('new sorted list =', value_returned)
def insert(input, n):
print('input list = ', input)
print('number to insert = ', n)
print('range to iterate is =', len(input))
first = input[0]
print('first element =', first)
last = input[-1]
print('last element =', last)
if first > n:
list = [n] + input[:]
return pstatement(list)
elif last < n:
list = input[:] + [n]
return pstatement(list)
else:
for i in range(len(input)):
if input[i] > n:
break
list = input[:i] + [n] + input[i:]
return pstatement(list)
# Input values
listq = [2, 4, 5]
n = 1
insert(listq, n)

Well there are many ways to do this, here is a simple naive program to do the same using inbuilt Python function sorted()
def sorted_inserter():
list_in = []
n1 = int(input("How many items in the list : "))
for i in range (n1):
e1 = int(input("Enter numbers in list : "))
list_in.append(e1)
print("The input list is : ",list_in)
print("Any more items to be inserted ?")
n2 = int(input("How many more numbers to be added ? : "))
for j in range (n2):
e2= int(input("Add more numbers : "))
list_in.append(e2)
list_sorted=sorted(list_in)
print("The sorted list is: ",list_sorted)
sorted_inserter()
The output is
How many items in the list : 4
Enter numbers in list : 1
Enter numbers in list : 2
Enter numbers in list : 123
Enter numbers in list : 523
The input list is : [1, 2, 123, 523]
Any more items to be inserted ?
How many more numbers to be added ? : 1
Add more numbers : 9
The sorted list is: [1, 2, 9, 123, 523]

To add to the existing answers: When you want to insert an element into a list of tuples where the first element is comparable and the second is not you can use the key parameter of the bisect.insort function as follows:
import bisect
class B:
pass
a = [(1, B()), (2, B()), (3, B())]
bisect.insort(a, (3, B()), key=lambda x: x[0])
print(a)
Without the lambda function as the third parameter of the bisect.insort function the code would throw a TypeError as the function would try to compare the second element of a tuple as a tie breaker which isn't comparable by default.

This is the best way to append the list and insert values to sorted list:
a = [] num = int(input('How many numbers: ')) for n in range(num):
numbers = int(input('Enter values:'))
a.append(numbers)
b = sorted(a) print(b) c = int(input("enter value:")) for i in
range(len(b)):
if b[i] > c:
index = i
break d = b[:i] + [c] + b[i:] print(d)`

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

python - How to use map reduce MRJob - python

Related

python edit tuple duplicates in a list

Algorithmics issue, python string, no idea

Merge nested list items based on a repeating value

Generate a list of strings with a sliding window using itertools, yield, and iter() in Python 2.7.1?

Insert an item into sorted list in Python

Categories

Resources