Use MapReduce on list elements? - python

So I know how to use MapReduce for files where each element is on a line, but I'm trying to use MapReduce on a file with its entries like so:
74390,0,6,7,5,2,6,4,10,12,7,6,12,9,4,3,9,1,3,5,9,9,8,5,12,11,4,8,5,9,6,12,12,9,7,9,12,7,8,9,8,8
74391,1,4,2,9,3,5,12,7,6,9,6,8,9,10,12,7,9,9,9,9,5,1,8,4,5,12,6,5,4,3,9,6,8,7,12,11,12,7,8,12,8
74392,0,6,9,3,2,4,9,1,4,7,12,9,12,12,10,6,9,9,5,12,7,12,7,6,8,7,9,5,3,5,9,8,9,12,5,8,4,11,8,6,8
74393,0,8,9,9,7,12,7,12,12,2,9,7,10,7,9,9,9,9,6,4,9,5,6,4,8,8,5,3,5,6,4,1,12,8,12,12,3,8,6,11,5
74394,0,5,9,6,2,4,6,5,6,7,12,8,9,7,9,10,3,9,1,9,8,9,12,7,3,5,12,12,4,12,4,8,9,5,9,12,8,11,6,8,7
74395,1,7,6,7,6,5,2,9,7,1,7,9,12,6,3,9,3,12,10,12,9,9,8,4,12,4,9,6,8,4,9,5,8,12,11,12,8,5,9,8,5
The first entry is the index and the second is meaningless for this analysis, and in the following code I remove them. My file has hundreds of thousands of lines like this, and I need to figure out which number appears in each part of the line the most, as these correspond to slots.
Expected output:
0: 1
1: 11
2: 5
...
40: 9
What I've got so far:
from mrjob.job import MRJob
from mrjob.step import MRStep
class topPieceSlot(MRJob):
def mapper(self, _, line):
pieces = line.split(',')
pieces = pieces[2::]
for item in range(len(pieces)):
yield str(item)
def reducer(self, pieces):
for slot in range(len(pieces)):
element = str(slot)
numElements = 0
for x in pieces:
total += x
numElements += 1
yield element, numElements
if __name__ == '__main__':
topPieceSlot.run()
And it returns nothing. It tells me it needs more than one value to unpack, but I'm not sure why it is only getting one value or if it's even right to begin with. Should I be using 40 variables for it? That seems inefficient and wrong.

Related

Extract words from random strings

Below I have some strings in a list:
some_list = ['a','l','p','p','l','l','i','i','r',i','r','a','a']
Now I want to take the word april from this list. There are only two april in this list. So I want to take that two april from this list and append them to another extract list.
So the extract list should look something like this:
extract = ['aprilapril']
or
extract = ['a','p','r','i','l','a','p','r','i','l']
I tried many times trying to get the everything in extract in order, but I still can't seems to get it.
But I know I can just do this
a_count = some_list.count('a')
p_count = some_list.count('p')
r_count = some_list.count('r')
i_count = some_list.count('i')
l_count = some_list.count('l')
total_count = [a_count,p_count,r_count,i_count,l_count]
smallest_count = min(total_count)
extract = ['april' * smallest_count]
Which I wouldn't be here If I just use the code above.
Because I made some rules for solving this problem
Each of the characters (a,p,r,i and l) are some magical code elements, these code elements can't be created out of thin air; they are some unique code elements, that has some uniquw identifier, like a secrete number that is associated with them. So you don't know how to create this magical code elements, the only way to get the code elements is to extract them to a list.
Each of the characters (a,p,r,i and l) must be in order. Imagine they are some kind of chains, they will only work if they are together. Meaning that we got to put p next to and in front of a, and l must come last.
These important code elements are some kind of top secrete stuff, so if you want to get it, the only way is to extract them to a list.
Below are some examples of a incorrect way to do this: (breaking the rules)
import re
word = 'april'
some_list = ['aaaaaaappppppprrrrrriiiiiilll']
regex = "".join(f"({c}+)" for c in word)
match = re.match(regex, text)
if match:
lowest_amount = min(len(g) for g in match.groups())
print(word * lowest_amount)
else:
print("no match")
from collections import Counter
def count_recurrence(kernel, string):
# we need to count both strings
kernel_counter = Counter(kernel)
string_counter = Counter(string)
effective_counter = {
k: int(string_counter.get(k, 0)/v)
for k, v in kernel_counter.items()
}
min_recurring_count = min(effective_counter.values())
return kernel * min_recurring_count
This might sounds really stupid, but this is actually a hard problem (well for me). I originally designed this problem for myself to practice python, but it turns out to be way harder than I thought. I just want to see how other people solve this problem.
If anyone out there know how to solve this ridiculous problem, please help me out, I am just a fourteen-year-old trying to do python. Thank you very much.
I'm not sure what do you mean by "cannot copy nor delete the magical codes" - if you want to put them in your output list you will need to "copy" them somehow.
And btw your example code (a_count = some_list.count('a') etc) won't work since count will always return zero.
That said, a possible solution is
worklist = [c for c in some_list[0]]
extract = []
fail = False
while not fail:
lastpos = -1
tempextract = []
for magic in magics:
if magic in worklist:
pos = worklist.index(magic, lastpos+1)
tempextract.append(worklist.pop(pos))
lastpos = pos-1
else:
fail = True
break
else:
extract.append(tempextract)
Alternatively, if you don't want to pop the elements when you find them, you may compute the positions of all the occurences of the first element (the "a"), and set lastpos to each of those positions at the beginning of each iteration
May not be the most efficient way, although code works and is more explicit to understand the program logic:
some_list = ['aaaaaaappppppprrrrrriiiiiilll']
word = 'april'
extract = []
remove = []
string = some_list[0]
for x in range(len(some_list[0])//len(word)): #maximum number of times `word` can appear in `some_list[0]`
pointer = i = 0
while i<len(word):
j=0
while j<(len(string)-pointer):
if string[pointer:][j] == word[i]:
extract.append(word[i])
remove.append(pointer+j)
i+=1
pointer = j+1
break
j+=1
if i==len(word):
for r_i,r in enumerate(remove):
string = string[:r-r_i] + string[r-r_i+1:]
remove = []
elif j==(len(string)-pointer):
break
print(extract,string)

Python: iterables & generators to replace my while true loops?

Let's start by my question: can you write a better code than the one below?
FRAME_DELIMITER = b'\x0a\x0b\x0c\x0d'
def get_data():
f = bytearray();
# detect frame delimiter
while True:
f += read_byte()
if f[-4:] == FRAME_DELIMITER:
start = len(f)-2
break
# read data until next frame delimiter
while True:
f += self._read_byte()
if f[-4:] == FRAME_DELIMITER:
return f[start:-2]
In few words, this code is reading a data flow and return an entire frame. Each frame is delimited by 0x0a 0x0b 0x0c.The read_byte function reads one byte on the data flow (maybe it could be convenient to retrieve a buffer of x bytes).
I had a look to Python documentation to try writing this code in a more pythonic way (and better performance ?).
I came to generators and iterators.
We could imagine to create a generator like this one:
def my_generator(self):
while True:
yield self._read_byte()
and play around with list comprehension and itertools like this one:
f = b''.join(itertools.takewhile(lambda c: c != b'\x03', self.my_generator()))
But in fact I'm stuck because I need to check a delimiter pattern and not only one character.
Could you help in giving me the right direction ... Or maybe my code above is just what I need ?!
Thanks!
It's not practical to perform the test you're going for without some state, but you can hide the state in your generator!
You could make your generator read the frame itself, assuming the delimiter is a constant value (or you pass in the required delimiter). A collections.deque can allow it to easily preserve state only for the last four characters, so it's not just hiding large data storage in state:
def read_until_delimiter(self):
# Note: If FRAME_DELIMITER is bytes, and _read_byte returns len 1 bytes objects
# rather than ints, you'll need to tweak the definition of frame_as_deque to make it store bytes
frame_as_deque = collections.deque(FRAME_DELIMITER)
window = collections.deque(maxlen=len(FRAME_DELIMITER))
while window != frame_as_deque:
byte = self._read_byte()
yield byte
window.append(byte) # Automatically ages off bytes to keep constant length after filling
Now your caller can just do:
f = bytearray(self.read_until_delimiter())
# Or bytearray().join(self.read_until_delimiter()) if reading bytes objects, not ints
start = len(f) - 2
Note: I defined the maxlen in terms of the length of FRAME_DELIMITER; your end of delimiter would almost never pass, because you sliced off the last four bytes, and compared them to a constant containing only three bytes.
I think by saying a better code Is code that don't slice a the concatenated bytes sequence instead a smart generator, and use only one while loop:
# just to simulate your method
data = b'AA\x0a\x0b\x0cBBqfdqsfqsfqsvcwccvxcvvqsfq\x0a\x0b\x0cqsdfqs'
index = -1
def get_bytes():
# you used two method
# return read_byte() if count == 2 else self._read_byte()
global index
index += 1
return data[index:index + 1]
FRAME_DELIMITER = b'\x0a\x0b\x0c'
def get_data():
def update_last_delimiter(v):
""" update the delemeter with the last readed element"""
nonlocal last_four_byte
if len(last_four_byte) < len(FRAME_DELIMITER):
last_four_byte += v
else:
last_four_byte = last_four_byte[1:] + v
count = 2
last_four_byte = b''
while True:
# because you have two method to extract bytes
# replace get_bytes() by (read_byte() if count == 2 else self._read_byte())
update_last_delimiter(get_bytes())
# only yields items when the first DELIMITER IS FOUND
if count < 2:
yield last_four_byte[1:2]
if last_four_byte == FRAME_DELIMITER:
count -= 1
if not count:
break
else:
# when first DELIMITER is found we should yield the [-2] element
yield last_four_byte[1:2]
print(b''.join(get_data()))
# b'\x0b\x0cBBqfdqsfqsfqsvcwccvxcvvqsfq\n\x0b'
The key here is to keep track of the last DELIMITER bytes

Read N lines from a file

so for class we have to start out our problem doing this:
Write a function that takes as its input a filename, and an integer. The file should open the file and read in the first number of lines given as the second argument. (You'll need to have a variable to use as a counter for this part).
It's very basic and I figure a loop is needed but I can't figure out how to incorporate a loop into the question. What I've tried doesn't work and it's been about 3 hours and the best I can come up with is
def filewrite(textfile,line):
infile=open(textfile,'r',encoding='utf-8')
text=infile.readline(line)
print(text)
however that doesn't get me to what I need for the function. It's still early in my intro to python class so basic code is all we have worked with.
There are two basic looping strategies you could use here:
you could count up to n, reading lines as you go
you could read lines from file, keeping track of how many you've read, and stop when you reach a certain number.
def filewrite(textfile, n):
with open(textfile) as infile:
for _ in range(n):
print(infile.readline(), end='')
print()
def filewrite(textfile, n):
with open(textfile) as infile:
counter = 0
for line in infile:
if counter >= n:
break
print(line, end='')
counter += 1
The first is obviously more readable, and since readline will just return an empty string if it runs out of lines, it's safe to use even if the user asks for more lines than the infile has.
Here I'm also using a context manager to make sure the files are closed when I'm done with them.
Here's a version without the stuff you don't recognize
def filewrite(textfile, n):
infile = open(textfile)
count = 0
while count < n:
print(infile.readline())
count += 1
infile.close()

Handling variable length command tuple with try...except

I'm writing a Python 3 script that does tabulation for forestry timber counts.
The workers will radio the species, diameter, and height in logs of each tree they mark to the computer operator. The computer operator will then enter a command such as this:
OAK 14 2
which signifies that the program should increment the count of Oak trees of fourteen inches in diameter and two logs in height.
However, the workers also sometimes call in more than one of the same type of tree at a time. So the program must also be able to handle this command:
OAK 16 1 2
which would signify that we're increasing the count by two.
The way I have the parser set up is thus:
key=cmdtup[0]+"_"+cmdtup[1]+"_"+cmdtup[2]
try:
trees[key]=int(trees[key])+int(cmdtup[3])
except KeyError:
trees[key]=int(cmdtup[3])
except IndexError:
trees[key]=int(trees[key])+1
If the program is commanded to store a tree it hasn't stored before, a KeyError will go off, and the handler will set the dict entry instead of increasing it. If the third parameter is omitted, an IndexError will be raised, and the handler will treat it as if the third parameter was 1.
Issues occur, however, if we're in both situations at once; the program hasn't heard of Oak trees yet, and the operator hasn't specified a count. KeyError goes off, but then generates an IndexError of its own, and Python doesn't like it when exceptions happen in exception handlers.
I suppose the easiest way would be to simply remove one or the other except and have its functionality be done in another way. I'd like to know if there's a more elegant, Pythonic way to do it, though. Is there?
I would do something like this:
def parse(cmd, trees):
res = cmd.split() # split the string by spaces, yielding a list of strings
if len(res) == 3: # if we got 3 parameters, set the fourth to 1
res.append(1)
for i in range(1,4): # convert parameters 1-3 to integers
res[i] = int(res[i])
key = tuple(res[x] for x in range(3)) # convert list to tuple, as lists cannot be dictionary indexes
trees[key] = trees.get(key,0) + res[3] # increase the number of entries, creating it if needed
trees={}
# test data
parse("OAK 14 2", trees)
parse("OAK 16 1 2", trees)
parse("OAK 14 2", trees)
parse("OAK 14 2", trees)
# print result
for tree in trees:
print(tree, "=", trees[tree])
yielding
('OAK', 16, 1) = 2
('OAK', 14, 2) = 3
Some notes:
no error handling here, you should handle the case when a value supposed to be a number isn't or the input is wrong in any other way
instead of strings, I use tuples as a dictionary index
You could use collections.Counter, which returns 0 rather than a KeyError if the key isn't in the dictionary.
Counter Documentation:
Counter objects have a dictionary interface except that they return a zero count for missing items instead of raising a KeyError
Something like this:
from collections import Counter
counts = Counter()
def update_counts(counts, cmd):
cmd_list = cmd.split()
if len(cmd_list) == 3:
tree = tuple(cmd_list)
n = 1
else:
*tree, n = tuple(cmd_list)
counts[tree] += n
Same notes apply as in uselpa's answer. Another nice thing with Counter is that if you want to, e.g., look at weekly counts, you just do something like sum(daily_counts).
Counter works even better if you're starting from a list of commands:
from collections import Counter
from itertools import repeat
raw_commands = get_commands() # perhaps read a file
command_lists = [c.split() for c in raw_commands]
counts = Counter(parse(command_lists))
def parse(commands):
for c in commands:
if len(c) == 3:
yield tuple(c)
elif len(c) == 4
yield from repeat(tuple(c[0:2]), times=c[3])
From there you can use the update_counts function above to add new trees, or you can start collecting the commands in another text file and then generate a second Counter object for the next day, the next week, etc.
In the end, the best way was to simply remove the IndexError handler, change cmdtup to a list, and insert the following:
if len(cmdtup) >= 3:
cmdtup.append(1)

Making string series in Python

I have a problem in Python I simply can't wrap my head around, even though it's fairly simple (I think).
I'm trying to make "string series". I don't really know what it's called, but it goes like this:
I want a function that makes strings that run in series, so that every time the functions get called it "counts" up once.
I have a list with "a-z0-9._-" (a to z, 0 to 9, dot, underscore, dash). And the first string I should receive from my method is aaaa, next time I call it, it should return aaab, next time aaac etc. until I reach ----
Also the length of the string is fixed for the script, but should be fairly easy to change.
(Before you look at my code, I would like to apologize if my code doesn't adhere to conventions; I started coding Python some days ago so I'm still a noob).
What I've got:
Generating my list of available characters
chars = []
for i in range(26):
chars.append(str(chr(i + 97)))
for i in range(10):
chars.append(str(i))
chars.append('.')
chars.append('_')
chars.append('-')
Getting the next string in the sequence
iterationCount = 0
nameLen = 3
charCounter = 1
def getString():
global charCounter, iterationCount
name = ''
for i in range(nameLen):
name += chars[((charCounter + (iterationCount % (nameLen - i) )) % len(chars))]
charCounter += 1
iterationCount += 1
return name
And it's the getString() function that needs to be fixed, specifically the way name gets build.
I have this feeling that it's possible by using the right "modulu hack" in the index, but I can't make it work as intended!
What you try to do can be done very easily using generators and itertools.product:
import itertools
def getString(length=4, characters='abcdefghijklmnopqrstuvwxyz0123456789._-'):
for s in itertools.product(characters, repeat=length):
yield ''.join(s)
for s in getString():
print(s)
aaaa
aaab
aaac
aaad
aaae
aaaf
...

Categories