Just a fundamental question regarding python and .join() method:
file1 = open(f1,"r")
file2 = open(f2,"r")
file3 = open("results","w")
diff = difflib.Differ()
result = diff.compare(file1.read(),file2.read())
file3.write("".join(result)),
The above snippet of code yields a nice output stored in a file called "results", in string format, showing the differences between the two files line-by-line. However I notice that if I just print "result" without using .join(), the compiler returns a message that includes a memory address. After trying to write the result to the file without using .join(), I was informed by the compiler that only strings and character buffers may be used in the .join() method, and not generator objects. So based off of all the evidence that I have adduced, please correct me if I am wrong:
result = diff.compare(file1.read(),file2.read()) <---- result is a generator object?
result is a list of strings, with result itself being the reference to the first string?
.join() takes a memory address and points to the first, and then iterates over the rest of the addresses of strings in that structure?
A generator object is an object that returns a pointer?
I apologize if my questions are unclear, but I basically wanted to ask the python veterans if my deductions were correct. My question is less about the observable results, and more so about the inner workings of python. I appreciate all of your help.
join is a method of strings. That method takes any iterable and iterates over it and joins the contents together. (The contents have to be strings, or it will raise an exception.)
If you attempt to write the generator object directly to the file, you will just get the generator object itself, not its contents. join "unrolls" the contents of the generator.
You can see what is going with a simple, explicit generator:
def gen():
yield 'A'
yield 'B'
yield 'C'
>>> g = gen()
>>> print g
<generator object gen at 0x0000000004BB9090>
>>> print ''.join(g)
ABC
The generator doles out its contents one at a time. If you try to look at the generator itself, it doesn't dole anything out and you just see it as "generator object". To get at its contents, you need to iterate over them. You can do this with a for loop, with the next function, or with any of various other functions/methods that iterate over things (str.join among them).
When you say that result "is a list of string" you are getting close to the idea. A generator (or iterable) is sort of like a "potential list". Instead of actually being a list of all its contents all at once, it lets you peel off each item one at a time.
None of the objects is a "memory address". The string representation of a generator object (like that of many other objects) includes a memory address, so if you print it (as above) or write it to a file, you'll see that address. But that doesn't mean that object "is" that memory address, and the address itself isn't really usable as such. It's just a handy identifying tag so that if you have multiple objects you can tell them apart.
Related
Here’s a quick example of I’m trying to do and the error I’m getting:
for symbol in itertools.product(list_a, repeat=8):
list_b.append(symbol)
I’m also afterwards excluding combinations from that list like so:
for combination in list_b:
valid_b = True
for symbols in range(len(list_exclude)):
if list_exclude[symbols] in combination:
valid_b = False
else:
pass
if valid_b:
new_list.append(combination)
I’ve heard somehow chunking the process might help, not sure how that could be done here though.
I’m using multiprocessing for this as well.
When I run it I get “MemoryError”
How would you go about it?
Don't pre-compute anything, especially not the first full list:
def symbols(lst, exclude):
for symbol in map(''.join, itertools.product(lst, repeat=8)):
if any(map(symbol.__contains__, exclude)):
continue
yield symbol
Now use the generator as you need to lazily evaluate the elements. Keep in mind that since it's pre-filtering the data, even list(symbols(list_a, list_exclude)) will he much cheaper than what you originally wrote.
Here is a breakdown of what happens:
itertools.product is a generator. That means that it produces an output without retaining a reference to any previous items. Each element it returns is a tuple containing some combination of the input elements.
Since you want to compare strings, you need to convert the tuples. Hence, ''.join. Mapping it onto each of the tuples that itertools.product produces converts those elements into strings. For example:
>>> ''.join(('$', '$', '&', '&', '♀', '#', '%', '$'))
'$$&&♀#%$'
Filtering each symbol thus created can be done by checking if any of the items in excludes are contained in it. You can do this with something like
[ex in symbol for ex in exclude]
The operation ... in symbol is implemented via the magic method symbol.__contains__. You can therefore map that method to every element of exclude.
Since the first element of exclude that is contained in symbol invalidates it, you don't need to check the remainder. This is called short-circuiting, and is implemented in the any function. Notice that because map is a generator, the remaining elements will actually not be computed once a match is found. This is different from using a list comprehension, which pre-computed all the elements.
Putting yield into your function turns it into a generator function. That means that when you call symbols(...), it returns a generator object that you can iterate over. This object does not pre-compute anything until you call next on it. So if you write the data to a file (for example), only the current result will be in memory at once. It may take a long time to write out a large number of results but your memory usage should not spike at all from it.
This little change i made could save you a bit of ram usage.
for combination in list_b:
valid_b = True
for symbols in list_exclude:
if symbols in combination:
valid_b = False
else:
pass
if valid_b:
new_list.append(combination)
I need to turn a list of various entities into strings. So far I use:
all_ents_dead=[] # converted to strings
for i in all_ents:
all_ents_dead.append(str(i))
Is there an optimized way of doing that?
EDIT: I then need to find which of these contain certain string. So far I have:
matching = [s for s in all_ents_dead if "GROUPS" in s]
Whenever you have a name = [], then name.append() in a loop pattern, consider using a list comprehension. A list comprehension builds a list from a loop, without having to use list.append() lookups and calls, making it faster:
all_ents_dead = [str(i) for i in all_ents]
This directly echoes the code you had, but with the expression inside all_ents_dead.append(...) moved to the front of the for loop.
If you don't actually need a list, but only need to iterate over the str() conversions you should consider lazy conversion options. You can turn the list comprehension in to a generator expression:
all_ents_dead = (str(i) for i in all_ents)
or, when only applying a function, the faster alternative in the map() function:
all_ents_dead = map(str, all_ents) # assuming Python 3
both of which lazily apply str() as you iterate over the resulting object. This helps avoid creating a new list object where you don't actually need one, saving on memory. Do note that a generator expression can be slower however; if performance is at stake consider all options based on input sizes, memory constraints and time trials.
For your specific search example, you could just embed the map() call:
matching = [s for s in map(str, all_ents) if "GROUPS" in s]
which would produce a list of matching strings, without creating an intermediary list of string objects that you then don't use anywhere else.
Use the map() function. This will take your existing list, run a function on each item, and return a new list/iterator (see below) with the result of the function applied on each element.
all_ends_dead = map(str, all_ents)
In Python 3+, map() will return an iterator, while in Python 2 it will return a list. An iterator can have optimisations you desire since it generates the values when demanded, and not all at once (as opposed to a list).
I am making a program that works on chat files under the following general form:
Open file & import lines (via readlines)
Do an initial pass to turn the list of strings into better-formed data (currently dicts) by:
Throwing out out malformed lines
Separating out the usernames from the text of the message
Throwing out lines for ignored (typically bot) users
Marking which lines are /me commands
Do a second pass over the list of dicts to apply various manipulations on them, such as:
Replacing every mention of a nick with its alias
Applying special formatting to /me commands
Rather than have multiple switches in the config file and lots of if statement checks within the loops, I believe the program would be cleaner if I generated the list of functions elsewhere and then fed the program the list of dicts (or strings, depending on which part of the program I'm at) and a list of functions such that the list of functions gets applied to each of the items in the list of objects.
It seems that this probably would be a good case for list comprehensions if I were only applying a single function to each item, but I don't want to have to do a separate pass through the log for every function that I want to call. However, this answer notes that list comprehension probably aren't what I want, since they return a completely new list, rather than modifying in place.
Is my best option to have two variants of the following?
for item in list:
item = a(b(c(d(item, dparam1, dparam2), cparam)), aparams)
(for readability, I'd put each function on its own line, like:
for item in list:
item = d(item, dparam1, dparam2)
item = c(item, cparam)
item = b(item)
item = a(item, aparams)
However, the above doesn't eliminate the need for if checks on all the switches and wouldn't allow for applying function a at two different places unless I explicitly add a switch to do so.
Based on your sample code, you may try this way,
func_list = [d, c, b, a]
for idx, item in enumerate(item_list):
item_list[idx] = reduce(lambda v, func: func(v), func_list, item)
Let each func handle the data, make the loop more like pipeline.
I am using Python 3.5 to create a set of generators to parse a set of opened files in order to cherry pick data from those files to construct an object I plan to export later. I was originally parsing through the entirety of each file and creating a list of dictionary objects before doing any analysis, but this process would take up to 30 seconds sometimes, and since I only need to work with each line of each file only once, I figure its a great opportunity to use a generator. However, I feel that I am missing something conceptually with generators, and perhaps the mutability of objects within a generator.
My original code that makes a list of dictionaries goes as follows:
parsers = {}
# iterate over files in the file_name file to get their attributes
for dataset, data_file in files.items():
# Store each dataset as a list of dictionaries with keys that
# correspond to the attributes of that dataset
parsers[dataset] = [{attributes[dataset][i]: value.strip('~')
for i, value in enumerate(line.strip().split('^'))}
for line
in data_file]
And I access the the list by calling:
>>>parsers['definitions']
And it works as expected returning a list of dictionaries. However when I convert this list into a generator, all sorts of weirdness happens.
parsers = {}
# iterate over files in the file_name file to get their attributes
for dataset, data_file in files.items():
# Store each dataset as a list of dictionaries with keys that
# correspond to the attributes of that dataset
parsers[dataset] = ({attributes[dataset][i]: value.strip('~')
for i, value in enumerate(line.strip().split('^'))}
for line
in data_file)
And I call it by using:
>>> next(parsers['definitions'])
Running this code returns an index out of range error.
The main difference I can see between the two code segments is that in the list comprehension version, python constructs the list from the file and moves on without needing to store the comprehensions variables for later use.
Conversely, in the generator expression the variables defined within the generator need to be stored with the generator, as they effect each successive call of the generator later in my code. I am thinking that perhaps the variables inside the generator are sharing a namespace with the other generators my code creates, and so each generator has erratic behavior based on whatever generator expression was run last, and therefore set the values of the variables last.
I appreciate any thoughts as to the reason for this issue!
I assume that the problem is when you're building the dictionaries.
attributes[dataset][i]
Note that with the list version, dataset is whatever dataset was at that particular turn of the for loop. However, with the generator, that expression isn't evaluated until after the for loop has completed, so dataset will have the value of the last dataset from the files.items() loop...
Here's a super simple demo that hopefully elaborates on the problem:
results = []
for a in [1, 2, 3]:
results.append(a for _ in range(3))
for r in results:
print(list(r))
Note that we always get [3, 3, 3] because when we take the values from the generator, the value of a is 3.
I'm thoroughly puzzled. I have a block of HTML that I scraped out of a larger table. It looks about like this:
<td align="left" class="page">Number:\xc2\xa0<a class="topmenu" href="http://www.example.com/whatever.asp?search=724461">724461</a> Date:\xc2\xa01/1/1999 Amount:\xc2\xa0$2.50 <br/>Person:<br/><a class="topmenu" href="http://www.example.com/whatever.asp?search=LAST&searchfn=FIRST">LAST,\xc2\xa0FIRST </a> </td>
(Actually, it looked worse, but I regexed out a lot of line breaks)
I need to get the lines out, and break up the Date/Amount line. It seemed like the place to start was to find the children of that block of HTML. The block is a string because that's how regex gave it back to me. So I did:
text_soup = BeautifulSoup(text)
text_children = text_soup.find('td').childGenerator()
I've worked out that I can only iterate through text_children once, though I don't understand why that is. It's a listiterator type, which I'm struggling to understand.
I'm used to being able to assume that if I can iterate through something with a for loop I can call on any one element with something like text_children[0]. That doesn't seem to be the case with an iterator. If I create a list with:
my_array = ["one","two","three"]
I can use my_array[1] to see the second item in the array. If I try to do text_children[1] I get an error:
TypeError: 'listiterator' object is not subscriptable
How do I get at the contents of an iterator?
You can easy construct a list from the iterator by:
my_list = list(your_generator)
Now you can subscript the elements:
print(my_list[1])
another way to get the value is by using next. This will pull the next value from the iterator, but as you've already discovered, once you pull a value out of the iterator, you can't always put it back in (whether or not you can put it back in depends entirely on the object that is being iterated over and what its next method actually looks like).
The reason for this is that often you just want an object that you can iterate over. iterators are great for that as they calculate the elements 1 at a time rather than needing to store all of the values. In other words, you only have one element from the iterator consuming your system's memory at a time -- vs. a list or a tuple where all of the elements are typically stored in memory before you start iterating.
I try to work out a more general answer:
An iterable is an object which can be iterated over. These include lists, tuples, etc. On request, they give an iterator.
An iterator is an object which is used for iteration. It gives a value on each request, and if it is over, it is over. These are generators, list iterators etc., but also e. g. file objects. Every iterator is iterable and gives itself as its iterator.
Example:
a = []
b = iter(a)
print a, b # -> [] <listiterator object at ...>
If you do
for i in a: ...
a is asked for an iterator via its __iter__() method and this iterator is then queried for the next elements until exhausted. This happens via the .next() (resp. __next__() in 3.x) method.
Indexing is a completely different thing. As iteration can happen via indexing if the object doesn't have an .__iter__() method, every indexable object is iterable, but not vice versa.
the short answer, as stated before me, is to just create a list from your generator.
like so: list(generator)
the long answer, and the explanation as to why:
when you create a generator, or in your case a 'listiterator' which is a generator that beautiful soup uses, you are not really creating a list of items. you are creating an object (generator) which knows how to iterate through a certain amount of items, one at a time, (next())
what that means.
instead of what you want which is lets say, a book with pages.
you get a typewriter.
the typewriter can create a book with pages, but only 1 page at a time. now, if you just start at the begining and look at them one at a time like a for loop, then yes, its almost like reading a normal book.
but unlike a normal book, once the typewriter is finished with a page, you cant go backwards, that page is now gone.
i hope this makes some sense.