I'm thoroughly puzzled. I have a block of HTML that I scraped out of a larger table. It looks about like this:
<td align="left" class="page">Number:\xc2\xa0<a class="topmenu" href="http://www.example.com/whatever.asp?search=724461">724461</a> Date:\xc2\xa01/1/1999 Amount:\xc2\xa0$2.50 <br/>Person:<br/><a class="topmenu" href="http://www.example.com/whatever.asp?search=LAST&searchfn=FIRST">LAST,\xc2\xa0FIRST </a> </td>
(Actually, it looked worse, but I regexed out a lot of line breaks)
I need to get the lines out, and break up the Date/Amount line. It seemed like the place to start was to find the children of that block of HTML. The block is a string because that's how regex gave it back to me. So I did:
text_soup = BeautifulSoup(text)
text_children = text_soup.find('td').childGenerator()
I've worked out that I can only iterate through text_children once, though I don't understand why that is. It's a listiterator type, which I'm struggling to understand.
I'm used to being able to assume that if I can iterate through something with a for loop I can call on any one element with something like text_children[0]. That doesn't seem to be the case with an iterator. If I create a list with:
my_array = ["one","two","three"]
I can use my_array[1] to see the second item in the array. If I try to do text_children[1] I get an error:
TypeError: 'listiterator' object is not subscriptable
How do I get at the contents of an iterator?
You can easy construct a list from the iterator by:
my_list = list(your_generator)
Now you can subscript the elements:
print(my_list[1])
another way to get the value is by using next. This will pull the next value from the iterator, but as you've already discovered, once you pull a value out of the iterator, you can't always put it back in (whether or not you can put it back in depends entirely on the object that is being iterated over and what its next method actually looks like).
The reason for this is that often you just want an object that you can iterate over. iterators are great for that as they calculate the elements 1 at a time rather than needing to store all of the values. In other words, you only have one element from the iterator consuming your system's memory at a time -- vs. a list or a tuple where all of the elements are typically stored in memory before you start iterating.
I try to work out a more general answer:
An iterable is an object which can be iterated over. These include lists, tuples, etc. On request, they give an iterator.
An iterator is an object which is used for iteration. It gives a value on each request, and if it is over, it is over. These are generators, list iterators etc., but also e. g. file objects. Every iterator is iterable and gives itself as its iterator.
Example:
a = []
b = iter(a)
print a, b # -> [] <listiterator object at ...>
If you do
for i in a: ...
a is asked for an iterator via its __iter__() method and this iterator is then queried for the next elements until exhausted. This happens via the .next() (resp. __next__() in 3.x) method.
Indexing is a completely different thing. As iteration can happen via indexing if the object doesn't have an .__iter__() method, every indexable object is iterable, but not vice versa.
the short answer, as stated before me, is to just create a list from your generator.
like so: list(generator)
the long answer, and the explanation as to why:
when you create a generator, or in your case a 'listiterator' which is a generator that beautiful soup uses, you are not really creating a list of items. you are creating an object (generator) which knows how to iterate through a certain amount of items, one at a time, (next())
what that means.
instead of what you want which is lets say, a book with pages.
you get a typewriter.
the typewriter can create a book with pages, but only 1 page at a time. now, if you just start at the begining and look at them one at a time like a for loop, then yes, its almost like reading a normal book.
but unlike a normal book, once the typewriter is finished with a page, you cant go backwards, that page is now gone.
i hope this makes some sense.
Related
Here’s a quick example of I’m trying to do and the error I’m getting:
for symbol in itertools.product(list_a, repeat=8):
list_b.append(symbol)
I’m also afterwards excluding combinations from that list like so:
for combination in list_b:
valid_b = True
for symbols in range(len(list_exclude)):
if list_exclude[symbols] in combination:
valid_b = False
else:
pass
if valid_b:
new_list.append(combination)
I’ve heard somehow chunking the process might help, not sure how that could be done here though.
I’m using multiprocessing for this as well.
When I run it I get “MemoryError”
How would you go about it?
Don't pre-compute anything, especially not the first full list:
def symbols(lst, exclude):
for symbol in map(''.join, itertools.product(lst, repeat=8)):
if any(map(symbol.__contains__, exclude)):
continue
yield symbol
Now use the generator as you need to lazily evaluate the elements. Keep in mind that since it's pre-filtering the data, even list(symbols(list_a, list_exclude)) will he much cheaper than what you originally wrote.
Here is a breakdown of what happens:
itertools.product is a generator. That means that it produces an output without retaining a reference to any previous items. Each element it returns is a tuple containing some combination of the input elements.
Since you want to compare strings, you need to convert the tuples. Hence, ''.join. Mapping it onto each of the tuples that itertools.product produces converts those elements into strings. For example:
>>> ''.join(('$', '$', '&', '&', '♀', '#', '%', '$'))
'$$&&♀#%$'
Filtering each symbol thus created can be done by checking if any of the items in excludes are contained in it. You can do this with something like
[ex in symbol for ex in exclude]
The operation ... in symbol is implemented via the magic method symbol.__contains__. You can therefore map that method to every element of exclude.
Since the first element of exclude that is contained in symbol invalidates it, you don't need to check the remainder. This is called short-circuiting, and is implemented in the any function. Notice that because map is a generator, the remaining elements will actually not be computed once a match is found. This is different from using a list comprehension, which pre-computed all the elements.
Putting yield into your function turns it into a generator function. That means that when you call symbols(...), it returns a generator object that you can iterate over. This object does not pre-compute anything until you call next on it. So if you write the data to a file (for example), only the current result will be in memory at once. It may take a long time to write out a large number of results but your memory usage should not spike at all from it.
This little change i made could save you a bit of ram usage.
for combination in list_b:
valid_b = True
for symbols in list_exclude:
if symbols in combination:
valid_b = False
else:
pass
if valid_b:
new_list.append(combination)
This question already has answers here:
Why can't I iterate twice over the same iterator? How can I "reset" the iterator or reuse the data?
(5 answers)
Closed 4 years ago.
I encounter some code that get back an iterative object from the Dynamo database, and I can do:
print [en["student_id"] for en in enrollments]
However, when I do similar things again:
print [en["course_id"] for en in enrollments]
Then the second iteration will print out nothing, because the iterative structure can only be iterated only once and it has reached its end.
The question is, how can we iterate it more than once, for the case of (1) what if it is known to be only several items in the iteration (2) what if we know there will be lots of items (say a million items) in the iteration, and we don't want to cost a lot of additional memory space?
Related is, I looked up rewind, and it seems like it exists for PHP and Ruby, but not for Python?
enrollments is a generator. Either recreate the generator if you need to iterate again, or convert it to a list first:
enrollments = list(enrollments)
Take into account that APIs often use generators to avoid memory bloat; a list must have references to all objects it contains, so all those objects have to exist at the same time. A generator can produce the elements one by one, as needed; your list comprehension discards those objects again once the 'student_id' key has been extracted.
The alternative is to iterate just once, and do all the things with each object you want to do. So instead of running two list comprehensions, run one regular for loop and extract all the data you need in one place, appending to separate lists as you go along:
courses = []
students = []
for enrollment in enrollments:
courses.append(enrollment['course_id'])
students.append(enrollment['student_id'])
rewind in PHP is unrelated to this; Python has fileobj.seek(0) to do the same, but file objects are not generators.
import itertools
it1, it2 = itertools.tee(enrollments, n=2)
Looks like it is an answer from here: Why can't I iterate twice over the same data?
But it is valid only if you are going to iterate not too much times.
This question already has answers here:
Strange result when removing item from a list while iterating over it
(8 answers)
Closed last month.
As an experiment, I did this:
letters=['a','b','c','d','e','f','g','h','i','j','k','l']
for i in letters:
letters.remove(i)
print letters
The last print shows that not all items were removed ? (every other was).
IDLE 2.6.2
>>> ================================ RESTART ================================
>>>
['b', 'd', 'f', 'h', 'j', 'l']
>>>
What's the explanation for this ? How it could this be re-written to remove every item ?
Some answers explain why this happens and some explain what you should've done. I'll shamelessly put the pieces together.
What's the reason for this?
Because the Python language is designed to handle this use case differently. The documentation makes it clear:
It is not safe to modify the sequence being iterated over in the loop (this can only happen for mutable sequence types, such as lists). If you need to modify the list you are iterating over (for example, to duplicate selected items) you must iterate over a copy.
Emphasis mine. See the linked page for more -- the documentation is copyrighted and all rights are reserved.
You could easily understand why you got what you got, but it's basically undefined behavior that can easily change with no warning from build to build. Just don't do it.
It's like wondering why i += i++ + ++i does whatever the hell it is it that line does on your architecture on your specific build of your compiler for your language -- including but not limited to trashing your computer and making demons fly out of your nose :)
How it could this be re-written to remove every item?
del letters[:] (if you need to change all references to this object)
letters[:] = [] (if you need to change all references to this object)
letters = [] (if you just want to work with a new object)
Maybe you just want to remove some items based on a condition? In that case, you should iterate over a copy of the list. The easiest way to make a copy is to make a slice containing the whole list with the [:] syntax, like so:
#remove unsafe commands
commands = ["ls", "cd", "rm -rf /"]
for cmd in commands[:]:
if "rm " in cmd:
commands.remove(cmd)
If your check is not particularly complicated, you can (and probably should) filter instead:
commands = [cmd for cmd in commands if not is_malicious(cmd)]
You cannot iterate over a list and mutate it at the same time, instead iterate over a slice:
letters=['a','b','c','d','e','f','g','h','i','j','k','l']
for i in letters[:]: # note the [:] creates a slice
letters.remove(i)
print letters
That said, for a simple operation such as this, you should simply use:
letters = []
You cannot modify the list you are iterating, otherwise you get this weird type of result. To do this, you must iterate over a copy of the list:
for i in letters[:]:
letters.remove(i)
It removes the first occurrence, and then checks for the next number in the sequence. Since the sequence has changed it takes the next odd number and so on...
take "a"
remove "a" -> the first item is now "b"
take the next item, "c"
-...
what you want to do is:
letters[:] = []
or
del letters[:]
This will preserve original object letters was pointing to. Other options like, letters = [], would create a new object and point letters to it: old object would typically be garbage-collected after a while.
The reason not all values were removed is that you're changing list while iterating over it.
ETA: if you want to filter values from a list you could use list comprehensions like this:
>>> letters=['a','b','c','d','e','f','g','h','i','j','k','l']
>>> [l for l in letters if ord(l) % 2]
['a', 'c', 'e', 'g', 'i', 'k']
Probably python uses pointers and the removal starts at the front. The variable „letters“ in the second line partially has a different value than tha variable „letters“ in the third line. When i is 1 then a is being removed, when i is 2 then b had been moved to position 1 and c is being removed. You can try to use „while“.
#!/usr/bin/env python
import random
a=range(10)
while len(a):
print a
for i in a[:]:
if random.random() > 0.5:
print "removing: %d" % i
a.remove(i)
else:
print "keeping: %d" % i
print "done!"
a=range(10)
while len(a):
print a
for i in a:
if random.random() > 0.5:
print "removing: %d" % i
a.remove(i)
else:
print "keeping: %d" % i
print "done!"
I think this explains the problem a little better, the top block of code works, whereas the bottom one doesnt.
Items that are "kept" in the bottom list never get printed out, because you are modifiying the list you are iterating over, which is a recipe for disaster.
OK, I'm a little late to the party here, but I've been thinking about this and after looking at Python's (CPython) implementation code, have an explanation I like. If anyone knows why it's silly or wrong, I'd appreciate hearing why.
The issue is moving through a list using an iterator, while allowing that list to change.
All the iterator is obliged to do is tell you which item in the (in this case) list comes after the current item (i.e. with the next() function).
I believe the way iterators are currently implemented, they only keep track of the index of the last element they iterated over. Looking in iterobject.c one can see what appears to be a definition of an iterator:
typedef struct {
PyObject_HEAD
Py_ssize_t it_index;
PyObject *it_seq; /* Set to NULL when iterator is exhausted */
} seqiterobject;
where it_seq points to the sequence being iterated over and it_index gives the index of the last item supplied by the iterator.
When the iterator has just supplied the nth item and one deletes that item from the sequence, the correspondence between subsequent list elements and their indices changes. The former (n+1)st item becomes the nth item as far as the iterator is concerned. In other words, the iterator now thinks that what was the 'next' item in the sequence is actually the 'current' item.
So, when asked to give the next item, it will give the former (n+2)nd item(i.e. the new (n+1)st item).
As a result, for the code in question, the iterator's next() method is going to give only the n+0, n+2, n+4, ... elements from the original list. The n+1, n+3, n+5, ... items will never be exposed to the remove statement.
Although the intended activity of the code in question is clear (at least for a person), it would probably require much more introspection for an iterator to monitor changes in the sequence it iterates over and, then, to act in a 'human' fashion.
If iterators could return prior or current elements of a sequence, there might be a general work-around, but as it is, you need to iterate over a copy of the list, and be certain not to delete any items before the iterator gets to them.
Intially i is reference of a as the loop runs the first position element deletes or removes and the second position element occupies the first position but the pointer moves to the second position this goes on so that's the reason we are not able to delete b,d,f,h,j,l
`
I'm working with the Flask framework in Python, and need to hand off a list of lists to a renderer.
I step through a loop and create a list, sort it, append it to another list, then call the render function with the masterlist, like so:
for itemID in itemsArray:
avgQuantity = getJitaQuantity(itemID)
lowestJitaSell = getJitaLowest(itemID)
candidateArray = findLowestPrices(itemID, lowestJitaSell, candidateArray, avgQuantity)
candidateArray.sort()
multiCandidateArray.append(candidateArray)
renderPage(multiCandidateArray)
My problem is that I need to clear the candidateArray and create a new one each time through the loop, but it looks like the candidateArray that I append to the multiCandidateArray is actually a pointer, not the values themselves.
When I do this:
for itemID in itemsArray:
avgQuantity = getJitaQuantity(itemID)
lowestJitaSell = getJitaLowest(itemID)
candidateArray = findLowestPrices(itemID, lowestJitaSell, candidateArray, avgQuantity)
candidateArray.sort()
multiCandidateArray.append(candidateArray)
**del candidateArray[:]**
renderPage(multiCandidateArray)
I end up with no values.
Is there a way to handle this situation that I'm missing?
I would probably go with something like:
for itemID in itemsArray:
avgQuantity = getJitaQuantity(itemID)
lowestJitaSell = getJitaLowest(itemID)
candidateArray = findLowestPrices(itemID, lowestJitaSell, candidateArray, avgQuantity)
multiCandidateArray.append(sorted(candidateArray))
No need to del anything here, and sorted returns a new list, so even if FindLowestPrices is for some reason returning references to the same list (which is unlikely), then you'll still have unique lists in the multiCandidateArray (although your unique lists could hold references to the same objects).
Your code already creates a new one each time through the loop.
candidateArray = findLowestPrices(...)
This assigns a new list to the variable, candidateArray. It should work fine.
When you do this:
del candidateArray[:]
...you're deleting the contents of the same list you just appended to the master list.
Don't think about pointers or variables; just think about objects, and remember nothing in Python is ever implicitly copied. A list is an object. At the end of the loop, candidateArray names the same list object as multiCandidateArray[-1]. They're different names for the same thing. On the next run through the loop, candidateArray becomes a name for a new list as produced by findLowestPrices, and the list at the end of the master list is unaffected.
I've written about this before; the C way of thinking about variables as being predetermined blocks of memory just doesn't apply to Python at all. Names are moved onto values, rather than values being copied into some fixed number of buckets.
(Also, nitpicking, but Python code generally uses under_scores and doesn't bother with types in names unless it's really ambiguous. So you might have candidates and multi_candidates. Definitely don't call anything an "array", since there's an array module in the standard library that does something different and generally not too useful. :))
Just a fundamental question regarding python and .join() method:
file1 = open(f1,"r")
file2 = open(f2,"r")
file3 = open("results","w")
diff = difflib.Differ()
result = diff.compare(file1.read(),file2.read())
file3.write("".join(result)),
The above snippet of code yields a nice output stored in a file called "results", in string format, showing the differences between the two files line-by-line. However I notice that if I just print "result" without using .join(), the compiler returns a message that includes a memory address. After trying to write the result to the file without using .join(), I was informed by the compiler that only strings and character buffers may be used in the .join() method, and not generator objects. So based off of all the evidence that I have adduced, please correct me if I am wrong:
result = diff.compare(file1.read(),file2.read()) <---- result is a generator object?
result is a list of strings, with result itself being the reference to the first string?
.join() takes a memory address and points to the first, and then iterates over the rest of the addresses of strings in that structure?
A generator object is an object that returns a pointer?
I apologize if my questions are unclear, but I basically wanted to ask the python veterans if my deductions were correct. My question is less about the observable results, and more so about the inner workings of python. I appreciate all of your help.
join is a method of strings. That method takes any iterable and iterates over it and joins the contents together. (The contents have to be strings, or it will raise an exception.)
If you attempt to write the generator object directly to the file, you will just get the generator object itself, not its contents. join "unrolls" the contents of the generator.
You can see what is going with a simple, explicit generator:
def gen():
yield 'A'
yield 'B'
yield 'C'
>>> g = gen()
>>> print g
<generator object gen at 0x0000000004BB9090>
>>> print ''.join(g)
ABC
The generator doles out its contents one at a time. If you try to look at the generator itself, it doesn't dole anything out and you just see it as "generator object". To get at its contents, you need to iterate over them. You can do this with a for loop, with the next function, or with any of various other functions/methods that iterate over things (str.join among them).
When you say that result "is a list of string" you are getting close to the idea. A generator (or iterable) is sort of like a "potential list". Instead of actually being a list of all its contents all at once, it lets you peel off each item one at a time.
None of the objects is a "memory address". The string representation of a generator object (like that of many other objects) includes a memory address, so if you print it (as above) or write it to a file, you'll see that address. But that doesn't mean that object "is" that memory address, and the address itself isn't really usable as such. It's just a handy identifying tag so that if you have multiple objects you can tell them apart.