I want to change this
def has_class_but_no_id(tag):
return tag.has_key('class') and not tag.has_key('id')
This function is from Python2 not for Python3
I had idea that
I changed this HTML document in a list like this
list_of_descendants = list(soup.descendants)
So I can get tags which contain class but don't id
it is about that find all tags with class = blabla... but not id = ....
I have no idea how I can handle this problem
The documentation says:
I renamed one method for compatibility with Python 3:
Tag.has_key() -> Tag.has_attr()
Also, the exact same function is available in the documentation here:
If none of the other matches work for you, define a function that
takes an element as its only argument. The function should return True
if the argument matches, and False otherwise.
Here’s a function that returns True if a tag defines the “class”
attribute but doesn’t define the “id” attribute:
def has_class_but_no_id(tag):
return tag.has_attr('class') and not tag.has_attr('id')
Hey i solve this Problem.
What i had to do is
1.collect all the tags(BeautifulSoup) and all children of tags (contents)
soup = BeautifulSoup(html_doc,"html.parser")
list_of_descendants = list(soup.descendants)
2.eliminate all NavigableStrings(cuz they can't accept has_attr() Methodes)
def terminate_navis(list_of_some):
new_list = []
for elem in list_of_some:
if type(elem) == bs4.element.Tag:
new_list.append(elem)
else :
continue
return new_list
new_list = terminate_navis(list_of_descendants)
def contents_adding(arg_list):
//this Method helps that get all the childrens of tags in lists again
new_list = arg_list
child_list = []
for elem in arg_list:
if elem.contents:
child_list = elem.contents
child_list = terminate_navis(child_list)
new_list.extend(child_list)
new_list = list(set(new_list))
return new_list
3.filter all tags if they have attribute 'class' (has_attr) and if they don't have 'id'(also with has_attr)
def justcl(tag_lists):
class_lists = []
for elem in tag_lists:
if elem.has_attr('class'):
class_lists.append(elem)
else :
continue
return class_lists
def notids(class_lists):
no_id_lists = []
for elem in class_lists:
if elem.has_attr('id'):
continue
else :
no_id_lists.append(elem)
return no_id_lists
all this collected tags create as a list and print on the screen
print or using for loop and so on...
Related
What's the difference in the following code? Links refer to Linked List objects. The .first() and .rest() attributes return the first and remaining values respectively. In accordance to StackOverFlow policies and my class's policies, I want to mention that this is not an assignment--it was an optional assignment that was long due, and I am revisiting it trying out iteration vs. recursion to study for my upcoming examination.
Here are some doctests.
>>> s = Link(1, Link(2, Link(3)))
>>> s.rest.rest.rest = s
>>> has_cycle(s)
True
>>> t = Link(1, Link(2, Link(3)))
>>> has_cycle(t)
False
>>> u = Link(2, Link(2, Link(2)))
>>> has_cycle(u)
False
Here is my recursive solution for the function def has_cycle(link):
existing = []
def cycle(link):
nonlocal existing
if link is not Link.empty:
if link in existing:
return True
existing.append(link)
cycle(link.rest)
cycle(link)
return False
Alternatively
existing = []
while link is not Link.empty:
if link in existing:
return True
existing.append(link)
link = link.rest
return False
Thank you. I should mention that the iterative version works, the recursive version does not.
Link List Class:
class Link:
"""A linked list.
>>> s = Link(1, Link(2, Link(3)))
>>> s.first
1
>>> s.rest
Link(2, Link(3))
"""
empty = ()
def __init__(self, first, rest=empty):
assert rest is Link.empty or isinstance(rest, Link)
self.first = first
self.rest = rest
def __repr__(self):
if self.rest is Link.empty:
return 'Link({})'.format(self.first)
else:
return 'Link({}, {})'.format(self.first, repr(self.rest))
existing = []
def cycle(link):
nonlocal existing
if link is not Link.empty:
if link in existing:
return True
existing.append(link)
cycle(link.rest)
else:
return False
res = cycle(link)
#print res
Try this out as your recursive version....
Your returns are missing or inappropriate. When descending into recursion, you need to return its respective result. Also, you need to return something in case your emtpy-condition does not hold true:
existing = []
def cycle(link):
nonlocal existing
if link is not Link.empty:
if link in existing:
return True
existing.append(link)
return cycle(link.rest) # Return result of recursive call
else:
return False # Return false if link is empty
print(cycle(link))
I am currently implementing an ORM that stores data defined in an XSD handled with a DOM generated by PyXB.
Many of the respective elements contain sub-elements and so forth, which each have a minOccurs=0 and thus may resolve to None in the DOM.
Hence when accessing some element hierarchy containing optional elements I now face the problem whether to use:
with suppress(AttributeError):
wanted_subelement = root.subelement.sub_subelement.wanted_subelement
or rather
if root.subelement is not None:
if root.subelement.sub_subelement is not None:
wanted_subelement = root.subelement.sub_subelement.wanted_subelement
While both styles work perfectly fine, which is preferable? (I am not Dutch, btw.)
This also works:
if root.subelement and root.subelement.sub_subelement:
wanted_subelement = root.subelement.sub_subelement.wanted_subelement
The if statement evaluates None as False and will check from left to right. So if the first element evaluates to false it will not try to access the second one.
If you have quite a few such lookups to perform, better to wrap this up in a more generic lookup function:
# use a sentinel object distinct from None
# in case None is a valid value for an attribute
notfound = object()
# resolve a python attribute path
# - mostly, a `getattr` that supports
# arbitrary sub-attributes lookups
def resolve(element, path):
parts = path.split(".")
while parts:
next, parts = parts[0], parts[1:]
element = getattr(element, next, notfound)
if element is notfound:
break
return element
# just to test the whole thing
class Element(object):
def __init__(self, name, **attribs):
self.name = name
for k, v in attribs.items():
setattr(self, k, v)
e = Element(
"top",
sub1=Element("sub1"),
nested1=Element(
"nested1",
nested2=Element(
"nested2",
nested3=Element("nested3")
)
)
)
tests = [
"notthere",
"does.not.exists",
"sub1",
"sub1.sub2",
"nested1",
"nested1.nested2",
"nested1.nested2.nested3"
]
for path in tests:
sub = resolve(e, path)
if sub is notfound:
print "%s : not found" % path
else:
print "%s : %s" % (path, sub.name)
I have a function which returns a list of objects (I used the code below for example). Each object has attribute called text:
def mylist():
mylist = []
for i in range(5):
elem = myobject(i)
mylist.append(elem)
return mylist
for obj in mylist():
print obj.text
How can I rewrite this code so mylist() returned each iteration new value and I iterate over iterator? In other words how can I reuse here a mylist in python so use it like xrange()?
If I understood right, you're looking for generators:
def mylist():
for i in range(5):
elem = myobject(i)
yield elem
Complete code for you to play with:
class myobject:
def __init__(self, i):
self.text = 'hello ' + str(i)
def mylist():
for i in range(5):
elem = myobject(i)
yield elem
for obj in mylist():
print obj.text
You can also use a generator expression:
mylist = (myobject(i) for i in range(5))
This will give you an actual generator but without having to declare a function beforehand.
Please note the usage of parentheses instead of brackets to denote a generator comprehension instead of a list comprehension
What georg said, or you can return the iter of that list
def mylist():
mylist = []
for i in range(5):
mylist.append(myobject(i))
return iter(mylist)
probably not a good idea to use your function name as a variable name, though :)
llist = [0,4,5,6]
ii = iter(llist)
while (True):
try:
print(next(ii))
except StopIteration:
print('End of iteration.')
break
in fucntion getLink(urls), I have return (cloud,parent,children)
in main function, I have (cloud,parent,children) = getLink(urls) and I got error of this line: TypeError: 'NoneType' object is not iterable
parent and children are all list of http links. since, it is not able to paste them here, parent is a list contains about 30 links; children is a list contains about 30 items, each item is about 10-100 links which is divide by ",".
cloud is a list contain about 100 words, like that: ['official store', 'Java Applets Centre', 'About Google', 'Web History'.....]
I didnot know why I get an error. Is there anything wrong in passing parameter? Or because the list take too much space?
#crawler url: read webpage and return a list of url and a list of its name
def crawler(url):
try:
m = urllib.request.urlopen(url)
msg = m.read()
....
return (list(set(list(links))),list(set(list(titles))) )
except Exception:
print("url wrong!")
#this is the function has gone wrong: it throw an exception here, also the error I mentioned, also it will end while before len(parent) reach 100.
def getLink(urls):
try:
newUrl=[]
parent = []
children =[]
cloud =[]
i=0
while len(parent)<=100:
url = urls[i]
if url in parent:
i += 1
continue
(links, titles) = crawler(url)
parent.append(url)
children.append(",".join(links))
cloud = cloud + titles
newUrl= newUrl+links
print ("links: ",links)
i += 1
if i == len(urls):
urls = list(set(newUrl))
newUrl = []
i = 0
return (cloud,parent,children)
except Exception:
print("can not get links")
def readfile(file):
#not related, this function will return a list of url
def main():
file='sampleinput.txt'
urls=readfile(file)
(cloud,parent,children) = getLink(urls)
if __name__=='__main__':
main()
There might be a way that your function ends without reaching the explicit return statement.
Look at the following example code.
def get_values(x):
if x:
return 'foo', 'bar'
x, y = get_values(1)
x, y = get_values(0)
When the function is called with 0 as parameter the return is skipped and the function will return None.
You could add an explicit return as the last line of your function. In the example given in this answer it would look like this.
def get_values(x):
if x:
return 'foo', 'bar'
return None, None
Update after seing the code
When the exception is triggered in get_link you just print something and return from the function. You have no return statement, so Python will return None. The calling function now tries to expand None into three values and that fails.
Change your exception handling to return a tuple with three values like you do it when everything is fine. Using None for each value is a good idea for it shows you, that something went wrong. Additionally I wouldn't print anything in the function. Don't mix business logic and input/output.
except Exception:
return None, None, None
Then in your main function use the following:
cloud, parent, children = getLink(urls)
if cloud is None:
print("can not get links")
else:
# do some more work
For example
bs = BeautifulSoup("<html><a>sometext</a></html>")
print bs.find_all("a",text=re.compile(r"some"))
returns [<a>sometext</a>] but when element searched for has a child, i.e. img
bs = BeautifulSoup("<html><a>sometext<img /></a></html>")
print bs.find_all("a",text=re.compile(r"some"))
it returns []
Is there a way to use find_all to match the later example?
You will need to use a hybrid approach since text= will fail when an element has child elements as well as text.
bs = BeautifulSoup("<html><a>sometext</a></html>")
reg = re.compile(r'some')
elements = [e for e in bs.find_all('a') if reg.match(e.text)]
Background
When BeautifulSoup is searching for an element, and text is a callable, it eventually eventually calls:
self._matches(found.string, self.text)
In the two examples you gave, the .string method returns different things:
>>> bs1 = BeautifulSoup("<html><a>sometext</a></html>")
>>> bs1.find('a').string
u'sometext'
>>> bs2 = BeautifulSoup("<html><a>sometext<img /></a></html>")
>>> bs2.find('a').string
>>> print bs2.find('a').string
None
The .string method looks like this:
#property
def string(self):
"""Convenience property to get the single string within this tag.
:Return: If this tag has a single string child, return value
is that string. If this tag has no children, or more than one
child, return value is None. If this tag has one child tag,
return value is the 'string' attribute of the child tag,
recursively.
"""
if len(self.contents) != 1:
return None
child = self.contents[0]
if isinstance(child, NavigableString):
return child
return child.string
If we print out the contents we can see why this returns None:
>>> print bs1.find('a').contents
[u'sometext']
>>> print bs2.find('a').contents
[u'sometext', <img/>]