BeautifulSoup question

BeautifulSoup question - python

<parent1>
<span>Text1</span>
</parnet1>
<parent2>
<span>Text2</span>
</parnet2>
<parent3>
<span>Text3</span>
</parnet3>
I'm parsing this with Python & BeautifulSoup. I have a variable soupData which stores pointer for need object. How can I get pointer for the parent2, for example, if I have the text Text2. So the problem is to filter span-tags by content. How can I do this?

After correcting the spelling on the end-tags:
[e for e in soup(recursive=False, text=False) if e.span.string == 'Text2']

I don't think there's a way to do it in a single step. So:
for parenttag in soupData:
if parenttag.span.string == "Text2":
do_stuff(parenttag)
break
It's possible to use a generator expression, but not much shorter.

Using python 2.7.6 and BeautifulSoup 4.3.2 I found Marcelo's answer to give an empty list. This worked for me, however:
[x.parent for x in bSoup.findAll('span') if x.text == 'Text2'][0]
Alternatively, for a ridiculously overengineered solution (to this particular problem at least, but maybe it would be useful if you'll be doing filtering on criteria too long to put in a reasonably easily understandable list expression) you could do:
def hasText(text):
def hasTextFunc(x):
return x.text == text
return hasTextFunc
to create a function factory, then
hasTextText2 = hasText('Text2')
filter(hasTextText2,bSoup.findAll('span'))[0].parent
to get the reference to the parent tag that you were looking for

Related

How can I check with a loop if a group of strings is empty or not?

I am trying to go through a loop of strings (alias names) in order to apply the unidecode, and to avoid the error of not having aliases words in some cases, I am using an if/else to check if there is a string or not, so I can apply the unidecode.
I tried by checking the length of the word by doing this code, but it keeps saying "list' object has no attribute 'element". I am not sure what should I do.
alias=record.findall("./person/names/aliases/alias")
alias_name=[element.text for element in alias]
if len(alias_name.element.text)!=0:
alias_names_str = ','.join(alias_name)
alias_names_str=unidecode.unidecode(alias_names_str)
else:
alias_name.element.text="NONE"

Not exactly sure what you are trying to accomplish, but it looks like an XML or HTML decoder.
# alias is an iterable of elements
alias=record.findall("./person/names/aliases/alias")
# alias_names is a list of strings, you have already dereferenced the text
alias_names=[element.text for element in alias]
if len(alias_names)>0:
alias_names_str = ','.join(alias_names)
alias_names_str = unidecode.unidecode(alias_names_str)
else:
alias_names_str = "NONE"

thank you! yes, I want to apply the unidecode to remove the diacritics of those strings. Your code helped, then I needed to change some details and it worked!
Thank you so much! :)

Parsing data with pattern

I'm parsing some data with the pattern as follows:
tagA:
titleA
dataA1
dataA2
dataA3
...
tagB:
titleB
dataB1
dataB2
dataB3
...
tagC:
titleC
dataC1
dataC2
...
...
These tags are stored in a list list_of_tags, if I iterate through the list, I can get all the tags; also, if iterating through the tags, I can get the title and the data associated with the title.
The tags in my data are pretty much something like <div>, so they are not useful to me; what I'm trying to do is to construct a dictionary which uses titles as keys and datas as a list of values.
The constructed dictionary would look like:
{
titleA: [dataA1, dataA2, dataA3...],
titleB: [dataB1, dataB2, dataB3...],
...
}
Notice every tag only contains one title and some datas, and title always comes before data.
So here are my working codes:
Method 1:
result = {}
for tag in list_of_tags:
list_of_values = []
for idx, elem in enumerate(tag):
if not idx:
key = elem
else:
construct_list_of_values()
update_the_dictionary()
Actually, method 1 works fine and gives me my desired result; however, if I put that piece of codes in PyCharm, it warns me that "Local variable 'key' might be referenced before assignment" at the last line. Hence, I try another approach:
Method 2:
result = {tag[0]: tag[1:] for tag in list_of_tags}
Method 2 works fine if tags are lists, but I also want the code to work normally if tags are generators ('generator' object is not subscriptable will occur with method 2)
In order to work with generators, I come up with:
Method 3:
key_val_list = [(next(tag), list(tag)) for tag in list_of_tags]
result = dict(key_val_list)
Method 3 also works; but I cannot write this in dictionary comprehension ({next(tag): list(tag) for tag in list_of_tags} would give StopIteration exception because list(tag) will be evaluated first)
So, my question is, is there an elegant way for dealing with this pattern which could work no matter tags are lists or generators? (method 1 seems to work for both, but I don't know if I should ignore the warning PyCharms gives; the other two methods looks more concise, but one can only work on lists while the other can only work on generators)
Sorry for the long question, thanks for the patience!

I guess the reason why PyCharm is giving you a warning is that you are using key in update_the_dictionary, but key could be left unassigned if tag does not contain at least one element. You might have the knowledge that the title will always be in the list, but the static analyzer is not able to infer that from the context.
If you are using Python 3, you might want to try using PEP 3132 - Extended Iterable Unpacking. It should work for both lists and generators.
e.g.
title, *data = tag

how do I modify a url that I pick at random in python

I have an app that will show images from reddit. Some images come like this http://imgur.com/Cuv9oau, when I need to make them look like this http://i.imgur.com/Cuv9oau.jpg. Just add an (i) at the beginning and (.jpg) at the end.

You can use a string replace:
s = "http://imgur.com/Cuv9oau"
s = s.replace("//imgur", "//i.imgur")+(".jpg" if not s.endswith(".jpg") else "")
This sets s to:
'http://i.imgur.com/Cuv9oau.jpg'

This function should do what you need. I expanded on #jh314's response and made the code a little less compact and checked that the url started with http://imgur.com as that code would cause issues with other URLs, like the google search I included. It also only replaces the first instance, which could causes issues.
def fixImgurLinks(url):
if url.lower().startswith("http://imgur.com"):
url = url.replace("http://imgur", "http://i.imgur",1) # Only replace the first instance.
if not url.endswith(".jpg"):
url +=".jpg"
return url
for u in ["http://imgur.com/Cuv9oau","http://www.google.com/search?q=http://imgur"]:
print fixImgurLinks(u)
Gives:
>>> http://i.imgur.com/Cuv9oau.jpg
>>> http://www.google.com/search?q=http://imgur

You should use Python's regular expressions to place the i. As for the .jpg you can just append it.

How to transform hyperlink codes into normal URL strings?

I'm trying to build a blog system. So I need to do things like transforming '\n' into < br /> and transform http://example.com into < a href='http://example.com'>http://example.com< /a>
The former thing is easy - just using string replace() method
The latter thing is more difficult, but I found solution here: Find Hyperlinks in Text using Python (twitter related)
But now I need to implement "Edit Article" function, so I have to do the reverse action on this.
So, how can I transform < a href='http://example.com'>http://example.com< /a> into http://example.com?
Thanks! And I'm sorry for my poor English.

Sounds like the wrong approach. Making round-trips work correctly is always challenging. Instead, store the source text only, and only format it as HTML when you need to display it. That way, alternate output formats / views (RSS, summaries, etc) are easier to create, too.
Separately, we wonder whether this particular wheel needs to be reinvented again ...

Since you are using the answer from that other question your links will always be in the same format. So it should be pretty easy using regex. I don't know python, but going by the answer from the last question:
import re
myString = 'This is my tweet check it out http://tinyurl.com/blah'
r = re.compile(r'(http://[^ ]+)')
print r.sub(r'\1', myString)
Should work.

python: xml.etree.ElementTree, removing "namespaces"

I like the way ElementTree parses xml, in particular the Xpath feature. I've an output in xml from an application with nested tags.
I'd like to access this tags by name without specifying the namespace, is it possible?
For example:
root.findall("/molpro/job")
instead of:
root.findall("{http://www.molpro.net/schema/molpro2006}molpro/{http://www.molpro.net/schema/molpro2006}job")

At least with lxml2, it's possible to reduce this overhead somewhat:
root.findall("/n:molpro/n:job",
namespaces=dict(n="http://www.molpro.net/schema/molpro2006"))

You could write your own function to wrap the nasty looking bits for example:
def my_xpath(doc, ns, xp);
num = xp.count('/')
new_xp = xp.replace('/', '/{%s}')
ns_tup = (ns,) * num
doc.findall(new_xp % ns_tup)
namespace = 'http://www.molpro.net/schema/molpro2006'
my_xpath(root, namespace, '/molpro/job')
Not that much fun I admit but a least you will be able to read your xpath expressions.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

BeautifulSoup question - python

After correcting the spelling on the end-tags: [e for e in soup(recursive=False, text=False) if e.span.string == 'Text2']

I don't think there's a way to do it in a single step. So: for parenttag in soupData: if parenttag.span.string == "Text2": do_stuff(parenttag) break It's possible to use a generator expression, but not much shorter.

Related

How can I check with a loop if a group of strings is empty or not?

Parsing data with pattern

how do I modify a url that I pick at random in python

How to transform hyperlink codes into normal URL strings?

python: xml.etree.ElementTree, removing "namespaces"

Categories

Resources