Like for example I have a string of HTML Ordered List. Inside that ordered list, I want to write n number of lists. How can I accomplish the task of adding the lists to this string?
Here is the example code:
html = """
<ol>
<li>
<!--Something-->
</li>
... <!--n lists-->
{} #str().format()
<li>
<!--Something-->
</li>
</ol>
"""
for li in html_lists: #where li is <li>...</li> and is inside the python list.
html.format(li)
As far as I know, Strings are immutable and .format() will add <li> at {}. Hence this won't work for more than one <li>.
Like you said, strings are immutable so just having html.format(li) on a line by itself won't do anything, you need to do html = html.format(li) because the first version won't modify html in place, it will return a result.
As for using a loop with str.format(), you should be able to use the following assuming each element in html_lists is a string that contains a single <li> entry:
html = html.format('\n'.join(html_lists))
This works because '\n'.join(html_lists) will construct a single string from your list of strings, which can then be passed to html.format() to replace the single {} with the content from every element in html_lists. Note that you could also use ''.join(html_lists), the newline is just there to make it more readable when html is displayed.
You could use lxml to build HTML:
import lxml.html as LH
import lxml.builder as builder
html_lists = 'one two three'.split()
E = builder.E
html = (
E.ol(
*([E.li('something')]
+ [E.li(item) for item in html_lists]
+ [E.li('else')])
)
)
print(LH.tostring(html, pretty_print=True))
prints
<ol>
<li>something</li>
<li>one</li>
<li>two</li>
<li>three</li>
<li>else</li>
</ol>
Python is really good for processing text, so here's an example using it to do what you want:
import textwrap
def indent(amt, s):
dent = amt * ' '
return ''.join(map(lambda i: dent+i+'\n', s.split('\n')[:-1])).rstrip()
ordered_list_html = textwrap.dedent('''\
<ol>
{}
</ol>
''')
# create some test data
html_lists = [textwrap.dedent('''\
<li>
list #{}
</li>
''').format(n) for n in xrange(5)]
print ordered_list_html.format(indent(2, ''.join(html_lists)))
Output:
<ol>
<li>
list #0
</li>
<li>
list #1
</li>
<li>
list #2
</li>
<li>
list #3
</li>
<li>
list #4
</li>
</ol>
Related
As I am working on scraping a website, I have found myself with a list of lists containing bs4.element.ResulstSets from a find_all() search. Therefore the output of my variable (features) looks like this
[[<li class="WlSsj">Terrasse</li>],
[<li class="WlSsj">Balkon</li>, <li class="WlSsj">Video Live-Besichtigung</li>],
[<li class="WlSsj">Balkon</li>, <li class="WlSsj">Terrasse</li>],
[],
[<li class="WlSsj y7k9g">Neubauprojekt</li>, <li class="WlSsj">Garten</li>, <li class="WlSsj">Balkon</li>, <li class="WlSsj">Terrasse</li>]]
I would now like to turn this into a list of strings like this:
["Terrasse",
"Balkin, Video-Besichtigung",
"Balkon, Terrasse",
"",
"Neubauprojekt, Garten, Balkon, Terrasse"]
I have tried many things but sometimes the empty list is not converted to an empty string or the order is not consistent.
Thank you very much in advance.
You need to use nested for loops and extract the text of that element, in your case:
inner_text = list()
# features is the list of lists you get with bs4
for lis in features:
text_to_append = ""
for item in lis:
text_to_append = f"{text_to_append}, {item.text}"
inner_text.append(text_to_append)
I have extracted data wrapped within multiple HTML p tags from a webpage using BeautifulSoup4. I have stored all of the extracted data in a list. But I want each of the extracted data as separate list elements separated by a comma.
HTML content structure:
<ul>
<li>
<p>
<span class="TextRun">
<span class="NormalTextrun"> Data 1 </span>
</span>
</p>
</li>
<li>
<p>
<span class="TextRun">
<span class="NormalTextrun"> Data 2 </span>
</span>
</p>
</li>
<li>
<p>
<span class="TextRun">
<span class="NormalTextrun"> Data 3 </span>
</span>
</p>
</li>
</ul>
Code to extract:
for data in elem.find_all('span', class_="TextRun"):
data = ''.join([' '.join(item.text.split()) for item in elem.select(".NormalTextRun")])
data = data.replace(u'\xa0', '')
events_parsed_thisweek.append(data)
print (events_parsed_thisweek)
Current output:
[Data1Data2Data3]
Expected output:
[Data1, Data2, Data3]
Any help is much appreciated!
data = [x.text.strip() for x in elem.find_all('span', {'class': 'NormalTextrun'})]
Printing data will give you: ['Data 1', 'Data 2', 'Data 3']
I think what #Sagun Shrestha suggest works. To deal with it more detailly like the inner span and the extra spaces. Maybe you should try:
data = [s.text.strip() for s in b.find_all('span', class_='NormalTextrun')]
print(data)
If you specifically want the string output without the quotation marks. You can try this:
data = [s.text.strip() for s in b.find_all('span', class_='NormalTextrun')]
print('[', ', '.join(data), ']', sep='')
Hope it's what you want.
This should solve your problem
data = [x.text for x in elem.find_all('span', attrs={'class':'TextRun'})]
This gives the correct output:
data = [ele.text for ele in soup.find_all('span', {'class':'NormalTextrun'})]
print(data)
Output:
[' Data 1 ', ' Data 2 ', ' Data 3 ']
I am in mid of scraping data from a website, but I encounter following code
code = "<li class="price-current">
<span class="price-current-label">
</span>₹ 7,372
<span class="price-current-range">
<abbr title="to">–</abbr>
</span>
</li> "
I need to extract only "₹ 7,372".
I have tried following.
1. Code.text
but it result to
'\n\n₹ 7,372\xa0\r\n \n–\n\n'
code.text.strip()
but it result to
'₹ 7,372\xa0\r\n \n–'
Is there any method?
Please let me know, so that I can complete my project.
Ok, I managed to clean data that you need. This way is a little ugly, but works=)
from bs4 import BeautifulSoup as BS
html= """<li class="price-current">
<span class="price-current-label">
</span>₹ 7,372
<span class="price-current-range">
<abbr title="to">–</abbr>
</span>
</li> """
soup=BS(html)
li = soup.find('li').text
for j in range(3):
for i in ['\n',' ', '–', '\xa0', '\r','\x20','\x0a','\x09','\x0c','\x0d']:
li=li.strip(i)
print(li)
output:
₹ 7,372
In the loop list I outlined all (as far as I know) ASCII spaces and the symbols that you get.
Loop launches 3 times because needed value doesn't clean from the first time, you can check it every iteration in variable explorer.
Also optionally you can try to figure out what precise symbol gives a lot of pseudo spaces between <span> tags.
from bs4 import BeautifulSoup as bs
code = '''<li class="price-current">
<span class="price-current-label">
</span>₹ 7,372
<span class="price-current-range">
<abbr title="to">–</abbr>
</span>
</li>'''
soup = bs(code,'html.parser')
w = soup.find_all('li')
l = []
for item in w:
l.append(item)
words = str(l)
t = words.split('\n')
print(t[2][7:])
₹ 7,372
I'm really new to learning python so this could be really obvious, but I have extracted a NavigableString from BeautifulSoup and I need to find data in the string. However, it's not as easy as some of the examples I've seen online.
My end goal is to create a dictionary that looks something like this:
dict = {'Fandom':'Undertale (Video Game)', 'Works':15341}
Here's are two examples of the strings:
<li>
<a class="tag" href="/tags/Undertale%20(Video%20Game)/works">Undertale (Video Game)</a>
(15341)
</li>
<a class="tag" href="/tags/Sherlock%20Holmes%20*a*%20Related%20Fandoms/works">Sherlock Holmes & Related Fandoms</a>
(101015)
</li>
I've already succeeded extracting fandom from the string, but now I need the works count in parenthesis. How would I use Beautiful Soup and/or Regular Expressions to do this?
I also need to do error handling because while a fandom will always be displayed, it may not have a work count next to it.
<li>
<a class="tag" href="/tags/Composer%20-%20Fandom/works">Composer - Fandom</a>
</li>
Here's the relevant pieces of code:
for each_f in cate:
#print(each_f)
result = each_f.find('a')
if result !=-1:
#here is where I grab the Fandom vals
fandom_name = result.contents
#print(result.contents)
NOTE: I know I'm missing the code to append to the dictionary, I haven't made it that far yet. I'm just trying to get the values to print to the screen.
use dict.fromkeys(('Fandom', 'Works')) to get :
In [17]: dict.fromkeys(('Fandom', 'Works'))
Out[17]: {'Fandom': None, 'Works': None}
use zip to combines the key with strings in the li tag, this will only combines the shortest:
zip(('Fandom', 'Works'),li.stripped_strings)
[('Fandom', 'Undertale (Video Game)'), ('Works', '(15341)')]
[('Fandom', 'Sherlock Holmes & Related Fandoms'), ('Works', '(101015)')]
[('Fandom', 'Composer - Fandom')]
then we update the dict with those data:
In [20]: for li in soup.find_all('li'):
...: d = dict.fromkeys(('Fandom', 'Works'))
...: out = zip(('Fandom', 'Works'),li.stripped_strings)
...: d.update(out)
...: print(d)
out:
{'Works': '(15341)', 'Fandom': 'Undertale (Video Game)'}
{'Works': '(101015)', 'Fandom': 'Sherlock Holmes & Related Fandoms'}
{'Works': None, 'Fandom': 'Composer - Fandom'}
You can use stripped_strings and unpack the values to get your blocks of text. You can store the results in a dictso that you can use them later.
Example:
from bs4 import BeautifulSoup
import requests
example = """<li>
<a class="tag" href="/tags/Undertale%20(Video%20Game)/works">Undertale (Video Game)</a>
(15341)
</li>
<li><a class="tag" href="/tags/Sherlock%20Holmes%20*a*%20Related%20Fandoms/works">Sherlock Holmes & Related Fandoms</a>
(101015)
</li>
<li>
<a class="tag" href="/tags/Composer%20-%20Fandom/works">Composer - Fandom</a>
</li>"""
soup = BeautifulSoup(example, "html.parser")
Fandom = {"Fandom" : []}
for li in soup.find_all("li"):
try:
fandom, count = li.stripped_strings
Fandom["Fandom"].append({fandom.strip() : count[1:-1]})
except:
fandom = li.text.strip()
Fandom["Fandom"].append({fandom.strip() : 0})
print (Fandom)
This outputs:
{'Fandom': [{'Undertale (Video Game)': '15341'}, {'Sherlock Holmes & Related Fandoms': '101015'}, {'Composer - Fandom': 0}]}
The try-catch will catch any unpacking that doesn't contains two values: your fandom title and the word count.
I'm using Selenium WebDriver using Python for UI tests and I want to check the following HTML:
<ul id="myId">
<li>Something here</li>
<li>And here</li>
<li>Even more here</li>
</ul>
From this unordered list I want to loop over the elements and check the text in them. I selected the ul-element by its id, but I can't find any way to loop over the <li>-children in Selenium.
Does anybody know how you can loop over the <li>-childeren of an unordered list with Selenium (in Python)?
You need to use the .find_elements_by_ method.
For example,
html_list = self.driver.find_element_by_id("myId")
items = html_list.find_elements_by_tag_name("li")
for item in items:
text = item.text
print text
Strangely enough, I had to use this get_attribute()-workaround in order to see the content:
html_list = driver.find_element_by_id("myId")
items = html_list.find_elements_by_tag_name("li")
for item in items:
print(item.get_attribute("innerHTML"))
You can use list comprehension:
# Get text from all elements
text_contents = [el.text for el in driver.find_elements_by_xpath("//ul[#id='myId']/li")]
# Print text
for text in text_contents:
print(text)