I am trying to scrape lists of elements from a page that looks like this:
<div class="container">
<b>1</b>
<b>2</b>
<b>3</b>
</div>
<div class="container">
<b>4</b>
<b>5</b>
<b>6</b>
</div>
I would like to get lists or tuples using xpath: [1,2,3],[4,5,6]...
Using for loop on the page I get either the first element of each list or all numbers as one list.
Could you please help me to solve the exercise?
Thank you in advance for any help!
For web-scraping of static pages bs4 is best package to work with. and using bs4 you can achieve your goal as easy as below:
from bs4 import BeautifulSoup
source = """<div class="container">
<b>1</b>
<b>2</b>
<b>3</b>
</div>
<div class="container">
<b>4</b>
<b>5</b>
<b>6</b>
</div>"""
soup = BeautifulSoup(source, 'html.parser') # parse content/ page source
soup.find_all('div', {'class': 'container'}) # find all the div element (second argument is optional mentioned to scrape/find only element with attribute value)
print([[int(x.text) for x in i.find_all('b')] for i in soup.find_all('div', {'class': 'container'})]) # get list of all div's number list as you require
Output:
[[1, 2, 3], [4, 5, 6]]
you could use this xpath expression, which will give you two strings
.//*[#class='container'] ➡ '1 2 3', '4 5 6'
if you would prefer 6 strings
.//*[#class='container']/b ➡ '1','2','3','4','5','6'
to get exactly what you are looking for though you would have to separate the xpath expressions
.//*[#class='container'][1]/b ➡ '1','2','3'
.//*[#class='container'][2]/b ➡ '4','5','6'
Related
I am scraping a website which returns a bs4.element.Tag similar to the following:
<span class="attributes-value">
<span class="four-door">four door</span>
<span class="inline-4-engine">inline 4 engine</span>
<span class="24-gallons-per-mile">24 gallons per mile</span>
</span>
I am trying to extract just the text from this block and add it to a dictionary. All of the examples that I am seeing on the forum include some sort of common element like an 'id' or similar. I am not an html guy so i may be using incorrect terms.
What I would like to do is get the text ("four door", "v6 engine", etc) and add them as values to a dictionary with the key being a pre-designated variable of car_model.
cars = {'528i':['four door', 'inline 4 engine']}
I cant figure out a universal way to pull out the text because there may be more or fewer span classes with different text. Thanks for your help!
You need to loop through all the elements by selector and extract text value from these elements.
A selector is a specific path to the element you want. In my case, the selector is .attributes-value span, where .attributes-value allows you to access the class, and span allows you to access the tags within that class.
The get_text() method retrieves the content between the opening and closing tags. This is exactly what you need.
I also recommend using lxml because it will speed up your code.
The full code is attached below:
from bs4 import BeautifulSoup
import lxml
html = '''
<span class="attributes-value">
<span class="four-door">four door</span>
<span class="inline-4-engine">inline 4 engine</span>
<span class="24-gallons-per-mile">24 gallons per mile</span>
</span>
'''
soup = BeautifulSoup(html, 'lxml')
cars = {
'528i': []
}
for span in soup.select(".attributes-value span"):
cars['528i'].append(span.get_text())
print(cars)
Output:
{'528i': ['four door', 'inline 4 engine', '24 gallons per mile']}
You can use:
out = defaultdict(list)
soup = BeautifulSoup(html_doc, 'html.parser')
for tag in soup.select(".attributes-value span"):
out["528i"].append(tag.text)
print(dict(out))
Prints:
{'528i': ['four door', 'inline 4 engine', '24 gallons per mile']}
My HTML is like :
<body>
<div class="afds">
<span class="dfsdf">mytext</span>
</div>
<div class="sdf dzf">
<h1>some random text</h1>
</div>
</body>
I want to find all tags containing "text" & their corresponding classes. In this case, I want:
span, "dfsdf"
h1, null
Next, I want to be able to navigate through the returned tags. For example, find the div parent tag & respective classes of all the returned tags.
If I execute the following
soupx.find_all(text=re.compile(".*text.*"))
it simply returns the text part of the tags:
['mytext', ' some random text']
Please help.
You are probably looking for something along these lines:
ts = soup.find_all(text=re.compile(".*text.*"))
for t in ts:
if len(t.parent.attrs)>0:
for k in t.parent.attrs.keys():
print(t.parent.name,t.parent.attrs[k][0])
else:
print(t.parent.name,"null")
Output:
span dfsdf
h1 null
find_all() does not return just strings, it returns bs4.element.NavigableString.
That means you can call other beautifulsoup functions on those results.
Have a look at find_parent and find_parents: documentation
childs = soupx.find_all(text=re.compile(".*text.*"))
for c in childs:
c.find_parent("div")
It's OK to write
content.css('.text>p::text').extract()
But
content.css('.text:not(.text .text)>p::text').extract()
will not work.
It tells me:
SelectorSyntaxError: Expected ')', got <S ' ' at 15>
Yes, the 15th letter in the '.text:not(.text .text)>p::text' is ' ', but how can I express this meaning without using a ' '?
Update
There are nested <div class='text'>s, I want to extract all the <p>s right beneath the first <div class='text'>.
For example:
<div class='text comment'>
<strong>abc</strong>
<span>def</span>
<p>xxxxxxxxxxxxx</p>
<p>xxxxxxxxxxxxxxxxxxxxxxxxxxx</p>
<div class='text sub_comment'>
<strong>lst</strong>
<span>lll</span>
<p>xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx</p>
<p>xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx</p>
</div>
</div>
I want to get texts in the first two <p>. I can't use .comment and .sub_comment to distinguish them because they change from case to case and are not definitely comment in the outside and sub_comment in the inner tag.
How about trying nth-child(1)?
So your css would be:
".text:nth-child(1)>p"
In Scrapy:
In [54]: from scrapy import Selector
In [55]: a
Out[55]: u"<div><div class='text comment'> <strong>abc</strong> <span>def</span> <p>xxxxxxxxxxxxx</p> <p>xxxxxxxxxxxxxxxxxxxxxxxxxxx</p> <div class='text sub_comment'> <strong>lst</strong> <span>lll</span> <p>xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx</p> <p>xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx</p> </div></div></div>"
In [56]: sel = Selector(text=a)
In [57]: sel.css(".text:nth-child(1)>p::text").extract()
Out[57]: [u'xxxxxxxxxxxxx', u'xxxxxxxxxxxxxxxxxxxxxxxxxxx']
There is nice explanation and demo of nth-child in this tutorial here (scroll down to paragraph 22).
I'm trying to extract some text using Beautiful Soup. The relevant portion looks something like this.
...
<p class="consistent"><strong>RecurringText</strong></p>
<p class="consistent">Text1</p>
<p class="consistent">Text2</p>
<p class="consistent">Text3</p>
<p class="consistent"><strong>VariableText</strong></p>
...
RecurringText, as the name implies, is consistent in all the files. However, VariableText changes. The only thing it has in common is it is the next coded section. I'd like to get Text1, Text2, and Text3 extract. What comes before (up to and including RecurringText) and what comes after (including and after VariableText) can be left behind. The portion of extract from RecurringText I have found elsewhere, but I am unsure how to remove the next item, if that makes sense.
In sum, how I can extract based on the characteristic of VariableText (which the string is variable throughout the urls) consistently coming after the last item of Text1, Text2, ..., Textn (where n is different across files).
You can basically get items from p element containing strong element to another p element containing strong element:
from bs4 import BeautifulSoup
data = """
<div>
<p class="consistent"><strong>RecurringText</strong></p>
<p class="consistent">Text1</p>
<p class="consistent">Text2</p>
<p class="consistent">Text3</p>
<p class="consistent"><strong>VariableText</strong></p>
</div>
"""
soup = BeautifulSoup(data, "html.parser")
for p in soup.find_all(lambda elm: elm and elm.name == "p" and elm.text == "RecurringText" and \
"consistent" in elm.get("class") and elm.strong):
for item in p.find_next_siblings("p"):
if item.strong:
break
print(item.text)
Prints:
Text1
Text2
Text3
So I need to grab the numbers after lines looking like this
<div class="gridbarvalue color_blue">79</div>
and
<div class="gridbarvalue color_red">79</div>
Is there a way I can do a findAll('div', text=re.recompile('<>)) where I would find tags with gridbarvalue color_<red or blue>?
I'm using beautifulsoup.
Also sorry if I'm not making my question clear, I'm pretty inexperienced with this.
class is a Python keyword, so BeautifulSoup expects you to put an underscore after it when using it as a keyword parameter
>>> soup.find_all('div', class_=re.compile(r'color_(?:red|blue)'))
[<div class="gridbarvalue color_blue">79</div>, <div class="gridbarvalue color_red">79</div>]
To also match the text, use
>>> soup.find_all('div', class_=re.compile(r'color_(?:red|blue)'), text='79')
[<div class="gridbarvalue color_blue">79</div>, <div class="gridbarvalue color_red">79</div>]
import re
elems = soup.findAll(attrs={'class' : re.compile("color_(blue|red)")})
for each e in elems:
m = re.search(">(\d+)<", str(e))
print "The number is %s" % m.group(1)