I'm trying to scrape data from a site with this structure below. I want to extract information in each of the <li id="entry">, but both of the entries should also extract the category information from <li id="category"> / <h2>
<ul class="html-winners">
<li id="category">
<h2>Redaktionell Print - Dagstidning</h2>
<ul>
<li id="entry">
<div class="entry-info">
<div class="block">
<img src="bilder/tumme10/4.jpg" width="110" height="147">
<span class="gold">Guld: Svenska Dagbladet</span><br>
<strong><strong>Designer:</strong></strong> Anna W Thurfjell och SvD:s medarbetare<br>
<strong><strong>Motivering:</strong></strong> "Konsekvent design som är lätt igenkänningsbar. Små förändringar förnyar ständigt och blldmotiven utnyttjas föredömligt."
</div>
</div>
</li>
<li id="entry">
<div class="entry-info">
<div class="block"><img src="bilder/tumme10/3.jpg" width="110" height="147">
<span class="silver">Silver: K2 - Kristianstadsbladet</span>
</div>
</div>
</li>
</ul>
</li>
I use a scrapy with the following code:
start_urls = [
"http://www.designpriset.se/vinnare.php?year=2010"
]
rules = (
Rule(LinkExtractor(allow = "http://www.designpriset.se/", restrict_xpaths=('//*[#class="html-winners"]')), callback='parse_item'),
)
def parse(self, response):
for sel in response.xpath('//*[#class="entry-info"]'):
item = ByrauItem()
annons_list = sel.xpath('//span[#class="gold"]/text()|//span[#class="silver"]/text()').extract()
byrau_list = sel.xpath('//div/text()').extract()
kategori_list = sel.xpath('/preceding::h2/text()').extract()
for x in range(0,len(annons_list)):
item['Annonsrubrik'] = annons_list[x]
item['Byrau'] = byrau_list[x]
item['Kategori'] = kategori_list[x]
yield item
annons_list and byrau_list works perfect, they use xpath to go down the heirarchy from the starting point //*[#class="entry-info"]. But kategori_list gives me "IndexError: list index out of range". Am I writing the xpath preceding axe the wrong way?
As mentioned by #kjhughes in comment, you need to add . just before / or // to make your XPath expression relative to current context element. Otherwise the expression will be considered relative to the root document. And that's why the expression /preceding::h2/text() returned nothing.
In the case of /, you can also remove it from the beginning of your XPath expression as alternative way to make to it relative to current context element :
kategori_list = sel.xpath('preceding::h2/text()').extract()
Just a note, preceding::h2 will return all h2 elements located before the <div class="entry-info">. According to the HTML posted, I think the following XPath expression is safer from returning unwanted h2 elements (false positive) :
query = 'parent::li/parent::ul/preceding-sibling::h2/text()'
kategori_list = sel.xpath(query).extract()
Related
Question:
Can I group found elements by a div class they're in and store them in lists in a list.
Is that possible?
*So I did some further testing and as said. It seems like that even if you store one div in a variable and when trying to search in that stored div it searches the whole site content.
from selenium import webdriver
driver = webdriver.Chrome()
result_text = []
# Let's say this is the class of the different divs, I want to group it by
#class='a-fixed-right-grid a-spacing-top-medium'
# These are the texts from all divs around the page that I'm looking for but I can't say which one belongs in witch div
elements = driver.find_elements_by_xpath("//a[contains(#href, '/gp/product/')]")
for element in elements:
result_text.append(element.text)
print(result_text )
Current Result:
I'm already getting all the information I'm looking for from different divs around the page but I want it to be "grouped" by the topmost div.
['Text11', 'Text12', 'Text2', 'Text31', 'Text32']
Result I want to achieve:
The
text is grouped by the #class='a-fixed-right-grid a-spacing-top-medium'
[['Text11', 'Text12'], ['Text2'], ['Text31', 'Text32']]
HTML: (looks something like this)
class="a-text-center a-fixed-left-grid-col a-col-left" is the first one that wraps the group from there on we can use any div to group it. At least I think that.
</div>
</div>
</div>
</div>
<div class="a-fixed-right-grid a-spacing-top-medium"><div class="a-fixed-right-grid-inner a-grid-vertical-align a-grid-top">
<div class="a-fixed-right-grid-col a-col-left" style="padding-right:3.2%;float:left;">
<div class="a-row">
<div class="a-fixed-left-grid a-spacing-base"><div class="a-fixed-left-grid-inner" style="padding-left:100px">
<div class="a-text-center a-fixed-left-grid-col a-col-left" style="width:100px;margin-left:-100px;float:left;">
<div class="item-view-left-col-inner">
<a class="a-link-normal" href="/gp/product/B07YCW79/ref=ppx_yo_dt_b_asin_image_o0_s00?ie=UTF8&psc=1">
<img alt="" src="https://images-eu.ssl-images-amazon.com/images/I/41rcskoL._SY90_.jpg" aria-hidden="true" onload="if (typeof uet == 'function') { uet('cf'); uet('af'); }" class="yo-critical-feature" height="90" width="90" title="Same as the text I'm looking for" data-a-hires="https://images-eu.ssl-images-amazon.com/images/I/41rsxooL._SY180_.jpg">
</a>
</div>
</div>
<div class="a-fixed-left-grid-col a-col-right" style="padding-left:1.5%;float:left;">
<div class="a-row">
<a class="a-link-normal" href="/gp/product/B07YCR79/ref=ppx_yo_dt_b_asin_title_o00_s0?ie=UTF8&psc=1">
Text I'm looking for
</a>
</div>
<div class="a-row">
I don't have the link to test it on but this might work for you:
from selenium import webdriver
driver = webdriver.Chrome()
result_text = [[a.text for a in div.find_elements_by_xpath("//a[contains(#href, '/gp/product/')]")]
for div in driver.find_elements_by_class_name('a-fixed-right-grid')]
print(result_text)
EDIT: added alternative function:
# if that doesn't work try:
def get_results(selenium_driver, div_class, a_xpath):
div_list = []
for div in selenium_driver.find_elements_by_class_name(div_class):
a_list = []
for a in div.find_elements_by_xpath(a_xpath):
a_list.append(a.text)
div_list.append(a_list)
return div_list
get_results(driver,
div_class='a-fixed-right-grid'
a_xpath="//a[contains(#href, '/gp/product/')]")
If that doesn't work then maybe the xpath is returning EVERY matching element every time despite being called from the div, or another element has that same class name farther up the document
Example of a list:
<div class="c-article-metrics-bar__wrapper u-clear-both">
<ul class="c-article-metrics-bar u-list-inline">
<li class="c-article-metrics-bar__item">
<p class="c-article-metrics-bar__count">"277k "
<span class="c-article-metrics-bar__label">Accesses</span>
</p>
</li>
<li class="c-article-metrics-bar__item">
<p class="c-article-metrics-bar__count">"6 "
<span class="c-article-metrics-bar__label">Citations</span>
</p>
</li>
<li class="c-article-metrics-bar__item">
<p class="c-article-metrics-bar__count">"594 "
<span class="c-article-metrics-bar__label">Altmetric</span>
</p>
</li>
</ul>
</div>
I am trying to get count of "Citations" from the list. The problem is, if there are no citations, the list element will be omitted.
My idea was to get info from p and span tags and save data in two lists. Then iterate through the span tag list, and if the "Citations" string is found, return text at that index from p tag list, like this:
metrics_labels = self.wd.find_element_by_css_selector('span.c-article-metrics-bar__label')
labels = [label.text for label in metrics_labels]
metrics_counts = self.wd.find_element_by_css_selector('p.c-article-metrics-bar__count')
counts = [count.text for count in metrics_counts]
for i in range(len(labels) - 1):
if labels[i] == "Citations":
return counts[i]
But it doesn't work, what I am missing here?
Thanks!
It looks like instead of collecting all elements with find_elements you are only collecting 1 by using find_element here:
metrics_labels = self.wd.find_element_by_css_selector('span.c-article-metrics-bar__label')
and here:
metrics_counts = self.wd.find_element_by_css_selector('p.c-article-metrics-bar__count')
To fix it, try using find_elements_by_css_selector:
metrics_labels = self.wd.find_elements_by_css_selector('span.c-article-metrics-bar__label')
labels = [label.text for label in metrics_labels]
metrics_counts = self.wd.find_elements_by_css_selector('p.c-article-metrics-bar__count')
counts = [count.text for count in metrics_counts]
I hope this works, good luck! If it doesn't help, please let us know the error you are getting.
I need to scrape some pages. The exact structure of the part that I want is as follows:
<div class="someclasses">
<h3>...</h3> # Not needed
<ul class="ul-class1 ul-class2">
<li id="li1-id" class="li-class1 li-class2">
<div id ="div1-id" class="div-class1 div-class2 ... div-class6">
<div class="div2-class">
<div class="div3-class">...</div> #Not needed
<div class="div4-class1 div4-class2 div4-class3">
<a href="href1" data-control-id="id1" data-control-name="name" id ="a1-id" class="a-class1 a-class2">
<h3 class="h3-class1 h3-class2 h3-class3">Text1</h3>
</a></div>
<div>...</div> # Not needed
</div>
</li>
<li id="li2-id" class="li-class1 li-class2">
<div id ="div2-id" class="div-class1 div-class2 ... div-class6">
<div class="div2-class">
<div class="div3-class">...</div> #Not needed
<div class="div4-class1 div4-class2 div4-class3">
<a href="href2" data-control-id="id2" data-control-name="name" id ="a2-id" class="a-class1 a-class2">
<h3 class="h3-class1 h3-class2 h3-class3">Text2</h3>
</a></div>
<div>...</div> # Not needed
</div>
</li>
# More <li> elements
</ul>
</div>
Now what I want is to get the Texts as well as the hrefs.I have used the naming in above example exactly realistic i.e the same names are also the same in the real webpage. The code that I am currently using is:
elems = driver.find_elements_by_xpath("//div[#class='someclasses']/ul[#class='ul-class1']/li[#class='li-class1']")
print(len(elems))
for elem in elems:
elem1 = driver.find_element_by_xpath("./a[#data-control-name='name']")
names2.append(elem1.text)
print(elem1.text)
hrefs.append(elem.get_attribute("href"))
The result of the print statement above is 0 so basically the elements are not found. Can anyone please tell me what am I doing wrong.
You are using only part of the class name... in XPATH you need the full class name...
FYI: With CSS you can use part of the class name...
If you want to use XPATH try:
elems = driver.find_elements_by_xpath("//div[#class='someclasses']//li//a")
print(len(elems))
for elem in elems:
names2.append(elem.text)
print(elem.text)
new_href = elem.get_attribute("href")
print(new_href)
hrefs.append(new_href)
For CSS use: div.someclasses ul.ul-class1
elems = driver.find_elements_by_css_selector("div.someclasses ul.ul-class1 li a")
for elem in elems:
names2.append(elem.text)
print(elem.text)
new_href = elem.get_attribute("href")
print(new_href)
hrefs.append(new_href)
I am trying to get all xpaths that start with something, but when I test it out, it says my xpath is invalid.
This is my xpath:
xpath = "//*[contains(#id,'ClubDetails_"+ str(team) +"']"
Which outputs when I print:
//*[contains(#id,'ClubDetails_1']
Here is HTML:
<div id="ClubDetails_1_386" class="fadeout">
<div class="MapListDetail">
<div>Bev Garman</div>
armadaswim#aol.com
</div>
<div class="MapListDetail">
<div>Rolling Hills Country Day School</div>
<div>26444 Crenshaw Blvd</div>
<div>Rolling Hills Estates, CA 90274</div>
</div>
<div class="MapListDetailOtherLocs_1">
<div>This club also swims at other locations</div>
<span class="show_them_link">show them...</span>
</div>
</div>
What am I missing ?
An alternative can be use pattern to match an ID starting with some text and get all those element in a list and then iterate list one by one and check the condition whether id attribute of that element contains the team is you wants.
You can match the pattern in CSS selector as below :
div[id^='ClubDetails_']
And this is how in xpath :
//div[starts-with(#id, 'ClubDetails_')]
Code sample:
expectedid = "1_386"
clubdetails = driver.find_elements_by_css_selector("div[id^='ClubDetails_']")
for item in clubdetails:
elementid = item.get_attribute('id')
expectedid in elementid
// further code
Let's suppose that I have the following HTML:
<div class="class1">
<div class="some multiple classes here">
<div class="some multiple classes here">
<ul class="other classes">
<li>
<div class="random">some text</div>
<div class="random1">some text1</div>
<div class="random2">some tex2t</div>
</li>
<li>
<div class="random">some text3</div>
<div class="random1">some text4</div>
<div class="random2">some text5</div>
</li>
<li>
<div class="random">some text6</div>
<div class="random1">some text7</div>
<div class="random2">some text8</div>
</li>
<!-- here can appear more <li></li> elements -->
</ul>
</div>
</div>
</div>
Now, in python, I made a function which is adding each message from the divs inside li tags in a list. So my lists will look like this:
messages_list = ['some text some text1 some text2', 'some text3 some text4 some text5', 'and so on..']
The function that I created uses selenium webdriver to get the content from the HTML and it looks like this:
def writeToChatTest(CHAT_URL):
mydriver.get(CHAT_URL)
message = "Some message to test"
xpaths = {
'textArea': "//*[#id='ipsTabs_elChatTabs_chatroom_panel']/div[1]/div[1]/div/div/div[1]/textarea",
'submitMessage': "//*[#id='ipsTabs_elChatTabs_chatroom_panel']/div[1]/div[1]/div/div/div[3]/button"
}
time.sleep(5)
rst_messages_list = []
lis = mydriver.find_elements_by_xpath('//ul[#class="ipsList_reset cChatContainer"]/li')
for li in lis:
rst_messages_list.append(li.text)
for unique_message in rst_messages_list:
if "word" in unique_message:
mydriver.find_element_by_xpath(xpaths['textArea']).clear()
mydriver.find_element_by_xpath(xpaths['textArea']).send_keys(unique_message[0] + ": " + message)
mydriver.find_element_by_xpath(xpaths['submitMessage']).click()
Now, the question that I'm asking is: is there any way of storing the last li tag parsed and check if there's a new one (or more) ? More, how can I make this check to be made continously?
The problem is that once I parsed the whole li tags, I'm not able to retrieve the new ones (it's a chat, so new lis appear pretty often).
Each element has an unique ID so you could store the last li processed but it complicates things.
I would something like:
def writeToChatTest(CHAT_URL):
mydriver.get(CHAT_URL)
message = "Some message to test"
xpaths = {
'textArea': "//*[#id='ipsTabs_elChatTabs_chatroom_panel']/div[1]/div[1]/div/div/div[1]/textarea",
'submitMessage': "//*[#id='ipsTabs_elChatTabs_chatroom_panel']/div[1]/div[1]/div/div/div[3]/button"
}
parsed_messages = []
keepRunning = True
while keepRunning:
time.sleep(5)
lis = mydriver.find_elements_by_xpath('//ul[#class="ipsList_reset cChatContainer"]/li')
rst_messages_list = []
for li in lis:
if li.id() not in parsed_messages:
'''this will end nicely your test if the message 'end selenium test' is being sent in chat'''
if li.text == 'end selenium test':
keepRunning = False
rst_messages_list.append(li.text)
parsed_messages.append(li.id())
for unique_message in rst_messages_list:
if "word" in unique_message:
mydriver.find_element_by_xpath(xpaths['textArea']).clear()
mydriver.find_element_by_xpath(xpaths['textArea']).send_keys(unique_message[0] + ": " + message)
mydriver.find_element_by_xpath(xpaths['submitMessage']).click()