So I have multiple webpages that have essentially the main part a bunch of <span> tags. In these the id for the span is random.
The basic page structure is as follows:
<pre>
<span id = "abcdf">
x
<span title="1">
random text
random text
</span>
... Repeat ...
<span id = "awfaf">
x
<span title="127">
random text
random text
</span>
</pre>
The id is always random, and the title for the span is always an integer. (It increases so 1-128 on page one, 129 to 256 on page two. Etc.)
What I would like to do is pull the id of the span, and then the two columns/text in the second and third href of each page.
I'm not sure how to go around to doing this in a repeatable way and simply need an idea for the logic, that is which elements to pull and such when going through the pages.
Following is one of the way to get the required data using Java:
List<String> idList = new ArrayList<String>();
List<String> textList1 = new ArrayList<String>();
List<String> textList2 = new ArrayList<String>();
int i=1;
while (driver.findElements(By.xpath("//pre/span[" + i + "]")).size() != 0) {
idList.add(driver.findElement(By.xpath("//pre/span[" + i + "]")).getAttribute("id"));
textList1.add(driver.findElement(By.xpath("//pre/span[" + i + "]/a[2]")).getText());
textList2.add(driver.findElement(By.xpath("//pre/span[" + i + "]/a[3]")).getText());
i++;
}
Above code can be executed for each page.
Related
I have an html document like this: https://dropmefiles.com/wezmb
So I need to extract text inside tags <span id="1" and </span , but I don't know how.
I'm trying and write this code:
from bs4 import BeautifulSoup
with open("10_01.htm") as fp:
soup = BeautifulSoup(fp,features="html.parser")
for a in soup.find_all('span'):
print (a.string)
But it extract all information from all 'span' tags. So, how can i extract text inside tags <span id="1" and </span in Python?
What you need is the .contents function. documentation
Find the span <span id = "1"> ... </span> using
for x in soup.find(id = 1).contents:
print(x)
OR
x = soup.find(id = 1).contents[0] # since there will only be one element with the id 1.
print(x)
This will give you :
10
that is, an empty line followed by 10 followed by another empty line. This is because the string in the HTML is actually like that and prints 10 in a new line, as you can also see in the HTML that 10 has its separate line.
The string will correctly be '\n10\n'.
If you want just x = '10' from x = '\n10\n' you can do : x = x[1:-1] since '\n' is a single character. Hope this helped.
I'm working in python on a Selenium problem. I'm trying to gather each element with an h1 tag and following that tag, I want to get the closest h2 and paragraph text tags and place that data into an object.
My current code looks like:
cards = browser.find_elements_by_tag_name("h1")
ratings = browser.find_elements_by_tag_name('h3')
descriptions = browser.find_elements_by_tag_name('p')
print(len(cards))
print(len(ratings))
print(len(descriptions))
which is generating inconsistent numbers.
To get the <h1> tag elements and then the next sibling <h2> and <p> tag elements you can use the following solution:
cards = browser.find_elements_by_tag_name("h1")
ratings = browser.find_elements_by_xpath("//h1//following-sibling::h2")
descriptions = browser.find_elements_by_xpath("//h1//following-sibling::p")
print(len(cards))
print(len(ratings))
print(len(descriptions))
I've written an xpath expression to get the highest value of page number from some html elements. However, with the below xpath I'm getting the last text which is Next Page in this case. I wish my xpath act in such a way so that I can get the highest number, as in 6 using it.
The elements upon which the xpath should be applied:
content = """
<div class="nav-links"><span aria-current="page" class="page-numbers current"><span class="meta-nav screen-reader-text">Page </span>1</span>
<a class="page-numbers" href="https://page/2/"><span class="meta-nav screen-reader-text">Page </span>2</a>
<span class="page-numbers dots">…</span>
<a class="page-numbers" href="https://page/6/"><span class="meta-nav screen-reader-text">Page </span>6</a>
<a class="next page-numbers" href="https://page/2/"><span class="screen-reader-text">Next Page</span></a></div>
"""
What I've tried so far:
from lxml.html import fromstring
root = fromstring(above_content)
pagenum = root.xpath("//*[contains(#class,'page-numbers')][last()]/span")[0].text
print(pagenum)
Output I'm having:
Next Page
Output I wish to have:
6
You can use exact class name to avoid fetching Next link:
//a[#class="page-numbers"][last()]
Note that contains(#class,'page-numbers') will return you links with numbers and Next while #class="page-numbers" numbers only
I am in mid of scraping data from a website, but I encounter following code
code = "<li class="price-current">
<span class="price-current-label">
</span>₹ 7,372
<span class="price-current-range">
<abbr title="to">–</abbr>
</span>
</li> "
I need to extract only "₹ 7,372".
I have tried following.
1. Code.text
but it result to
'\n\n₹ 7,372\xa0\r\n \n–\n\n'
code.text.strip()
but it result to
'₹ 7,372\xa0\r\n \n–'
Is there any method?
Please let me know, so that I can complete my project.
Ok, I managed to clean data that you need. This way is a little ugly, but works=)
from bs4 import BeautifulSoup as BS
html= """<li class="price-current">
<span class="price-current-label">
</span>₹ 7,372
<span class="price-current-range">
<abbr title="to">–</abbr>
</span>
</li> """
soup=BS(html)
li = soup.find('li').text
for j in range(3):
for i in ['\n',' ', '–', '\xa0', '\r','\x20','\x0a','\x09','\x0c','\x0d']:
li=li.strip(i)
print(li)
output:
₹ 7,372
In the loop list I outlined all (as far as I know) ASCII spaces and the symbols that you get.
Loop launches 3 times because needed value doesn't clean from the first time, you can check it every iteration in variable explorer.
Also optionally you can try to figure out what precise symbol gives a lot of pseudo spaces between <span> tags.
from bs4 import BeautifulSoup as bs
code = '''<li class="price-current">
<span class="price-current-label">
</span>₹ 7,372
<span class="price-current-range">
<abbr title="to">–</abbr>
</span>
</li>'''
soup = bs(code,'html.parser')
w = soup.find_all('li')
l = []
for item in w:
l.append(item)
words = str(l)
t = words.split('\n')
print(t[2][7:])
₹ 7,372
This is an extraction of an HTML file from http://www.flashscore.com/hockey/sweden/shl/results/
<td title="Click for match detail!" class="cell_sa score bold">4:3<br><span class="aet">(3:3)</span></td>
<td title="Click for match detail!" class="cell_sa score bold">2:5</td>
I would now like to extract the scores after regulation time.
This means whenever 'span class = "aet"' is present (after td class="cell_sa score bold") I need to get the text from span class = "aet". If span class = "aet" is not present I would like to extract the text from td class="cell_sa score bold".
In the above case the desired output (in a list) would be:
[3:3,2:5]
How could go I go for that with xpath statements in python?
You can reach text node of desired tags, obeying conditions you defined by this:
(/tr/td[count(./span[#class = 'aet']) > 0]/span[#class = 'aet'] | /tr/td[0 = count(./span[#class = 'aet'])])/text()
I supposed <td> tags are grouped in a <tr> tag.
If you want to strictly just choose <td>s having 'cell_sa' and 'score' and 'bold' add [contains(#class, 'cell_sa')][contains(#class, 'score')][contains(#class, 'bold')] after each td. As below:
(/tr/td[contains(#class, 'cell_sa')][contains(#class, 'score')][contains(#class, 'bold')][count(./span[#class = 'aet']) > 0]/span[#class = 'aet'] | /tr/td[contains(#class, 'cell_sa')][contains(#class, 'score')][contains(#class, 'bold')][0 = count(./span[#class = 'aet'])])/text()
As you see I tried to implement #class check method order-independently and loose (Just as it is in css selector). You could implement this check as a simple string comparison which results a fragile data consumer