BeautifulSoup, how can I get texts without class identifier?

BeautifulSoup, how can I get texts without class identifier? - python

While crawling the website, there is no class name of some text I want to pull or any id style to separate the part that contains that text. In the selector path I used with soup.select it doesn't work for continuous operations. As an example, I want to take the data below, but I don't know how to do it.
ex.

Just a guess you can get the table, if so and you know the row, you can do the following. Use findAll to get all the rows in a list and use the slice syntax to access your element:
row = your_table_result.findAll('tr')[5::6]
EDITED AFTER QUESTION UPDATE
You solve your problem in different ways, but first grab the table:
table = soup.find("table",{"class":"auflistung"})
Way #1 - You know the row, where information is stored
(be aware that structure of table can change or maybe differ)
rows = table.findAll('td')
name = rows[0].text.strip()
position = rows[6].text.strip()
Way #2 - You know heading of information
(works great cause there ist only one column)
name = table.find("th", text="Anavatandaki isim:").find_next_sibling("td").text.strip()
position = table.find("th", text="Mevki:").find_next_sibling("td").text.strip()

Related

Find_all function returning multiple strings that are separated but not delimited by a character

Background: I am pretty new to python and decided to practice by making a webscraper for https://www.marketwatch.com/tools/markets/stocks/a-z which would allow me to pull the company name, ticker, origin, and sector. I could then use this in another scraper to combine it with more complex information. The page is separated by 2 indexing methods- one to select the first letter of the company name (at the top of the page and another for the number of pages within that letter index (at the bottom of the page). These two tags have the class = "pagination" identifier but when I scrape based on that criteria, I get two separate strings but they are not a delimited list separated by a comma.
Does anyone know how to get the strings as a list? or individually? I really only care about the second.
from bs4 import BeautifulSoup
import requests
# open the source code of the website as text
source = 'https://www.marketwatch.com/tools/markets/stocks/a-z/x'
page = requests.get(source).text
soup = BeautifulSoup(page, 'lxml')
for tags in soup.find_all('ul', class_='pagination'):
tags_text = tags.text
print(tags_text)
Which returns:
0-9ABCDEFGHIJKLMNOPQRSTUVWX (current)YZOther
«123»
When I try to split on /n:
tags_text = tags.text.split('/n')
print(tags_text)
The return is:
['\n0-9ABCDEFGHIJKLMNOPQRSTUVWX (current)YZOther']
['«123»']
Neither seem to form a list. I have found many ways to get the first string but I really only need the second.
Also, please note that I am using the X index as my current tab. If you build it from scratch, you might have more numbers in the second string and the word (current) might be in a different place in the first list.
THANK YOU!!!!
Edit:
Cleaned old, commented out code from the source and realized I did not show the results of trying to call the second element despite the lack of a comma in the split example:
tags_text = tags.text.split('/n')[1]
print(tags_text)
Returns:
File, "C:\.....", line 22, in <module>
tags_text = tags.text.split('/n')[1]
IndexError: list index out of range

Never mind, I was using print() when I should have been using return, .apphend(), or another term to actually do something with the value...

Is there any way to select element through HTML, by selenium, python

I am making crawling app through selenium, python and I am stuck.
enter image description here
as in picture I can select text(with underline).
but what I need is numbers next to text.
but in F12 in chrome
enter image description here
numbers(red cricle) has class name, but that class names are all same.
there is no indicator that I can use to select numbers through selenium.(as far as I know)
so I tried to find any way to select element through HTML by selenium.
but I couldn't find any. Is there any way to do?
If I am looking for something does not exist, I am very sorry.
I only know python and selenium.. so If I cannot handle this, please let me know.
---edit
I think I make bad explanation.
what I need is find text first, than collect numbers (both of two).
but there is tons of text. I just screenshot little bit.
so I can locate texts by it's specific ids(lot's of it).
but how can I get numbers that is nest to text.
this is my question. sorry for bad explanation
and if BeautifulSoup can handle this please let me know. Thanks for your help.
special thanks to Christine
her code solved my problem.

You can use an XPath index to accomplish selecting first td element. Given the screenshot, you can select the first td containing 2,.167 as such:
cell = driver.find_element_by_xpath("//tr[td/a[text()='TEXT']]/td[#class='txt-r'][1]")
print(cell.text)
You should replace TEXT with the characters you underlined in your screenshot -- I do not have this keyboard so I cannot type the text for you.
The above XPath will query on all table rows, pick the row with your desired text, then query on table cells with class txt-r within a row. Because the two td elements both have class txt-r, you only want to pick one of them, using an index indicated by [1]. The [1] will pick the first td, with text 2,167.
Full sample as requested by the user:
# first get all text on the page
all_text_elements = driver.find_elements_by_xpath("//a[contains(#class, 'link-resource')]")
# iterate text elements and print both numbers that are next to text
for text_element in all_text_elements:
# get the text from web element
text = text_element.text
# find the first number next to it (2,167 from sample HTML)
first_number = driver.find_element_by_xpath("//tr[td/a[text()='" + text + "']]/td[#class='txt-r'][1]")
print(first_number.text)
# find 2nd number (0 from sample HTML)
second_number = driver.find_element_by_xpath("//tr[td/a[text()='" + text + "']]/td[#class='txt-r'][2]")
print(second_number.text)

(HTML Scraping) XPath of a column changes based on color

I am trying to parse through all of the values in the column of this website (with different stock tickers). I am working in Python and am using XPath to scrape HTML data.
Lets say I want to extract the value of the "Change" which is currently 0.62% (and green). I would first get the tree to the website and then say.
stockInfo_1 = tree.xpath('//*[#class="table-dark-row"]/td[12]/b/span/text()')
I would then get an array of values and last element happens to be change value.
However, I noticed that if a value in this column has a color, it is in the /b/SPAN, while if it does not have a color, there is no span and its just in the /b.
So to explain:
stockInfo_1 = tree.xpath('//*[#class="table-dark-row"]/td[12]/b/span/text()')
^this array would have every value in this column that is Colored
while stockInfo_1 = tree.xpath('//*[#class="table-dark-row"]/td[12]/b/text()')
^would have every value in the column that does not have a color.
The colors are not consistent for each stock. Some stocks have random values that have colors and some do not. So that messes up the /b/span and /b array consistency.
How can I get an array of variables of ALL of the values (in order) in each column regardless of if they are in a span or not? I do not care about the colors, i just care about the values.
I can explain more if needed. Thanks!!

You can directly skip intermediate tags in xpath and get all the values in a list by using // inbetween.
So the snippet should be
tree.xpath('//*[#class="table-dark-row"]/td[12]/b//text()')
This skips all the intermediate tags between and text.
I've tried using lxml. Here is the code
import requests
from lxml import html
url="https://finviz.com/quote.ashx?t=acco&ty=c&ta=1&p=d"
resp=requests.get(url)
tree = html.fromstring(resp.content)
values = tree.xpath('//*[#class="table-dark-row"]/td[12]/b//text()')
print values
Which gives output as follows
['0.00%', '-2.43%', '-8.71%', '-8.71%', '7.59%', '-1.23%', '1.21', '0.30', '2.34% 2.38%', '12.05', '12.18', '1.04%']
Note: If you don't want to hardcode 12 in the above Xpath you can aslo use last() as tree.xpath('//*[#class="table-dark-row"]/td[last()]/b//text()')
Xpath cheat sheet for your kind reference.
Using "//" And ".//" Expressions In XPath XML Search Directives In ColdFusion

Python Selenium only getting first row when iterating over table

I am trying to extract the most recent headlines from the following news site:
http://news.sina.com.cn/hotnews/
#save ids of relevant buttons that need to be clicked on the site
buttons_ids = ['Tab21' , 'Tab22', 'Tab32']
#save ids of relevant subsections
con_ids = ['Con11']
#start webdriver, go to site, hover over buttons
driver = webdriver.Chrome()
driver.get("http://news.sina.com.cn/hotnews/")
time.sleep(3)
for button_id in buttons_ids:
button = driver.find_element_by_id(button_id)
ActionChains(driver).move_to_element(button).perform()
Then I iterate through each section that I am interested in and within each section through all the headlines which are rows in an HTML table. However, on every iteration, it returns the first element
for con_id in con_ids:
for news_id in range(2,10):
print(news_id)
headline = driver.find_element_by_xpath("//div[#id='"+con_id+"']/table/tbody/tr["+str(news_id)+"]")
text = headline.find_element_by_xpath("//td[2]/a")
print(text.get_attribute("innerText"))
print(text.get_attribute("href"))
com_no = comment.find_element_by_xpath("//td[3]/a")
print(com_no.get_attribute("innerText"))
I also tried the following approach by essentially saving the table as a list and then iterating through the rows:
for con_id in con_ids:
table = driver.find_elements_by_xpath("//div[#id='"+con_id+"']/table/tbody/tr")
for headline in table:
text = headline.find_element_by_xpath("//td[2]/a")
print(text.get_attribute("innerText"))
print(text.get_attribute("href"))
com_no = comment.find_element_by_xpath("//td[3]/a")
print(com_no.get_attribute("innerText"))
In the second case I get exactly the number of headlines in the section, so it apparently correctly picks up the number of rows. However, it is still only returning the first row on all iterations. Where am I going wrong? I know a similar question has been asked here: Selenium Python iterate over a table of rows it is stopping at the first row but I am still unable to figure out where I am going wrong.

In XPath, queries that begin with // will search relative to the document root; so even though you're calling find_element_by_xpath() on the correct container element, you're breaking out of that scope, thereby performing the same global search and yielding the same result every time.
To constrain your query to descendants of the current element, begin your query with .//, e.g.,:
text = headline.find_element_by_xpath(".//td[2]/a")

try this:
for con_id in con_ids:
for news_id in range(2,10):
print(news_id)
print("(//div[#id='"+con_id+"']/table/tbody/tr)["+str(news_id)+"]")
headline = driver.find_element_by_xpath("(//div[#id='"+con_id+"']/table/tbody/tr)["+str(news_id)+"]")
value = headline.find_element_by_xpath(".//td[2]/a")
print(value.get_attribute("innerText").encode('utf-8'))
I am able to get the headlines with above code

I was able to solve it by specifying the entire XPath in one go like this:
headline = driver.find_element_by_xpath("(//*[#id='"+con_id+"']/table/tbody/tr["+str(news_id)+"]/td[2]/a)")
print(headline.get_attribute("innerText"))
print(headline.get_attribute("href"))
rather than splitting it into two parts.
My only explanation for why it only prints the first row repeatedly is that there is some weird Javascript at work that doesn't let you iterate properly when splitting the request.
Or my first version had a syntax error, which I am not aware of.
If anyone has a better explanation, I'd be glad to hear it!

Python - Extract data from a specific table in a page

just started to learn python. Spent the whole weekend for this project but the progress is terrible. Hopefully can get some guidance from the community.
Part of my tutorial required me to extract data from a google finance page. https://www.google.com/finance. But only the sector summary table. And then organize them into a JSON dump.
The questions I have so far is:
1) How to extract data from sector summary table only? I can find_all using but the result come back include other table as well.
2) How do I get the change for each sectors ie: (energy : 0.99% , basic material : 0.31%, industrials : 0.17%). There are no unique tag I can used. The only characters is these numbers are below the same as the sector name

Looking at the page (either using View Source or your browser's developer tools), we know a few things:
The sector summary table is the only one inside a div tag with id=secperf (probably short for 'sector performance').
For every row except the first, the first cell from the left contains the sector name; the second one from the left contains the change percentage.
The other cells might contain bar graphs. The bar graphs also happen to be tables, but we want to ignore them, so we shouldn't recurse into them.
There are many ways to approach this. One way would be as follows:
def sector_summary(document):
table = document.find(id='secperf').find('table')
rows = table.find_all('tr', recursive=False)
for row in rows[1:]:
cells = row.find_all('td')
sector = cells[0].get_text().strip()
change = cells[1].get_text().strip()
yield (sector, change)
print(dict(sector_summary(my_document)))

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

BeautifulSoup, how can I get texts without class identifier? - python

Related

Find_all function returning multiple strings that are separated but not delimited by a character

Is there any way to select element through HTML, by selenium, python

(HTML Scraping) XPath of a column changes based on color

Python Selenium only getting first row when iterating over table

Python - Extract data from a specific table in a page

Categories

Resources