Python - Extract data from a specific table in a page - python

just started to learn python. Spent the whole weekend for this project but the progress is terrible. Hopefully can get some guidance from the community.
Part of my tutorial required me to extract data from a google finance page. https://www.google.com/finance. But only the sector summary table. And then organize them into a JSON dump.
The questions I have so far is:
1) How to extract data from sector summary table only? I can find_all using but the result come back include other table as well.
2) How do I get the change for each sectors ie: (energy : 0.99% , basic material : 0.31%, industrials : 0.17%). There are no unique tag I can used. The only characters is these numbers are below the same as the sector name

Looking at the page (either using View Source or your browser's developer tools), we know a few things:
The sector summary table is the only one inside a div tag with id=secperf (probably short for 'sector performance').
For every row except the first, the first cell from the left contains the sector name; the second one from the left contains the change percentage.
The other cells might contain bar graphs. The bar graphs also happen to be tables, but we want to ignore them, so we shouldn't recurse into them.
There are many ways to approach this. One way would be as follows:
def sector_summary(document):
table = document.find(id='secperf').find('table')
rows = table.find_all('tr', recursive=False)
for row in rows[1:]:
cells = row.find_all('td')
sector = cells[0].get_text().strip()
change = cells[1].get_text().strip()
yield (sector, change)
print(dict(sector_summary(my_document)))

Related

BeautifulSoup - scrape hover over tooltip data

I am very new to coding and python. Only been using python for a few weeks. So please be kind. I use to code in college with C++ but that was 30 years ago. So basically starting from ground zero.
I have a html table. I have been able to break the table down using beautifulsoup into a list of rows and then into a list of columns in each row. I have been able to grab other data from the columns. But this last bit of text that in inside a tooltip that is only visible when hovering over is giving me a headache.
I can see the text I want in my debugger but can't seem to figure out how to reference it. The tooltip data is a list of names separated by commas. Once I pulled the text from the tooltip I was going spilt the names into a list. You can see in the debugger window I have marked the filed I am trying to grab.
output = []
for row in table.findAll('tr'):
# Find all data for each column
try:
columns = row.find_all('td')
# separate out the columns
if columns is not None and len(columns) >= 5:
coach = columns[1].text.strip()
status = columns[2].text.strip()
currently_coaching = columns[3].text.strip()
players_coached = columns[4].contents[1].strip()

BeautifulSoup, how can I get texts without class identifier?

While crawling the website, there is no class name of some text I want to pull or any id style to separate the part that contains that text. In the selector path I used with soup.select it doesn't work for continuous operations. As an example, I want to take the data below, but I don't know how to do it.
ex.
Just a guess you can get the table, if so and you know the row, you can do the following. Use findAll to get all the rows in a list and use the slice syntax to access your element:
row = your_table_result.findAll('tr')[5::6]
EDITED AFTER QUESTION UPDATE
You solve your problem in different ways, but first grab the table:
table = soup.find("table",{"class":"auflistung"})
Way #1 - You know the row, where information is stored
(be aware that structure of table can change or maybe differ)
rows = table.findAll('td')
name = rows[0].text.strip()
position = rows[6].text.strip()
Way #2 - You know heading of information
(works great cause there ist only one column)
name = table.find("th", text="Anavatandaki isim:").find_next_sibling("td").text.strip()
position = table.find("th", text="Mevki:").find_next_sibling("td").text.strip()

Using Pandas match='string' parameter in pd.read_html does not recognize the string in a table, even though there is a table containing 'string'

This question has two parts:
Part One
I'm trying to scrape some data from the SEC's website using pandas pd.read_html function. I only want one specific table, which has the text "Principal Position" in the table. I wrote a script (see below) to pull this data, but the problem is that it only works some of the time. Sometimes it seems to completely ignore tables that contain this text.
For instance, in the script below, I attempt to pull the table containing the words "Principal Position" for each of three companies - Microsoft, Amazon, and Tesla. Although all three companies have a table containing the words "Principal Position", only two of the tables (MSFT and TSLA) are scraped. The third (AMZN) is skipped in the try-except block because the text is not found. I cannot figure out what I'm doing wrong that is causing the AMZN table to be skipped.
Any help would be greatly appreciated!
Part Two
I'm also trying to figure out how to cause the table to have headers that start with whatever row contains the words "Principal Position." Sometimes this phrase is in the second row, sometimes the third, etc. I can't figure out how to set the headers parameter in pd.read_html to be dynamic so that it changes based on whichever row contains the words "Principal Position."
Ideally I would also like to get rid of the extra columns that are inserted into the table (i.e., the columns that are all 'NaN' values).
I know I'm asking a ton but thought I'd throw it out there to see if anyone knows how to do this (I'm stumped). Again, greatly appreciate any help!
My code (which skips AMZN table but does scrape the MSFT and TSLA tables)
import pandas as pd
import html5lib
CIK_list = {'MSFT': 'https://www.sec.gov/Archives/edgar/data/789019/000119312519268531/d791036ddef14a.htm',
'AMZN': 'https://www.sec.gov/Archives/edgar/data/1018724/000119312520108422/d897711ddef14a.htm',
'TSLA': 'https://www.sec.gov/Archives/edgar/data/1318605/000156459020027321/tsla-def14a_20200707.htm',}
for ticker, link in CIK_list.items():
try:
df_list = pd.read_html(link, match=('Principal Position'))
df_list = pd.DataFrame(df_list[0])
df_list.to_csv(f'Z:/Python/{ticker}.csv')
except:
pass
EDITED POST: To add a bit of detail, the error I am receiving is as follows:
ValueError: No tables found matching pattern 'Principal Position'
However, if you look at the link to the AMZN filing, you can text search for "Principal Position" and it comes up. Could it be that somehow Pandas is not waiting for the page to fully load before executing read_html?
Try:
df_list = pd.read_html(link, match=('Principal\s+Position'))
Because looking at the html code, there appears to be more than just a single whitespace between Principal and Position for the AMZN webpage. Using \s+ regex, which means one or more space will capture this table also.

Python Selenium only getting first row when iterating over table

I am trying to extract the most recent headlines from the following news site:
http://news.sina.com.cn/hotnews/
#save ids of relevant buttons that need to be clicked on the site
buttons_ids = ['Tab21' , 'Tab22', 'Tab32']
#save ids of relevant subsections
con_ids = ['Con11']
#start webdriver, go to site, hover over buttons
driver = webdriver.Chrome()
driver.get("http://news.sina.com.cn/hotnews/")
time.sleep(3)
for button_id in buttons_ids:
button = driver.find_element_by_id(button_id)
ActionChains(driver).move_to_element(button).perform()
Then I iterate through each section that I am interested in and within each section through all the headlines which are rows in an HTML table. However, on every iteration, it returns the first element
for con_id in con_ids:
for news_id in range(2,10):
print(news_id)
headline = driver.find_element_by_xpath("//div[#id='"+con_id+"']/table/tbody/tr["+str(news_id)+"]")
text = headline.find_element_by_xpath("//td[2]/a")
print(text.get_attribute("innerText"))
print(text.get_attribute("href"))
com_no = comment.find_element_by_xpath("//td[3]/a")
print(com_no.get_attribute("innerText"))
I also tried the following approach by essentially saving the table as a list and then iterating through the rows:
for con_id in con_ids:
table = driver.find_elements_by_xpath("//div[#id='"+con_id+"']/table/tbody/tr")
for headline in table:
text = headline.find_element_by_xpath("//td[2]/a")
print(text.get_attribute("innerText"))
print(text.get_attribute("href"))
com_no = comment.find_element_by_xpath("//td[3]/a")
print(com_no.get_attribute("innerText"))
In the second case I get exactly the number of headlines in the section, so it apparently correctly picks up the number of rows. However, it is still only returning the first row on all iterations. Where am I going wrong? I know a similar question has been asked here: Selenium Python iterate over a table of rows it is stopping at the first row but I am still unable to figure out where I am going wrong.
In XPath, queries that begin with // will search relative to the document root; so even though you're calling find_element_by_xpath() on the correct container element, you're breaking out of that scope, thereby performing the same global search and yielding the same result every time.
To constrain your query to descendants of the current element, begin your query with .//, e.g.,:
text = headline.find_element_by_xpath(".//td[2]/a")
try this:
for con_id in con_ids:
for news_id in range(2,10):
print(news_id)
print("(//div[#id='"+con_id+"']/table/tbody/tr)["+str(news_id)+"]")
headline = driver.find_element_by_xpath("(//div[#id='"+con_id+"']/table/tbody/tr)["+str(news_id)+"]")
value = headline.find_element_by_xpath(".//td[2]/a")
print(value.get_attribute("innerText").encode('utf-8'))
I am able to get the headlines with above code
I was able to solve it by specifying the entire XPath in one go like this:
headline = driver.find_element_by_xpath("(//*[#id='"+con_id+"']/table/tbody/tr["+str(news_id)+"]/td[2]/a)")
print(headline.get_attribute("innerText"))
print(headline.get_attribute("href"))
rather than splitting it into two parts.
My only explanation for why it only prints the first row repeatedly is that there is some weird Javascript at work that doesn't let you iterate properly when splitting the request.
Or my first version had a syntax error, which I am not aware of.
If anyone has a better explanation, I'd be glad to hear it!

Scrapy: How to make a conditional (present or absent) XPATH return values when absent?

I am seeking to scrape particular product information from a website. One of my desired XPATH criteria, however, does not appear on every product's page. (While all products have name, price, etc, some do not have the recommended age displayed).
This is not a problem, however, when scrapy writes or even returns the data in the shell, it is no longer in the order associated with the start-url's list, nor does it respect the absence of data from some of the urls. Hence, all of my data (multiple columns of different variables) does not match the new age column since it is much shorter and out of order. This is not the case when I focus only on products that do have the age displayed.
Is there a way to make pages without the desired XPATH and age return a blank space to maintain matched column order in my data?
Here is my XPATH selector:
item["age"] = hxs.select('//li[contains(#class,"our-age")]/span/text()').extract()
(Some webpages do not have the age and thus lack completely the path.)
xpath = '//li[contains(#class,"our-age")]/span/text()'
item["age"] = hxs.select(xpath).extract() or [' ']

Categories