BeautifulSoup - scrape hover over tooltip data - python

I am very new to coding and python. Only been using python for a few weeks. So please be kind. I use to code in college with C++ but that was 30 years ago. So basically starting from ground zero.
I have a html table. I have been able to break the table down using beautifulsoup into a list of rows and then into a list of columns in each row. I have been able to grab other data from the columns. But this last bit of text that in inside a tooltip that is only visible when hovering over is giving me a headache.
I can see the text I want in my debugger but can't seem to figure out how to reference it. The tooltip data is a list of names separated by commas. Once I pulled the text from the tooltip I was going spilt the names into a list. You can see in the debugger window I have marked the filed I am trying to grab.
output = []
for row in table.findAll('tr'):
# Find all data for each column
try:
columns = row.find_all('td')
# separate out the columns
if columns is not None and len(columns) >= 5:
coach = columns[1].text.strip()
status = columns[2].text.strip()
currently_coaching = columns[3].text.strip()
players_coached = columns[4].contents[1].strip()

Related

BeautifulSoup, how can I get texts without class identifier?

While crawling the website, there is no class name of some text I want to pull or any id style to separate the part that contains that text. In the selector path I used with soup.select it doesn't work for continuous operations. As an example, I want to take the data below, but I don't know how to do it.
ex.
Just a guess you can get the table, if so and you know the row, you can do the following. Use findAll to get all the rows in a list and use the slice syntax to access your element:
row = your_table_result.findAll('tr')[5::6]
EDITED AFTER QUESTION UPDATE
You solve your problem in different ways, but first grab the table:
table = soup.find("table",{"class":"auflistung"})
Way #1 - You know the row, where information is stored
(be aware that structure of table can change or maybe differ)
rows = table.findAll('td')
name = rows[0].text.strip()
position = rows[6].text.strip()
Way #2 - You know heading of information
(works great cause there ist only one column)
name = table.find("th", text="Anavatandaki isim:").find_next_sibling("td").text.strip()
position = table.find("th", text="Mevki:").find_next_sibling("td").text.strip()

Using Pandas match='string' parameter in pd.read_html does not recognize the string in a table, even though there is a table containing 'string'

This question has two parts:
Part One
I'm trying to scrape some data from the SEC's website using pandas pd.read_html function. I only want one specific table, which has the text "Principal Position" in the table. I wrote a script (see below) to pull this data, but the problem is that it only works some of the time. Sometimes it seems to completely ignore tables that contain this text.
For instance, in the script below, I attempt to pull the table containing the words "Principal Position" for each of three companies - Microsoft, Amazon, and Tesla. Although all three companies have a table containing the words "Principal Position", only two of the tables (MSFT and TSLA) are scraped. The third (AMZN) is skipped in the try-except block because the text is not found. I cannot figure out what I'm doing wrong that is causing the AMZN table to be skipped.
Any help would be greatly appreciated!
Part Two
I'm also trying to figure out how to cause the table to have headers that start with whatever row contains the words "Principal Position." Sometimes this phrase is in the second row, sometimes the third, etc. I can't figure out how to set the headers parameter in pd.read_html to be dynamic so that it changes based on whichever row contains the words "Principal Position."
Ideally I would also like to get rid of the extra columns that are inserted into the table (i.e., the columns that are all 'NaN' values).
I know I'm asking a ton but thought I'd throw it out there to see if anyone knows how to do this (I'm stumped). Again, greatly appreciate any help!
My code (which skips AMZN table but does scrape the MSFT and TSLA tables)
import pandas as pd
import html5lib
CIK_list = {'MSFT': 'https://www.sec.gov/Archives/edgar/data/789019/000119312519268531/d791036ddef14a.htm',
'AMZN': 'https://www.sec.gov/Archives/edgar/data/1018724/000119312520108422/d897711ddef14a.htm',
'TSLA': 'https://www.sec.gov/Archives/edgar/data/1318605/000156459020027321/tsla-def14a_20200707.htm',}
for ticker, link in CIK_list.items():
try:
df_list = pd.read_html(link, match=('Principal Position'))
df_list = pd.DataFrame(df_list[0])
df_list.to_csv(f'Z:/Python/{ticker}.csv')
except:
pass
EDITED POST: To add a bit of detail, the error I am receiving is as follows:
ValueError: No tables found matching pattern 'Principal Position'
However, if you look at the link to the AMZN filing, you can text search for "Principal Position" and it comes up. Could it be that somehow Pandas is not waiting for the page to fully load before executing read_html?
Try:
df_list = pd.read_html(link, match=('Principal\s+Position'))
Because looking at the html code, there appears to be more than just a single whitespace between Principal and Position for the AMZN webpage. Using \s+ regex, which means one or more space will capture this table also.

(HTML Scraping) XPath of a column changes based on color

I am trying to parse through all of the values in the column of this website (with different stock tickers). I am working in Python and am using XPath to scrape HTML data.
Lets say I want to extract the value of the "Change" which is currently 0.62% (and green). I would first get the tree to the website and then say.
stockInfo_1 = tree.xpath('//*[#class="table-dark-row"]/td[12]/b/span/text()')
I would then get an array of values and last element happens to be change value.
However, I noticed that if a value in this column has a color, it is in the /b/SPAN, while if it does not have a color, there is no span and its just in the /b.
So to explain:
stockInfo_1 = tree.xpath('//*[#class="table-dark-row"]/td[12]/b/span/text()')
^this array would have every value in this column that is Colored
while stockInfo_1 = tree.xpath('//*[#class="table-dark-row"]/td[12]/b/text()')
^would have every value in the column that does not have a color.
The colors are not consistent for each stock. Some stocks have random values that have colors and some do not. So that messes up the /b/span and /b array consistency.
How can I get an array of variables of ALL of the values (in order) in each column regardless of if they are in a span or not? I do not care about the colors, i just care about the values.
I can explain more if needed. Thanks!!
You can directly skip intermediate tags in xpath and get all the values in a list by using // inbetween.
So the snippet should be
tree.xpath('//*[#class="table-dark-row"]/td[12]/b//text()')
This skips all the intermediate tags between and text.
I've tried using lxml. Here is the code
import requests
from lxml import html
url="https://finviz.com/quote.ashx?t=acco&ty=c&ta=1&p=d"
resp=requests.get(url)
tree = html.fromstring(resp.content)
values = tree.xpath('//*[#class="table-dark-row"]/td[12]/b//text()')
print values
Which gives output as follows
['0.00%', '-2.43%', '-8.71%', '-8.71%', '7.59%', '-1.23%', '1.21', '0.30', '2.34% 2.38%', '12.05', '12.18', '1.04%']
Note: If you don't want to hardcode 12 in the above Xpath you can aslo use last() as tree.xpath('//*[#class="table-dark-row"]/td[last()]/b//text()')
Xpath cheat sheet for your kind reference.
Using "//" And ".//" Expressions In XPath XML Search Directives In ColdFusion

Python - Extract data from a specific table in a page

just started to learn python. Spent the whole weekend for this project but the progress is terrible. Hopefully can get some guidance from the community.
Part of my tutorial required me to extract data from a google finance page. https://www.google.com/finance. But only the sector summary table. And then organize them into a JSON dump.
The questions I have so far is:
1) How to extract data from sector summary table only? I can find_all using but the result come back include other table as well.
2) How do I get the change for each sectors ie: (energy : 0.99% , basic material : 0.31%, industrials : 0.17%). There are no unique tag I can used. The only characters is these numbers are below the same as the sector name
Looking at the page (either using View Source or your browser's developer tools), we know a few things:
The sector summary table is the only one inside a div tag with id=secperf (probably short for 'sector performance').
For every row except the first, the first cell from the left contains the sector name; the second one from the left contains the change percentage.
The other cells might contain bar graphs. The bar graphs also happen to be tables, but we want to ignore them, so we shouldn't recurse into them.
There are many ways to approach this. One way would be as follows:
def sector_summary(document):
table = document.find(id='secperf').find('table')
rows = table.find_all('tr', recursive=False)
for row in rows[1:]:
cells = row.find_all('td')
sector = cells[0].get_text().strip()
change = cells[1].get_text().strip()
yield (sector, change)
print(dict(sector_summary(my_document)))

Extract multiple types of text from a column in html

I'm new to Python and I'm trying to extract data from a html page. There is a certain column of the table which is a mixture of text and URLs. I'd like to extract all the information from that column, keeping the links intact to a csv file (which I'll later save as an Excel file). Please advise me. Here's my code to extract just the text.
trs = soup.find_all('tr')
for tr in trs:
tds = tr.find_all("td")
try:
RS_id = str(tds[5].get_text().encode('utf-8'))
A few cells of the column have multiple URLs and I'd like to keep them the same.
How is the data in that column written? If there is a clear pattern for how the URL is separated by other text, then you can use the string.split('character') command.
Say the column of data you care about has all of the entries split apart by a ',' character, then you would say:
column_data=RS_id.split(',')
This would give you a list of everything listed in that column, splitting it up every time there is a comma character. Then you just index the list to get the URL you're after. If there is no particular order to index the list by, you may have to do something like:
URL_list=[]
for item in column_data:
if 'http' in item: URL_list.append(item)
EDIT:
check out how beautifulsoup parses the table: http://www.crummy.com/software/BeautifulSoup/bs3/documentation.html
There should be a .href attribute for the text, which is the URL the hyperlink links to.

Categories