Selenium - Get text inside of table cells - python

Trying to get the text inside of the table cells, but have no luck.
I am trying to get the text inside of these cells:
(th and td)
The code works, kind of. It prints out the value as a normal " " (space).
code:
driver.get('https://www.komplett.se/product/1165487/datorutrustning/datorkomponenter/chassibarebone/big-tower/phanteks-eclipse-p500-air')
parent_table = driver.find_element_by_xpath("/html/body/div[2]/main/div[2]/div[2]/div[3]/div/div[2]/div/section[2]/div/div/div")
count_of_tables = len(parent_table.find_elements_by_xpath("./table"))
for x in range(count_of_tables):
parent_tr = driver.find_element_by_xpath(f"/html/body/div[2]/main/div[2]/div[2]/div[3]/div/div[2]/div/section[2]/div/div/div/table[{x + 1}]/tbody")
count_of_tr = len(parent_tr.find_elements_by_xpath("./tr"))
print(count_of_tr)
for y in range(count_of_tr):
th = driver.find_element_by_xpath(f'/html/body/div[2]/main/div[2]/div[2]/div[3]/div/div[2]/div/section[2]/div/div/div/table[{x + 1}]/tbody/tr[{y+1}]/th')
td = driver.find_element_by_xpath(f'/html/body/div[2]/main/div[2]/div[2]/div[3]/div/div[2]/div/section[2]/div/div/div/table[{x + 1}]/tbody/tr[{y + 1}]/td')
print(th.text)
print(td.text)

for y in range(count_of_tr):
th = driver.find_element_by_xpath(
f'/html/body/div[2]/main/div[2]/div[2]/div[3]/div/div[2]/div/section[2]/div/div/div/table[{x + 1}]/tbody/tr[{y+1}]/th')
td = driver.find_element_by_xpath(
f'/html/body/div[2]/main/div[2]/div[2]/div[3]/div/div[2]/div/section[2]/div/div/div/table[{x + 1}]/tbody/tr[{y + 1}]/td')
print(th.get_attribute("textContent"))
print(td.get_attribute("textContent"))
use get attribute text content as , text will retrieve text visible in view port only

Related

How to scrape a table with selenium?

I'm having a weird issue trying to scrape a table with selenium. For reference, the table is the item table here, although ideally I would like to be able to scrape any item table for any hero on this site.
self.item_table_xpath = '//table[descendant::thead[descendant::tr[descendant::th[contains(text(), "Item")]]]]'
def retrieve_hero_stats(self, url):
self.driver.get(url)
try:
win_rate_span = self.driver.find_element(by = By.XPATH, value = '//dd[descendant::*[#class = "won"]]/span')
except:
win_rate_span = self.driver.find_element(by = By.XPATH, value = '//dd[descendant::*[#class = "lost"]]/span')
win_rate = win_rate_span.text
hero_name = url.split('/')[-1]
values = list()
for i in range(1, 13):
values.append({
'Item Name': self.driver.find_element(by = By.XPATH, value = self.item_table_xpath + f'/tbody/tr[{i}]' + '/td[2]').text,
'Matches Played': self.driver.find_element(by = By.XPATH, value = self.item_table_xpath + f'/tbody/tr[{i}]' + '/td[3]').text,
'Matches Won': self.driver.find_element(by = By.XPATH, value = self.item_table_xpath + f'/tbody/tr[{i}]' + '/td[4]').text,
'Win Rate': self.driver.find_element(by = By.XPATH, value = self.item_table_xpath + f'/tbody/tr[{i}]' + '/td[5]').text
})
print(hero_name)
print(values)
The issue is the output of the code is inconsistent; sometimes the fields in the values list are populated, and sometimes they are not. This changes each time I run my code. I don't necessarily need someone to write this code for me, in fact, I'd prefer you didn't, I'm just stumped as to why the output changes every time I run?

Updating folium changed the Popup box width

Recently I updated folium from 0.5.0 to 0.11.0 and thereafter I am experiencing a problem with the popup box. With the update the popup box seem to have shrinked in width and the text is coming in separate lines, which happened to appear in the same line with the previous version of folium. No changes been made with the code.
How can I change the popup box look like the previous one, i.e., text does not break the line?
Popup box code:
fgc.add_child(folium.Marker(location=[lt, ln], popup= "<h4> <b>Thana :&nbsp" + di +"</h4></b>"+ "<br><b>Cases Total: &nbsp: </b>"+str(ca)+ " person "+ "<br>" + "<b>Cases 24 hours : </b>"+ str(da)+ " person "+"<br>"+"<b>Cases 7 days: </b>"+str(we)+ " person "+"<br><b>Neighbouhood affected : </b>"+str(ne)
How I handled this was to create a IFrame to handle the dataframe variables and then just passed the that to the popup class, this should work for database or dataframe.
for (index, row) in df.iterrows():
if row.loc['BRANCH'] == 1:
iframe = folium.IFrame('Account#:' + str(row.loc['ACCT']) + '<br>' + 'Name: ' + row.loc['NAME'] + '<br>' + 'Terr#: ' + str(row.loc['TERR']))
popup = folium.Popup(iframe, min_width=300, max_width=300)
folium.Marker(location=[row.loc['LAT'], row.loc['LON']], icon=folium.Icon(color=row.loc['COLOR'], icon='map-marker', prefix='fa'), popup=popup).add_to(map1)
Without reproducible code it is not possible to give you a tailored solution. As a general suggestion, you could use folium.Popup() with the combo of min_width and max_width parameters to force the width of a popup.
For example:
import folium
m = folium.Map(location=[43.775, 11.254],
zoom_start=5)
html = '''1 aaaaaaaaaaaaaaaaaa aaaa aaa aa aaaaa aaa aaaa a a a a<br>2 aaaaaaaaaa aaa aaaaa aaaaa<br>3 aaaaa aaaaaa aaaaa aaa aaaaa<br>4 aaa aaa aaaaaaaa
'''
iframe = folium.IFrame(html)
popup = folium.Popup(iframe,
min_width=500,
max_width=500)
marker = folium.Marker([43.775, 11.254],
popup=popup).add_to(m)
m
and you get:
def color(elev):
if elev == "STARTED":
col = 'orange.png'
elif elev=="COMPLETED":
col = 'vehicle3_w30.png'
elif elev =="DELIVERED":
col = 'vehicle3_w30.png'
else:
col='grey.png'
return col
icon_url = "grey.png"
icon = folium.features.CustomIcon(icon_url,
icon_size=(12, 12))
for lat,lan,name,event_name,officer,update_at in zip(df['fSourceLatitude'],df['fSourceLongitude'],df['officer_name'],df['event_name'],df["user_name"],df["update_at"]):
bikeColor = color(event_name)
biker = folium.features.CustomIcon(bikeColor, icon_size=(20,40))
popContent = ("Updated At: " + str(update_at) + '<br>' +\
"Officer ID : " + str(officer) + '<br>'+\
"Status: {}".format(event_name))
iframe = folium.IFrame(popContent)
popup1 = folium.Popup(iframe,
min_width=500,
max_width=500)
folium.Marker(location=[lat,lan],popup = popup1,icon= biker).add_to(map5)
It worked for me, you need to initiate marker with custom icon in each iteration as shown in this code, It will work perfectly...
He is trying to fetch the data from database that's why it is breaking, If he did write the data using html tag then there will be no problem. But the main fact inside of html tag you have to use fetch data.

Scraping a widget

I am scraping data and it was scraping and printing what was appearing on the first page, however there was tons more data below. So, next I added code to scroll down to the bottom of the page so everything could be scraped. The problem now is that it scrolls to the bottom but then it just waits and never prints. Anyone know how to get this to print and eventually I'd the results to go to an excel file if anyone knows how to that too. Thanks so much
from selenium import webdriver
url = 'http://www.tradingview.com/screener'
driver = webdriver.Firefox()
driver.get(url)
SCROLL_PAUSE_TIME = 2
# Get scroll height
last_height = driver.execute_script("return document.body.scrollHeight")
while True:
# Scroll down to bottom
driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
# Wait to load page
time.sleep(SCROLL_PAUSE_TIME)
# Calculate new scroll height and compare with last scroll height
new_height = driver.execute_script("return document.body.scrollHeight")
if new_height == last_height:
break
last_height = new_height
# will give a list of all tickers
tickers = driver.find_elements_by_css_selector('a.tv-screener__symbol')
# will give a list of all company names
company_names = driver.find_elements_by_css('span.tv-screener__description')
# will give a list of all close values
close_values = driver.find_elements_by_xpath("//td[#class = 'tv-data-table__cell tv-screener-table__cell tv-screener-table__cell--numeric']/span")
# will give a list of all percentage changes
percentage_changes = driver.find_elements_by_xpath('//tbody/tr/td[3]')
# will give a list of all value changes
value_changes = driver.find_elements_by_xpath('//tbody/tr/td[4]')
# will give a list of all ranks
ranks = driver.find_elements_by_xpath('//tbody/tr/td[5]/span')
# will give a list of all volumes
volumes = driver.find_elements_by_xpath('//tbody/tr/td[6]')
# will give a list of all market caps
market_caps = driver.find_elements_by_xpath('//tbody/tr/td[7]')
# will give a list of all PEs
pes = driver.find_elements_by_xpath('//tbody/tr/td[8]')
# will give a list of all EPSs
epss = driver.find_elements_by_xpath('//tbody/tr/td[9]')
# will give a list of all EMPs
emps = driver.find_elements_by_xpath('//tbody/tr/td[10]')
# will give a list of all sectors
sectors = driver.find_elements_by_xpath('//tbody/tr/td[11]')
for index in range(len(tickers)):
print("Row " + index + " " + tickers[index].text + " " + company_names[index].text + " ")
You are trying to locate a wrong element. This:
element = driver.find_elements_by_id('js-screener-container')
should be replaced with:
# will give a list of all tickers
tickers = driver.find_elements_by_css_selector('a.tv-screener__symbol')
# will give a list of all company names
company_names = driver.find_elements_by_css_selector('span.tv-screener__description')
# will give a list of all close values
close_values = driver.find_elements_by_xpath("//td[#class = 'tv-data-table__cell tv-screener-table__cell tv-screener-table__cell--numeric']/span")
# will give a list of all percentage changes
percentage_changes = driver.find_elements_by_xpath('//tbody/tr/td[3]')
# will give a list of all value changes
value_changes = driver.find_elements_by_xpath('//tbody/tr/td[4]')
# will give a list of all ranks
ranks = driver.find_elements_by_xpath('//tbody/tr/td[5]/span')
# will give a list of all volumes
volumes = driver.find_elements_by_xpath('//tbody/tr/td[6]')
# will give a list of all market caps
market_caps = driver.find_elements_by_xpath('//tbody/tr/td[7]')
# will give a list of all PEs
pes = driver.find_elements_by_xpath('//tbody/tr/td[8]')
# will give a list of all EPSs
epss = driver.find_elements_by_xpath('//tbody/tr/td[9]')
# will give a list of all EMPs
emps = driver.find_elements_by_xpath('//tbody/tr/td[10]')
# will give a list of all sectors
sectors = driver.find_elements_by_xpath('//tbody/tr/td[11]')
So now you have all data stored in lists. If you want to build a rows of data, you can use something like this:
for index in range(len(tickers)):
print("Row " + tickers[index].text + " " + company_names[index].text + " " + ....)
Output will be something like this:
Row AAPL APPLE INC. 188.84 -1.03% -1.96 Neutral 61.308M 931.386B 17.40 10.98 123K Technology
Row AMZN AMAZON.COM INC 1715.97 -0.46% -7.89 Buy 4.778M 835.516B 270.53 6.54 566K Consumer Cyclicals
...
PS:
I think
SCROLL_PAUSE_TIME = 0.5
is too small ammount of time, since sometimes loading new content by scrolling on the page bottom may be longer as 0.5 seconds. I would increase this value to make sure that all content will be loaded.

Python docx add_paragraph() inserts leading newline

I'm able to use a paragraph object to select font size, color, bold, etc. within a table cell. But, add_paragraph() seems to always insert a leading \n into the cell and this messes up the formatting on some tables.
If I just use the cell.text('') method it doesn't insert this newline but then I can't control the text attributes.
Is there a way to eliminate this leading newline?
Here is my function:
def add_table_cell(table, row, col, text, fontSize=8, r=0, g=0, b=0, width=-1):
cell = table.cell(row,col)
if (width!=-1):
cell.width = Inches(width)
para = cell.add_paragraph(style=None)
para.alignment = WD_ALIGN_PARAGRAPH.LEFT
run = para.add_run(text)
run.bold = False
run.font.size = Pt(fontSize)
run.font.color.type == MSO_COLOR_TYPE.RGB
run.font.color.rgb = RGBColor(r, g, b)
I tried the following and it worked out for me. Not sure if is the best approach:
cells[0].text = 'Some text' #Write the text to the cell
#Modify the paragraph alignment, first paragraph
cells[0].paragraphs[0].paragraph_format.alignment=WD_ALIGN_PARAGRAPH.CENTER
The solution that I find is to use text attribute instead of add_paragraph() but than use add_run():
row_cells[0].text = ''
row_cells[0].paragraphs[0].add_run('Total').bold = True
row_cells[0].paragraphs[0].paragraph_format.alignment = WD_ALIGN_PARAGRAPH.RIGHT
I've look through the documentation of cell, and it's not the problem of add_paragraph(). The problem is when you having a cell, by default, it will have a paragraph inside it.
class docx.table._Cell:
paragraphs: ... By default, a new cell contains a single paragraph. Read-only
Therefore, if you want to add paragraphs in the first row of cell, you should first delete the default paragraph first. Since python-docx don't have paragraph.delete(), you can use the function mention in this github issue: feature: Paragraph.delete()
def delete_paragraph(paragraph):
p = paragraph._element
p.getparent().remove(p)
p._p = p._element = None
Therefore, you should do something like:
cell = table.cell(0,0)
paragraph = cell.paragraphs[0]
delete_paragraph(paragraph)
paragraph = cell.add_paragraph('text you want to add', style='style you want')
Update at 10/8/2022
Sorry, the above approach is kinda unnecessary.
It's much intuitive to edit the default paragraph instead of first deleting it and add it back.
For the function add_table_cell, just replace the para = cell.paragraphs[0]
and para.style = None, the para.style = None is not necessary as it should be default value for a new paragraph.
Here is what worked for me. I don't call add_paragraph(). I just reference the first paragraph with this call -> para = cell.paragraphs[0]. Everything else after that is the usual api calls.
table = doc.add_table( rows=1, cols=3 ) # bar codes
for tableRow in table.rows:
for cell in tableRow.cells:
para = cell.paragraphs[0]
run = para.add_run( "*" + specIDStr + "*" )
font = run.font
font.name = 'Free 3 of 9'
font.size = Pt( 20 )
run = para.add_run( "\n" + specIDStr
+ "\n" + firstName + " " + lastName
+ "\tDOB: " + dob )
font = run.font
font.name = 'Arial'
font.size = Pt( 8 )

Parsing Python textfile with tags

I am parsing a 300 page document with python and I need to find out the attribute values of the Response element after the ThisVal element. There are multiple points where the Response element is used for differentVals, so I need to find out what is in the Response elements attribute value after finding the ThisVal element.
If it helps, the tokens are unique to ThisVal, but are different in every document.
11:44:49 <ThisVal Token="5" />
11:44:49 <Response Token="5" Code="123123" elements="x.one,x.two,x.three,x.four,x.five,x.six,x.seven" />
Have you considered using pyparsing? I've found it to be very useful for this kind of thing. Below is my attempt at a solution to your problem.
import pyparsing as pp
document = """11:44:49 <ThisVal Token="5" />
11:44:49 <Response Token="5" Code="123123" elements="x.one,x.two,x.three,x.four,x.five,x.six,x.seven" />
"""
num = pp.Word(pp.nums)
colon = ":"
start = pp.Suppress("<")
end = pp.Suppress("/>")
eq = pp.Suppress("=")
tag_name = pp.Word(pp.alphas)("tag_name")
value = pp.QuotedString("\"")
timestamp = pp.Suppress(num + colon + num + colon + num)
other_attr = pp.Group(pp.Word(pp.alphas) + eq + value)
tag = start + tag_name + pp.ZeroOrMore(other_attr)("attr") + end
tag_line = timestamp + tag
thisval_found = False
for line in document.splitlines():
result = tag_line.parseString(line)
print("Tag: {}\nAttributes: {}\n".format(result.tag_name, result.attr))
if thisval_found and tag_name == "Response":
for a in result.attr:
if a[0] == "elements":
print("FOUND: {}".format(a[1]))
thisval_found = result.tag_name == "ThisVal"

Categories