I am trying to fetch the tab name with playwright, but my code only outputs the page url.
For instance, I want to go to https://www.rottentomatoes.com/ and print the name being shown on the tab:
Rotten Tomatoes: Movies/ Tv Shows/ Movie Trailers...
I tried to use page.title and a few other options, but it is not working.
from playwright.sync_api import sync_playwright
website = "https://www.rottentomatoes.com/"
p = sync_playwright().start()
browser = p.chromium.launch(headless=False)
page = browser.new_page()
page.goto(website)
print(page.content)
print(page.inner_text)
print(page.title)
print(page.context)
My output is:
<bound method Page.content of <Page url='https://www.rottentomatoes.com/'>>
<bound method Page.inner_text of <Page url='https://www.rottentomatoes.com/'>>
<bound method Page.title of <Page url='https://www.rottentomatoes.com/'>>
As mentioned in the comments by other user. You need to pass selector as an arugument in the method inner_text(Selector) to get the value.
from playwright.sync_api import sync_playwright
website = "https://www.rottentomatoes.com/"
p = sync_playwright().start()
browser = p.chromium.launch(headless=False)
page = browser.new_page()
page.goto(website)
print(page.inner_text("rt-header-nav[slot='nav-dropdowns'] [slot='movies'] a[slot='link']"))
print(page.inner_text("rt-header-nav[slot='nav-dropdowns'] [slot='tv'] a[slot='link']"))
print(page.inner_text("rt-header-nav[slot='nav-dropdowns'] [slot='trivia'] a[slot='link']"))
This will print in the console.
MOVIES
TV SHOWS
MOVIE TRIVIA
As stated in the comment by G. Anderson, right now you’re just referencing the method but not calling it, so page.title should be page.title() as shown in the page title docs. Which is why the output was just saying “bound method…”
Hope that helps!
Related
I cobbled together some code to login to a website and navigate to the specific pages I want to scrape from. This part works fine. Now, however, I'm searching for a specific element titled 'tspan' and I'm getting an error that reads:
AttributeError: 'str' object has no attribute 'descendants'
If I go to the URL, right-click the element I want to grab, and click 'Inspect Element' I see the code behind the page, and it looks like this.
It looks like querying by 'g id' may work too.
So, I thought I could get all 'tspan' items, load all into a list, and write the list to a text file. However, I'm getting no 'tspan' elements at all. If I right-click the page and click 'View Page Source', I see no 'tspan' elements. This is very weird! The code behind the page is definitely different than what's rendered on the page itself. Here's my code. What am I doing wrong here?
from bs4 import BeautifulSoup as bs
import webbrowser
import requests
from lxml import html
from selenium import webdriver
profile = webdriver.FirefoxProfile()
profile.accept_untrusted_certs = True
import time
# selenium
wd = webdriver.Firefox(executable_path="C:/Utility/geckodriver.exe", firefox_profile=profile)
url = "https://corp-internal.com/admin/?page=0"
wd.get(url)
# set username
time.sleep(2)
username = wd.find_element_by_id("identifierId")
username.send_keys("my_email#email.com")
wd.find_element_by_id("identifierNext").click()
# set password
time.sleep(2)
password = wd.find_element_by_name("password")
password.send_keys("my_pswd")
wd.find_element_by_id("passwordNext").click()
all_text = []
# list of URLs
url_list = ['https://corp-internal.com/admin/graph?dag_id=emm1_daily_legacy',
'https://corp-internal.com/admin/graph?dag_id=eemm1_daily_legacy_history']
for link in url_list:
#File = webbrowser.open(link)
#File = requests.get(link)
#data = File.text
for link in bs.findAll('tspan'):
alldata = all_text.append(link.get('tspan'))
outF = open('C:/Users/ryans/OneDrive/Desktop/test.txt', 'w')
outF.writelines(alldata)
outF.close()
The code below is so simple, why is it printing None. Does that mean it's not finding the page?
from bs4 import BeautifulSoup as soup
from robobrowser import RoboBrowser
br = RoboBrowser()
login_url = 'https://www.cbssports.com/login'
login_page = br.open(login_url)
print(login_page)
br.open doesn't return the page content
br.open seems to open the robot browser to that page, which only changes the state of the robot browser. If you want to get the content of the page, you can do br.open(login_url) which opens the page, and then print(br.state.response.text), which prints the text sent back in the response, which is stored in the state of the browser.
I'm trying to get the number of plays for the top songs from a number of artists on Spotify using python and splinter.
If you fill in the username and password below with yours, you should be able to run the code.
from splinter import Browser
import time
from bs4 import BeautifulSoup
browser = Browser()
url = 'http://play.spotify.com'
browser.visit(url)
time.sleep(2)
button = browser.find_by_id('has-account')
button.click()
time.sleep(1)
browser.fill('username', 'your_username')
browser.fill('password', 'your_password')
buttons = browser.find_by_css('button')
visible_buttons = [button for button in buttons if button.visible]
login_button = visible_buttons[-1]
login_button.click()
time.sleep(1)
browser.visit('https://play.spotify.com/artist/5YGY8feqx7naU7z4HrwZM6')
time.sleep(10)
So far, so good. If you open up firefox, you'll can see Miley Cyrus's artist page, including the number of plays for top tracks.
If you open up the Firefox Developer Tools Inspector and hover, you can see the name of the song in .tl-highlight elements, and the number of plays in .tl-listen-count elements. However, I've found it impossible (at least on my machine) to access these elements using splinter. Moreover, when I try to get the source for the entire page, the elements that I can see by hovering my mouse over them in Firefox don't show up in what is ostensibly the page source.
html = browser.html
soup = BeautifulSoup(html)
output = soup.prettify()
with open('miley_cyrus_artist_page.html', 'w') as output_f:
output_f.write(output)
browser.quit()
I don't think I know enough about web programming to know what the issue is here--Firefox sees all the DOM elements clearly, but splinter that is driving Firefox does not.
The key problem is that there is an iframe containing the artist's page with list of tracks. You need to switch into it's context before searching for elements:
frame = browser.driver.find_element_by_css_selector("iframe[id^=browse-app-spotify]")
browser.driver.switch_to.frame(frame)
Many thanks to #alecxe, the following code works to pull the information on the artist.
from splinter import Browser
import time
from bs4 import BeautifulSoup
import codecs
browser = Browser()
url = 'http://play.spotify.com'
browser.visit(url)
time.sleep(2)
button = browser.find_by_id('has-account')
button.click()
time.sleep(1)
browser.fill('username', 'your_username')
browser.fill('password', 'your_password')
buttons = browser.find_by_css('button')
visible_buttons = [button for button in buttons if button.visible]
login_button = visible_buttons[-1]
login_button.click()
time.sleep(1)
browser.visit('https://play.spotify.com/artist/5YGY8feqx7naU7z4HrwZM6')
time.sleep(30)
CORRECT_FRAME_INDEX = 6
with browser.get_iframe(CORRECT_FRAME_INDEX) as iframe:
html = iframe.html
soup = BeautifulSoup(html)
output = soup.prettify()
with codecs.open('test.html', 'w', 'utf-8') as output_f:
output_f.write(output)
browser.quit()
I have to parse only the positions and points from this link. That link has 21 listings (I don't know actually what to call them) on it and each listing has 40 players on it expect the last one. Now I have written a code which is like this,
from bs4 import BeautifulSoup
import urllib2
def overall_standing():
url_list = ["http://www.afl.com.au/afl/stats/player-ratings/overall-standings#",
"http://www.afl.com.au/afl/stats/player-ratings/overall-standings#page/2",
# "http://www.afl.com.au/afl/stats/player-ratings/overall-standings#page/3",
# "http://www.afl.com.au/afl/stats/player-ratings/overall-standings#page/4",
# "http://www.afl.com.au/afl/stats/player-ratings/overall-standings#page/5",
# "http://www.afl.com.au/afl/stats/player-ratings/overall-standings#page/6",
# "http://www.afl.com.au/afl/stats/player-ratings/overall-standings#page/7",
# "http://www.afl.com.au/afl/stats/player-ratings/overall-standings#page/8",
# "http://www.afl.com.au/afl/stats/player-ratings/overall-standings#page/9",
# "http://www.afl.com.au/afl/stats/player-ratings/overall-standings#page/10",
# "http://www.afl.com.au/afl/stats/player-ratings/overall-standings#page/11",
# "http://www.afl.com.au/afl/stats/player-ratings/overall-standings#page/12",
# "http://www.afl.com.au/afl/stats/player-ratings/overall-standings#page/13",
# "http://www.afl.com.au/afl/stats/player-ratings/overall-standings#page/14",
# "http://www.afl.com.au/afl/stats/player-ratings/overall-standings#page/15",
# "http://www.afl.com.au/afl/stats/player-ratings/overall-standings#page/16",
# "http://www.afl.com.au/afl/stats/player-ratings/overall-standings#page/17",
# "http://www.afl.com.au/afl/stats/player-ratings/overall-standings#page/18",
# "http://www.afl.com.au/afl/stats/player-ratings/overall-standings#page/19",
# "http://www.afl.com.au/afl/stats/player-ratings/overall-standings#page/20",
"http://www.afl.com.au/afl/stats/player-ratings/overall-standings#page/21"]
gDictPlayerPointsInfo = {}
for url in url_list:
print url
header = {'User-Agent': 'Mozilla/5.0'}
header = {'User-Agent': 'Mozilla/5.0'}
req = urllib2.Request(url,headers=header)
page = urllib2.urlopen(req)
soup = BeautifulSoup(page)
table = soup.find("table", { "class" : "ladder zebra player-ratings" })
lCount = 1
for row in table.find_all("tr"):
lPlayerName = ""
lTeamName = ""
lPosition = ""
lPoint = ""
for cell in row.find_all("td"):
if lCount == 2:
lPlayerName = str(cell.get_text()).strip().upper()
elif lCount == 3:
lTeamName = str(cell.get_text()).strip().split("\n")[-1].strip().upper()
elif lCount == 4:
lPosition = str(cell.get_text().strip())
elif lCount == 6:
lPoint = str(cell.get_text().strip())
lCount += 1
if url == "http://www.afl.com.au/afl/stats/player-ratings/overall-standings#page/2":
print lTeamName, lPlayerName, lPoint
if lPlayerName <> "" and lTeamName <> "":
lStr = lPosition + "," + lPoint
# if gDictPlayerPointsInfo.has_key(lTeamName):
# gDictPlayerPointsInfo[lTeamName].append({lPlayerName:lStr})
# else:
gDictPlayerPointsInfo[lTeamName+","+lPlayerName] = lStr
lCount = 1
lfp = open("a.txt","w")
for key in gDictPlayerPointsInfo:
if key.find("RICHMOND"):
lfp.write(str(gDictPlayerPointsInfo[key]))
lfp.close()
return gDictPlayerPointsInfo
# overall_standing()
but the problem is it always gives me the first listing's points and positions, It ignored the other 20. How could I get the positions and points for the whole 21? Now I heard scrapy can do this type thing pretty easy but I am not fully familiar with scrapy. Is there any other way possible than using scrapy.
This is happening because these links are handled by the server, and often the portion of the link followed by the # symbol, called the fragment identifier, is processed by the browser and refers to some link or javascript behavior, i.e. loading a different set of results.
I would suggest two appraoches, either finding a way to use a link which the server can evaluate that you could continue using scrapy with or using a webdriver like selenium.
Scrapy
Your first step is to identify the javascript load call, often ajax, and use those links to pull your information. These are calls to the site's DB. This can be done by opening your web inspector and watching the network traffic as you click the next search result page:
and then after the click
we can see that there is a new call the this url:
http://www.afl.com.au/api/cfs/afl/playerRatings?roundId=CD_R201401408&pageNum=3&pageSize=40
This url returns a json file which can be parsed, and you can even shorten your steps are it looks like you can control more what information is returned to you.
You could either write a method to generate a series of links for you:
def gen_url(page_no):
return "http://www.afl.com.au/api/cfs/afl/playerRatings?roundId=CD_R201401408&pageNum=" + str(page_no) + "&pageSize=40"
and then, for example, use scrapy with the seed list:
seed = [gen_url(i) for i in range(20)]
or you can try tweaking the url parameters and see what you get, maybe you can get multiple pages at a time:
http://www.afl.com.au/api/cfs/afl/playerRatings?roundId=CD_R201401408&pageNum=1&pageSize=200
I changed the end the pageSize parameter to 200 since it seems this corresponds directly to the number of results returned.
NOTE There is a chance this method would not work as sites sometimes block their data API from outside usage via screening the ip of where the request is coming from.
If this is the case you should go with the following approach.
Selenium (or other webdriver)
Using something like selenium which is a webdriver, you can use what is loaded into a browser to evaluate data that is loaded after the server has returned the webpage.
There is some initial setup that needs to be set up in order for selenium to be usable, but it is a very powerful tool once you have it working.
A simple example of this would be:
from selenium import webdriver
driver = webdriver.Firefox()
driver.get("http://www.afl.com.au/stats/player-ratings/overall-standings")
You will see a python controlled Firefox browser (this can be done with other browsers too) open on your screen and load the url you provide, then follow the commands you give it which can be done from a shell even (useful for debugging) and you can search and parse html in the same way you would with scrapy (code contd from previous code section...)
If you want to perform something like clicking the next page button:
driver.find_elements_by_xpath("//div[#class='pagination']//li[#class='page']")
That expression may need some tweaking but it intends to find all the li elements of class='page' that are in the div with class='pagination', the // means shortened path between elements, your other alternative would be like /html/body/div/div/..... until you get to the one in question, which is why //div/... is useful and appealing.
For specific help and reference on locating elements see their page
My usual method is trial and error for this, tweaking the expression until it hits the target elements I want. This is where the console/shell comes in handy. After setting up the driver as above, I usually try and build my expression:
Say you have an html structure like:
<html>
<head></head>
<body>
<div id="container">
<div id="info-i-want">
treasure chest
</div>
</div>
</body>
</html>
I would start with something like:
>>> print driver.get_element_by_xpath("//body")
'<body>
<div id="container">
<div id="info-i-want">
treasure chest
</div>
</div>
</body>'
>>> print driver.get_element_by_xpath("//div[#id='container']")
<div id="container">
<div id="info-i-want">
treasure chest
</div>
</div>
>>> print driver.get_element_by_xpath("//div[#id='info-i-want']")
<div id="info-i-want">
treasure chest
</div>
>>> print driver.get_element_by_xpath("//div[#id='info-i-want']/text()")
treasure chest
>>> # BOOM TREASURE!
Usually it will be more complex, but this is a good and often necessary debugging tactic.
Back to your case, you could then save them out into an array:
links = driver.find_elements_by_xpath("//div[#class='pagination']//li[#class='page']")
and then one by one click them, scrape the new data, click the next one:
import time
from selenium import webdriver
driver = None
try:
driver = webdriver.Firefox()
driver.get("http://www.afl.com.au/stats/player-ratings/overall-standings")
#
# Scrape the first page
#
links = driver.find_elements_by_xpath("//div[#class='pagination']//li[#class='page']")
for link in links:
link.click()
#
# scrape the next page
#
time.sleep(1) # pause for a time period to let the data load
finally:
if driver:
driver.close()
It is best to wrap it all in a try...finally type block to make sure you close the driver instance.
If you decide to delve deeper into the selenium approach, you can refer to their docs which have excellent and very explicit documentation and examples.
Happy scraping!
I am attempting to get the source code from a webpage including html that is generated by javascript. My code currently is as follows:
from selenium import webdriver
from bs4 import BeautifulSoup
case_url = "http://na.leagueoflegends.com/tribunal/en/case/5555631/#nogo"
try:
browser = webdriver.Firefox()
browser.get(case_url)
url = browser.page_source
print url
browser.close
except:
...
soup=BeautifulSoup(url)
...extraction code that finds the right tags, but they are empty...
When I print the source stored in url, it prints the usual HTML, but is missing the generated html information. How do I get the same HTML as when I press f12 (but I would prefer to do this programatically)?
Further to alexce's answer above, your underlying issue was that you were extracting the HTML before the JavaScript had generated it. Selenium returns control as soon as the browser has loaded and does not wait for any post load JavaScript generated HTML.
By using "find_elements", you will be automatically waiting for the elements to appear (depending on the timeout set when instantiating your driver).
If you were to call get "page_source" after the "find_elements", then you would see the full HTML.
I have automated many dynamically client side generated web pages, and have had no issues providing you wait for the HTML to be rendered.
Alexce is correct that there is no need to use BeautifulSoup, but I wanted to make it clear that Selenium is perfectly able to automate JavaScript generated HTML
You don't really need to use BeautifulSoup for parsing html in this case, selenium itself is pretty powerful in terms of Locating Elements.
Here's how you can parse the contents of each tab/game one by one:
from selenium import webdriver
case_url = "http://na.leagueoflegends.com/tribunal/en/case/5555631/#nogo"
browser = webdriver.Firefox()
browser.get(case_url)
game_tabs = browser.find_elements_by_xpath('//a[contains(#id, "tab-")]')
for index, tab in enumerate(game_tabs, start=1):
tab.click()
game = browser.find_element_by_id('game%d' % index)
game_type = game.find_element_by_id('stat-type-fill').text
game_length = game.find_element_by_id('stat-length-fill').text
game_outcome = game.find_element_by_id('stat-outcome-fill').text
game_chat = game.find_element_by_class_name('chat-log')
enemy_chat = [msg.text for msg in game_chat.find_elements_by_class_name('enemy') if msg.text]
ally_chat = [msg.text for msg in game_chat.find_elements_by_class_name('ally') if msg.text]
print game_type, game_length, game_outcome
print "Enemy chat: ", enemy_chat
print "Ally chat: ", ally_chat
print "------"
prints:
Classic 34:48 Loss
Enemy chat: [u'Akali [All] [00:01:38] lol', ... ]
Ally chat: [u'Gangplank [All] [00:00:12] anyone remember the april fools lee sin spotlight? lol', ... ]
------
Dominion 19:22 Loss
Enemy chat: [u'Evelynn [All] [00:00:10] Our GP has a Ti-83', ... ]
Ally chat: [u'Miss Fortune [All] [00:00:18] arr ye wodden computer needs to walk the plank!', ... ]