How to get the EXACT, REAL value of 'href' - python

I'm trying to make a program that can fetch information about my attendance from my college website. In order to do that i wrote a script to login to the website ,which leads me to my dashboard ,and then go to the Attendence tab, get the href and attach it to url of the college website ,
the tag in the attendence class looked like this
Attendance
and when i clicked the attendance link the ,webpage had a url on the Address bar looked like this
http://erp.college_name.edu/Student/StudentAttendanceView.aspx?SID=7JyKkZE1Eyx2EYNii7tOjQ==|yaE7DmfR9r8= .
So, it was self explanatory that i was supposed to attach the href to the
'http://erp.college_name.edu' . Ok, i did i.e.
L = 'http://erp.college_name.edu' + str(I.findAll('li')[4].a.get('href').replace('.', ''))
but the problem is that when i fetch the href it is something else than that in the tag, it keeps on changing and when i get the link that is when i print L i got this.. which i assumed to get..
http://erp.college_name.edu/Student/StudentAttendanceViewaspx?SID=aDmK9cEFWwDqvsWw5ZzEOw==|oTeYVRfW1u8=
but the problem is that the href i'm getting in is different from the real url , and IT KEEPS ON CHANGING WHEN I RE-RUN THE PROGRAM ,the second time i got
http://erp.college_name.edu/Student/StudentAttendanceViewaspx?SID=WM/lbVRchyyBiLsDvkORJw==|MaP8NtvvrHE=
, why i'm getting this ,and moreover when i click on other links on my Dashboard page and again click on attendance tab , the href value in the url again changed in the address bar? ..
so, after that when i did,
opens = requests.get(L)
soup_2 = BeautifulSoup(opens.text, 'lxml')
print(L)
i got this..
C:\Users\HUNTER\AppData\Local\Programs\Python\Python35-32\python.exe
C:/Users/HUNTER/PycharmProjects/dictionary/erp_1.py
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN"
"http://www.w3.org/TR/html4/strict.dtd">
<html><head><title>The page cannot be found</title>
<meta content="text/html; charset=utf-8" http-equiv="Content-Type"/>
<style type="text/css">
BODY { font: 8pt/12pt verdana }
H1 { font: 13pt/15pt verdana }
H2 { font: 8pt/12pt verdana }
A:link { color: red }
A:visited { color: maroon }
</style>
</head><body><table border="0" cellspacing="10" width="500"><tr><td>
<h1>The page cannot be found</h1>
The page you are looking for might have been removed, had its name
changed, or is temporarily unavailable.
<hr/>
<p>Please try the following:</p>
<ul>
<li>Make sure that the Web site address displayed in the address bar of
your browser is spelled and formatted correctly.</li>
<li>If you reached this page by clicking a link, contact
the Web site administrator to alert them that the link is incorrectly
formatted.
</li>
<li>Click the Back button to
try
another link.</li>
</ul>
<h2>HTTP Error 404 - File or directory not found.<br/>Internet
Information
Services (IIS)</h2>
<hr/>
<p>Technical Information (for support personnel)</p>
<ul>
<li>Go to <a href="http://go.microsoft.com/fwlink/?
linkid=8180">Microsoft
Product Support Services</a> and perform a title search for the words
<b>HTTP</b> and <b>404</b>.</li>
<li>Open <b>IIS Help</b>, which is accessible in IIS Manager (inetmgr),
and search for topics titled <b>Web Site Setup</b>, <b>Common
Administrative
Tasks</b>, and <b>About Custom Error Messages</b>.</li>
</ul>
</td></tr></table></body></html>
Process finished with exit code 0
UPDATE
I replaced the .replace('.', '') method with [2:] because the the replace function also removed . from .aspx in the href and the problem now changed to this
but still, how the value of href keep getting changed how can i fetch that page..
Any help?

Related

Can't extract some tags from a web page

I was scraping some data from this URL
https://www.degruyter.com/search?query=*&startItem=0&pageSize=10&sortBy=relevance&documentTypeFacet=journal
when I try to get the journal names its not giving anything. Some tags giving response, but tags for journal names gives nothing.
div with class name "resultTitle" has journal names but when I try the following in scrapy
response.css("div.resultTitle").get() is giving nothing.
I have tried BeautifulSoup also
It seems that the block contains what you want "resultTitle" was loaded by JS which is xxxxxxxx-main.js
...
a.loginContentPromise.then((()=>{
const e = document.querySelector("#session-redirect");
if (e) {
const t = e.dataset.destination || "/";
window.location.replace(t)
}
}
)),
...
You can find the code block like below if you post your request via "wget" command, instead of using web browser.
...
<main id="main" class='language_en px-0 min-vh-100 container-fluid'>
<div id="session-redirect" data-destination='/search?query=*&startItem=0&pageSize=10&sortBy=relevance&documentTypeFacet=journal'></div>
</main>
...
You can read the "xxxxxxxx-main.js" JS code and implement it.
or just simply use Splash to handle it.
P.S.
wget -O search_result.html https://www.degruyter.com/search\?query\=\*\&startItem\=0\&pageSize\=10\&sortBy\=relevance\&documentTypeFacet\=journal

List of html attributes that could be displayed to user in some cases in browser

I am designing a web scraping script using Python. I am using the beautifulsoup module for this and I almost succeeded in this. Currently, I am having some requirements unsatisfied in beautifulsoup.
When extracting the content that could be displayed to the user in a browser using beautifulsoup, it's not displaying some text like the "placeholder" attribute value of an input tag element. I wrote the below code for a demonstration of this behavior.
Python code:
import requests
from bs4 import BeautifulSoup as bs4
web_page = requests.get("http://localhost/1.html", allow_redirects=True)
web_view = bs4(web_page.text, "html.parser")
print(web_view.text)
HTML code of http://localhost/1.html is
<html>
<title>Test Website</title>
<body>
<p>Hello World</p>
<form>
<input placeholder="Username"/>
<input placeholder="Password" type="password"/>
</form>
</body>
</html>
The output of above said Python code is:
Test Website
Hello World
I am expecting the "Username" word, and "Password" word also extracted in the python output because that is also displayed to the user in the browser.
My requirement is not limited to the "placeholder" attribute of the "input" element tag. I need to display the text that could be displayed to the user in the browser when some exception happens. For example, if an image is missing that is placed in an "img" tag of any html page of any website, the user will see the text that is provided in the "alt" attribute of the "img" tag like this.
HTTML code for this page:
<html>
<title>Test Website</title>
<body>
<p>Hello World</p>
<form>
<input placeholder="Username"/>
<input placeholder="Password" type="password"/>
<br><br><br>
<img source="2.img" alt="Image missing">
</body>
</html>
"2.img" is the image, and it is missing I know.
My overall question is:
I need to see all those web page content that is displayed to the user in the browser including any exception cases like the image missing. Currently, beautifulsoup is displaying only the "value" of any dom element tag and it's not extracting any text that is part of any attributes of the dom element tag that could be displayed to the user. I need that attributes' value also.
If this information can be extracted from beautifullsoup, I am happy to see how to do it. But if it's not possible, I would like to know all the html tag attributes (as a list) that are coming under this category so that I can write a code to search those html attributes through all the html tags on an html page.
If complete list of attributes is not possible, I am requesting everyone to provide the attribute names of any tags you know that are coming under above said use case so that I can prepare a list that may be partially correct.
Edited:
In short:
What are all the attributes' value of any html tag that might be displayed to user in browser. You know and I know, "placeholder" attribute value (of input tag) will be displayed to user in browser. "alt" attribute value of image tag will be displayed to user if image is missing. Like placeholder, and alt attributes, what are all the other attributes out there?
Regarding to your first question, you can't expect .text attribute to give you attributes of specific tags. You need to use .attrs['<attr_name>'][docs] to get desired output:
input_tags = web_view.find('form').find_all('input')
placeholders = [each.attrs['placeholder'] for each in input_tags]
# -> ['Username', 'Password']
As for the second question, you can find all img tags and print its alt attribute if that's what you are looking for:
imgs = web_view.find_all('img')
alt_attrs = [each.attrs['alt'] for each in imgs]
# -> ['Image missing']
To get each attribute of certain tag you need to call .attrs:
input_tags = web_view.find('form').find_all('input')
attributes = [each.attrs for each in input_tags]
# -> [{'placeholder': 'Username'}, {'placeholder': 'Password', 'type': 'password'}]

How to Access Web Elements in Nested HTML Tags Using Selenium

I have a website that generates a report within an iframe, and the website's HTML structure is laid out to look like the following:
<html> <!-- a -->
<body>
.
. (C)
.
<iframe>
<html> <!-- b -->
<body>
.
. (D)
.
</body>
</html>
</iframe>
</body>
</html>
I am wondering how I can access elements in the body containing "(D)" from above. So far I have tried using xpath, but if I copy the xpath from an element in (D) in begins from html tag b, so when the program starts searching from html tag a, it finds nothing. I tested to see if the has any child web elements, and it has 0. Attempts to search for elements in (D) by ID have also also resulted in NoSuchElementException. Not a time problem because I am using WebDriverWait for 120 seconds to make sure everything on page has loaded. Thank you for any help you can provide.
Because it's in an iframe you need to use switchTo to grab the DOM of the frame.
See more here: https://www.techbeamers.com/switch-between-iframes-selenium-python/ or https://www.guru99.com/handling-iframes-selenium.html
driver.switch_to.iframe(self,frame reference)
where reference is:
By using the tag name ( in this case ‘iframe’)
By using the Id of IFrame
By using the name of IFrame

Python Selenium send_keys() for non-input tag

The Problem
I'm trying to change the page of a database (ReferenceUSA, requires paid or university credentials) using selenium by typing the page number in and clicking enter, but the catch is that the search box is in a div tag. send_keys() only works for input and textarea tags. Here's a picture of the page navigator, and here's the HTML:
<div class="pager">
<div class="prev button" title="Hold down to scroll through records faster." style="float: left;">«</div>
<div class="page" style="text-align: center; margin: 0px 1em; float: left; width: 40px;">1</div>
<div class="next button" title="Hold down to scroll through records faster.">»</div><span style="clear: both;"></span></div>
My Attempt
I figured out how to change the innerHTML, so I really only need to hit enter on the div tag. I thought of and tried changing the innerHTML for the next button (e.g. set current page to 1207 in next_button() function if I wanted to navigate to page 1208), but the functions are all written in a global file for the site, thousands of lines long, and they all feed off each other using the same 6 variables making it essentially unreadable. I've brought this problem to one of my CS professors to no avail.
There must be a simple solution, but I'm at a loss right now. I would greatly appreciate it if anyone could offer some guidance
Update:
I ended up scrapping Selenium altogether and figured out how to do it using the Requests module. The following is the snippet of code for my workaround:
Select start page
data_data = {
'requestKey': page_data, # Collected earlier in the code
'direction': 'Ascending', # Collected with Chrome DevTools
'pageIndex': page_index # Page number - 1
}
data = r.post(SELECT_PAGE, data = data_data)
Rather than clicking the button in Selenium, I figured out every time the page number changed I was sending a POST request, so I just sent the POST request.

Getting html elements of another page with window.open()

I'm trying to access html elements from an html page I access with window.open().
These are my htmls:
firstpage.html:
<html>
<body>
<script>
function openPage() {
return window.open("secondpage.html")
}
</script>
</body>
</html>
secondpage.html
<html>
<body>
<h1>Hello!</h2>
<p>This is the second page</p>
</body>
</html>
This is what I'm doing:
from selenium import webdriver
browser = webdriver.Firefox()
browser.get("firstpage.html")
html = browser.execute_script("return openPage().document;")
print html
What I'm expecting to get is a reference to the document element of the second page. This seems to work in the Firefox web console. When I test the script, the second page opens, but the first page seems to hang and after a while I get a dialog saying:
"A script on this page may be busy, or it may have stopped responding. You can stop the script now, or you can continue to see if the script will complete."
With the "Stop" and "Continue" buttons. Pressing "Continue" the dialog keeps to appear, when I eventually press "Stop", the html python variable contains the same text of the dialog.
What am I doing wrong?
EDIT:
As in the #e1che answer, this is the right way to do it:
firstpage.html:
<html>
<body>
<script>
function openPage() {
window.open("secondpage.html", "secondpagewindow")
}
</script>
</body>
</html>
The python code:
from selenium import webdriver
browser = webdriver.Firefox()
browser.get("firstpage.html")
browser.execute_script("openPage()")
browser.switch_to_window("secondpagewindow")
print browser.page_source
You're doing well,
but you miss to switch your driver to the new window.
so right after you're browser.execute_script("return openPage().document;")
You write something like :
driver.switch_to_window("windowName")
I let you search here for more infos and tricks ;)

Categories