Accessing web Directory using python - python

I want to access the student details of all the students from the following college website https://java.access.uni.edu/ed/faces/searchStudent.jsp
I dont know the names of students and I want to access the details of each student .
The directory is open and there is nothing illegal in it .
I am using the following github code as a reference .
https://github.com/JoshuaRLi/direktory/blob/master/direktory.py
Please help !

You can either do it Using bs4 beautifulsoup which will help you to scrap the content from the given directory ... its basically called web scraping ..
that's what represented in your github link...
another method is, selenium webdriver ..
from this method, you can simply pass the url followed by giving corresponding field name and its value.
you can trigger the API URL from selenium itself...
other you can send POST request and get response directly using python requests method...
here is the eg:
>>> import requests
>>> r = requests.post("https://java.access.uni.edu/ed/faces/searchStudent.jsp;jsessionid=e8093da105003620293edb31ec442edfdfa514485389b950c4f20b46515aa640.e34Sbx0MaNuObi0LahiMaxmRb30Re0", data={'txtLastName':'mohamemd','txtFirstName':'mohideen','txtEmail':'temp#mail.com','soMajor':0,'soCollege':0,'soClass':0})
>>> r.status_code
200
>>> r.text[:300]
u'<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"\r\n"http://www.w3.org/TR/html4/loose.dtd">\r\n\r\n\r\n\r\n\r\n\r\n\r\n <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/html4/loose.dtd"><html dir="ltr" lang="en-US">\r\n <head id="head1"><title>UNI Directory - Student Search</t'
>>> a = r.text[:300]
>>> len(a)
300
>>>
here i restricted output into 300.. if you want full you can simply print,
r.text

Related

Scraping data from span tag with class using BS

Im trying to scrape project's names from Gitlab. When I inspect source code I see that name of project is in:
<span class='project-name'>Project Name</span>
Unfortunately, when I try to scrape this date I got empty list, My code looks like:
url = 'https://gitlab.com/users/USER/projects'
source = urllib.request.urlopen(url).read()
soup = bs.BeautifulSoup(source,'lxml')
repos = [repo.text for repo in soup.find_all('span',{'class':'project-name'})]
I was trying other solutions like using attrs, class_ or using other HTML tags, but nothing works. What can be wrong here?
Ok, so it looks like when you inspect the page in the network tabs in chrome developer tools, you can see that projects are not rendered when request is made:
What that means is, project information is requested after. In order to get the projects you need to send a request to https://gitlab.com/users/USER/projects.json endpoint:
After that you can inspect the response from that endpoint. As you can see the response here is json so we can load json data with json module and then in that dictionary there is an entry called html which has html data in it, so we can parse it with beautifulsoup and the rest of the code stays the same:
import bs4 as bs
import urllib, json
url = 'https://gitlab.com/users/USER/projects.json'
source = urllib.request.urlopen(url).read()
soup = bs.BeautifulSoup(json.loads(source)["html"],'html.parser')
repos = [repo.text for repo in soup.find_all('span',{'class':'project-name'})]
print(repos)
Output:
['freebsd', 'freebsd-ports', 'freebsd-test', 'risc-vhdl', 'dotfiles', 'tideyBot']

Python HTML parsing: removing excess HTML from get request output

I am wanting to make a simple python script to automate the process of pulling .mov files from an IP camera's SD card. The Model of IP camera supports http requests which returns HTML that contains the .mov file info. My python script so far..
from bs4 import BeautifulSoup
import requests
page = requests.get("http://192.168.1.99/form/getStorageFileList?type=3")
soup = BeautifulSoup(page.content, 'html.parser')
print(soup.prettify())
OUTPUT:
NAME2041=Record_continiously/2018-06-02/8/MP_2018-06-03_00-33-15_60.mov
I want to only return the MOV file. So removing:
"NAME2041=Record_continiously/2018-06-02/8/"
I'm new to HTML parsing with python so I'm a bit confused with the functionality.
Is returned HTML considered a string? If so, I understand that it will be immutable and I will have to create a new string instead of "striping away" the preexisting string.
I have tried:
page.replace("NAME2041=Record_continiously/2018-06-02/8/","")
in which I receive an attribute error. Is anyone aware of any method that could accomplish this?
Here is a sample of the HTML I am working with...
<html>
<head></head>
<body>
000 Success NUM=2039 NAME0=Record_Continuously/2018-06-04/10/MP_2018-06-04_12-17-38_60.mov SIZE0=15736218
NAME1=Record_Continuously/2018-06-04/10/MP_2018-06-04_12-16-37_60.mov SIZE1=15683077
NAME2=Record_Continuously/2018-06-04/10/MP_2018-06-04_12-15-36_60.mov SIZE2=15676882
NAME3=Record_Continuously/2018-06-04/10/MP_2018-06-04_12-14-35_60.mov SIZE3=15731539
</body>
</html>
Use str.split with negative indexing.
Ex:
page = "NAME2041=Record_continiously/2018-06-02/8/MP_2018-06-03_00-33-15_60.mov"
print( page.split("/")[-1])
Output:
MP_2018-06-03_00-33-15_60.mov
as you asked for explanation of your code here it is:
# import statements
from bs4 import BeautifulSoup
import requests
page = requests.get("http://192.168.1.99/form/getStorageFileList?type=3") # returns response object
soup = BeautifulSoup(page.content, 'html.parser') #
page.content returns string content of response
you are passing this(page.content) string content to class BeautifulSoup which is initialized with two arguments your content(page.content) as string and parser here it is html.parser
soup is the object of BeautifulSoup
.prettify() is method used to pretty print the content
In string slicing you may get failure of result due to length of content so it's better to split your content as suggested by #Rakesh and that's the best approach in your case.

Python equivalent of full webpage download

I'm trying to create a basic scraper that will scrape Username and Song Title from a search on Soundcloud. By inspecting the element I needed (using Chrome), I found I needed to find the string associated with every tag 'span' with title="soundTitle__usernameText". Using BeautifulSoup, urllib2, and lxml, I have the following code for a search 'robert delong':
from lxml import html
from bs4 import BeautifulSoup
from urllib2 import urlopen
import requests
def search_results(url):
html = urlopen(url).read()
# html = requests.get(url) I've tried this also
soup = BeautifulSoup(html, "lxml")
usernames = [span.string for span in soup.find_all("span", "soundTitle__usernameText")]
return usernames
print search_results('http://soundcloud.com/search?q=robert%20delong')
This returns an empty list. However, when I save the complete webpage on Chrome by selecting File>Save>Format-Webpage, Complete, and use that associated HTML file instead of the file obtained with urlopen, the code then prints
[u'Two Door Cinema Club', u'whatever-28', u'AWOLNATION', u'Two Door Cinema Club', u'Sean Glass', u'Capital Cities', u'Robert DeLong', u'RAC', u'JR JR']
which is the ideal outcome. To me, it appears that urlopen uses somewhat truncated HTML code to conduct its search, which is why it returns an empty list.
Any thoughts on how I may be able to access the same HTML obtained by manually saving the webpage, but using Python/Terminal? Thank you.
You guessed right. Downloaded HTML does not contain all the data. Javascript is used to request information in JSON format which is then inserted into the document.
By looking at the request Chrome made (ctrl+shift+i, "Network"), I see that it requested https://api-v2.soundcloud.com/search?q=robert%20delong. I believe the response to that has the information you need.
Actually, this is good for you. Reading JSON should me much more straight-forward than parsing HTML ;)
This is the code that you can use to download the html of the webpage using terminal and its related links and images:
wget -p --convert-links http://www.website.com/directory/webpage.html

source code of web page not available using urllib.urlopen()

I am trying to get video links from 'https://www.youtube.com/trendsdashboard#loc0=ind'. When I do inspect elements, it displays me the source html code for each videos. In source code retrieved using
urllib2.urlopen("https://www.youtube.com/trendsdashboard#loc0=ind").read()
It does not display html source for videos. Is there any otherway to do this?
<a href="/watch?v=dCdvyFkctOo" alt="Flipkart Wish Chain">
<img src="//i.ytimg.com/vi/dCdvyFkctOo/hqdefault.jpg" alt="Flipkart Wish Chain">
</a>
This simple code appears when we inspect elements from browser, but not in source code retrived by urllib
To view the source code you need use read method
If you just use open it gives you something like this.
In [12]: urllib2.urlopen('https://www.youtube.com/trendsdashboard#loc0=ind')
Out[12]: <addinfourl at 3054207052L whose fp = <socket._fileobject object at 0xb60a6f2c>>
To see the source use read
urllib2.urlopen('https://www.youtube.com/trendsdashboard#loc0=ind').read()
Whenever you compare the source code between Python code and Web browser, dont do it through Insect Element, right click on the webpage and click view source, then you will find the actual source. Inspect Element displays the aggregated source code returned by as many network requests created as well as javascript code being executed.
Keep Developer Console open before opening the webpage, stay on Network tab and make sure that 'Preserve Log' is open for Chrome or 'Persist' for Firebug in Firefox, then you will see all the network requests made.
works for me...
import urllib2
url = 'https://www.youtube.com/trendsdashboard#loc0=ind'
html = urllib.urlopen(url).read()
IMO I'd use requests instead of urllib - it's a bit easier to use:
import requests
url = 'https://www.youtube.com/trendsdashboard#loc0=ind'
response = requests.get(url)
html = response.content
Edit
This will get you a list of all <a></a> tags with hyperlinks as per your edit. I use the library BeautifulSoup to parse the html:
from bs4 import BeautifulSoup
soup = BeautifulSoup(html)
links = [tag for tag in soup.findAll('a') if tag.has_attr('href')]
we also need to decode the data to utf-8.
here is the code:
just use
response.decode('utf-8')
print(response)

Spynner doesn't load html from URL

I use spynner for scraping data from a site. My code is this:
import spynner
br = spynner.Browser()
br.load("http://www.venere.com/it/hotel/roma/hotel-ferrari/#reviews")
text = br._get_html()
This code fails to load the entire html page. This is the html that I received:
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"><html xmlns="http://www.w3.org/1999/xhtml"><head>
<script type="text/javascript">(function(){var d=document,m=d.cookie.match(/_abs=(([or])[a-z]*)/i)
v_abs=m?m[1].toUpperCase():'N'
if(m){d.cookie='_abs='+v_abs+'; path=/; domain=.venere.com';if(m[2]=='r')location.reload(true)}
v_abp='--OO--OOO-OO-O'
v_abu=[,,1,1,,,1,1,1,,1,1,,1]})()
My question is: how do I load the complete html?
More information:
I tried with:
import spynner
br = spynner.Browser()
respond = br.load("http://www.venere.com/it/hotel/roma/hotel-ferrari/#reviews")
if respond == None:
br.wait_load ()
but loading html is never complete or certain. What is the problem? I'm going crazy.
Again:
I'm working in Django 1.3. If I use the same code in Python (2.7) sometimes load all html.
Now after you check the contents of test.html you will find the p elements with id="feedback-...somenumber..." :
import spynner
def content_ready(browser):
if 'id="feedback-' in browser.html:
return True
br = spynner.Browser()
br.load("http://www.venere.com/it/hotel/roma/hotel-ferrari/#reviews", wait_callback=content_ready)
with open("test.html", "w") as hf:
hf.write(br.html.encode("utf-8"))

Categories