def align_sequences(IDs):
import webbrowser
import urllib,urllib2
url = 'http://www.uniprot.org/align/'
params = {'query':IDs}
data = urllib.urlencode(params)
request = urllib2.Request(url, data)
response = urllib2.urlopen(request)
job_url = response.geturl()
webbrowser.open(job_url)
align_sequences('Q4PRD1 Q7LZ61')
With this function I want to open 'http://www.uniprot.org/align/', request the protein sequences with IDs Q4PRD1 and Q7LZ61 to be aligned, and then open the website in my browser.
Initially it seems to be working fine - running the script will open the website and show the alignment job to being run. However, it will keep going forever and never actually finish, even if I refresh the page. If I input the IDs in the browser and hit 'align' it works just fine, taking about 8 seconds to align.
I am not familiar with the differences between running something directly from a browser and running it from Python. Do any of you have an idea of what might be going wrong?
Thank you :-)
~Max
You have to click align button. You can't do this with webbrowser though. One option is to use selenium:
from selenium import webdriver
url = 'http://www.uniprot.org/align/'
ids = 'Q4PRD1 Q7LZ61'
driver = webdriver.Firefox()
driver.get(url)
q = driver.find_element_by_id('alignQuery')
q.send_keys(ids)
btn = driver.find_element_by_id("sequence-align-submit")
btn.click()
I think this is in javascript. If you look at the html-code of button Align you can see
onclick="UniProt.analytics('AlignmentSubmissionPage', 'click', 'Submit align'); submitAlignForm();"
UniProt.analytics() and submitAlignForm() some javascript magic. This magic in js-compr.js2013_11 file.
You can view this file using http://jsbeautifier.org/ and then do on python what do javascript.
Related
I am trying to write a script to automate job applications on Linkedin using selenium and python.
The steps are simple:
open the LinkedIn page, enter id password and log in
open https://linkedin.com/jobs and enter the search keyword and location and click search(directly opening links like https://www.linkedin.com/jobs/search/?geoId=101452733&keywords=python&location=Australia get stuck as loading, probably due to lack of some post information from the previous page)
the click opens the job search page but this doesn't seem to update the driver as it still searches on the previous page.
import selenium
from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from bs4 import BeautifulSoup
import pandas as pd
import yaml
driver = webdriver.Chrome("/usr/lib/chromium-browser/chromedriver")
url = "https://linkedin.com/"
driver.get(url)
content = driver.page_source
stream = open("details.yaml", 'r')
details = yaml.safe_load(stream)
def login():
username = driver.find_element_by_id("session_key")
password = driver.find_element_by_id("session_password")
username.send_keys(details["login_details"]["id"])
password.send_keys(details["login_details"]["password"])
driver.find_element_by_class_name("sign-in-form__submit-button").click()
def get_experience():
return "1%C22"
login()
jobs_url = f'https://www.linkedin.com/jobs/'
driver.get(jobs_url)
keyword = driver.find_element_by_xpath("//input[starts-with(#id, 'jobs-search-box-keyword-id-ember')]")
location = driver.find_element_by_xpath("//input[starts-with(#id, 'jobs-search-box-location-id-ember')]")
keyword.send_keys("python")
location.send_keys("Australia")
driver.find_element_by_xpath("//button[normalize-space()='Search']").click()
WebDriverWait(driver, 10)
# content = driver.page_source
# soup = BeautifulSoup(content)
# with open("a.html", 'w') as a:
# a.write(str(soup))
print(driver.current_url)
driver.current_url returns https://linkedin.com/jobs/ instead of https://www.linkedin.com/jobs/search/?geoId=101452733&keywords=python&location=Australia as it should. I have tried to print the content to a file, it is indeed from the previous jobs page and not from the search page. I have also tried to search elements from page like experience and easy apply button but the search results in a not found error.
I am not sure why this isn't working.
Any ideas? Thanks in Advance
UPDATE
It works if try to directly open something like https://www.linkedin.com/jobs/search/?f_AL=True&f_E=2&keywords=python&location=Australia but not https://www.linkedin.com/jobs/search/?f_AL=True&f_E=1%2C2&keywords=python&location=Australia
the difference in both these links is that one of them takes only one value for experience level while the other one takes two values. This means it's probably not a post values issue.
You are getting and printing the current URL immediately after clicking on the search button, before the page changed with the response received from the server.
This is why it outputs you with https://linkedin.com/jobs/ instead of something like https://www.linkedin.com/jobs/search/?geoId=101452733&keywords=python&location=Australia.
WebDriverWait(driver, 10) or wait = WebDriverWait(driver, 20) will not cause any kind of delay like time.sleep(10) does.
wait = WebDriverWait(driver, 20) only instantiates a wait object, instance of WebDriverWait module / class
I want to open a url using python script and then same python script should fill the form but not submit it
For example script should open https://www.facebook.com/ and fill the name and password in the fields, but don't submit it.
You can use Selenium to get it done smoothly. Here is the sample code with Google search:
from selenium import webdriver
browser = webdriver.Firefox()
browser.get("http://www.google.com")
browser.find_element_by_id("lst-ib").send_keys("book")
# browser.find_element_by_name("btnK").click()
The last line is commented intentionally if do not want to submit the search.
Many websites don't support Web Scraping. Actually It may cost you an illegal access case on you.
But Try using requests library in python.
You'll find it easy to do that stuff.
https://realpython.com/python-requests/
payload = {'inUserName': 'USERNAME/EMAIL', 'inUserPass': 'PASSWORD'}
url = 'http://www.locationary.com/home/index2.jsp'
requests.post(url, data=payload)
I want to scrape, e.g., the title of the first 200 questions under the web page https://www.quora.com/topic/Stack-Overflow-4/all_questions. And I tried the following code:
import requests
from bs4 import BeautifulSoup
url = "https://www.quora.com/topic/Stack-Overflow-4/all_questions"
print("url")
print(url)
r = requests.get(url) # HTTP request
print("r")
print(r)
html_doc = r.text # Extracts the html
print("html_doc")
print(html_doc)
soup = BeautifulSoup(html_doc, 'lxml') # Create a BeautifulSoup object
print("soup")
print(soup)
It gave me a text https://pastebin.com/9dSPzAyX. If we search href='/, we can see that the html does contain title of some questions. However, the problem is that the number is not enough; actually on the web page, a user needs to manually scroll down to trigger extra load.
Does anyone know how I could mimic "scrolling down" by the program to load more content of the page?
Infinite scrolls on a webpage is based on the Javascript functionality. Therefore, to find out what URL we need to access and what parameters to use, we need to either thoroughly study the JS code working inside the page or, and preferably, examine the requests that the browser does when you scroll down the page. We can study requests using the Developer Tools.
See example for quora
the more you scroll down, the more requests generated. so now your requests will be done to that url instead of normal url but keep in mind to send correct headers and playload.
other easier solution will be by using selenium
Couldn't find a response using request. But you can use Selenium. First printed out the number of questions at first load, then send the End key to mimic scrolling down. You can see number of questions went from 20 to 40 after sending the End key.
I used driver.implicitly wait for 5 seconds before loading the DOM again in case the script load to fast before the DOM was loaded. You can improve by using EC with selenium.
The page loads 20 questions per scroll. So if you are looking to scrape 100 questions, then you need to send the End key 5 times.
To use the code below you need to install chromedriver.
http://chromedriver.chromium.org/downloads
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.common.by import By
CHROMEDRIVER_PATH = ""
CHROME_PATH = ""
WINDOW_SIZE = "1920,1080"
chrome_options = Options()
# chrome_options.add_argument("--headless")
chrome_options.add_argument("--window-size=%s" % WINDOW_SIZE)
chrome_options.binary_location = CHROME_PATH
prefs = {'profile.managed_default_content_settings.images':2}
chrome_options.add_experimental_option("prefs", prefs)
url = "https://www.quora.com/topic/Stack-Overflow-4/all_questions"
def scrape(url, times):
if not url.startswith('http'):
raise Exception('URLs need to start with "http"')
driver = webdriver.Chrome(
executable_path=CHROMEDRIVER_PATH,
chrome_options=chrome_options
)
driver.get(url)
counter = 1
while counter <= times:
q_list = driver.find_element_by_class_name('TopicAllQuestionsList')
questions = [x for x in q_list.find_elements_by_xpath('//div[#class="pagedlist_item"]')]
q_len = len(questions)
print(q_len)
html = driver.find_element_by_tag_name('html')
html.send_keys(Keys.END)
wait = WebDriverWait(driver, 5)
time.sleep(5)
questions2 = [x for x in q_list.find_elements_by_xpath('//div[#class="pagedlist_item"]')]
print(len(questions2))
counter += 1
driver.close()
if __name__ == '__main__':
scrape(url, 5)
I recommend using selenium rather than bs.
selenium can control browser and parsing. like scroll down, click button, etc…
this example is for scroll down for get all liker user in instagram.
https://stackoverflow.com/a/54882356/5611675
If the content only loads on "scrolling down", this probably means that the page is using Javascript to dynamically load the content.
You can try using a web client such as PhantomJS to load the page and execute the javascript in it, and simulate the scroll by injecting some JS such as document.body.scrollTop = sY; (Simulate scroll event using Javascript).
I want to get html-text a few seconds after opening url.
Here's the code:
import requests
url = "http://XXXXX…"
html = request.get(url).text
I want to get html-text few seconds after opening url.
Well, the webpage HTML stays the same right after you "get" the url using Requests, so there's no need to wait a few seconds as the HTML will not change.
I assume the reason that you would like to wait is for the page to load all the relevant resources (e.g. CSS/JS) that modifies the HTML?
If it's so, I wouldn't recommend you using the Requests module as you will have to manipulate and load all of the relevant resources by yourself.
I suggest you to have a look at Selenium for Python.
Selenium fully simulates a browser, hence you can wait and it will load all the resources for your webpage.
try using time.sleep(t)
response = request.get(url)
time.sleep(5) # suspend execution for 5 secs
html = response.text
You want to change the last line to:
html = requests.get(url).text
I have found the library requests-html handy for this purpose, though mostly I use Selenium (as already proposed in Danny answer).
from requests_html import HTMLSession, HTMLResponse
session = HTMLSession()
req = cast(HTMLResponse, session.get("http://XXXXX"))
req.html.render(sleep=5, keep_page=True)
Now, the req.html is a HTML object. In order to get the raw text or the html as a string you can use:
text = req.text
or:
text = req.html.html
Then you can parse your text string, e.g. with Beautiful Soup.
basically you can give a sleep to the request as a parameter as bellow:
import requests
import time
url = "http://XXXXX…"
seconds = 5
html = requests.get(url,time.sleep(seconds)).text #for example 5 seconds
These are the steps I need to automatize:
1) Log in
2) Select an option from a drop down menu (To acces a list of products)
3) search something on the search field (The product we are looking for)
4) click a link (To open up the product's options)
5) click another link(To compile all the .pdf files relevant to said product in a bigger .pdf)
6) wait for a .pdf to load and then download it.(Save the .pdf on my machine with the name of the product as the file name)
I want to know if this is possible. If it is, where can I find how to do it?
Is it pivotal that there is actual clicking involved? If you're just looking to download PDFs then I suggest you use the Requests library. You might also want to consider using Scrapy.
In terms of searching on the site, you may want to use Fiddler to capture the HTTP POST request and then replicate that in Python.
Here is some code that might be useful as a starting place - these functions would login to a server and download a target file.
def login():
login_url = 'http://www.example.com'
payload = 'usr=username&pwd=password'
connection = requests.Session()
post_login = connection.post(data=payload,
url=login_url,
headers=main_headers,
proxies=proxies,
allow_redirects=True)
def download():
directory = "C:\\example\\"
url = "http://example.com/download.pdf"
filename = directory + '\\' + url[url.rfind("/")+1:]
r = connection.get(url=url,
headers=main_headers,
proxies=proxies)
file_size = int(r.headers["Content-Length"])
block_size = 1024
mode = 'wb'
print "\tDownloading: %s [%sKB]" % (filename, int(file_size/1024))
if r.status_code == 200:
with open(filename, mode) as f:
for chunk in r.iter_content(block_size):
f.write(chunk)
For static sites you can use the mechanize module, available from PyPi, it does everything you want - except it does not run Javascript and thus does not work on dynamic websites. Also it is strictly Python 2 only.
easy_install mechanize
For something way more complicated you might have to use python bindings for Selenium (install instructions) to control an external browser; or use spynner that embeds a web browser. However these 2 are far more difficult to set up.
Sure, just use selenium webdriver
from selenium import webdriver
browser = webdriver.Chrome()
browser.get('http://your-website.com')
search_box = browser.find_element_by_css_selector('input[id=search]')
search_box.send_keys('my search term')
browser.find_element_by_css_selector('input[type=submit']).click()
That would get you through the visit page, enter search term, click on search, stage of your problem. Read through the api for the rest.
Mechanize has problems at the moment because so much of a web page is generated via javascript. And if it is not rendering that you can't do much with the page.
It helps if you understand css selectors, else you can find elements by id, or xpath or other things...