I am using the python webbrowser module to try and open a html file. I added a short thing to get code from a website to view, allowing me to store a web-page incase I ever need to view it without wifi, for instance a news article or something else.
The code itself is fairly short so far, so here it is:
import requests as req
from bs4 import BeautifulSoup as bs
import webbrowser
import re
webcheck = re.compile('^(https?:\/\/)?(www.)?([a-z0-9]+\.[a-z]+)([\/a-zA-Z0-9#\-_]+\/?)*$')
#Valid URL Check
while True:
url = input('URL (MUST HAVE HTTP://): ')
check = webcheck.search(url)
groups = list(check.groups())
if check != None:
for group in groups:
if group == 'https://':
groups.remove(group)
elif group.count('/') > 0:
groups.append(group.replace('/', '--'))
groups.remove(group)
filename = ''.join(groups) + '.html'
break
#Getting Website Data
reply = req.get(url)
soup = bs(reply.text, 'html.parser')
#Writing Website
with open(filename, 'w') as file:
file.write(reply.text)
#Open Website
webbrowser.open(filename)
webbrowser.open('https://www.youtube.com')
I added webbrowser.open('https://www.youtube.com') so that I knew the module was working, which it was, as it did open up youtube.
However, webbrowser.open(filename) doesn't do anything, yet it returns True if I define it as a variable and print it.
The html file itself has a period in the name, but I don't think that should matter as I have made a file without it as the name and it wont run.
Does webbrowser need special permissions to work?
I'm not sure what to do as I've removed characters from the filename and even showed that the module is working by opening youtube.
What can I do to fix this?
From the webbrowser documentation:
Note that on some platforms, trying to open a filename using this function, may work and start the operating system’s associated program. However, this is neither supported nor portable.
So it seems that webbrowser can't do what you want. Why did you expect that it would?
adding file:// + full path name does the trick for any wondering
Related
So basically Im making a script that's able to download a bunch of maps from TrackmaniaExchange with a search result. However, to download the map files, I need the actual download link, which the search result doesn't give.
I already know how to download maps. The link is https://trackmania.exchange/maps/download/(map id). However, the href's for the search results is /maps/(map id)/(map name).
What I was thinking of doing is using selenium to go to the site, grab the href for the map, edit the link with re.sub so that itll link to /maps/download/(map id)/, and remove the end of the link with re.sub so there's no map name at the end of it. I dont know how to go about it, though. This is what I have so far in my script:
import requests
import os.path
import os
import selenium.webdriver as webdriver
from selenium.webdriver.firefox.options import Options
import time
import re
def Search():
link="https://trackmania.exchange/mapsearch2?limit=100" #Trackmania Exchange link, will scrape all 100 results
checkedlink = re.sub("\s", "+", link) #Replaces spaces with + for track names (this shouldnt happen with authors/tags)
options = Options() #This is for selenium
options.binary_location = "C:/Program Files/Mozilla Firefox/firefox.exe"
driver = webdriver.Firefox(options=options)
search_box = driver.find_element_by_name("trackname")
sitelinks = driver.find_element_by_xpath("/html/[div/#id='container'/#data-select2-id='container']/[div/#class='container-inner']/[div/#class='ly-box-open']/[div/#class='box-col-6']/[div/#class='windowv2-panel']/[div/#id='searchResults-container']/div/div/table/tbody/[tr/#class='WindowTableCell2v2 with-hover has-image']/[td/#class='cell-ellipsis']")
results = []
name=input("Track Name (if nothing, hit enter)") #Prompts the user to input stuff
author=input("Track Author (if nothing, hit enter)")
tags=input("Tags (separate with %2C if there's multiple, if nothing, hit enter)")
path=input("Map download directory (do not leave blank, use forward slashes)")
print("WARNING: Download wget for this script to work.")
type(name) #These are to make a link to find html with
type(author)
type(tags)
type(path)
if path == "":
print("Please put a path next time you start this")
time.sleep(3)
os.exit()
else: #And so begins the if/else hellhole to find out what needs to be added to the link
if tags == "":
if name == "":
if author == "":
print("Chief, you cant just enter nothing. Put something in here next time")
time.sleep(3)
os.exit()
else:
link = link+"&author="+author
else:
link = link+"&trackname="+name
if author != "":
link = link+"&author="+author
else:
link = link+"&tags="+tags
if name != "":
link = link+"&trackname="+name
if author != "":
link = link+"&author="+author
else:
if author != "":
link = link+"&author="+author
print("Checking link...")
checkedlink() #this is to make sure there's no spaces in the link. tags are separated by %2C, but track names are separated by +
print("Attempting to download...")
driver.get(link)
links = sitelinks
for link in links
href = link.get_attribute("href")
browser.close()
with open("list.txt", "w", encoding="utf-8") as f:
f.write(href)
for line in f:
h = re.findall("\d") #My failed attempt at removing the end of the link
re.sub("/maps/", "https://trackmania.exchange/maps/download", f)
re.sub("") #unfinished part cause i was stubbed
os.system("wget --directory-prefix="path" -i list.txt")
Search()
Their API is listed on the site and after looking over the rules for the site, this is allowed. I also havent really tested the script after making the if/else hellhole, but I can work on that later. All I need help with is removing the map name after the map ID. If you need a proper example, one of the href's on the front page for me is /maps/91677/cloudy-day. Itll be different for every link, so I don't really know what I should do.
If I know the URL format will be /maps/id/some-text and the ID will only include numbers, then I would just simply grab the ID from the link using the bellow regex, and then use an f string to build the URL.
map_id = re.search(r"\d+", url).group(0)
get_map_url = f"https://trackmania.exchange/maps/download/{map_id}"
Play around on regex101 with different URLs you may come across.
Let's say I have an xls or csv file (on some cloud) with list of website which contain URL to some content on my website. I'd like to write a script that goes to given website, checks if the link is still there and if it has 'follow' attribute. Which tool and library will be optimal for it. I think about using Selenium for this.
For manually selecting websites to check, try:
import urllib
Url = urllib.request.urlopen(input(“Website to check?(Format: http(s)://www.(WebPageDomain).(WebPageUrlEnder)/(OPTIONAL:Sub-page)\n>> “)
if Url.read().contains(input(“Your website name?\n>> “)):
# do thing
This may work, or it may not. I had no time to check. If you get issues with the method read() then look at some documentation
yes, you can use selenium to automate the stuffs in python.
Alternatively, you can read the csv/xls files and store the values
as a dataframe using pandas in python. You can iterate over the
websites and store the result that the website is working or not.
# sample code for storing csv/xls in dataframe
filepath = 'data.csv'
df = pd.read_csv(filepath) || pd.read_excel(filepath, index_col=0)
print(df)
# sample code for checking website exists
import requests
url = 'http://www.example.com'
request = requests.get(url)
if request.status_code == 200:
print('Web site exists')
else:
print('Web site does not exist')
and finally store the result in the form of csv/xls.
Using the webbrowser module, I want to open a specific page on last.fm.
It picks a line from a text file then prints it. I want it to add that line at the end of:
webbrowser.open('http://www.last.fm/music/')
So for example, the random.choice picks example artist. I want example artist to be added at the end of the url correctly.
Any help is appreciated.
Use the urlparse.urljoin function to build up the full destination URL:
import urlparse
import webbrowser
artist_name = 'virt'
url = urlparse.urljoin('http://www.last.fm/music/', artist_name)
# Will open http://www.last.fm/music/virt in your browser.
webbrowser.open(url)
I am writing a function that downloads and stores the today's list of pre-release domains .txt file from http://www.namejet.com/pages/downloads.aspx. I am trying to achieve it using json.
import json
import requests
def hello():
r = requests.get('http://www.namejet.com/pages/downloads.aspx')
#Replace with your website URL
with open("a.txt", "w") as f:
#Replace with your file name
for item in r.json or []:
try:
f.write(item['name']['name'] + "\n")
except KeyError:
pass
hello()
I need to download the file which consist of pre-release domains using python. How can I do that? Is the above code right way to do it?
I dont't think mechanize is much use for javascript, use selenium. Here's an example:
In [1]: from selenium import webdriver
In [2]: browser=webdriver.Chrome() # Select browser that you want to automate
In [3]: browser.get('http://www.namejet.com/pages/downloads.aspx')
In [4]: element=browser.find_element_by_xpath(
'//a[#id="ctl00_ContentPlaceHolder1_hlPreRelease1"]')
In [5]: element.click()
Now you can find prerelease_10-08-2012.txt in your download folder and you can open it in a usual way.
I see a few problems with your approach:
The page doesn't return any json; so even if you were to access the page successfully, r.json will be empty:
>>> import requests
>>> r = requests.get('http://www.namejet.com/pages/downloads.aspx')
>>> r.json
The file that you are after, is hidden behind a postback link; which you cannot "execute" using requests as it will not understand javascript.
In light of the above, the better approach is to use mechanize or alternatives to emulate a browser. You could also ask the company to provide you with a direct link.
Let me start by saying that I know there are a few topics discussing problems similar to mine, but the suggested solutions do not seem to work for me for some reason.
Also, I am new to downloading files from the internet using scripts. Up until now I have mostly used python as a Matlab replacement (using numpy/scipy).
My goal:
I want to download a lot of .csv files from an internet database (http://dna.korea.ac.kr/vhot/) automatically using python. I want to do this because it is too cumbersome to download the 1000+ csv files I require by hand. The database can only be accessed using a UI, where you have to select several options from a drop down menu to finally end up with links to .csv files after some steps.
I have figured out that the url you get after filling out the drop down menus and pressing 'search' contains all the parameters of the drop-down menu. This means I can just change those instead of using the drop down menu, which helps a lot.
An example url from this website is (lets call it url1):
url1 = http://dna.korea.ac.kr/vhot/search.php?species=Human&selector=drop&mirname=&mirname_drop=hbv-miR-B2RC&pita=on&set=and&miranda_th=-5&rh_th=-10&ts_th=0&mt_th=7.3&pt_th=99999&gene=
On this page I can select 5 csv-files, one example directs me to the following url:
url2 = http://dna.korea.ac.kr/vhot/download.php?mirname=hbv-miR-B2RC&species_filter=species_id+%3D+9606&set=and&gene_filter=&method=pita&m_th=-5&rh_th=-10&ts_th=0&mt_th=7.3&pt_th=99999&targetscan=&miranda=&rnahybrid=µt=&pita=on
However, this doesn't contain the csv file directly, but appears to be a 'redirect' (a new term for me, that I found by googeling, so correct me if I am wrong).
One strange thing. I appear to have to load url1 in my browser before I can access url2 (I do not know if it has to be the same day, or hour. url2 didn't work for me today and it did yesterday. Only after after accessing url1 did it work again...). If I do not access url1 before url2 I get "no results" instead of my csv file from my browser. Does anyone know what is going on here?
However, my main problem is that I cannot save the csv files from python.
I have tried using the packages urllib, urllib2 and request but I cannot get it to work.
From what i understand the Requests package should take care of redirects, but I haven't been able to make it work.
The solutions from the following web pages do not appear to work for me (or I am messing up):
stackoverflow.com/questions/7603044/how-to-download-a-file-returned-indirectly-from-html-form-submission-pyt
stackoverflow.com/questions/9419162/python-download-returned-zip-file-from-url
techniqal.com/blog/2008/07/31/python-file-read-write-with-urllib2/
Some of the things I have tried include:
import urllib2
import csv
import sys
url = 'http://dna.korea.ac.kr/vhot/download.php?mirname=hbv-miR-B2RC&species_filter=species_id+%3D+9606&set=or&gene_filter=&method=targetscan&m_th=-5&rh_th=-10&ts_th=0&mt_th=7.3&pt_th=-10&targetscan=on&miranda=&rnahybrid=µt=&pita='
#1
u = urllib2.urlopen(url)
localFile = open('file.csv', 'w')
localFile.write(u.read())
localFile.close()
#2
req = urllib2.Request(url)
res = urllib2.urlopen(req)
finalurl = res.geturl()
pass
# finalurl = 'http://dna.korea.ac.kr/vhot/download.php?mirname=hbv-miR-B2RC&species_filter=species_id+%3D+9606&set=or&gene_filter=&method=targetscan&m_th=-5&rh_th=-10&ts_th=0&mt_th=7.3&pt_th=-10&targetscan=on&miranda=&rnahybrid=µt=&pita='
#3
import requests
r = requests.get(url)
r.content
pass
#r.content = "< s c r i p t > location.replace('download_send.php?name=qgN9Th&type=targetscan'); < / s c r i p t >"
#4
import requests
r = requests.get(url,
allow_redirects=True,
data={'download_open': 'Download', 'format_open': '.csv'})
print r.content
# r.content = "
#5
import urllib
test1 = urllib.urlretrieve(url, "test.csv")
test2 = urllib.urlopen(url)
pass
For #2, #3 and #4 the outputs are displayed after the code.
For #1 and #5 I just get a .csv file with </script>'
Option #3 just gives me a new redirect I think, can this help me?
Can anybody help me with my problem?
The page does not send a HTTP Redirect, instead the redirect is done via JavaScript.
urllib and requests do not process javascript, so they cannot follow to the download url.
You have to extract the final download url by yourself, and then open it, using any of the methods.
You could extract the URL using the re module with a regex like r'location.replace\((.*?)\)'
Based on the response from ch3ka, I think I got it to work. From the source code I get the java redirect, and from this redirect I can get the data.
#Find source code
redirect = requests.get(url).content
#Search for the java redirect (find it in the source code)
# --> based on answer ch3ka
m = re.search(r"location.replace\(\'(.*?)\'\)", redirect).group(1)
# Now you need to create url from this redirect, and using this url get the data
data = requests.get(new_url).content