How to create page links? - python

I know current user's location. It can be one of the following url:
(1) http://myapp.appspot.com/something/something-else/
(2) http://myapp.appspot.com/something/something-else
(3) http://myapp.appspot.com/something/something-else/page1
(4) http://myapp.appspot.com/something/something-else/page3
(actually, addresses 1, 2 and 3 are for the same page1)
I need to display on these pages link for page2:
http://myapp.appspot.com/something/something-else/page2
The question is how to generate such link?
I've tried to use relative links: /page2 and page2 - doesn't work properly. I am not sure how to create absolute link with self.request.path - it doesn't work properly also.

/page2 will never work; the leading / makes it relative to the website root rather than the current directory.
page2 should work for everything except #2; without a trailing slash, something-else is interpreted as a file rather than the current directory.
One solution would be to link to /something/something-else/page2 so your link doesn't change based on the user's address.

import something #refers to your .py file with the template handler
...
application = webapp.WSGIApplication([
('/something/something-else/', something.SomeThingElseHandler),
('/something/something-else', something.SomeThingElseHandler),
('/something/something-else/' + '([^/]+)/', something.PageHandler),
#The above pattern will be recognized if you close the url with /
#If you want your url to end without the slash your remove it for the reg ex. like
('/something/something-else/' + '([^/]+)', something.PageHandler),
],
debug=config.DEBUG)
util.run_wsgi_app(application)
In something.py, in your class PageHandler you have to parse the key or the id you are parsing manually to render the correct content.

Related

Create a page using Pywikibot

I am trying to create a page in https://dev.wikidebates.org/wiki/Wikidébats:Accueil, it is similar ti wikipedia, so Pywikibot should work the same way. I would like to make a page using Pywikibot. I checked option scripts in Pywikibot https://www.mediawiki.org/wiki/Manual:Pywikibot/Scripts. The script pagefromfile.py is responsible for it. However, I don't see in the code where I should write link on new page of wiki.
Also function inter in class Pagefromfile returns page. How can I check that the page was made?
The code which I try now is the following. Everything works ok, except last line.(Page is not created)
site = pywikibot.Site('dev', 'wikidebates') # The site we want to run our bot on
page = pywikibot.Page(site, "Faut-il légaliser le cannabis ?")
text = page.get() #get the code of the page
wb = open("pages1.txt", "w", encoding="utf-8")
wb.write(str(txt)) #write text to the file, in order to download it then to the wiki
wb.close()
main_page('-file:pages1.txt') #use main function from scrip pagefromfile.py - I renamed it
You haven't shown any further information. Probably there are any further messages from pagefromfile.py script. If you download the text from a wiki you either have to include begin and end markers to the text or you have to use the -textonly option. I also propose to use the -title option.
Refer the doc: https://doc.wikimedia.org/pywikibot/stable/scripts/general.html#module-scripts.pagefromfile
from scripts.pagefromfile import main # or main_page in your modified case
site = pywikibot.Site('wikidebates:dev') # you may use te site name for the constructor function
page = pywikibot.Page(site, 'Faut-il légaliser le cannabis ?')
text = page.text
with open("pages1.txt", "w", encoding="utf-8") as wp: # also closes the file
wb.write(text)
main('-file:pages1.txt', '-textonly', '-title:"The new title"')
To specify the target site, add a -site option to the main function if it is not your default site. For simulating purpose you can use the -simulate option:
main('-file:pages1.txt', '-textonly', '-title:"The new title"', '-simulate', '-site:wikidebates:dev')
Note: all arguments given to the main function must be strings delimited by comma. You cannot give all arguments as a long string except you do it like this:
main(*'-file:pages1.txt -textonly -title:The_new_title'.split())

How do you correctly parse web links to avoid a 403 error when using Wget?

I just started learning python yesterday and have VERY minimal coding skill. I am trying to write a python script that will process a folder of PDFs. Each PDF contains at least 1, and maybe as many as 15 or more, web links to supplemental documents. I think I'm off to a good start, but I'm having consistent "HTTP Error 403: Forbidden" errors when trying to use the wget function. I believe I'm just not parsing the web links correctly. I think the main issue is coming in because the web links are mostly "s3.amazonaws.com" links that are SUPER long.
For reference:
Link copied directly from PDF (works to download): https://s3.amazonaws.com/os_uploads/2169504_DFA%20train%20pass.PNG?AWSAccessKeyId=AKIAIPCTK7BDMEW7SP4Q&Expires=1909634500&Signature=aQlQXVR8UuYLtkzjvcKJ5tiVrZQ=&response-content-disposition=attachment;%20filename*=utf-8''DFA%2520train%2520pass.PNG
Link as it appears after trying to parse it in my code (doesn't work, gives "unknown url type" when trying to download): https%3A//s3.amazonaws.com/os_uploads/2169504_DFA%2520train%2520pass.PNG%3FAWSAccessKeyId%3DAKIAIPCTK7BDMEW7SP4Q%26Expires%3D1909634500%26Signature%3DaQlQXVR8UuYLtkzjvcKJ5tiVrZQ%253D%26response-content-disposition%3Dattachment%253B%2520filename%252A%253Dutf-8%2527%2527DFA%252520train%252520pass.PNG
Additionally if people want to weigh in on how I'm doing this in a stupid way. Each PDF starts with a string of 6 digits, and once I download supplemental documents I want to auto save and name them as XXXXXX_attachY.* Where X is the identifying string of digits and Y just increases for each attachment. I haven't gotten my code to work enough to test that, but I'm fairly certain I don't have it correct either.
Help!
#!/usr/bin/env python3
import os
import glob
import pdfx
import wget
import urllib.parse
## Accessing and Creating Six Digit File Code
pdf_dir = "/users/USERNAME/desktop/worky"
pdf_files = glob.glob("%s/*.pdf" % pdf_dir)
for file in pdf_files:
## Identify File Name and Limit to Digits
filename = os.path.basename(file)
newname = filename[0:6]
## Run PDFX to identify and download links
pdf = pdfx.PDFx(filename)
url_list = pdf.get_references_as_dict()
attachment_counter = (1)
for x in url_list["url"]:
if x[0:4] == "http":
parsed_url = urllib.parse.quote(x, safe='://')
print (parsed_url)
wget.download(parsed_url, '/users/USERNAME/desktop/worky/(newname)_attach(attachment_counter).*')
##os.rename(r'/users/USERNAME/desktop/worky/(filename).*',r'/users/USERNAME/desktop/worky/(newname)_attach(attachment_counter).*')
attachment_counter += 1
for x in url_list["pdf"]:
print (parsed_url + "\n")```
I prefer to use requests (https://requests.readthedocs.io/en/master/) when trying to grab text or files online. I tried it quickly with wget and I got the same error (might be linked to user-agent HTTP headers used by wget).
wget and HTTP headers issues : download image from url using python urllib but receiving HTTP Error 403: Forbidden
HTTP headers : https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/User-Agent
The good thing with requests is that it lets you modify HTTP headers the way you want (https://requests.readthedocs.io/en/master/user/quickstart/#custom-headers).
import requests
r = requests.get("https://s3.amazonaws.com/os_uploads/2169504_DFA%20train%20pass.PNG?AWSAccessKeyId=AKIAIPCTK7BDMEW7SP4Q&Expires=1909634500&Signature=aQlQXVR8UuYLtkzjvcKJ5tiVrZQ=&response-content-disposition=attachment;%20filename*=utf-8''DFA%2520train%2520pass.PNG")
with open("myfile.png", "wb") as file:
file.write(r.content)
I'm not sure I understand what you're trying to do, but maybe you want to use formatted strings to build your URLs (https://docs.python.org/3/library/stdtypes.html?highlight=format#str.format) ?
Maybe checking string indexes is fine in your case (if x[0:4] == "http":), but I think you should check python re package to use regular expressions to catch the elements you want in a document (https://docs.python.org/3/library/re.html).
import re
regex = re.compile(r"^http://")
if re.match(regex, mydocument):
<do something>
The reason for this behavior is inside wget library. Inside it encodes the URL with urllib.parse.quote() (https://docs.python.org/3/library/urllib.parse.html#urllib.parse.quote).
Basically it replaces characters with their appropriate %xx escape character. Your URL is already escaped but the library does not know that. When it parses the %20 it sees % as a character that needs to be replaced so the result is %2520 and different URL - therefore 403 error.
You could decode that URL first and then pass it, but then you would have another problem with this library because your URL has parameter filename*= but the library expects filename=.
I would recommend doing something like this:
# get the file
req = requests.get(parsed_url)
# parse your URL to get GET parameters
get_parameters = [x for x in parsed_url.split('?')[1].split('&')]
filename = ''
# find the get parameter with the name
for get_parameter in get_parameters:
if "filename*=" in get_parameter:
# split it to get the name
filename = get_parameter.split('filename*=')[1]
# save the file
with open(<path> + filename, 'wb') as file:
file.write(req.content)
I would also recommend removing the utf-8'' in that filename because I don't think it is actually part of the filename. You could also use regular expressions for getting the filename, but this was easier for me.

Can't figure out HTML file path formatting

I am trying to send a link to a location on our servers via email, but I can't get the HTML portion of the link to work.
This is my file path-- P:\2. Corps\PNL_Daily_Report
What I've tried--
newMail.HTMLBody ='Link Anchor'
newMail.HTMLBody ='Link Anchor'
newMail.HTMLBody ='Link Anchor'
Obviously I am not an HTML guy, so i bet this answer will be quick for someone who is. Any ideas on how to get the HTML link to format?
According to this article on MSDN Blog, your href should be:
Link Anchor
Windows path is a significant break from UNIX-style path, and the internet mostly uses Unix convention.
file:// indicate the scheme (the other popular schemes are http:// and ftp://)
P: refers to the drive, or the mounting point in Unix's file systems
/ means root, or the ultimate point that refers to a file system
The backward slash (\) is the Windows' path separator, but Unix's is the forward slash (/)

Failing to upload a file using Selenium

I'm trying to upload a file to a form using Selenium using this code on eclipse:
search = driver.find_element_by_xpath("//input[#type='file']")
search.send_keys("D:/test.txt")
search.send_keys(Keys.RETURN)
This error keeps showing up:
selenium.common.exceptions.WebDriverException: Message: File not
found: D:/test.txt
The file is in place, where do you think the problem is?
I guess the reason is within the slash used in the path - I think it requires a backslash instead.
What if you try to use search.send_keys("D:\\test.txt")? Not sure if double backslash is required for that, so you can try with single one as well.
EDIT
I tried my own code on simple form with just the input[type=file] and with Submit button:
search = browser.find_element_by_xpath("//input[#type='file']")
search.send_keys("F:\\test.txt")
submit = browser.find_element_by_css_selector("input[type=submit]")
submit.click()
And somehow, it worked just fine, just had to escape backslash and to use Submit button instead of using ENTER button.
So make sure your file is actually there, within the path you posted, and such code (at least on Windows) works just fine. Also, you should make sure you have permission to this file.

In Python, urllib.urlretrieve downloads a file which says "Go away"

I'm trying to download the (APK) files from links such as https://www.apkmirror.com/wp-content/themes/APKMirror/download.php?id=215041. When you enter the link in your browser, it brings up a dialog to open or save the file (see below).
I would like to save the file using a Python script. I've tried the following:
import urllib
download_link = 'https://www.apkmirror.com/wp-content/themes/APKMirror/download.php?id=215041'
download_file = '/tmp/apkmirror_test/youtube.apk'
if __name__ == "__main__":
urllib.urlretrieve(url=download_link, filename=download_file)
but the resulting youtube.apk contains only the words "Go away".
Since I am able to download the file by pasting the link in my browser's address bar, there must be some difference between that and urllib.urlretrieve that makes this not work. Can someone explain this difference and how to eliminate it?
You should not programmatically access that download page as it is disallowed in the robots.txt:
https://www.apkmirror.com/robots.txt
That being said, your request header is different. Python by default sets User-Agent to something like "Python...". That is the most likely cause of detection.

Categories