I am testing the following code, I found that the output after the "print" is inconsistent with the text file. I have set the encoding to be "UTF-8". Is this a bug? How to fix?
import requests
url = "http://www.aastocks.com/tc/stocks/analysis/company-fundamental/financial-ratios?symbol=0001&period=4"
r = requests.get(url)
print r.content
f = open("test.txt","w")
f.write(r.content)
There is an internal limit to how many lines the run console buffer can hold. It is limited to about 15K lines.
To increase this limit, you'll have to change the idea.properties file and add a key idea.cycle.buffer.size and adjust it accordingly.
See this bug report where the solution was detailed.
While I don't know the exact version of python you are using, I would venture to guess that it's not 3.x because of usage of print statements.
The problem is not with your print statement per se, but, displaying such long lines (this one is 175765 long) can frequently be a significant issue. Python (particularly on windows), starts to become moody when dealing with lines that are several kB (176KB in this case) long. Instead of trying to display the entire string in one statement, try to break it up into multiple parts and then display. You will see that there is no difference between whatever r.content is showing up on screen and what it's storing through f.write.
Just for your confirmation you can do this after your code:
fh = open("test.txt","r")
print fh.read()
fh.close()
You will notice that there will not be a difference between this and whatever is shown by the previous print statement.
I have tried this on python 3.4.x and linux, But the behaviour you mentioned is not observed with this combination of python and platform.
EDIT 1
This is what I have tried:
import requests
url = "http://www.aastocks.com/tc/stocks/analysis/company-fundamental/financial-ratios?symbol=0001&period=4"
r = requests.get(url)
a = print(str(r.content))
f = open("test.txt","w")
f.write(str(r.content))
f.close()
f = open("test.txt","r")
print(f.read())
f.close()
and here is the output:
http://pastebin.com/R0j0mYe5
EDIT 2
I didn't notice the header was getting cut. I tried it in 2.x and saw the behavior. That does seem to be a problem. Apparently there are some issues popping up when scanning through the html and decoding ot for print. :
This is what I saw:
print r.content[0:500]
print "*****"
print r.content[0:1000]
Gives and op like this:
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"> <html xmlns="http://www.w3.org/1999/xhtml" xmlns:fb="http://www.facebook.com/2008/fbml" xmlns:og="http://ogp.me/ns#"> <head id="Head1"><meta name="keywords" content="公司資料, 主要財經比率, 流動比率, 股東權益回報率, 總資產回報率, 邊際利潤率, 派息比率" /><meta name="description" content="公司資料, 財務比率, 變現能力, 償債能力
*****
</script> <meta http-equiv="Content-Type" content="text/html; charset=UTF-8" /> <link rel="stylesheet" type="teonal.dtd"> <html xmlns="http://www.w3.org/1999/xhtml" xmlns:fb="http://www.facebook.com/2008/fbml" xmlns:og="http://ogp.me/ns#"> <head id="Head1"><meta name="keywords" content="公司資料, 主要財經比率, 流動比率, 股東權益回報率, 總資產回報率, 邊際利潤率, 派息比率" /><meta name="description" content="公司資料, 財務比率, 變現能力, 償債能力, 投資回報, 盈利能力, 營運能力, 投資收益, 綜合全年, 綜合中期" /><meta http-equiv="X-UA-Compatible" content="IE=Edge" /> <script type="text/javascript">
As we can see when printing only the first 500 lines, the op is as expected, but there are errors when we try for more.
Something strange is going on when it tries to decode the entire doc.
However, in python 3.4.x I see this:
print(con[0:500]) #con = r.content
print(con[0:1000])
output:
b'<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"> <html xmlns="http://www.w3.org/1999/xhtml" xmlns:fb="http://www.facebook.com/2008/fbml" xmlns:og="http://ogp.me/ns#"> <head id="Head1"><meta name="keywords" content="\xe5\x85\xac\xe5\x8f\xb8\xe8\xb3\x87\xe6\x96\x99, \xe4\xb8\xbb\xe8\xa6\x81\xe8\xb2\xa1\xe7\xb6\x93\xe6\xaf\x94\xe7\x8e\x87, \xe6\xb5\x81\xe5\x8b\x95\xe6\xaf\x94\xe7\x8e\x87, \xe8\x82\xa1\xe6\x9d\xb1\xe6\xac\x8a\xe7\x9b\x8a\xe5\x9b\x9e\xe5\xa0\xb1\xe7\x8e\x87, \xe7\xb8\xbd\xe8\xb3\x87\xe7\x94\xa2\xe5\x9b\x9e\xe5\xa0\xb1\xe7\x8e\x87, \xe9\x82\x8a\xe9\x9a\x9b\xe5\x88\xa9\xe6\xbd\xa4\xe7\x8e\x87, \xe6\xb4\xbe\xe6\x81\xaf\xe6\xaf\x94\xe7\x8e\x87" /><meta name="description" content="\xe5\x85\xac\xe5\x8f\xb8\xe8\xb3\x87\xe6\x96\x99, \xe8\xb2\xa1\xe5\x8b\x99\xe6\xaf\x94\xe7\x8e\x87, \xe8\xae\x8a\xe7\x8f\xbe\xe8\x83\xbd\xe5\x8a\x9b, \xe5\x84\x9f\xe5\x82\xb5\xe8\x83\xbd\xe5\x8a\x9b'
b'<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"> <html xmlns="http://www.w3.org/1999/xhtml" xmlns:fb="http://www.facebook.com/2008/fbml" xmlns:og="http://ogp.me/ns#"> <head id="Head1"><meta name="keywords" content="\xe5\x85\xac\xe5\x8f\xb8\xe8\xb3\x87\xe6\x96\x99, \xe4\xb8\xbb\xe8\xa6\x81\xe8\xb2\xa1\xe7\xb6\x93\xe6\xaf\x94\xe7\x8e\x87, \xe6\xb5\x81\xe5\x8b\x95\xe6\xaf\x94\xe7\x8e\x87, \xe8\x82\xa1\xe6\x9d\xb1\xe6\xac\x8a\xe7\x9b\x8a\xe5\x9b\x9e\xe5\xa0\xb1\xe7\x8e\x87, \xe7\xb8\xbd\xe8\xb3\x87\xe7\x94\xa2\xe5\x9b\x9e\xe5\xa0\xb1\xe7\x8e\x87, \xe9\x82\x8a\xe9\x9a\x9b\xe5\x88\xa9\xe6\xbd\xa4\xe7\x8e\x87, \xe6\xb4\xbe\xe6\x81\xaf\xe6\xaf\x94\xe7\x8e\x87" /><meta name="description" content="\xe5\x85\xac\xe5\x8f\xb8\xe8\xb3\x87\xe6\x96\x99, \xe8\xb2\xa1\xe5\x8b\x99\xe6\xaf\x94\xe7\x8e\x87, \xe8\xae\x8a\xe7\x8f\xbe\xe8\x83\xbd\xe5\x8a\x9b, \xe5\x84\x9f\xe5\x82\xb5\xe8\x83\xbd\xe5\x8a\x9b, \xe6\x8a\x95\xe8\xb3\x87\xe5\x9b\x9e\xe5\xa0\xb1, \xe7\x9b\x88\xe5\x88\xa9\xe8\x83\xbd\xe5\x8a\x9b, \xe7\x87\x9f\xe9\x81\x8b\xe8\x83\xbd\xe5\x8a\x9b, \xe6\x8a\x95\xe8\xb3\x87\xe6\x94\xb6\xe7\x9b\x8a, \xe7\xb6\x9c\xe5\x90\x88\xe5\x85\xa8\xe5\xb9\xb4, \xe7\xb6\x9c\xe5\x90\x88\xe4\xb8\xad\xe6\x9c\x9f" /><meta http-equiv="X-UA-Compatible" content="IE=Edge" /> <script type="text/javascript">\rvar _gaq = _gaq || [];\r_gaq.push([\'_setAccount\', \'UA-20790503-3\']);\r_gaq.push([\'_setDomainName\', \'www.aastocks.com\']);\r_gaq.push([\'_trackPageview\']);\r_gaq.push([\'_trackPageLoadTime\']);\rfunction OA_show(name) {\r} \r</script> <meta http-equiv="Content-Type" content="text/html; charset=UTF-8" /> <link rel="stylesheet" type="te'
But the output is similar in 3.x (like 2.x) if I try to decode the utf-8:
print(con[0:500].decode('utf-8'))
print(con[0:1000].decode('utf-8'))
Op:
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"> <html xmlns="http://www.w3.org/1999/xhtml" xmlns:fb="http://www.facebook.com/2008/fbml" xmlns:og="http://ogp.me/ns#"> <head id="Head1"><meta name="keywords" content="公司資料, 主要財經比率, 流動比率, 股東權益回報率, 總資產回報率, 邊際利潤率, 派息比率" /><meta name="description" content="公司資料, 財務比率, 變現能力, 償債能力
</script> <meta http-equiv="Content-Type" content="text/html; charset=UTF-8" /> <link rel="stylesheet" type="te
Related
Good morning people!
I'm trying to make a POST request on shibolleth with the following code:
import requests as requests
link = 'https://[IP]/Shibboleth.sso/SAML2/POST'
requisicao = requests.Session().post(url=link,verify=False)
print(requisicao)
print(requisicao.text)
However, an HTML is returned with the following information:
<Response [500]>
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE html
PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
"DTD/xhtml1-strict.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">
<head>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8"/>
<link rel="stylesheet" type="text/css" href="/shibboleth-sp/main.css" />
<title>opensaml::BindingException</title>
</head>
<body>
<h1>opensaml::BindingException</h1>
<p>The system encountered an error at Mon May 23 11:48:22 2022</p>
<p>To report this problem, please contact the site administrator at
root#localhost.
</p>
<p>Please include the following message in any email:</p>
<p class="error">opensaml::BindingException at (https://shibboleth.unifeob.edu.br/Shibboleth.sso/SAML2/POST)</p>
<p>Request missing SAMLRequest or SAMLResponse form parameter.</p>
</body>
</html>
Does anyone know which parameter is missing? Could it be SSL?
I have this Requests module. I wish to rename it to Requisitions. Can someone please confirm how to change name of a module?
In views/index.html file the title is coming from =T("title") but i am unsure where is this title coming from.
Here's the code to html file.
<!DOCTYPE html PUBLIC "-//W4C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" lang="{{=T.accepted_language or "en"}}">{{# class="no-js" needed for modernizr }}
<head>{{theme=response.s3.theme}}
<meta http-equiv="content-type" content="text/html; charset=utf-8" />
{{# Always force latest IE rendering engine (even in intranet) & Chrome Frame }}
<meta http-equiv="X-UA-Compatible" content="IE=edge,chrome=1" />
**<title>{{try:}}{{=title}}{{except:}}{{=response.title or settings.get_system_name_short()}}{{pass}}</title>**
{{if response.refresh:}}{{=XML(response.refresh)}}{{pass}}
{{# http://dev.w3.org/html5/markup/meta.name.html }}
<meta name="application-name" content="{{=appname}}" />
{{# Set your site up with Google: http://google.com/webmasters }}
{{# <meta name="google-site-verification" content="" /> }}
{{a="""<!-- Mobile Viewport Fix
j.mp/mobileviewport & davidbcalhoun.com/2010/viewport-metatag
device-width: Occupy full width of the screen in its current orientation
initial-scale = 1.0 retains dimensions instead of zooming out if page height > device height
maximum-scale = 1.0 retains dimensions instead of zooming in if page width < device width
--> """}}
Requests Module
I'm trying to extract a PDF from this site that uses the native Google Chrome pdf viewer tool to open the pdf in the first place, it's content type is /application/pdf. The issue is that the site URLs that I get aren't actually links to the PDF but rather to a .zul site where the js will load the pdf, or fetch it.
Here's my download code below:
def download_pdf(url, idx, save_dir):
options = webdriver.ChromeOptions()
profile = {"plugins.plugins_list": [{"enabled":False,"name":"Chrome PDF Viewer"}],
"download.default_directory" : save_dir}
options.add_experimental_option("prefs",profile)
driver = webdriver.Chrome("/usr/lib/chromium-browser/chromedriver", chrome_options=options)
driver.get(url)
The problem that Im encountering with the above code is that I get the following readout from driver.source_page:
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta http-equiv="Pragma" content="no-cache" />
<meta http-equiv="Expires" content="-1" />
<title>Document Viewer</title>
<link rel="stylesheet" type="text/css" href="/eSMARTContracts/zkau/web/9776a7f0/zul/css/zk.wcs;jsessionid=088DC94ECA6804AF717A0E997E4F1444.node1"/>
<script type="text/javascript" src="/eSMARTContracts/zkau/web/9776a7f0/js/zk.wpd;jsessionid=088DC94ECA6804AF717A0E997E4F1444.node1" charset="UTF-8">
</script>
<script type="text/javascript" src="/eSMARTContracts/zkau/web/9776a7f0/js/zul.lang.wpd;jsessionid=088DC94ECA6804AF717A0E997E4F1444.node1" charset="UTF-8">
</script>
<!-- ZK 6.0.2 EE 2012072410 -->
</head>
<body>
<div id="j4AP_" class="z-temp"></div>
<script class="z-runonce" type="text/javascript">zk.pi=1;zkmx(
[0,'j4AP_',{dt:'z_2m1',cu:'/eSMARTContracts;jsessionid=088DC94ECA6804AF717A0E997E4F1444.node1',uu:'/eSMARTContracts/zkau;jsessionid=088DC94ECA6804AF717A0E997E4F1444.node1',ru:'/service/dpsweb/ViewDPSWeb.zul'},[
['zul.wnd.Window','j4AP0',{$$onSize:false,$$onMaximize:false,$$onOpen:false,$$onMinimize:false,$$onZIndex:false,$onClose:true,$$onMove:false,width:'100%',height:'100%',prolog:'\
'},[]]]]);
</script>
<noscript>
<div class="noscript"><p>Sorry, JavaScript must be enabled.<br/>Change your browser options, then try again.</p></div>
</noscript>
</body>
</html>
EDIT: Included the link
I was wondering if there was a way to print the entire html path. I am trying to verify some text in a pdf xhtml file pop-up and can not get to to. My hope is to get the entire page source and verify the text is in there. However .page_source seems to only give me the url and description and I am looking to get each line of code.
A possible approach is to make selenium find the starting page tag (html) and get all the source related code.
driver = webdriver.Firefox()
driver.get("http://stackoverflow.com/")
driver.find_element_by_tag_name("html").get_attribute('outerHTML')
Documentation
Output example:
<html webdriver="true"><head>
<title>Stack Overflow</title>
<link rel="shortcut icon" href="https://cdn.sstatic.net/Sites/stackoverflow/img/favicon.ico?v=4f32ecc8f43d">
<link rel="apple-touch-icon image_src" href="https://cdn.sstatic.net/Sites/stackoverflow/img/apple-touch-icon.png?v=c78bd457575a">
<link rel="search" type="application/opensearchdescription+xml" title="Stack Overflow" href="/opensearch.xml">
<meta name="twitter:card" content="summary">
<meta name="twitter:domain" content="stackoverflow.com">
<meta property="og:type" content="website">
<meta name="description" content="Stack Overflow is the largest online community for programmers to learn, share their knowledge, and advance their careers">
<meta property="og:image" itemprop="image primaryImageOfPage" content="https://cdn.sstatic.net/Sites/stackoverflow/img/apple-touch-icon#2.png?v=73d79a89bded">
<meta name="twitter:title" property="og:title" itemprop="title name" content="Stack Overflow">
<meta name="twitter:description" property="og:description" itemprop="description" content="Q&A for professional and enthusiast programmers">
<meta property="og:url" content="http://stackoverflow.com/">
......
I have been fighting this for a whole night...
I'm trying to use Python markdown to generate HTML files from .md files and embed them into some other HTML files.
Here is the problematic snippet:
md = markdown.Markdown(encoding="utf-8")
input_file = codecs.open(f, mode="r", encoding="utf-8") # f is the name of the markdown file
text = input_file.read()
html = md.convert(text) # html generated from the markdown file
context = {
'css_url': url_for('static', filename = 'markdown.css'),
'contents': html
}
rendered_file = render_template('blog.html', **context)
output = open(splitext(f)[0] + '.html', 'w') # write the html to disk
output.write(rendered_file)
output.close()
Here is my "blog.html" template , which is really simple:
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">
<head>
<title>blog</title>
<link rel="stylesheet" href="{{ css_url }}" type="text/css" />
</head>
<body>
{{ contents }}
</body>
</html>
And yet this is what I get:
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">
<head>
<title>blog</title>
<link rel="stylesheet" href="/static/markdown.css" type="text/css" />
</head>
<body>
<li>People who love what they are doing</li>
<li></li>
</ol>
</body>
</html>
So I'm getting those weird ">", "<" stuff, even though I've already specified the encoding to be 'utf-8'. What could possibly go wrong?
Thank you!
<> has nothing to do with encoding. These are HTML entities that represent your input. You should mark it as safe so that jinja will not automatically escape it.
{{ contents|safe }}