Python 3.4 - reading data from a webpage - python

I'm currently trying to learn how to read from a webpage, and have tried the following:
>>>import urllib.request
>>>page = urllib.request.urlopen("http://docs.python-requests.org/en/latest/", data = None)
>>>contents = page.read()
>>>lines = contents.split('\n')
This gives the following error:
Traceback (most recent call last):
File "<pyshell#4>", line 1, in <module>
lines = contents.split('\n')
TypeError: Type str doesn't support the buffer API
Now I assumed that reading from a URL would be pretty similar from reading for a text file, and that the contents of contents would be of type str. Is this not that case?
When I try >>> contents I can see that the contents of contents is just the HTML document, so why doesn't `.split('\n') work? How can I make it work?
Please note that I'm splitting at the newline characters so I can print the webpage line by line.
Following the same train of thought, I then tried contents.readlines() which gave this error:
Traceback (most recent call last):
File "<pyshell#8>", line 1, in <module>
contents.readlines()
AttributeError: 'bytes' object has no attribute 'readlines'
Is the webpage stored in some object called 'bytes'?
Can someone explain to me what is happening here? And how to read the webpage properly?

You need to wrap it with an io.TextIOWrapper() object and encode your file (utf-8 is a universal you can change it to proper encoding too):
import urllib.request
import io
u = urllib.request.urlopen("http://docs.python-requests.org/en/latest/", data = None)
f = io.TextIOWrapper(u,encoding='utf-8')
text = f.read()

Decode the bytes object to produce a string:
lines = contents.decode(encoding="UTF-8").split("/n")

The return type of the read() method is of type bytes. You need to properly decode it to a string before you can use a string method like split. Assuming it is UTF-8 you can use:
s = contents.decode('utf-8')
lines = s.split('\n')
As a general solution you should check the character encoding the server provides in the response to your request and use that.

Related

How to turn a string into a binary object in python

I'm using this library to download and decode MMS PDUs:
https://github.com/pmarti/python-messaging
The sample code almost works, except that this method:
mms = MMSMessage.from_data(response)
Is throwing an exception:
TypeError: unsupported operand type(s) for &: 'str' and 'int'
Which seems to obviously be some sort of binary formatting problem.
In the sample code, the HTTP response is passed directly into the from_data method, however in my case it comes through with HTTP headers on it so I'm splitting the response by double CRLF and then passing in just the PDU data:
data = buf.getvalue()
split = data.split("\r\n\r\n");
mms = MMSMessage.from_data(split[1].strip())
This throws an error BUT if I first write the exact same data to a file then use the from_file method it works:
data = buf.getvalue()
split = data.split("\r\n\r\n");
f = open('dump','w+')
f.write(split[1])
f.close()
path = 'dump'
mms = MMSMessage.from_file(path)
I looked in the from_file method, and all it does is load the contents and then pass it into the same method as the from_data method, so the first way should Just Work™.
What I did notice is that the file is opened in binary format, and the content is loaded like this:
data = array.array('B')
with open(filename, 'rb') as f:
data.fromfile(f, num_bytes)
return self.decode_data(data)
So it seems obvious that somehow what I'm passing into the first function is actually a "string representation of binary data" and what's being read from the file is "actual binary data".
I tried using bytearray like this to "binaryfy" the string:
mms = MMSMessage.from_data(bytearray(split[1].strip(), "utf8"))
but that throws the error:
Traceback (most recent call last):
File "decodepdu.py", line 41, in <module>
mms = MMSMessage.from_data(bytearray(split[1].strip(), "utf8"))
UnicodeDecodeError: 'ascii' codec can't decode byte 0x8c in position 0: ordinal not in range(128)
which seems weird because it's using an 'ascii' codec but I specified utf8 encoding.
Anyway at this point I'm in over my head because I'm not really all that familiar with python, so for now I'm just writing the content to a temporary file but I would really rather not.
Any help would be most appreciated!
Okay thanks to Paul M. in the comments, this works:
data = buf.getvalue()
split = data.split("\r\n\r\n");
pdu = array.array('B');
pdu.fromstring(split[1]);
mms = MMSMessage.from_data(pdu);

unable to decode this string using python

I have this text.ucs file which I am trying to decode using python.
file = open('text.ucs', 'r')
content = file.read()
print content
My result is
\xf\xe\x002\22
I tried doing decoding with utf-16, utf-8
content.decode('utf-16')
and getting error
Traceback (most recent call last): File "", line 1, in
File "C:\Python27\lib\encodings\utf_16.py", line 16, in
decode
return codecs.utf_16_decode(input, errors, True) UnicodeDecodeError: 'utf16' codec can't decode bytes in position
32-33: illegal encoding
Please let me know if I am missing anything or my approach is wrong
Edit: Screenshot has been asked
The string is encoded as UTF16-BE (Big Endian), this works:
content.decode("utf-16-be")
oooh, as i understand you using python 2.x.x but encoding parameter was added only in python 3.x.x as I know, i am doesn't master of python 2.x.x but you can search in google about io.open for example try:
file = io.open('text.usc', 'r',encoding='utf-8')
content = file.read()
print content
but chek do you need import io module or not
You can specify which encoding to use with the encoding argument:
with open('text.ucs', 'r', encoding='utf-16') as f:
text = f.read()
your string need to Be Uncoded With The Coding utf-8 you can do What I Did Now for decode your string
f = open('text.usc', 'r',encoding='utf-8')
print f

Getting error while creating multiple file in python

I'm creating two files using python script, first file is JSON and second one is HTML file, my below is creating json file but while creating HTML file I'm getting error. Could someone help me to resolve the issue? I'm new to Python script so it would be really appreciated if you could suggest some solution
#!/usr/bin/python
# -*- coding: utf-8 -*-
import sys
import json
JsonResponse = '[{"status": "active", "due_date": null, "group": "later", "task_id": 73286}]'
def create(JsonResponse):
print JsonResponse
print 'creating new file'
try:
jsonFile = 'testFile.json'
file = open(jsonFile, 'w')
file.write(JsonResponse)
file.close()
with open('testFile.json') as json_data:
infoFromJson = json.load(json_data)
print infoFromJson
htmlReportFile = 'Report.html'
htmlfile = open(htmlReportFile, 'w')
htmlfile.write(infoFromJson)
htmlfile.close()
except:
print 'error occured'
sys.exit(0)
create(JsonResponse)
I used below online Python editor to execute my code:
https://www.tutorialspoint.com/execute_python_online.php
infoFromJson = json.load(json_data)
Here, json.load() will expect a valid json data as json_data. But the json_data you provided are not valid json, it's a simple string(Hello World!). So, you are getting the error.
ValueError: No JSON object could be decoded
Update:
In your code you should get the error:
TypeError: expected a character buffer object
That's because, the content you are writing to the file needs to be string, but in place of that, you have a list of dictionary.
Two way to solve this. Replace the line:
htmlfile.write(infoFromJson)
To either this:
htmlfile.write(str(infoFromJson))
To make infoFromJson a string.
Or use the dump utility of json module:
json.dump(infoFromJson, json_data)
If you delete Try...except statement, you will see errors below:
Traceback (most recent call last):
File "/Volumes/Ithink/wechatProjects/django_wx_joyme/app/test.py", line 26, in <module>
create(JsonResponse)
File "/Volumes/Ithink/wechatProjects/django_wx_joyme/app/test.py", line 22, in create
htmlfile.write(infoFromJson)
TypeError: expected a string or other character buffer object
Errors occurred because htmlfile.write need string type ,but infoFromJson is a list .
So,change htmlfile.write(infoFromJson) to htmlfile.write(str(infoFromJson)) will avoid errors!

Parsing data from JSON with python

I'm just starting out with Python and here is what I'm trying to do. I want to access Bing's API to get the picture of the day's url. I can import the json file fine but then I can't parse the data to extract the picture's url.
Here is my python script:
import urllib, json
url = "http://www.bing.com/HPImageArchive.aspx? format=js&idx=0&n=1&mkt=en-US"
response = urllib.urlopen(url)
data = json.loads(response.read())
print data
print data["images"][3]["url"]
I get this error:
Traceback (most recent call last):
File "/Users/Robin/PycharmProjects/predictit/api.py", line 9, in <module>
print data["images"][3]["url"]
IndexError: list index out of range
FYI, here is what the JSON file looks like:
http://jsonviewer.stack.hu/#http://www.bing.com/HPImageArchive.aspx?format=js&idx=0&n=1&mkt=en-US
print data["images"][0]["url"]
there is only one object in "images" array
Since there is only one element in the images list, you should have data['images'][0]['url'].
You can also see that under the "Viewer" tab in the "json viewer" that you linked to.

Write Data To File in Python Gives Error Related to UNICODE

I am basically parsing data from XML using SAX Parser in Python.
I am able to parse and print. However I wanted to put the data to a text file.
sample:
def startElement(self, name, attrs):
file.write("startElement'"+ name + " ' ")
While trying to write some text to a test.txt with above sample code, I get below error:
TypeError: descriptor 'write' requires a 'file' object but received a 'unicode'
Any help is greately appreciated.
You are not using an open file. You are using the file type. The file.write method is then unbound it expected an open file to be bound to:
>>> file
<type 'file'>
>>> file.write
<method 'write' of 'file' objects>
>>> file.write(u'Hello')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: descriptor 'write' requires a 'file' object but received a 'unicode'
If you have an already opened file object, then use that; perhaps you have an attribute named file on self:
self.file.write("startElement'" + name + " ' ")
but take into account that because name is a Unicode value you probably want to encode the information to bytes:
self.file.write("startElement'" + name.encode('utf8') + " ' ")
You could also use io.open() function to create a file object that'll accept Unicode values and encode these to a given encoding for you when writing:
file_object = io.open(filename, 'w', encoding='utf8')
but then you need to be explicit about always writing Unicode values and not mix byte strings (type str) and Unicode strings (type unicode).

Categories