Python Bag of Words NameError: name 'unicode' is not defined - python

I have been following this site, https://radimrehurek.com/data_science_python/, to apply bag of words on a list of tweets.
import csv
from textblob import TextBlob
import pandas
messages = pandas.read_csv('C:/Users/Suki/Project/Project12/newData1.csv', sep='\t', quoting=csv.QUOTE_NONE,
names=["label", "message"])
def split_into_tokens(message):
message = unicode(message, encoding="utf8") # convert bytes into proper unicode
return TextBlob(message).words
messages.message.head().apply(split_into_tokens)
print (messages)
However I keep getting this error. I've checked and I following the code on the site but the error keeps arising.
Error
Traceback (most recent call last):
File "C:/Users/Suki/Project/Project12/projectBagofWords.py", line 34, in <module>
messages.message.head().apply(split_into_tokens)
File "C:\Program Files\Python36\lib\site-packages\pandas\core\series.py", line 2510, in apply
mapped = lib.map_infer(values, f, convert=convert_dtype)
File "pandas/_libs/src\inference.pyx", line 1521, in pandas._libs.lib.map_infer
File "C:/Users/Suki/Project/Project12/projectBagofWords.py", line 31, in split_into_tokens
message = unicode(message, encoding="utf8") # convert bytes into proper unicode
NameError: name 'unicode' is not defined
Can someone offer advice on how I could rectify this?
Thanks

unicode is a python 2 method. If you are not sure which version will run this code, you can simply add this at the beginning of your code so it will replace the old unicode with new str:
import sys
if sys.version_info[0] >= 3:
unicode = str

unicode is python 2.x method. If you are running Python 3.x, then all strings are unicode and that call is not needed.
https://docs.python.org/3/howto/unicode.html

Related

I passed the file object but still getting parse error with slate3k

I am trying to read the text data of pdf file using "slate3k" . It seems fine to me. But I am getting parse error
I have been using "python3.7" .
import slate3k
with open("/home/am-it/Desktop/PythonLearning/pdf_practice/invoice-1.pdf","rb")as file:
doc = slate3k.PDF(file)
print(doc)
The Output of above code should be text from pdf. but the actual output is
"Traceback (most recent call last):
File "/home/am-it/Desktop/PythonLearning/pdf_practice/invoslate.py", line 4, in <module>
doc = slate3k.PDF(file)
File "/home/administrator/.local/lib/python3.7/site-packages/slate3k/classes.py", line 59, in __init__
self.doc = PDFDocument()
TypeError: __init__() missing 1 required positional argument: 'parser'"
I have passed the proper file object but still getting error. So please enlighten me
Mine works well with single quotes and with print not indented
import slate3k as slt
with open('pdfPythonTest.pdf','rb') as f:
extracted_text=slt.PDF(f)
print(extracted_text)
Hope this helps!
Dude,
in this part of the code:
with open("/home/am-it/Desktop/PythonLearning/pdf_practice/invoice-1.pdf","rb")as file:
you have to write the name of the file plus the extension and not the path.
So, try this:
with open("invoice-1.pdf","rb")as file:

How to append keywords to IPTC data in a JPG image?

I'm trying to add keywords to the IPTC data in a JPG file and failing miserably. I'm able to read in the keywords using the iptcinfo3 library and, seemingly, append the keyword to the list of current keywords but I'm failing when trying to write those keywords back to the JPG file, if not sooner. The error message is a bit unclear to me and may actually reference the appending of the new keyword (although a print statement seems to indicate it took).
I've tried three different metadata libraries (there doesn't seem to be one standard) and this is the furthest I've gotten with any of them (failing to even install one and not being able to get a second one to run). This seems so basic but I can't figure it out and haven't been able to adapt the few other code examples I've seen online to work, including iptcinfo3's example code fragment.
The current Error message is:
| => pipenv run python editMetadata.py
WARNING: problems with charset recognition (b'\x1b')
[b'Gus']
[b'Gus', b'frog']
Traceback (most recent call last):
File "editMetadata.py", line 22, in <module>
info.save_as('Gus2.jpg')
File "/Users/Scott/.local/share/virtualenvs/editPhotoMetadata-tx0JAOmI/lib/python3.7/site-packages/iptcinfo3.py", line 635, in save_as
jpeg_parts = jpeg_collect_file_parts(fh)
File "/Users/Scott/.local/share/virtualenvs/editPhotoMetadata-tx0JAOmI/lib/python3.7/site-packages/iptcinfo3.py", line 324, in jpeg_collect_file_parts
adobeParts = collect_adobe_parts(partdata)
File "/Users/Scott/.local/share/virtualenvs/editPhotoMetadata-tx0JAOmI/lib/python3.7/site-packages/iptcinfo3.py", line 433, in collect_adobe_parts
out = [''.join(out)]
TypeError: sequence item 0: expected str instance, bytes found
Code:
from iptcinfo3 import IPTCInfo
import os
# Create new info object
info = IPTCInfo('Gus.jpg')
# Print list of keywords
print(info['keywords'])
# Append the keyword I want to add
info['keywords'].append(b'frog')
# Print to test keyword has been added
print(info['keywords'])
# Save new info to file
info.save()
info.save_as('Gus2.jpg')
Instead of appending use equal "="
from iptcinfo3 import IPTCInfo
info = IPTCInfo('Gus.jpg')
print(info['keywords'])
# add keyword
info['keywords'] = ['new keyword']
info.save()
info.save_as('Gus_2.jpg')
I have the same error. It seems to be an issue with the save depending on the file.
from iptcinfo3 import IPTCInfo
info = IPTCInfo('image.jpg', force=True)
info.save()
Which gives me the same error.
WARNING: problems with charset recognition (b'\x1b')
WARNING: problems with charset recognition (b'\x1b')
Traceback (most recent call last):
File "./searchimages.py", line 123, in <module>
main(sys.argv[1:])
File "./searchimages.py", line 119, in main
find_photos(str(sys.argv[1]))
File "./searchimages.py", line 46, in find_photos
write_keywords(image, current_keywords, new_keywords)
File "./searchimages.py", line 109, in write_keywords
info.save_as('out.jpg')
File "/usr/local/lib/python3.7/site-packages/iptcinfo3.py", line 635, in save_as
jpeg_parts = jpeg_collect_file_parts(fh)
File "/usr/local/lib/python3.7/site-packages/iptcinfo3.py", line 324, in jpeg_collect_file_parts
adobeParts = collect_adobe_parts(partdata)
File "/usr/local/lib/python3.7/site-packages/iptcinfo3.py", line 433, in collect_adobe_parts
out = [''.join(out)]
TypeError: sequence item 0: expected str instance, bytes found

Unicode Decode Error when trying to read csv file in python

I am new to python and stackoverflow.
I have a folder with csv files and I am trying to read field name from each file and write them on new csv file.
Thanks to stackoverflow, I was able to make and edit my code until unicode error came out.
I tried my best trying to solve this error and did research.
I found out that files created in Mac or Linux have utf8 unicode and files created in windows have cp949.
Thus, I have to open them by utf8.
My code first looked like this :
import csv
import glob
lst=[]
files=glob.glob('C:/dataset/*.csv')
with open('test.csv','w',encoding='cp949',newline='') as testfile:
csv_writer=csv.writer(testfile)
for file in files:
with open(file,'r') as infile:
file=file[file.rfind('\\')+1:]
reader=csv.reader(infile)
headers=next(reader)
headers=[str for str in headers if str]
while len(headers) < 3 :
headers=next(reader)
headers=[str for str in headers if str]
lst=[file]+headers
csv_writer.writerow(lst)
Then this error came out :
Traceback (most recent call last):
File "C:\Python35\2.py", line 12, in <module>
headers=next(reader)
UnicodeDecodeError: 'cp949' codec can't decode byte 0xec in position 6: illegal multibyte sequence
Here is how I tried to fix unicode error :
import csv
import glob
lst=[]
files=glob.glob('C:/dataset/*.csv')
with open('test.csv','w',encoding='cp949',newline='') as testfile:
csv_writer=csv.writer(testfile)
for file in files:
try:
with open(file,'r') as infile:
file=file[file.rfind('\\')+1:]
reader=csv.reader(infile)
headers=next(reader)
headers=[str for str in headers if str]
while len(headers) < 3 :
headers=next(reader)
headers=[str for str in headers if str]
lst=[file]+headers
csv_writer.writerow(lst)
except:
with open(file,'r',encoding='utf8') as infile:
file=file[file.rfind('\\')+1:]
reader=csv.reader(infile)
headers=next(reader)
headers=[str for str in headers if str]
while len(headers) < 3 :
headers=next(reader)
headers=[str for str in headers if str]
lst=[file]+headers
csv_writer.writerow(lst)
And this error came out :
Traceback (most recent call last):
File "C:\Python35\2.py", line 12, in <module>
headers=next(reader)
UnicodeDecodeError: 'cp949' codec can't decode byte 0xec in position 6: illegal multibyte sequence
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "C:\Python35\2.py", line 20, in <module>
with open(file,'r',encoding='utf8') as infile:
FileNotFoundError: [Errno 2] No such file or directory: '2010_1_1.csv'
File '2010_1_1.csv' definitely exists in my directory ('C:/dataset/*.csv')
When I try to open this file individually using open('C:/dataset/2010_1_1.csv','r',encoding='utf8') it works but there is '\ufeff' next to filename.
I am not sure but my guess is that this file is being opened in try: and not yet closed thus python can't open this file at except.
How can I edit my code to solve this Unicode problem?
import glob
from chardet.universaldetector import UniversalDetector
files=glob.glob('C:/example/*.csv')
for filename in files:
print(filename.ljust(60)),
detector.reset()
for line in file(filename, 'rb'):
detector.feed(line)
if detector.done: break
detector.close()
print(detector.result)
Error :
Traceback (most recent call last):
File "<pyshell#20>", line 4, in <module>
for line in file(filename, 'rb'):
TypeError: 'str' object is not callable
I'm not very experienced with Python so call me out of this is not possible, but you could simply attempt ignoring the encoding of the file when opening it. I'm a Java programmer and from my experience, encoding only needs to be specifyed when creating a new file, and not when opening one.
It looks like your file is not written in cp949 if it won't decode properly. You'll have to figure out the correct encoding. A module like chardet can help.
On Windows, when reading a file open it with the encoding it was written in. If UTF-8, use utf-8-sig, which will automatically handle and remove the byte-order-mark (BOM) U+FEFF character if present. When writing, the best bet is to use utf-8-sig because it handles all possible Unicode characters and will add a BOM so Windows tools like Notepad and Excel will recognize UTF-8-encoded files. Without it, most Windows tools will assume the ANSI encoding, which varies per localized version of Windows.
Error Comes Out in Form of Unicode Decode Error its just Somewhere Missed Decoding it Could be both due to Format is Not Supported in Decoding the Particular File or Decoding is perfect but Error in file while it was Written in Whatever Format it had been ex:>json,xml,csv....
The Only way to Avoid when Stuck in this Problem is to Ignore errors in Decode in first part of Code by using errors='ignore' argument in open():>
with open('test.csv','w',encoding='cp949',newline='') as testfile:
#to
with open(r'test.csv','w',encoding='cp949',newline='',errors='ignore') as testfile:
#or
data = open(r'test.csv',errors='ignore').read()#read the file as a data
if error persist check with diffrent encoding using errors='ignore'

convert json retrived from url

I am partially able to work with json saved as file:
#! /usr/bin/python3
import json
from pprint import pprint
json_file='a.json'
json_data=open(json_file)
data = json.load(json_data)
json_data.close()
print(data[10])
But I am trying to achieve the same from data directly from web. I am trying with the accepted answer here:
#! /usr/bin/python3
from urllib.request import urlopen
import json
from pprint import pprint
jsonget=urlopen("http://api.crossref.org/works?query.author=Rudra+Banerjee")
data = json.load(jsonget)
pprint(data)
which is giving me error:
Traceback (most recent call last):
File "i.py", line 10, in <module>
data = json.load(jsonget)
File "/usr/lib64/python3.5/json/__init__.py", line 268, in load
parse_constant=parse_constant, object_pairs_hook=object_pairs_hook, **kw)
File "/usr/lib64/python3.5/json/__init__.py", line 312, in loads
s.__class__.__name__))
TypeError: the JSON object must be str, not 'bytes'
What is going wrong here?
Changing the code as par Charlie's reply to:
jsonget=str(urlopen("http://api.crossref.org/works?query.author=Rudra+Banerjee"))
data = json.load(jsonget)
pprint(jsonget)
breaks at json.load:
Traceback (most recent call last):
File "i.py", line 9, in <module>
data = json.load(jsonget)
File "/usr/lib64/python3.5/json/__init__.py", line 265, in load
return loads(fp.read(),
AttributeError: 'str' object has no attribute 'read'
It's actually telling you the answer: you're getting back a byte array, where in Python 3 a string is different because of dealing with unicode. In Python 2.7, it would work. You should be able to fix it by converting your bytes explicitly to a string with
jsonget=str(urlopen("http://api.crossref.org/works?query.author=Rudra+Banerjee")_

Converting python webcrawler to 3.4 from 2.7

For this code I am converting a working python webcrawler from 2.7 to 3.4. I've made some modifications but I still get errors when running it:
Traceback (most recent call last):
File "Z:\testCrawler.py", line 11, in <module>
for i in re.findall('''href=["'](.[^"']+)["']''', urllib.request.urlopen(myurl).read(), re.I):
File "C:\Python34\lib\re.py", line 206, in findall
return _compile(pattern, flags).findall(string)
TypeError: can't use a string pattern on a bytes-like object
This is the code itself, please tell me if you see what the syntax errors are.
#! C:\python34
import re
import urllib.request
textfile = open('depth_1.txt','wt')
print ("Enter the URL you wish to crawl..")
print ('Usage - "http://phocks.org/stumble/creepy/" <-- With the double quotes')
myurl = input("#> ")
for i in re.findall('''href=["'](.[^"']+)["']''', urllib.request.urlopen(myurl).read(), re.I):
print (i)
for ee in re.findall('''href=["'](.[^"']+)["']''', urllib.request.urlopen(i).read(), re.I):
print (ee)
textfile.write(ee+'\n')
textfile.close()
Change
urllib.request.urlopen(myurl).read()
to for example
urllib.request.urlopen(myurl).read().decode('utf-8')
What happens here is .read() returning bytes instead of str like it was in python 2.7, so it has to be decoded using some encoding.

Categories