OS: CentOS 6.5
Python version: 2.7.5
I have a file with the following sample of information.
I would like to search and replace the cent symbol and replace with $0. infront.
Alpha $1.00
Beta ¢55 <<<< note
Charlie $2.00
Delta ¢23 <<<< note
I want it to look like this:
Alpha $1.00
Beta $0.55 <<<< note
Charlie $2.00
Delta $0.23 <<<< note
So this code in command line (which works) is:
sed 's/¢/$0./g' *file name*
However using python to code it does not work:
import subprocess
hello = subprocess.call('cat datafile ' + '| sed "s/¢/$0./g"',shell=True)
print hello
There seems to be an error whenever I try to paste the ¢ symbol.
Slightly closer, when I print the unicode for the cent sign in Python, it comes out below:
print(u"\u00A2")
¢
When I cat my datafile, it actually shows up as the ¢ sign, missing the Â. << not sure if this is any help
I think when I'm trying to sed with the Unicode, the added symbol before the ¢ is not allowing me to search and replace.
Error code when trying unicode:
hello = subprocess.call(u"cat datafile | sed 's/\uxA2/$0./g'",shell=True)
SyntaxError: (unicode error) 'unicodeescape' codec can't decode bytes in position 25-26: truncated \uXXXX escape
Fixing the uxA2 to u00A2, i get this:
sed: -e expression #1, char 7: unknown option to `s'
1
Any ideas/thoughts?
Both examples I get the error below:
[root#centOS user]# python test2.py
Traceback (most recent call last):
File "test2.py", line 3, in <module>
data = data.decode('utf-8') # decode immediately to Unicode
File "/usr/local/lib/python2.7/encodings/utf_8.py", line 16, in decode
return codecs.utf_8_decode(input, errors, True)
UnicodeDecodeError: 'utf8' codec can't decode byte 0xa2 in position 6: invalid start byte
[root#centOS user]# python test1.py
Traceback (most recent call last):
File "test1.py", line 11, in <module>
hello_unicode = hello_utf8.decode('utf-8')
File "/usr/local/lib/python2.7/encodings/utf_8.py", line 16, in decode
return codecs.utf_8_decode(input, errors, True)
UnicodeDecodeError: 'utf8' codec can't decode byte 0xa2 in position 6: invalid start byte
This is the cat of the file:
[root#centOS user]# cat datafile
alpha ¢79
this is the Nano of the datafile:
alpha �79
This is the Vim of the datafile:
[root#centOS user]# vim fbasdf
alpha ¢79
~
Thanks again for all your help guys
ANSWER!!
The SED output from Rob and Thomas works.
File format was saved as charset=iso-8859-1. I was unable to search the document for utf-8 format character.
Identified file charset:
file -bi datafile
text/plain; charset=iso-8859-1
Used following code to change file:
iconv -f iso-8859-1 -t utf8 datafile > datafile1
Stealing Thomas's answer and expanding on it:
import subprocess
# Keep all strings in unicode as long as you can.
cmd_unicode = u"sed 's/\u00A2/$0./g' < datafile"
# only convert them to encoded byte strings when you send them out
# also note the use of .check_output(), NOT .call()
cmd_utf8 = cmd_unicode.encode('utf-8')
hello_utf8 = subprocess.check_output(cmd_utf8, shell=True)
# Decode any incoming byte string to unicode immediately on receipt
hello_unicode = hello_utf8.decode('utf-8')
# And you have your answer
print hello_unicode
The code above demonstrates the use of a "Unicode sandwich": bytes on the outside, Unicode on the inside. See http://nedbatchelder.com/text/unipain.html
For this simple example, you could have just as easily done everything in Python:
with open('datafile') as datafile:
data = datafile.read() # Read in bytes
data = data.decode('utf-8') # decode immediately to Unicode
data = data.replace(u'\xa2', u'$0.') # Do all operations in Unicode
print data # Implicit encode during output
Also, change your string to a unicode string, and replace the cent sign with \u00A2.
Here's the fixed code:
import subprocess
hello = subprocess.call(u"cat datafile | sed \"s#\u00A2#$0.#g\"",shell=True)
print hello
Related
I want to print an ASCII text but when I run the script, it throws me an error:
$ python test.py Traceback (most recent call last):
File "C:\Users\wooxh\Desktop\Materialy\XRichPresence\test.py",
line 1, in <module> print(""" File "C:\Program Files\WindowsApps\PythonSoftwareFoundation.Python.3.9_3.9.2032.0_x64__qbz5n2kfra8p0\lib\encodings\cp1250.py",
line 19, in encode return codecs.charmap_encode(input,self.errors,encoding_table)[0] UnicodeEncodeError:
'charmap' codec can't encode characters in position 2-4: character maps to <undefined>
Here's the code
print("""
██╗ ██╗██████╗ ██████╗ ██████╗
╚██╗██╔╝██╔══██╗██╔══██╗██╔════╝
╚███╔╝ ██████╔╝██████╔╝██║
██╔██╗ ██╔══██╗██╔═══╝ ██║
██╔╝ ██╗██║ ██║██║ ╚██████╗
╚═╝ ╚═╝╚═╝ ╚═╝╚═╝ ╚═════╝
""")
Looks like Python is identifying your code page as 1250, which doesn't include the characters you're using. If chcp reports you're actually using code page 437 (common in cmd.exe) you can do:
import sys
sys.stdout.buffer.write("""
██╗ ██╗██████╗ ██████╗ ██████╗
╚██╗██╔╝██╔══██╗██╔══██╗██╔════╝
╚███╔╝ ██████╔╝██████╔╝██║
██╔██╗ ██╔══██╗██╔═══╝ ██║
██╔╝ ██╗██║ ██║██║ ╚██████╗
╚═╝ ╚═╝╚═╝ ╚═╝╚═╝ ╚═════╝
""".encode('cp437'))
to explicitly encode to the correct code page and write it. Otherwise, I'd suggest enabling Python's forced UTF-8 runtime mode, which should allow your original code (with no call to encode) to work (possibly dropping or replacing characters not representable by the terminal). All you'd change is your run command:
> python -X utf8 test.py
or explicitly define PYTHONUTF=1 in your environment to turn it on without a command line switch.
What Character set is é from? In Windows notepad having this character in an ANSI text file will save fine. Insert something like 😍 and you'll get an error. é seems to work fine in ASCII terminal in Putty (Are CP437 and IBM437 the same?) where as 😍 does not.
I can see that 😍 is Unicode, not ASCII. But what is é? It doesn't give errors I get with Unicode in Notepad, but Python was throwing SyntaxError: Non-ASCII character '\xc3' in file on line , but no encoding declared; before I added a "magic comment" as suggested by Python NLTK: SyntaxError: Non-ASCII character '\xc3' in file (Sentiment Analysis -NLP).
I added the "magic comment" and don't get that error, but os.path.isfile() is saying a filename with é doesn't exist. Ironic that the character é is in Marc-André Lemburg, the author of the PEP the error links to.
EDIT: If I print the path of the file, the accented e shows up as ├⌐ but I can copy and paste é into the command prompt.
EDIT2: See below
Private > cat scratch.py ### LOL cat scratch :3
# coding=utf-8
file_name = r"Filéname"
file_name = unicode(file_name)
Private > python scratch.py
Traceback (most recent call last):
File "scratch.py", line 3, in <module>
file_name = unicode(file_name)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 3: ordinal not in range(128)
Private >
EDIT3:
Private > PS1="Private > " ; echo code below ; cat scratch.py ; echo ======= ; echo output below ; python scratch.py
code below
# -*- coding: utf-8 -*-
file_name = r"Filéname"
file_name = unicode(file_name, encoding="utf-8")
# I have code here to determine a path depending on the hostname of the
# machine, the folder paths contain no Unicode characters, for my debug
# version of the script, I will hardcode the redacted hostname.
hostname = "One"
if hostname == "One":
folder = "C:/path/folder_one"
elif hostname == "Two":
folder = "C:/path/folder_two"
else:
folder = "C:/path/folder_three"
path = "%s/%s" % (folder, file_name)
path = unicode(path, encoding="utf-8")
print path
=======
output below
Traceback (most recent call last):
File "scratch.py", line 18, in <module>
path = unicode(path, encoding="utf-8")
TypeError: decoding Unicode is not supported
Private >
You need to tell unicode what encoding the string is in, in this case it's utf-8 not ascii, and the file header should be # -*- coding: utf-8 -*-, Encoding Declarations
# -*- coding: utf-8 -*-
file_name = r"Filéname"
file_name = unicode(file_name, encoding="utf-8")
1 Help on class unicode in module __builtin__:
2
3 class unicode(basestring)
4 | unicode(object='') -> unicode object
5 | unicode(string[, encoding[, errors]]) -> unicode object
6 |
7 | Create a new Unicode object from the given encoded string.
8 | encoding defaults to the current default string encoding.
9 | errors can be 'strict', 'replace' or 'ignore' and defaults to 'strict'.
And as I mentioned in my previous comment you will save yourself a lot of headaches by switching to Python 3. Python 2 on a Windows filesystem with unicode characters can be a nightmare.
My Python 2.x script trys to download a web page including Chinese words. It's encoded in UTF-8. By urllib.openurl(url), I get content in type str, so I decode content with UTF-8. It throws UnicodeEncodeError. I googled a lot of posts like this and this, but they don't work for me. Am I misunderstand something?
My code is:
import urllib
import httplib
def get_html_content(url):
response = urllib.urlopen(url)
html = response.read()
print type(html)
return html
if __name__ == '__main__':
url = 'http://weekly.manong.io/issues/58'
html = get_html_content(url)
print html.decode('utf-8')
Error message:
<type 'str'>
Traceback (most recent call last):
File "E:\src\infra.py", line 32, in <module>
print html.decode('utf-8')
UnicodeEncodeError: 'ascii' codec can't encode character u'\u201c' in position 44: ordinal not in range(128)
[Finished in 1.6s]
print statement converts arguments to str objects. Encoding it manually will prevent to encode it with ascii:
import sys
...
if __name__ == '__main__':
url = 'http://weekly.manong.io/issues/58'
html = get_html_content(url)
print html.decode('utf-8').encode(sys.stdout.encoding, 'ignore')
Replace sys.stdout.encoding with encoding of your terminal unless it print correctly.
UPDATE
Alternatively you can use PYTHONIOENCODING environmental variable without encoding in the source code:
PYTHONIOENCODING=utf-8:ignore python program.py
If the standard output is redirected to a pipe then Python 2 fails to use your locale encoding:
⟫ python -c'print u"\u201c"' # no redirection -- works
“
⟫ python -c'print u"\u201c"' | cat
Traceback (most recent call last):
File "<string>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\u201c' in position 0: ordinal not in range(128)
To fix it; you could specify PYTHONEIOENCODING environment variable e.g., in bash:
⟫ PYTHONIOENCODING=utf-8 python -c'print u"\u201c"' | cat
“
On Windows, you need to set the envvar using a different syntax.
If your Windows console doesn't support utf-8 (it matters only for the first command where there is no redirection) then you could try to print Unicode directly using Win32 API calls like win-unicode-console does. See windows console doesn't print or input Unicode.
I'm using NLTK to perform kmeans clustering on my text file in which each line is considered as a document. So for example, my text file is something like this:
belong finger death punch <br>
hasty <br>
mike hasty walls jericho <br>
jägermeister rules <br>
rules bands follow performing jägermeister stage <br>
approach
Now the demo code I'm trying to run is this:
import sys
import numpy
from nltk.cluster import KMeansClusterer, GAAClusterer, euclidean_distance
import nltk.corpus
from nltk import decorators
import nltk.stem
stemmer_func = nltk.stem.EnglishStemmer().stem
stopwords = set(nltk.corpus.stopwords.words('english'))
#decorators.memoize
def normalize_word(word):
return stemmer_func(word.lower())
def get_words(titles):
words = set()
for title in job_titles:
for word in title.split():
words.add(normalize_word(word))
return list(words)
#decorators.memoize
def vectorspaced(title):
title_components = [normalize_word(word) for word in title.split()]
return numpy.array([
word in title_components and not word in stopwords
for word in words], numpy.short)
if __name__ == '__main__':
filename = 'example.txt'
if len(sys.argv) == 2:
filename = sys.argv[1]
with open(filename) as title_file:
job_titles = [line.strip() for line in title_file.readlines()]
words = get_words(job_titles)
# cluster = KMeansClusterer(5, euclidean_distance)
cluster = GAAClusterer(5)
cluster.cluster([vectorspaced(title) for title in job_titles if title])
# NOTE: This is inefficient, cluster.classify should really just be
# called when you are classifying previously unseen examples!
classified_examples = [
cluster.classify(vectorspaced(title)) for title in job_titles
]
for cluster_id, title in sorted(zip(classified_examples, job_titles)):
print cluster_id, title
(which can also be found here)
The error I receive is this:
Traceback (most recent call last):
File "cluster_example.py", line 40, in
words = get_words(job_titles)
File "cluster_example.py", line 20, in get_words
words.add(normalize_word(word))
File "", line 1, in
File "/usr/local/lib/python2.7/dist-packages/nltk/decorators.py", line 183, in memoize
result = func(*args)
File "cluster_example.py", line 14, in normalize_word
return stemmer_func(word.lower())
File "/usr/local/lib/python2.7/dist-packages/nltk/stem/snowball.py", line 694, in stem
word = (word.replace(u"\u2019", u"\x27")
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 13: ordinal not in range(128)
What is happening here?
The file is being read as a bunch of strs, but it should be unicodes. Python tries to implicitly convert, but fails. Change:
job_titles = [line.strip() for line in title_file.readlines()]
to explicitly decode the strs to unicode (here assuming UTF-8):
job_titles = [line.decode('utf-8').strip() for line in title_file.readlines()]
It could also be solved by importing the codecs module and using codecs.open rather than the built-in open.
This works fine for me.
f = open(file_path, 'r+', encoding="utf-8")
You can add a third parameter encoding to ensure the encoding type is 'utf-8'
Note: this method works fine in Python3, I did not try it in Python2.7.
For me there was a problem with the terminal encoding. Adding UTF-8 to .bashrc solved the problem:
export LC_CTYPE=en_US.UTF-8
Don't forget to reload .bashrc afterwards:
source ~/.bashrc
You can try this also:
import sys
reload(sys)
sys.setdefaultencoding('utf8')
I got this error when trying to install a python package in a Docker container. For me, the issue was that the docker image did not have a locale configured. Adding the following code to the Dockerfile solved the problem for me.
# Avoid ascii errors when reading files in Python
RUN apt-get install -y locales && locale-gen en_US.UTF-8
ENV LANG='en_US.UTF-8' LANGUAGE='en_US:en' LC_ALL='en_US.UTF-8'
When on Ubuntu 18.04 using Python3.6 I have solved the problem doing both:
with open(filename, encoding="utf-8") as lines:
and if you are running the tool as command line:
export LC_ALL=C.UTF-8
Note that if you are in Python2.7 you have do to handle this differently. First you have to set the default encoding:
import sys
reload(sys)
sys.setdefaultencoding('utf-8')
and then to load the file you must use io.open to set the encoding:
import io
with io.open(filename, 'r', encoding='utf-8') as lines:
You still need to export the env
export LC_ALL=C.UTF-8
To find ANY and ALL unicode error related... Using the following command:
grep -r -P '[^\x00-\x7f]' /etc/apache2 /etc/letsencrypt /etc/nginx
Found mine in
/etc/letsencrypt/options-ssl-nginx.conf: # The following CSP directives don't use default-src as
Using shed, I found the offending sequence. It turned out to be an editor mistake.
00008099: C2 194 302 11000010
00008100: A0 160 240 10100000
00008101: d 64 100 144 01100100
00008102: e 65 101 145 01100101
00008103: f 66 102 146 01100110
00008104: a 61 097 141 01100001
00008105: u 75 117 165 01110101
00008106: l 6C 108 154 01101100
00008107: t 74 116 164 01110100
00008108: - 2D 045 055 00101101
00008109: s 73 115 163 01110011
00008110: r 72 114 162 01110010
00008111: c 63 099 143 01100011
00008112: C2 194 302 11000010
00008113: A0 160 240 10100000
Use open(fn, 'rb').read().decode('utf-8') instead of just open(fn).read()
You can try this before using job_titles string:
source = unicode(job_titles, 'utf-8')
For python 3, the default encoding would be "utf-8". Following steps are suggested in the base documentation:https://docs.python.org/2/library/csv.html#csv-examples in case of any problem
Create a function
def utf_8_encoder(unicode_csv_data):
for line in unicode_csv_data:
yield line.encode('utf-8')
Then use the function inside the reader, for e.g.
csv_reader = csv.reader(utf_8_encoder(unicode_csv_data))
python3x or higher
load file in byte stream:
body = ''
for lines in open('website/index.html','rb'):
decodedLine = lines.decode('utf-8')
body = body+decodedLine.strip()
return body
use global setting:
import io
import sys
sys.stdout = io.TextIOWrapper(sys.stdout.buffer,encoding='utf-8')
I keep getting the following error:
$ ./test.py
-bash: ./test.py: cannot execute binary file
when trying to run the following file in python via cygwin:
#!usr/bin/python
with open("input.txt") as inf:
try:
while True:
latin = inf.next().strip()
gloss = inf.next().strip()
trans = inf.next().strip()
process(latin, gloss, trans)
inf.next() # skip blank line
except StopIteration:
# reached end of file
pass
from itertools import chain
def chunk(s):
"""Split a string on whitespace or hyphens"""
return chain(*(c.split("-") for c in s.split()))
def process(latin, gloss, trans):
chunks = zip(chunk(latin), chunk(gloss))
How do I fix this??
After taking on the below suggestions, still getting the same error.
If this helps, I tried
$ python ./test.py
and got
$ python ./test.py
File "./test.py", line 1
SyntaxError: Non-ASCII character '\xff' in file ./test.py on line 1, but no encoding declared; see http://www.python.org/peps/pep-0263.html for details
There is a problem. You are missing the '/' in front of usr in #!usr/bin/python. Your line should look like this.
#!/usr/bin/python
In addition to protecting the file executable, #!/usr/bin/python may not work. At least it has never worked for me on Red Hat or Ubuntu Linux. Instead, I have put this in my Python files:
#!/usr/bin/env python
I don't know how this works on Windows platforms.