Hopefully a quick for for this one. I have a script replacing a specific value with a file location. The location unfortunetly seems to quite often contain \n or n\ in (it because the current directory is in the temp folders), causing the line to either break or remove itself from the line entirely making the folder location invalid.
The temp dir usually looks something like this.
C:\Users\Admin\AppData\Local\Temp\nsfCDAC.tmp\Firefox
Is there a way to prevent \n or n\ from executing? Any help is appreciated, and here's what my line replacement script looks like. Thanks in advance!
#Editing Prefs.fs
def replaceAll(file,searchExp,replaceExp):
for line in fileinput.input(file, inplace=1):
if searchExp in line:
line = line.replace(searchExp,replaceExp)
sys.stdout.write(line)
replaceAll(rootDir + "/Firefox/Data/prefs.js",'FirefoxAppDirHere',rootDir + "\\FirefoxApp.exe")
EDIT:
eryksun method that he commented with on this post worked perfectly for me! Thanks a lot! I'd mark the question as solved but you must make a post first.
If you are specifying the directory name within your script, you should use a raw string literal by prefixing the literal with r. For example, r"C:\Users\Admin\AppData\Local\Temp\nsfCDAC.tmp\Firefox". This will keep the backslashes from being interpreted.
Your string in memory has plain backslash characters. It's not a problem of accidentally creating control characters such as line feed on the Python side. But if you're writing this out to a Javascript program, then you have to escape the backslashes. For example:
>>> x = r"C:\Users\Admin\AppData\Local\Temp\nsfCDAC.tmp"
>>> print(x)
C:\Users\Admin\AppData\Local\Temp\nsfCDAC.tmp
So in memory this string has single backslash characters. Let's try to compile and evaluate it as a string:
>>> print(eval("'%s'" % x))
Traceback (most recent call last):
File "<console>", line 1, in <module>
File "<string>", line 1
SyntaxError: (unicode error) 'unicodeescape' codec can't decode bytes in
position 2-4: truncated \UXXXXXXXX escape
To fix this you can replace each backslash with two backslashes:
>>> x = x.replace('\\', '\\\\')
>>> print(x)
C:\\Users\\Admin\\AppData\\Local\\Temp\\nsfCDAC.tmp
>>> print(eval("'%s'" % x))
C:\Users\Admin\AppData\Local\Temp\nsfCDAC.tmp
Michael Hoffman's solution is good in general, if for any reason you need the string not to be raw, you can also add an extra backslash:
"C:\Users\Admin\AppData\Local\Temp\\nsfCDAC.tmp"
The extra backslash keeps the \n (or any other special function like that) from running. For example (I believe, I'm running off of vague recollection here), if you need a string with ' and " in it, you can do:
"blah blah blah, he said \"hi!\", and continued on, \'til he got to the road. Blah blah!"
you should use a raw string literal by prefixing the literal with r. For more details about raw strings
you can visit here or other link is here
Related
In Python, I am trying to replace a symbol in a string.
I have this string:
a = "• HELLO • HOW • ARE • AYOU"
I want to replace the "•" by ";".
I tried that, but no modification to my string:
b = a.replace("", ";")
I tried that as well, that works in Python:
b = a.replace("•", ";")
but when I launched in my spark-submit, I have this error:
SyntaxError: Non-UTF-8 code starting with '\x95' in file file_test.py on line 392, but no encoding declared;
thank you for your help
the ascii number of • is 8229 which can be found using ord("•")
try changing the line to b=a.replace(chr(8226), ";")
The error message tells you that you need to declare an encoding in your source file. You do this by including the following command at the beginning:
# coding=utf-8
(Either as the very first line, or as the second line behind the shebang declaration.)
Your first code doesn’t work because is a HTML entity. It has nothing to do with Python. In Python, instead of using a Unicode character in code, you could also use an escape sequence to encode the value of the bullet character:
a.replace('\u2022', ';')
(U+2022 is the Unicode code point “BULLET”.)
I get a string which includes Unicode characters. But the backslashes are escaped. I want to remove one backslash so python can treat the Unicode in the right way.
Using replace I am only able to remove and add two backslashes at a time.
my_str = '\\uD83D\\uDE01\\n\\uD83D\\uDE01'
my_str2 = my_str.replace('\\', '')
'\\uD83D\\uDE01\\n\\uD83D\\uDE01' should be '\uD83D\uDE01\n\uD83D\uDE01'
edit:
Thank you for the many responses. You are right my example was wrong. Here are other things I have tried
my_str = '\\uD83D\\uDE01\\n\\uD83D\\uDE01'
my_str2 = my_str.replace('\\\\', '\\') # no unicode
my_str2 = my_str.replace('\\', '')
That's… probably not going to work. Escape characters are handled during lexical analysis (parsing), what you have in your string is already a single backslash, it's just the escaped representation of that single backslash:
>>> r'\u3d5f'
'\\u3d5f'
What you need to do is encode the string to be "python source" then re-decode it while applying unicode escapes:
>>> my_str.encode('utf-8').decode('unicode_escape')
'\ud83d\ude01\n\ud83d\ude01'
However note that these codepoints are surrogates, and your string is thus pretty much broken / invalid, you're not going to be able to e.g. print it because the UTF8 encoder is going to reject it:
>>> print(my_str.encode('utf-8').decode('unicode_escape'))
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'utf-8' codec can't encode characters in position 0-1: surrogates not allowed
To fix that, you need a second fixup pass: encode to UTF-16 letting the surrogates pass through directly (using the "surrogatepass" mode) then do proper UTF-16 decoding back to an actual well-formed string:
>>> print(my_str.encode('utf-8').decode('unicode_escape').encode('utf-16', 'surrogatepass').decode('utf-16'))
😁
😁
You may really want to do a source analysis on your data though, it's not logically valid to get a (unicode) string with unicode escapes in there, it might be incorrect loading of JSON data or somesuch. If it's an option (I realise that's not always the case) fixing that would be much better than applying hacky fixups afterwards.
Basically I'm doing a subtitle project.
Very complicated, but I just want to insert lines after a line for all lines in a converted ASS file(Currently still a txt file in the experiment)
Untouched lines. I won't talk about Aegisub problems here
Dialogue: 0,0:00:00.00,0:00:03.90,Default,,0,0,0,,Hello, viewers. This is The Reassembler,
Dialogue: 0,0:00:03.90,0:00:07.04,Default,,0,0,0,,the show where we take everyday objects in their component form
Dialogue: 0,0:00:07.04,0:00:10.24,Default,,0,0,0,,and put them back together, very slowly.
Objective:
Every line in the dialogue section appended with
'\N{\3c&HAA0603&\fs31\b1}'
Dialogue: 0,0:00:00.00,0:00:03.90,Default,,0,0,0,,Hello, viewers. This is The Reassembler,\N{\3c&HAA0603&\fs31\b1}
Dialogue: 0,0:00:03.90,0:00:07.04,Default,,0,0,0,,the show where we take everyday objects in their component form\N{\3c&HAA0603&\fs31\b1}
Dialogue: 0,0:00:07.04,0:00:10.24,Default,,0,0,0,,and put them back together, very slowly.\N{\3c&HAA0603&\fs31\b1}
The Python 3.x code:
text1 = open('d:\Programs\sub1.txt','r')
text2 = open('e:\modsub.ass','w+')
alltext1 = text1.read()
lines = alltext1.split('\n')
for i in range(lines.index('[Events]')+1,len(lines)):
lines[i] += ' hello '
print(lines)
text2.write(str(lines))
text1.close()
text2.close()
1.Python doesn't recognize one or two characters in it, apparently, in unicode
'\N{\3c&HAA0603&\fs31\b1}'
SyntaxError: (unicode error) 'unicodeescape' codec can't decode bytes in position 0-23: unknown Unicode character name
How to deal with it without affecting the output?
2.When I used ' hello ' instead of the subtitling code, the output was this:
'Dialogue: 0,0:00:07.04,0:00:10.24,Default,,0,0,0,,and put them back together, very slowly. hello ', 'Dialogue: 0,0:00:10.24,0:00:11.72,Default,,0,0,0,,That feels very nice. hello ', 'Dialogue: 0,0:00:11.72,0:00:13.36,Default,,0,0,0,,Oh, yes. Look at that! hello ',
et cetera, instead of lines after lines arrangement.
How to make the strings just line up and take out the quotes and stuff?
Use a raw string literal, i.e. replace:
'\N{\3c&HAA0603&\fs31\b1}'
with:
r'\N{\3c&HAA0603&\fs31\b1}'
In this way the interpreter will not try to look for the unicode character named \3c&HAA0603&\fs31\b1 which does not exist.
>>> '\N{\3c&HAA0603&\fs31\b1}'
File "<stdin>", line 1
SyntaxError: (unicode error) 'unicodeescape' codec can't decode bytes in position 0-23: unknown Unicode character name
>>> r'\N{\3c&HAA0603&\fs31\b1}'
'\\N{\\3c&HAA0603&\\fs31\\b1}'
>>> print(r'\N{\3c&HAA0603&\fs31\b1}')
\N{\3c&HAA0603&\fs31\b1}
The problem is that you're using a string with \ characters in it, without escaping them. You need to double them up or use the r'' notation.
lines[i] += '\\N{\\3c&HAA0603&\\fs31\\b1}'
or
lines[i] += r'\N{\3c&HAA0603&\fs31\b1}'
As for your other problem, you're writing str(lines) which shows a literal representation. Use '\n'.join(lines) + '\n' instead.
I am new to python and am using it to use nltk in my project.After word-tokenizing the raw data obtained from a webpage I got a list containing '\xe2' ,'\xe3','\x98' etc.However I do not need these and want to delete them.
I simply tried
if '\x' in a
and
if a.startswith('\xe')
and it gives me an error saying invalid \x escape
But when I try a regular expression
re.search('^\\x',a)
i get
Traceback (most recent call last):
File "<pyshell#83>", line 1, in <module>
print re.search('^\\x',a)
File "C:\Python26\lib\re.py", line 142, in search
return _compile(pattern, flags).search(string)
File "C:\Python26\lib\re.py", line 245, in _compile
raise error, v # invalid expression
error: bogus escape: '\\x'
even re.search('^\\x',a) is not identifying it.
I am confused by this,even googling didnt help(I might be missing something).Please suggest any simple way to remove such strings from the list and what was wrong with the above.
Thanks in advance!
You can use unicode(a, 'ascii', 'ignore') to remove all non-ascii characters in the string at once.
It helps here to understand the difference between a string literal and a string.
A string literal is a sequence of characters in your source code. When parsed and compiled by the Python interpreter, it produces a string, which is a sequence of characters in memory.
For example, the string literal " a " produces the string a.
String literals can take a number of forms. All of these produce the same string a:
"a"
'a'
r"a"
"""a"""
r'''a'''
Source code is traditionally ASCII-only, but we'd like it to contain string literals that can produce characters beyond ASCII. To do this escapes can be used. For example, the string literal "\xe2" produces a single-character string, with a character with integer value E2 hexadecimal, or 226 decimal.
This explains the error about "\x" being an invalid escape: the parser is expecting you to specify the hexadecimal value of a character.
To detect if a string has any characters in a certain range, you can use a regex with a character class specifying the lower and upper bounds of the characters you don't want:
if re.search(r"[\x90-\xff]", a):
'\xe2' is one character, \x is an escape sequence that's followed by a hex number and used to specify a byte literally.
That means you have to specify the whole expression:
>>> s = '\xe2hello'
>>> print s
'\xe2hello'
>>> s.replace('\xe2', '')
'hello'
More information can be found in the Python docs.
I see other answers have done a good job in explaining your confusion with respect to '\x', but while suggesting that you may not want to completely remove non-ASCII characters, have not provided a specific way to do other normalization beyond such removing.
If you want to obtain some "reasonably close ASCII character" (e.g., strip accents from letters but leave the underlying letter, &c), this SO answer may help -- the code in the accepted answer, using only the standard Python library, is:
import unicodedata
def strip_accents(s):
return ''.join(c for c in unicodedata.normalize('NFD', s)
if unicodedata.category(c) != 'Mn')
Of course, you'll need to apply this function to each string item in the list you mention in the title, e.g
cleanedlist = [strip_accents(s) for s in mylist]
if all items in mylist are strings.
Let's stand back and think about this a little bit ...
You're using nltk (natural language toolkit) to parse (presumably) natural language.
Your '\xe2' is highly likely to represent U+00E2 LATIN SMALL LETTER A WITH CIRCUMFLEX (â).
Your '\xe3' is highly likely to represent U+00E3 LATIN SMALL LETTER A WITH TILDE (ã).
They look like natural language letters to me. Are you SURE that you don't need them?
If you want only to enter this pattern and avoid the error,
you can try insert a + between \ and x like here:
re.search('\+x[0123456789abcdef]*',a)
In trying to fix up a PML (Palm Markup Language) file, it appears as if my test file has non-ASCII characters which is causing MakeBook to complain. The solution would be to strip out all the non-ASCII chars in the PML.
So in attempting to fix this in python, I have
import unicodedata, fileinput
for line in fileinput.input():
print unicodedata.normalize('NFKD', line).encode('ascii','ignore')
However, this results in an error that line must be "unicode, not str". Here's a file fragment.
\B1a\B \tintense, disordered and often destructive rage†.†.†.\t
Not quite sure how to properly pass line in to be processed at this point.
Try print line.decode('iso-8859-1').encode('ascii', 'ignore') -- that should be much closer to what you want.
You would like to treat line as ASCII-encoded data, so the answer is to decode it to text using the ascii codec:
line.decode('ascii')
This will raise errors for data that is not in fact ASCII-encoded. This is how to ignore those errors:
line.decode('ascii', 'ignore').
This gives you text, in the form of a unicode instance. If you would rather work with (ascii-encoded) data rather than text, you may re-encode it to get back a str or bytes instance (depending on your version of Python):
line.decode('ascii', 'ignore').encode('ascii')
To drop non-ASCII characters use line.decode(your_file_encoding).encode('ascii', 'ignore'). But probably you'd better use PLM escape sequences for them:
import re
def escape_unicode(m):
return '\\U%04x' % ord(m.group())
non_ascii = re.compile(u'[\x80-\uFFFF]', re.U)
line = u'\\B1a\\B \\tintense, disordered and often destructive rage\u2020.\u2020.\u2020.\\t'
print non_ascii.sub(escape_unicode, line)
This outputs \B1a\B \tintense, disordered and often destructive rage\U2020.\U2020.\U2020.\t.
Dropping non-ASCII and control characters with regular expression is easy too (this can be safely used after escaping):
regexp = re.compile('[^\x09\x0A\x0D\x20-\x7F]')
regexp.sub('', line)
When reading from a file in Python you're getting byte strings, aka "str" in Python 2.x and earlier. You need to convert these to the "unicode" type using the decode method. eg:
line = line.decode('latin1')
Replace 'latin1' with the correct encoding.