Absolute path of string that contains special characters - python

In my code, I get a path from the database that may contain special escaping characters that I need to convert them to a real path name. I'm using python 3.7 on Windows.
Suppose this path: C:\Files\2c2b2541\00025\test.x
IMPORTANT: the path is not a fixed value in the code and it is an output of executing a Stored Procedure from pyodbc.
When I try to convert it to an absolute path I get this error:
ValueError: _getfullpathname: embedded null character in path
I also tried to replace "\" with "/" but with no luck.
import os
# path = cursor.execute(query, "some_input").fetchone()[0]
path = 'C:\Files\2c2b2541\00025\test.x'
print(os.path.abspath(path))

Judging by your comments on the other answers, it sounds like the data is already corrupted in the database you're using. That is, you have a literal null byte stored there, and perhaps other bogus bytes (like \2 perhaps turning into \x02). So you probably need two fixes.
First, you should fix whatever code is putting values into the database, so it won't put bogus data in any more. You haven't described how the data gets into the database, so I we can't give you much guidance on how to do this. But most programming languages (and DB libraries) have tools to prevent escape sequences from being evaluated in strings where they're not wanted.
Once you've stopped new bad data from getting added, you can work on fixing the values that are already in the database. It probably shouldn't be too hard to write a query that will replace \0 null bytes with \\0 (or whatever the appropriate escape sequence is for your DB). You may want to look for special characters like newlines (\n) and unprintable characters (like \x02) as well.
I'd only try to fix this issue on the output end if you don't have any control of the database at all.

I think the below is the right way to solve your problem.
>>> def get_fixed_path(path):
... path = repr(path)
... path = path.replace("\\", "\\\\")
... path = path.replace("\\x", "\\\\0")
... path = os.path.abspath(path3).split("'")[1]
... return path
...
>>>
>>> path = 'C:\Files\2c2b2541\00025\test.x'
>>> path
'C:\\Files\x02c2b2541\x0025\test.x'
>>>
>>> print(path)
C:\Filesc2b2541 25 est.x
>>>
>>> final_path = get_fixed_path(path)
>>> final_path
'C:\\Files\\002c2b2541\\00025\\test.x'
>>>
>>> print(final_path)
C:\Files\002c2b2541\00025\test.x
>>>
And here is the detailed description of each and every steps/statements in the above solution.
First step (problem)
>>> import os
>>>
>>> path = 'C:\Files\2c2b2541\00025\test.x'
>>> path
'C:\\Files\x02c2b2541\x0025\test.x'
>>>
>>> print(path)
C:\Filesc2b2541 25 est.x
>>>
Second step (problem)
>>> path2 = repr(path)
>>> path2
"'C:\\\\Files\\x02c2b2541\\x0025\\test.x'"
>>>
>>> print(path2)
'C:\\Files\x02c2b2541\x0025\test.x'
>>>
Third step (problem)
>>> path3 = path2.replace("\\", "\\\\")
>>> path3
"'C:\\\\\\\\Files\\\\x02c2b2541\\\\x0025\\\\test.x'"
>>>
>>> print(path3)
'C:\\\\Files\\x02c2b2541\\x0025\\test.x'
>>>
>>> path3 = path3.replace("\\x", "\\\\0")
>>> path3
"'C:\\\\\\\\Files\\\\\\002c2b2541\\\\\\00025\\\\test.x'"
>>>
>>> print(path3)
'C:\\\\Files\\\002c2b2541\\\00025\\test.x'
>>>
Fourth step (problem)
>>> os.path.abspath(path3)
"C:\\Users\\RISHIKESH\\'C:\\Files\\002c2b2541\\00025\\test.x'"
>>>
>>> os.path.abspath(path2)
"C:\\Users\\RISHIKESH\\'C:\\Files\\x02c2b2541\\x0025\\test.x'"
>>>
>>> os.path.abspath('k')
'C:\\Users\\RISHIKESH\\k'
>>>
>>> os.path.abspath(path3).split("'")
['C:\\Users\\RISHIKESH\\', 'C:\\Files\\002c2b2541\\00025\\test.x', '']
>>> os.path.abspath(path3).split("'")[1]
'C:\\Files\\002c2b2541\\00025\\test.x'
>>>
Final step (solution)
>>> final_path = os.path.abspath(path3).split("'")[1]
>>>
>>> final_path
'C:\\Files\\002c2b2541\\00025\\test.x'
>>>
>>> print(final_path)
C:\Files\002c2b2541\00025\test.x
>>>

Replace "\" by "\\".
That's it.

You need to either use a raw string literal or double backslashes \\.
import os
path = r'C:\Files\2c2b2541\00025\test.x' #r before the string
print(os.path.abspath(path))

Related

Python: How to get double backslash from Path object

I'm using the pathlib Path module to store the paths of a couple of my programs. The problem is, when I go to use this variable, I end up with an error because the \\ gets turned into a \ which the system in interpreting as a special character. My understanding was that depending on the OS, the Path module would handle this accordingly. (I'm using Windows)
Here is a recreation of my code:
from pathlib import Path
def get_dictionary():
path1 = Path("C:\\Programs\\program1")
path2 = Path("C:\\Programs\\program2")
path3 = Path("C:\\Programs\\program3")
path4 = Path("C:\\Programs\\program4")
info = {
"program1" : str(path1),
"program2" : str(path2),
"program3" : str(path3),
"program4" : str(path4)
}
return info
if __name__ == "__main__":
theInfo = get_dictionary()
print(theInfo['program1'])
print(theInfo['program2'])
print(theInfo['program3'])
print(theInfo['program4'])
print(theInfo)
And the console output is the following:
C:\Programs\program1
C:\Programs\program2
C:\Programs\program3
C:\Programs\program4
{'program1': 'C:\\Programs\\program1', 'program2': 'C:\\Programs\\program2', 'program3':
'C:\\Programs\\program3', 'program4': 'C:\\Programs\\program4'}
So my question is: Say I want to use theInfo['program1']. I get C:\Programs\program1 but I need to get C:\\Programs\\program1. How can I go about doing this? Thank you for any help!
Edit: The values I get from the dictionary are placed in a string that ends up being a line in a Tcl file. For instance I have a function where I write:
f"puts {theInfo['program1']}"
where I expect:
puts C:\\Programs\\program1
but I get:
puts C:\Programs\program1
With other characters, this interprets as a tab, newline, ect...
This phenomenon is caused caused by Escape character '\', if you really want the escaped data to be C:\\Programs\\program1, you can split the path then join it with \\\\.
In [4]: data = 'C:\\Programs\\program1'
In [5]: data
Out[5]: 'C:\\Programs\\program1'
In [6]: print(data)
C:\Programs\program1
In [7]: new_data = 'C:\\\\Programs\\\\program1'
In [8]: new_data
Out[8]: 'C:\\\\Programs\\\\program1'
In [9]: print(new_data)
C:\\Programs\\program1
In [21]: p = '\\\\'.join(("C:\\Programs\\program1").split('\\'))
In [22]: p
Out[22]: 'C:\\\\Programs\\\\program1'
In [23]: print(p)
C:\\Programs\\program1

Python encoding unicode utf-8

I'm using selenium to insert text input with german umlauts in a web formular. The declared coding for the python script is utf-8. The page uses utf-8 encoding. When i definine a string like that everything works fine:
q = u"Hällö" #type(q) returns unicode
...
textbox.send_keys(q)
But when i try to read from a config file using ConfigParser (or another kind of file) i get malformed output in the webformular (Hällö). This is the code i use for that:
the_encoding = chardet.detect(q)['encoding'] #prints utf-8
q = parser.get('info', 'query') # type(q) returns str
q = q.decode('unicode-escape') # type(q) returns unicode
textbox.send_keys(q)
Whats the difference between the both q's given to the send_keys function?
This is probably bad encoding. Try printing q before the last statement, and see if it's equal. This line q = parser.get('info', 'query') # type(q) returns str should return the string 'H\xc3\xa4ll\xc3\xb6'. If it's different, then you are using the wrong coding.
>>> q = u"Hällö" # unicode obj
>>> q
u'H\xe4ll\xf6'
>>> print q
Hällö
>>> q.encode('utf-8')
'H\xc3\xa4ll\xc3\xb6'
>>> a = q.encode('utf-8') # str obj
>>> a
'H\xc3\xa4ll\xc3\xb6' # <-- this should be the value of the str
>>> a.decode('utf-8') # <-- unicode obj
u'H\xe4ll\xf6'
>>> print a.decode('utf-8')
Hällö
>>>
from ConfigParser import SafeConfigParser
import codecs
parser = SafeConfigParser()
with codecs.open('cfg.ini', 'r', encoding='utf-8-sig') as f:
parser.readfp(f)
greet = parser.get('main', 'greet')
print 'greet:', greet.encode('utf-8-sig')
greet: Hällö
cfg.ini file
[main]
greet=Hällö

Strip and split at same time in python

I'm trying to split and strip one string at same time.
I have a file D:\printLogs\newPrintLogs\4.txt and I want to split it that I get only 4.txt and than to strip the .txt and add in string + ".zpr" to get "4.zpr".
This is the code that I tryed to use:
name = str(logfile)
print ("File name: " + name.split('\\')[-1] + name.strip( '.txt' ))
But I get this output:
File name: 4.txtD:\printLogs\newPrintLogs\4
Don't use stripping and splitting.
First of all, stripping removes all characters from a set, you are removing all 't', 'x' and '.' characters from the start and end of your string, regardless of order:
>>> 'tttx.foox'.strip('.txt')
'foo'
>>> 'tttx.foox'.strip('xt.')
'foo'
Secondly, Python offers you the os.path module for handling paths in a cross-platform and consistent manner:
basename = os.path.basename(logfile)
if basename.endswith('.txt'):
basename = os.path.splitext(basename)[0]
You can drop the str.endswith() test if you just want to remove any extension:
basename = os.path.splitext(os.path.basename(logfile))[0]
Demo:
>>> import os.path
>>> logfile = r'D:\printLogs\newPrintLogs\4.txt'
>>> os.path.splitext(os.path.basename(logfile))[0]
'4'
You're adding too much there. This is all you need:
print ("File name: " + name.split('\\')[-1].strip( '.txt' ))
Better yet, use the os module:
>>> import os
>>> os.path.splitext(os.path.basename(r'D:\printLogs\newPrintLogs\4.txt'))[0]
'4'
Or, split up among several steps, with occasional feedback:
>>> import os
>>> name = r'D:\printLogs\newPrintLogs\4.txt'
>>> basename = os.path.basename(name)
>>> basename
'4.txt'
>>> splitname = os.path.splitext(basename)
>>> splitname
('4', '.txt')
>>> splitname[0]
'4'
Thank you all for your solutions it helped me but at first I didn't explained question right.
I founded solution for my problem
name = str(logfile)
print ("Part name: " + name.split('\\')[-1].replace('.txt','.zpr'))
For python 3.4 or later:
import pathlib
name = r"D:\printLogs\newPrintLogs\4.txt"
stem = pathlib.Path(name).stem
print(stem) # prints 4
You can split and rstrip:
print(s.rsplit("\\",1)[1].rstrip(".txt"))
But it may be safer to split on the .:
print(s.rsplit("\\",1)[1].rsplit(".",1)[0])
If you rstrip or strip you could end up removing more than just the .txt .

Print python os.urandom output on terminal

how can i print the output of os.urandom(n) in terminal?
I try to generate a SECRET_KEY with fabfile and will output the 24 bytes.
Example how i implement both variants in the python shell:
>>> import os
>>> out = os.urandom(24)
>>> out
'oS\xf8\xf4\xe2\xc8\xda\xe3\x7f\xc75*\x83\xb1\x06\x8c\x85\xa4\xa7piE\xd6I'
>>> print out
oS�������5*������piE�I
If what you want is hex-encoded string, use binascii.a2b_hex (or hexlify):
>>> out = 'oS\xf8\xf4\xe2\xc8\xda\xe3\x7f\xc75*\x83\xb1\x06\x8c\x85\xa4\xa7piE\xd6I'
>>> import binascii
>>> print binascii.hexlify(out)
6f53f8f4e2c8dae37fc7352a83b1068c85a4a7706945d649
To use just built-ins, you can get the integer value with ord and then convert that back to a hex number:
list_of_hex = [str(hex(ord(z)))[2:] for z in out]
print " ".join(list_of_hex)
If you just want the hex list, then the str() and [2:] are unnecessary
The output of this and the hexify() version are both type str and should work fine for the web app.

Python to show special characters

I know there are tons of threads regarding this issue but I have not managed to find one which solves my problem.
I am trying to print a string but when printed it doesn't show special characters (e.g. æ, ø, å, ö and ü). When I print the string using repr() this is what I get:
u'Von D\xc3\xbc' and u'\xc3\x96berg'
Does anyone know how I can convert this to Von Dü and Öberg? It's important to me that these characters are not ignored, e.g. myStr.encode("ascii", "ignore").
EDIT
This is the code I use. I use BeautifulSoup to scrape a website. The contents of a cell (<td>) in a table (<table>), is put into the variable name. This is the variable which contains special characters that I cannot print.
web = urllib2.urlopen(url);
soup = BeautifulSoup(web)
tables = soup.find_all("table")
scene_tables = [2, 3, 6, 7, 10]
scene_index = 0
# Iterate over the <table>s we want to work with
for scene_table in scene_tables:
i = 0
# Iterate over < td> to find time and name
for td in tables[scene_table].find_all("td"):
if i % 2 == 0: # td contains the time
time = remove_whitespace(td.get_text())
else: # td contains the name
name = remove_whitespace(td.get_text()) # This is the variable containing "nonsense"
print "%s: %s" % (time, name,)
i += 1
scene_index += 1
Prevention is better than cure. What you need is to find out how that rubbish is being created. Please edit your question to show the code that creates it, and then we can help you fix it. It looks like somebody has done:
your_unicode_string = original_utf8_encoded_bytestring.decode('latin1')
The cure is to reverse the process, simply, and then decode.
correct_unicode_string = your_unicode_string.encode('latin1').decode('utf8')
Update Based on the code that you supplied, the probable cause is that the website declares that it is encoded in ISO-8859-1 (aka latin1) but in reality it is encoded in UTF-8. Please update your question to show us the url.
If you can't show it, read the BS docs; it looks like you'll need to use:
BeautifulSoup(web, from_encoding='utf8')
Unicode support in many languages is confusing, so your error here is understandable. Those strings are UTF-8 bytes, which would work properly if you drop the u at the front:
>>> err = u'\xc3\x96berg'
>>> print err
Ã?berg
>>> x = '\xc3\x96berg'
>>> print x
Öberg
>>> u = x.decode('utf-8')
>>> u
u'\xd6berg'
>>> print u
Öberg
For lots more information:
http://www.joelonsoftware.com/articles/Unicode.html
http://docs.python.org/howto/unicode.html
You should really really read those links and understand what is going on before proceeding. If, however, you absolutely need to have something that works today, you can use this horrible hack that I am embarrassed to post publicly:
def convert_fake_unicode_to_real_unicode(string):
return ''.join(map(chr, map(ord, string))).decode('utf-8')
The contents of the strings are not unicode, they are UTF-8 encoded.
>>> print u'Von D\xc3\xbc'
Von Dü
>>> print 'Von D\xc3\xbc'
Von Dü
>>> print unicode('Von D\xc3\xbc', 'utf-8')
Von Dü
>>>
Edit:
>>> print '\xc3\x96berg' # no unicode identifier, works as expected because it's an UTF-8 encoded string
Öberg
>>> print u'\xc3\x96berg' # has unicode identifier, means print uses the unicode charset now, outputs weird stuff
Ãberg
# Look at the differing object types:
>>> type('\xc3\x96berg')
<type 'str'>
>>> type(u'\xc3\x96berg')
<type 'unicode'>
>>> '\xc3\x96berg'.decode('utf-8') # this command converts from UTF-8 to unicode, look at the unicode identifier in the output
u'\xd6berg'
>>> unicode('\xc3\x96berg', 'utf-8') # this does the same thing
u'\xd6berg'
>>> unicode(u'foo bar', 'utf-8') # trying to convert a unicode string to unicode will fail as expected
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: decoding Unicode is not supported

Categories