How to convert a python string - python

How would I convert this string
'\\n this is a docstring for\\n the main function.\\n a,\\n b,\\n c\\n '
into
'\n this is a docstring for\n the main function.\n a,\n b,\n c\n '
keeping in mind I would also like to do this for '\t' and all other escaped characters.
The code for the reverse way is
def fix_string(s):
""" takes the string and replaces any `\n` with `\\n` so that the read file will be recognized """
# escape chars = \t , \b , \n , \r , \f , \' , \" , \\
new_s = ''
for i in s:
if i == '\t':
new_s += '\\t'
elif i == '\b':
new_s += '\\b'
elif i == '\n':
new_s += '\\n'
elif i == '\r':
new_s += '\\r'
elif i == '\f':
new_s += '\\f'
elif i == '\'':
new_s += "\\'"
elif i == '\"':
new_s += '\\"'
else:
new_s += i
return new_s
would I possibly need to look at the actual numeric values for the characters and check the next character say if I find a ('\',92) character followed by a ('n',110)?

Don't reinvent the wheel here. Python has your back. Besides, handling escape syntax correctly, is harder than it looks.
The correct way to handle this
In Python 2, use the str-to-str string_escape codec:
string.decode('string_escape')
This interprets any Python-recognized string escape sequences for you, including \n and \t.
Demo:
>>> string = '\\n this is a docstring for\\n the main function.\\n a,\\n b,\\n c\\n '
>>> string.decode('string_escape')
'\n this is a docstring for\n the main function.\n a,\n b,\n c\n '
>>> print string.decode('string_escape')
this is a docstring for
the main function.
a,
b,
c
>>> '\\t\\n\\r\\xa0\\040'.decode('string_escape')
'\t\n\r\xa0 '
In Python 3, you'd have to use the codecs.decode() and the unicode_escape codec:
codecs.decode(string, 'unicode_escape')
as there is no str.decode() method and this is not a str -> bytes conversion.
Demo:
>>> import codecs
>>> string = '\\n this is a docstring for\\n the main function.\\n a,\\n b,\\n c\\n '
>>> codecs.decode(string, 'unicode_escape')
'\n this is a docstring for\n the main function.\n a,\n b,\n c\n '
>>> print(codecs.decode(string, 'unicode_escape'))
this is a docstring for
the main function.
a,
b,
c
>>> codecs.decode('\\t\\n\\r\\xa0\\040', 'unicode_escape')
'\t\n\r\xa0 '
Why straightforward str.replace() won't cut it
You could try to do this yourself with str.replace(), but then you also need to implement proper escape parsing; take \\\\n for example; this is \\n, escaped. If you naively apply str.replace() in sequence, you end up with \n or \\\n instead:
>>> '\\\\n'.decode('string_escape')
'\\n'
>>> '\\\\n'.replace('\\n', '\n').replace('\\\\', '\\')
'\\\n'
>>> '\\\\n'.replace('\\\\', '\\').replace('\\n', '\n')
'\n'
The \\ pair should be replaced by just one \ characters, leaving the n uninterpreted. But the replace option either will end up replacing the trailing \ together with the n with a newline character, or you end up with \\ replaced by \, and then the \ and the n are replaced by a newline. Either way, you end up with the wrong output.
The slow way to handle this, manually
You'll have to process the characters one by one instead, pulling in more characters as needed:
_map = {
'\\\\': '\\',
"\\'": "'",
'\\"': '"',
'\\a': '\a',
'\\b': '\b',
'\\f': '\f',
'\\n': '\n',
'\\r': '\r',
'\\t': '\t',
}
def unescape_string(s):
output = []
i = 0
while i < len(s):
c = s[i]
i += 1
if c != '\\':
output.append(c)
continue
c += s[i]
i += 1
if c in _map:
output.append(_map[c])
continue
if c == '\\x' and i < len(s) - 2: # hex escape
point = int(s[i] + s[i + 1], 16)
i += 2
output.append(chr(point))
continue
if c == '\\0': # octal escape
while len(c) < 4 and i < len(s) and s[i].isdigit():
c += s[i]
i += 1
point = int(c[1:], 8)
output.append(chr(point))
return ''.join(output)
This now can handle the \xhh and the standard 1-letter escapes, but not the \0.. octal escape sequences, or \uhhhh Unicode code points, or \N{name} unicode name references, nor does it handle malformed escapes in quite the same way as Python would.
But it does handle the escaped escape properly:
>>> unescape_string(string)
'\n this is a docstring for\n the main function.\n a,\n b,\n c\n '
>>> unescape_string('\\\\n')
'\\n'
Do know this is far slower than using the built-in codec.

the simplest solution to this is just to use a str.replace() call
s = '\\n this is a docstring for\\n the main function.\\n a,\\n b,\\n c\\n '
s1 = s.replace('\\n','\n')
s1
output
'\n this is a docstring for\n the main function.\n a,\n b,\n c\n '

def convert_text(text):
return text.replace("\\n","\n").replace("\\t","\t")
text = '\\n this is a docstring for\\n the main function.\\n a,\\n b,\\n c\\n '
print convert_text(text)
output:
this is a docstring for
the main function.
a,
b,
c

Related

How can I sanitize a string so that it only contains only printable ASCII chars?

I want a function which will sanitize a string. The string returned by the sanitizer should only contains what would be ASCII character #32 (space character) through ASCII #126 ('~').
ASCII character #9 (tab character) is to be replaced by four spaces. All other illegal characters are to be replaced by empty strings. For example, "\n" will be replaced with the empty string. We do not want illegal characters replaced by strings representing the relevant escape sequences. For example, we do not want a newline character replaced by a backslash character and an 'n' character.
It is fine if the final string is Unicode-encoded, instead of ASCII. I just want the only allowed characters to be as follows:
" !\"#$%&'()*+,-./0123456789:;<=>?#ABCDEFGHIJKLMNOPQRSTUVWXYZ[\\]^_`abcdefghijklmnopqrstuvwxyz{|}~"
EXAMPLE USAGE:
unsafe_string = "\u2502\u251cAPPLES\n\t\t\t\t\t\r\r AND \n\nBANANAS"
safe_string = sanitize(unsafe_string)
print(safe_string)
OUTPUT:
APPLES AND BANANAS
EDIT:
The following attempted solutions do not work because they fail to filter out new-line characters.
import string
import re
unsafe_string = "\u2502\u251cAPPLES\n\t\t\t\t\t\r\r AND \n\nBANANAS"
safe_string = re.sub(r'[^\x00-\x7f]',r'', unsafe_string)
print(safe_string)
printable = set(string.printable)
safe_string = ''.join(filter(lambda x: x in printable, unsafe_string))
print(safe_string)
import re
def sanitize(s):
s = s.replace("\t", " ")
return re.sub(r"[^ -~]", "", s)
[ -~] means 'everything in the range from (space) to ~'. Adding ^ at the beginning means everything except that.
The output is:
APPLES AND BANANAS
In your example output, you forgot to replace tabs with spaces.
You can iterate over the characters, get codepoints, and check against allowed values:
def sanitize(unsafe_str):
allowed_range = set(range(32, 127))
safe_str = ''
for char in unsafe_str:
cp = ord(char)
if cp in allowed_range:
safe_str += char
elif cp == 9:
safe_str += ' ' * 4
return re.sub(r'\s+', ' ', safe_str)
Example:
In [1042]: unsafe_string = "\u2502\u251cAPPLES\n\t\t\t\t\t\r\r AND \n\nBANANAS"
In [1043]: def sanitize(unsafe_str):
...: allowed_range = set(range(32, 127))
...: safe_str = ''
...: for char in unsafe_str:
...: cp = ord(char)
...: if cp in allowed_range:
...: safe_str += char
...: elif cp == 9:
...: safe_str += ' ' * 4
...: return re.sub(r'\s+', ' ', safe_str)
...:
...:
In [1044]: sanitize(unsafe_string)
Out[1044]: 'APPLES AND BANANAS'
The last re.sub(r'\s+', ' ', safe_str) chunk is to compress whitespaces to one. If you don't want that only do return safe_str:
In [1046]: def sanitize(unsafe_str):
...: allowed_range = set(range(32, 127))
...: safe_str = ''
...: for char in unsafe_str:
...: cp = ord(char)
...: if cp in allowed_range:
...: safe_str += char
...: elif cp == 9:
...: safe_str += ' ' * 4
...: return safe_str
...:
In [1047]: sanitize(unsafe_string)
Out[1047]: 'APPLES AND BANANAS'
FWIW, this generates the allowed list on each run of the function, but as it's a constant you can put it at the module level to have it generated only once e.g.:
ALLOWED_RANGE = set(range(32, 127))

How to copy spaces from one string to another in Python?

I need a way to copy all of the positions of the spaces of one string to another string that has no spaces.
For example:
string1 = "This is a piece of text"
string2 = "ESTDTDLATPNPZQEPIE"
output = "ESTD TD L ATPNP ZQ EPIE"
Insert characters as appropriate into a placeholder list and concatenate it after using str.join.
it = iter(string2)
output = ''.join(
[next(it) if not c.isspace() else ' ' for c in string1]
)
print(output)
'ESTD TD L ATPNP ZQ EPIE'
This is efficient as it avoids repeated string concatenation.
You need to iterate over the indexes and characters in string1 using enumerate().
On each iteration, if the character is a space, add a space to the output string (note that this is inefficient as you are creating a new object as strings are immutable), otherwise add the character in string2 at that index to the output string.
So that code would look like:
output = ''
si = 0
for i, c in enumerate(string1):
if c == ' ':
si += 1
output += ' '
else:
output += string2[i - si]
However, it would be more efficient to use a very similar method, but with a generator and then str.join. This removes the slow concatenations to the output string:
def chars(s1, s2):
si = 0
for i, c in enumerate(s1):
if c == ' ':
si += 1
yield ' '
else:
yield s2[i - si]
output = ''.join(char(string1, string2))
You can try insert method :
string1 = "This is a piece of text"
string2 = "ESTDTDLATPNPZQEPIE"
string3=list(string2)
for j,i in enumerate(string1):
if i==' ':
string3.insert(j,' ')
print("".join(string3))
outout:
ESTD TD L ATPNP ZQ EPIE

How do I split a comma delimited string in Python except for the commas that are within quotes

I am trying to split a comma delimited string in python. The tricky part for me here is that some of the fields in the data themselves have a comma in them and they are enclosed within quotes (" or '). The resulting split string should also have the quotes around the fields removed. Also, some fields can be empty.
Example:
hey,hello,,"hello,world",'hey,world'
needs to be split into 5 parts like below
['hey', 'hello', '', 'hello,world', 'hey,world']
Any ideas/thoughts/suggestions/help with how to go about solving the above problem in Python would be much appreciated.
Thank You,
Vish
Sounds like you want the CSV module.
(Edit: The original answer had trouble with empty fields on the edges due to the way re.findall works, so I refactored it a bit and added tests.)
import re
def parse_fields(text):
r"""
>>> list(parse_fields('hey,hello,,"hello,world",\'hey,world\''))
['hey', 'hello', '', 'hello,world', 'hey,world']
>>> list(parse_fields('hey,hello,,"hello,world",\'hey,world\','))
['hey', 'hello', '', 'hello,world', 'hey,world', '']
>>> list(parse_fields(',hey,hello,,"hello,world",\'hey,world\','))
['', 'hey', 'hello', '', 'hello,world', 'hey,world', '']
>>> list(parse_fields(''))
['']
>>> list(parse_fields(','))
['', '']
>>> list(parse_fields('testing,quotes not at "the" beginning \'of\' the,string'))
['testing', 'quotes not at "the" beginning \'of\' the', 'string']
>>> list(parse_fields('testing,"unterminated quotes'))
['testing', '"unterminated quotes']
"""
pos = 0
exp = re.compile(r"""(['"]?)(.*?)\1(,|$)""")
while True:
m = exp.search(text, pos)
result = m.group(2)
separator = m.group(3)
yield result
if not separator:
break
pos = m.end(0)
if __name__ == "__main__":
import doctest
doctest.testmod()
(['"]?) matches an optional single- or double-quote.
(.*?) matches the string itself. This is a non-greedy match, to match as much as necessary without eating the whole string. This is assigned to result, and it's what we actually yield as a result.
\1 is a backreference, to match the same single- or double-quote we matched earlier (if any).
(,|$) matches the comma separating each entry, or the end of the line. This is assigned to separator.
If separator is false (eg. empty), that means there's no separator, so we're at the end of the string--we're done. Otherwise, we update the new start position based on where the regex finished (m.end(0)), and continue the loop.
The csv module won't handle the scenario of " and ' being quotes at the same time. Absent a module that provides that kind of dialect, one has to get into the parsing business. To avoid reliance on a third party module, we can use the re module to do the lexical analysis, using the re.MatchObject.lastindex gimmick to associate a token type with the matched pattern.
The following code when run as a script passes all the tests shown, with Python 2.7 and 2.2.
import re
# lexical token symbols
DQUOTED, SQUOTED, UNQUOTED, COMMA, NEWLINE = xrange(5)
_pattern_tuples = (
(r'"[^"]*"', DQUOTED),
(r"'[^']*'", SQUOTED),
(r",", COMMA),
(r"$", NEWLINE), # matches end of string OR \n just before end of string
(r"[^,\n]+", UNQUOTED), # order in the above list is important
)
_matcher = re.compile(
'(' + ')|('.join([i[0] for i in _pattern_tuples]) + ')',
).match
_toktype = [None] + [i[1] for i in _pattern_tuples]
# need dummy at start because re.MatchObject.lastindex counts from 1
def csv_split(text):
"""Split a csv string into a list of fields.
Fields may be quoted with " or ' or be unquoted.
An unquoted string can contain both a " and a ', provided neither is at
the start of the string.
A trailing \n will be ignored if present.
"""
fields = []
pos = 0
want_field = True
while 1:
m = _matcher(text, pos)
if not m:
raise ValueError("Problem at offset %d in %r" % (pos, text))
ttype = _toktype[m.lastindex]
if want_field:
if ttype in (DQUOTED, SQUOTED):
fields.append(m.group(0)[1:-1])
want_field = False
elif ttype == UNQUOTED:
fields.append(m.group(0))
want_field = False
elif ttype == COMMA:
fields.append("")
else:
assert ttype == NEWLINE
fields.append("")
break
else:
if ttype == COMMA:
want_field = True
elif ttype == NEWLINE:
break
else:
print "*** Error dump ***", ttype, repr(m.group(0)), fields
raise ValueError("Missing comma at offset %d in %r" % (pos, text))
pos = m.end(0)
return fields
if __name__ == "__main__":
tests = (
("""hey,hello,,"hello,world",'hey,world'\n""", ['hey', 'hello', '', 'hello,world', 'hey,world']),
("""\n""", ['']),
("""""", ['']),
("""a,b\n""", ['a', 'b']),
("""a,b""", ['a', 'b']),
(""",,,\n""", ['', '', '', '']),
("""a,contains both " and ',c""", ['a', 'contains both " and \'', 'c']),
("""a,'"starts with "...',c""", ['a', '"starts with "...', 'c']),
)
for text, expected in tests:
result = csv_split(text)
print
print repr(text)
print repr(result)
print repr(expected)
print result == expected
I fabricated something like this. Very redundant I suppose, but it does the job for me. You have to adapt it a bit to your specifications:
def csv_splitter(line):
splitthese = [0]
splitted = []
splitpos = True
for nr, i in enumerate(line):
if i == "\"" and splitpos == True:
splitpos = False
elif i == "\"" and splitpos == False:
splitpos = True
if i == "," and splitpos == True:
splitthese.append(nr)
splitthese.append(len(line)+1)
for i in range(len(splitthese)-1):
splitted.append(re.sub("^,|\"","",line[splitthese[i]:splitthese[i+1]]))
return splitted

Removing unwanted characters from a string in Python

I have some strings that I want to delete some unwanted characters from them.
For example: Adam'sApple ----> AdamsApple.(case insensitive)
Can someone help me, I need the fastest way to do it, cause I have a couple of millions of records that have to be polished.
Thanks
One simple way:
>>> s = "Adam'sApple"
>>> x = s.replace("'", "")
>>> print x
'AdamsApple'
... or take a look at regex substitutions.
Here is a function that removes all the irritating ascii characters, the only exception is "&" which is replaced with "and". I use it to police a filesystem and ensure that all of the files adhere to the file naming scheme I insist everyone uses.
def cleanString(incomingString):
newstring = incomingString
newstring = newstring.replace("!","")
newstring = newstring.replace("#","")
newstring = newstring.replace("#","")
newstring = newstring.replace("$","")
newstring = newstring.replace("%","")
newstring = newstring.replace("^","")
newstring = newstring.replace("&","and")
newstring = newstring.replace("*","")
newstring = newstring.replace("(","")
newstring = newstring.replace(")","")
newstring = newstring.replace("+","")
newstring = newstring.replace("=","")
newstring = newstring.replace("?","")
newstring = newstring.replace("\'","")
newstring = newstring.replace("\"","")
newstring = newstring.replace("{","")
newstring = newstring.replace("}","")
newstring = newstring.replace("[","")
newstring = newstring.replace("]","")
newstring = newstring.replace("<","")
newstring = newstring.replace(">","")
newstring = newstring.replace("~","")
newstring = newstring.replace("`","")
newstring = newstring.replace(":","")
newstring = newstring.replace(";","")
newstring = newstring.replace("|","")
newstring = newstring.replace("\\","")
newstring = newstring.replace("/","")
return newstring
Any characters in the 2nd argument of the translate method are deleted:
>>> "Adam's Apple!".translate(None,"'!")
'Adams Apple'
NOTE: translate requires Python 2.6 or later to use None for the first argument, which otherwise must be a translation string of length 256. string.maketrans('','') can be used in place of None for pre-2.6 versions.
Try:
"Adam'sApple".replace("'", '')
One step further, to replace multiple characters with nothing:
import re
print re.sub(r'''['"x]''', '', '''a'"xb''')
Yields:
ab
str.replace("'","");
As has been pointed out several times now, you have to either use replace or regular expressions (most likely you don't need regexes though), but if you also have to make sure that the resulting string is plain ASCII (doesn't contain funky characters like é, ò, µ, æ or φ), you could finally do
>>> u'(like é, ò, µ, æ or φ)'.encode('ascii', 'ignore')
'(like , , , or )'
An alternative that will take in a string and an array of unwanted chars
# function that removes unwanted signs from str
#Pass the string to the function and an array ofunwanted chars
def removeSigns(str,arrayOfChars):
charFound = False
newstr = ""
for letter in str:
for char in arrayOfChars:
if letter == char:
charFound = True
break
if charFound == False:
newstr += letter
charFound = False
return newstr
Let's say we have the following list:
states = [' Alabama ', 'Georgia!', 'Georgia', 'georgia', 'south carolina##', 'West virginia?']
Now we will define a function clean_strings()
import re
def clean_strings(strings):
result = []
for value in strings:
value = value.strip()
value = re.sub('[!#?]', '', value)
value = value.title()
result.append(value)
return result
When we call the function clean_strings(states)
The result will look like:
['Alabama',
'Georgia',
'Georgia',
'Georgia',
'Florida',
'South Carolina',
'West Virginia']
I am probably late for the answer but i think below code would also do ( to an extreme end)
it will remove all the unncesary chars:
a = '; niraj kale 984wywn on 2/2/2017'
a= re.sub('[^a-zA-Z0-9.?]',' ',a)
a = a.replace(' ',' ').lstrip().rstrip()
which will give
'niraj kale 984wywn on 2 2 2017'

collapsing whitespace in a string

I have a string that kind of looks like this:
"stuff . // : /// more-stuff .. .. ...$%$% stuff -> DD"
and I want to strip off all punctuation, make everything uppercase and collapse all whitespace so that it looks like this:
"STUFF MORE STUFF STUFF DD"
Is this possible with one regex or do I need to combine more than two? This is what I have so far:
def normalize(string):
import re
string = string.upper()
rex = re.compile(r'\W')
rex_s = re.compile(r'\s{2,}')
result = rex.sub(' ', string) # this produces a string with tons of whitespace padding
result = rex.sub('', result) # this reduces all those spaces
return result
The only thing that doesn't work is the whitespace collapsing. Any ideas?
Here's a single-step approach (but the uppercasing actually uses a string method -- much simpler!):
rex = re.compile(r'\W+')
result = rex.sub(' ', strarg).upper()
where strarg is the string argument (don't use names that shadow builtins or standard library modules, please).
s = "$$$aa1bb2 cc-dd ee_ff ggg."
re.sub(r'\W+', ' ', s).upper()
# ' AA1BB2 CC DD EE_FF GGG '
Is _ punctuation?
re.sub(r'[_\W]+', ' ', s).upper()
# ' AA1BB2 CC DD EE FF GGG '
Don't want the leading and trailing space?
re.sub(r'[_\W]+', ' ', s).strip().upper()
# 'AA1BB2 CC DD EE FF GGG'
result = rex.sub(' ', string) # this produces a string with tons of whitespace padding
result = rex.sub('', result) # this reduces all those spaces
Because you typo'd and forgot to use rex_s for the second call instead. Also, you need to substitute at least one space back in or you'll end up with any multiple-space gap becoming no gap at all, instead of a single-space gap.
result = rex.sub(' ', string) # this produces a string with tons of whitespace padding
result = rex_s.sub(' ', result) # this reduces all those spaces
Do you have to use regular expressions? Do you feel you must do it in one line?
>>> import string
>>> s = "stuff . // : /// more-stuff .. .. ...$%$% stuff -> DD"
>>> s2 = ''.join(c for c in s if c in string.letters + ' ')
>>> ' '.join(s2.split())
'stuff morestuff stuff DD'
works in python3 this will retain the same whitespace character you collapsed. So if you have a tab and a space next to each other they wont collapse into a single character.
def collapse_whitespace_characters(raw_text):
ret = ''
if len(raw_text) > 1:
prev_char = raw_text[0]
ret += prev_char
for cur_char in raw_text[1:]:
if not cur_char.isspace() or cur_char != prev_char:
ret += cur_char
prev_char = cur_char
else:
ret = raw_text
return ret
this one will collapse whitespace sets into the first whitespace character it sees
def collapse_whitespace(raw_text):
ret = ''
if len(raw_text) > 1:
prev_char = raw_text[0]
ret += prev_char
for cur_char in raw_text[1:]:
if not cur_char.isspace() or \
(cur_char.isspace() and not prev_char.isspace()):
ret += cur_char
prev_char = cur_char
else:
ret = raw_text
return ret
>>> collapse_whitespace_characters('we like spaces and\t\t TABS AND WHATEVER\xa0\xa0IS')
'we like spaces and\t TABS\tAND WHATEVER\xa0IS'
>>> collapse_whitespace('we like spaces and\t\t TABS AND WHATEVER\xa0\xa0IS')
'we like spaces and\tTABS\tAND WHATEVER\xa0IS'
for punctuation
def collapse_punctuation(raw_text):
ret = ''
if len(raw_text) > 1:
prev_char = raw_text[0]
ret += prev_char
for cur_char in raw_text[1:]:
if cur_char.isalnum() or cur_char != prev_char:
ret += cur_char
prev_char = cur_char
else:
ret = raw_text
return ret
to actually answer the question
orig = 'stuff . // : /// more-stuff .. .. ...$%$% stuff -> DD'
collapse_whitespace(''.join([(c.upper() if c.isalnum() else ' ') for c in orig]))
as said, the regexp would be something like
re.sub('\W+', ' ', orig).upper()
One can use regular expression to substitute reoccurring white spaces.
White space is given by \s with \s+ meaning: at least one.
import re
rex = re.compile(r'\s+')
test = " x y z z"
res = rex.sub(' ', test)
print(f">{res}<")
> x y z z<
Note this also affects/includes carriage return, etc.

Categories