Can't delete "\r\n" from a string - python

I have a string like this:
la lala 135 1039 921\r\n
And I can't remove the \r\n.
Initially this string was a bytes object but then I converted it to string
I tried with .strip("\r\n") and with .replace("\r\n", "") but nothing...

>>> my_string = "la lala 135 1039 921\r\n"
>>> my_string.rstrip()
'la lala 135 1039 921'
Alternate solution with just slicing off the end, which works better with the bytes->string situation:
>>> my_string = b"la lala 135 1039 921\r\n"
>>> my_string = my_string.decode("utf-8")
>>> my_string = my_string[0:-2]
>>> my_string
'la lala 135 1039 921'
Or hell, even a regex solution, which works better:
re.sub(r'\r\n', '', my_string)

The issue is that the string contains a literal backslash followed by a character. Normally, when written into a string such as .strip("\r\n") these are interpreted as escape sequences, with "\r" representing a carriage return (0x0D in the ASCII table) and "\n" representing a line feed (0x0A).
Because Python interprets a backslash as the beginning of an escape sequence, you need to follow it by another backslash to signify that you mean a literal backslash. Therefore, the calls need to be .strip("\\r\\n") and .replace("\\r\\n", "").
Note: you really don't want to use .strip() here as it affects a lot more than just the end of the string as it will remove backslashes and the letters "r" and "n" from the string. .replace() is a little better here in that it will match the whole string and replace it, but it will match \r\n in the middle of the string too, not just the end. The most straight-forward way to remove the sequence is the conditional given below.
You can see the list of escape sequences Python supports in the String and Byte Literals subsection of the Lexical Analysis section in the Python Language Reference.
For what it's worth, I would not use .strip() to remove the sequence. .strip() removes all characters in the string (it treats the string as a set, rather than a pattern match). .replace() would be a better choice, or simply using slice notation to remove the trailing "\\r\\n" off the string when you detect it's present:
if s.endswith("\\r\\n"):
s = s[:-4]

'\r\n' is also a standard line delimiter for .splitlines(), so this can also work.
>>> s = "la lala 135 1039 921\r\n"
>>> type(s)
<class 'str'>
>>> t = ''.join(s.splitlines())
>>> t
'la lala 135 1039 921'
>>> type(t)
<class 'str'>

You could also determine the length of the string say 20 characters then truncate it to 18 regardless of the last two characters or verify they are the characters before you do that. Sometimes it helps to compare the ascii value first pseudo logic:
if last character in string is tab, cr, lf or ? then shorten the string by one. Repeat till you no longer find ending characters matching tab, cr, lef, etc.

Related

Proper replacement of "beginning" non-alphanumeric characters, in python, using regular expressions

NOTE: This post is not the same as the post "Re.sub not working for me".
That post is about matching and replacing ANY non-alphanumeric substring in a string.
This question is specifically about matching and replacing non-alphanumeric substrings that explicitly show up at the beginning of a string.
The following method attempts to match any non-alphanumeric character string "AT THE BEGINNING" of a string and replace it with a new string "BEGINNING_"
def m_getWebSafeString(self, dirtyAttributeName):
cleanAttributeName = ''.join(dirtyAttributeName)
# Deal with beginning of string...
cleanAttributeName = re.sub('^[^a-zA-z]*',"BEGINNING_",cleanAttributeName)
# Deal with end of string...
if "BEGINNING_" in cleanAttributeName:
print ' ** ** ** D: "{}" ** ** ** C: "{}"'.format(dirtyAttributeName, cleanAttributeName)
PROBLEM DESCRIPTION: The method seems to not only replace non-alphnumeric characters but it also incorrectly inserts the "BEGINNING_" string at the beginning of all strings that are passed into it. In other words...
GOOD RESULT: If the method is passed the string *##$ThisIsMyString1, it correctly returns BEGINNING_ThisIsMyString1
BAD/UNWANTED RESULT: However, if the method is passed the string ThisIsMyString2 it incorrectly (and always) inserts the replacement string (BEGINNING_), even there are no non-alphanumeric characters, and yields the result BEGINNING_ThisIsMyString2
MY QUESTION: What is the correct way to write the re.sub() line so it only replaces those non-alphnumeric characters at the beginning of the string such that it does not always insert the replacement string at the beginning of the original input string?
You're matching 0 or more instances of non-alphabetic characters by using the * quantifier, which means it'll always be picked up by your pattern. You can replace what you have with
re.sub('^[^a-zA-Z]+', ...)
to ensure that only 1 or more instances are matched.
replace
re.sub('^[^a-zA-z]*',"BEGINNING_",cleanAttributeName)
with
re.sub('^[^a-zA-z]+',"BEGINNING_",cleanAttributeName)
There is a more elegant solution. You can use this
re.sub('^\W+', 'BEGINNING_', cleanAttributeName)
\W Matches any non-alphanumeric character; this is equivalent to the class [^a-zA-Z0-9_].
>>> re.sub('^\W+', 'BEGINNING_', '##$ThisIsMyString1')
'BEGINNING_ThisIsMyString1'
>>> re.sub('^\W+', 'BEGINNING_', 'ThisIsMyString2')
'ThisIsMyString2'

How to use text strip() function?

I can strip numerics but not alpha characters:
>>> text
'132abcd13232111'
>>> text.strip('123')
'abcd'
Why the following is not working?
>>> text.strip('abcd')
'132abcd13232111'
The reason is simple and stated in the documentation of strip:
str.strip([chars])
Return a copy of the string with the leading and trailing characters removed.
The chars argument is a string specifying the set of characters to be removed.
'abcd' is neither leading nor trailing in the string '132abcd13232111' so it isn't stripped.
Just to add a few examples to Jim's answer, according to .strip() docs:
Return a copy of the string with the leading and trailing characters removed.
The chars argument is a string specifying the set of characters to be removed.
If omitted or None, the chars argument defaults to removing whitespace.
The chars argument is not a prefix or suffix; rather, all combinations of its values are stripped.
So it doesn't matter if it's a digit or not, the main reason your second code didn't worked as you expected, is because the term "abcd" was located in the middle of the string.
Example1:
s = '132abcd13232111'
print(s.strip('123'))
print(s.strip('abcd'))
Output:
abcd
132abcd13232111
Example2:
t = 'abcd12312313abcd'
print(t.strip('123'))
print(t.strip('abcd'))
Output:
abcd12312313abcd
12312313

Regarding the regex in search module with and without raw text

I am doing the following in python2.7
>>> a='hello team 123'
>>> b=re.search('hello team [0-9]+',a)
>>>
>>> b
<_sre.SRE_Match object at 0x00000000022995E0>
>>> b=re.search(r'hello team [0-9]+',a)
>>> b
<_sre.SRE_Match object at 0x0000000002299578>
>>>
Now as you see, in one case i am doing the raw text while in the other it's without raw text.
From one of the posts on SO, i learnt:
The r means that the string is to be treated as a raw string, which means all escape codes will be ignored.
For an example:
'\n' will be treated as a newline character, while r'\n' will be treated as the characters \ followed by n
Then, why is my example working for both cases i.e with r and without r?
Is it because none of my example uses \ ?
Also please look at the attached screenshot
You are not using any special characters in your string, so r'' and '' will do the same thing.
In hello team [0-9]+ nothing needs to escaped. It will be passed to regex engine as it is. If you use special characters in your Python string then you need to escape them to pass them to regex engine.
There are two levels of escaping involved in regex. First level is Python string and second level regex engine.
So for example:
'\\\\' --> Python(string translation) ---> '\\' ---> Regex Engine(translation) ---> '\'
In order to avoid Python string translation you use raw strings.
r'\\' --> Python(string translation) ---> '\\' ---> Regex Engine(translation) ---> '\'
>>> print repr('\\')
'\\'
>>> print repr(r'\\')
'\\\\'
>>> print str('\\')
\
>>> print str(r'\\')
\\

How to check if \n is in a string

I want to remove \n from a string if it is in a string.
I have tried:
slashn = str(chr(92))+"n"
if slashn in newString:
newerString = newString.replace(slashn,'')
print(newerString)
else:
print(newString)
Assume that newString is a word that has \n at the end of it. E.g. text\n.
I have also tried the same code except slash equals to "\\"+"n".
Use str.replace() but with raw string literals:
newString = r"new\nline"
newerString = newString.replace(r"\n", "")
If you put a r right before the quotes enclosing a string literal, it becomes a raw string literal that does not treat any backslash characters as special escape sequences.
Example to clarify raw string literals (output is behind the #> comments):
# Normal string literal: single backslash escapes the 'n' and makes it a new-line character.
print("new\nline")
#> new
#> line
# Normal string literal: first backslash escapes the second backslash and makes it a
# literal backslash. The 'n' won't be escaped and stays a literal 'n'.
print("new\\nline")
#> new\nline
# Raw string literal: All characters are taken literally, the backslash does not have any
# special meaning and therefore does not escape anything.
print(r"new\nline")
#> new\nline
# Raw string literal: All characters are taken literally, no backslash has any
# special meaning and therefore they do not escape anything.
print(r"new\\nline")
#> new\\nline
You can use strip() of a string. Or strip('\n'). strip is a builtin function of a string.
Example:
>>>
>>>
>>> """vivek
...
... """
'vivek\n\n'
>>>
>>> """vivek
...
... """.strip()
'vivek'
>>>
>>> """vivek
...
... \n"""
'vivek\n\n\n'
>>>
>>>
>>> """vivek
...
... \n""".strip()
'vivek'
>>>
Look for the help command for a string builtin function strip like this:
>>>
>>> help(''.strip)
Help on built-in function strip:
strip(...)
S.strip([chars]) -> string or unicode
Return a copy of the string S with leading and trailing
whitespace removed.
If chars is given and not None, remove characters in chars instead.
If chars is unicode, S will be converted to unicode before stripping
>>>
Use
string_here.rstrip('\n')
To remove the newline.
Try with strip()
your_string.strip("\n") # removes \n before and after the string
If you want to remove the newline from the ends of a string, I'd use .strip(). If no arguments are given then it will remove whitespace characters, this includes newlines (\n).
Using .strip():
if newString[-1:-2:-1] == '\n': #Test if last two characters are "\n"
newerString = newString.strip()
print(newerString)
else:
print(newString)
Another .strip() example (Using Python 2.7.9)
Also, the newline character can simply be represented as "\n".
Text="test.\nNext line."
print(Text)
Output:::: test.\nNextline"
This is because the element is stored in double inverted commas.In such cases next line will behave as text enclose in string.

How do I remove hex values in a python string with regular expressions?

I have a cell array in matlab
columns = {'MagX', 'MagY', 'MagZ', ...
'AccelerationX', 'AccelerationX', 'AccelerationX', ...
'AngularRateX', 'AngularRateX', 'AngularRateX', ...
'Temperature'}
I use these scripts which make use of matlab's hdf5write function to save the array in the hdf5 format.
I then read in the the hdf5 file into python using pytables. The cell array comes in as a numpy array of strings. I convert to a list and this is the output:
>>>columns
['MagX\x00\x00\x00\x08\x01\x008\xe6\x7f',
'MagY\x00\x7f\x00\x00\x00\xee\x0b9\xe6\x7f',
'MagZ\x00\x00\x00\x00\x001',
'AccelerationX',
'AccelerationY',
'AccelerationZ',
'AngularRateX',
'AngularRateY',
'AngularRateZ',
'Temperature']
These hex values pop into the strings from somewhere and I'd like to remove them. They don't always appear on the first three items of the list and I need a nice way to deal with them or to find out why they are there in the first place.
>>>print columns[0]
Mag8�
>>>columns[0]
'MagX\x00\x00\x00\x08\x01\x008\xe6\x7f'
>>>repr(columns[0])
"'MagX\\x00\\x00\\x00\\x08\\x01\\x008\\xe6\\x7f'"
>>>print repr(columns[0])
'MagX\x00\x00\x00\x08\x01\x008\xe6\x7f'
I've tried using a regular expression to remove the hex values but have little luck.
>>>re.sub('(\w*)\\\\x.*', '\1', columns[0])
'MagX\x00\x00\x00\x08\x01\x008\xe6\x7f'
>>>re.sub('(\w*)\\\\x.*', r'\1', columns[0])
'MagX\x00\x00\x00\x08\x01\x008\xe6\x7f'
>>>re.sub(r'(\w*)\\x.*', '\1', columns[0])
'MagX\x00\x00\x00\x08\x01\x008\xe6\x7f'
>>>re.sub('([A-Za-z]*)\x00', r'\1', columns[0])
'MagX\x08\x018\xe6\x7f'
>>>re.sub('(\w*?)', '\1', columns[0])
'\x01M\x01a\x01g\x01X\x01\x00\x01\x00\x01\x00\x01\x08\x01\x01\x01\x00\x018\x01\xe6\x01\x7f\x01'
Any suggestions on how to deal with this?
You can remove all non-word characters in the following way:
>>> re.sub(r'[^\w]', '', 'MagX\x00\x00\x00\x08\x01\x008\xe6\x7f')
'MagX8'
The regex [^\w] will match any character that is not a letter, digit, or underscore. By providing that regex in re.sub with an empty string as a replacement you will delete all other characters in the string.
Since there may be other characters you want to keep, a better solution might be to specify a larger range of characters that you want to keep that excludes control characters. For example:
>>> re.sub(r'[^\x20-\x7e]', '', 'MagX\x00\x00\x00\x08\x01\x008\xe6\x7f')
'MagX8'
Or you could replace [^\x20-\x7e] with the equivalent [^ -~], depending on which seems more clear to you.
To exclude all characters after this first control character just add a .*, like this:
>>> re.sub(r'[^ -~].*', '', 'MagX\x00\x00\x00\x08\x01\x008\xe6\x7f')
'MagX'
They're not actually in the string: you have unescaped control characters, which Python displays using the hexadecimal notation - that's why you see a unusual symbol when you print the value.
You should simply be able to remove the extra levels of quoting in your regular expression but you might also simply rely on something like the regexp module's generic whitespace class, which will match whitespace characters other than tabs and spaces:
>>> import re
>>> re.sub(r'\s', '?', "foo\x00bar")
'foo\x00bar'
>>> print re.sub(r'\s', '?', "foo\x00bar")
foobar
I use this one a bit to replace all input whitespace runs, including non-breaking space characters, with a single space:
>>> re.sub(r'[\xa0\s]+', ' ', input_str)
You can also do this without importing re. E.g. if you're content to keep only ascii characters:
good_string = ''.join(c if ord(c) < 129 else '?' for c in bad_string)

Categories