How can I ignore characters other than [a-z][A-Z] in input string in python, and after applying method what will the string look like?
Do I need to use regular expressions?
If you need to use a regex, use a negative character class ([^...]):
re.sub(r'[^a-zA-Z]', '', inputtext)
A negative character class matches anything not named in the class.
Demo:
>>> import re
>>> inputtext = 'The quick brown fox!'
>>> re.sub(r'[^a-zA-Z]', '', inputtext)
'Thequickbrownfox'
But using str.translate() is way faster:
import string
ascii_letters = set(map(ord, string.ascii_letters))
non_letters = ''.join(chr(i) for i in range(256) if i not in ascii_letters)
inputtext.translate(None, non_letters)
Using str.translate() is more than 10 times faster than a regular expression:
>>> import timeit, partial, re
>>> ascii_only = partial(re.compile(r'[^a-zA-Z]').sub, '')
>>> timeit.timeit('f(t)', 'from __main__ import ascii_only as f, inputtext as t')
7.903045892715454
>>> timeit.timeit('t.translate(None, m)', 'from __main__ import inputtext as t, non_letters as m')
0.5990171432495117
Using Jakub's method is slower still:
>>> timeit.timeit("''.join(c for c in t if c not in l)", 'from __main__ import inputtext as t; import string; l = set(string.letters)')
9.960685968399048
You can use regex:
re.compile(r'[^a-zA-Z]').sub('', your_string)
You could also manage without regular expressions (e.g, if you had regex phobia):
import string
new_string = ''.join(c for c in old_string
if c not in set(string.letters))
Although I would use regex, this example has additional educational values: set, comprehension and string library. Note that set is not strictly needed here
Related
I have a file, that contains both hex data and non-hex data.
For example, var _0x36ba=["\x69\x73\x41\x72\x72\x61\x79","\x63\x61\x6C\x6C","\x74\x6F\x53\x74\x72\x69\x6E\x67",]
When I directly paste this code in python console, I got var _0x36ba=["isArray","call","toString",]
But when I try to read the file and print contents, it gives me var _0x36ba=["\\x69\\x73\\x41\\x72\\x72\\x61\\x79","\\x63\\x61\\x6C\\x6C","\\x74\\x6F\\x53\\x74\\x72\\x69\\x6E\\x67","\\
Seems like backslashes are parsed as they are.
How can I read the file and obtain readable output?
You have string literals with \xhh hex escapes. You can decode these with the string_escape encoding:
text.decode('string_escape')
See the Python Specific Encodings section of the codecs module documentation:
string_escape
Produce a string that is suitable as string literal in Python source code
Decoding reverses that encoding:
>>> "\\x69\\x73\\x41\\x72\\x72\\x61\\x79".decode('string_escape')
'isArray'
>>> "\\x63\\x61\\x6C\\x6C".decode('string_escape')
'call'
>>> "\\x74\\x6F\\x53\\x74\\x72\\x69\\x6E\\x67".decode('string_escape')
'toString'
Being a built-in codec, this is a lot faster than using regular expressions:
>>> from timeit import timeit
>>> import re
>>> def unescape(text):
... return re.sub(r'\\x([0-9a-fA-F]{2})',
... lambda m: chr(int(m.group(1), 16)), text)
...
>>> value = "\\x69\\x73\\x41\\x72\\x72\\x61\\x79"
>>> timeit('unescape(value)', 'from __main__ import unescape, value')
6.254786968231201
>>> timeit('value.decode("string_escape")', 'from __main__ import value')
0.43862390518188477
That's about 14 times faster.
EDIT: Please use Martijn's solution. I didn't know the text.decode('string_escape') yet, and of course it is way faster. Below follows my original answer.
Use this regular expression to unescape all escaped hexadecimal expressions within the string:
def unescape(text):
return re.sub(r'\\\\|\\x([0-9a-fA-F]{2})',
lambda m: chr(int(m.group(1), 16)) if m.group(1)
else '\\', text)
If you know that the input will not contain a double backslash followed by an x (e. g. foo bar \\x41 bloh which probably should be interpreted to foo bar \x41 bloh instead of to foo bar \A bloh), then you can simplify this to:
def unescape(text):
return re.sub(r'\\x([0-9a-fA-F]{2})',
lambda m: chr(int(m.group(1), 16)), text)
This works:
stripped_str = whatever_str.rstrip("abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ")
but just seems very inelegant to me. Any cleaner way of doing it?
Perhaps you are looking for string.ascii_letters:
from string import ascii_letters
stripped_str = whatever_str.rstrip(ascii_letters)
It allows you to do the same as your current code, but without typing the entire alphabet.
Below is a demonstration:
>>> from string import ascii_letters
>>> ascii_letters
'abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ'
>>>
>>> '123abdjihdkffyifbgh'.rstrip(ascii_letters)
'123'
>>>
I wonder if there is a simpler alternative (e.g. a single function call) for matching and replacing to the following example:
>>> import re
>>>
>>> line = 'file:///windows-d/academic%20discipline/study%20objects/areas/formal%20systems/math'
>>>
>>> match = re.match(r'^file://(.*)$', line)
>>> if match and match.group(1):
... substitution = re.sub(r'%20', r' ', match.group(1))
...
>>> substitution
'/windows-d/academic discipline/study objects/areas/formal systems/math'
Thanks.
I'm going to dodge your regex question and suggest you use something else for this:
>>> line = 'file:///windows-d/academic%20discipline/study%20objects/areas/formal%20systems/math'
>>> import urllib
>>> urllib.unquote(line)
'file:///windows-d/academic discipline/study objects/areas/formal systems/math'
Then just strip off the file:// with a slice or str.replace if necessary.
%20 (space) is not the only escaped character possible here, so it's better to use the right tool for the job than have your regex solution break later when there is another character needing un-escaping.
You could try the below simple python code,
>>> import re
>>> line = 'file:///windows-d/academic%20discipline/study%20objects/areas/formal%20systems/math'
>>> m = re.sub(r'%20|file://', r' ', line).strip()
>>> m
'/windows-d/academic discipline/study objects/areas/formal systems/math'
re.sub(r'%20|file://', r' ', line).strip() code replaces the string %20 or file:// with a space. And again the strip() function removes all the leading and trailing spaces.
>>> import re
>>> s = 'file:///windows-d/academic%20discipline/study%20objects/areas/formal%20systems/math'
>>> re.sub(r'^file://(.*)$', lambda m: m.group(1).replace('%20', ' '), s)
'/windows-d/academic discipline/study objects/areas/formal systems/math'
>>> s = 'file:///windows-d/academic%20discipline/study%20objects/areas/formal%20systems/math'
>>> s.replace('file://', '').replace('%20', ' ')
'/windows-d/academic discipline/study objects/areas/formal systems/math'
I am using python's re module to match sequential string in text, for example:
s = 'habcabcabcj', I try the following code:
import re
re.findall(r'(abc)+', s)
And the result is: ["abc"]
If I want the match result to be ["abcabcabc"], how can I do this?
Use a non-capturing group (?:...):
>>> import re
>>> s = 'habcabcabcj'
>>> re.findall(r'(?:abc)+', s)
['abcabcabc']
>>>
I need to replace space with comma between two numbers
15.30 396.90 => 15.30,396.90
In PHP this is used:
'/(?<=\d)\s+(?=\d)/', ','
How to do it in Python?
There are several ways to do it (sorry, Zen of Python). Which one to use depends on your input:
>>> s = "15.30 396.90"
>>> ",".join(s.split())
'15.30,396.90'
>>> s.replace(" ", ",")
'15.30,396.90'
or, using re, for example, this way:
>>> import re
>>> re.sub("(\d+)\s+(\d+)", r"\1,\2", s)
'15.30,396.90'
You can use the same regex with the re module in Python:
import re
s = '15.30 396.90'
s = re.sub(r'(?<=\d)\s+(?=\d)', ',', s)