remove special charecters in a string using python - python

I have a string = "msdjdgf(^&%*(Aroha Technologies&^$^&*^CHJdjg" with special characters.
what i am trying is to remove all special charecters in the string and then display the word 'Aroha Technologies'
i was able to do with hard coding using lstrip() function but can anyone help me out how can i display string 'Aroha Technologies' in a single line using regular expressions.
edit suggested:-
by using this lstrip() and rstrip() functions i was able to remove characters from the string.
str = "msdjdgf(^&%*(Aroha Technologies&^$^&*^CHJdjg"
str=str.lstrip('msdjdgf(^&%*(')
str=str.rstrip('&^$^&*^CHJdjg')

here, A bit more dirty approach
import re # A module in python for String matching/operations
a = "msdjdgf(^&%*(Aroha Technologies&^$^&*^CHJdjg"
stuff = re.findall('\W(\w+\s\w+)\W', a)
print(stuff[0]) # Aroha Technologies
hope this helps ;)

You don't provide a lot of information, so this may or may not be close to what you want:
import re
origstr = "msdjdgf(^&%(Aroha Technologies&^$^&^CHJdjg"
match = re.search("[A-Z][a-z]*(?: [A-Z][a-z]*)*", origstr)
if match:
newstr = match.group()
(looks for a series of capitalized words with spaces between them)

Related

Regex For Special Character (S with line on top)

I was trying to write regex in Python to replace all non-ascii with an underscore, but if one of the characters is "S̄" (an 'S' with a line on the top), it adds an extra 'S'... Is there a way to account for this character as well? I believe it's a valid utf-8 character, but not ascii
Here's there code:
import re
line = "ra*ndom wordS̄"
print(re.sub('[\W]', '_', line))
I would expect it to output:
ra_ndom_word_
But instead I get:
ra_ndom_wordS__
The reason Python works this way is that you are actually looking at two distinct characters; there's an S and then it's followed by a combining macron U+0304
In the general case, if you want to replace a sequence of combining characters and the base character with an underscore, try e.g.
import unicodedata
def cleanup(line):
cleaned = []
strip = False
for char in line:
if unicodedata.combining(char):
strip = True
continue
if strip:
cleaned.pop()
strip = False
if unicodedata.category(char) not in ("Ll", "Lu"):
char = "_"
cleaned.append(char)
return ''.join(cleaned)
By the by, \W does not need square brackets around it; it's already a regex character class.
Python's re module lacks support for important Unicode properties, though if you really want to use specifically a regex for this, the third-party regex library has proper support for Unicode categories.
"Ll" is lowercase alphabetics and "Lu" are uppercase. There are other Unicode L categories so maybe tweak this to suit your requirements (unicodedata.category(char).startswith("L") maybe?); see also https://www.fileformat.info/info/unicode/category/index.htm
You can use the following script to get the desired output:
import re
line="ra*ndom wordS̄"
print(re.sub('[^[-~]+]*','_',line))
Output
ra_ndom_word_
In this approach, it works with other non-ascii characters as well :
import re
line="ra*ndom ¡¢£Ä wordS̄. another non-ascii: Ä and Ï"
print(re.sub('[^[-~]+]*','_',line))
output:
ra_ndom_word_another_non_ascii_and_

Remove extra spaces from a python string

I have a string which contains the following information.
mystring = "'$1$Not Running', ''"
I want to be able to remove the extra space and , '' after the Running. I tried to use strip() but it does not seem to work.
My desired output is mystring = "'$2$Not Running'"
I am not sure what I am missing here? Any help is appreciated.
One of the easier solutions would be to partition your string based on the comma:
mystring, comma, rest = mystring.partition(",")
This solution depends on there not being any commas in the string other than that one.
The better solution would be to figure out why the extra characters are in your string and what you can do to avoid it.
If that isn't possible, it looks like the string is valid Python, so you could parse it as a tuple and always pick the first element:
import ast
mystring, _ = ast.literal_eval(mystring)
Although in this case you would get what's inside the single quotes, not the single quotes as characters themselves.
i assume you want to remove the final 4 char's in your string. To do this you can simply
mystring = mystring[:-4]
if this is not right tell me and ill try to find a solution
strip() only removes spaces as the beginning and end of a string. Since what you want to remove is in the middle, it won't work for you.
You can use regular expressions to search and replace for specific strings:
import re
mystring = "'$1$Not Running', ''"
mynewstring = re.sub(", ''", "", mystring)
print(mynewstring)
# '$1$Not Running'
I'm not sure what extra space you're talking about, but you can use similar logic to replace it.
If this is literally the only thing you need it for, then some of the other answers might be simpler. If you need it for several different cases of input, this might be a better option. We'd need to see more examples of input to figure that out though.
Maybe there is something better but you can try to use split()
mynewstring = mystring.split()[0] + mystring.split()[1]
If the 4 characters you want to replace are ', ' then you can just use the string.replace() function to replace them with an empty string '':
mystring = mystring.replace( "', '", '')

strip out non valid and non-ascci character from my string in Python

Trying to format this string and strip out the non-ascii characters
import re
text = '<phone_number><![CDATA[0145236243 <0x0C><0x05><0x4>
]>'
clean = re.sub('[^\x00-\x7f]',"", text)
This does not seem to do the job properly.Does someone have a proper solution. I have also uploaded a file in case stackoverflow has formatted the non-ascci characters.
Not a very generic one. But the below solution might work for you
''.join([i for i in text.split() if('<0x') not in i])#'<phone_number><![CDATA[0145236243]]></phone_number>'
Using regex
re.sub('(<0x\w*>)|\s',"", text) # '<phone_number><![CDATA[0145236243]]></phone_number>'
This link also has a similar solution for all non UTF-8 characters. Regular expression that finds and replaces non-ascii characters with Python
You can try using str.encode() and str.decode() for this purpose.
Then you can replace them.

can't use variable inside regex

So, I have a long sequence of Unicode characters that I want to match using regular expressions:
char_set = '\u0041-\u005A|\u00C0-\u00D6|\u00D8-\u00DE|\u0100|\u0102|\u0104|\u0106|\u0108|\u010A|\u010C|\u010E|\u0110|\u0112|\u0114|\u0116|\u0118|\u011A|\u011C|\u011E|\u0120|\u0122|\u0124|\u0126|\u0128|\u012A|\u012C|\u012E|\u0130|\u0132|\u0134|\u0136|\u0139|\u013B|\u013D|\u013F|\u0141|\u0143|\u0145|\u0147|\u014A|\u014C|\u014E|\u0150|\u0152|\u0154|\u0156|\u0158|\u015A|\u015C|\u015E|\u0160|\u0162|\u0164|\u0166|\u0168|\u016A|\u016C|\u016E|\u0170|\u0172|\u0174|\u0176|\u0178|\u0179|\u017B|\u017D'
(These are all the uppercase characters comprehended in the Unicode range 0-382. Most of them are accented. PEP8 discourages the use of non-ASCII characters in Python scripts, so I'm using the Unicode codes instead of the string literals.)
If I simply compile that long string directly, it works. For instance, this matches all the words that begin with one of those characters:
regex = re.compile(u'\A[\u0041-\u005A|\u00C0-\u00D6|\u00D8-\u00DE|\u0100|\u0102|\u0104|\u0106|\u0108|\u010A|\u010C|\u010E|\u0110|\u0112|\u0114|\u0116|\u0118|\u011A|\u011C|\u011E|\u0120|\u0122|\u0124|\u0126|\u0128|\u012A|\u012C|\u012E|\u0130|\u0132|\u0134|\u0136|\u0139|\u013B|\u013D|\u013F|\u0141|\u0143|\u0145|\u0147|\u014A|\u014C|\u014E|\u0150|\u0152|\u0154|\u0156|\u0158|\u015A|\u015C|\u015E|\u0160|\u0162|\u0164|\u0166|\u0168|\u016A|\u016C|\u016E|\u0170|\u0172|\u0174|\u0176|\u0178|\u0179|\u017B|\u017D]')
But I want to re-use that same sequence of characters in several other regular expressions. I could simply copy and paste it every time, but that's ugly. So based on previous answers to similar questions I've tried this:
regex = re.compile(u'\A[%s]' % char_set)
No good. Somehow the above expression seems to match ANY character, not just the ones hardcoded under the variable 'char_set'.
I've also tried this:
regex = re.compile(u'\A[' + char_set + ']')
And this:
regex = re.compile(u'\A[' + re.escape(char_set) + ']')
And this too:
regex = re.compile(u'\A[{ }]'.format(char_set))
None of which works as expected.
Any thoughts? What am I doing wrong?
(I'm using Python 2.7 and Mac OS X 10.6)
When you're using a pattern with a set of characters in square brackets, you don't want to put any vertical bar (|) characters in the set. Instead, just string the characters together and it should work. Here's a session where I tried out your characters with no problems after stripping the | chars:
>>> import re
>>> char_set = u'\u0041-\u005A|\u00C0-\u00D6|\u00D8-\u00DE|\u0100|\u0102|\u0104|\u0106|\u0108|\u010A|\u010C|\u010E|\u0110|\u0112|\u0114|\u0116|\u0118|\u011A|\u011C|\u011E|\u0120|\u0122|\u0124|\u0126|\u0128|\u012A|\u012C|\u012E|\u0130|\u0132|\u0134|\u0136|\u0139|\u013B|\u013D|\u013F|\u0141|\u0143|\u0145|\u0147|\u014A|\u014C|\u014E|\u0150|\u0152|\u0154|\u0156|\u0158|\u015A|\u015C|\u015E|\u0160|\u0162|\u0164|\u0166|\u0168|\u016A|\u016C|\u016E|\u0170|\u0172|\u0174|\u0176|\u0178|\u0179|\u017B|\u017D'
>>> fixed_char_set = char_set.replace("|", "") # remove the unneeded vertical bars
>>> pattern = ur"\A[{}]".format(fixed_char_set) # create a pattern string
>>> regex = re.compile(pattern) # compile the pattern into a regex object
>>> print regex.match("%foo") # "%" is not in the character set, so match returns None
None
edit: Actually, it seems like there must be some other issue going on, since I don't match "%foo" even if I use your original char_set without stripping out anything. Please give examples of text that is matching when it shouldn't!

Python remove JSON substring

If I have a string where there is a valid JSON substring like this one:
mystr = '100{"1":2, "3":4}312'
What is the best way to do extract just the JSON string? The numbers outside can be anything (except a { or }), including newlines and things like that.
Just to be clear, this is the result I want
newStr = '{"1":2, "3":4}'
The best way I can think of do this is to use find and rfind and then take the substring. This seems too verbose to me and it isn't python 3.0 compliant (which I would prefer but is not essential)
Any help is appreciated.
Note that the following code very much assumes that there is nothing other than non-bracket material on either side of the JSON string.
import re
matcher = re.compile(r"""
^[^\{]* # Starting from the beginning of the string, match anything that isn't an opening bracket
( # Open a group to record what's next
\{.+\} # The JSON substring
) # close the group
[^}]*$ # at the end of the string, anything that isn't a closing bracket
""", re.VERBOSE)
# Your example
print matcher.match('100{"1":2, "3":4}312').group(1)
# Example with embedded hashmap
print matcher.match('100{"1":{"a":"b", "c":"d"}, "3":4}312').group(1)
The short, non-precompiled, non-commented version:
import re
print re.match("^[^\{]*(\{[^\}]+\})[^}]*$", '100{"1":2, "3":4}312').group(1)
Although for the sake of maintenance, commenting regular expressions is very much preferred.

Categories