Python - Split string on first occurrence of non-allowable characters - python

I have some python code where i want to scan and split string on first occurrence of non-allowable characters.
import re,string
mystring="my_id=abc-something_123&anything#;?lcdkahck;my_id%3Dkckdkkj_bcjc"
if "my_id=" in mystring:
mystring = mystring[mystring.index("my_id=") + 6 : len(mystring)][0:100]
mystring = re.split('[;&#]', mystring)[0]
print(mystring)
What happens in this, I get string correctly where ;&# is coming, but my data can have any unpredictable character put of ;&#.
What i tried drive out these characters
allowable_character = '-' + '_' + string.ascii_letters + string.digits
mystring = re.sub('[^%s]' % allowable_character, '', mystring)
print(mystring)
But this just filters the string with characters that are not in 'allowable_character'.
What i am trying to achieve is to split string once the character which is not in 'allowable_character' and return that string.
So I want expected output as 'abc-something_123'
Any help is appreciated here

You could just use re.findall here:
mystring = "my_id=abc-something_123&anything#;?lcdkahck;my_id%3Dkckdkkj_bcjc"
match = re.findall(r'^my_id=([\w-]*).*$', mystring)[0]
print(match)
This prints:
'abc-something_123'

Related

How to insert quotes around a string in the middle of another string

I need to change this string:
input_str = '{resourceType=Type, category=[{coding=[{system=http://google.com, code=item, display=Item}]}]}'
To json format:
output_str = '{"resourceType":"Type", "category":[{"coding":[{"system":"http://google.com", "code":"item", "display":"Item"}]}]}'
Changing the equal sign "=" to colon ":" is quite easy by using replace function:
input_str.replace("=", ":")
But adding quotes before and after each value / word is something that I can't find the solution for
I suggest to surround with double quotes any sequence of characters that are not reserved in your markup. I also made a provision for escaped double quotes, and you can add more escaped symbols to it:
import re
input_str = '{resourceType=Type, category=[{coding=[{system=http://google.com, code=item, display=Item}]}]}'
output_str = re.sub (r'(([^=([\]{},\s]|\")+)', r'"\1"', input_str).replace('=', ':')
print (output_str)
Output:
{"resourceType":"Type", "category":[{"coding":[{"system":"http://google.com", "code":"item", "display":"Item"}]}]}
You can use this function for the conversion.
def to_json(in_str):
return in_str.replace('{', '{"').replace('=', '":"').replace(',', '", "').replace('[', '[').replace('}', '"}').replace(']', ']').replace('" ', '"').replace(':"[', ':[').replace(']"', ']')
this works correctly for the input you have mentioned.
print(to_json(input_str))
#output = {"resourceType":"Type", "category":[{"coding":[{"system":"http://google.com", "code":"item", "display":"Item"}]}]}
Regex is certainly more concise and efficient but, just for the fun, it's also possible using replace :
input_str = input_str.replace("=", "\":\"")
input_str = input_str.replace("=[", "\":[")
input_str = input_str.replace(", ", "\", \"")
input_str = input_str.replace("{", "{\"")
input_str = input_str.replace("}", "\"}")
input_str = input_str.replace("]\"}", "]}")
input_str = input_str.replace("\"[", "[")
print(input_str) #=> '{"resourceType":"Type", "category":[{"coding":[{"system":"http://google.com", "code":"item", "display":"Item"}]}]}'

Python Removing non-alphabetical characters with exceptions

I am having a hard time doing Data Analysis on a large text that has lots of non-alphabetical chars. I tried using
string = filter(str.isalnum, string)
but I also have "#" in my text that I want to keep. How do I make an exception for a character like "#" ?
It is easier to use regular expressions:
string = re.sub("[^A-Za-z0-9#]", "", string)
You can use re.sub
re.sub(r'[^\w\s\d#]', '', string)
Example:
>>> re.sub(r'[^\w\s\d#]', '', 'This is # string 123 *$^%')
This is # string 123
One way to do this would be to create a function that returns True or False if an input character is valid.
import string
valid_characters = string.ascii_letters + string.digits + '#'
def is_valid_character(character):
return character in valid_characters
# Instead of using `filter`, we `join` all characters in the input string
# if `is_valid_character` is `True`.
def get_valid_characters(string):
return "".join(char for char in string if is_valid_character(char))
Some example output:
>>> print(valid_characters)
abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789#
>>> get_valid_characters("!Hello_#world?")
'Helloworld'
>>> get_valid_characters("user#example")
'user#example'
A simpler way to write it would be using regex. This will accomplish the same thing:
import re
def get_valid_characters(string):
return re.sub(r"[^\w\d#]", "", string)
You could use a lambda function to specify your allowed characters. But also note that filter returns a <filter object> which is an iterator over the returned values. So you will have to stich it back to a string:
string = "?filter_#->me3!"
extra_chars = "#!"
filtered_object = filter(lambda c: c.isalnum() or c in extra_chars, string)
string = "".join(filtered_object)
print(string)
Gives:
filter#me3!

How to replace/delete a string in python

how can I replace/delete a part of a string, like this
string = '{DDBF1F} this is my string {DEBC1F}'
#{DDBF1F} the code between Parentheses is random, I only know it is made out of 6 characters
the output should be
this is my string
I tried this, I know it doesn't work, but I tried :3
string = '{DDBF1F} Hello {DEBC1F}'
string.replace(f'{%s%s%s%s%s%s}', 'abc')
print(string)
Use the re library to perform a regex replace, like this:
import re
text = '{DDBF1F} Hello {DEBC1F}'
result = re.sub(r"(\s?\{[A-F0-9]{6}\}\s?)", "", text)
print(result)
If the length of the strings within the brackets is fixed, you can use slicing to get the inner substring:
>>> string = '{DDBF1F} this is my string {DEBC1F}'
>>> string[8:-8]
' this is my string '
(string[9:-9] if you want to remove the surrounding spaces)
If hardcoding the indexes feels bad, they can be derived using str.index (if you can be certain that the string will not contain an embedded '}'):
>>> start = string.index('}')
>>> start
7
>>> end = string.index('{', start)
>>> end
27
>>> string[start+1:end]
' this is my string '
This code works
string = '{DDBF1F} this is my string {DEBC1F}'
st=string.split(' ')
new_str=''
for i in st:
if i.startswith('{') and i.endswith('}'):
pass
else:
new_str=new_str+" "+ i
print(new_str)

Non-ASCII characters appearing in string

So I need to compare 2 strings :
str1 = 'this is my string/ndone'
str2 = 'this is my string done'
So I replace the new line from str1 with ' ':
new_str = str1.replace('\n', ' ')
And when print the 2 strings there are identical:
'this is my string done'
But when compared using == operator the not so I convert this 2 strings into array to see why they are not equal:
arr1 = bytearray(str1 , 'utf-8')
print(arr1)
arr2 = bytearray(str2 , 'utf-8')
print(arr2)
And this is the output:
str1 = bytearray(b'this is\xc2\xa0my string done')
str2 = bytearray(b'this is my string done')
So what is this \xc2\xa0 ?
'\xc2\xa0' is the UTF-8 encoding of the Unicode character 'NO-BREAK SPACE' (U+00A0).
use python unidecode library
from unidecode import unidecode
str = "this is\xc2\xa0my string done"
print(unidecode(str))
o/p
this isA my string done
== is working in comparing two string
str1 = 'this is my string\ndone'
str2 = 'this is my string done'
str1 = str1.replace("\n"," ")
print(str1)
if (str1 == str2):
print("y")
else:
print("n")
and output is
this is my string done
y
As stated elsewhere your string had a "/n" not "\n" in it.
Assuming though that what you wanted to do was normalise all whitespace characters, this is a very handy trick I use all the time:
string = ' '.join(string.split())
Update: Okay this is why:
If you don't specify what string.split() should use a separater the, per docs:
If sep is not specified or is None, a different splitting algorithm is
applied: runs of consecutive whitespace are regarded as a single
separator, and the result will contain no empty strings at the start
or end if the string has leading or trailing whitespace.
So it splits on whitspaces, and treats multiple whitespaces as a single seperator. I don't know what characters are all defined as "whitespaces", but is certainly includes all the usual suspects. Then when you rejoin the list into a string with ' '.join(), you know for sure that all whitespaces are now the same.

The elegant way to replace specific characters in Python

I have strings that are unpredictable in terms of character content, but I know that every string contains exactly one character '*'.
How to replace two characters after the '*' with some non hard-coded string. Non hard-coded string is actually calculated checksum and converted into string:
checksum_str = str(hex(csum).lstrip('0x'))
You want something like:
star_pos = my_string.find('*')
my_string = my_string[:star_pos] + '*' + checksum_str + my_string[star_pos + 3:]
You can do it with a regular expression:
import re
my_string = re.sub(r'(?<=\*)..', checksum_str, my_string, 1)

Categories