Non-ASCII characters appearing in string - python

So I need to compare 2 strings :
str1 = 'this is my string/ndone'
str2 = 'this is my string done'
So I replace the new line from str1 with ' ':
new_str = str1.replace('\n', ' ')
And when print the 2 strings there are identical:
'this is my string done'
But when compared using == operator the not so I convert this 2 strings into array to see why they are not equal:
arr1 = bytearray(str1 , 'utf-8')
print(arr1)
arr2 = bytearray(str2 , 'utf-8')
print(arr2)
And this is the output:
str1 = bytearray(b'this is\xc2\xa0my string done')
str2 = bytearray(b'this is my string done')
So what is this \xc2\xa0 ?

'\xc2\xa0' is the UTF-8 encoding of the Unicode character 'NO-BREAK SPACE' (U+00A0).

use python unidecode library
from unidecode import unidecode
str = "this is\xc2\xa0my string done"
print(unidecode(str))
o/p
this isA my string done

== is working in comparing two string
str1 = 'this is my string\ndone'
str2 = 'this is my string done'
str1 = str1.replace("\n"," ")
print(str1)
if (str1 == str2):
print("y")
else:
print("n")
and output is
this is my string done
y

As stated elsewhere your string had a "/n" not "\n" in it.
Assuming though that what you wanted to do was normalise all whitespace characters, this is a very handy trick I use all the time:
string = ' '.join(string.split())
Update: Okay this is why:
If you don't specify what string.split() should use a separater the, per docs:
If sep is not specified or is None, a different splitting algorithm is
applied: runs of consecutive whitespace are regarded as a single
separator, and the result will contain no empty strings at the start
or end if the string has leading or trailing whitespace.
So it splits on whitspaces, and treats multiple whitespaces as a single seperator. I don't know what characters are all defined as "whitespaces", but is certainly includes all the usual suspects. Then when you rejoin the list into a string with ' '.join(), you know for sure that all whitespaces are now the same.

Related

Python - Split string on first occurrence of non-allowable characters

I have some python code where i want to scan and split string on first occurrence of non-allowable characters.
import re,string
mystring="my_id=abc-something_123&anything#;?lcdkahck;my_id%3Dkckdkkj_bcjc"
if "my_id=" in mystring:
mystring = mystring[mystring.index("my_id=") + 6 : len(mystring)][0:100]
mystring = re.split('[;&#]', mystring)[0]
print(mystring)
What happens in this, I get string correctly where ;&# is coming, but my data can have any unpredictable character put of ;&#.
What i tried drive out these characters
allowable_character = '-' + '_' + string.ascii_letters + string.digits
mystring = re.sub('[^%s]' % allowable_character, '', mystring)
print(mystring)
But this just filters the string with characters that are not in 'allowable_character'.
What i am trying to achieve is to split string once the character which is not in 'allowable_character' and return that string.
So I want expected output as 'abc-something_123'
Any help is appreciated here
You could just use re.findall here:
mystring = "my_id=abc-something_123&anything#;?lcdkahck;my_id%3Dkckdkkj_bcjc"
match = re.findall(r'^my_id=([\w-]*).*$', mystring)[0]
print(match)
This prints:
'abc-something_123'

Python Removing non-alphabetical characters with exceptions

I am having a hard time doing Data Analysis on a large text that has lots of non-alphabetical chars. I tried using
string = filter(str.isalnum, string)
but I also have "#" in my text that I want to keep. How do I make an exception for a character like "#" ?
It is easier to use regular expressions:
string = re.sub("[^A-Za-z0-9#]", "", string)
You can use re.sub
re.sub(r'[^\w\s\d#]', '', string)
Example:
>>> re.sub(r'[^\w\s\d#]', '', 'This is # string 123 *$^%')
This is # string 123
One way to do this would be to create a function that returns True or False if an input character is valid.
import string
valid_characters = string.ascii_letters + string.digits + '#'
def is_valid_character(character):
return character in valid_characters
# Instead of using `filter`, we `join` all characters in the input string
# if `is_valid_character` is `True`.
def get_valid_characters(string):
return "".join(char for char in string if is_valid_character(char))
Some example output:
>>> print(valid_characters)
abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789#
>>> get_valid_characters("!Hello_#world?")
'Helloworld'
>>> get_valid_characters("user#example")
'user#example'
A simpler way to write it would be using regex. This will accomplish the same thing:
import re
def get_valid_characters(string):
return re.sub(r"[^\w\d#]", "", string)
You could use a lambda function to specify your allowed characters. But also note that filter returns a <filter object> which is an iterator over the returned values. So you will have to stich it back to a string:
string = "?filter_#->me3!"
extra_chars = "#!"
filtered_object = filter(lambda c: c.isalnum() or c in extra_chars, string)
string = "".join(filtered_object)
print(string)
Gives:
filter#me3!

How to replace/delete a string in python

how can I replace/delete a part of a string, like this
string = '{DDBF1F} this is my string {DEBC1F}'
#{DDBF1F} the code between Parentheses is random, I only know it is made out of 6 characters
the output should be
this is my string
I tried this, I know it doesn't work, but I tried :3
string = '{DDBF1F} Hello {DEBC1F}'
string.replace(f'{%s%s%s%s%s%s}', 'abc')
print(string)
Use the re library to perform a regex replace, like this:
import re
text = '{DDBF1F} Hello {DEBC1F}'
result = re.sub(r"(\s?\{[A-F0-9]{6}\}\s?)", "", text)
print(result)
If the length of the strings within the brackets is fixed, you can use slicing to get the inner substring:
>>> string = '{DDBF1F} this is my string {DEBC1F}'
>>> string[8:-8]
' this is my string '
(string[9:-9] if you want to remove the surrounding spaces)
If hardcoding the indexes feels bad, they can be derived using str.index (if you can be certain that the string will not contain an embedded '}'):
>>> start = string.index('}')
>>> start
7
>>> end = string.index('{', start)
>>> end
27
>>> string[start+1:end]
' this is my string '
This code works
string = '{DDBF1F} this is my string {DEBC1F}'
st=string.split(' ')
new_str=''
for i in st:
if i.startswith('{') and i.endswith('}'):
pass
else:
new_str=new_str+" "+ i
print(new_str)

Removing Punctuation and Replacing it with Whitespace using Replace in Python

trying to remove the following punctuation in python I need to use the replace methods to remove these punctuation characters and replace it with whitespace ,.:;'"-?!/
here is my code:
text_punct_removed = raw_text.replace(".", "")
text_punct_removed = raw_text.replace("!", "")
print("\ntext with punctuation characters removed:\n", text_punct_removed)
It only will remove the last one I try to replace, so I tried combining them
text_punct_removed = raw_text.replace(".", "" , "!", "")
print("\ntext with punctuation characters removed:\n", text_punct_removed)
but I get an error message, how do I remove multiple punctuation? Also there will be an issue if I put the " in quotes like this """ which will make a comment, is there a way around that? thanks
If you don't need to explicitly use replace:
exclude = set(",.:;'\"-?!/")
text = "".join([(ch if ch not in exclude else " ") for ch in text])
Here's a naive but working solution:
for sp in '.,"':
raw_text = raw_text.replace(sp, '')
If you need to replace all punctuations with space, you can use the built-in punctuation list to replace the string:
Python 3
import string
import re
my_string = "(I hope...this works!)"
translator = re.compile('[%s]' % re.escape(string.punctuation))
translator.sub(' ', my_string)
print(my_string)
# Result:
# I hope this works
After, if you want to remove double spaces inside string, you can make:
my_string = re.sub(' +',' ', my_string).strip()
print(my_string)
# Result:
# I hope this works
This works in Python3.5.3:
from string import punctuation
raw_text_with_punctuations = "text, with: punctuation; characters? all over ,.:;'\"-?!/"
print(raw_text_with_punctuations)
for char in punctuation:
raw_text_with_punctuations = raw_text_with_punctuations.replace(char, '')
print(raw_text_with_punctuations)
Either remove one character at a time:
raw_text.replace(".", "").replace("!", "")
Or, better, use regular expressions (re.sub()):
re.sub(r"\.|!", "", raw_text)

Remove all newlines from inside a string

I'm trying to remove all newline characters from a string. I've read up on how to do it, but it seems that I for some reason am unable to do so. Here is step by step what I am doing:
string1 = "Hello \n World"
string2 = string1.strip('\n')
print string2
And I'm still seeing the newline character in the output. I've tried with rstrip as well, but I'm still seeing the newline. Could anyone shed some light on why I'm doing this wrong? Thanks.
strip only removes characters from the beginning and end of a string. You want to use replace:
str2 = str.replace("\n", "")
re.sub('\s{2,}', ' ', str) # To remove more than one space
As mentioned by #john, the most robust answer is:
string = "a\nb\rv"
new_string = " ".join(string.splitlines())
Answering late since I recently had the same question when reading text from file; tried several options such as:
with open('verdict.txt') as f:
First option below produces a list called alist, with '\n' stripped, then joins back into full text (optional if you wish to have only one text):
alist = f.read().splitlines()
jalist = " ".join(alist)
Second option below is much easier and simple produces string of text called atext replacing '\n' with space;
atext = f.read().replace('\n',' ')
It works; I have done it. This is clean, easier, and efficient.
strip() returns the string after removing leading and trailing whitespace. see doc
In your case, you may want to try replace():
string2 = string1.replace('\n', '')
or you can try this:
string1 = 'Hello \n World'
tmp = string1.split()
string2 = ' '.join(tmp)
This should work in many cases -
text = ' '.join([line.strip() for line in text.strip().splitlines() if line.strip()])
text = re.sub('[\r\n]+', ' ', text)
strip() returns the string with leading and trailing whitespaces(by default) removed.
So it would turn " Hello World " to "Hello World", but it won't remove the \n character as it is present in between the string.
Try replace().
str = "Hello \n World"
str2 = str.replace('\n', '')
print str2
If the file includes a line break in the middle of the text neither strip() nor rstrip() will not solve the problem,
strip family are used to trim from the began and the end of the string
replace() is the way to solve your problem
>>> my_name = "Landon\nWO"
>>> print my_name
Landon
WO
>>> my_name = my_name.replace('\n','')
>>> print my_name
LandonWO

Categories