Only part of unicode is replacing in python.Don't understand why - python

subject = page.select('div.container h1')
subject = [x.text.replace('2015', '')for x in subject]
print subject
[u'\u20132016 Art Courses']# This is the code after.
[u'2015\u20132016 Art Courses']#This is the code before.
subject = [x.text.replace('20132016', '')for x in subject]
When I try to change the .replace to '20132016' it just prints out
[u'2015\u20132016 Art Courses']
would anyone know how to get rid of the 20132016 as well as the word
courses.

You don't have the characters "2013" in your string. You have a single character, unicode 2013, ie "–", an en dash. You need to replace that character.
x.text.replace(/u'u20132016', '') for x in subject]

\u2013 is a unicode symbol en dash. Check here for example.
So to get rid of all but Art you need to replace it like this:
>>> a = u'2015\u20132016 Art Courses'
>>> a.replace(u'2015\u20132016', '').replace('Courses', '').strip()
u'Art'

Related

Replacing Unicode Characters with actual symbols

string = "At Donald Trump<U+2019>s Properties, a Showcase for a Brand and a President-Elect"
I want to get rid of the <U + 2019> and replace it with '. Is there a way to do this in python?
Edit : I also have instances of <U + 2014>, <U + 201C> etc. Looking for something which can replace all of this with appropriate values
Replace them all at once with re.sub:
import re
string = "testing<U+2019> <U+2014> <U+201C>testing<U+1F603>"
result = re.sub(r'<U\+([0-9a-fA-F]{4,6})>', lambda x: chr(int(x.group(1),16)), string)
print(result)
Output:
testing’ — “testing😃
The regular expression matches <U+hhhh> where hhhh can be 4-6 hexadecimal characters. Note that Unicode defines code points from U+0000 to U+10FFFF so this accounts for that. The lambda replacement function converts the string hhhh to an integer using base 16 and then converts that number to a Unicode character.
Here's my solution for all code points denoted as U+0000 through U+10FFFF ("U+" followed by the code point value in hexadecimal, which is prepended with leading zeros to a minimum of four digits):
import re
def UniToChar(unicode_notation):
return chr(int(re.findall(r'<U\+([a-hA-H0-9]{4,5})>',unicode_notation)[0],16))
xx= '''
At Donald<U+2019>s <U+2016>Elect<U+2016> in <U+2017>2019<U+2017>
<U+00C0> la Donald<U+2019>s friend <U+1F986>. <U+1F929><U+1F92A><U+1F601>
'''
for x in xx.split('\n'):
abc = re.findall(r'<U\+[a-hA-H0-9]{4,5}>',x)
if len(abc) > 0:
for uniid in set(abc): x=x.replace(uniid, UniToChar(uniid))
print(repr(x).strip("'"))
Output: 71307293.py
At Donald’s ‖Elect‖ in ‗2019‗
À la Donald’s friend 🦆. 🤩🤪😁
In fact, private range from U+100000 to U+10FFFD (Plane 16) isn't detected using above simplified regex… Improved code follows:
import re
def UniToChar(unicode_notation):
aux = int(re.findall(r'<U\+([a-hA-H0-9]{4,6})>',unicode_notation)[0],16)
# circumvent the "ValueError: chr() arg not in range(0x110000)"
if aux <= 0x10FFFD:
return chr(aux)
else:
return chr(0xFFFD) # Replacement Character
xx= '''
At Donald<U+2019>s <U+2016>Elect<U+2016> in <U+2017>2019<U+2017>
<U+00C0> la Donald<U+2019>s friend <U+1F986>. <U+1F929><U+1F92A><U+1F601>
Unassigned: <U+05ff>; out of Unicode range: <U+110000>.
'''
for x in xx.split('\n'):
abc = re.findall(r'<U\+[a-hA-H0-9]{4,6}>',x)
if len(abc) > 0:
for uniid in set(abc): x=x.replace(uniid, UniToChar(uniid))
print(repr(x).strip("'"))
Output: 71307293.py
At Donald’s ‖Elect‖ in ‗2019‗
À la Donald’s friend 🦆. 🤩🤪😁
Unassigned: \u05ff; out of Unicode range: �.
I guess this solves the problem if its just one or two of these characters.
>>> string = "At Donald Trump<U+2019>s Properties, a Showcase for a Brand and a President-Elect"
>>> string.replace("<U+2019>","'")
"At Donald Trump's Properties, a Showcase for a Brand and a President-Elect"
If there are many if these substitutions to be done, consider using 'map()' method.
Source: Removing \u2018 and \u2019 character
You can replace using .replace()
print(string.replace('<U+2019>', "'"))
Or if your year changes, you can use re. But make it more attractive than mine.
import re
string = "At Donald Trump<U+2019>s Properties, a Showcase for a Brand and a President-Elect"
rep = re.search('[<][U][+]\d{4}[>]', string).group()
print(string.replace(rep, "'"))
what version of python are u using?
I edited my answer so it can bee used with multiple code point in the same string
well u need to convert the unicode's code point that is between < >, to unicode char
I used regex to get the unicode's code point and then convert it to the corresponding uniode char
import re
string = "At Donald Trump<U+2019>s Properties, a Showcase for a Brand and a President<U+2014>Elect"
repbool = re.search('[<][U][+]\d{4}[>]', string)
while repbool:
rep = re.search('[<][U][+]\d{4}[>]', string).group()
string=string.replace(rep, chr(int(rep[1:-1][2:], 16)))
repbool = re.search('[<][U][+]\d{4}[>]', string)
print(string)

Is there a way to split a string by multiple parameters and not get TypeError: 'str' object cannot be interpreted as an integer?

Is there a way to split a string by multiple parameters? When I try it the way I've got below, I get
TypeError: 'str' object cannot be interpreted as an integer
Looking into it, it should work.
The program is a translator to a new language, etc. man is masculine so the man is de mno while woman which is feminine would be di felio instead of de felio. I've got everything working up to this point, and I want to start working on rearranging the sentence order. I want to be able to split at every de and di, not just one of them. I've looked online and tried using a solution I found using the re module but it didn't end up working.
new_sentence = re.split(' de', ' di', translated_sentence)
print(new_sentence)
When I print new_sentence, if I entered originally and already translated the man is with the woman into de mno aili di felio. I want it to print out like ['de mno aili', 'di felio'].
I don't have a very good understanding of the split function so my code may be completely wrong.
You can match the space, and use a character class d[ei] to match either de or di with a regex. If you don't want a partial match, you can add a word boundary \b at the end.
import re
translated_sentence = "The program is a translator to a new language, etc. man is masculine so the man is de mno while woman which is feminine would be di felio instead of de felio."
new_sentence = re.split(r' d[ei]\b', translated_sentence)
print(new_sentence)
Output
['The program is a translator to a new language, etc. man is masculine so the man is', ' mno while woman which is feminine would be', ' felio instead of', ' felio.']
See a Python demo
If you have list of elements which should split, you might use re.escape and | to craft pattern from that which might be then used as re.split 1st argument, consider following example
import re
splitat = ['da','di','do']
pattern = '|'.join(re.escape(i) for i in splitat)
text = 'Some do text di for da testing'
print(re.split(pattern,text))
output
['Some ', ' text ', ' for ', ' testing']
re.escape does take care of any character which have special meaning in patterns.

Want to extract the alphanumeric text with certain special characters using python regex

I have a following text which I want in a desired format using python regex
text = "' PowerPoint PresentationOctober 11th, 2011(Visit) to Lap Chec1Edit or delete me in ‘view’ then ’slide master’.'"
I used following code
reg = re.compile("[^\w']")
text = reg.sub(' ', text)
However it gives output as text = "'PowerPoint PresentationOctober 11th 2011 Visit to Lap Chec1Edit or delete me in â viewâ then â slide masterâ'" which is not a desired output.
My desired output should be text = '"PowerPoint PresentationOctober 11th, 2011(Visit) to Lap Chec1Edit or delete me in view then slide master.'"
I want to remove special characters except following []()-,.
Rather than removing the chars, you may fix them using the right encoding:
text = text.encode('windows-1252').decode('utf-8')
// => ' PowerPoint PresentationOctober 11th, 2011Visit to Lap Chec1Edit or delete me in ‘view’ then ’slide master’.'
See the Python demo
If you want to remove them later, it will become much easier, like text.replace('‘', '').replace('’', ''), or re.sub(r'[’‘]+', '', text).
I got the answer though it was simple as follows, thanks for replies.
reg = re.compile("[^\w'\,\.\(\)\[\]]")
text = reg.sub(' ', text)

Simple Regex in Python Three to replace text between '|' and '/' symbols

I want to replace the text between the '|' and '/' in the string ("|伊士曼柯达公司/") with '!!!'.
s = '柯達⑀柯达⑀ /Kodak (brand, US film company)/full name Eastman Kodak Company 伊士曼柯達公司|伊士曼柯达公司/'
print(s)
s = re.sub(r'\|.*?\/.', '/!!!', s)
print('\t', s)
I tested the code first on https://regex101.com/, and it worked perfectly.
I can't quite figure out why it's not doing the replacement in python.
Variant's of escaping I've tried also include:
s = re.sub(r'|.*?\/.', '!!!', s)
s = re.sub(r'|.*?/.', '!!!', s)
s = re.sub(r'\|.*?/.', '!!!', s)
Each time the string comes out unchanged.
You can change your regex to this one, which uses lookarounds to ensure what you want to replace is preceded by | and followed by /
(?<=\|).*?(?=/)
Check this Python code,
import re
s = '柯達⑀柯达⑀ /Kodak (brand, US film company)/full name Eastman Kodak Company 伊士曼柯達公司|伊士曼柯达公司/'
print(s)
s = re.sub(r'(?<=\|).*?(?=/)', '!!!', s)
print(s)
Prints like you expect,
柯達⑀柯达⑀ /Kodak (brand, US film company)/full name Eastman Kodak Company 伊士曼柯達公司|伊士曼柯达公司/
柯達⑀柯达⑀ /Kodak (brand, US film company)/full name Eastman Kodak Company 伊士曼柯達公司|!!!/
Online Python Demo

How to make string when a sentence has quotation marks or inverted commas? Python

For example I have a sentence such as:
Jamie's car broke "down" in the middle of the street
How can I convert this into a string without manually removing the quotation marks and inverted commas like:
'Jamies car broke down in the middle of the street'
Any help is appreciated!
Thanks,
Use replace() one after other:
s = """Jamie's car broke "down" in the middle of the street"""
print(s.replace('\'', '').replace('"', ''))
# Jamies car broke down in the middle of the street
You may use regex to remove all special characters from string as:
>>> import re
>>> my_str = """Jamie's car broke "down" in the middle of the street"""
>>> re.sub('[^A-Za-z0-9\s]+', '', my_str)
'Jamies car broke down in the middle of the street'
Try this:
oldstr = """Jamie's car broke "down" in the middle of the street""" #Your problem string
newstr = oldstr.replace('\'', '').replace('"', '')) #New string using replace()
print(newstr) #print result
This returns:
Jamies car broke down in the middle of the street

Categories