Removing different string patterns from Pandas column - python
I have the following column which consists of email subject headers:
Subject
EXT || Transport enquiry
EXT || RE: EXTERNAL: RE: 0001 || Copy of enquiry
EXT || FW: Model - Jan
SV: [EXTERNAL] Calculations
What I want to achieve is:
Subject
Transport enquiry
0001 || Copy of enquiry
Model - Jan
Calculations
and for this I am using the below code which only takes into account the first regular expression that I am passing and ignoring the rest
def clean_subject_prelim(text):
text = re.sub(r'^EXT \|\| $' , '' , text)
text = re.sub(r'EXT \|\| RE: EXTERNAL: RE:', '' , text)
text = re.sub(r'EXT \|\| FW:', '' , text)
text = re.sub(r'^SV: \[EXTERNAL]$' , '' , text)
return text
df['subject_clean'] = df['Subject'].apply(lambda x: clean_subject_prelim(x))
Why this is not working, what am I missing here?
You can use
pattern = r"""(?mx) # MULTILINE mode on
^ # start of string
(?: # non-capturing group start
EXT\s*\|\|\s*(?:RE:\s*EXTERNAL:\s*RE:|FW:)? # EXT || or EXT || RE: EXTERNAL: RE: or EXT || FW:
| # or
SV:\s*\[EXTERNAL]# SV: [EXTERNAL]
) # non-capturing group end
\s* # zero or more whitespaces
"""
df['subject_clean'] = df['Subject'].str.replace(pattern', '', regex=True)
See the regex demo.
Since the re.X ((?x)) is used, you should escape literal spaces and # chars, or just use \s* or \s+ to match zero/one or more whitespaces.
Get rid of the $ sign in the first expression and switch some of regex expressions from place. Like this:
import pandas as pd
import re
def clean_subject_prelim(text):
text = re.sub(r'EXT \|\| RE: EXTERNAL: RE:', '' , text)
text = re.sub(r'EXT \|\| FW:', '' , text)
text = re.sub(r'^EXT \|\|' , '' , text)
text = re.sub(r'^SV: \[EXTERNAL]' , '' , text)
return text
data = {"Subject": [
"EXT || Transport enquiry",
"EXT || RE: EXTERNAL: RE: 0001 || Copy of enquiry",
"EXT || FW: Model - Jan",
"SV: [EXTERNAL] Calculations"]}
df = pd.DataFrame(data)
df['subject_clean'] = df['Subject'].apply(lambda x: clean_subject_prelim(x))
Related
Regex substitution reversal?
I have a question: starting from this text example: input_test = "أكتب الدر_س و إحفضه ثم إقرأ القصـــــــــــــــيـــــــــــدة" I managed to clean this text using these functions: arabic_punctuations = '''`÷×؛<>_()*&^%][ـ،/:"؟.,'{}~¦+|!”…“–ـ''' english_punctuations = string.punctuation punctuations_list = arabic_punctuations + english_punctuations arabic_diacritics = re.compile(""" ّ | # Tashdid َ | # Fatha ً | # Tanwin Fath ُ | # Damma ٌ | # Tanwin Damm ِ | # Kasra ٍ | # Tanwin Kasr ْ | # Sukun ـ # Tatwil/Kashida """, re.VERBOSE) def normalize_arabic(text): text = re.sub("[إأآا]", "ا", text) return text def remove_diacritics(text): text = re.sub(arabic_diacritics, '', text) return text def remove_punctuations(text): translator = str.maketrans('', '', punctuations_list) return text.translate(translator) def remove_repeating_char(text): return re.sub(r'(.)\1+', r'\1', text) Which gives me this text as the result: result = "اكتب الدرس و احفضه ثم اقرا القصيدة" Now if I have have this case, how can I find the word "اقرا" in the orginal input_test? The input text can be in English, too. I'm thinking of regex — but I don't know from where to start…
Get data with boundaries using regex
I would like to get the labels and data from this function using regex, I have tried using this: pattern = re.compile(r'/blabels: ],/b') print(pattern) result = soup.find("script", text=pattern) But I get None using boundaries This is the soup: <script> Chart.defaults.LineWithLine = Chart.defaults.line; new Chart(document.getElementById("chart-overall-mentions"), { type: 'LineWithLine', data: { labels: [1637005508000,1637006108000,1637006708000,1637007308000,1637007908000,1637008508000,1637009108000,1637009708000,1637010308000,1637010908000,1637011508000,1637012108000,1637012708000,1637013308000,1637013908000,1637014508000,1637015108000,1637015708000,1637016308000,1637016908000,1637017508000,1637018108000,1637018708000,1637019308000,1637019908000,1637020508000,1637021108000,1637021708000,1637022308000,1637022908000,1637023508000,1637024108000,1637024708000,1637025308000,1637025908000,1637026508000,1637027108000,1637027708000,1637028308000,1637028908000,1637029508000,1637030108000,1637030708000,1637031308000,1637031908000,1637032508000,1637033108000,1637033708000,1637034308000,1637034908000,1637035508000,1637036108000,1637036708000,1637037308000,1637037908000,1637038508000,1637039108000,1637039708000,1637040308000,1637040908000,1637041508000,1637042108000,1637042708000,1637043308000,1637043908000,1637044508000,1637045108000,1637045708000,1637046308000,1637046908000,1637047508000,1637048108000,1637048708000,1637049308000,1637049908000,1637050508000,1637051108000,1637051708000,1637052308000,1637052908000,1637053508000,1637054108000,1637054708000,1637055308000,1637055908000,1637056508000,1637057108000,1637057708000,1637058308000,1637058908000,1637059508000,1637060108000,1637060708000,1637061308000,1637061908000,1637062508000,1637063108000,1637063708000,1637064308000,1637064908000,1637065508000,1637066108000,1637066708000,1637067308000,1637067908000,1637068508000,1637069108000,1637069708000,1637070308000,1637070908000,1637071508000,1637072108000,1637072708000,1637073308000,1637073908000,1637074508000,1637075108000,1637075708000,1637076308000,1637076908000,1637077508000,1637078108000,1637078708000,1637079308000,1637079908000,1637080508000,1637081108000,1637081708000,1637082308000,1637082908000,1637083508000,1637084108000,1637084708000,1637085308000,1637085908000,1637086508000,1637087108000,1637087708000,1637088308000,1637088908000,1637089508000,1637090108000,1637090708000,1637091308000], datasets: [{ data: [13,10,20,26,21,23,24,21,24,35,25,31,42,24,24,20,23,22,17,23,30,11,16,20,9,10,22,10,19,16,15,16,17,19,10,20,24,14,19,15,13,9,13,17,20,16,15,21,18,25,15,14,16,15,16,14,14,21,10,9,5,9,9,13,14,9,9,18,15,11,11,6,12,14,19,17,16,11,20,14,21,13,15,12,14,10,20,16,25,17,17,11,23,11,13,11,19,10,17,19,10,20,22,19,19,27,28,18,20,22,18,16,17,18,14,17,19,18,20,11,13,20,15,15,18,14,13,14,14,11,19,14,14,11,11,15,26,12,15,15,11,4,3,6], pointRadius: 0, borderColor: "#666", fill: true, yAxisID:'yAxis1' }, ] }, options: { tooltips: { mode: 'index', bodyFontSize: 18, intersect: false, titleFontSize: 16, }, . . . </script>
Here is how you can do that: Get the script tag - you can use a regex, too, if that is the only way to obtain that node Then run a regex search against the node text/string to get your final output. You can use # Get the script node with text matching your pattern item = soup.find("script", text=re.compile(r'\blabels:\s*\[')) import re match = re.search(r'\blabels:\s*\[([^][]*)]', item.string) if match: labels = map(int, match.group(1).split(',')) Output: >>> print(list(labels)) [1637005508000, 1637006108000, 1637006708000, 1637007308000, 1637007908000, 1637008508000, 1637009108000, 1637009708000, 1637010308000, 1637010908000, 1637011508000, 1637012108000, 1637012708000, 1637013308000, 1637013908000, 1637014508000, 1637015108000, 1637015708000, 1637016308000, 1637016908000, 1637017508000, 1637018108000, 1637018708000, 1637019308000, 1637019908000, 1637020508000, 1637021108000, 1637021708000, 1637022308000, 1637022908000, 1637023508000, 1637024108000, 1637024708000, 1637025308000, 1637025908000, 1637026508000, 1637027108000, 1637027708000, 1637028308000, 1637028908000, 1637029508000, 1637030108000, 1637030708000, 1637031308000, 1637031908000, 1637032508000, 1637033108000, 1637033708000, 1637034308000, 1637034908000, 1637035508000, 1637036108000, 1637036708000, 1637037308000, 1637037908000, 1637038508000, 1637039108000, 1637039708000, 1637040308000, 1637040908000, 1637041508000, 1637042108000, 1637042708000, 1637043308000, 1637043908000, 1637044508000, 1637045108000, 1637045708000, 1637046308000, 1637046908000, 1637047508000, 1637048108000, 1637048708000, 1637049308000, 1637049908000, 1637050508000, 1637051108000, 1637051708000, 1637052308000, 1637052908000, 1637053508000, 1637054108000, 1637054708000, 1637055308000, 1637055908000, 1637056508000, 1637057108000, 1637057708000, 1637058308000, 1637058908000, 1637059508000, 1637060108000, 1637060708000, 1637061308000, 1637061908000, 1637062508000, 1637063108000, 1637063708000, 1637064308000, 1637064908000, 1637065508000, 1637066108000, 1637066708000, 1637067308000, 1637067908000, 1637068508000, 1637069108000, 1637069708000, 1637070308000, 1637070908000, 1637071508000, 1637072108000, 1637072708000, 1637073308000, 1637073908000, 1637074508000, 1637075108000, 1637075708000, 1637076308000, 1637076908000, 1637077508000, 1637078108000, 1637078708000, 1637079308000, 1637079908000, 1637080508000, 1637081108000, 1637081708000, 1637082308000, 1637082908000, 1637083508000, 1637084108000, 1637084708000, 1637085308000, 1637085908000, 1637086508000, 1637087108000, 1637087708000, 1637088308000, 1637088908000, 1637089508000, 1637090108000, 1637090708000, 1637091308000] Once the node is obtained the \blabels:\s*\[([^][]*)] regex searches for \b - a word boundary labels: - a fixed string \s* - zero or more whitespaces \[ - a [ char ([^][]*) - Group 1 (this is what you will need to split with a comma later): any zero or more chars other than ] and [ ] - a ] char.
replace trademark symbol (™) when alone
I'm trying to remove trademark symbol (™) but only in the case it's not followed by any other symbol for instance I might have ’ which is a bad encoding of quotation mark (') so I don't want to remove trademark symbol (™) and hence broking the pattern that i'm using to replace xx™ with quotation mark. dict = {}; chars = { '\xe2\x84\xa2': '', # ™ '\xe2\x80\x99': "'", # ’ } def stats_change(char, number): if dict.has_key(char): dict[char] = dict[char]+number else: dict[char] = number # Add new entry def replace_chars(match): char = match.group(0) stats_change(char,1) return chars[char] i, nmatches = re.subn("(\\" + '|\\'.join(chars.keys()) + ")", replace_chars, i) count_matches += nmatches Input: foo™ oof Output: foo oof Input: o’f oof Output: o'f oof Any suggestions ?
How to replace some special characters from user input for different Python platforms
I need to replace some special characters from user input for different platform (i.e. Linux and Windows) using Python. Here is my code: if request.method == 'POST': rname1 = request.POST.get('react') Here I am getting the user input by post method. I need to the following characters to remove from the user input (if there is any). 1- Escape or filter special characters for windows, ( ) < > * ‘ = ? ; [ ] ^ ~ ! . ” % # / \ : + , ` 2- Escape or filter special characters for Linux, { } ( ) < > * ‘ = ? ; [ ] $ – # ~ ! . ” % / \ : + , ` The special characters are given above. Here I need to remove for both Linux and Windows.
Python strings have a built in method translate for substitution/deletion of characters. You need to build a translation table and then call the function. import sys if "win" in sys.platform: special = """( ) < > * ‘ = ? ; [ ] ^ ~ ! . ” % # / \ : + , `""".split() else: special = """{ } ( ) < > * ‘ = ? ; [ ] $ – # ~ ! . ” % / \ : + , `""".split() trans_dict = {character: None for character in special} trans_table = str.maketrans(trans_dict) print("Lo+=r?e~~m ipsum dol;or sit!! amet, consectet..ur ad%".translate(trans_table)) Will print Lorem ipsum dolor sit amet consectetur ad. If you want to use a replacement character instead of deleting, then replace None above with the character. You can build a translation table with specific substitutions, `{"a": "m", "b": "n", ...} Edit: The above snippet is indeed in Python3. In Python2 (TiO) it's easier to delete characters: >>> import sys >>> import string >>> if "win" in sys.platform: ... special = """()<>*'=?;[]^~!%#/\:=,`""" ... else: ... special = """{}()<>*'=?;[]$-#~!."%/\:+""" ... >>> s = "Lo+r?e~~/\#<>m ips()u;m" >>> string.translate(s, None, special) 'Lorem ipsum' Note that I've substituted ‘ with ' and similarly replaced ” with " because I think you're only dealing with ascii strings.
Python regex sub confusion
There are four keywords: title, blog, tags, state Excess keyword occurrences are being removed from their respective matches. Example: blog: blog state title tags and returns state title tags and instead of blog state title tags and The sub function should be matching .+ after it sees blog:, so I don't know why it treats blog as an exception to .+ Regex: re.sub(r'((^|\n|\s|\b)(title|blog|tags|state)(\:\s).+(\n|$))', matcher, a) Code: def n15(): import re a = """blog: blog: fooblog state: private title: this is atitle bun and text""" kwargs = {} def matcher(string): v = string.group(1).replace(string.group(2), '').replace(string.group(3), '').replace(string.group(4), '').replace(string.group(5), '') if string.group(3) == 'title': kwargs['title'] = v elif string.group(3) == 'blog': kwargs['blog_url'] = v elif string.group(3) == 'tags': kwargs['comma_separated_tags'] = v elif string.group(3) == 'state': kwargs['post_state'] = v return '' a = re.sub(r'((^|\n|\s|\b)(title|blog|tags|state)(\:\s).+(\n|$))', matcher, a) a = a.replace('\n', '<br />') a = a.replace('\r', '') a = a.replace('"', r'\"') a = '<p>' + a + '</p>' kwargs['body'] = a print kwargs Output: {'body': '<p>and text</p>', 'post_state': 'private', 'blog_url': 'foo', 'title': 'this is a bun'} Edit: Desired Output: {'body': '<p>and text</p>', 'post_state': 'private', 'blog_url': 'fooblog', 'title': 'this is atitle bun'}
replace(string.group(3), '') is replacing all occurrences of 'blog' with '' . Rather than try to replace all the other parts of the matched string, which will be hard to get right, I suggest capture the string you actually want in the original match. r'((^|\n|\s|\b)(title|blog|tags|state)(\:\s)(.+)(\n|$))' which has () around the .+ to capture that part of the string, then v = match.group(5) at the start of matcher.