Regex substitution reversal? - python

I have a question:
starting from this text example:
input_test = "أكتب الدر_س و إحفضه ثم إقرأ القصـــــــــــــــيـــــــــــدة"
I managed to clean this text using these functions:
arabic_punctuations = '''`÷×؛<>_()*&^%][ـ،/:"؟.,'{}~¦+|!”…“–ـ'''
english_punctuations = string.punctuation
punctuations_list = arabic_punctuations + english_punctuations
arabic_diacritics = re.compile("""
ّ | # Tashdid
َ | # Fatha
ً | # Tanwin Fath
ُ | # Damma
ٌ | # Tanwin Damm
ِ | # Kasra
ٍ | # Tanwin Kasr
ْ | # Sukun
ـ # Tatwil/Kashida
""", re.VERBOSE)
def normalize_arabic(text):
text = re.sub("[إأآا]", "ا", text)
return text
def remove_diacritics(text):
text = re.sub(arabic_diacritics, '', text)
return text
def remove_punctuations(text):
translator = str.maketrans('', '', punctuations_list)
return text.translate(translator)
def remove_repeating_char(text):
return re.sub(r'(.)\1+', r'\1', text)
Which gives me this text as the result:
result = "اكتب الدرس و احفضه ثم اقرا القصيدة"
Now if I have have this case, how can I find the word "اقرا" in the orginal input_test?
The input text can be in English, too. I'm thinking of regex — but I don't know from where to start…


Removing different string patterns from Pandas column

I have the following column which consists of email subject headers:
EXT || Transport enquiry
EXT || RE: EXTERNAL: RE: 0001 || Copy of enquiry
EXT || FW: Model - Jan
SV: [EXTERNAL] Calculations
What I want to achieve is:
Transport enquiry
0001 || Copy of enquiry
Model - Jan
and for this I am using the below code which only takes into account the first regular expression that I am passing and ignoring the rest
def clean_subject_prelim(text):
text = re.sub(r'^EXT \|\| $' , '' , text)
text = re.sub(r'EXT \|\| RE: EXTERNAL: RE:', '' , text)
text = re.sub(r'EXT \|\| FW:', '' , text)
text = re.sub(r'^SV: \[EXTERNAL]$' , '' , text)
return text
df['subject_clean'] = df['Subject'].apply(lambda x: clean_subject_prelim(x))
Why this is not working, what am I missing here?
You can use
pattern = r"""(?mx) # MULTILINE mode on
^ # start of string
(?: # non-capturing group start
EXT\s*\|\|\s*(?:RE:\s*EXTERNAL:\s*RE:|FW:)? # EXT || or EXT || RE: EXTERNAL: RE: or EXT || FW:
| # or
) # non-capturing group end
\s* # zero or more whitespaces
df['subject_clean'] = df['Subject'].str.replace(pattern', '', regex=True)
See the regex demo.
Since the re.X ((?x)) is used, you should escape literal spaces and # chars, or just use \s* or \s+ to match zero/one or more whitespaces.
Get rid of the $ sign in the first expression and switch some of regex expressions from place. Like this:
import pandas as pd
import re
def clean_subject_prelim(text):
text = re.sub(r'EXT \|\| RE: EXTERNAL: RE:', '' , text)
text = re.sub(r'EXT \|\| FW:', '' , text)
text = re.sub(r'^EXT \|\|' , '' , text)
text = re.sub(r'^SV: \[EXTERNAL]' , '' , text)
return text
data = {"Subject": [
"EXT || Transport enquiry",
"EXT || RE: EXTERNAL: RE: 0001 || Copy of enquiry",
"EXT || FW: Model - Jan",
"SV: [EXTERNAL] Calculations"]}
df = pd.DataFrame(data)
df['subject_clean'] = df['Subject'].apply(lambda x: clean_subject_prelim(x))

Python regex pattern match starts with dot and store it in dict format

from pprint import pprint
data = '''
#Long log file
Section Name | Budget | Size | Prev Size | Overflow
.text.resident | 712924 | 794576 | 832688 | YES
.rodata.resident | 77824 | 77560 | 21496 | YES
.data.resident | 28672 | 28660 | 42308 | NO
.bss.resident | 52672 | 1051632 | 1455728 | YES
Output expected:
MEMDICT = {'.text.resident' : {'Budget':'712924', 'Size':'794576', 'Prev Size': '832688' , 'Overflow': 'YES'},
'.rodata.resident' : {'Budget':'', 'Size':'', 'Prev Size': '' , 'Overflow': 'YES'},
'.data.resident' :{'Budget':'', 'Size':'', 'Prev Size': '' , 'Overflow': 'NO'},
'.bss.resident' :{'Budget':'', 'Size':'', 'Prev Size': '' , 'Overflow': 'YES'}}
I am a beginer in python. Please suggest some simple steps
Search for a regex pattern and get the headers in a list
pattern = re.compile(r'\sSection Name\s|\sBudget*') # This can be improved,
key_list = (''.join(line.split())).split('|') # Unable to handle space issues, so trimmed and used.
Search for a regex pattern to match .something.resident | \d+ | \d+ | \d+ | **
Need some help and get it in value_list
Making all list into the dict in a loop
mem_info = {} # reset the list
for i in range(0,len(key_list)):
mem_info[key_list[i]] = value_list[i]
MEMDICT[sta_info[0]] = sta_info
The only thing you haven't shown us is what line ends the section. Other than that, this is what you need:
keeper = False
memdict = {}
for line in open(file):
if not keeper:
if 'Section Name' in line:
keeper = True
if '-------------------' in line:
if 'whatever ends the section' in line:
parts = line.split()
memdict[parts[0]] = {
'Budget': int(parts[1]),
'Size': int(parts[2]),
'Prev Size': int(parts[3]),
'Overflow': parts[4]

Print elements containing only 2 strings

I have this list
lst = [' SOME TEXT\nSOME TEXT\nFTY = 1', 'A|1\nB|5\nC|3\n \nD|0\nE|0', 'D|4\nE|1\nG|1', '\nblah blah', '\n--- HHGTY',
'SOME TEXT\nFTY = 1\nA|3\nB|2\nC|8\nD|6\nE|9\nF|3', '', 'blah blah\n \nblah blah',
'--- HHGTY'
and I want to print only the elements containing | or HHGTY. I using the code below, but is printing
SOME TEXT and FTY = 1 too. What is wrong? Thanks
>>> for s in lst:
... if ("|" in s) or ("HHGTY" in s):
... print(s)
FTY = 1
I think what you want is:
for s in lst:
for subs in s.split('\n'):
if ("|" in subs) or ("HHGTY" in subs):
Your code is doing everything right:
SOME TEXT and FTY = 1 are parts of SOME TEXT \ nFTY = 1 \ nA | 3 \ nB | 2 \ nC | 8 \ nD | 6 \ nE | 9 \ nF | 3.
Because in your 'SOME TEXT\nFTY = 1\nA|3\nB|2\nC|8\nD|6\nE|9\nF|3' element '|' is present.

How to ignore punctuation in-between words using word_tokenize in NLTK?

I'm looking to ignore characters in-between words using NLTK word_tokenize.
If I have a a sentence:
test = 'Should I trade on the S&P? This works with a phone number 333-445-6635 and email'
The word_tokenize method is splitting the S&P into
Is there a way to have this library ignore punctuation between words or letters?
Expected output: 'S&P','?'
Let me know how this works with your sentences.
I added an additional test with a bunch of punctuation.
The regular expression is, in the final portion, modified from the WordPunctTokenizer regexp.
from nltk.tokenize import RegexpTokenizer
punctuation = r'[]!"$%&\'()*+,./:;=##?[\\^_`{|}~-]?'
tokenizer = RegexpTokenizer(r'\w+' + punctuation + r'\w+?|[^\s]+?')
# result:
In [156]: tokenizer.tokenize(test)
Out[156]: ['Should', 'I', 'trade', 'on', 'the', 'S&P', '?']
# additional test:
In [225]: tokenizer.tokenize('"I am tired," she said.')
Out[225]: ['"', 'I', 'am', 'tired', ',', '"', 'she', 'said', '.']
Edit: the requirements changed a bit so we can slightly modify PottsTweetTokenizer for this purpose.
emoticon_string = r"""
[:;=8] # eyes
[\-o\*\']? # optional nose
[\)\]\(\[dDpP/\:\}\{#\|\\] # mouth
[\)\]\(\[dDpP/\:\}\{#\|\\] # mouth
[\-o\*\']? # optional nose
[:;=8] # eyes
# Twitter symbols/cashtags: # Added by awd, 20140410.
# Based upon Twitter's regex described here: <>.
cashtag_string = r"""(?:\$[a-zA-Z]{1,6}([._][a-zA-Z]{1,2})?)"""
# The components of the tokenizer:
regex_strings = (
# Phone numbers:
(?: # (international)
(?: # (area code)
\d{3} # exchange
\d{4} # base
# Emoticons:
# HTML tags:
# URLs:
# Twitter username:
# Twitter hashtags:
# Twitter symbols/cashtags:
# email addresses
# Remaining word types:
(?:[a-z][^\s]+[a-z]) # Words with punctuation (modification here).
(?:[+\-]?\d+[,/.:-]\d+[+\-]?) # Numbers, including fractions, decimals.
(?:[\w_]+) # Words without apostrophes or dashes.
(?:\.(?:\s*\.){1,}) # Ellipsis dots.
(?:\S) # Everything else that isn't whitespace.
word_re = re.compile(r"""(%s)""" % "|".join(regex_strings), re.VERBOSE | re.I | re.UNICODE)
# The emoticon and cashtag strings get their own regex so that we can preserve case for them as needed:
emoticon_re = re.compile(emoticon_string, re.VERBOSE | re.I | re.UNICODE)
cashtag_re = re.compile(cashtag_string, re.VERBOSE | re.I | re.UNICODE)
# These are for regularizing HTML entities to Unicode:
html_entity_digit_re = re.compile(r"&#\d+;")
html_entity_alpha_re = re.compile(r"&\w+;")
amp = "&"
class CustomTweetTokenizer(object):
def __init__(self, *, preserve_case: bool=False):
self.preserve_case = preserve_case
def tokenize(self, tweet: str) -> list:
Argument: tweet -- any string object.
Value: a tokenized list of strings; concatenating this list returns the original string if preserve_case=True
# Fix HTML character entitites:
tweet = self._html2unicode(tweet)
# Tokenize:
matches = word_re.finditer(tweet)
if self.preserve_case:
return [ for match in matches]
return [self._normalize_token( for match in matches]
def _normalize_token(token: str) -> str:
# Avoid changing emoticons like :D into :d
return token
if token.startswith('$') and
return token.upper()
return token.lower()
def _html2unicode(tweet: str) -> str:
Internal method that seeks to replace all the HTML entities in
tweet with their corresponding unicode characters.
# First the digits:
ents = set(html_entity_digit_re.findall(tweet))
if len(ents) > 0:
for ent in ents:
entnum = ent[2:-1]
entnum = int(entnum)
tweet = tweet.replace(ent, chr(entnum))
# Now the alpha versions:
ents = set(html_entity_alpha_re.findall(tweet))
ents = filter((lambda x: x != amp), ents)
for ent in ents:
entname = ent[1:-1]
tweet = tweet.replace(ent, chr(html.entities.name2codepoint[entname]))
tweet = tweet.replace(amp, " and ")
return tweet
To test it out:
tknzr = CustomTweetTokenizer(preserve_case=True)
# result:
Following up on #mechanical_meat answer,
There's a twitter text tokenizer in NLTK
Most probably, it's derived from the PottsTweetTokenizer at
from nltk.tokenize import TweetTokenizer
tt = TweetTokenizer()
text = 'Should I trade on the S&P? This works with a phone number 333-445-6635 and email'
['Should', 'I', 'trade', 'on', 'the', 'S', '&', 'P', '?', 'This', 'works', 'with', 'a', 'phone', 'number', '333-445-6635', 'and', 'email', '']
But that doesn't solve the S&P problem!!
So you can try the Multi-Word Expression approach, see
from nltk import word_tokenize
from nltk.tokenize import TweetTokenizer
from nltk.tokenize import MWETokenizer
def multiword_tokenize(text, mwe, tokenize_func=word_tokenize):
# Initialize the MWETokenizer
protected_tuples = [tokenize_func(word) for word in mwe]
protected_tuples_underscore = ['_'.join(word) for word in protected_tuples]
tokenizer = MWETokenizer(protected_tuples)
# Tokenize the text.
tokenized_text = tokenizer.tokenize(tokenize_func(text))
# Replace the underscored protected words with the original MWE
for i, token in enumerate(tokenized_text):
if token in protected_tuples_underscore:
tokenized_text[i] = mwe[protected_tuples_underscore.index(token)]
return tokenized_text
text = 'Should I trade on the S&P? This works with a phone number 333-445-6635 and email'
mwe = ['S&P']
tt = TweetTokenizer()
print(multiword_tokenize(text, mwe, tt.tokenize))
['Should', 'I', 'trade', 'on', 'the', 'S&P', '?', 'This', 'works', 'with', 'a', 'phone', 'number', '333-445-6635', 'and', 'email', '']

getting alphabets after applying sentence tokenizer of nltk instead of sentences in Python 3.5.1

import codecs, os
import re
import string
import mysql
import mysql.connector
y_ = ""
'''Searching and reading text files from a folder.'''
for root, dirs, files in os.walk("/Users/ultaman/Documents/PAN dataset/Pan Plagiarism dataset 2010/pan-plagiarism-corpus-2010/source-documents/test1"):
for file in files:
if file.endswith(".txt"):
x_ =,file),"r", "utf-8-sig")
for lines in x_.readlines():
y_ = y_ + lines
'''Tokenizing the senteces of the text file.'''
from nltk.tokenize import sent_tokenize
raw_docs = sent_tokenize(y_)
tokenized_docs = [sent_tokenize(y_) for sent in raw_docs]
'''Removing punctuation marks.'''
regex = re.compile('[%s]' % re.escape(string.punctuation))
tokenized_docs_no_punctuation = ''
for review in tokenized_docs:
new_review = ''
for token in review:
new_token = regex.sub(u'', token)
if not new_token == u'':
new_review+= new_token
tokenized_docs_no_punctuation += (new_review)
'''Connecting and inserting tokenized documents without punctuation in database field.'''
def connect():
for i in range(len(tokenized_docs_no_punctuation)):
conn = mysql.connector.connect(user = 'root', password = '', unix_socket = "/tmp/mysql.sock", database = 'test' )
cursor = conn.cursor()
cursor.execute("""INSERT INTO splitted_sentences(sentence_id, splitted_sentences) VALUES(%s, %s)""",(cursor.lastrowid,(tokenized_docs_no_punctuation[i])))
if __name__ == '__main__':
After writing the above code, The result is like
2 | S | N |
| 3 | S | o |
| 4 | S | |
| 5 | S | d |
| 6 | S | o |
| 7 | S | u |
| 8 | S | b |
| 9 | S | t |
| 10 | S | |
| 11 | S | m |
| 12 | S | y |
| 13 | S |
| 14 | S | d
in the database.
It should be like:
1 | S | No doubt, my dear friend.
2 | S | no doubt.
I suggest making the following edits(use what you would like). But this is what I used to get your code running. Your issue is that review in for review in tokenized_docs: is already a string. So, this makes token in for token in review: characters. Therefore to fix this I tried -
tokenized_docs = ['"No doubt, my dear friend, no doubt; but in the meanwhile suppose we talk of this annuity.', 'Shall we say one thousand francs a year."', '"What!"', 'asked Bonelle, looking at him very fixedly.', '"My dear friend, I mistook; I meant two thousand francs per annum," hurriedly rejoined Ramin.', 'Monsieur Bonelle closed his eyes, and appeared to fall into a gentle slumber.', 'The mercer coughed;\nthe sick man never moved.', '"Monsieur Bonelle."']
'''Removing punctuation marks.'''
regex = re.compile('[%s]' % re.escape(string.punctuation))
tokenized_docs_no_punctuation = []
for review in tokenized_docs:
new_token = regex.sub(u'', review)
if not new_token == u'':
and got this -
['No doubt my dear friend no doubt but in the meanwhile suppose we talk of this annuity', 'Shall we say one thousand francs a year', 'What', 'asked Bonelle looking at him very fixedly', 'My dear friend I mistook I meant two thousand francs per annum hurriedly rejoined Ramin', 'Monsieur Bonelle closed his eyes and appeared to fall into a gentle slumber', 'The mercer coughed\nthe sick man never moved', 'Monsieur Bonelle']
The final format of the output is up to you. I prefer using lists. But you could concatenate this into a string as well.
nw = []
for review in tokenized_docs[0]:
new_review = ''
for token in review:
new_token = regex.sub(u'', token)
if not new_token == u'':
new_review += new_token
'''Inserting into database'''
def connect():
for j in nw:
conn = mysql.connector.connect(user = 'root', password = '', unix_socket = "/tmp/mysql.sock", database = 'Thesis' )
cursor = conn.cursor()
cursor.execute("""INSERT INTO splitted_sentences(sentence_id, splitted_sentences) VALUES(%s, %s)""",(cursor.lastrowid,j))
if __name__ == '__main__':
