Prune dataset vocabulary

Prune dataset vocabulary - python

I have a training and testing dataset from the 20NewsGroups dataset available in sklearn. I have imported the data and created a bag of words that I can use to run it through a naive bayes classifier. Current code is below:
def prep(categories):
# Import Newsgroup data
datatrain = fetch_20newsgroups(subset='train', categories=categories)
datatest = fetch_20newsgroups(subset='test', categories=categories)
countvect = CountVectorizer() # Create CountVectorizer
Xtrain_counts = countvect.fit_transform(datatrain.data)
tfidf = TfidfTransformer() # Term-frequency transformer
Xtrain_tfidf = tfidf.fit_transform(Xtrain_counts)
print "\nTfidf Dimensions: %s" % str(Xtrain_tfidf.shape)
print "\nVocabulary: %s" % str(len(countvect.vocabulary_)) + " unique \'words\'"
From here, I want to prune the data so that I ignore strings like "w32w" or email IDs or common words like "an", "the", "is" to try and improve the accuracy of my classifier. I have a regex that can catch emails below:
found = re.findall(r'[\w\.-]+#[\w\.-]+', Xtrain_tfidf)
How can I apply the regex in such a manner that it removes data that matches, and how can I expand the regex to include common words?
String samples:
From: matt-dah#dsv.su.se (Mattias Dahlberg)
Subject: Re: REAL-3D
Organization: Dept. of Computer and Systems Sciences, Stockholm University
Lines: 17
X-Newsreader: TIN [version 1.1 PL8]
Rauno Haapaniemi (raunoh#otol.fi) wrote:
Earlier today I read an ad for REAL-3D animation & ray-tracing software
and it looked very convincing to me.
Yes, it looks like very good indeed.
However, I don't own an Amiga and so I began to wonder, if there's a PC
version of it.
Nope.
Expected output:
Mattias Dahlberg REAL-3D Dept of Computer Systems Sciences Stockholm University Rauno Haapaniemi Earlier today read ad for REAL-3D animation & ray-tracing software looked very convincing to me Yes looks like very good indeed However I don't own an Amiga and so began to wonder there's PC version of Nope
From this you can see that the emails, common words, punctuation have all been stripped.

You may use re.sub:
re.sub(pattern, repl, string, count=0, flags=0)
Return the string obtained by replacing the leftmost non-overlapping occurrences of pattern in string by the replacement repl.
with a regex that matches stopwords as whole words and email-like substrings.
from nltk.corpus import stopwords
result = re.sub(r"[\w.-]+#[\w.-]+|\b(?:{})\b".format("|".join(set(stopwords.words('english')))), '', Xtrain_tfidf)
Note the r'' prefix that defines a raw string literal where \ defines a literal backslash and \b is thus treated as a word boundary, not a backspace char.
The pattern will match:
[\w.-]+#[\w.-]+ - 1+ word, . or - chars, followed with # and then again 1+ word, . or - chars
| - or
\b(?:and|or|not|a|an|is|the|of|like)\b - any of the alternatives in the non-capturing alternation group as whole words as \b is a word boundary.

Related

Spacy tokenization add extra white space for dates with hyphen separator when I manually build the Doc

I've been trying to solve a problem with the spacy Tokenizer for a while, without any success. Also, I'm not sure if it's a problem with the tokenizer or some other part of the pipeline.
Description
I have an application that for reasons besides the point, creates a spacy Doc from the spacy vocab and the list of tokens from a string (see code below). Note that while this is not the simplest and most common way to do this, according to spacy doc this can be done.
However, when I create a Doc for a text that contains compound words or dates with hyphen as a separator, the behavior I am getting is not what I expected.
import spacy
from spacy.language import Doc
# My current way
doc = Doc(nlp.vocab, words=tokens) # Tokens is a well defined list of tokens for a certein string
# Standard way
doc = nlp("My text...")
For example, with the following text, if I create the Doc using the standard procedure, the spacy Tokenizer recognizes the "-" as tokens but the Doc text is the same as the input text, in addition the spacy NER model correctly recognizes the DATE entity.
import spacy
doc = nlp("What time will sunset be on 2022-12-24?")
print(doc.text)
tokens = [str(token) for token in doc]
print(tokens)
# Show entities
print(doc.ents[0].label_)
print(doc.ents[0].text)
Output:
What time will sunset be on 2022-12-24?
['What', 'time', 'will', 'sunset', 'be', 'on', '2022', '-', '12', '-', '24', '?']
DATE
2022-12-24
On the other hand, if I create the Doc from the model's vocab and the previously calculated tokens, the result obtained is different. Note that for the sake of simplicity I am using the tokens from doc, so I'm sure there are no differences in tokens. Also note that I am manually running each pipeline model in the correct order with the doc, so at the end of this process I would theoretically get the same results.
However, as you can see in the output below, while the Doc's tokens are the same, the Doc's text is different, there were blank spaces between the digits and the date separators.
doc2 = Doc(nlp.vocab, words=tokens)
# Run each model in pipeline
for model_name in nlp.pipe_names:
pipe = nlp.get_pipe(model_name)
doc2 = pipe(doc2)
# Print text and tokens
print(doc2.text)
tokens = [str(token) for token in doc2]
print(tokens)
# Show entities
print(doc.ents[0].label_)
print(doc.ents[0].text)
Output:
what time will sunset be on 2022 - 12 - 24 ?
['what', 'time', 'will', 'sunset', 'be', 'on', '2022', '-', '12', '-', '24', '?']
DATE
2022 - 12 - 24
I know it must be something silly that I'm missing but I don't realize it.
Could someone please explain to me what I'm doing wrong and point me in the right direction?
Thanks a lot in advance!
EDIT
Following the Talha Tayyab suggestion, I have to create an array of booleans with the same length that my list of tokens to indicate for each one, if the token is followed by an empty space. Then pass this array in doc construction as follows: doc = Doc(nlp.vocab, words=words, spaces=spaces).
To compute this list of boolean values based on my original text string and list of tokens, I implemented the following vanilla function:
def get_spaces(self, text: str, tokens: List[str]) -> List[bool]:
# Spaces
spaces = []
# Copy text to easy operate
t = text.lower()
# Iterate over tokens
for token in tokens:
if t.startswith(token.lower()):
t = t[len(token):] # Remove token
# If after removing token we have an empty space
if len(t) > 0 and t[0] == " ":
spaces.append(True)
t = t[1:] # Remove space
else:
spaces.append(False)
return spaces
With these two improvements in my code, the result obtained is as expected. However, now I have the following question:
Is there a more spacy-like way to compute whitespace, instead of using my vanilla implementation?

Please try this:
from spacy.language import Doc
doc2 = Doc(nlp.vocab, words=tokens,spaces=[1,1,1,1,1,1,0,0,0,0,0,0])
# Run each model in pipeline
for model_name in nlp.pipe_names:
pipe = nlp.get_pipe(model_name)
doc2 = pipe(doc2)
# Print text and tokens
print(doc2.text)
tokens = [str(token) for token in doc2]
print(tokens)
# Show entities
print(doc.ents[0].label_)
print(doc.ents[0].text)
# You can also replace 0 with False and 1 with True
This is the complete syntax:
doc = Doc(nlp.vocab, words=words, spaces=spaces)
spaces are a list of boolean values indicating whether each word has a subsequent space. Must have the same length as words, if specified. Defaults to a sequence of True.
So you can choose which ones you gonna have space and which ones you do not need.
Reference: https://spacy.io/api/doc

Late to this, but as you've retrieved the tokens from a document to begin with I think you can just use the whitespace_ attribute of the token for this. Then your 'get_spaces` function looks like:
def get_spaces(tokens):
return [1 if token.whitespace_ else 0 for token in tokens]
Note that this won't work nicely if there multiple spaces or non-space whitespace (e.g. tabs), but then you probably need to update the tokenizer or use your existing solution and update this part:
if len(t) > 0 and t[0] == " ":
spaces.append(True)
t = t[1:] # Remove space
to check for generic whitespace and remove more than just a leading space.

Using regex to match adjacent words using first letter of abbreviations in text python

I have a list of abbreviations that I am trying to find in my text using regex. However I am struggling to find adjacent words by matching letters and have only achieved this with word matching. Here is my text
text = '''They posted out the United States Navy Seals (USNS) to the area.
Entrance was through an underground facility (UGF) as they has to bypass a no-fly-zone (NFZ).
I found an assault-rifle (AR) in the armoury.'''
My list is as such: [USNS, UGF, NFZ, AR]
I would like to find the corresponding long forms in the text using the first letter of each abbreviation. It would also need to be non-case sensitive. My attempt so far has been as such:
re.search(r'\bUnited\W+?States\b\W+?Navy\b\W+?Seals\b', text)
which returns United States Navy Seals however when I try and just use the first letter:
re.search(r'\bU\W+?S\b\W+?N\b\W+?S\b', text)
It then returns nothing. Furthermore some of the abbreviations contain more than the initial of a word in the text such as UGF - underground facility.
My actual goal is to eventually replace all abbreviations in the text (USNS, UGF, NFZ, AR) with their corresponding long forms (United States Navy Seals, underground facility, no-fly-zone, assault-rifle).

In your last regex [1]
re.search(r'\bU\W+?S\b\W+?N\b\W+?S\b', text)
you get no match because you made several mistakes:
\w+ means one or more word characters, \W+ is for one or more non-word characters.
the \b boundary anchor is sometimes in the wrong place (i.e. between the initial letter and the rest of the word)
re.search(r'\bU\w+\sS\w+?\sN\w+?\sS\w+', text)
should match.
And, well,
print(re.search(r'\bu\w+?g\w+\sf\w+', text))
matches of course underground facility but in a long text, there will be much more irrelevant matches.
Approach to generalization
Finally I built a little "machine" that dynamically creates regular expressions from the known abbreviations:
import re
text = '''They posted out the United States Navy Seals (USNS) to the area.
Entrance was through an underground facility (UGF) as they has to bypass a no-fly-zone (NFZ).
I found an assault-rifle (AR) in the armoury.'''
abbrs = ['USNS', 'UGF', 'NFZ', 'AR']
for abbr in abbrs:
pattern = ''.join(map(lambda i: '['+i.upper()+i.lower()+'][a-z]+[ a-z-]', abbr))
print(pattern)
print(re.search(pattern, text, flags=re.IGNORECASE))
The output of above script is:
[Uu][a-z]+[ a-z-][Ss][a-z]+[ a-z-][Nn][a-z]+[ a-z-][Ss][a-z]+[ a-z-]
<re.Match object; span=(20, 45), match='United States Navy Seals '>
[Uu][a-z]+[ a-z-][Gg][a-z]+[ a-z-][Ff][a-z]+[ a-z-]
<re.Match object; span=(89, 110), match='underground facility '>
[Nn][a-z]+[ a-z-][Ff][a-z]+[ a-z-][Zz][a-z]+[ a-z-]
<re.Match object; span=(140, 152), match='no-fly-zone '>
[Aa][a-z]+[ a-z-][Rr][a-z]+[ a-z-]
<re.Match object; span=(170, 184), match='assault-rifle '>
Further generalization
If we assume that in a text each abbreviation is introduced after the first occurrence of the corresponding long form, and we further assume that the way it is written definitely starts with a word boundary and definitely ends with a word boundary (no assumptions about capitalization and the use of hyphens), we can try to extract a glossary automatically like this:
import re
text = '''They posted out the United States Navy Seals (USNS) to the area.
Entrance was through an underground facility (UGF) as they has to bypass a no-fly-zone (NFZ).
I found an assault-rifle (AR) in the armoury.'''
# build a regex for an initial
def init_re(i):
return f'[{i.upper()+i.lower()}][a-z]+[ -]??'
# build a regex for an abbreviation
def abbr_re(abbr):
return r'\b'+''.join([init_re(i) for i in abbr])+r'\b'
# build an inverse glossary from a text
def inverse_glossary(text):
abbreviations = set(re.findall('\([A-Z]+\)', text))
igloss = dict()
for pabbr in abbreviations:
abbr = pabbr[1:-1]
pattern = '('+abbr_re(abbr)+') '+r'\('+abbr+r'\)'
m = re.search(pattern, text)
if m:
longform = m.group(1)
igloss[longform] = abbr
return igloss
igloss = inverse_glossary(text)
for long in igloss:
print('{} -> {}'.format(long, igloss[long]))
The output is
no-fly-zone -> NFZ
United States Navy Seals -> USNS
assault-rifle -> AR
underground facility -> UGF
By using an inverse glossary you may easily replace all long forms into their corresponding abbreviation. A bit harder is it to do for all but the first occurrence. There is much space for refinement, for example to correctly handle line breaks within long forms (also to use re.compile).
As to replace the abbreviations with the long forms, you have to build a normal glossary instead of an inverse one:
# build a glossary from a text
def glossary(text):
abbreviations = set(re.findall('\([A-Z]+\)', text))
gloss = dict()
for pabbr in abbreviations:
abbr = pabbr[1:-1]
pattern = '('+abbr_re(abbr)+') '+r'\('+abbr+r'\)'
m = re.search(pattern, text)
if m:
longform = m.group(1)
gloss[abbr] = longform
return gloss
gloss = glossary(text)
for abbr in gloss:
print('{}: {}'.format(abbr, gloss[abbr]))
The output here is
AR: assault-rifle
NFZ: no-fly-zone
UGF: underground facility
USNS: United States Navy Seals
The replacement itself is left to the reader.
[1]
Let's take a closer look at your first regex again:
re.search(r'\bUnited\W+?States\b\W+?Navy\b\W+?Seals\b', text)
The boundary anchors (\b) are redundant. They can be removed without changing anything in the result because \W+? means at least one non-word character after the last character of States and Navy. They cause no problems here but I guess that they led to the confusion when you started by modifying from it to get a more general one.

You could use the below regex which would take care of the case sensitivity as well. Click here.
This would just find United States Navy Seals.
\s[u|U].*?[s|S].*?[n|N].*?[s|S]\w+
Similarly, for UF,
You can use - \s[u|U].*?[g|G].*?[f|F]\w+
Please find a pattern above. The characters are just joined with .*? and each character is used as [a|A] which would match either lower case or upper case. The start would be \s since it should be a word and the end would \w+.
Play around.

Correct POS tags for numbers substituted with ## in spacy

The gigaword dataset is a huge corpus used to train abstractive summarization models. It contains summaries like these:
spain 's colonial posts #.## billion euro loss
taiwan shares close down #.## percent
I want to process these summaries with spacy and get the correct pos tag for each token. The issue is that all numbers in the dataset were replaced with # signs which spacy does not classify as numbers (NUM) but as other tags.
>>> import spacy
>>> from spacy.tokens import Doc
>>> nlp = spacy.load("en_core_web_sm")
>>> nlp.tokenizer = lambda raw: Doc(nlp.vocab, words=raw.split(' '))
>>> text = "spain 's colonial posts #.## billion euro loss"
>>> doc = nlp(text)
>>> [(token.text, token.pos_) for token in doc]
[('spain', 'PROPN'), ("'s", 'PART'), ('colonial', 'ADJ'), ('posts', 'NOUN'), ('#.##', 'PROPN'), ('billion', 'NUM'), ('euro', 'PROPN'), ('loss', 'NOUN')]
Is there a way to customize the POS tagger so that it classifies all tokens that only consist of #-sign and dots as numbers?
I know you replace the spacy POS tagger with your own or fine-tune it for your domain with additional data but I don't have tagged training data where all numbers are replaced with # and I would like to change the tagger as little as possible. I would prefer having a regular expression or fixed list of tokens that are always recognized as numbers.

What about replacing # with a digit?
In a first version of this answer I chose the digit 9, because it reminds me of the COBOL numeric field formats I used some 30 years ago... But then I had a look at the dataset, and realized that for proper NLP processing one should get at least a couple of things straight:
ordinal numerals (1st, 2nd, ...)
dates
Ordinal numerals need special handling for any choice of digit, but the digit 1 produces reasonable dates, except for the year (of course, 1111 may or may not be interpreted as a valid year, but let's play it safe). 11/11/2020 is clearly better than 99/99/9999...
Here is the code:
import re
ic = re.IGNORECASE
subs = [
(re.compile(r'\b1(nd)\b', flags=ic), r'2\1'), # 1nd -> 2nd
(re.compile(r'\b1(rd)\b', flags=ic), r'3\1'), # 1rd -> 3rd
(re.compile(r'\b1(th)\b', flags=ic), r'4\1'), # 1th -> 4th
(re.compile(r'11(st)\b', flags=ic), r'21\1'), # ...11st -> ...21st
(re.compile(r'11(nd)\b', flags=ic), r'22\1'), # ...11nd -> ...22nd
(re.compile(r'11(rd)\b', flags=ic), r'23\1'), # ...11rd -> ...23rd
(re.compile(r'\b1111\b'), '2020') # 1111 -> 2020
]
text = '''spain 's colonial posts #.## billion euro loss
#nd, #rd, #th, ##st, ##nd, ##RD, ##TH, ###st, ###nd, ###rd, ###th.
ID=#nd#### year=#### OK'''
text = text.replace('#', '1')
for pattern, repl in subs:
text = re.sub(pattern, repl, text)
print(text)
# spain 's colonial posts 1.11 billion euro loss
# 2nd, 3rd, 4th, 21st, 22nd, 23RD, 11TH, 121st, 122nd, 123rd, 111th.
# ID=1nd1111 year=2020 OK
If the preprocessing of the corpus converts any digit into a # anyway, you lose no information with this transformation. Some “true” # would become a 1, but this would probably be a minor problem compared to numbers not being recognized as such. Furthermore, in a visual inspection of about 500000 lines of the dataset I haven't been able to find any candidate for a “true” #.
N.B.: The \b in the above regular expressions stands for “word boundary”, i.e., the boundary between a \w (word) and a \W (non-word) character, where a word character is any alphanumeric character (further info here). The \1 in the replacement stands for the first group, i.e., the first pair of parentheses (further info here). Using \1 the case of all text is preserved, which would not be possible with replacement strings like 2nd. I later found that your dataset is normalized to all lower case, but I decided to keep it generic.
If you need to get the text with #s back from the parts of speech, it's simply
token.text.replace('0','#').replace('1','#').replace('2','#').replace('3','#').replace('4','#')

Removing chars/signs from string

I'm preparing text for a word cloud, but I get stuck.
I need to remove all digits, all signs like . , - ? = / ! # etc., but I don't know how. I don't want to replace again and again. Is there a method for that?
Here is my concept and what I have to do:
Concatenate texts in one string
Set chars to lowercase <--- I'm here
Now I want to delete specific signs and divide the text into words (list)
calculate freq of words
next do the stopwords script...
abstracts_list = open('new','r')
abstracts = []
allab = ''
for ab in abstracts_list:
abstracts.append(ab)
for ab in abstracts:
allab += ab
Lower = allab.lower()
Text example:
MicroRNAs (miRNAs) are a class of noncoding RNA molecules
approximately 19 to 25 nucleotides in length that downregulate the
expression of target genes at the post-transcriptional level by
binding to the 3'-untranslated region (3'-UTR). Epstein-Barr virus
(EBV) generates at least 44 miRNAs, but the functions of most of these
miRNAs have not yet been identified. Previously, we reported BRUCE as
a target of miR-BART15-3p, a miRNA produced by EBV, but our data
suggested that there might be other apoptosis-associated target genes
of miR-BART15-3p. Thus, in this study, we searched for new target
genes of miR-BART15-3p using in silico analyses. We found a possible
seed match site in the 3'-UTR of Tax1-binding protein 1 (TAX1BP1). The
luciferase activity of a reporter vector including the 3'-UTR of
TAX1BP1 was decreased by miR-BART15-3p. MiR-BART15-3p downregulated
the expression of TAX1BP1 mRNA and protein in AGS cells, while an
inhibitor against miR-BART15-3p upregulated the expression of TAX1BP1
mRNA and protein in AGS-EBV cells. Mir-BART15-3p modulated NF-κB
activity in gastric cancer cell lines. Moreover, miR-BART15-3p
strongly promoted chemosensitivity to 5-fluorouracil (5-FU). Our
results suggest that miR-BART15-3p targets the anti-apoptotic TAX1BP1
gene in cancer cells, causing increased apoptosis and chemosensitivity
to 5-FU.

So to set upper case characters to lower case characters you could do the following:
so just store your text to a string variable, for example STRING and next use the command
STRING=re.sub('([A-Z]{1})', r'\1',STRING).lower()
now your string will be free of capital letters.
To remove the special characters again module re can help you with the sub command :
STRING = re.sub('[^a-zA-Z0-9-_*.]', ' ', STRING )
with these command your string will be free of special characters
And to determine the word frequency you could use the module collections from where you have to import Counter.
Then use the following command to determine the frequency with which the words occur:
Counter(STRING.split()).most_common()

I'd probably try to use string.isalpha():
abstracts = []
with open('new','r') as abstracts_list:
for ab in abstracts_list: # this gives one line of text.
if not ab.isalpha():
ab = ''.join(c for c in ab if c.isalpha()
abstracts.append(ab.lower())
# now assuming you want the text in one big string like allab was
long_string = ''.join(abstracts)

getting words between m and n characters

I am trying to get all names that start with a capital letter and ends with a full-stop on the same line where the number of characters are between 3 and 5
My text is as follows:
King. Great happinesse
Rosse. That now Sweno, the Norwayes King,
Craues composition:
Nor would we deigne him buriall of his men,
Till he disbursed, at Saint Colmes ynch,
Ten thousand Dollars, to our generall vse
King. No more that Thane of Cawdor shall deceiue
Our Bosome interest: Goe pronounce his present death,
And with his former Title greet Macbeth
Rosse. Ile see it done
King. What he hath lost, Noble Macbeth hath wonne.
I am testing it out on this link. I am trying to get all words between 3 and 5 but haven't succeeded.

Does this produce your desired output?
import re
re.findall(r'[A-Z].{2,4}\.', text)
When text contains the text in your question it will produce this output:
['King.', 'Rosse.', 'King.', 'Rosse.', 'King.']
The regex pattern matches any sequence of characters following an initial capital letter. You can tighten that up if required, e.g. using [a-z] in the pattern [A-Z][a-z]{2,4}\. would match an upper case character followed by between 2 to 4 lowercase characters followed by a literal dot/period.
If you don't want duplicates you can use a set to get rid of them:
>>> set(re.findall(r'[A-Z].{2,4}\.', text))
set(['Rosse.', 'King.'])

You may have your own reasons for wanting to use regexs here, but Python provides a rich set of string methods and (IMO) it's easier to understand the code using these:
matched_words = []
for line in open('text.txt'):
words = line.split()
for word in words:
if word[0].isupper() and word[-1] == '.' and 3 <= len(word)-1 <=5:
matched_words.append(word)
print matched_words

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.