Python split text without spaces but keep dates as they are - python

To split text without spaces, one can use wordninja, please see How to split text without spaces into list of words. Here is the code to do the job.
sent = "Test12 to separate mergedwords butkeeprest asitis, say 1/2/2021 or 1.2.2021."
import wordninja
print(' '.join(wordninja.split(sent)))
output: Test 12 to separate merged words but keep rest as it is say 1 2 2021 or 1 2 2021
The wordninja looks great and works well for splitting those merged text. My question here is that how I can split text without spaces but keep the dates (and punctuations) as they are. An ideal output will be:
Test 12 to separate merged words but keep rest as it is, say 1/2/2021 or 1.2.2021
Your help is much appreciated!

The idea here is to split our string into a list at every instance of a date then iterate over that list preserving items that matched the initial split pattern and calling wordninja.split() on everything else. Then recombine the list with join.
import re
def foo(s):
return 'ninja'
string = 'Test12 to separate mergedwords butkeeprest asitis, say 1/2/2021 or 1.2.2021.'
pattern = re.compile(r'([0-9]{1,2}[/.][0-9]{1,2}[/.][0-9]{1,4})')
# Split the string up by things matching our pattern, preserve rest of string.
string_isolated_dates = re.split(pattern, string)
# Apply wordninja to everything that doesn't match our date pattern, join it all together. OP should replace foo in the next line with wordninja.split()
wordninja_applied = ' '.join([el if pattern.match(el) else foo(el) for el in string_isolated_dates])
print(wordninja_applied)
Output:
ninja 1/2/2021 ninja 1.2.2021 ninja
Note: I replaced your function wordninja.split() with foo() just because I don't feel like downloading yet another nlp library. But my code demonstrates modifying the original string while preserving the dates.

Finally I got the following code, based on comments under my post (Thanks for comments):
import re
sent = "Test12 to separate mergedwords butkeeprest asitis, say 1/2/2021 or 1.2.2021."
sent = re.sub(","," ",sent)
corrected = ' '.join([' '.join(wordninja.split(w)) if w.isalnum() else w for w in sent.split(" ")])
print(corrected)
output: Test 12 to separate merged words but keep rest as it is say 1/2/2021 or 1.2.2021.
It is not a straightforward solution, but works.

Related

Using difflib.get_close_matches to replace word in string - Python

If difflib.get_close_matches can return a single close match. Where I supply the sample string and close match. How can I utilize the 'close match' to replace the string token found?
# difflibQuestion.py
import difflib
word = ['Summerdalerise', 'Winterstreamrise']
line = 'I went up to Winterstreamrose.'
result = difflib.get_close_matches(line,word,n=1)
print(result)
Output:
['Winterstreamrise']
I want to produce the line:
I went up to Winterstreamrise.
For many lines and words.
I have checked the docs
can't find any ref to string index of found match difflib.getget_close_matches
the other module classes & functions return lists
I Googled "python replace word in line using difflib" etc. I can't find any reference to anyone else asking/writing about it. It would seem a common scenario to me.
This example is of course a simplified version of my 'real world' scenario. Which may be of help. Since I am dealing more with table data (rather than line)
Surname, First names, Street Address, Town, Job Description
And my 'words' are a large list of street base names eg MAIN, EVERY, EASY, LOVERS (without the Road, Street, Lane) So my difflib.get_close_matches could be used to substitute the string of column x 'line' with the closest match 'word'.
However I would appreciate anyone suggesting an approach to either of these examples.
You could try something like this:
import difflib
possibilities = ['Summerdalerise', 'Winterstreamrise']
line = 'I went up to Winterstreamrose.'
newWords = []
for word in line.split():
result = difflib.get_close_matches(word, possibilities, n=1)
newWords.append(result[0] if result else word)
result = ' '.join(newWords)
print(result)
Output:
I went up to Winterstreamrise
Explanation:
The docs show a first argument named word, and there is no suggestion that get_close_matches() has any awareness of sub-words within this argument; rather, it reports on the closeness of a match between this word atomically and the list of possibilities supplied as the second argument.
We can add the awareness of words within line by splitting it into a list of such words which we iterate over, calling get_close_matches() for each word separately and modifying the word in our result only if there is a match.

Delete rows based on the multiple words stored in a list as fast as possible

I have a dataframe which consist a column named Keyword. There are around 1M Keywords. I want to delete all the rows where the Keywords consist of the words I stored in the list.
Here is some words stored in the list:
excluded_words = ['nz','ca']
I have tried the follwing code:
df[~df['Keyword'].str.contains('|'.join(exclude_words), regex = True)]
This code is blazing fast. Doing its job but with a little issue.
It is deleting any keywords which contains any words including "ca". I want to delete only those keywords where "ca" is a seperate word.
For example let's say we have two below Keywords
cast iron sump pump
sump pump repair service near ca
The first keyword shouldn't be deleted as "ca" is just a part of the keyword "cast", not just a word itself. Where the second keyword should be surely deleted as "ca" is a word there.
How to modify the code so that it can deal with it? Thank you in advance.
You can surround each word to exclude with r'\b', a raw Python string which represents the regular expression special sequence for a word boundary (re.py docs):
excluded_words = ['nz', 'ca']
excluded_words = [r'\b' + x + r'\b' for x in excluded_words]
df[~df['Keyword'].str.contains('|'.join(excluded_words), regex=True)]

Remove all words in a string that contain any given substrings using python

I have a .csv file that has a column containing text. For each item in this column there is a gene name and a date (for example 'CYP2C19, CYP2D6 07/17/2020'). I want to remove the dates from all of values in this column so that only the two genes are visible (output: 'CYP2C19, CYP2D6'). Secondly, in some boxes there is both a gene name and an indication if there is no recommendation ('CYP2C9 08/19/2020 (no recommendation'). In these cases, I would like to remove both the date and the statement that says no recommendation (output: 'CYP2C19, CYP2D6').
I have tried using the code below to remove any text that contains slashes for a single string (I have not yet tried anything on the entire .csv file). However it left the 07 from the date unfortunately.
import re
pattern = re.compile('/(?:[a-zA-Z]|[0-9]|[$-_#.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+')
s = 'CYP2C19, CYP2D6 07/17/2020'
pattern.sub('', s)
Output: 'CYP2C19, CYP2D6 07'
One method is to just take the date out of the string, and then split it as you please. Note this works for any number of dates:
import re
x = 'CYP2C19, CYP2D6 07/17/2020'
x = re.sub(r'\s*\d{2}/\d{2}/\d{4}', "", x)
You could replace \s* with \s if you always know there will be only a single space separating the term you want and the date, but I don't see a reason to do that.
Note that you could now split this by the delimiter, which in the case of your question, is actually a comma followed by a space
result = x.split(", ")
# ['CYP2C19', 'CYP2D6']
Although in your csv you may find that it is just a comma (as CSVs normally are).
Combining the steps above:
import re
x = 'CYP2C19 08/15/1972, CYP2D6 07/17/2020'
x = re.sub(r'\s*\d{2}/\d{2}/\d{4}', "", x).split(", ")
# ['CYP2C19', 'CYP2D6']
I think that you could take each column then split it :
for exemple let's take the following string : column = ' CYP2D6 07/17/2020'
you could do : m = column.split() then you will obtain : a list like : m=['CYP2D6','07/17/2020']
after that you could simply take : gene = m[0]

Pandas extracting text multiple times with same criteria

I have a DataFrame and in one cell I have a long text, e.g.:
-student- Kathrin A -/student- received abc and -student- Mike B -/student-
received def.
My question is: how can I extract the text between the -student- and -/student- and create two new columns with "Kathrin A" in the first one and "Mike B" in the second one? Meaning that this criteria meets twice or multiple times in the text.
what I have tried so far: str.extract('-student-\s * ([^.] * )\s * -/student-', expand = False) but this only extracts the first match, i.e Kathrin A.
Many thanks!
You could use str.split with regex and defined you delimiters as follows:
splittxt = ['-student-','-/student-']
df.text.str.split('|'.join(splittxt), expand=True)
Output:
0 1 2 3 4
0 Kathrin A received abc and Mike B received def.
Another approach would be to try extractall. The only caveat is the result is put into multiple rows instead of multiple columns. With some rearranging this should not be an issue, and please update this response if you end up working it out.
That being said I also have a slight modification to your regular expression which will help you with capturing both.
'(?<=-student-)(?:\s*)([\w\s]+)(?= -/student-)'
The only capturing group is [\w\s]+ so you'll be sure to not end up capturing the whole string.

How to find the count/occurrence of one string(can be multi-word) in another string(sentence) in python

I have to count the occurrence of a string(which can be 1 or more words) in another string (which is a sentence) and should not be case-sensitive.
For instance -
a = "Hi my name is Alex and hi to you as well. How high is the building? The highest floor is 18th. Highlights .... She said hi as well. Do you know highlights of the match ... hi."
b = "hi" #word/sentence to find count of
I tried -
a.lower().count(b)
which returns
>> 8
while the required answer should be
>> 4.
For multi-word, this method seems to work but I am not sure of the limiting cases. How can I fix this?
You can use re.findall to search for the substring with leading and trailing word boundaries:
import re
print(len(re.findall(r'\b{}\b'.format(b), a, re.I))) # -> 4
# ^ ^
# |___|_ word boundaries ^
# |_ case insensitive match
The function works just fine: the sequence "hi" appears 8 times in the string. Since you want it only as words, you'll need to figure out how you can differentiate the word "hi" from the incidental appearance in other words, such as "chipper".
One common way is to use the re package (regular expressions), but that may be more learning then you want to do right now.
A better way at the moment would be to split the string into words before you check each:
word_list = a.lower().split()
b_count = word_list.count(b)
Note that this considers only spaces when dividing words. It still won't find "hi" in "hi-performance", for example. You'd need another split operation for other separators.
"Spliting" a sentence into words is not trivial.
There in a package in python to do that : nltk.
First install this package using pip or system specific package manager.
Then run ipython and use nltk.download() to download "punkt" data : type d then type punkt. Then quit q.
Then use
tokens = nltk.word_tokenize(a)
len(list(filter(lambda x: x.lower() == b, tokens))
it returns 4.
Use str.split() and filter out punctuation with regex:
import re
a = "Hi my name is Alex and hi to you as well. How high is the building? The highest floor is 18th. Highlights .... She said hi as well. Do you know highlights of the match ... hi."
b = "hi"
final_count = sum(re.sub("\W+", '', i.lower()) == b for i in a.split())
Output:
4

Categories