separate words in a sentence that has comma between them [duplicate] - python

This question already has answers here:
Split string with multiple delimiters in Python [duplicate]
(5 answers)
How to split at spaces and commas in Python?
(3 answers)
Closed 4 years ago.
I want to remove commas from one sentence and separate all the other words(a-z) and print them one by one.
a = input()
b=list(a) //to remove punctuations
for item in list(b): //to prevent "index out of range" error.
for j in range(len(l)):
if(item==','):
b.remove(item)
break
c="".join(b) //sentence without commas
c=c.split()
print(c)
My input is :
The university was founded as a standard academy,and developed to a university of technology by Habib Nafisi.
and when I remove the comma:
... founded as a standard academyand developed to a university...
and when I split the words:
The
university
.
.
.
academyand
.
.
.
what can I do to prevent this?
I already tried replace method and it doesn't work.

You could replace , with a space assuming there is no space between , and next word in your input 1 and then perform split:
s = 'The university was founded as a standard academy,and developed to a university of technology by Habib Nafisi.'
print(s.replace(',', ' ').split())
# ['The', 'university', 'was', 'founded', 'as', 'a', 'standard', 'academy', 'and', 'developed', 'to', 'a', 'university', 'of', 'technology', 'by', 'Habib', 'Nafisi.']
Alternatively, you could also try your hand at regex:
import re
s = 'The university was founded as a standard academy,and developed to a university of technology by Habib Nafisi.'
print(re.split(r' |,', s))
1Note: This works even if you had space (multiple) after , because ultimately you split on whitespace.

Your issue seems to be that there is no space between the comma and the next word here: academy,and You could solve this by ensuring that there is a space so when you use b=list(a) that function will actually separate each word into a different element of the list.

This is probably what you want, I see you forgot to replace comma with space.
stri = """ The university was founded as a standard academy,and developed to a university of technology by Habib Nafisi."""
stri.replace(",", " ")
print(stri.split())
Will give you the output in a list:
['The', 'university', 'was', 'founded', 'as', 'a', 'standard', 'academy,and', 'developed', 'to', 'a', 'university', 'of', 'technology', 'by', 'Habib', 'Nafisi.']

If you consider words to be a series of characters that are separated by spaces, if you replace a , with nothing, then there will be no space between them, and it will consider it one word.
The easiest way to do this is to replace the comma with a space, and then split based on spaces:
my_string = "The university was founded as a standard academy,and developed to a university of technology by Habib Nafisi."
list_of_words = my_string.replace(",", " ").split()

Related

How to end a regular expression when one of several possible phrases are found?

I basically know nothing about regex but with the help of Google I'm attempting to use it to create an address parser that only extracts the street number and name (ex. 123 Random Boulevard) from a string of text (ex. "Hey I live at 123 Random Boulevard if you were wondering". To do this, I created a list of words that street names end with (ex. avenue, street, place, way, etc.).
What syntax do I use in the 6th line of my code (regex_partialaddress) to get the regular expression to end upon encountering one of these words from the list?
Thanks in advance—any help is much appreciated.
So far, I have attempted to run the following lines of code
regex_partialaddress = "[0-9]{1,4} $['Way', 'Ave', 'Rd', 'Blvd', 'St.', 'Pl.', 'Dr.', 'Cir.', 'Ln', 'Ct', 'Hwy', 'Pkwy', 'Plaza', 'Highway', 'Court', 'Lane', 'Circle', 'Boulevard', 'Street', 'Road', 'Avenue', 'Drive', 'Place', 'Temple', 'Parkway']{1}"
re.findall(regex_partialaddress, "Hey I live at 123 Random Boulevard if you were wondering")
It compiled but it was not successful
What you want to do is separate the words you want to search for with pipe ('|') characters and place them inside a set of parens. Here's how to do that with a list of valid street types:
import re
street_types = ['Way', 'Ave', 'Rd', 'Blvd', 'St.', 'Pl.', 'Dr.', 'Cir.', 'Ln', 'Ct', 'Hwy', 'Pkwy', 'Plaza',
'Highway', 'Court', 'Lane', 'Circle', 'Boulevard', 'Street', 'Road', 'Avenue', 'Drive', 'Place',
'Temple', 'Parkway']
# Escape the '.' characters so that they match literally rather
# than matching any character.
street_types = [st.replace('.', r'\.') for st in street_types]
str = "123 Random Boulevard"
regex_partialaddress = fr"[0-9]{{1,4}} \w+ ({'|'.join(street_types)})"
m = re.match(regex_partialaddress, str)
if m:
print(f"Street type: {m.group(1)}")
Result:
Street type: Boulevard
UPDATE: #tripleee pointed out that there are periods in the street types that should be matched literally. Leaving them as is will cause them to match any character in that position. I added a preprocessing step to the code to escape the periods so that the produce the right behavior in the regex.

Python: split a text into individual English sentences; retain the punctuation [closed]

Closed. This question needs debugging details. It is not currently accepting answers.
Edit the question to include desired behavior, a specific problem or error, and the shortest code necessary to reproduce the problem. This will help others answer the question.
Closed 2 years ago.
Improve this question
I am trying to make a function, takes a string/text as an argument, return list of sentences in the text. Sentence boundaries like(.,?,!) should not be removed.
I don't want it to split on abbreviations (Dr. Kg. Mr. Mrs., e.g. "Dr. Jones").
Should I make a dictionary of all abbreviations?
Given input:
input = "I think Dr. Jones is busy now. Can you visit some other day? I was really surprised!"
Expected output:
output=['I think Dr. Jones is busy now.','Can you visit some other day?','I was really surprised!']
What I've tried:
# performing somthing like this:
output = input.split('.')
# will produce
'''
['I think Dr', ' Jones is busy now', ' Can you visit some other day? I was really surprised!']
'''
# where as doing
output = input.split(' ')
# will produce
'''
['I', 'think', 'Dr.', 'Jones', 'is', 'busy', 'now.', 'Can', 'you', 'visit', 'some', 'other', 'day?', 'I', 'was', 'really', 'surprised!']
'''
Basic assumption is that the text intput is not anomalously punctuated!
A clumsy way of achieving it is as follows:
abbr = {'Dr.', 'Mr.', 'Mrs.', 'Ms.'}
sentence_ender = ['.', '?', '!']
s = "I think Dr. Jones is busy now. Can you visit some other day? I was really surprised!"
def containsAny(wrd, charList):
# The list comprehension generates a list of True and False.
# "1 in [ ... ]" returns true is the list has atleast 1 true, else false
# we are essentially testing whether the word contains the sentence ender char
return 1 in [c in wrd for c in charList]
def separate_sentences(string):
sentences = [] # will be a list of all complete sentences
temp = [] # will be a list of all words in current sentence
for wrd in string.split(' '): # the input string is split on spaces
temp.append(wrd) # append current word to temp
# The following condition checks that if the word is not an abbreviation
# yet contains any of the sentence delimiters,
# make 'space separated' sentence and clear temp
if wrd not in abbr and containsAny(wrd, sentence_ender):
sentences.append(' '.join(temp)) # combine words currently in temp
temp = [] # clear temp, for next sentence
return sentences
print(separate_sentences(s))
Should produce:
['I think Dr. Jones is busy now.', 'Can you visit some other day?', 'I was really surprised!']

Getting rid of few entities using regex python

I am new to Regex. Given the below phrase I want to get rid of the I's and the extra field appearing because of using two regex operation.
text= "I have a problem in Regex, How do I get rid of the Capital I's provided I want to retain words occurring together as logical entity with a Capital letter in the beginning of each word like International Business Machine "
For example
I would like to retain "International Business Machine" as "International Business Machine" and not "Capital I's" as "Capital I's" but "Capital"
I used the below Regular Expression:
re.findall('([A-Z][\w\']*(?:\s+[A-Z][\w|\']*)+)|([A-Z][\w]*)', text)
The output I received is
[('', 'I'),
('', 'Regex'),
('', 'How'),
('', 'I'),
("Capital I's", ''),
('', 'I'),
('', 'Capital'),
('International Business Machine', '')]
However I would Like my Output to be as :
[('Regex'),
('How'),
("Capital"),
('Capital'),
('International Business Machine')]
How do I get rid of the "I" and the extra field appearing because of using two regex operation.
Thanks
Just match the word which starts with a captital letter followed by one or more word characters and then add a pattern to match the following words which should be like the previous one(starts with captital letter) and make that pattern to repeat zero or more times. So that it would match strings like Foo or Foo Bar Buzz.
>>> text= "I have a problem in Regex, How do I get rid of the Capital I's provided I want to retain words occurring together as logical entity with a Capital letter in the beginning of each word like International Business Machine "
>>> import re
>>> re.findall(r'\b[A-Z]\w+(?:\s+[A-Z]\w+)*', text)
['Regex', 'How', 'Capital', 'Capital', 'International Business Machine']
If you want to match also apostrophes(like in your example), you can try with:
(?:[A-Z](?:[\w]|(?<=\w\w)\')+\s?)+
DEMO
it will match ' if it is preceded by at least two word characters. Not too fancy solution but works. Then:
import re
text = 'I have a problem in Regex, How do I get rid of the Capital I\'s provided I want to retain words occurring together as logical entity with a Capital letter in the beginning of each word like International Business Machine'
found = re.findall('(?:[A-Z](?:[\w]|(?<=\w\w)\')+\s?)+',text)
print found
will also give a result:
['Regex', 'How ', 'Capital ', 'Capital ', 'International Business Machine']

Tokenizing first and last name as one token

Is is possible to tokenize a text in tokens such that first and last name are combined in one token?
For example if my text is:
text = "Barack Obama is the President"
Then:
text.split()
results in:
['Barack', 'Obama', 'is', 'the, 'President']
how can I recognize the first and last name? So I get only ['Barack Obama', 'is', 'the', 'President'] as tokens.
Is there a way to achieve it in Python?
What you are looking for is a named entity recognition system. I suggest you do not consider this as part of tokenization.
For python you can use https://pypi.python.org/pypi/ner/
Example from the site
>>> tagger.json_entities("Alice went to the Museum of Natural History.")
'{"ORGANIZATION": ["Museum of Natural History"], "PERSON": ["Alice"]}'
Here's a regular expression that meets the needs of your question. It will find individual words beginning with a lowercase character, or match singleton or pairs of capitalized words.
import re
re.findall(r"[a-z]\w+|[A-Z]\w+(?: [A-Z]\w+)?",text)
outputs
['Barack Obama', 'is', 'the', 'President']

Counting the number of unique words [duplicate]

This question already has answers here:
Counting the number of unique words in a document with Python
(8 answers)
Closed 9 years ago.
I want to count unique words in a text, but I want to make sure that words followed by special characters aren't treated differently, and that the evaluation is case-insensitive.
Take this example
text = "There is one handsome boy. The boy has now grown up. He is no longer a boy now."
print len(set(w.lower() for w in text.split()))
The result would be 16, but I expect it to return 14. The problem is that 'boy.' and 'boy' are evaluated differently, because of the punctuation.
import re
print len(re.findall('\w+', text))
Using a regular expression makes this very simple. All you need to keep in mind is to make sure that all the characters are in lowercase, and finally combine the result using set to ensure that there are no duplicate items.
print len(set(re.findall('\w+', text.lower())))
you can use regex here:
In [65]: text = "There is one handsome boy. The boy has now grown up. He is no longer a boy now."
In [66]: import re
In [68]: set(m.group(0).lower() for m in re.finditer(r"\w+",text))
Out[68]:
set(['grown',
'boy',
'he',
'now',
'longer',
'no',
'is',
'there',
'up',
'one',
'a',
'the',
'has',
'handsome'])
I think that you have the right idea of using the Python built-in set type.
I think that it can be done if you first remove the '.' by doing a replace:
text = "There is one handsome boy. The boy has now grown up. He is no longer a boy now."
punc_char= ",.?!'"
for letter in text:
if letter == '"' or letter in punc_char:
text= text.replace(letter, '')
text= set(text.split())
len(text)
that should work for you. And if you need any of the other signs or punctuation points you can easily
add them into punc_char and they will be filtered out.
Abraham J.
First, you need to get a list of words. You can use a regex as eandersson suggested:
import re
words = re.findall('\w+', text)
Now, you want to get the number of unique entries. There are a couple of ways to do this. One way would be iterate through the words list and use a dictionary to keep track of the number of times you have seen a word:
cwords = {}
for word in words:
try:
cwords[word] += 1
except KeyError:
cwords[word] = 1
Now, finally, you can get the number of unique words by
len(cwords)

Categories