Sentence processing in Python

Sentence processing in Python - python

I have a file with such data:
Sentence[0].Sentence[1].Sentence[2].'/n'
Sentence[0].Sentence[1].Sentence[2].'/n'
Sentence[0].Sentence[1].Sentence[2].'/n'
What I want to print out are all Sentences0. This is what I have done, but it prints out a blank list.
from nltk import *
import codecs
f=codecs.open('topon.txt','r+','cp1251')
text = f.readlines()
first=[sentence for sentence in text if re.findall('\.\n^Abc',sentence)]
print first

You don't need NLTK for this (nor are you using it). Unless I misunderstand the question, this should do the trick:
with open('topon.txt') as infile:
for line in infile:
print line.split('.', 1)[0]

In addition to #inspectorG4dget 's answer, you can do it by regexes:
from nltk import *
import codecs
f = codecs.open('a.txt', 'r+', 'cp1251')
text = f.readlines()
print [re.findall('^[^.]+', sentence) for sentence in text]

Splitting a paragraph at periods works only if every sentence ends with a period, and periods are used for nothing else. If you have a lot of real text, neither of these is even close to true. Abbreviations, questions? exclamations! etc. will trip you up a lot. So, use the tool that the nltk provides for this purpose: the function sent_tokenize(). It's not perfect, but it's a whole lot better than looking for periods. If text is your list of paragraphs, you use it like this:
first = [ ]
for par in text:
sentences = nltk.sent_tokenize(par)
first.append(sentences[0])
You could fold the above into a list comprehension, but it's not going to be very readable...

Related

Dictionary replaces substrings Python 2.7

I want to replace numbers from a text file in a new text file. I tried to solve it with the function Dictionary, but now python also replaces the substrings.
For Example: I want to replace the number 014189 to 1489, with this code it also replaces 014896 to 1489 - how can i get rid of this? Thank you!!!
replacements = {'01489':'1489', '01450':'1450'}
infile = open('test_fplan.txt', 'r')
outfile = open('test_fplan_neu.txt', 'w')
for line in infile:
for src, target in replacements.iteritems():
line = line.replace(src, target)
outfile.write(line)

I don't know how your input file looks, but if the numbers are surrounded by spaces, this should work:
replacements = {' 01489 ':' 1489 ', ' 01450 ':' 1450 '}

It looks like your concern is that it's also modifying numbers that contain your src pattern as a substring. To avoid that, you'll need to first define the boundaries that should be respected. For instance, do you want to insist that only matched numbers surrounded by spaces get replaced? Or perhaps just that there be no adjacent digits (or periods or commas). Since you'll probably want to use regular expressions to constrain the matching, as suggested by JoshuaF in another answer, you'll likely need to avoid the simple replace function in favor of something from the re library.

Use regexp with negative lookarounds:
import re
replacements = {'01489':'1489', '01450':'1450'}
def find_replacement(match_obj):
number = match_obj.group(0)
return replacements.get(number, number)
with open('test_fplan.txt') as infile:
with open('test_fplan_neu.txt', 'w') as outfile:
outfile.writelines(
re.sub(r'(?<!\d)(\d+)(?!\d)', find_replacement, line)
for line in infile
)

Check out the regular expression syntax https://docs.python.org/2/library/re.html. It should allow you to match whatever pattern you're looking for exactly.

How to format long `if` statement

I want to test if certain characters are in a line of text. The condition is simple but characters to be tested are many.
Currently I am using \ for easy viewing, but it feels clumsy. What's the way to make the lines look nicer?
text = "Tel+971-2526-821     Fax:+971-2526-821"
if "971" in text or \
"(84)" in text or \
"+66" in text or \
"(452)" in text or \
"19 " in text:
print "foreign"

Why don't extract the phone numbers from the string and do your tests
text = "Tel:+971-2526-821 Fax:+971-2526-821"
tel, fax = text.split()
tel_prefix, *_ = tel.split(':')[-1].split('-')
fax_prefix, *_ = fax.split(':')[-1].split('-')
if tel_prefix in ("971", "(84)"):
print("Foreigner")
for python 2.x
tel_prefix = tel.split(':')[-1].split('-')[0]
fax_prefix = fax.split(':')[-1].split('-')[0]

Enlightened by #Patrick Haugh in the comment. We can do:
text = "Tel+971-2526-821     Fax:+971-2526-821"
if any(x in text for x in ("971", "(84)", "+66", "(452)", "19 ")):
print "foreign"

You can use any builtin function to check if any one of the token exists in the text. If you would like to check if all the token exists in the string you can replace the below any with all function. Cheers!
text = 'Hello your number is 19 '
tokens = ('971', '(84)', '+66', '(452)', '19 ')
if any(token in text for token in tokens):
print('Foriegn')
Output:
Foriegn

Existing comments mention that you can't really have multiple or statements like you intend, but using generators/comprehensions and the any() function you are able to come up with a serviceable option, such as the snippet if any(x in text for x in ('971', '(84)', '+66', '(452)', '19 ')): that #Patrick Haugh recommended.
I would recommend using regular expressions instead as a more versatile and efficient way of solving the problem. You could either generate the pattern dynamically, or for the purpose of this problem, the following snippet would work (don't forget to escape parentheses):
import re
text = 'Tel:+971-2526-821 Fax:+971-2526-821'
pattern = u'(971|\(84\)|66|\(452\)|19)'
prog = re.compile(pattern)
if prog.search(text):
print 'foreign'
If you are searching many lines of text or large bodies of text for multiple possible substrings, this approach will be faster and more reusable. You only have to compile prog once, and then you can use it as often as you'd like.
As far as dynamic generation of a pattern is concerned, a naive implementation might do something like this:
match_list = ['971', '(84)', '66', '(452)', '19']
pattern = '|'.join(map(lambda s: s.replace('(', '\(').replace(')', '\)'), match_list)).join(['(', ')'])
The variable match_list could then be updated and modified as needed. There is a slight inefficiency in running two passes of replace(), and #Andrew Clark has a good trick for fixing that here, but I don't want this answer to be too long and cumbersome.

You can construct a lambda function that checks if a value is in the text, and then map this function to all of the values:
text = "Tel:+971-2526-821 Fax:+971-2526-821"
print any(map((lambda x: x in text), ["971", "(84)", "+66", "(452)", "19 "]))
The result is True, which means at least one of the values is in text.

How to count sentences taking into account the occurrence of ellipses

I've written the following script to count the number of sentences in a text file:
import re
filepath = 'sample_text_with_ellipsis.txt'
with open(filepath, 'r') as f:
read_data = f.read()
sentences = re.split(r'[.{1}!?]+', read_data.replace('\n',''))
sentences = sentences[:-1]
sentence_count = len(sentences)
However, if I run it on a sample_text_with_ellipsis.txt with the following content:
Wait for it... awesome!
I get sentence_count = 2 instead of 1, because it does not ignore the ellipsis (i.e., the "...").
What I tried to do in the regex is to make it match only one occurrence of a period through .{1}, but this apparently doesn't work the way I intended it. How can I get the regex to ignore ellipses?

Splitting sentences with a regex like this is not enough. See Python split text on sentences to see how NLTK can be leveraged for this.
Answering your question, you call 3 dot sequence an ellipsis. Thus, you need to use
[!?]+|(?<!\.)\.(?!\.)
See the regex demo. The . is moved from the character class since you can't use quantifiers inside them, and only that . is matched that is not enclosed with other dots.
[!?]+ - 1 or more ! or ?
| - or
(?<!\.)\.(?!\.) - a dot that is neither preceded ((?<!\.)), nor followed ((?!\.)) with a dot.
See Python demo:
import re
sentences = re.split(r'[!?]+|(?<!\.)\.(?!\.)', "Wait for it... awesome!".replace('\n',''))
sentences = sentences[:-1]
sentence_count = len(sentences)
print(sentence_count) # => 1

Following Wiktor's suggestion to use NLTK, I also came up with the following alternative solution:
import nltk
read_data="Wait for it... awesome!"
sentence_count = len(nltk.tokenize.sent_tokenize(read_data))
This yields a sentence count of 1 as expected.

Search and replace using regular expressions in Python

I have a log file that is full of tweets. Each tweet is on its own line so that I can iterate though the file easily.
An example tweet would be like this:
# sample This is a sample string $ 1.00 # sample
I want to be able to clean this up a bit by removing the white space between the special character and the following alpha-numeric character. "# s", "$ 1", "# s"
So that it would look like this:
#sample This is a sample string $1.00 #sample
I'm trying to use regular expressions to match these instances because they can be variable, but I am unsure of how to go about doing this.
I've been using re.sub() and re.search() to find the instances, but am struggling to figure out how to only remove the white space while leaving the string intact.
Here is the code I have so far:
#!/usr/bin/python
import csv
import re
import sys
import pdb
import urllib
f=open('output.csv', 'w')
with open('retweet.csv', 'rb') as inputfile:
read=csv.reader(inputfile, delimiter=',')
for row in read:
a = row[0]
matchObj = re.search("\W\s\w", a)
print matchObj.group()
f.close()
Thanks for any help!

Something like this using re.sub:
>>> import re
>>> strs = "# sample This is a sample string $ 1.00 # sample"
>>> re.sub(r'([##$])(\s+)([a-z0-9])', r'\1\3', strs, flags=re.I)
'#sample This is a sample string $1.00 #sample'

>>> re.sub("([#$#]) ", r"\1", "# sample This is a sample string $ 1.00 # sample")
'#sample This is a sample string $1.00 #sample'

This seemed to work pretty nice.
print re.sub(r'([#$])\s+',r'\1','# blah $ 1')

Returning all characters before the first underscore

Using re in Python, I would like to return all of the characters in a string that precede the first appearance of an underscore. In addition, I would like the string that is being returned to be in all uppercase and without any non-alpanumeric characters.
For example:
AG.av08_binloop_v6 = AGAV08
TL.av1_binloopv2 = TLAV1
I am pretty sure I know how to return a string in all uppercase using string.upper() but I'm sure there are several ways to remove the . efficiently. Any help would be greatly appreciated. I am still learning regular expressions slowly but surely. Each tip gets added to my notes for future use.
To further clarify, my above examples aren't the actual strings. The actual string would look like:
AG.av08_binloop_v6
With my desired output looking like:
AGAV08
And the next example would be the same. String:
TL.av1_binloopv2
Desired output:
TLAV1
Again, thanks all for the help!

Even without re:
text.split('_', 1)[0].replace('.', '').upper()

Try this:
re.sub("[^A-Z\d]", "", re.search("^[^_]*", str).group(0).upper())

Since everyone is giving their favorite implementation, here's mine that doesn't use re:
>>> for s in ('AG.av08_binloop_v6', 'TL.av1_binloopv2'):
... print ''.join(c for c in s.split('_',1)[0] if c.isalnum()).upper()
...
AGAV08
TLAV1
I put .upper() on the outside of the generator so it is only called once.

You don't have to use re for this. Simple string operations would be enough based on your requirements:
tests = """
AG.av08_binloop_v6 = AGAV08
TL.av1_binloopv2 = TLAV1
"""
for t in tests.splitlines():
print t[:t.find('_')].replace('.', '').upper()
# Returns:
# AGAV08
# TLAV1
Or if you absolutely must use re:
import re
pat = r'([a-zA-Z0-9.]+)_.*'
pat_re = re.compile(pat)
for t in tests.splitlines():
print re.sub(r'\.', '', pat_re.findall(t)[0]).upper()
# Returns:
# AGAV08
# TLAV1

He, just for fun, another option to get text before the first underscore is:
before_underscore, sep, after_underscore = str.partition('_')
So all in one line could be:
re.sub("[^A-Z\d]", "", str.partition('_')[0].upper())

import re
re.sub("[^A-Z\d]", "", yourstr.split('_',1)[0].upper())

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Sentence processing in Python - python

You don't need NLTK for this (nor are you using it). Unless I misunderstand the question, this should do the trick: with open('topon.txt') as infile: for line in infile: print line.split('.', 1)[0]

In addition to #inspectorG4dget 's answer, you can do it by regexes: from nltk import * import codecs f = codecs.open('a.txt', 'r+', 'cp1251') text = f.readlines() print [re.findall('^[^.]+', sentence) for sentence in text]

Related

Dictionary replaces substrings Python 2.7

How to format long `if` statement

How to count sentences taking into account the occurrence of ellipses

Search and replace using regular expressions in Python

Returning all characters before the first underscore

Categories

Resources