Test if sentences contain smaller sentences - python

So I have 100 million sentences, and for each sentence I'd like to see whether it contains one of 6000 smaller sentences (matching whole words only). So far my code is
smaller_sentences = [...]
for large_sentence in file:
for small_sentence in smaller_sentences:
if ((' ' + small_sentence + ' ') in large_sentence)
or (large_sentence.startswith(small_sentence + ' ')
or (large_sentence.endswith(' ' + small_sentence):
outfile.write(large_sentence)
break
But this code runs prohibitively slowly. Do you know of a faster way to go about doing this?

Without knowing more about the domain (word/sentence length), frequency of read/write/query and specifics around the algorithm.
But, in the first instance you can switch your condition around.
This checks the whole string (slow), then the head (fast), then the tail (fast).
((' ' + small_sentence + ' ') in large_sentence)
or (large_sentence.startswith(small_sentence + ' ')
or (large_sentence.endswith(' ' + small_sentence):
This checks the head then the tail (fast), then the head (fast), then the whole string. Not huge bump in the Big-O sense, but it might add some speed if you know that the strings might be more likely at the start or finish.
(large_sentence.startswith(small_sentence + ' ')
or (large_sentence.endswith(' ' + small_sentence)
or ((' ' + small_sentence + ' ') in large_sentence)

Related

Python - .split(), - two arguments

I need to make a modification on a python code.
This code scrapes information from a .csv file, to finally integrate it in a new .csv file, in a different structure.
In one of the columns of the source files, I have a value (string), which is in 99% of the time formed this way: 'block1 block2 block3'.
Block2 always ends with the value 'm' 99% of the time.
example: 'R2 180m RFT'.
By browsing the source dataset, I realized that in 1% of the cases, the block2 can end with 'M'.
As I need all the values after the 'm' or 'M' value, I'm a bit stuck.
I used the .split() function, like this in my :
'Newcolumn': getattr(row_unique_ids, 'COLUMNINTHEDATASET').split ('m') [1],
By doing so, my script falls in error, because it falls on a value of this style :
R2 180M AST'.
So I would like to know how to integrate an additional argument, which would allow me to make the split work well if the script falls on 'm' or 'M'.
Thank you for your help.
One solution is to
s = getattr(row_unique_ids, 'COLUMNINTHEDATASET')
s = s.lower()
s.split('m')[1]
But that will mess up your casing. If you want to preserve casing,
another solution is to do:
x = ''
s = getattr(row_unique_ids, 'COLUMNINTHEDATASET')
for c in s:
if c == 'M'
x += 'm'
x += c
x.split('m')[1]
One way to do multi-arguments split is, in general:
import re
string = "this is 3an infamous String4that I need to s?plit in an infamou.s way"
#Preserve the original char
print (re.sub(r"([0-9]|[?.]|[A-Z])",r'\1'+"DELIMITER",string).split('DELIMITER'))
#Discard the original char
print (re.sub(r"([0-9]|[?.]|[A-Z])","DELIMITER",string).split('DELIMITER'))
Output:
['this is 3', 'an infamous S', 'tring4', 'that I', ' need to s?', 'plit in an infamou.', 's way']
['this is ', 'an infamous ', 'tring', 'that ', ' need to s', 'plit in an infamou', 's way']
In your context:
import re
string = "R2 180m RFT R2 180M RFT"
print (re.sub(r"\b([0-9]+)[mM]\b",r'\1'+"M",string).split('M'))
#print (re.sub(r"\b([0-9]+)[mM]\b",r'\1'+"M",getattr(row_unique_ids, 'COLUMNINTHEDATASET')).split('M'))
Output:
['R2 180', ' RFT R2 180', ' RFT']
It will split on m and M if those are preceded by a number.

Question about changing the argument in range function through iterations

I'm a newbie so I'm really sorry if this is too basic of a question, but I just couldn't solve it on my own. Perhaps it's not considered complex enough ( at all ) which would explain why I couldn't find an adequate answer online.
I've made a tic-tac-toe program following the Automate the Boring Stuff with Python textbook, but modified it a tiny bit so it doesn't allow players to enter 'X'/'O' in already filled slots. Here's what it looks like :
theBoard = {'top-L': ' ', 'top-M': ' ', 'top-R': ' ',
'mid-L': ' ', 'mid-M': ' ', 'mid-R': ' ',
'low-L': ' ', 'low-M': ' ', 'low-R': ' '}
def printBoard(board):
print(board['top-L']+'|'+board['top-M']+'|'+board['top-R'])
print('-+-+-')
print(board['mid-L']+'|'+board['mid-M']+'|'+board['mid-R'])
print('-+-+-')
print(board['low-L']+'|'+board['low-M']+'|'+board['low-R'])
turn='X'
_range=9
for i in range(_range):
printBoard(theBoard)
print('''It is the '''+turn+''' player's turn.''')
move=input()
if theBoard[move]==' ':
theBoard[move]=turn
else:
print('The slot is already filled !')
_range+=1
if turn=='X':
turn ='O'
else:
turn='X'
printBoard(theBoard)
However, it doesn't seem like the _range variable is being increased by one at all through iterations where I intentionally enter 'X'\'O' in the slots where such symbols are already existent.
Is there something that I'm missing her ? Is there any way I could make this work as I planned it ?
Thank you in advance.
It could be easier to sanitize the input immediately via a while loop instead of increasing the range.
move = input(f"It is the {turn} player's turn.")
while theBoard[move]!=' ':
move=input('The slot is already filled')
Your attempt did not work because you changed _range after it being used and it is not used after you changed it.

Python Split Outputting in square brackets

I have code that looks like this:
import re
activity = "Basketball - Girls 9th"
activity = re.sub(r'\s', ' ', activity).split('-')
activity = str(activity [1:2]) + str(activity [0:1])
print("".join(activity))
I want the output to look like Girl's 9th Basketball, but the current output when printed is
[' Girls 9th']['Basketball ']
I want to get rid of the square brackets. I know I can simply trim it, but I would rather know how to do it right.
You're almost there. When you use .join on a list it creates a string so you can omit that step.
import re
activity = "Basketball - Girls 9th"
activity = re.sub(r'\s', ' ', activity).split('-')
activity = activity[1:2] + activity[0:1]
print(" ".join(activity))
You are stringyfying the lists which is the same as using print(someList) - it is the representation of a list wich puts the [] around it.
import re
activity = "Basketball - Girls 9th"
activity = re.sub(r'\s', ' ', activity).split('-')
activity = activity [1:2] + [" "] + activity [0:1] # reorders and reassignes list
print("".join(activity))
You could consider adding a step:
# remove empty list items and remove trailing/leading strings
cleaned = [x.strip() for x in activity if x.strip() != ""]
print(" ".join(cleaned)) # put a space between each list item
This just resorts the lists and adds the needed space item in between so you output fits.
You can solve it in one line:
new_activity = ' '.join(activity.split(' - ')[::-1])
You can try something like this:
import re
activity = "Basketball - Girls 9th"
activity = re.sub(r'\s', ' ', activity).split('-')
activity = str(activity [1:2][0]).strip() + ' ' + str(activity [0:1][0])
print(activity)
output:
Girls 9th Basketball

Selecting specific results from print out in Python - 2.7

I am looking to take the three most recent (based on time) lists from the printed code below. These are not actual files but text parsed and stored in dictionaries:
list4 = sorted(data1.values(), key = itemgetter(4))
for complete in list4:
if complete[1] == 'Completed':
print complete
returns:
['aaa664847', ' Completed', ' location' , ' mode', ' 2014-xx-ddT20:00:00.000']
['aaa665487', ' Completed', ' location' , ' mode', ' 2014-xx-ddT19:00:00.000']
['aaa661965', ' Completed', ' location' , ' mode', ' 2014-xx-ddT18:00:00.000']
['aaa669696', ' Completed', ' location' , ' mode', ' 2014-xx-ddT17:00:00.000']
['aaa665376', ' Completed', ' location' , ' mode', ' 2014-xx-ddT16:00:00.000']
I have tried to append these results to another list got this:
[['aaa664847', ' Completed', ' location' , ' mode', ' 2014-xx-ddT20:00:00.000']]
[['aaa665487', ' Completed', ' location' , ' mode', ' 2014-xx-ddT19:00:00.000']]
I would like one list which i then could use [-3:] to print out the three most recent.
storage = [['aaa664847', ' Completed', ' location' , ' mode', ' 2014-xx-ddT20:00:00.000'],['aaa665487', ' Completed', ' location' , ' mode', ' 2014-xx-ddT19:00:00.000']]
So,
import itertools
storage = list(itertools.islice((c for c in list4 if c[1]=='Completed'), 3))
maybe...?
Added: an explanation might help. The (c for c in list4 if c[1]=='Completed') part is a generator expression ("genexp") -- it walks list4 from the start and only yields, one at a time, items (sub-lists here) satisfying the condition.
The () around it are needed, because itertools.islice takes another argument (a genexp must always be surrounded by parentheses, though when it's the only argument to a callable the parentheses that call the callable are enough and need not be doublled up).
islice is told (via its second argument) to yield only the first up to 3 items of the iterable it receives as the first argument. Once it's done that it stops looping, doing no further work (which would be useless).
We do need a call to list over this all because we require, as a result, a list, not an iterator (which is what islice's result is).
People who are uncomfortable with generators and iterators might choose the following less-elegant, probably less-performant, but simpler approach:
storage = []
for c in list4:
if c[1]=='Completed':
storage.append(c)
if len(c) == 3: break
This is perfectly valid Python, too (and it would have worked just fine as far back as Python 1.5.4 if not earlier). But modern Python usage leans far more towards generators, iterators, and itertools, where applicable...

Ellipsizing list joins, Pythonically

I learned about list comprehensions a few days ago, and now I think I’ve gone a little crazy with them, trying to make them solve all the problems. Perhaps I don’t truly understand them yet, or I just don’t know enough Python yet to make them both powerful and simple. This problem has occupied me for a while now, and I’d appreciate any input.
The Problem
In Python, join a list of strings words into a single string excerpt that satisfies these conditions:
a single space separates elements
the final length of excerpt does not exceed integer maximum_length
if all elements of words are not in excerpt, append an ellipsis character … to excerpt
only whole elements of words appear in excerpt
The Ugly Solution
words = ('Your mother was a hamster and your ' +
'father smelled of elderberries!').split()
maximum_length = 29
excerpt = ' '.join(words) if len(' '.join(words)) <= maximum_length else \
' '.join(words[:max([n for n in range(0, len(words)) if \
len(' '.join(words[:n]) + '\u2026') <= \
maximum_length])]) + '\u2026'
print(excerpt) # Your mother was a hamster…
print(len(excerpt)) # 26
Yup, that works. Your mother was a hamster and fits in 29, but leaves no room for the ellipsis. But boy is it ugly. I can break it up a little:
words = ('Your mother was a hamster and your ' +
'father smelled of elderberries!').split()
maximum_length = 29
excerpt = ' '.join(words)
if len(excerpt) > maximum_length:
maximum_words = max([n for n in range(0, len(words)) if \
len(' '.join(words[:n]) + '\u2026') <= \
maximum_length])
excerpt = ' '.join(words[:maximum_words]) + '\u2026'
print(excerpt) # 'Your mother was a hamster…'
But now I’ve made a variable I’m never going to use again. Seems like a waste. And it hasn’t really made anything prettier or easier to understand.
Is there a nicer way to do this that I just haven’t seen yet?
see my comment about why "simple is better than complex"
that said, here's a suggestion
l = 'Your mother was a hamster and your father smelled of elderberries!'
last_space = l.rfind(' ', 0, 29)
suffix = ""
if last_space < 29:
suffix = "..."
print l[:last_space]+suffix
this is not 100% what you need, but rather easy to extend
My humble opinion is that you are right in that list comprehension is not necessary for this task. I would first get all words in a list with the split, then maybe do a while loop that remove words one at a time from the end of the list until len(' '.join(list)) < maximum_length.
I would also shorten the maximum_length by 3 (the length of the elipses) and after the while loop ends, add the "..." as the last element of the list.
You can trim the excerpt down to the maximum_length. Then, use rsplit to remove the last space and append on the ellipsis:
def append_ellipsis(words, length=29):
excerpt = ' '.join(words)[:length]
# If you are using Python 3.x then you can instead of the line below,
# pass `maxsplit=1` to `rsplit`. Below is the Python 2.x version.
return excerpt.rsplit(' ', 1)[0] + '\u2026'
words = ('Your mother was a hamster and your ' +
'father smelled of elderberries!').split()
result = append_ellipsis(words)
print(result)
print(len(result))

Categories