Related
I am trying to create Bigram tokens of sentences.
I have a list of tuples such as
tuples = [('hello', 'my'), ('my', 'name'), ('name', 'is'), ('is', 'bob')]
and I was wondering if there is a way to convert it to a list using python, so it would look love this:
list = ['hello my', 'my name', 'name is', 'is bob']
thank you
Try this snippet:
list = [' '.join(x) for x in tuples]
join is a string method that contatenates all items of a list (tuple) within a separator defined in '' brackets.
Try this
list = [ '{0} {1}'.format(t[0],t[1]) for t in tuples ]
In general if you want to use both values of a tuple, you can use something like this:
my_list = []
for (first, second) in tuples:
my_list.append(first+ ' '+ second)
In this case
my_list = [' '.join(x) for t in tuples]
should be fine
tuples = [('hello', 'my'), ('my', 'name'), ('name', 'is'), ('is', 'bob')]
result=[]
[result.append(k+" "+v) for k,v in tuples]
print(result)
output:
['hello my', 'my name', 'name is', 'is bob']
I am attempting to replace text in a list with text from another list. Below, lst_a has the string length I need for another script, but none of the formatting from lst_b. I want to give lst_a the correct spelling, capitalization, and punctuation from lst_b.
For example:
lst_a = ['it is an', 'example of', 'an english simple sentence']
lst_b = ['It\'s', 'an', 'example', 'of', 'an', 'English', 'simple', 'sentence.']
I'm not 100% sure the best way to approach this problem.
I have tried breaking lst_a into a smaller sub_lst_a and taking the difference from each list, but I'm not sure what to do when entire items exist in one list and not the other (e.g. 'it' and 'is' rather than 'it's').
Regardless, any help/direction would be greatly appreciated!
Solution attempt below:
I thought it may be worth trying to break lst_a into a list just of words. Then I thought to enumerate each item, so that I could more easily identify it's counter part in lst_b. From there I wanted to take the difference of the two lists, and replace the values in lst_a_diff with lst_b_diff. I had to sort the lists because my diff script wasn't consistently ordering the outputs.
lst_a = ['it is an', 'example of', 'an english simple sentence']
lst_b = ['It\'s', 'an', 'example', 'of', 'an', 'English', 'simple', 'sentence.']
# splitting lst_a into a smaller sub_lst_a
def convert(lst_a):
return ([i for item in lst_a for i in item.split()])
sub_lst_a = convert(lst_a)
# getting the position values of sub_lst_a and lst_b
lst_a_pos = [f"{i}, {v}" for i, v in enumerate(sub_lst_a)]
lst_b_pos = [f"{i}, {v}" for i, v in enumerate(lst_b)]
# finding the difference between the two lists
def Diff(lst_a_pos, lst_b_pos):
return list(set(lst_a_pos) - set(lst_b_pos))
lst_a_diff = Diff(lst_a_pos, lst_b_pos)
lst_b_diff = Diff(lst_b_pos, lst_a_pos)
# sorting lst_a_diff and lst_b_diff by the original position of each item
lst_a_diff_sorted = sorted(lst_a_diff, key = lambda x: int(x.split(', ')[0]))
lst_b_diff_sorted = sorted(lst_b_diff, key = lambda x: int(x.split(', ')[0]))
print(lst_a_diff_sorted)
print(lst_b_diff_sorted)
Desired Results:
final_lst_a = ['It\'s an', 'example of', 'an English simple sentence.']
Solution walkthrough
Assuming as you say that the two lists are essentially always in order, to properly align the indexes in both, words with apostrophe should really count for two.
One way to do that is for example to expand the words by adding an empty element:
# Fill in blanks for words that have apostrophe: they should count as 2
lst_c = []
for item in lst_b:
lst_c.append(item)
if item.find("'") != -1:
lst_c.append('')
print(lst_c)
>> ["It's", '', 'an', 'example', 'of', 'an', 'English', 'simple', 'sentence.']
Now it is a matter of expanding lst_a on a word-by-word basis, and then group them back as in the original lists. Essentially, we align the lists like this:
['it', 'is', 'an', 'example', 'of', 'an', 'english', 'simple', 'sentence']
["It's", '', 'an', 'example', 'of', 'an', 'English', 'simple', 'sentence.']
then we create new_item slices like these:
["It's", "", "an"]
["example of"]
["an English simple sentence"]
The code looks like this:
# Makes a map of list index and length to extract
final = []
ptr = 0
for item in lst_a:
# take each item in lst_a and count how many words it has
count = len(item.split())
# then use ptr and count to correctly map a slice off lst_c
new_item = lst_c[ptr:ptr+count]
# get rid of empty strings now
new_item = filter(len, new_item)
# print('new[{}:{}]={}'.format(ptr,count,new_item))
# join the words by single space and append to final list
final.append(' '.join(new_item))
# advance the ptr
ptr += count
>> ["It's an", 'example of', 'an English simple sentence.']
Complete code solution
This seems to handle other cases well enough. The complete code would be something like:
lst_a = ['it is an', 'example of', 'an english simple sentence']
lst_b = ['It\'s', 'an', 'example', 'of', 'an', 'English', 'simple', 'sentence.']
# This is another example that seems to work
# lst_a = ['tomorrow I will', 'go to the movies']
# lst_b = ['Tomorrow', 'I\'ll', 'go', 'to', 'the', 'movies.']
# Fill in blanks for words that have apostrophe: they should count as 2
lst_c = []
for item in lst_b:
lst_c.append(item)
if item.find("'") != -1:
lst_c.append('')
print(lst_c)
# Makes a map of list index and length to extract
final = []
ptr = 0
for item in lst_a:
count = len(item.split())
# print(ptr, count, item)
new_item = lst_c[ptr:ptr+count]
# get rid of empty strings now
new_item = filter(len, new_item)
# print('new[{}:{}]={}'.format(ptr,count,new_item))
ptr += count
final.append(' '.join(new_item))
print(final)
You can try the following code:
lst_a = ['it is an', 'example of', 'an english simple sentence']
lst_b = ['It\'s', 'an', 'example', 'of', 'an', 'English', 'simple', 'sentence.']
lst_a_split = []
end_indices_in_lst_a_split = []
# Construct "lst_a_split" and "end_indices_in_lst_a_split".
# "lst_a_split" is supposed to be ['it', 'is', 'an', 'example', 'of', 'an', 'english', 'simple', 'sentence'].
# "end_indices_in_lst_a_split" is supposed to be [3, 5, 9].
end = 0
for s in lst_a:
s_split = s.split()
end += len(s_split)
end_indices_in_lst_a_split.append(end)
for word in s_split:
lst_a_split.append(word)
# Construct "d" which contains
# index of every word in "lst_b" which does not include '\'' as value
# and the corresponding index of the word in "lst_a_split" as key.
# "d" is supposed to be {2: 1, 3: 2, 4: 3, 5: 4, 6: 5, 7: 6, 8: 7}.
d = {}
start = 0
for index_in_lst_b, word in enumerate(lst_b):
if '\'' in word:
continue
word = word.lower().strip('.').strip(',').strip('"') # you can add other strip()'s as you want
index_in_lst_a_split = lst_a_split.index(word, start)
start = index_in_lst_a_split + 1
d[index_in_lst_a_split] = index_in_lst_b
# Construct "final_lst_a".
final_lst_a = []
start_index_in_lst_b = 0
for i, end in enumerate(end_indices_in_lst_a_split):
if end - 1 in d:
end_index_in_lst_b = d[end - 1] + 1
final_lst_a.append(' '.join(lst_b[start_index_in_lst_b:end_index_in_lst_b]))
start_index_in_lst_b = end_index_in_lst_b
elif end in d:
end_index_in_lst_b = d[end]
final_lst_a.append(' '.join(lst_b[start_index_in_lst_b:end_index_in_lst_b]))
start_index_in_lst_b = end_index_in_lst_b
else:
# It prints the following message if it fails to construct "final_lst_a" successfully.
# It would happen if words in "lst_b" on both sides at a boundary contain '\'', which seem to be unlikely.
print(f'Failed to find corresponding words in "lst_b" for the string "{lst_a[i]}".')
break
print(final_lst_a)
which prints
["It's an", 'example of', 'an English simple sentence.']
lst_a = ['it is an', 'example of', 'an english simple sentence']
lst_b = ['It\'s', 'an', 'example', 'of', 'an', 'English', 'simple', 'sentence.']
for word in lst_b:
# If a word is capitalized, look for it in lst_a and capitalize it
if word[0].upper() == word[0]:
for idx, phrase in enumerate(lst_a):
if word.lower() in phrase:
lst_a[idx] = phrase.replace(word.lower(), word)
if "'" in word:
# if a word has an apostrophe, look for it in lst_a and change it
# Note here you can include other patterns like " are",
# or maybe just restrict it to "it is", etc.
for idx, phrase in enumerate(lst_a):
if " is" in phrase:
lst_a[idx] = phrase.replace(" is", "'s")
break
print(lst_a)
I know you already have a few responses to review. Here's something that should help you expand the implementation.
In addition to lst_a and lst_b, what if you could give all the lookup items like 'It's', 'I'll', 'don't' and you could outline what it should represent, then the below could would take care of that lookup as well.
#original lst_a. This list does not have the punctuation marks
lst_a = ['it is an', 'example of', 'an english simple sentence', 'if time permits', 'I will learn','this weekend', 'but do not', 'count on me']
#desired output with correct spelling, capitalization, and punctuation
#but includes \' that need to be replaced
lst_b = ['It\'s', 'an', 'example', 'of', 'an', 'english', 'simple', 'sentence.', 'If', 'time', 'permits,','I\'ll', 'learn','this','weekend', 'but', 'don\'t','count', 'on', 'me']
#lookup list to replace the contractions
ch = {'It\'s':['It','is'],'I\'ll':['I','will'], 'don\'t':['do','not']}
#final list will be stored into lst_c
lst_c = []
#enumerate through lst_b to replace all words that are contractions
for i,v in enumerate(lst_b):
#for this example, i am considering that all contractions are 2 part words
for j,k in ch.items():
if v == j: #here you are checking for contractions
lst_b[i] = k[0] #for each contraction, you are replacing the first part
lst_b.insert(i+1,k[1]) #and inserting the second part
#now stitch the words together based on length of each word in lst_b
c = 0
for i in lst_a:
j = i.count(' ') #find out number of words to stitch together
#stitch together only the number of size of words in lst_a
lst_c.append(' '.join([lst_b[k] for k in range (c, c+j+1)]))
c += j+1
#finally, I am printing lst_a, lst_b, and lst_c. The final result is in lst_c
print (lst_a, lst_b, lst_c, sep = '\n')
Output for this is as shown below:
lst_a = ['it is an', 'example of', 'an english simple sentence', 'if time permits', 'I will learn', 'this weekend', 'but do not', 'count on me']
lst_b = ['It', 'is', 'an', 'example', 'of', 'an', 'english', 'simple', 'sentence.', 'If', 'time', 'permits,', 'I', 'will', 'learn', 'this', 'weekend', 'but', 'do', 'not', 'count', 'on', 'me']
lst_c = ['It is an', 'example of', 'an english simple sentence.', 'If time permits,', 'I will learn', 'this weekend', 'but do not', 'count on me']
I need to create a function that gets passed multiple lists and returns a string to then be printed. Honestly I don't even know if I am headed in the right direction or not.
wordlist = ['new', 'barn', 'shark', 'hold', 'art', 'only', 'eyes'],
['subtract', 'add'],
['girl', 'house', 'best', 'thing', 'easy', 'wrong', 'right', 'again', 'above'],
['question'],
[]
def createSentence(wordlist):
if len(wordlist) > 1:
return 'The ' + str(len(wordlist)) + ' sight words for this week are ' + wordlist + '.'
elif len(wordlist) == 1:
return 'The only sight word for this week is' + wordlist + '.'
elif len(wordlist) == 0:
return 'There are no new sight words for this week!'
print(createSentence(wordlist))
Also i think my lists should really look like this
week2 = ['new', 'barn', 'shark', 'hold', 'art', 'only', 'eyes']
week5 = ['subtract', 'add']
week10 = ['girl', 'house', 'best', 'thing', 'easy', 'wrong', 'right', 'again', 'above']
week13 = ['question']
week17 = []
But I don't know how to them pass them through to the function
I think you may want to convert the list in a string by using the join function.
(' ,').join(wordlist)
week2 = ['new', 'barn', 'shark', 'hold', 'art', 'only', 'eyes']
week5 = ['subtract', 'add']
week10 = ['girl', 'house', 'best', 'thing', 'easy', 'wrong', 'right', 'again', 'above']
week13 = ['question']
week17 = []
def createSentence(wordlist):
if len(wordlist) > 1:
return 'The ' + str(len(wordlist)) + ' sight words for this week are ' + (' ,').join(wordlist) + '.'
elif len(wordlist) == 1:
return 'The only sight word for this week is' + (' ,').join(wordlist) + '.'
elif len(wordlist) == 0:
return 'There are no new sight words for this week!'
Output for week2:
'The 7 sight words for this week are new ,barn ,shark ,hold ,art ,only ,eyes.'
Like this:
wordlist = [['new', 'barn', 'shark', 'hold', 'art', 'only', 'eyes'],
['subtract', 'add'],
['girl', 'house', 'best', 'thing', 'easy', 'wrong', 'right', 'again', 'above'],
['question'],
[]]
def createSentence(wordlist):
if len(wordlist) > 1:
return f'The {len(wordlist)} sight words for this week are {", ".join(wordlist)}.'
elif len(wordlist) == 1:
return f'The only sight word for this week is {wordlist[0]}.'
elif len(wordlist) == 0:
return 'There are no new sight words for this week!'
for lst in wordlist:
print(createSentence(lst))
Output:
The 7 sight words for this week are new, barn, shark, hold, art, only, eyes.
The 2 sight words for this week are subtract, add.
The 9 sight words for this week are girl, house, best, thing, easy, wrong, right, again, above.
The only sight word for this week is question.
There are no new sight words for this week!
I'd suggest taking a different approach by creating a data structure for your data -- in this case, a dictionary:
wordlists = {
'week 2': ['new', 'barn', 'shark', 'hold', 'art', 'only', 'eyes'],
'week 5': ['subtract', 'add'],
'week 10': ['girl', 'house', 'best', 'thing', 'easy', 'wrong', 'right', 'again', 'above'],
'week 13': ['question'],
'week 17': [],
}
def createSentence(week):
wordlist = wordlists[week]
length = len(wordlist)
if length > 1:
return "The {} sight words for {} are: {}.".format(length, week, ", ".join(wordlist))
if length == 1:
return "The only sight word for {} is: {}.".format(week, ", ".join(wordlist))
return "There are no new sight words for {}!".format(week)
for week in wordlists:
print(createSentence(week))
OUTPUT
> python3 test.py
The 7 sight words for week 2 are: new, barn, shark, hold, art, only, eyes.
The 2 sight words for week 5 are: subtract, add.
The 9 sight words for week 10 are: girl, house, best, thing, easy, wrong, right, again, above.
The only sight word for week 13 is: question.
There are no new sight words for week 17!
>
def ngram(n, k, document):
f = open(document, 'r')
for i, line in enumerate(f):
words = line.split() + line.split()
print words
return {}
For ex- "I love the Python programming language" and n = 2
are "I love", "love the", "the Python", "Python programming", and "programming langue";
I want to store in a list and then compare how many of them are same.
It's not entirely clear what you want returned. Assuming one line says:
I love the Python programming language
And that you want to do nothing inter-line.
from collections import deque
def linesplitter(line, n):
prev = deque(maxlen=n) # fixed length list
for word in line.split(): # iterate through each word
prev.append(word) # keep adding to the list
if len(prev) == n: # until there are n elements
print " ".join(prev) # then start printing
# oldest element is removed automatically
with open(document) as f: # 'r' is implied
for line in f:
linesplitter(line, 2) # or any other length!
Output:
I love
love the
the Python
Python programming
programming language
You could adapt from one of the itertools recipes:
import itertools
def ngrams(N, k, filepath):
with open(filepath) as infile:
words = (word for line in infile for word in line.split())
ts = itertools.tee(words, N)
for i in range(1, len(ts)):
for t in ts[i:]:
next(t, None)
return zip(*ts)
With a test file that looks like this:
I love
the
python programming language
Here's the output:
In [21]: ngrams(2, '', 'blah')
Out[21]:
[('I', 'love'),
('love', 'the'),
('the', 'python'),
('python', 'programming'),
('programming', 'language')]
In [22]: ngrams(3, '', 'blah')
Out[22]:
[('I', 'love', 'the'),
('love', 'the', 'python'),
('the', 'python', 'programming'),
('python', 'programming', 'language')]
Well, you can achieve this through a List Comprehension:
>>> [s1 + " " + s2 for s1, s2 in zip(s.split(), s.split()[1:])]
['I love', 'love the', 'the Python', 'Python programming', 'programming language']
You can also use the str.format function:
>>> ["{} {}".format(s1, s2) for s1, s2 in zip(s.split(), s.split()[1:])]
['I love', 'love the', 'the Python', 'Python programming', 'programming language']
The finalized version of the function:
from itertools import tee, islice
def ngram(n, s):
var = [islice(it, i, None) for i, it in enumerate(tee(s.split(), n))]
return [("{} " * n).format(*itt) for itt in zip(*var)]
Demo:
>>> from splitting import ngram
>>> thing = 'I love the Python programming language'
>>> ngram(2, thing)
['I love ', 'love the ', 'the Python ', 'Python programming ', 'programming language ']
>>> ngram(3, thing)
['I love the ', 'love the Python ', 'the Python programming ', 'Python programming language ']
>>> ngram(4, thing)
['I love the Python ', 'love the Python programming ', 'the Python programming language ']
>>> ngram(1, thing)
['I ', 'love ', 'the ', 'Python ', 'programming ', 'language ']
Here a "one-line" solution, using list comprenhension:
s = "I love the Python programming language"
def ngram(s, n):
return [" ".join(k) for k in zip(*[l[0] for l in zip(s.split()[e:] for e in range(n))])]
# Test
for i in range(1, 7):
print ngram(s, i)
Output:
['I', 'love', 'the', 'Python', 'programming', 'language']
['I love', 'love the', 'the Python', 'Python programming', 'programming language']
['I love the', 'love the Python', 'the Python programming', 'Python programming language']
['I love the Python', 'love the Python programming', 'the Python programming language']
['I love the Python programming', 'love the Python programming language']
['I love the Python programming language']
Note That no k parameter is needed.
Adapted to your case:
def ngram(document, n):
with open(document) as f:
for line in f:
print [" ".join(k) for k in zip(*[l[0] for l in zip(line.split()[e:] for e in range(n))])]
I have a list of tuples that looks like this:
[('this', 'is'), ('is', 'the'), ('the', 'first'), ('first', 'document'), ('document', '.')]
What is the most pythonic and efficient way to convert into this where each token is separated by a space:
['this is', 'is the', 'the first', 'first document', 'document .']
Very simple:
[ "%s %s" % x for x in l ]
Using map() and join():
tuple_list = [('this', 'is'), ('is', 'the'), ('the', 'first'), ('first', 'document'), ('document', '.')]
string_list = map(' '.join, tuple_list)
As inspectorG4dget pointed out, list comprehensions are the most pythonic way of doing this:
string_list = [' '.join(item) for item in tuple_list]
This does it:
>>> l=[('this', 'is'), ('is', 'the'), ('the', 'first'),
('first', 'document'), ('document', '.')]
>>> ['{} {}'.format(x,y) for x,y in l]
['this is', 'is the', 'the first', 'first document', 'document .']
If your tuples are variable length (or not even), you can also do this:
>>> [('{} '*len(t)).format(*t).strip() for t in [('1',),('1','2'),('1','2','3')]]
['1', '1 2', '1 2 3'] #etc
Or, probably best still:
>>> [' '.join(t) for t in [('1',),('1','2'),('1','2','3'),('1','2','3','4')]]
['1', '1 2', '1 2 3', '1 2 3 4']
I strongly suggest you avoid using %s. Starting Python 3.6, f-strings were added, so you can take advantage of this feature as follows:
[f'{" ".join(e)}' for e in l]
If you are using a previous version of Python 3.6, you can also avoid using %s by employing the format function as follows:
print(['{joined}'.format(joined=' '.join(e)) for e in l]) # before Python 3.6
Alternative:
Assuming you have 2 elements in each tuple, you can use the following:
# Python 3.6+
[f'{first} {second}' for first, second in l]
# Before Python 3.6
['{first} {second}'.format(first=first, second=second) for first, second in l]
Assuming the list is:
You can use list comprehension + join()
li = [('this', 'is'), ('is', 'the'), ('the', 'first'), ('first', 'document'), ('document', '.')]
all you need to do is:
[' '.join(x) for x in li]
You can also use map() + join()
list(map(' '.join, li))
result :
['this is', 'is the', 'the first', 'first document', 'document .']