python: tokenize list of tuples without for loop - python

I have got a list of 2 million tuples with the first element being text and the second an integer. e.g.
list_of_tuples = [('here is some text', 1), ('this is more text', 5), ('a final tuple', 12)]
I would like to tokenize the first item in each tuple and attach all of the lists of words to a flattened list so the desired output would be.
list_of_tokenized_tuples = [(['here', 'is', 'some', 'text'], 1), (['this', 'is', 'more', 'text'], 5), (['a', 'final', 'tuple'], 12)]
list_of_all_words = ['here', 'is', 'some', 'text', 'this', 'is', 'more', 'text', 'a', 'final', 'tuple']
So far, I believe that I have found a way to achieve this with a for loop however due to the length of the list, it's really time intensive. Is there any way that I can tokenize the first item in the tuples and/or flatten the list of all words in a way that doesn't involve loops?
list_of_tokenized_tuples = []
list_of_all_words = []
for text, num in list_of_tuples:
tokenized_text = list(word_tokenize(text))
tokenized_tuples = (tokenized_text, num)
list_of_all_words.append(tokenized_text)
list_of_tokenized_tuples.append(tokenized_tuples)
list_of_all_words = [val for sublist in list_of_all_words for val in sublist]

Using itertools you could write it as:
from itertools import chain, imap
chain.from_iterable(imap(lambda (text,_): word_tokenize(text), list_of_tuples))
Testing this:
from itertools import chain, imap
def word_tokenize(text):
return text.split() # insert your tokenizer here
ts = [('here is some text', 1), ('this is more text', 5), ('a final tuple', 12)]
print list( chain.from_iterable(imap(lambda (t,_): word_tokenize(t), ts)) )
Output
['here', 'is', 'some', 'text', 'this', 'is', 'more', 'text', 'a', 'final', 'tuple']
I'm not sure what this buys you though as there are for loops in the implementation of the itertools functions.

TL;DR
>>> from itertools import chain
>>> list_of_tuples = [('here is some text', 1), ('this is more text', 5), ('a final tuple', 12)]
# Split up your list(str) from the int
>>> texts, nums = zip(*list_of_tuples)
# Go into each string and split by whitespaces,
# Then flatten the list of list of str to list of str
>>> list_of_all_words = list(chain(*map(str.split, texts)))
>>> list_of_all_words
['here', 'is', 'some', 'text', 'this', 'is', 'more', 'text', 'a', 'final', 'tuple']
If you need to use word_tokenize, then:
list_of_all_words = list(chain(*map(word_tokenize, texts)))

I wrote this generator for you. If you want to create a list, there isn't much else you can do (except a list comprehension). With that in mind, please see below, it gives you your desired output but joined within a tuple as two seperate lists. I doubt that matters too much and I'm sure you could always change it a bit to suit your needs or preferences.
import timeit, random
list_of_tuples = [('here is some text', 1), ('this is more text', 5), ('a final tuple', 12)]
big_list = [random.choice(list_of_tuples) for x in range(1000)]
def gen(lot=big_list, m='tokenize'):
list_all_words = []
tokenised_words = []
i1 = 0
i2 = 0
i3 = 0
lol1 = len(lot)
while i1 < lol1:
# yield lot[i1]
lol2 = len(lot[i1])
while i2 < lol2:
if type(lot[i1][i2]) == str:
list_all_words.append((lot[i1][i2].split(), i1 + 1))
i2 += 1
i1 += 1
i2 = 0
# print(list_all_words)
lol3 = len(list_all_words)
while i3 < lol3:
tokenised_words += list_all_words[i3][0]
i3 += 1
if m == 'list':
yield list_all_words
if m == 'tokenize':
yield tokenised_words
for x in gen():
print(x)
print(timeit.timeit(gen))
# Output of timeit: 0.2610903770813007
# This should be unnoticable on system resources I would have thought.

Related

Split The Second String of Every Element in a List into Multiple Strings

I am very very new to python so I'm still figuring out the basics.
I have a nested list with each element containing two strings like so:
mylist = [['Wowza', 'Here is a string'],['omg', 'yet another string']]
I would like to iterate through each element in mylist, and split the second string into multiple strings so it looks like:
mylist = [['wowza', 'Here', 'is', 'a', 'string'],['omg', 'yet', 'another', 'string']]
I have tried so many things, such as unzipping and
for elem in mylist:
mylist.append(elem)
NewList = [item[1].split(' ') for item in mylist]
print(NewList)
and even
for elem in mylist:
NewList = ' '.join(elem)
def Convert(string):
li = list(string.split(' '))
return li
print(Convert(NewList))
Which just gives me a variable that contains a bunch of lists
I know I'm way over complicating this, so any advice would be greatly appreciated
You can use list comprehension
mylist = [['Wowza', 'Here is a string'],['omg', 'yet another string']]
req_list = [[i[0]]+ i[1].split() for i in mylist]
# [['Wowza', 'Here', 'is', 'a', 'string'], ['omg', 'yet', 'another', 'string']]
I agree with #DeepakTripathi's list comprehension suggestion (+1) but I would structure it more descriptively:
>>> mylist = [['Wowza', 'Here is a string'], ['omg', 'yet another string']]
>>> newList = [[tag, *words.split()] for (tag, words) in mylist]
>>> print(newList)
[['Wowza', 'Here', 'is', 'a', 'string'], ['omg', 'yet', 'another', 'string']]
>>>
You can use the + operator on lists to combine them:
a = ['hi', 'multiple words here']
b = []
for i in a:
b += i.split()

Return items around all instances of an item in a list

Say I have a list...
['a','brown','cat','runs','another','cat','jumps','up','the','hill']
...and I want to go through that list and return all instances of a specific item as well as the 2 items leading up to and proceeding that item. Exactly like this if I am searching for 'cat'
[('a','brown','cat','runs','another'),('runs','another','cat','jumps','up')]
the order of the returned list of tuples is irrelevant, ideally the code handle instances where the word was the first or last in a list, and an efficient and compact piece of code would be better of course.
Thanks again everybody, I am just getting my feet wet in Python and everybody here has been a huge help!
Without error checking:
words = ['a','brown','cat','runs','another','cat','jumps','up','the','hill']
the_word = 'cat'
seqs = []
for i, word in enumerate(words):
if word == the_word:
seqs.append(tuple(words[i-2:i+3]))
print seqs #Prints: [('a', 'brown', 'cat', 'runs', 'another'), ('runs', 'another', 'cat', 'jumps', 'up')]
A recursive solution:
def context(ls, s):
if not s in ls: return []
i = ls.index('cat')
return [ tuple(ls[i-2:i+3]) ] + context(ls[i + 1:], s)
ls = ['a','brown','cat','runs','another','cat','jumps','up','the','hill']
print context(ls, 'cat')
Gives:
[('a','brown','cat','runs','another'),('runs','another','cat','jumps','up')]
With error checking:
def grep(in_list, word):
out_list = []
for i, val in enumerate(in_list):
if val == word:
lower = i-2 if i-2 > 0 else 0
upper = i+3 if i+3 < len(in_list) else len(in_list)
out_list.append(tuple(in_list[lower:upper]))
return out_list
in_list = ['a', 'brown', 'cat', 'runs', 'another', 'cat', 'jumps', 'up', 'the', 'hill']
grep(in_list, "cat")
# output: [('a', 'brown', 'cat', 'runs', 'another'), ('runs', 'another', 'cat', 'jumps', 'up')]
grep(in_list, "the")
# output: [('jumps', 'up', 'the', 'hill')]

Python - loop through list and save state before replacing

I'd like to replace the 'x-%' from the origin list with the values from 'anotherList' in loop.
As you can see, when looping through, only the last state is saved, because it replaces the standardList again.
What might be the best way to kind of 'save the state of every list' and then loop through it again?
Result should be:
result = ['I', 'just', 'try','to', 'acomplish', 'this','foo', 'list']
What I got so for:
originList = ['I', 'x-0', 'x-1','to', 'acomplish', 'x-2','foo', 'x-3']
anotherList = ['just','try','this','list']
for index in originList:
for num in range(0,4):
if 'x' in index:
result = str(originList).replace('x-%s'%(str(num)), anotherList[num])
print result
#['I', 'x-0', 'x-1', 'to', 'acomplish', 'x-2', 'foo', 'list'] <-- wrong :X
Thanks for any help because I can't figure it out at the moment
EDIT*
If there is a cleaner solution I would also appreciate to hear
This one avoids the creation of a new list
count = 0
for word in range(0, len(originList)):
if 'x-' in originList[word]:
originList[word] = anotherList[count]
count += 1
print originList
Here ya go!
>>> for original in originList:
if 'x' in original:
res.append(anotherList[int(original[-1])]) #grab the index
else:
res.append(original)
>>> res
['I', 'just', 'try', 'to', 'acomplish', 'this', 'foo', 'list']
>>>
Since the index of the value needed is in the items of originList, you can just use it, so no need for the extra loop. Hope this helps!
originList = ['I', 'x-0', 'x-1','to', 'acomplish', 'x-2','foo', 'x-3']
anotherList = ['just','try','this','list']
def change(L1, L2):
res = []
index = 0
for ele in L1:
if 'x-' in ele:
res.append(L2[index])
index += 1
else:
res += [ele]
return res
print(change(originList, anotherList))
The result:
['I', 'just', 'try', 'to', 'acomplish', 'this', 'foo', 'list']
originList = ['I', 'x-0', 'x-1','to', 'acomplish', 'x-2','foo', 'x-3']
anotherList = ['just','try','this','list']
res = []
i=0
for index in originList:
if 'x' in index:
res.append(anotherList[i])
i += 1
else:
res.append(index)
print res
you can get right result!
But,I think you have use string.format(like this)
print '{0}{1}{2}{3}'.format('a', 'b', 'c', 123) #abc123
Read python docs - string

Sequence Generation with Number applied to string

I have tried the Sequence Generator like Lambda, List comprehension and others but it seems that I am not able to get what I really want. My final goal is to print sequence of words from a string like string[1:3]
What I am looking for :
a = [0,13,26,39]
b = [12,25,38,51]
str = 'If you are done with the file, move to the command area across from the file name in the RL screen and type'
read = str.split()
read[0:12]
['If', 'you', 'are', 'done', 'with', 'the', 'file,', 'move', 'to', 'the', 'command', 'area']
read[13:25]
['from', 'the', 'file', 'name', 'in', 'the', 'RL', 'screen', 'and', 'type']
Use zip:
>>> a = [0,13,26,39]
>>> b = [12,25,38,51]
>>> strs = 'If you are done with the file, move to the command area across from the file name in the RL screen and type'
>>> spl = strs.split()
>>> for x,y in zip(a,b):
... print spl[x:y]
...
['If', 'you', 'are', 'done', 'with', 'the', 'file,', 'move', 'to', 'the', 'command', 'area']
['from', 'the', 'file', 'name', 'in', 'the', 'RL', 'screen', 'and', 'type']
[]
[]
zip returns list of tuples, where each tuple contains items on the same index from the iterables passed to it:
>>> zip(a,b)
[(0, 12), (13, 25), (26, 38), (39, 51)]
Use itertools.izip if you want memory efficient solution, as it returns an iterator.
You can use str.join if you want to create a string from that sliced list:
for x,y in zip(a,b):
print " ".join(spl[x:y])
...
If you are done with the file, move to the command area
from the file name in the RL screen and type
Update: Creating a and b:
>>> n = 5
>>> a = range(0, 13*n, 13)
>>> b = [ x + 12 for x in a]
>>> a
[0, 13, 26, 39, 52]
>>> b
[12, 25, 38, 51, 64]
Do you mean:
>>> [read[i:j] for i, j in zip(a,b)]
[['If', 'you', 'are', 'done', 'with', 'the', 'file,', 'move', 'to', 'the',
'command', 'area'], ['from', 'the', 'file', 'name', 'in', 'the', 'RL',
'screen', 'and', 'type'], [], []]
or
>>> ' '.join[read[i:j] for i, j in zip(a,b)][0])
'If you are done with the file, move to the command area'
>>> ' '.join[read[i:j] for i, j in zip(a,b)][1])
'from the file name in the RL screen and type'
a = [0,13,26,39]
b = [12,25,38,51]
str = 'If you are done with the file, move to the command area across from the file name in the RL screen and type'
read = str.split()
extra_lists = [read[start:end] for start,end in zip(a,b)]
print extra_lists
You mentioned a lambda, so:
f = lambda s, i, j: s.split()[i:j]
>>> f("hello world how are you",0,2)
['hello', 'world']
Seems like you're doing the slice indices in two lists, might I suggest a dictionary or a list of tuples?
str = 'If you are done with the file, move to the command area across from the file name in the RL screen and type'
slices = [(0, 13), (12, 25)]
dslices = {0:13, 12:25}
for pair in slices:
print f(str, pair[0], pair[1])
for key in dslices:
print f(str, key, dislikes[key])
I'm not a fan of using zip when you have the option of just formatting your data better.

Ordered tally of the cumulative number of unique words seen by a given position

I have a list of words given below (example):
['the', 'counter', 'starts', 'the', 'starts', 'for']
I want to process this list in order and generate a pair (x,y) where x is incremented with each word and y is incremented only when it sees a unique word.
So for the given example, my output should be like: [(1,1) (2,2), (3,3) (4,3) (5,3) (6,4)]
I am not sure about how to do this in python. It would be great if i can get some insights on how to do this.
Thanks.
try this:
>>>from collections import Counter
>>>data = ['the', 'counter', 'starts', 'the', 'starts', 'for']
>>>tally=Counter()
>>>for elem in data:
>>> tally[elem] += 1
>>>tally
Counter({'starts': 2, 'the': 2, 'counter': 1, 'for': 1})
from here: http://docs.python.org/2/library/collections.html
Of course, this results in a dictionary not a list. I wouldn't know if there's any way to convert this dict to a list (like some zip function ?)
Hope it might be any help for anyone
>>> words = ['the', 'counter', 'starts', 'the', 'starts', 'for']
>>> uniq = set()
>>> result = []
>>> for i, word in enumerate(words, 1):
uniq.add(word)
result.append((i, len(uniq)))
>>> result
[(1, 1), (2, 2), (3, 3), (4, 3), (5, 3), (6, 4)]
Use collections.Counter for counting occurrences:
I appreciate this doesn't directly answer your question but it presents the canonical, pythonic way to count stuff as a response to the incorrect usage provided in this answer.
from collections import Counter
data = ['the', 'counter', 'starts', 'the', 'starts', 'for']
counter = Counter(data)
The result is a dict-like object that can be accessed via the keys
counter['the']
>>> 2
you can also call Counter.items() to generate an unordered list of (element, count) pairs
counter.items()
>>> [('starts', 2), ('the', 2), ('counter', 1), ('for', 1)]
The output you want is slightly weird, it might be worth re-thinking why you need the data in that format.
Like this:
>>> seen = set()
>>> words = ['the', 'counter', 'starts', 'the', 'starts', 'for']
>>> for x, w in enumerate(words, 1):
... seen.add(w)
... print(x, len(seen))
...
(1, 1)
(2, 2)
(3, 3)
(4, 3)
(5, 3)
(6, 4)
In actual practice, I'd make a generator function to successively yield the tuples, instead of printing them:
def uniq_count(lst):
seen = set()
for w in lst:
seen.add(w)
yield len(seen)
counts = list(enumerate(uniq_count(words), 1))
Note here that I have also separated the logic of the two counts. Since enumerate does just what you need for the first number in each pair, it's easier just to handle the second number in the generator and let enumerate handle the first.
data = ['the', 'counter', 'starts', 'the', 'starts', 'for']
print [(i, len(set(data[:i]))) for i, v in enumerate(data, 1)]
a dictionary mentioned in your comment is created as follows:
data = ['the', 'counter', 'starts', 'the', 'starts', 'for']
print {j: data.count(j) for j in set(data)}

Categories