joining strings iteratively between two indexes python - python

I have a test file with the following format:
one
two
three
four
=
five
six
seven
eight
=
nine
ten
one
two
=
and I am writing a python code to create a list, with each element in the text to be an item in a list:
dump = sys.argv[1]
lines = []
with open(dump) as f:
for line in f:
x = line.strip()
lines.append(x)
print(lines)
lines list =
['one', 'two', 'three', 'four', '=', 'five', 'six', 'seven', 'eight', '=', 'nine', 'ten', 'one', 'two', '=']
I then get the indexes of the equals signs in order to try to use those at a later point to make a new list, combining the strings:
equals_indexes = [i for i, x in enumerate(lines) if x == '=']
equals_indexes list:
[4, 9, 14]
I am good up until this point. Now I would like to join the strings one, two, three, four before the first index as new_list element 1. I would like to join the next group of strings between equals sign 1 and 2, and the next group of strings between equals sign 2 and 3 to produce the following:
[[one two three four], [five six seven eight], [nine ten one two]]
I have tried to do this by iterating over the list of equals indexes, then iterating over the list lines:
for i in equals_indexes:
sequences = ""
for x,y in enumerate(lines):
if x < i:
sequences = ' '.join(lines[x:i])
groups.append(sequences)
print(groups)
Which produces the following:
['one two three four', 'two three four', 'three four', 'four', 'one two three four = five six seven eight', 'two three four = five six seven eight', ....]
I understand why this is happening, because at each iteration of x, it is checking to see if it is less than i and if so appending each string at x to the string "sequences". I am doing this because I have a large file with huge blocks of text corresponding to one iteration of a program. The separator between iteration 1 and iteration 2 of the program is a single '=' in the line. This way I can parse the list elements after I am able to split them by equals sign.
Any help would be great!

I think this gets you what you are looking for, although there is one part that is unclear. If you want to join the strings between equals signs as each element in your final list:
with open(dump) as f:
full_string = ' '.join([line.strip() for line in f])
my_list = [string.strip() for string in full_string.split('=') if string is not '']
print(my_list)
['one two three four', 'five six seven eight', 'nine ten one two']
If, instead, you want sub-lists comprising each string between the equals signs, just replace my_list above with:
my_list = [[s for s in string.split()] for string in full_string.split('=') if string is not '']
[['one', 'two', 'three', 'four'], ['five', 'six', 'seven', 'eight'], ['nine', 'ten', 'one', 'two']]
Bonus, they use list comprehensions which are a much more pythonic way of looping:

Here's a small IDLE example:
>>> stuff = ['a', 'b', 'c', '=', 'd', 'e', '=', 'f', 'g']
>>> "".join(stuff).split('=')
['abc', 'de', 'fg']
It joins all of the characters together (So you can skip separating them out into separate lists), and then splits that string on the = character.

Read in lines until you hit a =, merge them as one listentry and add it, continue until done, put last line-list content in:
t = """one
two
three
four
=
five
six
seven
eight
=
nine
ten
one
two
="""
data = [] # global list
line = [] # temp list
for n in [x.strip() for x in t.splitlines()]:
if n == "=":
if line:
data.append(' '.join(line))
line = []
else:
line.append(n)
if line:
data.append(' '.join(line))
print(data)
Output:
['one two three four', 'five six seven eight', 'nine ten one two']

Related

Identify a sequence of numbers written as words

I have lists of words in python. In the list elements I have numbers written as words. For example:
list = ['man', 'ball', 'apple', 'thirty-one', 'five', 'seven', 'twelve', 'queen']
I have also the dictionary with every number written as word as the key and the corresponding digit as value. For example:
n_dict = {'zero':0, 'one':1, 'two':2, ...., 'hundred':100}
What I need to do is to identify let's say 4 or more (greater than 4) numbers written as words consecutively in the list and convert them to digits based on the dictionary. For example list should be like:
list = ['man', 'ball', 'apple', '31', '5', '7', '12', 'queen']
However, if there are less consecutive elements than the number specified (in our case 4) the list shall be the same. For example:
list2 = ['bike', 'earth', 't-shirt', 'twenty-five', 'zero', 'seven', 'home', 'bottle']
list2 Shall remain as it is.
In addition, if there are multiple sequences with numbers written as words but they are not reaching the minimum amount of consecutive words required the words should not change to digits. For example:
list3 = ['stairs', 'tree', 'street', 'forty-two', 'nine', 'submarine', 'two', 'eighty-five']
list3 Shall remain as it is.
The sequence of numbers written as words can be anywhere at the list. At the beginning, at the end, somewhere in the middle.
What I have tried so far:
def checkConsecutive(l):
return sorted(l) == list(range(min(l), max(l)+1))
def replace_numbers(word_list, num_dict):
flag = False
intersect = list(set(word_list) & set(n_dict.keys()))
intersect_index = [word_list.index(elem) for elem in intersect]
flag = check_if_consecutive(intersect_index)
if (len(intersect_index) > 4) & flag:
flag = True
for index in intersect_index:
word_list[index] = n_dict[word_list[index]]
return word_list, flag
I need to return the flag as well to keep track which of the lists changed.
The above code works fine but I think it's not that efficient. My question is whether can be implemented in a better way. E.g. using operator.itemgetter or something in a similar fashion.
For digits
from itertools import filterfalse
list_of_strings_that_are_secretly_integers = [*filterfalse(lambda x: isinstance(x, bool), (n_dict.get(i, False) for i in list_of_strings))]
For consecutivity, the following should work for any indexed candidate
def continuous(candidate, differential=1):
return all(e == candidate[i-1] + differential for i, e in enumerate(candidate[1:]))

Shift elements of a list forward (rotating a list)

My issue is as follows: I want to create a program which accepts strings divided from each other by one space. Then the program should prompt a number, which is going to be the amount of words it's going to shift forward. I also want to use lists for words as well as for the output, because of practice.
Input: one two three four five six seven 3
Output: ['four', 'five', 'six', 'seven', 'one', 'two', 'three']
This is what I've came up with. For the input I've used the same input as above. However, when I try increasing a prompt number by N, the amount of appended strings to list cuts by N. Same happens when I decrease the prompt number by N (the amount of appended strings increases by N). What can be an issue here?
l_words = list(input().split())
shift = int(input()) + 1 #shifting strings' number
l = [l_words[shift - 1]]
for k in range(len(l_words)):
if (shift+k) < len(l_words):
l.append(l_words[shift+k])
else:
if (k-shift)>=0:
l.append(l_words[k-shift])
print(l)
You can use slicing by just joining the later sliced part to the initial sliced part given the rotation number.
inp = input().split()
shift_by = int(inp[-1])
li = inp[:-1]
print(li[shift_by:] + li[:shift_by]) # ['four', 'five', 'six', 'seven', 'one', 'two', 'three']

Separating string into two Python array's

I'm trying to split a string of words into two lists of words using the query below. The string up until 'a' should go into begin, and the rest into remainder. But the while loop somehow keeps running regardless of the fact that begin already contains 'a'. Thanks a lot for your help!
random_string = 'hello this is a test string'
split = {}
split = random_string.split()
begin = []
remainder = []
while 'a' not in begin:
for word in split:
storage = word
begin.append(storage)
print(begin)
So your problem here is that the while loop condition is checked after the for loop has completed. Essentially this is what happens
'a' is not in begin
Loop through the split and add every word to begin
check is 'a' is in begin
You could try something like:
for word in split:
if 'a' in begin:
remainder.append(word)
else:
begin.append(word)
where the 'a' condition is checked on every iteration of the loop or follow the slicing techniques listed in other answers
Try using slices and index, there is no need to run a loop for catching the 'a':
random_string = 'hello this is a test string'
split = random_string.split(' ')
index = split.index('a') + 1
array_1 = split[:index]
array_2 = split[index:]
print(array_1, array_2)
You should be looking at .index method of array and there is no need of loop.
random_string = 'hello this is a test string'
split = random_string.split()
begin = []
remainder = []
index = split.index('a')
begin = split[:index]
remainder = split[index:]
print(begin)
print(remainder)
Code snippet above will print:
['hello', 'this', 'is']
['a', 'test', 'string']
Just slice the list you get by spliting your string :
To take the sentence until a specific word :
>>>words = "one two three four"
>>>words.split()[:words.split().index("three")+1]
['one', 'two', 'three']
>>>words.split()[words.split().index("three")+1:]
['four']
To take half the sentence :
(Your post seemed ambiguous to me about what you wanted.)
>>>words = "one two three four"
>>>words.split()[:len(words.split())//2]
['one', 'two']
>>>words.split()[len(words.split())//2:]
['three', 'four']
Try this, just two lines. one to get the words with 'a', another to get words without out 'a'.
random_string = 'hello this is a test string, tap task'
begin = [x for x in random_string.split(' ') if 'a' in x]
remainder = [x for x in random_string.split(' ') if 'a' not in x]
print begin, remainder
will print.
['a', 'tap', 'task']
['hello', 'this', 'is', 'test', 'string,']
Use builtin string routines:
>>> str = 'hello this is a test string'
>>> begin, end str.split('a')
['hello this is ', ' test string']
>>> begin_words = begin.split()
['hello', 'this', 'is']
>>> end_words = end.split()
['test', 'string']
The default of split is to split on whitespace, but as you can see, it works with other strings as well.

Extract words surrounding a search word

I have this script that does a word search in text. The search goes pretty good and results work as expected. What I'm trying to achieve is extract n words close to the match. For example:
The world is a small place, we should try to take care of it.
Suppose I'm looking for place and I need to extract the 3 words on the right and the 3 words on the left. In this case they would be:
left -> [is, a, small]
right -> [we, should, try]
What is the best approach to do this?
Thanks!
def search(text,n):
'''Searches for text, and retrieves n words either side of the text, which are retuned seperatly'''
word = r"\W*([\w]+)"
groups = re.search(r'{}\W*{}{}'.format(word*n,'place',word*n), text).groups()
return groups[:n],groups[n:]
This allows you to specify how many words either side you want to capture. It works by constructing the regular expression dynamically. With
t = "The world is a small place, we should try to take care of it."
search(t,3)
(('is', 'a', 'small'), ('we', 'should', 'try'))
While regex would work, I think it's overkill for this problem. You're better off with two list comprehensions:
sentence = 'The world is a small place, we should try to take care of it.'.split()
indices = (i for i,word in enumerate(sentence) if word=="place")
neighbors = []
for ind in indices:
neighbors.append(sentence[ind-3:ind]+sentence[ind+1:ind+4])
Note that if the word that you're looking for appears multiple times consecutively in the sentence, then this algorithm will include the consecutive occurrences as neighbors.
For example:
In [29]: neighbors = []
In [30]: sentence = 'The world is a small place place place, we should try to take care of it.'.split()
In [31]: sentence
Out[31]:
['The',
'world',
'is',
'a',
'small',
'place',
'place',
'place,',
'we',
'should',
'try',
'to',
'take',
'care',
'of',
'it.']
In [32]: indices = [i for i,word in enumerate(sentence) if word == 'place']
In [33]: for ind in indices:
....: neighbors.append(sentence[ind-3:ind]+sentence[ind+1:ind+4])
In [34]: neighbors
Out[34]:
[['is', 'a', 'small', 'place', 'place,', 'we'],
['a', 'small', 'place', 'place,', 'we', 'should']]
import re
s='The world is a small place, we should try to take care of it.'
m = re.search(r'((?:\w+\W+){,3})(place)\W+((?:\w+\W+){,3})', s)
if m:
l = [ x.strip().split() for x in m.groups()]
left, right = l[0], l[2]
print left, right
Output
['is', 'a', 'small'] ['we', 'should', 'try']
If you search for The, it yields:
[] ['world', 'is', 'a']
Handling the scenario where the search keyword appears multiple times. For example below is the input text where search keyword : place appears 3 times
The world is a small place, we should try to take care of this small place by planting trees in every place wherever is possible
Here is the function
import re
def extract_surround_words(text, keyword, n):
'''
text : input text
keyword : the search keyword we are looking
n : number of words around the keyword
'''
#extracting all the words from text
words = words = re.findall(r'\w+', text)
#iterate through all the words
for index, word in enumerate(words):
#check if search keyword matches
if word == keyword:
#fetch left side words
left_side_words = words[index-n : index]
#fetch right side words
right_side_words = words[index+1 : index + n + 1]
print(left_side_words, right_side_words)
Calling the function
text = 'The world is a small place, we should try to take care of this small place by planting trees in every place wherever is possible'
keyword = "place"
n = 3
extract_surround_words(text, keyword, n)
output :
['is', 'a', 'small'] ['we', 'should', 'try']
['we', 'should', 'try'] ['to', 'microsot', 'is']
['also', 'take', 'care'] ['googe', 'is', 'one']
Find all of the words:
import re
sentence = 'The world is a small place, we should try to take care of it.'
words = re.findall(r'\w+', sentence)
Get the index of the word that you're looking for:
index = words.index('place')
And then use slicing to find the other ones:
left = words[index - 3:index]
right = words[index + 1:index + 4]

Python split '123' into '1', '2', '3'

My program takes a user input such as:
>>> x = input()
>>> 1
>>> print x
>>> one
my actual code:
>>> import string
>>> numbers = ['0','1','2','3','4','5','6','7','8','9']
>>> wordNumbers = ['zero', 'one', 'two', 'three', 'four', 'five', 'six', 'seven', 'eight', 'nine']
>>> myDict = dict(zip(numbers, wordNumbers))
>>> myVar = (raw_input("Enter a number to be tranlated: "))
>>> for translate in myVar.split():
>>> print(myDict[translate])
The problem is I need the user to input 123 and for my program to output one two three, but it doesn't for some reason.
I'm thinking that if I add spaces with some syntax between 123 like 1 2 3 it would work.
You simply need to use:
for translate in myVar:
Instead of:
for translate in myVar.split():
Iterating over a string gives you its characters one by one, which is what you need.
If you do want to convert '123' to '1 2 3' (which isn't needed here because you don't need to use split), you can use:
' '.join(myVar)

Categories