Split string into pair - python

What would be the best/easiest way to split a string into pair of word ?
Ex:
string = "This is a string"
Output:
["This is", "is a", "a string"]

>>> import itertools
>>> a, b = itertools.tee('this is a string'.split());
>>> next(b, None)
>>> [' '.join(words) for words in zip(a, b)]
['this is', 'is a', 'a string']

string_list = string.split()
result = [f'{string_list[i] string_list[i+1]}' for i in range(len(string_list) - 1)]

words = str.split()
output = []
i = 1
while i < len(words):
cur_word = words[i]
prev_word = words[i - 1]
output.append(f"{prev_word} {cur_word}")
i += 1

You can use the zip function, with its first argument the whole list of words and its second the list of words without the first word. Since zip aggregates elements from each of the iterables, this will connect each word with its next in the list:
string = "This is a string"
zipped_lst = zip(string.split(), string.split()[1:])
print(list(zipped_lst))
This outputs
[('This', 'is'), ('is', 'a'), ('a', 'string')]

Related

Strings searches in a list, if the list contains a string with a space

I'd like to identify a word in a list, however one of the strings has a space in-between and is not recognized. My code:
res = [word for word in somestring if word not in myList]
myList = ["first", "second", "the third"]
So when
somestring = "test the third"
is parsed then res="test the third" (should be "test").
How can I overcome strings searches in a list, if the list contains a string with a space?
One way is you can use split().
myList = ["first", "second", "the third"]
somestring = "test the third"
n=[x.split() for x in myList]
#[['first'], ['second'], ['the', 'third']]
You can flatten this by:
m=[item for sublist in n for item in sublist]
#['first', 'second', 'the', 'third']
Similarly, you can split() somestring
s=somestring.split()
#['test', 'the', 'third']
Finally:
for x in s:
if x not in m:
print(x)
#test
You can also get the result in one line; but it is not very readable:
[x for x in somestring.split() if x not in (item for sublist in (x.split() for x in myList) for item in sublist)]
#['test']
Use myLst as a list of patterns for regex substitution:
import re
myList = ["first", "second", "the third"]
somestring = "test the third"
res = re.sub(fr'({"|".join(myList)})', '', somestring).strip()
print(res)
test

How to construct a string from letters of each word from list?

I am wondering how to construct a string, which takes 1st letter of each word from list. Then it takes 2nd letter from each word etc.
For example :
Input --> my_list = ['good', 'bad', 'father']
Every word has different length (but the words in the list could have equal length)
The output should be: 'gbfoaaodtdher'.
I tried:
def letters(my_list):
string = ''
for i in range(len(my_list)):
for j in range(len(my_list)):
string += my_list[j][i]
return string
print(letters(['good', 'bad', 'father']))
and I got:
'gbfoaaodt'.
That's a good job for itertools.zip_longest:
from itertools import zip_longest
s = ''.join([c for x in zip_longest(*my_list) for c in x if c])
print(s)
Or more_itertools.interleave_longest:
from more_itertools import interleave_longest
s = ''.join(interleave_longest(*my_list))
print(s)
Output: gbfoaaodtdher
Used input:
my_list = ['good', 'bad', 'father']
The answer by #mozway is the best approach, but if you want to go along with your original method, this is how
def letters(my_list):
string = ''
max_len = max([len(s) for s in my_list])
for i in range(max_len):
for j in range(len(my_list)):
if i < len(my_list[j]):
string += my_list[j][i]
return string
print(letters(['good', 'bad', 'father']))
Output: gbfoaaodtdher
We can do without zip_longest as well:
l = ['good', 'bad', 'father']
longest_string=max(l,key=len)
''.join(''.join([e[i] for e in l if len(e) > i]) for i in range(len(longest_string)))
#'gbfoaaodtdher'

Unexpected behavior with string.split()

Say I have a string, string = 'a'
I do string.split() and I get ['a']
I don't want this, I only want a list when I have whitespace in my string, ala string = 'a b c d'
So far, I've tried all the following with no luck:
>>> a = 'a'
>>> a.split()
['a']
>>> a = 'a b'
>>> a.split(' ')
['a', 'b']
>>> a = 'a'
>>> a.split(' ')
['a']
>>> import re
>>> re.findall(r'\S+', a)
['a']
>>> re.findall(r'\S', a)
['a']
>>> re.findall(r'\S+', a)
['a', 'b']
>>> re.split(r'\s+', a)
['a', 'b']
>>> a = 'a'
>>> re.split(r'\s+', a)
['a']
>>> a.split(" ")
['a']
>>> a = "a"
>>> a.split(" ")
['a']
>>> a.strip().split(" ")
['a']
>>> a = "a".strip()
>>> a.split(" ")
['a']
Am I crazy? I see no whitespace in the string "a".
>>> r"[^\S\n\t]+"
'[^\\S\\n\\t]+'
>>> print(re.findall(r'[^\S\n\t]+',a))
[]
What up?
EDIT
FWIW, this is how I got what I needed:
# test for linked array
if typename == 'org.apache.ctakes.typesystem.type.textsem.ProcedureMention':
for f in AnnotationType.all_features:
if 'Array' in f.rangeTypeName:
if attributes.get(f.name) and typesystem.get_type(f.elementType):
print([ int(i) for i in attributes[f.name].split() ])
and that is the end...
Split will always return a list, try this.
def split_it(s):
if len(s.split()) > 1:
return s.split()
else:
return s
The behavior of split makes sense, it always returns a list. Why not just check if the list length is 1?
def weird_split(a):
words = a.split()
if len(words) == 1:
return words[0]
return words
You could use the conditional expression to check for the presence of space, and use split only if a space is detected:
str1 = 'abc'
split_str1 = str1 if (' ' not in str1) else str1.split(' ')
print (split_str1)
str1 = 'ab c'
split_str1 = str1 if (' ' not in str1) else str1.split(' ')
print (split_str1)
This would give the output:
abc
['ab', 'c']

Given an input as 'sentence', how to return the element that appears the most

This first function counts the string's characters
def character_count(sentence):
characters = {}
for char in sentence:
if char in characters:
characters[char] = characters[char] + 1
else:
characters[char] = 1
return characters
This second function determines the most common character and identifies which one appears most often by characters[char] which is established in the previous helper function
def most_common_character(sentence):
chars = character_count(sentence)
most_common = ""
max_times = 0
for curr_char in chars:
if chars[curr_char] > max_times:
most_common = curr_char
max_times = chars[curr_char]
return most_common
Why not simply using what Python provides?
>>> from collections import Counter
>>> sentence = "This is such a beautiful day, isn't it"
>>> c = Counter(sentence).most_common(3)
>>> c
[(' ', 7), ('i', 5), ('s', 4)]
After if you really want to proceed word by word and avoid spaces:
>>> from collections import Counter
>>> sentence = "This is such a beautiful day, isn't it"
>>> res = Counter(sentence.replace(' ', ''))
>>> res.most_common(1)
[('i', 5)]
You actually don't have to change anything! Your code will work with a list as is (the variable names just become misleading). Try it:
most_common_character(['this', 'is', 'a', 'a', 'list'])
Output:
'a'
This will work for lists with any kind of elements that are hashable (numbers, strings, characters, etc)

Matching String in a List of Strings

I basically want to create a new list 'T' which will match if each element in the list 'Word' exists as a separate element in the list 'Z'.
ie I want the output of 'T' in the following case to be T = ['Hi x']
Word = ['x']
Z = ['Hi xo xo','Hi x','yoyo','yox']
I tried the following code but it gives me all sentences with words having 'x' in it however I only want the sentences having 'x' as a separate word.
for i in Z:
for v in i:
if v in Word:
print (i)
Just another pythonic way
[phrase for phrase in Z for w in Word if w in phrase.split()]
['Hi x']
You can do it with list comprehension.
>>> [i for i in Z if any (w.lower() ==j.lower() for j in i.split() for w in Word)]
['Hi x']
Edit:
Or you can do:
>>> [i for i in Z for w in Word if w.lower() in map(lambda x:x.lower(),i.split())]
['Hi x']
if you want to print all strings from Z that contain a word from Word:
Word = ['xo']
Z = ['Hi xo xo','Hi x','yoyo','yox']
res = []
for i in Z:
for v in i.split():
if v in Word:
res.append(i)
break
print(res)
Notice the break. Without the break you could get some strings from Z twice, if two words from it would match. Like the xo in the example.
The i.split() expression splits i to words on spaces.
words = ['x']
phrases = ['Hi xo xo','Hi x','yoyo','yox']
for phrase in phrases:
for word in words:
if word in phrase.split():
print(phrase)
If you would store Word as a set instead of list you could use set operations for check. Basically following splits every string on whitespace, constructs set out of words and checks if Word is subset or not.
>>> Z = ['Hi xo xo','Hi x','yoyo','yox']
>>> Word = {'x'}
>>> [s for s in Z if Word <= set(s.split())]
['Hi x']
>>> Word = {'Hi', 'x'}
>>> [s for s in Z if Word <= set(s.split())]
['Hi x']
In above <= is same as set.issubset.

Categories