Separating a string of words into characters following certain conditions

Separating a string of words into characters following certain conditions - python

I have a string of words and I want to separate them into individual characters. However, if a group of characters is part of what I've called "special consonant pairs", they need to remain together.
These are some of my "special consonant pairs":
consonant_pairs = ["ng", "ld", "dr", "bl", "nd", "th" ...]
This is one of the sample strings I want to separate into characters:
sentence_1 = "We were drinking beer outside and we could hear the wind blowing"
And this would be my desired output (I have already deleted spaces and punctuation):
sentence_1_char = ['w', 'e', 'w', 'e', 'r', 'e', 'dr', 'i', 'n', 'k', 'i', 'ng', 'b', 'e', 'e', 'r', 'o', 'u', 't', 's', 'i', 'd', 'e', 'a', 'n', 'd', 'w', 'e', 'c', 'o', 'u', 'ld', 'h', 'e', 'a', 'r', 'th', 'e', 'w', 'i', 'nd', 'bl', 'o', 'w', 'i', 'ng']
I thought of using list(), though I don't know how to go about the consonant pairs. Could anyone help me?

A quick (not necessarily performant) answer:
import re
charred = re.split('(' + '|'.join(consonant_pairs) + ')', sentence)
EDIT: To get the expected output in OP:
import re
matches = re.finditer('(' + '|'.join(consonant_pairs) + '|.)', sentence)
charred = [sentence[slice(*x.span())] for x in matches]

Related

Python: How to replace string elements using an array of tuples?

I'm trying to replace the characters of the reversed alphabet with those of the alphabet. This is what I've got:
alphabet = ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z']
rev_alphabet = alphabet[::-1]
sample = "wrw blf hvv ozhg mrtsg'h vkrhlwv?"
def f(alph, rev_alph):
return (alph, rev_alph)
char_list_of_tups = list(map(f, alphabet, rev_alphabet))
for alph, rev_alph in char_list_of_tups:
sample = sample.replace(rev_alph, alph)
print(sample)
expected output: did you see last night's episode?
actual output: wrw you svv ozst nrtst's vprsowv?
I understand that I'm printing the last "replacement" of the whole iteration. How can I avoid this without appending it to a list and then running into problems with the spacing of the words?

Your problem here is that you lose data as you perform each replacement; for a simple example, consider an input of "az". On the first replacement pass, you replace 'z' with 'a', and now have "aa". When you get to replacing 'a' with 'z', it becomes "zz", because you can't tell the difference between an already replaced character and one that's still unchanged.
For single character replacements, you want to use the str.translate method (and the not strictly required, but useful helper function, str.maketrans), to do character by character transliteration across the string in a single pass.
from string import ascii_lowercase # No need to define the alphabet; Python provides it
# You can use the original str form, no list needed
# Do this once up front, and reuse it for as many translate calls as you like
trans_map = str.maketrans(ascii_lowercase[::-1], ascii_lowercase)
sample = sample.translate(trans_map)

alphabet = ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z']
# or
alphabet = [chr(97 + i) for i in range(0,26)]
sample = "wrw blf hvv ozhg mrtsg'h vkrhlwv?"
res = []
for ch in sample:
if ch in alphabet:
res.append(alphabet[-1 - alphabet.index(ch)])
else:
res.append(ch)
print("".join(res))

Another Way if you are ok with creating a new string instead.
alphabet = ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z']
dictRev = dict(zip(alphabet, alphabet[::-1]))
sample = "wrw blf hvv ozhg mrtsg'h vkrhlwv?"
s1="".join([dictRev.get(char, char) for char in sample])
print(s1)
"did you see last night's episode?"

I want to find the numbers of words in a file which contains numbers and special symbols also

I have given a simple text file with no special formatting, just a plain ASCII file, which has special symbols, numbers and words. I need to find the the number of words, letters and special symbols.
Here is my code.
import re
with open("C:/Users/Nikhil/Downloads/stack.txt", "r") as data:
data = data.read()
words = re.findall(r'\w+', data)
letters = re.findall(r'[a-zA-Z]', data)
pattern = '[~`##$%^&*(_)+{}|/\.,<>?-]'
spl_symbols = re.findall(pattern, data)
print(len(words))
print(len(letters))
print(len(spl_symbols))
I have used regex to get the output, but the problem is i am unable to get the correct count of words, because \w+ is matching numbers too. I would like the regex for the words variable to be able to exclude numbers 0-9 and underscores. I also would like it to be able to include apostrophes.

You appear to be having trouble with words with apostrophes and numbers.
Solution
import re
data = "hey, how are you? I'm fine but can't walk. And you? No, I didn't see Romeo Mr. Garrick"
word_patttern = r"[a-zA-Z]+\'?[a-zA-Z]*" # changed to allows only
# letters and singe apostrophe
words = re.findall(word_patttern, data)
letters = re.findall(r'[a-zA-Z]', data) # unchanged from OP code
pattern = '[~`##$%^&*(_)+{}|/\.,<>?-]' # unchanged from OP code
spl_symbols = re.findall(pattern, data)
print(words)
['hey', 'how', 'are', 'you', "I'm", 'fine', 'but', "can't", 'walk', 'And', 'you', 'No', 'I', "didn't", 'see', 'Romeo', 'Mr', 'Garrick']
Explanation
word_pattern keeps I'm, can't, didn't as single words
This is seen by looking at the parts of r"[a-zA-Z]+'?[a-zA-Z]*"
[a-zA-Z]+ to match one or more letters (comes first in pattern)
\'? to match 0 or 1 apostrophe, \ to escape quote (comes second in pattern)
[a-zA-Z]* to match 0 or more letters (last part of pattern)
By having [a-zA-Z]+ we don't match Apostrophe by itself (i.e. ' only). Beside in between letters for compacting words such as "I'm", an apostrophe can come at the end of a word. An example is plural nouns such as "party at the Joneses' house"), so we allow 0 or more letters after the apostrophe.

The regex you are looking for are the following:
import re
words = re.findall(r'[a-zA-Z]+', data)
letters = re.findall(r'[a-zA-Z]', data)
spl_symbols = re.findall(r'[~`##$%^&*(_)+{}|/\.,<>?-]', data)
On this excerpt of text, it gives:
This is an excerpt of text containing special chars like $, _, +. It contains less than 112 words but more than 3.
words: ['This', 'is', 'an', 'excerpt', 'of', 'text', 'containing', 'special', 'chars', 'like', 'It', 'contains', 'less', 'than', 'words', 'but', 'more', 'than']
letters: ['T', 'h', 'i', 's', 'i', 's', 'a', 'n', 'e', 'x', 'c', 'e', 'r', 'p', 't', 'o', 'f', 't', 'e', 'x', 't', 'c', 'o', 'n', 't', 'a', 'i', 'n', 'i', 'n', 'g', 's', 'p', 'e', 'c', 'i', 'a', 'l', 'c', 'h', 'a', 'r', 's', 'l', 'i', 'k', 'e', 'I', 't', 'c', 'o', 'n', 't', 'a', 'i', 'n', 's', 'l', 'e', 's', 's', 't', 'h', 'a', 'n', 'w', 'o', 'r', 'd', 's', 'b', 'u', 't', 'm', 'o', 'r', 'e', 't', 'h', 'a', 'n']
spl_symbols: ['$', ',', '_', ',', '+', '.', '.']

Since it is stated here that \w consist of a-z, A-Z, 0-9, including the _ (underscore) character. You might want to change your words variable to the following:
Option 1:
words = re.findall(r"[a-zA-Z_]+", data)
Option 2:
words = re.findall(r"[^\W0-9]+", data)
Option 3 - Apostrophe Support:
words = re.findall(r"[a-zA-Z_']+", data)
It will include everything \w searches for except the numeric 0-9.

Remove words from list containing certain characters

I have a long list of words that I'm trying to go through and if the word contains a specific character remove it. However, the solution I thought would work doesn't and doesn't remove any words
l3 = ['b', 'd', 'e', 'f', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y']
firstcheck = ['poach', 'omnificent', 'aminoxylol', 'teetotaller', 'kyathos', 'toxaemic', 'herohead', 'desole', 'nincompoophood', 'dinamode']
validwords = []
for i in l3:
for x in firstchect:
if i not in x:
validwords.append(x)
continue
else:
break
If a word from firstcheck has a character from l3 I want it removed or not added to this other list. I tried it both ways. Can anyone offer insight on what could be going wrong? I'm pretty sure I could use some list comprehension but I'm not very good at that.

The accepted answer makes use of np.sum which means importing a huge numerical library to perform a simple task that the Python kernel can easily do by itself:
validwords = [w for w in firstcheck if all(c not in w for c in l3)]

you can use a list comprehension:
import numpy as np
[w for w in firstcheck if np.sum([c in w for c in l3])==0]
It seems all the words contain at least 1 char from l3 and the output of above is an empty list.
If firstcheck is defined as below:
firstcheck = ['a', 'z', 'poach', 'omnificent']
The code should output:
['a', 'z']

If you want to avoid all loops etc, you can use re directly.
import re
l3 = ['b', 'd', 'e', 'f', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y']
firstcheck = ['azz', 'poach', 'omnificent', 'aminoxylol', 'teetotaller', 'kyathos', 'toxaemic', 'herohead', 'desole', 'nincompoophood', 'dinamode']
# Create a regex string to remove.
strings_to_remove = "[{}]".format("".join(l3))
validwords = [x for x in firstcheck if re.sub(strings_to_remove, '', x) == x]
print(validwords)
Output:
['azz']

Ah, there was some mistake in code, rest was fine:
l3 = ['b', 'd', 'e', 'f', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y']
firstcheck = ['aza', 'ca', 'poach', 'omnificent', 'aminoxylol', 'teetotaller', 'kyathos', 'toxaemic', 'herohead', 'desole', 'nincompoophood', 'dinamode']
validwords = []
flag=1
for x in firstcheck:
for i in l3:
if i not in x:
flag=1
else:
flag=0
break
if(flag==1):
validwords.append(x)
print(validwords)
So, here the first mistake was, the for loops, we need to iterate through words first then, through l3, to avoid the readdition of elements.
Next, firstcheck spelling was wrong in 'for x in firstcheck` due to which error was there.
Also, I added a flag, such that if flag value is 1 it will add the element in validwords.
To, check I added new elements as 'aza' and 'ca', due to which, now it shows correct o/p as 'aza' and 'ca'.
Hope this helps you.

Conditional statement does not work when appending a list relative to another

I am trying to remove certain characters from a string. My way of going about it is to turn the string into a list, iterate through each list and append each good character to a new list and return that new list but for some reason, it doesn't do that. This is the input:
"4193 with words"
and this is the output:
4193withwords
In other words, the only part of the code which works is the part of removing the whitespaces. Here is my entire code:
class Solution:
def myAtoi(self, str: str) -> int:
illegal_char = ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z', 'A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L', 'M', 'N', 'O', 'P', 'Q', 'R', 'S', 'T', 'U', 'V', 'W', 'X', 'Y', 'Z', '!', '#', '#', '$', '%', '^', '&' '*', '(', ')', '=', '+', '[', ']', '{', '}', '|']
new_list = []
integer_list = list(str)
for i in range(len(integer_list)):
if integer_list[i] != any(illegal_char):
new_list.append(integer_list[i])
output = ''.join(new_list)
output = output.replace(' ', '')
return output

You can do a join on a list-comprehension. What you need is a membership check in list and form string with only those characters you need:
''.join([x for x in s if x not in illegal_char]).replace(' ', '')
Note that I have renamed your string to s, because str is a built-in.
Also to add, if you can include space as illegal_char you can avoid replace at the end.

Hmm, this is a very complicated way of replacing some characters. I suggest you, to learn some regex, as it could help you alot. There is a regex library for python called re.
This would be my solution:
import re
mytext = "4193 with words"
newtext = re.sub("\s", "", mytext)

Given a list of Unicode code points, how does one split them into a list of Unicode characters?

I'm writing a lexical analyzer for Unicode text. Many Unicode characters require multiple code points (even after canonical composition). For example, tuple(map(ord, unicodedata.normalize('NFC', 'ā́'))) evaluates to (257, 769). How can I know where the boundary is between two characters? Additionally, I'd like to store the unnormalized version of the text. My input is guaranteed to be Unicode.
So far, this is what I have:
from unicodedata import normalize
def split_into_characters(text):
character = ""
characters = []
for i in range(len(text)):
character += text[i]
if len(normalize('NFKC', character)) > 1:
characters.append(character[:-1])
character = character[-1]
if len(character) > 0:
characters.append(character)
return characters
print(split_into_characters('Puélla in vī́llā vīcī́nā hábitat.'))
This incorrectly prints the following:
['P', 'u', 'é', 'l', 'l', 'a', ' ', 'i', 'n', ' ', 'v', 'ī', '́', 'l', 'l', 'ā', ' ', 'v', 'ī', 'c', 'ī', '́', 'n', 'ā', ' ', 'h', 'á', 'b', 'i', 't', 'a', 't', '.']
I expect it to print the following:
['P', 'u', 'é', 'l', 'l', 'a', ' ', 'i', 'n', ' ', 'v', 'ī́', 'l', 'l', 'ā', ' ', 'v', 'ī', 'c', 'ī́', 'n', 'ā', ' ', 'h', 'á', 'b', 'i', 't', 'a', 't', '.']

The boundaries between perceived characters can be identified with Unicode's Grapheme Cluster Boundary algorithm. Python's unicodedata module doesn't have the necessary data for the algorithm (the Grapheme_Cluster_Break property), but complete implementations can be found in libraries like PyICU and uniseg.

You may want to use the pyuegc library, an implementation of the Unicode algorithm for breaking code point sequences into extended grapheme clusters as specified in UAX #29.
from pyuegc import EGC # pip install pyuegc
string = 'Puélla in vī́llā vīcī́nā hábitat.'
egc = EGC(string)
print(egc)
# ['P', 'u', 'é', 'l', 'l', 'a', ' ', 'i', 'n', ' ', 'v', 'ī́', 'l', 'l', 'ā', ' ', 'v', 'ī', 'c', 'ī́', 'n', 'ā', ' ', 'h', 'á', 'b', 'i', 't', 'a', 't', '.']
print(len(string))
# 35
print(len(egc))
# 31

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Separating a string of words into characters following certain conditions - python

Related

Python: How to replace string elements using an array of tuples?

I want to find the numbers of words in a file which contains numbers and special symbols also

Remove words from list containing certain characters

Conditional statement does not work when appending a list relative to another

Given a list of Unicode code points, how does one split them into a list of Unicode characters?

Categories

Resources