Swap a character with its next character in paragraph - python

I have to swap a specific character appearing in paragraph to its next character.
let suppose that my paragraph text is:
My name is andrew. I am very addicted to python and attains very high knowledge about programming.
Now, my task is to find particular character in paragraph and swap it with the character next to it. Like, I want to swap every character 'a' with its the character next to it. After process my paragraph should look like this:
My nmae is nadrew. I ma very dadicted to python nad tatians very high knowledge baout progrmaming.
I would be very thankful if anybody define function for this in python

This will do it:
>>> import re
>>>>regex = re.compile(r'(a)(\w)')
>>>>text = 'My name is andrew. I am very addicted to python and attains very high knowledge about programming.'
>>> regex.sub(lambda(m) : m.group(2) + m.group(1), text)
'My nmae is nadrew. I ma very dadicted to python nad tatians very high knowledge baout progrmaming.'
Explanation:
(a)(\w)
Matches a, and put it on group 1, then matches another word character put in group 2. Lambda expression for replacement switch these two groups.
If you want to match everything but spaces use :
(a)(\S)

Related

Replace a substring with defined region and follow up variable region in Python

I have a seemingly simple problem that for the life of me is just outside my reach of understanding. What I mean by that is that I can come up with many complex ways to attempt this, but there must be an easy way.
What I am trying to do is find and replace a substring in a string, but the catch is that it is based on a mix of a defined region and then variable regions based on length.
Here is an example:
sequence = 'AATCGATCGTATATCTGCGTAGACTCTGTGCATGC' and I want to replace AATCGATCGTA with <span color="blue">AATCGA</span><span>T</span><span color="green">CGTA</span>
So in this example the first part will always be constant AATCGA and will be used to locate the region to replace. This is then followed by a "spacer", in this case a single character but could be more than one and needs to be specified, and finally the last bit that will follow the "tail", in this case four characters, but could also be more or less. A set-up in this case would be:
to_find = 'AATCGA'
spacer = 'T' #Variable based on number and not on the character
tail = 'CGTA' #Variable based on number and not on the character
With this information I need to do something like:
new_seq = sequence.replace(f'{to_find}{len(spacer)}{len(tail)}', f'<span color="blue">{to_find}</span><span>{spacer}</span><span color="green">{tail}</span>')
print(new_seq)
<span color="blue">AATCGA</span><span>T</span><span color="green">CGTA</span>TATCTGCGTAGACTCTGTGCATGC
But the spacer could be 3 characters from the end of to_find and it may vary, the same with the tail section. Also, it could be in reverse where the to_find is on the right hand side and then the tail is in the start.
Any help would be much appreciated!
I'm not quite sure I understand you fully. Nevertheless, you don't seem to be too far off. Just use regex.
import re
sequence = 'AATCGATCGTATATCTGCGTAGACTCTGTGCATGC'
expected_new_seq = '<span color="blue">AATCGA</span><span>T</span><span color="green">CGTA</span>TATCTGCGTAGACTCTGTGCATGC'
to_find = 'AATCGA'
spacer = 'T' # Variable based on number and not on the character
tail = 'CGTA' # Variable based on number and not on the character
# In this case, the pattern is (AATCGA)(.{1})(.{4})
# It matches "AATCGA" that is followed by 1 character and then 4 characters.
# AATCGA is captured in group 1, then the next unknown character is captured
# in group 2, and the next 4 unknown characters are captured in group 3
# (the brackets create capturing groups).
pattern = f'({to_find})(.{{{len(spacer)}}})(.{{{len(tail)}}})'
# \1 refers to capture group 1 (to_find), \2 refers to capture group 2 (spacer),
# and \3 refers to capture group 3 (tail).
# This no longer needs to be a f-string. But making it a raw string means we
# don't need to escape the slashes
repl = r'<span color="blue">\1</span><span>\2</span><span color="green">\3</span>'
new_seq = re.sub(pattern, repl, sequence)
print(new_seq)
print(new_seq == expected_new_seq)
Output:
<span color="blue">AATCGA</span><span>T</span><span color="green">CGTA</span>TATCTGCGTAGACTCTGTGCATGC
True
Have a play around with it here (also includes interactive explanation): https://regex101.com/r/2mshrI/1
Also, it could be in reverse where the to_find is on the right hand side and then the tail is in the start.
How do you know when to replace it when it's in reverse instead of forward? After all, all you're doing is matching a short string followed/lead by n characters. I imagine you'd get matches in both directions, so which replacement do you carry out? Please provide more examples - longer input with expected output.

Replace characters in specific locations in strings inside lists

Very new to Python/programming, trying to create a "grocery list generator" as a practice project.
I created a bunch of meal variables with their ingredients in a list, then to organise that list in a specific (albeit probably super inefficient) way with vegetables at the top I've added a numerical value at the start of each string. It looks like this -
meal = ["07.ingredient1", "02.ingredient2", "05.ingredient3"]
It organises, prints, and writes how I want it to, but now I want to remove the first three characters (the numbers) from each string in the list before I write it to my text file.
So far my final bit of code looks like this -
Have tried a few different things between the '.sort' and 'with open' like replace, strip, range and some other things but can't get them to work.
My next stop was trying something like this, but can't figure it out -
for item in groceries[1:]
str(groceries(range99)).replace('')
Thanks heaps for your help!
for item in groceries:
shopping_list.write(item[3:] + '\n')
Instead of replacing you can just take a substring.
groceries = [g[3:] for g in groceries]
Depending on your general programming knowledge, this solution is maybe a bit enhanced, but regular expressions would be another alternative.
import re
pattern = re.compile(r"\d+\.\s*(\w+)")
for item in groceries:
ingredient = pattern.findall(item)[0]
\d means any digit (0-9), + means "at least one", \. matches ".", \s is whitespace and * means "0 or more" and \w is any word character (a-z, A-Z, 0-9).
This would also match things like
groceries = ["1. sugar", "0110.salt", "10. tomatoes"]
>>> meal = ["07.ingredient1", "02.ingredient2", "05.ingredient3"]
>>> myarr = [i[3:] for i in meal]
>>> print(myarr)
['ingredient1', 'ingredient2', 'ingredient3']

Finding most common occurrence of a character that follows another

I'm currently working on a small piece of code and I seem to have run into a roadblock. I was wondering if it's possible to (because I cannot, for the life of me, figure it out) find the most common occurrence of a character that follows a specific character or string?
For example, say I have the following sentence:
"this is a test sentence that happens to be short"
How would could I determine, for example, the most common character that occurs after the letter h?
In this specific example, doing it by hand, I get something like this:
{"i": 1, "a": 2, "o": 1}
I'd then like to be able to get the key of the highest value--in this case, a.
Using Counter from collections, I've been able to find the most common occurrence of a specific word or character, but I'm not sure how to do this specific implementation of doing the most common occurrence after. Any help would be greatly appreciated, thanks!
(The code I wrote to find the most common occurrence of a letter in a file:
Counter(text).most_common(1), which does include white spaces )
EDIT:
How would this be done with words? For example, if I had the sentence: "whales are super neat, but whales don't make good pets. whales are cool."
How would I find the most common character that occurs after the words whales?
In this instance, removing white spaces, the most common character would be a
Just split them by your character and then get the letter after it
import collections
sentence = "this is a test sentence that happens to be short"
character = 'h'
letters_after_some_character = [part[0] for part in str.split(character)[1:] if part[0].isalpha()]
print(collections.Counter(letters_after_some_character).most_common())
If you want a solution without using regex:
import collections
sentence = "this is a test sentence that happens to be short"
characters = [sentence[i] for i in range(1,len(sentence)) if sentence[i-1] == 'h']
most_common_char = collections.Counter(characters).most_common(1)
Using the Counter class we can try:
import collections
s = "this is a test sentence that happens to be short"
s = re.sub(r'^.*n|\s*', '', s)
print(collections.Counter(s).most_common(1)[0])
The above would print o as it is the most frequent character occurring after the last n. Note that we also strip off whitespace before calling collections count.

How to remove multiple consequent characters within a word with regular expressions in Python?

I want a regular expression (in Python) that given a sentence like:
heyy how are youuuuu, it's so cool here, cooool.
converts it to:
heyy how are youu, it's so cool here, cool.
which means maximum of 1 time a character can be repeated and if it's more than that it should be removed.
heyy ==> heyy
youuuu ==> youu
cooool ==> cool
You can use back reference in the pattern to match repeated characters and then replace it with two instances of the matched character, here (.)\1+ will match a pattern that contains the same character two or more times, replace it with only two instances by \1\1:
import re
re.sub(r"(.)\1+", r"\1\1", s)
# "heyy how are youu, it's so cool here, cool."
create a new empty text and only add to it if there aren't 3 consecutive
text = "heyy how are youuuuu, it's so cool here, cooool."
new_text = ''
for i in range(len(text)):
try:
if text[i]==text[i+1]==text[i+2]:
pass
else:
new_text+=text[i]
except:
new_text+=text[i]
print new_text
>>>heyy how are youu, it's so cool here, cool.
eta: hmmm just noticed you requested "regular expressions" so approved answer is better; though this works

Regex to match 'lol' to 'lolllll' and 'omg' to 'omggg', etc

Hey there, I love regular expressions, but I'm just not good at them at all.
I have a list of some 400 shortened words such as lol, omg, lmao...etc. Whenever someone types one of these shortened words, it is replaced with its English counterpart ([laughter], or something to that effect). Anyway, people are annoying and type these short-hand words with the last letter(s) repeated x number of times.
examples:
omg -> omgggg, lol -> lollll, haha -> hahahaha, lol -> lololol
I was wondering if anyone could hand me the regex (in Python, preferably) to deal with this?
Thanks all.
(It's a Twitter-related project for topic identification if anyone's curious. If someone tweets "Let's go shoot some hoops", how do you know the tweet is about basketball, etc)
FIRST APPROACH -
Well, using regular expression(s) you could do like so -
import re
re.sub('g+', 'g', 'omgggg')
re.sub('l+', 'l', 'lollll')
etc.
Let me point out that using regular expressions is a very fragile & basic approach to dealing with this problem. You could so easily get strings from users which will break the above regular expressions. What I am trying to say is that this approach requires lot of maintenance in terms of observing the patterns of mistakes the users make & then creating case specific regular expressions for them.
SECOND APPROACH -
Instead have you considered using difflib module? It's a module with helpers for computing deltas between objects. Of particular importance here for you is SequenceMatcher. To paraphrase from official documentation-
SequenceMatcher is a flexible class
for comparing pairs of sequences of
any type, so long as the sequence
elements are hashable. SequenceMatcher
tries to compute a "human-friendly
diff" between two sequences. The
fundamental notion is the longest
contiguous & junk-free matching subsequence.
import difflib as dl
x = dl.SequenceMatcher(lambda x : x == ' ', "omg", "omgggg")
y = dl.SequenceMatcher(lambda x : x == ' ', "omgggg","omg")
avg = (x.ratio()+y.ratio())/2.0
if avg>= 0.6:
print 'Match!'
else:
print 'Sorry!'
According to documentation, any ratio() over 0.6 is a close match. You might need to explore tweak the ratio for your data needs. If you need more stricter matching I found any value over 0.8 serves well.
How about
\b(?=lol)\S*(\S+)(?<=\blol)\1*\b
(replace lol with omg, haha etc.)
This will match lol, lololol, lollll, lollollol etc. but fail lolo, lollllo, lolly and so on.
The rules:
Match the word lol completely.
Then allow any repetition of one or more characters at the end of the word (i. e. l, ol or lol)
So \b(?=zomg)\S*(\S+)(?<=\bzomg)\1*\b will match zomg, zomggg, zomgmgmg, zomgomgomg etc.
In Python, with comments:
result = re.sub(
r"""(?ix)\b # assert position at a word boundary
(?=lol) # assert that "lol" can be matched here
\S* # match any number of characters except whitespace
(\S+) # match at least one character (to be repeated later)
(?<=\blol) # until we have reached exactly the position after the 1st "lol"
\1* # then repeat the preceding character(s) any number of times
\b # and ensure that we end up at another word boundary""",
"lol", subject)
This will also match the "unadorned" version (i. e. lol without any repetition). If you don't want this, use \1+ instead of \1*.

Categories