Replace characters in specific locations in strings inside lists - python

Very new to Python/programming, trying to create a "grocery list generator" as a practice project.
I created a bunch of meal variables with their ingredients in a list, then to organise that list in a specific (albeit probably super inefficient) way with vegetables at the top I've added a numerical value at the start of each string. It looks like this -
meal = ["07.ingredient1", "02.ingredient2", "05.ingredient3"]
It organises, prints, and writes how I want it to, but now I want to remove the first three characters (the numbers) from each string in the list before I write it to my text file.
So far my final bit of code looks like this -
Have tried a few different things between the '.sort' and 'with open' like replace, strip, range and some other things but can't get them to work.
My next stop was trying something like this, but can't figure it out -
for item in groceries[1:]
str(groceries(range99)).replace('')
Thanks heaps for your help!

for item in groceries:
shopping_list.write(item[3:] + '\n')

Instead of replacing you can just take a substring.
groceries = [g[3:] for g in groceries]

Depending on your general programming knowledge, this solution is maybe a bit enhanced, but regular expressions would be another alternative.
import re
pattern = re.compile(r"\d+\.\s*(\w+)")
for item in groceries:
ingredient = pattern.findall(item)[0]
\d means any digit (0-9), + means "at least one", \. matches ".", \s is whitespace and * means "0 or more" and \w is any word character (a-z, A-Z, 0-9).
This would also match things like
groceries = ["1. sugar", "0110.salt", "10. tomatoes"]

>>> meal = ["07.ingredient1", "02.ingredient2", "05.ingredient3"]
>>> myarr = [i[3:] for i in meal]
>>> print(myarr)
['ingredient1', 'ingredient2', 'ingredient3']

Related

An efficient way to find elements of a list that contain substrings from another list

list1 = ["happy new year", "game over", "a happy story", "hold on"]
list2 = ["happy", "new", "hold"]
Assume I have two string lists, I want to use a new list to store the matched pairs of those two lists just like below:
list3=[["happy new year","happy"],["happy new year","new"],["a happy story","happy"],["hold on","hold"]]
which means I need to get all pairs of strings in one list with their substrings in another list.
Actually that is about some Chinese ancient scripts data. The first list contains names of people in 10th to 13th century, and the second list contains titles of all the poems at that period. Ancient Chinese people often record their social relations in the title of their works. For example, someone may write a poem titled "For my friend Wang Anshi". In this case, the people "Wang Anshi" in the first list should be matched with this title. Also their are cases like "For my friend Wang Anshi and Su Shi" which contains more than one people in the title. So basically that's a huge work involved 30,000 people and 160,000 poems.
Following is my code:
list3 = []
for i in list1:
for j in list2:
if str(i).count(str(j)) > 0:
list3.append([i,j])
I use str(i) because python always takes my Chinese strings as float. And this code does work but too too too slow. I must figure out another way to do that. Thanks!
Use a regular expression to do the searching, via the re module. A regular expression engine can work out matching elements in a search through text much better than a nested for loop can.
I'm going to use better variable names here to make it clearer where what list has to go; titles are the poem titles you are searching through, and names the things you are trying to match. matched are the (title, name) pairs you want to produce:
import re
titles = ["happy new year", "game over", "a happy story", "hold on"]
names = ["happy", "new", "hold"]
by_reverse_length = sorted(names, key=len, reverse=True)
pattern = "|".join(map(re.escape, by_reverse_length))
any_name = re.compile("({})".format(pattern))
matches = []
for title in titles:
for match in any_name.finditer(title):
matches.append((title, match.group()))
The above produces your required output:
>>> matches
[('happy new year', 'happy'), ('happy new year', 'new'), ('a happy story', 'happy'), ('hold on', 'hold')]
The names are sorted by length, in reverse, so that longer names are found before shorter with the same prefix; e.g. Hollander is found before Holland is found before Holl.
The pattern string is created from your names to form a ...|...|... alternatives pattern, any one of those patterns can match, but the regex engine will find those listed earlier in the sequence over those put later, hence the need to reverse sort by length. The (...) parentheses around the whole pattern of names tells the regular expression engine to capture that part of the text, in a group. The match.group() call in the loop can then extract the matched text.
The re.escape() function call is there to prevent 'meta characters' in the names, characters with special meaning such as ^, $, (, ), etc, from being interpreted as their special regular expression meanings.
The re.finditer() function (and method on compiled patterns) then finds non-overlapping matches in order from left to right, so it'll never match shorter substrings, and gives us the opportunity to extract the match object for each. This gives you more options if you want to know about starting positions of the matches and other metadata as well, should you want those. Otherwise, re.findall() could also be used here.
If you are going to use the above on text with Western alphabets and not on Chinese, then you probably also want to add word boundary markers, \b:
any_name = re.compile("\b({})\b".format(pattern))
otherwise substrings part of a larger word can be matched. Since Chinese has no word boundary characters (such as spaces and punctuation) you don't want to use \b in such texts.
If the lists are longer, it might be worth building a sort of "index" of the sentences a given word appears in. Creating the index takes about as long as finding the first word from list2 in all the sentences in list1 (it has to loop over all the words in all the sentences), and once created, you can get the sentences containing a word much faster in O(1).
list1 = ["happy new year", "game over", "a happy story", "hold on"]
list2 = ["happy", "new", "hold"]
import collections
index = collections.defaultdict(list)
for sentence in list1:
for word in sentence.split():
index[word].append(sentence)
res = [[sentence, word] for word in list2 for sentence in index[word]]
Result:
[['happy new year', 'happy'],
['a happy story', 'happy'],
['happy new year', 'new'],
['hold on', 'hold']]
This uses str.split to split the words at spaces, but if the sentences are more complex, e.g. if they contain punctuation, you might use a regular expression with word boundaries \b instead, and possibly normalize the sentences (e.g. convert to lowercase or apply a stemmer, not sure if this is applicable to Chinese, though).
This can be done quite easily in an absolutely strightforward way.
Option A: Finding "all" possible combinations: To find all strings in one list that contain substrings from another list, loop over all your strings of your list1 (the strings to assess) and for each element check whether it contains a substring of list2:
list1 = ["happy new year", "game over", "a happy story", "hold on"]
list2 = ["happy", "new", "hold"]
[(string, substring) for string in list1 for substring in list2 if substring in string]
>>> [('happy new year', 'happy'), ('happy new year', 'new'), ('a happy story', 'happy'), ('hold on', 'hold')]
(I do think the title of your question is a bit misleading, though, as you are not only asking for elements of a list that contain a substring of another list, but as per your code example you are looking for 'all possible combinations'.)
Thus option B: Finding "any" combination: Much simpler and faster, if you really only need what the question says, you can improve performance by finding only the 'any' matches:
[string for string in list1 if ( substring in string for substring in list2)]
Option B will also allow you to improve performance. In case the lists are very long, you can run B first, create a subset (only strings that will actually produce a match with a substring), and then expand again to catch 'all' instead of any.

How to remove multiple consequent characters within a word with regular expressions in Python?

I want a regular expression (in Python) that given a sentence like:
heyy how are youuuuu, it's so cool here, cooool.
converts it to:
heyy how are youu, it's so cool here, cool.
which means maximum of 1 time a character can be repeated and if it's more than that it should be removed.
heyy ==> heyy
youuuu ==> youu
cooool ==> cool
You can use back reference in the pattern to match repeated characters and then replace it with two instances of the matched character, here (.)\1+ will match a pattern that contains the same character two or more times, replace it with only two instances by \1\1:
import re
re.sub(r"(.)\1+", r"\1\1", s)
# "heyy how are youu, it's so cool here, cool."
create a new empty text and only add to it if there aren't 3 consecutive
text = "heyy how are youuuuu, it's so cool here, cooool."
new_text = ''
for i in range(len(text)):
try:
if text[i]==text[i+1]==text[i+2]:
pass
else:
new_text+=text[i]
except:
new_text+=text[i]
print new_text
>>>heyy how are youu, it's so cool here, cool.
eta: hmmm just noticed you requested "regular expressions" so approved answer is better; though this works

Python: Replace all substring occurrences with regular expressions

I would like to replace all substring occurrences with regular expressions. The original sentences would be like:
mystring = "Carl's house is big. He is asking 1M for that(the house)."
Now let's suppose I have two substrings I would like to bold. I bold the words by adding ** at the beginning and at the end of the substring. The 2 substrings are:
substring1 = "house", so bolded it would be "**house**"
substring2 = "the house", so bolded it would be "**the house**"
At the end I want the original sentence like this:
mystring = "Carl's **house** is big. He is asking 1M for that(**the house**)."
The main problem is that as I have several substrings to replace, they can overlap words like the example above. If I analyze the longest substring at first, I am getting this:
Carl's **house** is big. He is asking 1M for that(**the **house****).
On the other hand, if I analyze the shortest substring first, I am getting this:
Carl's **house** is big. He is asking 1M for that(the **house**).
It seems to be I will need to replace from the longest substring to the shortest, but I wonder how should I do to consider it in the first replacement but in the second. Also remember the substring can appear several times in the string.
Note:// Suppose the string ** will never occur in the original string, so we can use it to bold our words
You can search for all of the strings at once, so that the fact that one is a substring of another doesn't matter:
re.sub(r"(house|the house)", r"**\1**", mystring)
You could have a group that is not captured and is note required. If you look at the regex patter (?P<repl>(?:the )?house), the (?:the )? part is saying that there might be a the in the string, if it is present, include it in the match. This way, you let the re library optimize the way it matches. Here is the complete example
>>> data = "Carl's house is big. He is asking 1M for that(the house)."
>>> re.sub('(?P<repl>(?:the )?house)', '**\g<repl>**', data)
"Carl's **house** is big. He is asking 1M for that(**the house**)."
Note: \g<repl> is used to get all the string matched by the group <repl>
You could do two passes:
First: Go through from longest to shortest and replace with something like:
'the house': 'AA_THE_HOUSE'
'house': 'BB_HOUSE'
Second: Go through replace like:
'AA_THE_HOUSE': '**the house**'
'BB_HOUSE': '**house**'
Replace the strings with some unique values and then replace them back with original string enclosed in ** to make them bold.
For example:
'the house' with 'temp_the_house'
'house' with 'temp_house'
then 'temp_house' with 'house'
'temp_the_house' with '**the house****'
Should work fine. You can automate this by using two lists.

Swap a character with its next character in paragraph

I have to swap a specific character appearing in paragraph to its next character.
let suppose that my paragraph text is:
My name is andrew. I am very addicted to python and attains very high knowledge about programming.
Now, my task is to find particular character in paragraph and swap it with the character next to it. Like, I want to swap every character 'a' with its the character next to it. After process my paragraph should look like this:
My nmae is nadrew. I ma very dadicted to python nad tatians very high knowledge baout progrmaming.
I would be very thankful if anybody define function for this in python
This will do it:
>>> import re
>>>>regex = re.compile(r'(a)(\w)')
>>>>text = 'My name is andrew. I am very addicted to python and attains very high knowledge about programming.'
>>> regex.sub(lambda(m) : m.group(2) + m.group(1), text)
'My nmae is nadrew. I ma very dadicted to python nad tatians very high knowledge baout progrmaming.'
Explanation:
(a)(\w)
Matches a, and put it on group 1, then matches another word character put in group 2. Lambda expression for replacement switch these two groups.
If you want to match everything but spaces use :
(a)(\S)

Regex to match 'lol' to 'lolllll' and 'omg' to 'omggg', etc

Hey there, I love regular expressions, but I'm just not good at them at all.
I have a list of some 400 shortened words such as lol, omg, lmao...etc. Whenever someone types one of these shortened words, it is replaced with its English counterpart ([laughter], or something to that effect). Anyway, people are annoying and type these short-hand words with the last letter(s) repeated x number of times.
examples:
omg -> omgggg, lol -> lollll, haha -> hahahaha, lol -> lololol
I was wondering if anyone could hand me the regex (in Python, preferably) to deal with this?
Thanks all.
(It's a Twitter-related project for topic identification if anyone's curious. If someone tweets "Let's go shoot some hoops", how do you know the tweet is about basketball, etc)
FIRST APPROACH -
Well, using regular expression(s) you could do like so -
import re
re.sub('g+', 'g', 'omgggg')
re.sub('l+', 'l', 'lollll')
etc.
Let me point out that using regular expressions is a very fragile & basic approach to dealing with this problem. You could so easily get strings from users which will break the above regular expressions. What I am trying to say is that this approach requires lot of maintenance in terms of observing the patterns of mistakes the users make & then creating case specific regular expressions for them.
SECOND APPROACH -
Instead have you considered using difflib module? It's a module with helpers for computing deltas between objects. Of particular importance here for you is SequenceMatcher. To paraphrase from official documentation-
SequenceMatcher is a flexible class
for comparing pairs of sequences of
any type, so long as the sequence
elements are hashable. SequenceMatcher
tries to compute a "human-friendly
diff" between two sequences. The
fundamental notion is the longest
contiguous & junk-free matching subsequence.
import difflib as dl
x = dl.SequenceMatcher(lambda x : x == ' ', "omg", "omgggg")
y = dl.SequenceMatcher(lambda x : x == ' ', "omgggg","omg")
avg = (x.ratio()+y.ratio())/2.0
if avg>= 0.6:
print 'Match!'
else:
print 'Sorry!'
According to documentation, any ratio() over 0.6 is a close match. You might need to explore tweak the ratio for your data needs. If you need more stricter matching I found any value over 0.8 serves well.
How about
\b(?=lol)\S*(\S+)(?<=\blol)\1*\b
(replace lol with omg, haha etc.)
This will match lol, lololol, lollll, lollollol etc. but fail lolo, lollllo, lolly and so on.
The rules:
Match the word lol completely.
Then allow any repetition of one or more characters at the end of the word (i. e. l, ol or lol)
So \b(?=zomg)\S*(\S+)(?<=\bzomg)\1*\b will match zomg, zomggg, zomgmgmg, zomgomgomg etc.
In Python, with comments:
result = re.sub(
r"""(?ix)\b # assert position at a word boundary
(?=lol) # assert that "lol" can be matched here
\S* # match any number of characters except whitespace
(\S+) # match at least one character (to be repeated later)
(?<=\blol) # until we have reached exactly the position after the 1st "lol"
\1* # then repeat the preceding character(s) any number of times
\b # and ensure that we end up at another word boundary""",
"lol", subject)
This will also match the "unadorned" version (i. e. lol without any repetition). If you don't want this, use \1+ instead of \1*.

Categories