Wyh there are empty items after re.split()? [duplicate] - python

This question already has an answer here:
re.split() gives empty elements in list
(1 answer)
Closed 23 days ago.
The community is reviewing whether to reopen this question as of 22 days ago.
I assume I misunderstand how re.split() works.
Here is a real and simple example.
>>> import re
>>> re.split('(abc)', 'abc')
['', 'abc', '']
I'm confused about the first and last empty ('') element in the resulting list. The result expected by me would be this:
['abc']
This was a very simplified example. Please let me give something more complex.
>>> re.split(r'\[\[(.+?)\]\[(.+?)\]\]', '[[one][two]]')
['', 'one', 'two', '']
Here the result expect by me would be:
['one', 'two']
This third example with words before and after works as expected.
>>> re.split(r'\[\[(.+?)\]\[(.+?)\]\]', 'zero [[one][two]] three')
['zero ', 'one', 'two', ' three']
My final goal is to split (tokenize) a string with a regex, get the splitted parts as results but also the separators (the regex matches). That is why I'm not able to handle that with re.findall().

If you use capturing groups in the re.split expression, the splitting part (abc) is also returned in the output. This can be very useful with eg tokenization tasks.
Every second item in the return value is the captured split pattern; e. g. if (a.c) was the splitter and dabcdagde then splittee, you'd get ['d', 'abc', 'd', 'agd', 'e'].
In your first example, since the split expression is the whole string, you get empty strings "on the sides".

My answer is based on that answer in a similar question.
The behavior is as specified in the docs:
If there are capturing groups in the separator and it matches at the start of the string, the result will start with an empty string. The same holds for the end of the string:
That way, separator components are always found at the same relative indices within the result list.
Especially the last sentence to describe why this behavior is useful.
In short: The user/developer is every time able to identify the separators/matches in the resulting list when using catch groups.
When using catching groups the user/developer always can expect the separators (the matches) at the same position in the resulting list. Assuming one catching group each second element in the result is the matched separator (the catched group).
If you have two catch groups as in my example the relative position changes. You have to count to three. 0 is the splitted token, 1 is the first catch group, 2 is the second catch group, and again...

Related

How to replace some numbers with other in from one line?

I have line include some numbers with underscore like this
1_0_1_A2C_1A_2BE_DCAAFFC_0_0_0
I need code to check (DCAAFFC) and if the last 4 numbers not (0000) then the code should be replacing (0000) in place of last 4 numbers (AFFC) like this (DCA0000)
So should be line become like this
1_0_1_A2C_1A_2BE_DCA0000_0_0_0
I need code work on python2 and 3 please !!
P.S the code of (DCAAFFC) is not stander always changing.
code=1_0_1_A2C_1A_2BE_DCAAFFC_0_0_0
I will assume that the format is strictly like this. Then you can get the DCAAFFC by code.split('_')[-4]. Finally, you can replace the last string with 0000 by replace.
Here is the full code
>>> code="1_0_1_A2C_1A_2BE_DCAAFFC_0_0_0"
>>> frag=code.split("_")
['1', '0', '1', 'A2C', '1A', '2BE', 'DCAAFFC', '0', '0', '0']
>>> frag[-4]=frag[-4].replace(frag[-4][-4:],"0000") if frag[-4][-4:] != "0000" else frag[-4]
>>> final_code="_".join(frag)
>>> final_code
'1_0_1_A2C_1A_2BE_DCA0000_0_0_0'
Try regular expressions i.e:
import re
old_string = '1_0_1_A2C_1A_2BE_DCAAFFC_0_0_0'
match = re.search('_([a-zA-Z]{7})_', old_string)
span = match.span()
new_string = old_string[:span[0]+4] + '0000_' + old_string[span[1]:]
print(new_string)
Is this a general string or just some hexadecimal representation of a number? For numbers in Python 3, '_' underscores are used just for adding readability and do not affect the number value in any way.
Say you have one such general string as you've given, and would like to replace ending 4 characters of every possible subgroup bounded within '_' underscores having length more than 4 by '0000', then one simple one-liner following your hexadecimal_string would be:
hexadecimal_string = "1_0_1_A2C_1A_2BE_DCAAFFC_0_0_0"
hexadecimal_string = "_".join([ substring if len(substring)<=4 else substring[:-4]+'0'*4 for substring in hexadecimal_string.split('_')])
Here,
hexadecimal_string.split('_') separates all groups by '_' as separator,
substring if len(substring)<=4 else substring[:-4]+'0'*4 takes care of every such substring group having length more than 4 to have ending 4 characters replaced by '0'*4 or '0000',
such for loop usage is a list comprehension feature of Python.
'_'.join() joins the subgroups back into one main string using '_' as separator in string.
Other answers posted here work specifically well for the given string in the question, I'm sharing this answer to ensure your one-liner requirement in Python 3.
If the length of the string is always the same, and the position of the part that needs to be replaced with zero is always the same, you can just do this,
txt = '1_0_1_A2C_1A_2BE_DCAAFFC_0_0_0'
new = txt[0:20]+'0000'+txt[-6:]
print(new)
The output will be
'1_0_1_A2C_1A_2BE_DCA0000_0_0_0'
It would help if you gave us some other examples of the strings.

Python: Only using "maxsplit" in one of them, I need to split a string multiple times

I started studying Python yesterday and I wanted to study a little about the string split method.
I wasn't looking for anything specific, I was just trying to learn it. I saw that it's possible to split multiple characters of a string, but what if I want to use the maxsplit parameter in only one of those characters?
I searched a little about it and found nothing, so I'm here to ask how. Here's an example:
Let's suppose I have this string:
normal_string = "1d30 drake dreke"
I want this to be a list like this:
['1', '30', 'drake', 'dreke']
Now let's suppose I use a method to split multiple characters, so I split the character 'd' and the character ' '.
The thing is:
I don't want to take the "d" from "drake" and "dreke" off, only from "1d30", but at the same time I don't want this, I want to split all of the space characters.
I need to put a maxsplit parameter ONLY at the character "d", how can I do it?
Do the following:
normal_string = "1d30 drake dreke"
# first split by d
start, end = normal_string.split("d", maxsplit=1)
# the split by space and concat the results
res = start.split() + end.split()
print(res)
Output
['1', '30', 'drake', 'dreke']
A more general approach, albeit more advanced, is to do:
res = [w for s in normal_string.split("d", maxsplit=1) for w in s.split()]
print(res)

Split Big String by Specific Word In Python

I want to split a big string by word and that word is repeating in that big string.
Example what i expect :
We have tried to split a code, please check below
string.split("RFF+AAJ:")
So we need a bunch of list that i have described in my above screenshot.
You can get your result with the help of regex :-
import re
string = 'helloisworldisbyeishi'
re.split('(is)', string) # Splitting from 'is'
Output
['hello', 'is', 'world', 'is', 'bye', 'is', 'hi']
I hope it may help you.
split returns one single list with the complete string in it ( it is just split in parts ). So the list here contains the part before the first "RFF+AAJ:", then the part between the two "RFF+AAJ:"s and the last part, after the second "RFF+AAJ:". If you want to have three differrent lists use:
all = string.split("RFF+AAJ:")
first = all[0]
second = all[1]
third = all[2]
And the elements will be stored in first, second and third.
If you want to create lists, use first = list(first) # and so on.
Hope that helped.

Python: Replace all substring occurrences with regular expressions

I would like to replace all substring occurrences with regular expressions. The original sentences would be like:
mystring = "Carl's house is big. He is asking 1M for that(the house)."
Now let's suppose I have two substrings I would like to bold. I bold the words by adding ** at the beginning and at the end of the substring. The 2 substrings are:
substring1 = "house", so bolded it would be "**house**"
substring2 = "the house", so bolded it would be "**the house**"
At the end I want the original sentence like this:
mystring = "Carl's **house** is big. He is asking 1M for that(**the house**)."
The main problem is that as I have several substrings to replace, they can overlap words like the example above. If I analyze the longest substring at first, I am getting this:
Carl's **house** is big. He is asking 1M for that(**the **house****).
On the other hand, if I analyze the shortest substring first, I am getting this:
Carl's **house** is big. He is asking 1M for that(the **house**).
It seems to be I will need to replace from the longest substring to the shortest, but I wonder how should I do to consider it in the first replacement but in the second. Also remember the substring can appear several times in the string.
Note:// Suppose the string ** will never occur in the original string, so we can use it to bold our words
You can search for all of the strings at once, so that the fact that one is a substring of another doesn't matter:
re.sub(r"(house|the house)", r"**\1**", mystring)
You could have a group that is not captured and is note required. If you look at the regex patter (?P<repl>(?:the )?house), the (?:the )? part is saying that there might be a the in the string, if it is present, include it in the match. This way, you let the re library optimize the way it matches. Here is the complete example
>>> data = "Carl's house is big. He is asking 1M for that(the house)."
>>> re.sub('(?P<repl>(?:the )?house)', '**\g<repl>**', data)
"Carl's **house** is big. He is asking 1M for that(**the house**)."
Note: \g<repl> is used to get all the string matched by the group <repl>
You could do two passes:
First: Go through from longest to shortest and replace with something like:
'the house': 'AA_THE_HOUSE'
'house': 'BB_HOUSE'
Second: Go through replace like:
'AA_THE_HOUSE': '**the house**'
'BB_HOUSE': '**house**'
Replace the strings with some unique values and then replace them back with original string enclosed in ** to make them bold.
For example:
'the house' with 'temp_the_house'
'house' with 'temp_house'
then 'temp_house' with 'house'
'temp_the_house' with '**the house****'
Should work fine. You can automate this by using two lists.

Finding various string repeats in python in next 10 characters

So I'm working on a problem where I have to find various string repeats after encountering an initial string, say we take ACTGAC so the data file has sequences that look like:
AAACTGACACCATCGATCAGAACCTGA
So in that string once we find ACTGAC then I need to analyze the next 10 characters for the string repeats which go by some rules. I have the rules coded but can anyone show me how once I find the string that I need, I can make a substring for the next ten characters to analyze. I know that str.partition function can do that once I find the string, and then the [1:10] can get the next ten characters.
Thanks!
You almost have it already (but note that indexes start counting from zero in Python).
The partition method will split a string into head, separator, tail, based on the first occurence of separator.
So you just need to take a slice of the first ten characters of the tail:
>>> data = 'AAACTGACACCATCGATCAGAACCTGA'
>>> head, sep, tail = data.partition('ACTGAC')
>>> tail[:10]
'ACCATCGATC'
Python allows you to leave out the start-index in slices (in defaults to zero - the start of the string), and also the end-index (it defaults to the length of the string).
Note that you could also do the whole operation in one line, like this:
>>> data.partition('ACTGAC')[2][:10]
'ACCATCGATC'
So, based on marcog's answer in Find all occurrences of a substring in Python , I propose:
>>> import re
>>> data = 'AAACTGACACCATCGATCAGAACCTGAACTGACTGACAAA'
>>> sep = 'ACTGAC'
>>> [data[m.start()+len(sep):][:10] for m in re.finditer('(?=%s)'%sep, data)]
['ACCATCGATC', 'TGACAAA', 'AAA']

Categories