Getting Substring after certain char in python - python

I want to cut a String such as "0011165.jpg_Fish" to get only Fish, so everything after the "_", how do i do that in python?
Thank you very much!

Please use str.partition instead of str.split. This is robust, since you can always expect 3 items, unlike, split which maybe tricky to handle if the input string doesn't have the split character,
>>> word = '0011165.jpg_Fish'
>>> not_required, split_char, required = word.partition('_')
>>> required
'Fish'

Try
"0011165.jpg_Fish".split("_")[1]
And in case of a Dataframe
train['Label'] = train.Image_Labels.str.split("_").str[1]

Related

Python Best Way to Delete matches?

I want to delete multiple strings from a phrase in python.
For example I want to delete: apple, orange, tomato
How can I do that easily without writing 10 replaces like this:
str = str.replace('apple','').replace(....).replace(....)
Any time you are repeating yourself, think of a loop instead.
for word in ('apple','cherry','tomato','grape'):
str = str.replace(word,'')
And, by the way, str is a poor name for a variable, since it's the name of a type.
You could also use re.sub and list the words in a group between word boundaries \b
import re
s = re.sub(r"\b(?:apple|orange|tomato)\b", "", s)

Why doesn't replace () change all occurrences?

I have the following code:
dna = "TGCGAGAAGGGGCGATCATGGAGATCTACTATCCTCTCGGGGTATGGTGGGGTTGAGA"
print(dna.count("GAGA"))
dna = dna.replace("GAGA", "AGAG")
print(dna.count("GAGA"))
Replace does not replace all occurrences. Could somebody help my in understanding why it happened?
It replaces all occurences. That might lead to new occurences (look at your replacement string!).
I'd say, logically, all is fine.
You could repeat this replace while dna.count("GAGA") > 0 , but: that sounds not like what you should be doing. (I bet you really just want to do one round of replacement to simulate something specific happening. Not a genetics expert at all though.)
It did make all replacements (that's what .replace() does in Python unless specified otherwise), but some of these replacements inadvertently introduced new instances of GAGA. Take the beginning of your string:
TGCGAGAA
There's GAGA at indices 3-6. If you replace that with AGAG, you get
TGCAGAGA
So the last G from that AGAG, together with the subsequent A that was already there before, forms a new GAGA.
Replacements does not occur "until exhausted"; they occur when a substring is matched in your original string.
Consider the following from your string:
>>> a = "TGCGAGAA"
>>> a.replace("GAGA", "AGAG")
'TGCAGAGA'
>>>
The replacement does not happen again, since the original string did not match GAGA in that location.
If you want to do the replacement until no match is found, you can wrap it in a loop:
>>> while a.count("GAGA") > 0: # you probably don't want to use count here if the string is long because of performance considerations
... a = a.replace("GAGA", "AGAG")
...
>>> a
'TGCAAGAG'

extract first three numbers from a string

I have strings like
"ABCD_ABCD_6.2.15_3.2"
"ABCD_ABCD_12.22.15_4.323"
"ABCD_ABCD_2.33.15_3.223"
I want to extract following from above
"6.2.15"
"12.22.15"
"2.33.15"
I tried using indices of numbers but cant use them since they are variable. Only thing constant here is the length of the characters appearing in the beginning of each string.
Another way would be this regex:
_(\d+.*?)_
import re
m = re.search('_(\\d+.*?)_', 'ABCD_ABCD_6.2.15_3.2')
m.group(1)
There are a ton of ways to do this. Try:
>>> "ABCD_ABCD_6.2.15_3.2".split("_")[2]
'6.2.15'

String splitting in python by finding non-zero character

I want to do the following split:
input: 0x0000007c9226fc output: 7c9226fc
input: 0x000000007c90e8ab output: 7c90e8ab
input: 0x000000007c9220fc output: 7c9220fc
I use the following line of code to do this but it does not work!
split = element.rpartition('0')
I got these outputs which are wrong!
input: 0x000000007c90e8ab output: e8ab
input: 0x000000007c9220fc output: fc
what is the fastest way to do this kind of split?
The only idea for me right now is to make a loop and perform checking but it is a little time consuming.
I should mention that the number of zeros in input is not fixed.
Each string can be converted to an integer using int() with a base of 16. Then convert back to a string.
for s in '0x000000007c9226fc', '0x000000007c90e8ab', '0x000000007c9220fc':
print '%x' % int(s, 16)
Output
7c9226fc
7c90e8ab
7c9220fc
input[2:].lstrip('0')
That should do it. The [2:] skips over the leading 0x (which I assume is always there), then the lstrip('0') removes all the zeros from the left side.
In fact, we can use lstrip ability to remove more than one leading character to simplify:
input.lstrip('x0')
format is handy for this:
>>> print '{:x}'.format(0x000000007c90e8ab)
7c90e8ab
>>> print '{:x}'.format(0x000000007c9220fc)
7c9220fc
In this particular case you can just do
your_input[10:]
You'll most likely want to properly parse this; your idea of splitting on separation of non-zero does not seem safe at all.
Seems to be the XY problem.
If the number of characters in a string is constant then you can use
the following code.
input = "0x000000007c9226fc"
output = input[10:]
Documentation
Also, since you are using rpartitionwhich is defined as
str.rpartition(sep)
Split the string at the last occurrence of sep, and return a 3-tuple containing the part before the separator, the separator itself, and the part after the separator. If the separator is not found, return a 3-tuple containing two empty strings, followed by the string itself.
Since your input can have multiple 0's, and rpartition only splits the last occurrence this a malfunction in your code.
Regular expression for 0x00000 or its type is (0x[0]+) and than replace it with space.
import re
st="0x000007c922433434000fc"
reg='(0x[0]+)'
rep=re.sub(reg, '',st)
print rep

Regexp matching equal number of the same character on each side of a string

How do you match only equal numbers of the same character (up to 3) on each side of a string in python?
For example, let's say I am trying to match equal signs
=abc= or ==abc== or ===abc===
but not
=abc== or ==abc=
etc.
I figured out how to do each individual case, but can't seem to get all of them.
(={1}(?=abc={1}))abc(={1})
as | of the same character
((={1}(?=abc={1}))|(={2}(?=abc={2})))abc(={1}|={2})
doesn't seem to work.
Use the following regex:
^(=+)abc\1$
Edit:
If you are talking about only max three =
^(={1,3})abc\1$
This is not a regular language. However, you can do it with backreferences:
(=+)[^=]+\1
consider that sample is a single string, here's a non-regex approach (out of many others)
>>> string="===abc==="
>>> string.replace("abc"," ").split(" ")
['===', '===']
>>> a,b = string.replace("abc"," ").split(" ")
>>> if a == b:
... print "ok"
...
ok
You said you want to match equal characters on each side, so regardless of what characters, you just need to check a and b are equal.
You are going to want to use a back reference. Check this post for an example:
Regex, single quote or double quote

Categories