Python string split decimals from end of string - python

I use nlst on a ftp server which returns directories in the form of lists. The format of the returned list is as follows:
[xyz123,abcde345,pqrst678].
I have to separate each element of the list into two parts such that part1 = xyz and part2 = 123 i.e split the string at the beginning of the integer part. Any help on this will be appreciated!

>>> re.findall(r'\d+|[a-z]+', 'xyz123')
['xyz', '123']

For example, using the re module:
>>> import re
>>> a = ['xyz123','ABCDE345','pqRst678']
>>> regex = '(\D+)(\d+)'
>>> for item in a:
... m = re.match(regex, item)
... (a, b) = m.groups()
... print a, b
xyz 123
ABCDE 345
pqRst 678

Use the regular expression module re:
import re
def splitEntry(entry):
firstDecMatch = re.match(r"\d+$", entry)
alpha, numeric = "",""
if firstDecMatch:
pos = firstDecMatch.start(0)
alpha, numeric = entry[:pos], entry[pos:]
else # no decimals found at end of string
alpha = entry
return (alpha, numeric)
Note that the regular expression is `\d+$', which should match all decimals at the end of the string. If the string has decimals in the first part, it will not count those, e.g: xy3zzz134 -> "xy3zzz","134". I opted for that because you say you are expecting filenames, and filenames can include numbers. Of course it's still a problem if the filename ends with numbers.

Another non-re answer:
>>> [''.join(x[1]) for x in itertools.groupby('xyz123', lambda x: x.isalpha())]
['xyz', '123']

If you don't want to use regex, then you can do something like this. Note that I have not tested this so there could be a bug or typo somewhere.
list = ["xyz123", "abcde345", "pqrst678"]
newlist = []
for item in list:
for char in range(0, len(item)):
if item[char].isnumeric():
newlist.append([item[:char], item[char:]])
break

>>> import re
>>> [re.findall(r'(.*?)(\d+$)',x)[0] for x in ['xyz123','ABCDE345','pqRst678']]
[('xyz', '123'), ('ABCDE', '345'), ('pqRst', '678')]

I don't think its that difficult without re
>>> s="xyz123"
>>> for n,i in enumerate(s):
... if i.isdigit(): x=n ; break
...
>>> [ s[:x], s[x:] ]
['xyz', '123']
>>> s="abcde345"
>>> for n,i in enumerate(s):
... if i.isdigit(): x=n ; break
...
>>> [ s[:x], s[x:] ]
['abcde', '345']

Related

isolate data from long string [duplicate]

Suppose I had a string
string1 = "498results should get"
Now I need to get only integer values from the string like 498. Here I don't want to use list slicing because the integer values may increase like these examples:
string2 = "49867results should get"
string3 = "497543results should get"
So I want to get only integer values out from the string exactly in the same order. I mean like 498,49867,497543 from string1,string2,string3 respectively.
Can anyone let me know how to do this in a one or two lines?
>>> import re
>>> string1 = "498results should get"
>>> int(re.search(r'\d+', string1).group())
498
If there are multiple integers in the string:
>>> map(int, re.findall(r'\d+', string1))
[498]
An answer taken from ChristopheD here: https://stackoverflow.com/a/2500023/1225603
r = "456results string789"
s = ''.join(x for x in r if x.isdigit())
print int(s)
456789
Here's your one-liner, without using any regular expressions, which can get expensive at times:
>>> ''.join(filter(str.isdigit, "1234GAgade5312djdl0"))
returns:
'123453120'
if you have multiple sets of numbers then this is another option
>>> import re
>>> print(re.findall('\d+', 'xyz123abc456def789'))
['123', '456', '789']
its no good for floating point number strings though.
Iterator version
>>> import re
>>> string1 = "498results should get"
>>> [int(x.group()) for x in re.finditer(r'\d+', string1)]
[498]
>>> import itertools
>>> int(''.join(itertools.takewhile(lambda s: s.isdigit(), string1)))
With python 3.6, these two lines return a list (may be empty)
>>[int(x) for x in re.findall('\d+', your_string)]
Similar to
>>list(map(int, re.findall('\d+', your_string))
this approach uses list comprehension, just pass the string as argument to the function and it will return a list of integers in that string.
def getIntegers(string):
numbers = [int(x) for x in string.split() if x.isnumeric()]
return numbers
Like this
print(getIntegers('this text contains some numbers like 3 5 and 7'))
Output
[3, 5, 7]
def function(string):
final = ''
for i in string:
try:
final += str(int(i))
except ValueError:
return int(final)
print(function("4983results should get"))
Another option is to remove the trailing the letters using rstrip and string.ascii_lowercase (to get the letters):
import string
out = [int(s.replace(' ','').rstrip(string.ascii_lowercase)) for s in strings]
Output:
[498, 49867, 497543]
integerstring=""
string1 = "498results should get"
for i in string1:
if i.isdigit()==True
integerstring=integerstring+i
print(integerstring)

How to choose a certain position to split a string by "_"?

I have a string like this '00004079_20150427_5_169_192_114.npz', and I want to split it into this ['00004079_20150427_5', '169_192_114.npz'].
I tried the Python string split() method:
a = '00004079_20150427_5_169_192_114.nii.npz'
a.split("_", 3)
but it returned this:
['00004079', '20150427', '5', '169_192_114.nii.npz']
How can I split this into 2 parts by the third "_" appearance?
I also tried this:
reg = ".*\_.*\_.\_"
re.split(reg, a)
but it returns:
['', '169_192_114.nii.npz']
You can split the string based on the delimiter _ upto 3 times and then join back everything except the last value
>>> *start, end = s.split('_', 3)
>>> start = '_'.join(start)
>>>
>>> start
'00004079_20150427_5'
>>> end
'169_192_114.npz'
For python2, you can follow this instead
>>> lst = s.split('_', 3)
>>> end = lst.pop()
>>> start = '_'.join(lst)
>>>
>>> start
'00004079_20150427_5'
>>> end
'169_192_114.npz'
One of possible approaches (if going with regex):
import re
s = '00004079_20150427_5_169_192_114.nii.npz'
res = re.search(r'^((?:[^_]+_){2}[^_]+)_(.+)', s)
print(res.groups())
The output:
('00004079_20150427_5', '169_192_114.nii.npz')

mess with regex python

I am trying the next code but it seems that i am doing something wrong.
import re
lista = ["\\hola\\01\\02Jan\\05\\03",
"\\hola\\01\\02Dem\\12",
"\\hola\\01\\02March\\12\\04"]
for l in lista:
m= re.search("\\\\\d{2,2}\\\\\d{2,2}[a-zA-Z]+\\\\\d{2,2}\s",l)
if m:
print (m.group(0))
The result should be second string.
I have tried without \s but the result match with all strings.
You can try this regex:
lista = [r"\hola\01\02Jan\05\03", r"\hola\01\02Dem\12", r"\hola\01\02March\12\04"]
>>> for l in lista:
... m = re.search(r"\\\d{2,2}\\\d{2,2}[a-zA-Z]+\\\d{2}$", l)
... if m:
... print m.group()
...
Output:
\01\02Dem\12
Use r"..." form to declare a regex and input as raw string
Use anchor $ to avoid matching unwanted input
You can use the following code without regex:
>>> for l in lista:
totalNo = l.count('\\')
if totalNo == 4:
print l

list.append() where am I wrong?

I have a string which is very long. I would like to split this string into substrings 16 characters long, skipping one character every time (e.g. substring1=first 16 elements of the string, substring2 from element 18 to element 34 and so on) and list them.
I wrote the following code:
string="abcd..."
list=[]
for j in range(0,int(len(string)/17)-1):
list.append(string[int(j*17):int(j*17+16)])
But it returns:
list=[]
I can't figure out what is wrong with this code.
>>> string="abcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyz"
Your original code, without masking the built-in (excludes the final full-length string and any partial string after it):
>>> l = []
>>> for j in range(0,int(len(string)/17)-1):
... l.append(string[int(j*17):int(j*17+16)])
...
>>> l
['abcdefghijklmnop', 'rstuvwxyzabcdefg', 'ijklmnopqrstuvwx']
A cleaned version that includes all possible strings:
>>> for j in range(0,len(string),17):
... l.append(string[j:j+16])
...
>>> l
['abcdefghijklmnop', 'rstuvwxyzabcdefg', 'ijklmnopqrstuvwx', 'zabcdefghijklmno', 'qrstuvwxyz']
How about we turn that last one into a comprehension? Everyone loves comprehensions.
>>> l = [string[j:j+16] for j in range(0,len(string),17)]
We can filter out strings that are too short if we want to:
>>> l = [string[j:j+16] for j in range(0,len(string),17) if len(string[j:j+16])>=16]
It does work -- but only for strings longer than 16 characters. You have
range(0,int(len(string)/17)-1)
but, for the string "abcd...", int(len(string)/17)-1) is -1. Add some logic to catch the < 16 chars case and you're good:
...
for j in range(0, max(1, int(len(string)/17)-1)):
...
Does this work?
>>> from string import ascii_lowercase
>>> s = ascii_lowercase * 2
>>> s
'abcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyz'
>>> spl = [s[i:i+16] for i in range(0, len(s), 17)]
>>> spl
['abcdefghijklmnop', 'rstuvwxyzabcdefg', 'ijklmnopqrstuvwx', 'z']
The following should work:
#!/usr/bin/python
string="abcdefghijklmnopqrstuvwxyz"
liszt=[]
leng=5
for j in range(0,len(string)/leng):
ibeg=j*(leng+1)
liszt.append(string[ibeg:ibeg+leng])
if ibeg+leng+1 < len(string):
liszt.append(string[ibeg+leng:])
print liszt

regular expression in python between two words

I am trying to get value
l1 = [u'/worldcup/archive/southafrica2010/index.html', u'/worldcup/archive/germany2006/index.html', u'/worldcup/archive/edition=4395/index.html', u'/worldcup/archive/edition=1013/index.html', u'/worldcup/archive/edition=84/index.html', u'/worldcup/archive/edition=76/index.html', u'/worldcup/archive/edition=68/index.html', u'/worldcup/archive/edition=59/index.html', u'/worldcup/archive/edition=50/index.html', u'/worldcup/archive/edition=39/index.html', u'/worldcup/archive/edition=32/index.html', u'/worldcup/archive/edition=26/index.html', u'/worldcup/archive/edition=21/index.html', u'/worldcup/archive/edition=15/index.html', u'/worldcup/archive/edition=9/index.html', u'/worldcup/archive/edition=7/index.html', u'/worldcup/archive/edition=5/index.html', u'/worldcup/archive/edition=3/index.html', u'/worldcup/archive/edition=1/index.html']
I'm trying to do regular expression starting off with something like this below
m = re.search(r"\d+", l)
print m.group()
but I want value between "archive/" and "/index.html"
I goggled and have tried something like (?<=archive/\/index.html).*(?=\/index.html:)
but It didn't work for me .. how can I get my result list as '
result = ['germany2006','edition=4395','edition=1013' , ...]
If you know for sure that the pattern will match always, you can use this
import re
print [re.search("archive/(.*?)/index.html", l).group(1) for l in l1]
Or you can simply split like this
print [l.rsplit("/", 2)[-2] for l in l1]
You can take help from below code .It will solve your problem.
>>> import re
>>> p = '/worldcup/archive/southafrica2010/index.html'
>>> r = re.compile('archive/(.*?)/index.html')
>>> m = r.search(p)
>>> m.group(1)
'southafrica2010'
Look-arounds is what you need. You need to use it like this:
>>> [re.search(r"(?<=archive/).*?(?=/index.html)", s).group() for s in l1]
[u'southafrica2010', u'germany2006', u'edition=4395', u'edition=1013', u'edition=84', u'edition=76', u'edition=68', u'edition=59', u'edition=50', u'edition=39', u'edition=32', u'edition=26', u'edition=21', u'edition=15', u'edition=9', u'edition=7', u'edition=5', u'edition=3', u'edition=1']
The regular expression
m = re.search(r'(?<=archive\/).+(?=\/index.html)', s)
can solve this, suppose that s is a string from your list.

Categories