joining output from regex search - python

I have a regex that looks for numbers in a file.
I put results in a list
The problem is that it prints each results on a new line for every single number it finds. it aslo ignore the list I've created.
What I want to do is to have all the numbers into one list.
I used join() but it doesn't works.
code :
def readfile():
regex = re.compile('\d+')
for num in regex.findall(open('/path/to/file').read()):
lst = [num]
jn = ''.join(lst)
print(jn)
output :
122
34
764

What goes wrong:
# this iterates the single numbers you find - one by one
for num in regex.findall(open('/path/to/file').read()):
lst = [num] # this puts one number back into a new list
jn = ''.join(lst) # this gets the number back out of the new list
print(jn) # this prints one number
Fixing it:
Reading re.findall() show's you, it returns a list already.
There is no(t much) need to use a for on it to print it.
If you want a list - simply use re.findall()'s return value - if you want to print it, use one of the methods in Printing an int list in a single line python3 (several more posts on SO about printing in one line):
import re
my_r = re.compile(r'\d+') # define pattern as raw-string
numbers = my_r.findall("123 456 789") # get the list
print(numbers)
# different methods to print a list on one line
# adjust sep / end to fit your needs
print( *numbers, sep=", ") # print #1
for n in numbers[:-1]: # print #2
print(n, end = ", ")
print(numbers[-1])
print(', '.join(numbers)) # print #3
Output:
['123', '456', '789'] # list of found strings that are numbers
123, 456, 789
123, 456, 789
123, 456, 789
Doku:
print() function for sep= and end=
Printing an int list in a single line python3
Convert all strings in a list to int ... if you need the list as numbers
More on printing in one line:
Print in one line dynamically
Python: multiple prints on the same line
How to print without newline or space?
Print new output on same line

In your case, regex.findall() returns a list and you are are joining in each iteration and printing it.
That is why you're seeing this problem.
You can try something like this.
numbers.txt
Xy10Ab
Tiger20
Beta30Man
56
My45one
statements:
>>> import re
>>>
>>> regex = re.compile(r'\d+')
>>> lst = []
>>>
>>> for num in regex.findall(open('numbers.txt').read()):
... lst.append(num)
...
>>> lst
['10', '20', '30', '56', '45']
>>>
>>> jn = ''.join(lst)
>>>
>>> jn
'1020305645'
>>>
>>> jn2 = '\n'.join(lst)
>>> jn2
'10\n20\n30\n56\n45'
>>>
>>> print(jn2)
10
20
30
56
45
>>>
>>> nums = [int(n) for n in lst]
>>> nums
[10, 20, 30, 56, 45]
>>>
>>> sum(nums)
161
>>>

Use list built-in functions to append new values.
def readfile():
regex = re.compile('\d+')
lst = []
for num in regex.findall(open('/path/to/file').read()):
lst.append(num)
print(lst)

Related

Get the last part of a string in a list

I have a list which contains the following string element.
myList = ['120$My life cycle 3$121$My daily routine 2']
I perform a .split("$") operation and get the following new list.
templist = str(myList).split("$")
I want to be able to store all the integer values from this templist which are at even indexes after the split.I want to return a list of integers.
Expected output: [120, 121]
You can split at $ and use a list comprehension with str.isdigit() to extract numbers:
mylist = ['120$My life cycle$121$My daily routine','some$222$othr$text$42']
# split each thing in mylist at $, for each split-result, keep only those that
# contain only numbers and convert them to integer
splitted = [[int(i) for i in p.split("$") if i.isdigit()] for p in mylist]
print(splitted) # [[120, 121], [222, 42]]
This will produce a list of lists and convert the "string" numbers into integers. it only works for positive numbers-strings without sign - with sign you can exchange isdigit() for another function:
def isInt(t):
try:
_ = int(t)
return True
except:
return False
mylist = ['-120$My life cycle$121$My daily routine','some$222$othr$text$42']
splitted = [[int(i) for i in p.split("$") if isInt(i) ] for p in mylist]
print(splitted) # [[-120, 121], [222, 42]]
To get a flattened list no matter how many strings are in myList:
intlist = list(map(int,( d for d in '$'.join(myList).split("$") if isInt(d))))
print(intlist) # [-120, 121, 222, 42]
Updated version:
import re
myList = ['120$My life cycle 3$121$My daily routine 2']
myList = myList[0].split('$')
numbers = []
for i in range(0,len(myList),2):
temp = re.findall(r'\d+', myList[i])[0]
numbers.append(temp)
'''
.finall() returns list of all occurences of a pattern in a given string.
The pattern says map all digits in the string. If they are next to each
other count them as one element in final list. We use index 0 of the
myList as thats the string we want to work with.
'''
results = list(map(int, numbers)) # this line performs an int() operation on each of the elements of numbers.
print(results)
Why not just use re?
re is a library for regular expresions in python. They help you find patterns.
import re
myList = ['120$My life cycle 3$121$My daily routine 2']
numbers = re.findall(r'\d+$', myList[0])
'''
.finall() returns list of all occurences of a pattern in a given string.
The pattern says map all digits in the string. If they are next to each
other count them as one element in final list. We use index 0 of the
myList as thats the string we want to work with.
'''
results = list(map(int, numbers)) # this line performs an int() operation on each of the elements of numbers.
print(results)
First off we split string with '$' as separator. And then we just iterate through every other result from the new list, convert it into integer and append it to results.
myList = ['120$My life cycle 3$121$My daily routine 2']
myList = myList[0].split('$')
results = []
for i in range(0,len(myList),2):
results.append(int(myList[i]))
print(results)
# [120, 121]
Do something like this?
a = ['100$My String 1A$290$My String 1B']
>>> for x in a:
... [int(s) for s in x.split("$") if s.isdigit()]
...
[100, 290]

How to choose a certain position to split a string by "_"?

I have a string like this '00004079_20150427_5_169_192_114.npz', and I want to split it into this ['00004079_20150427_5', '169_192_114.npz'].
I tried the Python string split() method:
a = '00004079_20150427_5_169_192_114.nii.npz'
a.split("_", 3)
but it returned this:
['00004079', '20150427', '5', '169_192_114.nii.npz']
How can I split this into 2 parts by the third "_" appearance?
I also tried this:
reg = ".*\_.*\_.\_"
re.split(reg, a)
but it returns:
['', '169_192_114.nii.npz']
You can split the string based on the delimiter _ upto 3 times and then join back everything except the last value
>>> *start, end = s.split('_', 3)
>>> start = '_'.join(start)
>>>
>>> start
'00004079_20150427_5'
>>> end
'169_192_114.npz'
For python2, you can follow this instead
>>> lst = s.split('_', 3)
>>> end = lst.pop()
>>> start = '_'.join(lst)
>>>
>>> start
'00004079_20150427_5'
>>> end
'169_192_114.npz'
One of possible approaches (if going with regex):
import re
s = '00004079_20150427_5_169_192_114.nii.npz'
res = re.search(r'^((?:[^_]+_){2}[^_]+)_(.+)', s)
print(res.groups())
The output:
('00004079_20150427_5', '169_192_114.nii.npz')

list.append() where am I wrong?

I have a string which is very long. I would like to split this string into substrings 16 characters long, skipping one character every time (e.g. substring1=first 16 elements of the string, substring2 from element 18 to element 34 and so on) and list them.
I wrote the following code:
string="abcd..."
list=[]
for j in range(0,int(len(string)/17)-1):
list.append(string[int(j*17):int(j*17+16)])
But it returns:
list=[]
I can't figure out what is wrong with this code.
>>> string="abcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyz"
Your original code, without masking the built-in (excludes the final full-length string and any partial string after it):
>>> l = []
>>> for j in range(0,int(len(string)/17)-1):
... l.append(string[int(j*17):int(j*17+16)])
...
>>> l
['abcdefghijklmnop', 'rstuvwxyzabcdefg', 'ijklmnopqrstuvwx']
A cleaned version that includes all possible strings:
>>> for j in range(0,len(string),17):
... l.append(string[j:j+16])
...
>>> l
['abcdefghijklmnop', 'rstuvwxyzabcdefg', 'ijklmnopqrstuvwx', 'zabcdefghijklmno', 'qrstuvwxyz']
How about we turn that last one into a comprehension? Everyone loves comprehensions.
>>> l = [string[j:j+16] for j in range(0,len(string),17)]
We can filter out strings that are too short if we want to:
>>> l = [string[j:j+16] for j in range(0,len(string),17) if len(string[j:j+16])>=16]
It does work -- but only for strings longer than 16 characters. You have
range(0,int(len(string)/17)-1)
but, for the string "abcd...", int(len(string)/17)-1) is -1. Add some logic to catch the < 16 chars case and you're good:
...
for j in range(0, max(1, int(len(string)/17)-1)):
...
Does this work?
>>> from string import ascii_lowercase
>>> s = ascii_lowercase * 2
>>> s
'abcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyz'
>>> spl = [s[i:i+16] for i in range(0, len(s), 17)]
>>> spl
['abcdefghijklmnop', 'rstuvwxyzabcdefg', 'ijklmnopqrstuvwx', 'z']
The following should work:
#!/usr/bin/python
string="abcdefghijklmnopqrstuvwxyz"
liszt=[]
leng=5
for j in range(0,len(string)/leng):
ibeg=j*(leng+1)
liszt.append(string[ibeg:ibeg+leng])
if ibeg+leng+1 < len(string):
liszt.append(string[ibeg+leng:])
print liszt

Find a string in a list of list in python

I have a nested list as below:
[['asgy200;f','ssll100',' time is: 10h:00m:12s','xxxxxxx','***','','asgy200;f','frl5100',' time is: 00h:00m:05s','ooo']]
'***' is my delimiter. I want to separate all of seconds in the list in python.
First of all with regular expression I want to separate the line that has time is: string but it doesn't work!
I don't know what should I do.
Thanks
import re
x=[['asgy200;f','ssll100','time is: 10h:00m:12s','xxxxxxx','***','','asgy200;f','frl5100','time is: 00h:00m:05s','ooo']]
s=str(x)
print re.findall(r"(?<=time is)\s*:\s*[^']*:(\d+)",s)
Output:['12', '05']
You can try this.
You can use a look-ahead regex (r'(?<=time is\:).*') :
>>> [i.group(0).split(':')[2] for i in [re.search(r'(?<=time is\:).*',i) for i in l[0]] if i is not None]
['12s', '05s']
and you can convert them to int :
>>> [int(j.replace('s','')) for j in sec]
[12, 5]
if you want the string of seconds don't convert them to int after replace :
>>> [j.replace('s','') for j in sec]
['12', '05']
You could use capturing groups also. It won't print the seconds if the seconds is exactly equal to 00
>>> lst = [['asgy200;f','ssll100','time is: 10h:00m:12s','xxxxxxx','***','','asgy200;f','frl5100','time is: 00h:00m:05s','ooo']]
>>> [i for i in re.findall(r'time\s+is:\s+\d{2}h:\d{2}m:(\d{2})', ' '.join(lst[0])) if int(i) != 00]
['12', '05']
>>> lst = [['asgy200;f','ssll100','time is: 10h:00m:00s','xxxxxxx','***','','asgy200;f','frl5100','time is: 00h:00m:05s','ooo']]
>>> [i for i in re.findall(r'time\s+is:\s+\d{2}h:\d{2}m:(\d{2})', ' '.join(lst[0])) if int(i) != 00]
['05']
Taking into account your last comment to your Q,
>>> x = [['asgy200;f','ssll100','time is: 10h:00m:12s','xxxxxxx','***','','asgy200;f','frl5100','time is: 00h:00m:05s','ooo']]
>>> print all([w[-3:-1]!='00' for r in x for w in r if w.startswith('time is: ')])
True
>>>
all and any are two useful builtins...
The thing operates like this, the slower loop is on the sublists (rows) of x, the fastest loop on the items (words)in each row, we pick up only the words that startswith a specific string, and our iterable is made of booleans where we have true if the 3rd last and 2nd last character of the picked word are different from'00'. Finally the all consumes the iterable and returns True if all the second fields are different from '00'.
HTH,
Addendum
Do we want to break out early?
all_secs_differ_from_0 = True
for row in x:
for word in row:
if word.startswith('time is: ') and word[-3:-1] == '00':
all_secs_differ_from_0 = False
break
if not all_secs_differ_from_0: break

Python string split decimals from end of string

I use nlst on a ftp server which returns directories in the form of lists. The format of the returned list is as follows:
[xyz123,abcde345,pqrst678].
I have to separate each element of the list into two parts such that part1 = xyz and part2 = 123 i.e split the string at the beginning of the integer part. Any help on this will be appreciated!
>>> re.findall(r'\d+|[a-z]+', 'xyz123')
['xyz', '123']
For example, using the re module:
>>> import re
>>> a = ['xyz123','ABCDE345','pqRst678']
>>> regex = '(\D+)(\d+)'
>>> for item in a:
... m = re.match(regex, item)
... (a, b) = m.groups()
... print a, b
xyz 123
ABCDE 345
pqRst 678
Use the regular expression module re:
import re
def splitEntry(entry):
firstDecMatch = re.match(r"\d+$", entry)
alpha, numeric = "",""
if firstDecMatch:
pos = firstDecMatch.start(0)
alpha, numeric = entry[:pos], entry[pos:]
else # no decimals found at end of string
alpha = entry
return (alpha, numeric)
Note that the regular expression is `\d+$', which should match all decimals at the end of the string. If the string has decimals in the first part, it will not count those, e.g: xy3zzz134 -> "xy3zzz","134". I opted for that because you say you are expecting filenames, and filenames can include numbers. Of course it's still a problem if the filename ends with numbers.
Another non-re answer:
>>> [''.join(x[1]) for x in itertools.groupby('xyz123', lambda x: x.isalpha())]
['xyz', '123']
If you don't want to use regex, then you can do something like this. Note that I have not tested this so there could be a bug or typo somewhere.
list = ["xyz123", "abcde345", "pqrst678"]
newlist = []
for item in list:
for char in range(0, len(item)):
if item[char].isnumeric():
newlist.append([item[:char], item[char:]])
break
>>> import re
>>> [re.findall(r'(.*?)(\d+$)',x)[0] for x in ['xyz123','ABCDE345','pqRst678']]
[('xyz', '123'), ('ABCDE', '345'), ('pqRst', '678')]
I don't think its that difficult without re
>>> s="xyz123"
>>> for n,i in enumerate(s):
... if i.isdigit(): x=n ; break
...
>>> [ s[:x], s[x:] ]
['xyz', '123']
>>> s="abcde345"
>>> for n,i in enumerate(s):
... if i.isdigit(): x=n ; break
...
>>> [ s[:x], s[x:] ]
['abcde', '345']

Categories