Extract int between two different strings in python - python

I have a list files of strings of the following format:
files = ['/misc/lmbraid17/bensch/u-net-3d/2dcellnet/2dcellnet_v6w4l1/2dcellnet_v6w4l1_snapshot_iter_418000.caffemodel.h5',
'/misc/lmbraid17/bensch/u-net-3d/2dcellnet/2dcellnet_v6w4l1/2dcellnet_v6w4l1_snapshot_iter_502000.caffemodel.h5', ...]
I want to extract the int between iter_ and .caffemodel and return a list of those ints.
After some research I came up with this solution that does the trick, but I was wondering if there is a more elegant/pythonic way to do it, possibly using a list comprehension?
li = []
for f in files:
tmp = re.search('iter_[\d]+.caffemodel', f).group()
li.append(int(re.search(r'\d+', tmp).group()))

Just to add another possible solution: join the file names together into one big string (looks like the all end with h5, so there is no danger of creating unwanted matches) and use re.findall on that:
import re
li = [int(d) for d in re.findall(r'iter_(\d+)\.caffemodel', ''.join(files))]

Use just:
li = []
for f in files:
tmp = int(re.search('iter_(\d+)\.caffemodel', f).group(1))
li.append(tmp)
If you put an expression into parenthesis it creates another group of matched expressions.

You can also use a lookbehind assertion:
regex = re.compile("(?<=iter_)\d+")
for f in files:
number = regex.search(f).group(0)

Solution with list comprehension, as you wished:
import re
re_model_id = re.compile(r'iter_(?P<model_id>\d+).caffemodel')
li = [int(re_model_id.search(f).group('model_id')) for f in files]

Without a regex:
files = [
'/misc/lmbraid17/bensch/u-net-3d/2dcellnet/2dcellnet_v6w4l1/2dcellnet_v6w4l1_snapshot_iter_418000.caffemodel.h5',
'/misc/lmbraid17/bensch/u-net-3d/2dcellnet/2dcellnet_v6w4l1/2dcellnet_v6w4l1_snapshot_iter_502000.caffemodel.h5']
print([f.rsplit("_", 1)[1].split(".", 1)[0] for f in files])
['418000', '502000']
Or if you want to be more specific:
print([f.rsplit("iter_", 1)[1].split(".caffemodel", 1)[0] for f in files])
But your pattern seems to repeat so the first solution is probably sufficient.
You can also slice using find and rfind:
print( [f[f.find("iter_")+5: f.rfind("caffe")-1] for f in files])
['418000', '502000']

Related

Multiple non-nested if conditions in list comprehension without a terminal else

(Note: before you jump the gun to look for duplicate if-else Q's, please see the next section for why many of them did not suit mine)
I want to learn how to use list comprehension to simplify the two set of code block into one:
filenameslist.extend(
[f[:-4] for f in filenames if (
f.endswith('.mp3') or
f.endswith('.wma') or
f.endswith('.aac') or
f.endswith('.ogg') or
f.endswith('.m4a')
)])
filenameslist.extend(
[f[:-5] for f in filenames if (
f.endswith('.opus')
)])
I have tried to achieve it using the following code after following so many answers here in SO. However, these doesn't work for me. Please have a look at what I have right now:
filenameslist.extend(
[(f[:-4] if (
f.endswith('.mp3') or
f.endswith('.wma') or
f.endswith('.aac') or
f.endswith('.ogg') or
f.endswith('.m4a')
) else (f[:-5] if f.endswith('.opus') else '')) for f in filenames])
The unnecessary else '' at the end adds an entry '' to my list which I don't need. Removing the else or using else pass results in syntax error.
I can delete the '' entry manually from list, but the point is to learn how to do this one-step with list comprehension. I am using py 3.8.
There is no way in the expression of your list comprehension to state something like "do not produce an item in that case" (when the extension is not your list of allowed extensions).
You have to somehow repeat your test:
filenames = ['test.mp3', 'something.opus', 'dontcare.wav']
l = [
f[:-5] if f.endswith('.opus') else f[:-4]
for f in filenames
if (
f.endswith('.mp3') or
f.endswith('.wma') or
f.endswith('.aac') or
f.endswith('.ogg') or
f.endswith('.m4a') or
f.endswith('.opus')
)
]
print(l)
Note that you can use os.path.splitext to ease your work:
import os
filenames = ['test.mp3', 'something.opus', 'dontcare.wav']
l = [
os.path.splitext(f)[0]
for f in filenames
if os.path.splitext(f)[1] in ['.mp3', '.wma', '.aac', '.ogg', '.m4a', '.opus']
]
print(l)
Use the Path objects' built-in properties instead of parsing the names yourself:
from pathlib import Path
filenames = Path('/some/folder/').glob('*')
allowed_suffixes = ['.mp3', '.wma', '.aac', '.ogg', '.m4a', '.opus']
file_stems = set(f.stem for f in filenames if f.suffix in allowed_suffixes)
You can use a list instead of a set, of course. This looks cleaner than a convoluted list comprehension. If you want to retain the files' full paths, use:
file_stems = set(f.parent / f.stem for f in filenames if f.suffix in allowed_suffixes)
The str.endswith method can optionally take a tuple of suffixes, so you can simply do:
allowed_suffixes = '.mp3', '.wma', '.aac', '.ogg', '.m4a', '.opus'
filenameslist.extend(f[:f.rfind('.')] for f in filenames if f.endswith(allowed_suffixes))
You can use rpartition like below:
filenameslist.extend([fn.rpartition('.')[0] for fn in filenames if fn[fn.rfind('.'):] in suffixes])
Example:
suffixes = ['.mp3', '.wma', '.aac', '.ogg', '.m4a', '.opus', '.wav']
filenames = ['test.mp3', 'something.opus', 'dontcare.wav', 'lara']
[fn.rpartition('.')[0] for fn in filenames if fn[fn.rfind('.'):] in suffixes]
Output:
['test', 'something', 'dontcare']

Regex to match strings in a list without .csv extension

How can i write a regular expression to only match the string names without .csv extension. This should be the required output
Required Output:
['ap_2010', 'class_size', 'demographics', 'graduation','hs_directory', 'sat_results']
Input:
data_files = [
"ap_2010.csv",
"class_size.csv",
"demographics.csv",
"graduation.csv",
"hs_directory.csv",
"sat_results.csv"]
I tried but it return a empty list.
for i in data_files:
regex = re.findall(r'/w+/_[/d{4}][/w*]?', i)
If you really want to use a regular expression, you can use re.sub to remove the extension if it exists, and if not, leave the string alone:
[re.sub(r'\.csv$', '', i) for i in data_files]
['ap_2010',
'class_size',
'demographics',
'graduation',
'hs_directory',
'sat_results']
A better approach in general is using the os module to handle anything to do with filenames:
[os.path.splitext(i)[0] for i in data_files]
['ap_2010',
'class_size',
'demographics',
'graduation',
'hs_directory',
'sat_results']
If you want regex, the solution is r'(.*)\.csv:
for i in data_files:
regex = re.findall(r'(.*)\.csv', i)
print(regex)
Split the string at '.' and then take the last element of the split (using index [-1]). If this is 'csv' then it is a csv file.
for i in data_files:
if i.split('.')[-1].lower() == 'csv':
# It is a CSV file
else:
# Not a CSV
# Input
data_files = [ 'ap_2010.csv', 'class_size.csv', 'demographics.csv', 'graduation.csv', 'hs_directory.csv', 'sat_results.csv' ]
import re
pattern = '(?P<filename>[a-z0-9A-Z_]+)\.csv'
prog = re.compile(pattern)
# `map` function yields:
# - a `List` in Python 2.x
# - a `Generator` in Python 3.x
result = map(lambda data_file: re.search(prog, data_file).group('filename'), data_files)
l = [
"ap_2010.csv",
"class_size.csv",
"demographics.csv",
"graduation.csv",
"hs_directory.csv",
"sat_results.csv"]
print([i.rstrip('.'+i.split('.')[-1]) for i in l])

How to filter out file names with certain prefix and postfix (extension)?

I have a list of files like this:
file_list = ['file1.zip', 'file1.txt']
file_prefix = 'file1'
I'd like to use filter and re to only get file1.txt above. I'm trying this:
regex = re.compile(file_prefix + '.*(!zip).*')
result = list(filter(regex.search, file_list))
# in the above, result should be populated with just ['file1.txt']
But the regex pattern is not working. Could someone help me out on this? Thanks very much in advanced!
You can use negative lookahead like this:
regex = re.compile(file_prefix + '(?!\.zip)')
Code:
>>> file_list = ['file1.zip', 'file1.txt']
>>> file_prefix = 'file1'
>>> regex = re.compile(file_prefix + '(?!\.zip)')
>>> print list(filter(regex.search, file_list))
['file1.txt']
(?!\.zip) makes it a negative lookahead that asserts true when .zip is not present at next position.
Read more about look-arounds
No need for regex for this solution - you don't need to bring a cannon to a thumb-fight. Use Python's native string search/check:
file_list = ["file1.zip", "file1.txt"]
file_prefix = "file1"
file_exclude = "zip"
result = [e for e in file_list if e.startswith(file_prefix) and not e.endswith(file_exclude)]
# ['file1.txt']
Should be considerably faster, too.
If you don't want to search for edges only and you want to filter out only entries that don't have the zip suffix after the file_prefix no matter where it is in the string (so you want to match some_file1.txt, or even a_zip_file1.txt, but not file1_zip.txt) you can slightly modify it:
file_list = ["file1.zip", "file1.txt", "some_file1.txt", "a_zip_file1.txt", "file1_zip.txt"]
file_prefix = "file1"
file_exclude = "zip"
result = [e for e in file_list if e.find(file_exclude) < e.find(file_prefix)]
# ['file1.txt', 'some_file1.txt', 'a_zip_file1.txt']

Extract substring from list of file names in Python or R

My question is very similar to the following: How to get a Substring from list of file names. I'm a newb to Python and would prefer a similar solution for Python (or R). I'd like to look into a directory and extract a particular substring from each applicable file name and output it as a vector (preferred), list, or array. For example, assume I have directory with the following file names:
data_ABC_48P.txt
data_DEF_48P.txt
data_GHI_48P.txt
other_96.txt
another_98.txt
I would like to reference the directory and extract the following as a character vector (for use in R) or list:
"ABC", "DEF", "GHI"
I tried the following:
from os import listdir
from os.path import isfile, join
files = [ f for f in listdir(path) if isfile(join(path,f)) ]
import re
m = re.search('data_(.+?)_48P', files)
But I get the following error:
TypeError: expected string or buffer
files is of type list
In [10]: type(files)
Out[10]: list
Even though I ultimately want this character vector as an input to R code, we are trying to transition all of our "scripting" to Python and use R solely for data analysis, so a Python solution would be great. I'm also using Ubuntu, so a cmd line or bash script solution could work as well. Thanks in advance!
Use List comprehension like,
[re.search(r'data_(.+?)_48P', i).group(1) for i in files if re.search(r'data_.+?_48P', i)]
You need to iterate over the list contents inorder to grab the substrings you want.
from os import listdir
from os.path import isfile, join
import re
strings = []
for f in listdir(path):
if isfile(join(path,f)):
m = re.search('data_(.+?)_48P', f)
if m:
strings.append(m.group(1))
print strings
Output:
['ABC', 'DEF', 'GHI']
re.search() dont accept a list as argument you need to use a loop and pass every element that must be string to the function , you can use positive look-around for give your expected string then as the result of re.search is a generator you need group to get the string
>>> for i in files :
... try :
... print re.search(r'(?<=data_).*(?=_48P)', i).group(0)
... except AttributeError:
... pass
...
ABC
DEF
GHI
re.search requires string not list.
Use
m=[]
for line in files:
import re
m.append(re.search('data_(.+?)_48P', line).group(1))
In R:
list.files('~/desktop/test')
# [1] "another_98.txt" "data_ABC_48P.txt" "data_DEF_48P.txt" "data_GHI_48P.txt" "other_96.txt"
gsub('_', '', unlist(regmatches(l <- list.files('~/desktop/test'),
gregexpr('_(\\w+?)_', l, perl = TRUE))))
# [1] "ABC" "DEF" "GHI"
another way:
l <- list.files('~/desktop/test', pattern = '_(\\w+?)_')
sapply(strsplit(l, '[_]'), '[[', 2)
# [1] "ABC" "DEF" "GHI"

Regular Expression in Python

I'm trying to build a list of domain names from an Enom API call. I get back a lot of information and need to locate the domain name related lines, and then join them together.
The string that comes back from Enom looks somewhat like this:
SLD1=domain1
TLD1=com
SLD2=domain2
TLD2=org
TLDOverride=1
SLD3=domain3
TLD4=co.uk
SLD5=domain4
TLD5=net
TLDOverride=1
I'd like to build a list from that which looks like this:
[domain1.com, domain2.org, domain3.co.uk, domain4.net]
To find the different domain name components I've tried the following (where "enom" is the string above) but have only been able to get the SLD and TLD matches.
re.findall("^.*(SLD|TLD).*$", enom, re.M)
Edit:
Every time I see a question asking for regular expression solution I have this bizarre urge to try and solve it without regular expressions. Most of the times it's more efficient than the use of regex, I encourage the OP to test which of the solutions is most efficient.
Here is the naive approach:
a = """SLD1=domain1
TLD1=com
SLD2=domain2
TLD2=org
TLDOverride=1
SLD3=domain3
TLD4=co.uk
SLD5=domain4
TLD5=net
TLDOverride=1"""
b = a.split("\n")
c = [x.split("=")[1] for x in b if x != 'TLDOverride=1']
for x in range(0,len(c),2):
print ".".join(c[x:x+2])
>> domain1.com
>> domain2.org
>> domain3.co.uk
>> domain4.net
You have a capturing group in your expression. re.findall documentation says:
If one or more groups are present in the pattern, return a list of groups; this will be a list of tuples if the pattern has more than one group.
That's why only the conent of the capturing group is returned.
try:
re.findall("^.*((?:SLD|TLD)\d*)=(.*)$", enom, re.M)
This would return a list of tuples:
[('SLD1', 'domain1'), ('TLD1', 'com'), ('SLD2', 'domain2'), ('TLD2', 'org'), ('SLD3', 'domain3'), ('TLD4', 'co.uk'), ('SLD5', 'domain4'), ('TLD5', 'net')]
Combining SLDs and TLDs is then up to you.
this works for you example,
>>> sld_list = re.findall("^.*SLD[0-9]*?=(.*?)$", enom, re.M)
>>> tld_list = re.findall("^.*TLD[0-9]*?=(.*?)$", enom, re.M)
>>> map(lambda x: x[0] + '.' + x[1], zip(sld_list, tld_list))
['domain1.com', 'domain2.org', 'domain3.co.uk', 'domain4.net']
I'm not sure why are you talking about regular expressions. I mean, why don't you just run a for loop?
A famous quote seems to be appropriate here:
Some people, when confronted with a problem, think “I know, I'll use
regular expressions.” Now they have two problems.
domains = []
components = []
for line in enom.split('\n'):
k,v = line.split('=')
if k == 'TLDOverride':
continue
components.append(v)
if k.startswith('TLD'):
domains.append('.'.join(components))
components = []
P.S. I'm not sure what's this TLDOverride so the code just ignores it.
Here's one way:
import re
print map('.'.join, zip(*[iter(re.findall(r'^(?:S|T)LD\d+=(.*)$', text, re.M))]*2))
# ['domain1.com', 'domain2.org', 'domain3.co.uk', 'domain4.net']
Just for fun, map -> filter -> map:
input = """
SLD1=domain1
TLD1=com
SLD2=domain2
TLD2=org
TLDOverride=1
SLD3=domain3
TLD4=co.uk
SLD5=domain4
TLD5=net
"""
splited = map(lambda x: x.split("="), input.split())
slds = filter(lambda x: x[1][0].startswith('SLD'), enumerate(splited))
print map(lambda x: '.'.join([x[1][1], splited[x[0] + 1][1], ]), slds)
>>> ['domain1.com', 'domain2.org', 'domain3.co.uk', 'domain4.net']
This appears to do what you want:
domains = re.findall('SLD\d+=(.+)', re.sub(r'\nTLD\d+=', '.', enom))
It assumes that the lines are sorted and SLD always comes before its TLD. If that can be not the case, try this slightly more verbose code without regexes:
d = dict(x.split('=') for x in enom.strip().splitlines())
domains = [
d[key] + '.' + d.get('T' + key[1:], '')
for key in d if key.startswith('SLD')
]
You need to use multiline regex for this. This is similar to this post.
data = """SLD1=domain1
TLD1=com
SLD2=domain2
TLD2=org
TLDOverride=1
SLD3=domain3
TLD4=co.uk
SLD5=domain4
TLD5=net
TLDOverride=1"""
domain_seq = re.compile(r"SLD\d=(\w+)\nTLD\d=(\w+)", re.M)
for item in domain_seq.finditer(data):
domain, tld = item.group(1), item.group(2)
print "%s.%s" % (domain,tld)
As some other answers already said, there's no need to use a regular expression here. A simple split and some filtering will do nicely:
lines = data.split("\n") #assuming data contains your input string
sld, tld = [[x.split("=")[1] for x in lines if x[:3] == t] for t in ("SLD", "TLD")]
result = [x+y for x, y in zip(sld, tld)]

Categories