Inverse re.split - python

I have this C:/Users/nash08/Desktop/NUKE_OITO_MEDIA/PRODUTORAS/PYTHON/WORK_INTERNO/_CENAS_FX/N10/N01_V01_NK08.%04d.dpx
I want to remove everything thats in the path till the words produtoras like this:
/PRODUTORAS/PYTHON/WORK_INTERNO/_CENAS_FX/N10/N01_V01_NK08.%04d.dpx
My knowledge in regular expressions its pretty mediocre, the only way that I am used to is to separate the path by / like this
rpath = path.rsplit('/',1)[0]
rpath2 = re.split('/',path)
and index to where I want.

You're using regular expression.
That works:
t = "C:/Users/nash08/Desktop/NUKE_OITO_MEDIA/PRODUTORAS/PYTHON/WORK_INTERNO/_CENAS_FX/N10/N01_V01_NK08.%04d.dpx"
output = re.search(".*(/PRODUTORAS.*)", t)
print output.group(1)
>'/PRODUTORAS/PYTHON/WORK_INTERNO/_CENAS_FX/N10/N01_V01_NK08.%04d.dpx'

Here is a simple approach:
if '/PRODUTORAS/' in mypath:
newpath = '/PRODUTORAS/' + mypath.split('/PRODUTORAS/', 1)[1]
This only works if you are using forward slashes for your path separator and PRODUTORAS is capitalized.

If your prefix doesn't change, this code will work:
path = "C:/Users/nash08/Desktop/NUKE_OITO_MEDIA/PRODUTORAS/PYTHON/WORK_INTERNO/_CENAS_FX/N10/N01_V01_NK08.%04d.dpx"
prefix = "C:/Users/nash08/Desktop/NUKE_OITO_MEDIA/"
print path.strip(prefix)
#Output:
>>> 'PRODUTORAS/PYTHON/WORK_INTERNO/_CENAS_FX/N10/N01_V01_NK08.%04d.dpx'

Related

Matching a pattern of file name after wildcard

I have a couple of files, XR3_DEV_YEAR20_Z.7_ROP_Current*.csv and
XR3_DEV_YEAR20_Z.7_ROP_Previous*.csv
I'm trying to take the pattern of these file names and get the bit after the wildcard so it matches this:
XR3_DEV_YEAR20_Z.7_ROP_*_xml.txt
I'm trying to do this with the re or glob library, but I'm not really sure how to do it.
here's a way to do that:
exp = re.compile(r"^XR3_DEV_YEAR20_Z.7_ROP_(.*).csv$")
example = "XR3_DEV_YEAR20_Z.7_ROP_Current1234.csv"
res = exp.match(example)
part = res.groups()[0]
new_name = "XR3_DEV_YEAR20_Z.7_ROP_" + part + "_xml.txt"
New name, in this case, would result in :
XR3_DEV_YEAR20_Z.7_ROP_Current1234_xml.txt

Python Parse through String to create variable

I have a variable that reads in a datafile
dfPort = pd.read_csv("E:...\Portfolios\ConsDisc_20160701_Q.csv")
I was hoping to create three variables: portName, inceptionDate, and frequency that would read the string of the "E:..." above and take out the wanted parts of the string using the underscore as a indicator to go to next variable. Example after parsing string:
portName = "ConsDisc"
inceptionDate: "2016-07-01"
frequency: "Q"
Any tips would be appreciated!
You can use os.path.basename, os.path.splitext and str.split:
import os
filename = r'E:...\Portfolios\ConsDisc_20160701_Q.csv'
parts = os.path.splitext(os.path.basename(filename.replace('\\', os.sep)))[0].split('_')
print(parts)
outputs ['ConsDisc', '20160701', 'Q']. You can then manipulate this list as you like, for example extract it into variables with port_name, inception_date, frequency = parts, etc.
The .replace('\\', os.sep) there is used to "normalize" Windows-style backslash-separated paths into whatever is the convention of the system the code is being run on (i.e. forward slashes on anything but Windows :) )
import os
def parse_filename(path):
filename = os.path.basename(path)
filename_no_ext = os.path.splitext(filename)[0]
return filename_no_ext.split("_")
path = r"Portfolios\ConsDisc_20160701_Q.csv"
portName, inceptionDate, frequency = parse_filename(path)
How about an alternative solution just in case if you want to store them into a dictionary and use them like so,
import re
str1 = "E:...\Portfolios\ConsDisc_20160701_Q.csv"
re.search(r'Portfolios\\(?P<portName>.*)_(?P<inceptionDate>.*)_(?P<frequency>.)', str1).groupdict()
# result
# {'portName': 'ConsDisc', 'inceptionDate': '20160701', 'frequency': 'Q'}

How to filter out file names with certain prefix and postfix (extension)?

I have a list of files like this:
file_list = ['file1.zip', 'file1.txt']
file_prefix = 'file1'
I'd like to use filter and re to only get file1.txt above. I'm trying this:
regex = re.compile(file_prefix + '.*(!zip).*')
result = list(filter(regex.search, file_list))
# in the above, result should be populated with just ['file1.txt']
But the regex pattern is not working. Could someone help me out on this? Thanks very much in advanced!
You can use negative lookahead like this:
regex = re.compile(file_prefix + '(?!\.zip)')
Code:
>>> file_list = ['file1.zip', 'file1.txt']
>>> file_prefix = 'file1'
>>> regex = re.compile(file_prefix + '(?!\.zip)')
>>> print list(filter(regex.search, file_list))
['file1.txt']
(?!\.zip) makes it a negative lookahead that asserts true when .zip is not present at next position.
Read more about look-arounds
No need for regex for this solution - you don't need to bring a cannon to a thumb-fight. Use Python's native string search/check:
file_list = ["file1.zip", "file1.txt"]
file_prefix = "file1"
file_exclude = "zip"
result = [e for e in file_list if e.startswith(file_prefix) and not e.endswith(file_exclude)]
# ['file1.txt']
Should be considerably faster, too.
If you don't want to search for edges only and you want to filter out only entries that don't have the zip suffix after the file_prefix no matter where it is in the string (so you want to match some_file1.txt, or even a_zip_file1.txt, but not file1_zip.txt) you can slightly modify it:
file_list = ["file1.zip", "file1.txt", "some_file1.txt", "a_zip_file1.txt", "file1_zip.txt"]
file_prefix = "file1"
file_exclude = "zip"
result = [e for e in file_list if e.find(file_exclude) < e.find(file_prefix)]
# ['file1.txt', 'some_file1.txt', 'a_zip_file1.txt']

How to split a string on a character that occurs more than once inside the string

I am having a some logic issues attempting to parse a string into two fields. Name and Version. I have been splitting on "/" and it works very well for strings that have only one "/" in them. for example:
strString = someName/A
nameVer = strString .split('/')
name = nameVer[0]
ver = nameVer[1]
this returns name=someName and ver=A. Which is what i want. The problem is when i have more than one "/" in the string. Particularly this 3 cases:
Part ="someName//" #Expected output: name=someName ver=\
Part1="some/Name/A" #Expected output:name=some/Name ver=A
Part2="some/Name//" #Expected output:name=some/Name ver=/
Both the name and version can be or contain "/"'s. I have tried many things including keeping track of the indexes of the "/" and grabbing whats in between. In some cases I have also added brackets to the string ("[some/Name//]") so i can index the first and last char of the string. Any help with this is greatly appreciated. Thanks
Following some useful comments by BrenBarn and sr2222, I suggest the following solutions.
The OP should either
Make sure that the version string does not contain any '/' characters, and then use rsplit as suggested by sr2222
or
Choose a different delimiter for for the name-version division
A solution that ignores the last character (such that it can be assigned to the ver variable) would be
ind = Part[:-1].rindex('/')
name = Part[:ind+1]
ver = Part[ind+1:]
On the OPs inputs this produces the desired output.
If any instance of the separator might be doing the separating, there are too many choices. Take your last example, some/name//. Which of the three slashes is the
separator? The string can be parsed, in order, as ("some", "name//"),
as ("some/name", "/"), or as ("some/name/", "").
What to do? Let's say the version is necessarily non-empty (ruling out option 3),
and otherwise the name part should be maximal. If you like these rules,
here's a regexp that will do the job: r'^(.*)/(.+)$'. You can use it like this:
name, ver = re.match(r'^(.*)/(.+)$', "some/name/").groups()
Here's what it does:
>>> re.match(r'^(.*)/(.+)$', "name//").groups()
('name', '/')
>>> re.match(r'^(.*)/(.+)$', "some/name/a").groups()
('some/name', 'a')
>>> re.match(r'^(.*)/(.+)$', "some/name//").groups()
('some/name', '/')
>>> re.match(r'^(.*)/(.+)$', "some/name/").groups()
('some', 'name/')
In short, it splits on the last slash that has something after it (possibly a final slash). If you don't like this approach, you'll need to provide more detail on what you had in mind.
For the cases you've posted, this would work:
if part.endswith('//'):
name, ver = part[:-2], '/’
else:
name, ver = part.rsplit(’/’, 2)
Here is the code I made that handles just about every case. The only cases that it does not handle is when the name and version is ambiguous and you cannot tell if a "/" is apart of the name or ver. Thank you for everyone's input.
Part ="[0717_PFM1//]"
Part1="[0717_PFM1/A]" #generic case
Part2="[0717/_PFM1/A]"
Part3="[07/17/_PFM1//]" #Test case below
#Part3="[0717/_PFM1//B]" #Not working, ambigous: cant tell if the ending slash is part of name or ver
import re
lastCharIndex = Part3.index(']')
list1 =[]
counter = 0
numberOfSlashes = Part3.count("/")
if numberOfSlashes > 1:
nameVer = Part3.split("/")
name1, ver1 = re.match(r'^(.*)/(.+)$', Part3).groups()
if nameVer[2].strip("]") or ver1.strip("]") == "":
ver = "/"
else:
ver = nameVer[2].strip("]")
name = nameVer[0].strip('[')
if len(name1)>len(name):
name = name1
if len(ver1) > len(ver):
ver = ver1
name = name.rstrip("/")
else:
nameVer = Part3.split("/")
name, ver = nameVer[0], nameVer[1]
print "name",name.strip('['), "ver",ver.strip(']')

Replace recursively from a replacement map

I have a dictionary in the form
{'from.x': 'from.changed.x',...}
possibly quite big, and I have to substitute in text files accordingly to that dictionary in a quite big directory structure.
I didn't find anything which might any nice solution and I end up:
using os.walk
iterating through the dictionary
writing everything out
WIth something like:
def fix_imports(top_dir, not_ui_keys):
"""Walk through the directory and substitute the wrong imports
"""
repl = {}
for n in not_ui_keys:
# interleave a model in between
dotted = extract_dotted(n)
if dotted:
repl[dotted] = add_model(dotted)
for root, dirs, files in walk(top_dir):
py_files = [path.join(root, x) for x in files if x.endswith('.py')]
for py in py_files:
res = replace_text(open(py).read(), repl)
def replace_text(orig_text, replace_map):
res = orig_text
# now try to grep all the keys, using a translate maybe
# with a dictionary of the replacements
for to_replace in replace_map:
res.replace(to_replace, replace_map[to_replace])
# now print the differences
for un in unified_diff(res.splitlines(), orig_text.splitlines()):
print(un)
return res
Is there any better/nicer/faster way to do it?
EDIT:
Clarifying a bit the problem, the substitution are generated from a function, and they are all in the form:
{'x.y.z': 'x.y.added.z', 'x.b.a': 'x.b.added.a'}
And yes, sure I should better use regexps, I just thought I didn't need them this time.
I don't think it can help much, however, because I can't really formalize the whole range of substitutions with only one (or multiple) regexps..
I would write the first function using generators:
def fix_imports(top_dir, not_ui_keys):
"""Walk through the directory and substitute the wrong imports """
from itertools import imap,ifilter
gen = ifilter(None,imap(extract_dotted, not_ui_keys))
repl = dict((dotted,add_model(dotted)) for dotted in gen)
py_files = (path.join(root, x)
for root, dirs, files in walk(top_dir)
for x in files if x[-3:]=='.py')
for py in py_files:
with open(py) as opf:
res = replace_text(opf.read(), repl)
x[-3:]=='.py' is faster than x.endswith('.py')
Thank you everyone, and about the problem of substituting from a mapping in many files, I think I have a working solution:
def replace_map_to_text(repl_map, text_lines):
"""Take a dictionary with the replacements needed and a list of
files and return a list with the substituted lines
"""
res = []
concat_st = "(%s)" % "|".join(repl_map.keys())
# '.' in non raw regexp means one of any characters, so must be
# quoted ore we need a way to make the string a raw string
concat_st = concat_st.replace('.', '\.')
combined_regexp = re.compile(concat_st)
for line in text_lines:
found = combined_regexp.search(line)
if found:
expr = found.group(1)
new_line = re.sub(expr, repl_map[expr], line)
logger.info("from line %s to line %s" % (line, new_line))
res.append(new_line)
else:
res.append(line)
return res
def test_replace_string():
lines = ["from psi.io.api import x",
"from psi.z import f"]
expected = ["from psi.io.model.api import x",
"from psi.model.z import f"]
mapping = {'psi.io.api': 'psi.io.model.api',
'psi.z': 'psi.model.z'}
assert replace_map_to_text(mapping, lines) == expected
In short I compose a big regexp in the form
(first|second|third)
Then I search for it in every line and substitute with re.sub if something was found.
Still a bit rough but the simple test after works fine.
EDIT: fixed a nasty bug in the concatenation, because if it's not a raw string '.' means only one character, not a '.'

Categories