Replacing specific substrings in a specific part of a string - python

I have a following text file that is to be edited in a certain manner. The part of the file that comes to inside the (init: part is to be overwritten and nothing except that should be edited.
File:
(define (problem bin-picking-doosra)
(:domain bin-picking-second)
;(:requirements :typing :negative-preconditions)
(:objects
)
(:init
(batsmen first_batsman)
(bowler none_bowler)
(umpire third_umpire)
(spectator no_spectator)
)
(:goal (and
(batsmen first_batsman)
(bowler last_bowler)
(umpire third_umpire)
(spectator full_spectator)
)
)
)
In this file I want replace every line that is inside the (init: section with the required string. In this case, I want to replace:
(batsmen first_batsman) with (batsmen none_batsmen)
(bowler none_bowler) with (bowler first_bowler)
(umpire third_umpire) with (umpire leg_umpire)
(spectator no_spectator) with (spectator empty_spectator)
The code I currently have the following:
file_path = "/home/mus/problem_turtlebot.pddl"
s = open(file_path).read()
s = s.replace('(batsmen first_batsman)', '(batsmen '+ predicate_batsmen + '_batsman)')
f = open(file_path, 'w')
f.write(s)
f.close()
The term predicate_batsmen here contains the word none. It works fine this way. This code only satisfies point number 1. mentioned above
There are three problems that I have.
This code also changes the '(batsmen first_batsmen)' part in (goal: part which I dont want. I only want it to change the (init: part
Currently for the other strings in the (init: part, I have to redo this code with different statement. For eg: for '(bowler none_bowler)' i.e. point number 2 above, I have to have a copy of the coded lines again which I think is a not a good coding technique. Any better way for it.
If we consider the first string in (init: that is to be overwritten i.e (batsmen first_batsman). Is there a way in python that no matter what matter what is written in the question mark part of the string like (batsmen ??????_batsman) could be replaced with none. For now it is 'first' but even if it is written 'second'((batsmen second_batsman)) or 'last' ((batsmen last_batsman)) , I want to replace it with 'none'(batsmen none_batsman).
Any ideas on these issues?
Thanks

First of all you need to find the init-group. The init-group seems to have the structure:
(:init
...
)
where ... is some recurrence of text contained inside parenthesis, e.g. "(batsmen first_batsman)". Regular expressions is a powerful way to locate these kind of patterns in text. If you are not familiar with regular expressions (or regex for short) have a look here.
The following regex locates this group:
import re
#Matches the items in the init-group:
item_regex = r"\([\w ]+\)\s+"
#Matches the init-group including items:
init_group_regex = re.compile(r"(\(:init\s+({})+\))".format(item_regex))
init_group = init_group_regex.search(s).group()
Now you have the init-group in match. The next step is to locate the term you would want to replace, and actually replace it. re.sub can do just that! First store the mappings in a dictionary:
mappings = {'batsmen first_batsman': 'batsmen '+ predicate_batsmen + '_batsman',
'bowler none_bowler': 'bowler first_bowler',
'umpire third_umpire': 'umpire leg_umpire',
'spectator no_spectator': 'spectator empty_spectator'}
Finding the occurrences and replacing them by their corresponding value one-by-one:
for key, val in mappings.items():
init_group = re.sub(key, val, init_group)
Finally you can replace the init-group in the original string:
s = init_group_regex.sub(init_group, s)
This is really flexible! You can use regex in mappings to have it match anything you like, including:
mappings = {'batsmen \w+_batsman': '(batsmen '+ predicate_batsmen + '_batsman)'}
to match 'batsmen none_batsman', 'batsmen first_batsman' etc.

Related

Sort with re.search() - Python

i have some problems with solving the follwoing problem.
I have to *.txt files in both files are cities from austria. In the first file "cities1" are the cities are ordered by population.
The first file (cities1.txt) is looking like this:
1.,Vienna,Vienna,1.840.573
2.,Graz,Styria,273.838
3.,Linz,Upper Austria,198.181
4.,Salzburg,Salzburg,148.420
5.,Innsbruck,Tyrol,126.851
The second file (cities2.txt) is looking like this:
"Villach","Carinthia",60480,134.98,501
"Innsbruck","Tyrol",126851,104.91,574
"Graz","Styria",273838,127.57,353
"Dornbirn","Vorarlberg",47420,120.93,437
"Vienna","Vienna",1840573,414.78,151
"Linz","Upper Austria",198181,95.99,266
"Klagenfurt am Woerthersee","Carinthia",97827,120.12,446
"Salzburg","Salzburg",148420,65.65,424
"Wels","Upper Austria",59853,45.92,317
"Sankt Poelten","Lower Austria",52716,108.44,267
What i like to do, or in other words what i should do is, the first file cities1.txt is already sorted. I only need the second element of every line. That means i only need the name of the city. For example from the line 2.,Graz,Styria,273.838, i only need Graz.
Than second i should print out the area of the city, this is the fourth element of every line in cities2.txt. That means, for example from the third line "Graz","Styria",273838,127.57,353, i only need 127.57.
At the end the console should display the following:
Vienna,414.78
Graz,127.57
Linz,95.99
Salzburg,65.65
Innsbruck,104.91
So, my problem now is, how can i do this, if i only allowed to use the re.search() method. Cause the second *.txt file is not in the same order and i have to bring the cities in the same order as in the first file that this will work, or?
I know, it would be much easier to use re.split() because than you are able to compare the list elements form both files. But I'm not allowed to do this.
I hope someone can help me and sorry for the long text.
Here's an implementation based on my earlier comment:
with open('cities2.txt') as c:
D = {}
for line in c:
t = line.split(',')
cn = t[0].strip('"')
D[cn] = t[-2]
with open('cities1.txt') as d:
for line in d:
t = line.split(',')
print(f'{t[1]},{D[t[1]]}')
Note that this may not be robust. If there's a city name in cities1.txt that does not exist in cities2.txt then you'll get a KeyError
This is just a hint, it's your university assignment after all.
import re
TEST = '2.,Graz,Styria,273.838'
RE = re.compile('^[^,]*,([^,]*),')
if match := RE.search(TEST):
print(match.group(1)) # 'Graz'
Let's break down the regexp:
^ - start of line
[^,]* - any character except a comma - repeated 0 or more times
this is the first field
, - one comma character
this is the field separator
( - start capturing, we are interested in this field
[^,]* - any character except a comma - repeated 0 or more times
this is the second field
) - stop capturing
, - one comma character
(don't care about the rest of line)

Python: use a list index as a function argument

I'm trying to use list indices as arguments for a function that performs regex searches and substitutions over some text files. The different search patterns have been assigned to variables and I've put the variables in a list that I want to feed the function as it loops through a given text.
When I call the function using a list index as an argument nothing happens (the program runs, but no substitutions are made in my text files), however, I know the rest of the code is working because if I call the function with any of the search variables individually it behaves as expected.
When I give the print function the same list index as I'm trying to use to call my function it prints exactly what I'm trying to give as my function argument, so I'm stumped!
search1 = re.compile(r'pattern1')
search2 = re.compile(r'pattern2')
search3 = re.compile(r'pattern3')
searches = ['search1', 'search2', 'search2']
i = 0
for …
…
def fun(find)
…
fun(searches[i])
if i <= 2:
i += 1
…
As mentioned, if I use fun(search1) the script edits my text files as wished. Likewise, if I add the line print(searches[i]) it prints search1 (etc.), which is what I'm trying to give as an argument to fun.
Being new to Python and programming, I've a limited investigative skill set, but after poking around as best I could and subsequently running print(searches.index(search1) and getting a pattern1 is not in list error, my leading (and only) theory is that I'm giving my function the actual regex expression rather than the variable it's stored in???
Much thanks for any forthcoming help!
Try to changes your searches list to be [search1, search2, search3] instead of ['search1', 'search2', 'search2'] (in which you just use strings and not regex objects)
Thanks to all for the help. eyl327's comment that I should use a list or dictionary to store my regular expressions pointed me in the right direction.
However, because I was using regex in my search patterns, I couldn't get it to work until I also created a list of compiled expressions (discovered via this thread on stored regex strings).
Very appreciative of juanpa.arrivillaga point that I should have proved a MRE (please forgive, with a highly limited skill set, this in itself can be hard to do), I'll just give an excerpt of a slightly amended version of my actual code demonstrating the answer (one again, please forgive its long-windedness, I'm not presently able to do anything more elegant):
…
# put regex search patterns in a list
rawExps = ['search pattern 1', 'search pattern 2', 'search pattern 3']
# create a new list of compiled search patterns
compiledExps = [regex.compile(expression, regex.V1) for expression in rawExps]
i = 0
storID = 0
newText = ""
for file in filepathList:
for expression in compiledExps:
with open(file, 'r') as text:
thisText = text.read()
lines = thisThis.splitlines()
setStorID = regex.search(compiledExps[i], thisText)
if setStorID is not None:
storID = int(setStorID.group())
for line in lines:
def idSub(find):
global storID
global newText
match = regex.search(find, line)
if match is not None:
newLine = regex.sub(find, str(storID), line) + "\n"
newText = newText + newLine
storID = plus1(int(storID), 1)
else:
newLine = line + "\n"
newText = newText + newLine
# list index number can be used as an argument in the function call
idSub(compiledExps[i])
if i <= 2:
i += 1
write()
newText = ""
i = 0

Can you spot the problem with this REGEX statement?

Im running .txt files through a for loop which should slice out keywords and .append them into lists. For some reason my REGEX statements are returning really odd results.
My first statement which iterates through the full filenames and slices out the keyword works well.
# Creates a workflow list of file names within target directory for further iteration
stack = os.listdir(
"/Users/me/Documents/software_development/my_python_code/random/countries"
)
# declares list, to be filled, and their associated regular expression, to be used,
# in the primary loop
names = []
name_pattern = r"-\s(.*)\.txt"
# PRIMARY LOOP
for entry in stack:
if entry == ".DS_Store":
continue
# extraction of country name from file name into `names` list
name_match = re.search(name_pattern, entry)
name = name_match.group(1)
names.append(name)
This works fine and creates the list that I expect
However, once I move on to a similar process with the actual contents of files, it no longer works.
religions = []
reli_pattern = r"religion\s=\s(.+)."
# PRIMARY LOOP
for entry in stack:
if entry == ".DS_Store":
continue
# opens and reads file within `contents` variable
file_path = (
"/Users/me/Documents/software_development/my_python_code/random/countries" + "/" + entry
)
selection = open(file_path, "rb")
contents = str(selection.read())
# extraction of religion type and placement into `religions` list
reli_match = re.search(reli_pattern, contents)
religion = reli_match.group(1)
religions.append(religion)
The results should be something like: "therevada", "catholic", "sunni" etc.
Instead i'm getting seemingly random pieces of text from the document which have nothing to do with my REGEX like ruler names and stat values that do not contain the word "religion"
To try and figure this out I isolated some of the code in the following way:
contents = "religion = catholic"
reli_pattern = r"religion\s=\s(.*)\s"
reli_match = re.search(reli_pattern, contents)
print(reli_match)
And None is printed to the console so I am assuming the problem is with my REGEX. What silly mistake am I making which is causing this?
Your regular expression (religion\s=\s(.*)\s) requires that there be a trailing whitespace (the last \s there). Since your string doesn't have one, it doesn't find anything when searching thus re.search returns None.
You should either:
Change your regex to be r"religion\s=\s(.*)" or
Change the string you're searching to have a trailing whitespace (i.e 'religion = catholic' to 'religion = catholic ')

Python regular expression, ignoring characters until some charater is matched a number of times

i'm renaming a batch of files i downloaded from a torrent and wanted to get the episode's name,so i figured regex would do the trick. I'm kinda new to regex so I'd appreciate the help. This is what i could come up to:
i have a class related to other renaming functions so the function defined here is within this class, that initializes with the path to the files directory, the expression to rename to and the file extension.
im using glob to access all files with the extension ".mkv"
for debugging i printed out all the file names:
Mr.Robot.S02E01.eps2.0_unm4sk-pt1.tc.720p.WEB-DL.x264-[MULVAcoded].mkv
Mr.Robot.S02E02.eps2.0_unm4sk-pt2.tc.720p.WEB-DL.x264-[MULVAcoded].mkv
Mr.Robot.S02E03.eps2.1_k3rnel-pan1c.ksd.720p.WEB-DL.x264-[MULVAcoded].mkv
Mr.Robot.S02E04.eps2.2_init_1.asec.720p.WEB-DL.x264-[MULVAcoded].mkv
Mr.Robot.S02E05.eps2.3.logic-b0mb.hc.720p.WEB-DL.x264-[MULVAcoded].mkv
Mr.Robot.S02E06.eps2.4.m4ster-s1ave.aes.720p.WEB-DL.x264-[MULVAcoded].mkv
Mr.Robot.S02E07.eps2.5_h4ndshake.sme.720p.WEB-DL.x264-[MULVAcoded].mkv
Mr.Robot.S02E08.eps2.6.succ3ss0r.p12.720p.WEB-DL.x264-[MULVAcoded].mkv
Mr.Robot.S02E09.eps2.7_init_5.fve.720p.WEB-DL.x264-[MULVAcoded].mkv
Mr.Robot.S02E10.eps2.8_h1dden-pr0cess.axx.720p.WEB-DL.x264-[MULVAcoded].mkv
Mr.Robot.S02E11.eps2.9_pyth0n-pt1.p7z.720p.WEB-DL.x264-[MULVAcoded].mkv
Mr.Robot.S02E12.eps2.9_pyth0n-pt2.p7z.720p.WEB-DL.x264-[MULVAcoded].mkv
def strip_ep_name(self):
for i, f in enumerate(self.files):
f_list = f.split("\\")
name, ext = os.path.splitext(f_list[-1])
ep_name = name.strip(r'(.*?)".720p.WEB-DL.x264-[MULVAcoded]"')
print(ep_name)
for me, the goal is to get the episode's name, either with or without the episode's number, because i can, later on, give the episode a new name.
and the output is:
r.Robot.S02E01.eps2.0_unm4sk-pt1.t
r.Robot.S02E02.eps2.0_unm4sk-pt2.t
r.Robot.S02E03.eps2.1_k3rnel-pan1c.ks
r.Robot.S02E04.eps2.2_init_1.as
r.Robot.S02E05.eps2.3.logic-b0mb.h
r.Robot.S02E06.eps2.4.m4ster-s1ave.aes
r.Robot.S02E07.eps2.5_h4ndshake.sm
r.Robot.S02E08.eps2.6.succ3ss0r.p1
r.Robot.S02E09.eps2.7_init_5.fv
r.Robot.S02E10.eps2.8_h1dden-pr0cess.a
r.Robot.S02E11.eps2.9_pyth0n-pt1.p7z
r.Robot.S02E12.eps2.9_pyth0n-pt2.p7z
I wanted to strip all the ".eps2.2" before the episode's name, but they dont follow an order.
Now I don't know how to move on from here. can anyone help?
Do it all in one step:
\.eps\d+\.\d+[-_.](.+?)(?:\.720p.+)\.(\w+)$
Broken down, this reads:
\.eps\d+\.\d+ # ".eps", followed by digits, a dot and other digits
[-_.] # one of -, _ or .
(.+?) # anything else lazily afterwards
(?:\.720p.+) # until .720p is found (might need some tweaking)
\. # a dot
(\w+)$ # some word characters (aka the file extension) at the end
This needs to be replaced by .\1.\2 to get your desired format in the end.
Everything in Python:
import re
filenames = """
Mr.Robot.S02E01.eps2.0_unm4sk-pt1.tc.720p.WEB-DL.x264-[MULVAcoded].mkv
Mr.Robot.S02E02.eps2.0_unm4sk-pt2.tc.720p.WEB-DL.x264-[MULVAcoded].mkv
Mr.Robot.S02E03.eps2.1_k3rnel-pan1c.ksd.720p.WEB-DL.x264-[MULVAcoded].mkv
Mr.Robot.S02E04.eps2.2_init_1.asec.720p.WEB-DL.x264-[MULVAcoded].mkv
Mr.Robot.S02E05.eps2.3.logic-b0mb.hc.720p.WEB-DL.x264-[MULVAcoded].mkv
Mr.Robot.S02E06.eps2.4.m4ster-s1ave.aes.720p.WEB-DL.x264-[MULVAcoded].mkv
Mr.Robot.S02E07.eps2.5_h4ndshake.sme.720p.WEB-DL.x264-[MULVAcoded].mkv
Mr.Robot.S02E08.eps2.6.succ3ss0r.p12.720p.WEB-DL.x264-[MULVAcoded].mkv
Mr.Robot.S02E09.eps2.7_init_5.fve.720p.WEB-DL.x264-[MULVAcoded].mkv
Mr.Robot.S02E10.eps2.8_h1dden-pr0cess.axx.720p.WEB-DL.x264-[MULVAcoded].mkv
Mr.Robot.S02E11.eps2.9_pyth0n-pt1.p7z.720p.WEB-DL.x264-[MULVAcoded].mkv
Mr.Robot.S02E12.eps2.9_pyth0n-pt2.p7z.720p.WEB-DL.x264-[MULVAcoded].mkv
"""
rx = re.compile(r'\.eps\d+\.\d+[-_.](.+?)(?:\.720p.+)\.(\w+)$', re.M)
filenames = rx.sub(r".\1.\2", filenames)
print(filenames)
Which yields
Mr.Robot.S02E01.unm4sk-pt1.tc.mkv
Mr.Robot.S02E02.unm4sk-pt2.tc.mkv
Mr.Robot.S02E03.k3rnel-pan1c.ksd.mkv
Mr.Robot.S02E04.init_1.asec.mkv
Mr.Robot.S02E05.logic-b0mb.hc.mkv
Mr.Robot.S02E06.m4ster-s1ave.aes.mkv
Mr.Robot.S02E07.h4ndshake.sme.mkv
Mr.Robot.S02E08.succ3ss0r.p12.mkv
Mr.Robot.S02E09.init_5.fve.mkv
Mr.Robot.S02E10.h1dden-pr0cess.axx.mkv
Mr.Robot.S02E11.pyth0n-pt1.p7z.mkv
Mr.Robot.S02E12.pyth0n-pt2.p7z.mkv
See a demo on regex101.com.
Firstly import the regex module of Python:
import re
Then use this to replace from "r.Robot.S02E01.eps2.0_unm4sk-pt1.t" :
ep_name = re.sub(r"eps2\.\d{1,2}(\.|\_)","",episode_name)
use ep_name in loop and pass episode name to episode_name one by one and then print ep_name.
Output will be like:
r.Robot.S02E01.unm4sk-pt1.t
I'm not sure if I understand correctly, I don't know the series hence nor do I the titles. But do you really need re?
for f in files:
print(f[23:-35].split('.')[0])
results in
unm4sk-pt1
unm4sk-pt2
k3rnel-pan1c
init_1
logic-b0mb
m4ster-s1ave
h4ndshake
succ3ss0r
init_5
h1dden-pr0cess
pyth0n-pt1
pyth0n-pt2
Edit:
I still don't see an actual target format definition in your post, but just in case that #Jan is right, here's the re-less solution for that, too:
for f in files:
print(f[:16] + '.'.join(f[23:].split('.')[:2]) + '.mkv')
Mr.Robot.S02E01.unm4sk-pt1.tc.mkv
Mr.Robot.S02E02.unm4sk-pt2.tc.mkv
Mr.Robot.S02E03.k3rnel-pan1c.ksd.mkv
Mr.Robot.S02E04.init_1.asec.mkv
Mr.Robot.S02E05.logic-b0mb.hc.mkv
Mr.Robot.S02E06.m4ster-s1ave.aes.mkv
Mr.Robot.S02E07.h4ndshake.sme.mkv
Mr.Robot.S02E08.succ3ss0r.p12.mkv
Mr.Robot.S02E09.init_5.fve.mkv
Mr.Robot.S02E10.h1dden-pr0cess.axx.mkv
Mr.Robot.S02E11.pyth0n-pt1.p7z.mkv
Mr.Robot.S02E12.pyth0n-pt2.p7z.mkv

In Python,if startswith values in tuple, I also need to return which value

I have an area codes file I put in a tuple
for line1 in area_codes_file.readlines():
if area_code_extract.search(line1):
area_codes.append(area_code_extract.search(line1).group())
area_codes = tuple(area_codes)
and a file I read into Python full of phone numbers.
If a phone number starts with one of the area codes in the tuple, I need to do to things:
1 is to keep the number
2 is to know which area code did it match, as need to put area codes in brackets.
So far, I was only able to do 1:
for line in txt.readlines():
is_number = phonenumbers.parse(line,"GB")
if phonenumbers.is_valid_number(is_number):
if line.startswith(area_codes):
print (line)
How do I do the second part?
The simple (if not necessarily highest performance) approach is to check each prefix individually, and keep the first match:
for line in txt:
is_number = phonenumbers.parse(line,"GB")
if phonenumbers.is_valid_number(is_number):
if line.startswith(area_codes):
print(line, next(filter(line.startswith, area_codes)))
Since we know filter(line.startswith, area_codes) will get exactly one hit, we just pull the hit using next.
Note: On Python 2, you should start the file with from future_builtins import filter to get the generator based filter (which will also save work by stopping the search when you get a hit). Python 3's filter already behaves like this.
For potentially higher performance, the way to both test all prefixes at once and figure out which value hit is to use regular expressions:
import re
# Function that will match any of the given prefixes returning a match obj on hit
area_code_matcher = re.compile(r'|'.join(map(re.escape, area_codes))).match
for line in txt:
is_number = phonenumbers.parse(line,"GB")
if phonenumbers.is_valid_number(is_number):
# Returns None on miss, match object on hit
m = area_code_matcher(line)
if m is not None:
# Whatever matched is in the 0th grouping
print(line, m.group())
Lastly, one final approach you can use if the area codes are of fixed length. Rather than using startswith, you can slice directly; you know the hit because you sliced it off yourself:
# If there are a lot of area codes, using a set/frozenset will allow much faster lookup
area_codes_set = frozenset(area_codes)
for line in txt:
is_number = phonenumbers.parse(line,"GB")
if phonenumbers.is_valid_number(is_number):
# Assuming lines that match always start with ###
if line[:3] in area_codes_set:
print(line, line[:3])

Categories