I have multiple text files in a folder say "configs", I want to search a particular text "-cfg" in each file and copy the data after -cfg from opening to closing of inverted commas ("data"). This result should be updated in another text file "result.txt" with filename, test name and the config for each file.
NOTE: Each file can have multiple "cfg" in separate line along with test name related to that configuration.
E.g: cube_demo -cfg "RGB 888; MODE 3"
My approach is to open each text file one at a time and find the pattern, then store the required result into a buffer. Later, copy the entire result into a new file.
I came across Python and looks like it's easy to do it in Python. Still learning python and trying to figure out how to do it. Please help. Thanks.
I know how to open the file and iterate over each line to search for a particular string:
import re
search_term = "Cfg\s(\".*\")" // Not sure, if it's correct
ifile = open("testlist.csv", "r")
ofile = open("result.txt", "w")
searchlines = ifile.readlines()
for line in searchlines:
if search_term in line:
if re.search(search_term, line):
ofile.write(\1)
// trying to get string with the \number special sequence
ifile.close()
ofile.close()
But this gives me the complete line, I could not find how to use regular expression to get only the "data" and how to iterate over files in the folder to search the text.
Not quite there yet...
import re
search_term = "Cfg\s(\".*\")" // Not sure, if it's correct
"//" is not a valid comment marker, you want "#"
wrt/ your regexp, you want (from your specs) : 'cfg', followed by one or more space, followed by any text between double quotes, stopping at the first closing double quote, and want to capture the part between these double quotes. This is spelled as 'cfg "(.?)"'. Since you don't want to deal with escape chars, the best way is to use a raw single quoted string:
exp = r'cfg *"(.+?)"'
now since you're going to reuse this expression in a loop, you might as well compile it already:
exp = re.compile(r'cfg *"(.+?)"')
so now exp is a re.pattern object instead of string. To use it, you call it's search(<text>) method, with your current line as argument. If the line matches the expression, you'll get a re.match object, else you'll get None:
>>> match = exp.search('foo bar "baaz" boo')
>>> match is None
True
>>> match = exp.search('foo bar -cfg "RGB 888; MODE 3" tagada "tsoin"')
>>> match is None
False
>>>
To get the part between the double quotes, you call match.group(1) (second captured group, the first one being the one matchin the whole expression)
>>> match.group(0)
'cfg "RGB 888; MODE 3"'
>>> match.group(1)
'RGB 888; MODE 3'
>>>
Now you just have to learn and make correct use of files... First hint: files are context managers that know how to close themselves. Second hint: files are iterable, no need to read the whole file in memory. Third hint : file.write("text") WONT append a newline after "text".
If we glue all this together, your code should look something like:
import re
search_term = re.compile(r'cfg *"(.+?)"')
with open("testlist.csv", "r") as ifile:
with open("result.txt", "w") as ofile:
for line in ifile:
match = search_term.search(line)
if match:
ofile.write(match.group(1) + "\n")
Related
I am currently facing a problem. I am trying to write a regex code in order to match a pattern in a text file, and after finding it, remove it from the current text.
# Reading the file data and store it
with open('file.txt','r+') as f:
file = f.read()
print(file)
Here is my text when printed
'{\n\tINFO\tDATA_NUMBER\t974\n\t{\n\t\tDAT_CQFD\n\t\t{\n\t\t\tsome random text \t787878\n\t\t}\n\t\tDATA_TO_MATCH\n\t\t{\n\t\t1\tbunch of characters \t985\n\t\t2\tbunch of data\t\t78\n\t\t}\n\t}\n\tINFO\tDATA_CATCHME\t123\n\t{\n\t\t3\tbunch of characters \n\t\t2\tbunch of datas\n\t}\n\tINFO\tDATA_INTACT\t\t456\n\t{\n\t\t3\tbunch of numbers \n\t\t2\tbunch of texts\n\t}\n\n'
Here is a picture of the same text opened with an editor :
image here
I would like to match / search DATA_TO_MATCH and then look for the last closed bracket " } "
and remove everything between this close bracket and the next one included.
And I would like to do the same for DATA_CATCHME.
here is the expected result :
'{\n\tINFO\tDATA_NUMBER\t974\n\t{\n\t\tDATA_CQFD\n\t\t{\n\t\t\tsome random text \t787878\n\t\t}\n\n\t}\n\tINFO\tDATA_INTACT\t\t456\n\t{\n\t\t3\tbunch of numbers \n\t\t2\tbunch of texts\n\t}\n\n}\n'
Here is a picture of the same text result opened with an editor :
image here
I tried some
import re
#find the DATA_TO_MATCH
re.findall(r".*DATA_TO_MATCH",file)
#find the DATA_CATCHME
re.findall(r".*DATA_CATCHME",file)
#supposed to find everything before the closed bracket "}"
re.findall(r"(?=.*})[^}]*",file)
But I am not very familiar with regex and re, and i can't get what i want from it,
and I guess when it's found I will use
re.sub(my_patern,'', text)
to remove it from my text file
The main trick here is to use the re.MULTILINE flag, which will span across lines. You should also use re.sub directly rather than re.findall.
The regex itself is simple once you understand it. You look for any characters until DATA_TO_MATCH, then chew up any whitespace that may exist (hence the *), read a {, then read all characters that aren't a }, and finally consume the }. It's a very similar tactic for the second one.
import re
with open('input.txt', 'r+') as f:
file = f.read()
# find the DATA_TO_MATCH
file = re.sub(r".*DATA_TO_MATCH\s*{[^}]*}", "", file, flags=re.MULTILINE)
# find the DATA_CATCHME
file = re.sub(r".*DATA_CATCHME[^{]*{[^}]*}", "", file, flags=re.MULTILINE)
print(file)
I am a beginner in python programming and looking for a function that helps me to read out a file of each line after a specific character, for example:
here is the format of the text file.
<ABC>
language \sometext.com xyz
The text file full of these sample sentences and I needs the string only which is between '' and '.' (only "text" in the above example.)
Here is the code but I could not get it 100% output.
f = open("test.txt", "r")
for x in f:
if "\\" in x:
x = x.rstrip('\\')
print(x)
In the above code, I am just getting the output of the first line like,
output:
language sometext.com xyz
You are calling readline twice, overwriting the line variable with the second line in the text file. The second and third lines in your code effectively do nothing.
EDIT: The original question was edited, the problem is now slightly different. My advice about using regex still stands.
I would use regex, with python's built-in re module:
import re
regex = re.compile(r"\\(.+)\.") # Pattern matching anything beween \ and .
with open("test.txt", "r") as file:
results = regex.findall(file.read())
print(results)
# Returns a list of every sub-string bewtween \ and . in the text file.
If you want to do it line by line:
file = open("test.txt", "r")
line = file.readline()
result = regex.search(line).group(1) # ".group(1)" makes sure the \ and . are not included
print(result)
# then you can continue with the next line
line = file.readline()
result = regex.search(line).group(1)
print(result)
# etc
# You can do this in a loop
# or with file.readlines() which returns a list of all the lines in the file
If you want more info on regex (regular expressions) in python, check out this good introduction: https://automatetheboringstuff.com/2e/chapter7/
or the official documentation:
https://docs.python.org/3/library/re.html
I have a file with Contents as below:-
He is good at python.
Python is not a language as well as animal.
Look for python near you.
Hello World it's great to be here.
Now, script should search for pattern "python" or "pyt" or "pyth" or "Python" or any regex related to "p/Python". After search of particular word, it should insert new word like "Lion". So output should become like below:-
He is good at python.
Lion
Python is not a language as well as animal.
Lion
Look for python near you.
Lion
Hello World it's great to be here.
How can I do that ?
NOTE:-
Till now I wrote code like this:-
def insertAfterText(args):
file_name = args.insertAfterText[0]
pattern = args.insertAfterText[1]
val = args.insertAfterText[2]
fh = fileinput.input(file_name,inplace=True)
for line in fh:
replacement=val+line
line=re.sub(pattern,replacement,line)
sys.stdout.write(line)
fh.close()
You're better off writing a new file, than trying to write into the middle of an existing file.
with open is the best way to open files, since it safely and reliably closes them for you once you're done. Here's a cool way of using with open to open two files at once:
import re
pattern = re.compile(r'pyt', re.IGNORECASE)
filename = 'textfile.txt'
new_filename = 'new_{}'.format(filename)
with open(filename, 'r') as readfile, open(new_filename, 'w+') as writefile:
for line in readfile:
writefile.write(line)
if pattern.search(line):
writefile.write('Lion\n')
Here, we're opening the existing file, and opening a new file (creating it) to write to. We loop through the input file and simply write each line out to the new file. If a line in the original file contains matches for our regex pattern, we also write Lion\n (including the newline) after writing the original line.
Read the file into a variable:
with open("textfile") as ff:
s=ff.read()
Use regex and write the result back:
with open("textfile","w") as ff:
ff.write(re.sub(r"(?mi)(?=.*python)(.*?$)",r"\1\nLion",s))
(?mi): m: multiline, i.e. '$' will match end of line;
i: case insensitiv;
(?=.*python): lookahead, check for "python";
Lookahead doesn't step forward in the string, only look ahead, so:
(.*?$) will match the whole line,
which we replace with self '\1' and the other.
Edit:
To use from command line insert:
import sys
textfile=sys.argv[1]
pattern=sys.argv[2]
newtext=sys.argv[3]
and replace
r"(?mi)(?=.*python)(.*?$)",r"\1\nLion"
with
fr"(?mi)(?=.*{pattern})(.*?$)",r"\1{newtext}"
and in open() change "textfile" to textfile.
So, I have the following txt files:
test1.txt (It's all in the same line.)
(hello)(bye)
text2.txt (It's in two different lines.)
(This actually works)
(Amazing!)
And I have the following regex pattern
\((.*?)\)
Which obviously selects all the words that are inside the parenthesis.
What I want to do is to replace the words inside the () in test1.txt with the words inside the () in test2.txt, leaving test1.txt like:
(This actually works)(Amazing!)
I tried the following code, but it doesn't seem to work. What did I do wrong?
import re
pattern = re.compile("\((.*?)\)")
for line in enumerate(open("test1.txt")):
match = re.finditer(pattern, line)
for line in enumerate(open("test2.txt")):
pattern.sub(match, line)
I think I made a very big error, it's one of my first programs in python.
Okay, there are several problems:
finditer method returns a match object, not a string.
findall returns a list of matched string groups
you do the contrary you said. Do you want to replace data in test1 by data from test2 don't you?
enumerate returns a tuple so your var line was not a line but a list of [line_number, line_string_content]. I use it in last code block.
So you can try to first catch the content:
pattern = re.compile("\((.*?)\)")
for line in open("test2.txt"):
match = pattern.findall(line)
#match contains the list ['Amazing!'] from the last line of test2, your variable match is overwritten on each line of the file...
note: If you compile your pattern, you can use it as object to call the re methods.
If you want to do it line by line (big file?).
An other option whould be to load the entire file and create a multiline regex.
matches = []
for line in open("test2.txt"):
matches.extend(pattern.findall(line))
#matches contains the list ['This actually works','Amazing!']
Then replace the content of the parenthesis by you matches items:
for line in open("test1.txt"):
for i, match in enumerate(pattern.findall(line)):
re.sub(match, matches[i], line)
note: doing this will raise exception if there is more (string in parenthesis) in test1.txt than in test2.txt...
If you want to write an output file you should do
with open('fileout.txt', 'w') as outfile:
for line in enumerate(open("test1.txt")):
#another writing for the same task (in one line!)
newline = [re.sub(match, matches[i], line) for i, match in enumerate(pattern.findall(line))][0]
outfile.write(newline)
You can use the feature of re.sub() to allow a callable as a replacement pattern and create on-the-spot lambda function to go through your matches from test2.txt to achieve your result, e.g.
import re
# slightly changed to use lookahead and lookbehind groups for a proper match/substitution
pattern = re.compile(r"(?<=\()(.*?)(?=\))")
# you can also use r"(\(.*?\))" if you're preserving the brackets
with open("test2.txt", "r") as f: # open test2.txt for reading
words = pattern.findall(f.read()) # grabs all the found words in test2.txt
with open("test1.txt", "r+") as f: # open test1.txt for reading and writing
# read the content of test1.txt and replace each match with the next `words` list value
content = pattern.sub(lambda x: words.pop(0) if words else x.group(), f.read())
f.seek(0) # rewind the file to the beginning
f.write(content) # write the new, 'updated' content
f.truncate() # truncate the rest of the file (if any)
For test1.txt containing:
(hello)(bye)
and test2.txt containing:
(This actually works)
(Amazing!)
executing the above script will change test1.txt to:
(This actually works)(Amazing!)
It will also account for mismatches in the files by iterative replacing only up to how many matches were found in test2.txt (e.g. if your test1.txt contained (hello)(bye)(pie) it will be changed to (This actually works)(Amazing!)(pie)).
I'm trying to use a regex in Python to match a file (saved as a string, ie "/volumes/footage/foo/bar.mov") to a log file I create that contains a list of files. But when I run the script, it gives me this error: sre_constants.error: unbalanced parenthesis. The code I'm using is this:
To read the file:
theLogFile = The_Root_Path + ".processedlog"
if os.path.isfile(theLogFile):
the_file = open(theLogFile, "r")
else:
open(theLogFile, 'w').close()
the_file = open(theLogFile, "r")
the_log = the_file.read()
the_file.close()
Then inside a for loop I reassign (I didn't realize I was doing this until I posted this question) the the_file variable as a string from a list of files (obtained by running through a folder and it's subsets and grabbing all the filenames), then try to use regex to see if that filename is present in the log file:
for the_file in filenamelist:
p = re.compile(the_file, re.IGNORECASE)
m = p.search(the_log)
Every time it hits the re.compile() part of the code it spits out that error. And if I try to cut that out, and use re.search(the_file, the_log) it still spits out that error. I don't understand how I could be getting unbalanced parenthesis from this.
Where is the regular expression pattern? Are you trying to use filenames contained in one file as patterns to search the other file? If so, you will want to step through the_file with someting like
for the_pattern in the_file:
p = re.compile(the_pattern, re.IGNORECASE)
m = p.search(the_log)
...
According to the Python re.compile documentation, the first argument to re.compile() should be the regular expression pattern as a string.
But the return value of open() is a file object, which you assign to the_file and pass to re.compile()....
Gordon,
it would seem to me that the issue is in the data. You are compiling uninspected strings from the filelist into regexp, not heeding that they might contain meta characters relevant for the regexp engine.
In your for loop, add a print the_file before the call to re.compile (it is no problem that you are re-using a name as the loop iterator that referred to file object before), so you can see which strings are actually coming from the filelist. Or, better still, run all instances of the_file through re.escape before passing them to re.compile. This will turn all meta characters into their normal equivalent.
What you're binding to name the_file in your first snippet is a file object, even though you say that's "saved as a string", the filename (i.e. the string) is actually named theLogFile but what you're trying t turn into a RE object is not theLogFile (the string), it's the_file (the now-closed file object). Given this, the error's somewhat quirky (one would expect a TypeError), but it's clear that you will get an error at re.compile.
the_file should be a string. In the above code the_file is the return value of open, which is a file object.