I need to apply the following patterns
regex pattern 1 =>
search1: ^this, replace1: these
regex pattern 2 =>
search2: tests$, replace2: \t tests
regex patterns 3
.....
the following code only executes one search-replace operation.
how to combine multiple search operations? I might need to apply perhaps 10-20 patterns
thank you
import re
fin = open("data.txt", "r")
fout = open("data2.txt", "w")
for line in fin:
pattern1= re.sub('test\.', 'tests',line)
fout.write(pattern2)
Idk what did you mean by putting pattern1 and pattern2, here's my solution with what I understood :
Simply put then in a list and iterate through ?
import re
patterns = [('^this', 'these'), ('tests$', 'tests')]
fin = open("data.txt", "r")
fout = open("data2.txt", "w")
for line in fin:
for regex, replace in patterns:
line = re.sub(regex, replace, line)
fout.write(line)
You can put your regex patterns and replacement in one dictionary and use for loop on it.
check below example:
import re
patterns = {"pattern1":"replacetxt"}
fin = open("data.txt", "r")
fout = open("data2.txt", "w")
for line in fin:
for patt, replace in patterns.items():
line = re.sub(patt, replace, line)
fout.write(line)
Related
I'm trying to find any occurunce of "fiction" preceeded or followed by anything, except for "non-"
I tried :
.*[^(n-)]fiction.*
but it's not working as I want it to.
Can anyone help me out?
Check if this works for you:
.*(?<!non\-)fiction.*
You should avoid patterns starting with .*: they cause too many backtracking steps and slow down the code execution.
In Python, you may always get lines either by reading a file line by line, or by splitting a line with splitlines() and then get the necessary lines by testing them against a pattern without .*s.
Reading a file line by line:
final_output = []
with open(filepath, 'r', newline="\n", encoding="utf8") as f:
for line in f:
if "fiction" in line and "non-fiction" not in line:
final_output.append(line.strip())
Or, getting the lines even with non-fiction if there is fiction with no non- in front using a bit modified #jlesuffleur's regex:
import re
final_output = []
rx = re.compile(r'\b(?<!non-)fiction\b')
with open(filepath, 'r', newline="\n", encoding="utf8") as f:
for line in f:
if rx.search(line):
final_output.append(line.strip())
Getting lines from a multiline string (with both approaches mentioned above):
import re
text = "Your input string line 1\nLine 2 with fiction\nLine 3 with non-fiction\nLine 4 with fiction and non-fiction"
rx = re.compile(r'\b(?<!non-)fiction\b')
# Approach with regex returning any line containing fiction with no non- prefix:
final_output = [line.strip() for line in text.splitlines() if rx.search(line)]
# => ['Line 2 with fiction']
# Non-regex approach that does not return lines that may contain non-fiction (if they contain fiction with no non- prefix):
final_output = [line.strip() for line in text.splitlines() if "fiction" in line and "non-fiction" not in line]
# => ['Line 2 with fiction', 'Line 4 with fiction and non-fiction']
See a Python demo.
What about a negative lookbehind?
s = 'fiction non-fiction'
res = re.findall("(?<!non-)fiction", s)
res
Suppose I have a data file:
# cat 1.txt
#$$!###VM - This is VM$^#^#$^$^
%#%$%^SAS - This is SAS&%^#$^$
!##!#%^$^MD - This is MD!#$!#%$
Now I want to filter the words starting with VM and SAS (excluding MD)
Expected results:
VM - This is VM
SAS - This is SAS
I am using this code but all lines are shown.
import re
f = open("1.txt", "r")
for line in f:
p = re.match(r'.+?((SAS|VM)[-a-zA-Z0-9 ]+).+?', line)
if p:
print (p.groups()[0])
In regular expression, I can use (pattern1|pattern2) to match either pattern1 or pattern2
But in re.match, parenthesis is used for matching the pattern.
How to specify "Either Match" in re.match() function?
This is one approach.
Ex:
import re
with open(filename) as infile:
for line in infile:
line = re.sub(r"[^A-Za-z\-\s]", "", line.strip())
if line.startswith(("VM", "SAS")):
print(line)
Output:
VM - This is VM
SAS - This is SAS
Try it like this:
with open('1.txt') as f:
for line in f:
extract = re.match('.+?((SAS|VM)[-a-zA-Z0-9 ]+).+?', line)
if extract:
print(extract.group(1))
so i am very very new to python.
need some basic help.
my logic is to find words in text file.
party A %aapple 1
Party B %bat 2
Party C c 3
i need to find all the words starts from %.
my code is
searchfile = open("text.txt", "r")
for line in searchfile:
for char in line:
if "%" in char:
print char
searchfile.close()
but the output is only the % character. I need the putput to be %apple and %bat
any help?
You are not reading the file properly.
searchfile = open("text.txt", "r")
lines = [line.strip() for line in searchfile.readlines()]
for line in lines:
for word in line.split(" "):
if word.startswith("%"):
print word
searchfile.close()
You should also explore regex to solve this as well.
For the sake of exemplification, I'm following up on Bipul Jain's reccomendation of showing how this can be done with regex:
import re
with open('text.txt', 'r') as f:
file = f.read()
re.findall(r'%\w+', file)
results:
['%apple', '%bat']
I am trying to remove emoticons from a piece of text, I looked at this regex from another question and it doesn't remove any emoticons. Can you let me know what I am doing wrong, or if there are better regex's for removing emojis from a string.
import re
myre = re.compile(u'('
u'\ud83c[\udf00-\udfff]|'
u'\ud83d[\udc00-\ude4f\ude80-\udeff]|'
u'[\u2600-\u26FF\u2700-\u27BF])+',
re.UNICODE)
def clean(inputFile,outputFile):
with open(inputFile, 'r') as original,open(outputFile, 'w+') as out:
for line in original:
line=myre.sub('', line)
Something like this?
import re
myre = re.compile('('
'\ud83c[\udf00-\udfff]|'
'\ud83d[\udc00-\ude4f\ude80-\udeff]|'
'[\u2600-\u26FF\u2700-\u27BF])+'.decode('unicode_escape'),
re.UNICODE)
def clean(inputFile,outputFile):
with open(inputFile, 'r') as original,open(outputFile, 'w+') as out:
for line in original:
line = myre.sub('', line.decode('utf-8'))
print(line)
i got an inputfile which contains a javascript code which contains many five-figure ids. I want to have these ids in a list like:
53231,53891,72829 etc
This is my actual python file:
import re
fobj = open("input.txt", "r")
text = fobj.read()
output = re.findall(r'[0-9][0-9][0-9][0-9][0-9]' ,text)
outp = open("output.txt", "w")
How can i get these ids in the output file like i want it?
Thanks
import re
# Use "with" so the file will automatically be closed
with open("input.txt", "r") as fobj:
text = fobj.read()
# Use word boundary anchors (\b) so only five-digit numbers are matched.
# Otherwise, 123456 would also be matched (and the match result would be 12345)!
output = re.findall(r'\b\d{5}\b', text)
# Join the matches together
out_str = ",".join(output)
# Write them to a file, again using "with" so the file will be closed.
with open("output.txt", "w") as outp:
outp.write(out_str)