Python using re.match hangs with long text - python

I have a text file with a list of domains, I want to use python regular expression to match domains and any subdomains.
Sample domains file
admin.happy.com
nothappy.com
I have the following regexp:
main_domain = 'happy.com'
mydomains = open('domains.txt','r').read().replace('\n',',')
matchobj = re.match(r'^(.*\.)*%s$' % main_domain,mydomains)
The code works fine for a short text, but when my domain file has 100+ entries it hangs and freezes.
Is there a way I can optimize the regexp to work with the content from the text file?

(.*\.)* most likely results in horrible backtracking. If the file contains one domain per line the easiest fix would be executing the regex on each line instead of the whole file at once:
main_domain = 'happy.com'
for line in open('domains.txt','r')):
matchobj = re.match(r'^(.*\.)*%s$' % main_domain, line.strip())
# do something with matchobj
If your file does not contain anything but domains in the format you posted you can even simplify this much more and not use a regex at all:
subdomains = []
for line in open('domains.txt','r')):
line = line.strip()
if line.endswith(main_domain):
subdomains.append(line[:-len(main_domain)])

To avoid catastrophic backtracking, you could simplify the regex:
import re
with open("domains.txt") as file:
text = file.read()
main_domain = "happy.com"
subdomains = re.findall(r"^(.+)\.%s$" % re.escape(main_domain), text, re.M)
If you want also to match the main domain: (r"^(?:(.+)\.)?%s$".

Related

reading and printing text file from a website url line by line

I have this code here:
url = requests.get("https://raw.githubusercontent.com/deppokol/testrepo/main/testfile.txt")
File = url.text
for line in File:
print(line)
The output looks like this:
p
i
l
l
o
w
and so on...
Instead, I want it to look like this:
pillow
fire
thumb
and so on...
I know I can add end="" inside of print(line) but I want a variable to be equal to those lines. For example
Word = line
and when you print Word, it should look like this:
pillow
fire
thumb
.text of requests' response is str, you might use .splitlines for iterating over lines as follows:
import requests
url = requests.get("https://raw.githubusercontent.com/deppokol/testrepo/main/testfile.txt")
for line in url.text.splitlines():
print(line)
Note that .splitlines() deals with different newlines, so you can use it without worrying about what newlines exactly are used (using .split("\n") is fine as long are you sure you working with Linux-style newlines)
you cannot do for line in url.text because url.text is not a IO (File). Instead, you can either print it directly (since \n or the line breaks will automatically print as line breaks) or if you really need to split on new lines, then do for line in url.text.split('\n').
import requests
url = requests.get("https://raw.githubusercontent.com/deppokol/testrepo/main/testfile.txt")
for line in url.text.split('\n'):
print(line)
Edit: You might also want to do .strip() as well to remove extra line breaks.
response is a str object which you need to split() first as:
import requests
url = requests.get("https://raw.githubusercontent.com/deppokol/testrepo/main/testfile.txt")
file = url.text.split()
for line in file:
print(line)
You can also use split("\n"):
import requests
for l in requests.get("https://raw.githubusercontent.com/deppokol/testrepo/main/testfile.txt").text.split("\n"):
print(l)
Demo

Searching for word in file and taking whole line

I am running this program to basically get the page source code of a website I put in. It saves it to a file and what I want is it to look for a specific string which is basically # for the emails. However, I can't get it to work.
import requests
import re
url = 'https://www.youtube.com/watch?v=GdKEdN66jUc&app=desktop'
data = requests.get(url)
# dump resulting text to file
with open("data6.txt", "w") as out_f:
out_f.write(data.text)
with open("data6.txt", "r") as f:
searchlines = f.readlines()
for i, line in enumerate(searchlines):
if "#" in line:
for l in searchlines[i:i+3]: print((l))
You can use the regex method findall to find all email addresses in your text content, and use file.read() instead of file.readlines(). To get all content together rather than split into separate lines.
For example:
import re
with open("data6.txt", "r") as file:
content = file.read()
emails = re.findall(r"[\w\.]+#[\w\.]+", content)
Maybe cast to a set for uniqueness afterwards, and then save to a file however you like.

How to edit a file in python 2.7.10?

I am trying to edit a file as follows in python 2.7.10 and running into below error, can anyone provide guidance on what the issue is on how to edit files?
import fileinput,re
filename = 'epivers.h'
text_to_search = re.compile("#define EPI_VERSION_STR \"(\d+\.\d+) (TOB) (r(\d+) ASSRT)\"")
replacement_text = "#define EPI_VERSION_STR \"9.130.27.50.1.2.3 (r749679 ASSRT)\""
with fileinput.FileInput(filename, inplace=True, backup='.bak') as file:
for line in file:
print(line.replace(text_to_search, replacement_text))
file.close()
Error:-
Traceback (most recent call last):
File "pythonfiledit.py", line 5, in <module>
with fileinput.FileInput(filename, inplace=True, backup='.bak') as file:
AttributeError: FileInput instance has no attribute '__exit__'
UPDATE:
import fileinput,re
import os
import shutil
import sys
import tempfile
filename = 'epivers.h'
text_to_search = re.compile("#define EPI_VERSION_STR \"(\d+\.\d+) (TOB) (r(\d+) ASSRT)\"")
replacement_text = "#define EPI_VERSION_STR \"9.130.27.50.1.2.3 (r749679 ASSRT)\""
with open(filename) as src, tempfile.NamedTemporaryFile(
'w', dir=os.path.dirname(filename), delete=False) as dst:
# Discard first line
for line in src:
if text_to_search.search(line):
# Save the new first line
line = text_to_search .sub(replacement_text,line)
dst.write(line + '\n')
dst.write(line)
# remove old version
os.unlink(filename)
# rename new version
os.rename(dst.name,filename)
I am trying to match line define EPI_VERSION_STR "9.130.27 (TOB) (r749679 ASSRT)"
If r is a compiled regular expression and line is a line of text, the way to apply the regex is
r.match(line)
to find a match at the beginning of line, or
r.search(line)
to find a match anywhere. In your particular case, you simply need
line = r.sub(replacement, line)
though in addition, you'll need to add a backslash before the round parentheses in your regex in order to match them literally (except in a few places where you apparently put in grouping parentheses around the \d+ for no particular reason; maybe just take those out).
Your example input string contains three digits, and the replacement string contains six digits, so \d+\.\d+ will never match either of those. I'm guessing you want something like \d+(?:\.\d+)+ or perhaps very bluntly [\d.]+ if the periods can be adjacent.
Furthermore, a single backslash in a string will be interpreted by Python, before it gets passed to the regex engine. You'll want to use raw strings around regexes, nearly always. For improved legibility, perhaps also prefer single quotes or triple double quotes over regular double quotes, so you don't have to backslash the double quotes within the regex.
Finally, your usage of fileinput is wrong. You can't use it as a context manager. Just loop over the lines which fileinput.input() returns.
import fileinput, re
filename = 'epivers.h'
text_to_search = re.compile(r'#define EPI_VERSION_STR "\d+(?:\.\d+)+ \(TOB\) \(r\d+ ASSRT\)"')
replacement_text = '#define EPI_VERSION_STR "9.130.27.50.1.2.3 (r749679 ASSRT)"'
for line in fileinput.input(filename, inplace=True, backup='.bak'):
print(text_to_search.sub(replacement_text, line))
In your first attempt, line.replace() was a good start, but it doesn't accept a regex argument (and of course, you don't close() a file you opened with with ...). In your second attempt, you are checking whether the line is identical to the regex, which of course it isn't (just like the string "two" isn't equivalent to the numeric constant 2).
Read the file, use re.sub to substitute, then write the new contents back:
with open(filename) as f:
text = f.read()
new_text = re.sub(r'#define EPI_VERSION_STR "\d+\(?:.\d+\)+ \(TOB\) \(r\d+ ASSRT\)"',
'#define EPI_VERSION_STR "9.130.27.50.1.2.3 (r749679 ASSRT)"',
text)
with open(filename, 'w') as f:
f.write(new_text)

Using Regex to review a Text File in Python

What I am trying to accomplish here is basically have Reg ex return the match I want based on the pattern from a text file that Python has created and written too.
Currently I am getting TypeError: 'NoneType' object is not iterable error and I am not sure why. If I need more information let me know.
#Opens Temp file
TrueURL = open("TrueURL_tmp.txt","w+")
#Reviews Data grabbed from BeautifulSoup and write urls to file
for link in g_data:
TrueURL.write(link.get("href") + '\n')
#Creates Regex Pattern for TrueURL_tmp
pattern = re.compile(r'thread/.*/*apple|thread/.*/*potato')
search_pattern = re.search(pattern, str(TrueURL))
#Uses Regex Pattern against TrueURL_tmp file.
for url in search_pattern:
print (url)
#Closes and deletes file
TrueURL.close()
os.remove("TrueURL_tmp.txt")
Your search is returning no match because you are doing it on the str representation of the file object not the actual file content.
You are basically searching something like:
<open file 'TrueURL_tmp.txt', mode 'w+' at 0x7f2d86522390>
If you want to search the file content, close the file so the content is definitely written, then reopen and read the lines or maybe just search in the loop for link in g_data:
If you actually want to write to temporary file then use a tempfile:
from tempfile import TemporaryFile
with TemporaryFile() as f:
for link in g_data:
f.write(link.get("href") + '\n')
f.seek(0)
#Creates Regex Pattern for TrueURL_tmp
pattern = re.compile(r'thread/.*/*apple|thread/.*/*potato')
search_pattern = re.search(pattern, f.read())
search_pattern is a _sre.SRE_Match object so you would call group i,e print(search_pattern.group()) or maybe you want to use findAll.
search_pattern = re.findall(pattern, f.read())
for url in search_pattern:
print (url)
I still think doing the search before you write anything might be the best approach and maybe not writing at all but I am not fully sure what it is you actually want to do because I don't see how the file fits into what you are doing, concatenating to a string would achieve the same.
pattern = re.compile(r'thread/.*/*apple|thread/.*/*potato')
for link in g_data:
match = pattern.search(link.get("href"))
if match:
print(match.group())
Here is the solution I have found to answer my original question with, although Padraic way is correct and less painful process.
with TemporaryFile() as f:
for link in g_data:
f.write(bytes(link.get("href") + '\n', 'UTF-8'))
f.seek(0)
#Creates Regex Pattern for TrueURL_tmp
pattern = re.compile(r'thread/.*/*apple|thread/.*/*potato')
read = f.read()
search_pattern = re.findall(pattern,read)
#Uses Regex Pattern against TrueURL_tmp file.
for url in search_pattern:
print (url.decode('utf-8'))

Find text with regular expression and replace in file

I would like to find text in file with regular expression and after replace it to another name. I have to read file line by line at first because in other way re.match(...) can`t find text.
My test file where I would like to make modyfications is (no all, I removed some code):
//...
#include <boost/test/included/unit_test.hpp>
#ifndef FUNCTIONS_TESTSUITE_H
#define FUNCTIONS_TESTSUITE_H
//...
BOOST_AUTO_TEST_SUITE(FunctionsTS)
BOOST_AUTO_TEST_CASE(test)
{
std::string l_dbConfigDataFileName = "../../Config/configDB.cfg";
DB::FUNCTIONS::DBConfigData l_dbConfigData;
//...
}
BOOST_AUTO_TEST_SUITE_END()
//...
Now python code which replace the configDB name to another. I have to find configDB.cfg name by regular expression because all the time the name is changing. Only the name, extension not needed.
Code:
import fileinput
import re
myfile = "Tset.cpp"
#first search expression - ok. working good find and print configDB
with open(myfile) as f:
for line in f:
matchObj = re.match( r'(.*)../Config/(.*).cfg(.*)', line, re.M|re.I)
if matchObj:
print "Search : ", matchObj.group(2)
#now replace searched expression to another name - so one more time find and replace - another way - not working - file after run this code is empty?!!!
for line in fileinput.FileInput(myfile, inplace=1):
matchObj = re.match( r'(.*)../Config/(.*).cfg(.*)', line, re.M|re.I)
if matchObj:
line = line.replace("Config","AnotherConfig")
From docs:
Optional in-place filtering: if the keyword argument inplace=1 is passed to fileinput.input() or to the FileInput constructor, the file is moved to a backup file and standard output is directed to the input file (if a file of the same name as the backup file already exists, it will be replaced silently).
What you need to do is just print line in every step of the loop. Also, you need to print line without additional newline, so you can use sys.stdout.write from sys module. As a result:
import fileinput
import re
import sys
...
for line in fileinput.FileInput(myfile, inplace=1):
matchObj = re.match( r'(.*)../Config/(.*).cfg(.*)', line, re.M|re.I)
if matchObj:
line = line.replace("Config","AnotherConfig")
sys.stdout.write(line)
ADDED:
Also I assume that you need to replace config.cfg to AnotherConfig.cfg. In this case, you can do something like this:
import fileinput
import re
import sys
myfile = "Tset.cpp"
regx = re.compile(r'(.*?\.\./Config/)(.*?)(\.cfg.*?)')
for line in fileinput.FileInput(myfile, inplace=1):
matchObj = regx.match(line, re.M|re.I)
if matchObj:
sys.stdout.write(regx.sub(r'\1AnotherConfig\3', line))
else:
sys.stdout.write(line)
You can read about function sub here: python docs.
If I'm understanding you, you want to change in the line:
std::string l_dbConfigDataFileName = "../../Config/configDB.cfg";
just the file name 'configBD' to some other file name and rewrite the file.
First, I would suggest writing to a new file and changing the file names in case something goes wrong. Rather than use re.match use re.sub if there is a match it will return the line altered if not it will return the line unaltered -- just write it to a new file. Then change the filenames -- the old file to .bck and the new file to the old file name.
import re
import os
regex = re.compile(r'(../config/)(config.*)(.cfg)', re.IGNORECASE)
oldF = 'find_config.cfg'
nwF = 'n_find_config.cfg'
bckF = 'find_confg.cfg.bck'
with open ( oldF, 'r' ) as f, open ( nwF, 'w' ) as nf :
lns = f.readlines()
for ln in lns:
nln = re.sub(regex, r'\1new_config\3', ln )
nf.write ( nln )
os.rename ( oldF, bckF )
os.rename ( nwF, oldF )

Categories