Using Regex to review a Text File in Python - python

What I am trying to accomplish here is basically have Reg ex return the match I want based on the pattern from a text file that Python has created and written too.
Currently I am getting TypeError: 'NoneType' object is not iterable error and I am not sure why. If I need more information let me know.
#Opens Temp file
TrueURL = open("TrueURL_tmp.txt","w+")
#Reviews Data grabbed from BeautifulSoup and write urls to file
for link in g_data:
TrueURL.write(link.get("href") + '\n')
#Creates Regex Pattern for TrueURL_tmp
pattern = re.compile(r'thread/.*/*apple|thread/.*/*potato')
search_pattern = re.search(pattern, str(TrueURL))
#Uses Regex Pattern against TrueURL_tmp file.
for url in search_pattern:
print (url)
#Closes and deletes file
TrueURL.close()
os.remove("TrueURL_tmp.txt")

Your search is returning no match because you are doing it on the str representation of the file object not the actual file content.
You are basically searching something like:
<open file 'TrueURL_tmp.txt', mode 'w+' at 0x7f2d86522390>
If you want to search the file content, close the file so the content is definitely written, then reopen and read the lines or maybe just search in the loop for link in g_data:
If you actually want to write to temporary file then use a tempfile:
from tempfile import TemporaryFile
with TemporaryFile() as f:
for link in g_data:
f.write(link.get("href") + '\n')
f.seek(0)
#Creates Regex Pattern for TrueURL_tmp
pattern = re.compile(r'thread/.*/*apple|thread/.*/*potato')
search_pattern = re.search(pattern, f.read())
search_pattern is a _sre.SRE_Match object so you would call group i,e print(search_pattern.group()) or maybe you want to use findAll.
search_pattern = re.findall(pattern, f.read())
for url in search_pattern:
print (url)
I still think doing the search before you write anything might be the best approach and maybe not writing at all but I am not fully sure what it is you actually want to do because I don't see how the file fits into what you are doing, concatenating to a string would achieve the same.
pattern = re.compile(r'thread/.*/*apple|thread/.*/*potato')
for link in g_data:
match = pattern.search(link.get("href"))
if match:
print(match.group())

Here is the solution I have found to answer my original question with, although Padraic way is correct and less painful process.
with TemporaryFile() as f:
for link in g_data:
f.write(bytes(link.get("href") + '\n', 'UTF-8'))
f.seek(0)
#Creates Regex Pattern for TrueURL_tmp
pattern = re.compile(r'thread/.*/*apple|thread/.*/*potato')
read = f.read()
search_pattern = re.findall(pattern,read)
#Uses Regex Pattern against TrueURL_tmp file.
for url in search_pattern:
print (url.decode('utf-8'))

Related

Python: Fins String on File and return rext

I´m trying to find a specific string on a file and have it return the text in front of if.
The file has the following: "releaseDate": "2022-07-11T07:15:00.000Z"
I want to search the releaseDate but have it return the 2022-07-11T07:15:00.000Z
I can find it, but have honestly no idea how to return the info I need.
dateOccurence=open('scriptFile.txt', 'r').read().find('releaseDate')
Your file is JSON content, so parsing it as this would be the best idea, but then you'd need to know the path of keys to reach it
A regex approach is easier here
with open('scriptFile.txt', 'r') as f:
content = f.read()
date = re.search(r'"releaseDate":\s+"([^"]+)"', content)[1]
print(date) # 2022-07-11T07:15:00.000Z

Searching for word in file and taking whole line

I am running this program to basically get the page source code of a website I put in. It saves it to a file and what I want is it to look for a specific string which is basically # for the emails. However, I can't get it to work.
import requests
import re
url = 'https://www.youtube.com/watch?v=GdKEdN66jUc&app=desktop'
data = requests.get(url)
# dump resulting text to file
with open("data6.txt", "w") as out_f:
out_f.write(data.text)
with open("data6.txt", "r") as f:
searchlines = f.readlines()
for i, line in enumerate(searchlines):
if "#" in line:
for l in searchlines[i:i+3]: print((l))
You can use the regex method findall to find all email addresses in your text content, and use file.read() instead of file.readlines(). To get all content together rather than split into separate lines.
For example:
import re
with open("data6.txt", "r") as file:
content = file.read()
emails = re.findall(r"[\w\.]+#[\w\.]+", content)
Maybe cast to a set for uniqueness afterwards, and then save to a file however you like.

Create hyperlinks from urls in text file using QTextBrowser

I have a text file with some basic text:
For more information on this topic, go to (http://moreInfo.com)
This tool is available from (https://www.someWebsite.co.uk)
Contacts (https://www.contacts.net)
I would like the urls to show up as hyperlinks in a QTextBrowser, so that when clicked, the web browser will open and load the website. I have seen this post which uses:
Bar
but as the text file can be edited by anyone (i.e. they might include text which does not provide a web address), I would like it if these addresses, if any, can be automatically hyperlinked before being added to the text browser.
This is how I read the text file:
def info(self):
text_browser = self.dockwidget.text_browser
file_path = 'path/to/text.txt'
f = open(file_path, 'r')
text = f.read()
text_browser.setText(text)
text_browser.setOpenExternalLinks(True)
self.dockwidget.show()
Edit:
Made some headway and managed to get the hyperlinks using (assuming the links are inside parenthesis):
import re
def info(self):
text_browser = self.dockwidget.text_browser
file_path = 'path/to/text.txt'
f = open(about_file_path, 'r')
text = f.read()
urls = re.findall('http[s]?://(?:[a-zA-Z]|[0-9]|[$-_#.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+', text)
for x in urls:
if x in text:
text = text.replace(x, x.replace('http', '') + x + '')
textBrowser.setHtml(text)
textBrowser.setOpenExternalLinks(True)
self.dockwidget.show()
However, it all appears in one line and not in the same format as in the text file. How could I solve this?
Matching urls correctly is more complex than your current solution might suggest. For a full breakdown of the issues, see: What is the best regular expression to check if a string is a valid URL?
.
The other problem is much easier to solve. To preserve newlines, you can use this:
text = '<br>'.join(text.splitlines())

Python using re.match hangs with long text

I have a text file with a list of domains, I want to use python regular expression to match domains and any subdomains.
Sample domains file
admin.happy.com
nothappy.com
I have the following regexp:
main_domain = 'happy.com'
mydomains = open('domains.txt','r').read().replace('\n',',')
matchobj = re.match(r'^(.*\.)*%s$' % main_domain,mydomains)
The code works fine for a short text, but when my domain file has 100+ entries it hangs and freezes.
Is there a way I can optimize the regexp to work with the content from the text file?
(.*\.)* most likely results in horrible backtracking. If the file contains one domain per line the easiest fix would be executing the regex on each line instead of the whole file at once:
main_domain = 'happy.com'
for line in open('domains.txt','r')):
matchobj = re.match(r'^(.*\.)*%s$' % main_domain, line.strip())
# do something with matchobj
If your file does not contain anything but domains in the format you posted you can even simplify this much more and not use a regex at all:
subdomains = []
for line in open('domains.txt','r')):
line = line.strip()
if line.endswith(main_domain):
subdomains.append(line[:-len(main_domain)])
To avoid catastrophic backtracking, you could simplify the regex:
import re
with open("domains.txt") as file:
text = file.read()
main_domain = "happy.com"
subdomains = re.findall(r"^(.+)\.%s$" % re.escape(main_domain), text, re.M)
If you want also to match the main domain: (r"^(?:(.+)\.)?%s$".

How to use the regex to parse the entire file and determine the matches were found , rather then reading each line by line?

Instead on reading each and every line cant we just search for the string in the file and replace it... i am trying but unable to get any idea how to do thth?
file = open(C:\\path.txt,"r+")
lines = file.readlines()
replaceDone=0
file.seek(0)
newString="set:: windows32\n"
for l in lines:
if (re.search("(address_num)",l,re.I) replaceDone==0:
try:
file.write(l.replace(l,newString))
replaceDone=1
except IOError:
file.close()
Here's an example you can adapt that replaces every sequence of '(address_num)' with 'set:: windows32' for a file:
import fileinput
import re
for line in fileinput.input('/home/jon/data.txt', inplace=True):
print re.sub('address_num', 'set:: windows32', line, flags=re.I),
This is not very memory efficient but I guess it is what you are looking for:
import re
text = open(file_path, 'r').read()
open(file_path, 'w').write(re.sub(old_string, new_string, text))
Read the whole file, replace and write back the whole file.

Categories