Basically I'm trying to read text from a text file, use a regular expression to sub it into something else and then write it to a html file.
here's a snippet of what i have:
from re import sub
def markup():
##sub code here
sub('[a-z]+', 'test', file_contents)
the problem seems to be with that sub line.
The below code (part of the same function) needs to make a html file with the subbed text.
## write the HTML file
opfile = open(output_file, 'w')
opfile.write('<html>\n')
opfile.write('<head>\n')
opfile.write('<title>')
opfile.write(file_title)
opfile.write('</title>\n')
opfile.write('</head>\n')
opfile.write('<body>\n')
opfile.write(file_contents)
opfile.write('</body>\n')
opfile.write('</html>')
opfile.close()
the function here is designed so i can take text out of multiple files. after calling the markup function i can copy everything after file_contents except for the stuff in brackets, which i would replace with the names of the other files.
def content_func():
global file_contents
global file_title
global output_file
file_contents = open('example.txt', 'U').read()
file_title = ('example')
output_file = ('example.html')
markup()
content_func()
Example.txt is just a text file containing the text "the quick brown fox jumps over the lazy dog". what i'm hoping to achieve is to search text for specific markup language and replace it with HTML markup, but I've simplified it here to help me try and figure it out.
running this code should theoretically create a html file called example.html with a title and text saying "test", however this is not the case. i'm not familiar with regular expressions and they are driving me crazy. can anyone please suggest what i should do with the regular expression 'sub'?
EDIT: the code doesn't produce any errors, but the output HTML file lacks any substituted text. so the sub is searching the external text file but isn't putting it into the output HTML file.
You never save the result of sub(). Replace
sub('[a-z]+', 'test', file_contents)
with this
file_contents = sub('[a-z]+', 'test', file_contents)
Related
I have a folder that contains thousands of raw html code. I would like to extract all the href from each page. What would be the fastest way to do that?
href="what_i_need_here"
import re
with open('file', 'r') as f:
print (re.findall(r"href=\"(.+?)\"\n", ''.join(f.readlines())))
This would be what I guess might work, but there's no way to tell since you didn't provide any information. The regex used is href="(.+?)"\n. I read the content using f.readlines(), then combined it into a line to search using ''.join. See if it works, or add examples of the text.
I'd like to replace all characters (or any variants like  ) by a space character in the text part of an html file. (i.e., the ones returned by text_content() in lxml. Things in anything else like attributes should not be changed.)
Things like this can't gaurantee to only change the text part.
Replace `\n` in html page with space in python LXML
What is the correct way to do so in lxml?
You could use substitution from regular expressions for replacing the substring
import re
text = '''
I'd like to replace all characters (or any variants like  ) by a
space character in the text part of an html file. (i.e., the
ones returned by text_content() in lxml.
Things in anything else like attributes should not be changed.)
'''
out = re.sub(r"&\S*;", r"", text)
print(out)
>>I'd like to replace all characters (or any variants like ) by a
space character in the text part of an html file. (i.e., the ones returned by text_content() in lxml.
Things in anything else like attributes should not be changed.)
You can read the file using the open() function and append them to a variable. You then can replace certain and write it back to the file. Here is an example:
#My File
myFile = 'myFile.html'
#Read the file and gather the contents
with open(myFile, "r") as f:
contents = f.read()
print(contents)
#What we want to replace unwanted contents with
newContents = contents.replace(' ', ' ')
#Now write the file with the contents you want
with open(myFile, 'w') as f:
f.write(newContents)
print(newContents)
I have a simple script that writes the network name to a log file. It displays the following.
All User Profile : NET_NAME
what I would like to do is open that text file and extract just the NET_NAME part and then be able to use that as a variable and also save the text file with the changes.
I have tried using split function, it kind of works when using the text, but when trying to read from the file it doesn't work. I have searched regex but do not know the syntax to achieve what I want.
split can indeed be used to achieve this. In case you're using Python3 and the content is in text.txt file, the snippet below should be able to do the trick:
with open("text.txt", "rb") as f:
content = f.read().decode("utf-8")
name = content.split(":")[1].strip()
print(name)
I have a file with Contents as below:-
He is good at python.
Python is not a language as well as animal.
Look for python near you.
Hello World it's great to be here.
Now, script should search for pattern "python" or "pyt" or "pyth" or "Python" or any regex related to "p/Python". After search of particular word, it should insert new word like "Lion". So output should become like below:-
He is good at python.
Lion
Python is not a language as well as animal.
Lion
Look for python near you.
Lion
Hello World it's great to be here.
How can I do that ?
NOTE:-
Till now I wrote code like this:-
def insertAfterText(args):
file_name = args.insertAfterText[0]
pattern = args.insertAfterText[1]
val = args.insertAfterText[2]
fh = fileinput.input(file_name,inplace=True)
for line in fh:
replacement=val+line
line=re.sub(pattern,replacement,line)
sys.stdout.write(line)
fh.close()
You're better off writing a new file, than trying to write into the middle of an existing file.
with open is the best way to open files, since it safely and reliably closes them for you once you're done. Here's a cool way of using with open to open two files at once:
import re
pattern = re.compile(r'pyt', re.IGNORECASE)
filename = 'textfile.txt'
new_filename = 'new_{}'.format(filename)
with open(filename, 'r') as readfile, open(new_filename, 'w+') as writefile:
for line in readfile:
writefile.write(line)
if pattern.search(line):
writefile.write('Lion\n')
Here, we're opening the existing file, and opening a new file (creating it) to write to. We loop through the input file and simply write each line out to the new file. If a line in the original file contains matches for our regex pattern, we also write Lion\n (including the newline) after writing the original line.
Read the file into a variable:
with open("textfile") as ff:
s=ff.read()
Use regex and write the result back:
with open("textfile","w") as ff:
ff.write(re.sub(r"(?mi)(?=.*python)(.*?$)",r"\1\nLion",s))
(?mi): m: multiline, i.e. '$' will match end of line;
i: case insensitiv;
(?=.*python): lookahead, check for "python";
Lookahead doesn't step forward in the string, only look ahead, so:
(.*?$) will match the whole line,
which we replace with self '\1' and the other.
Edit:
To use from command line insert:
import sys
textfile=sys.argv[1]
pattern=sys.argv[2]
newtext=sys.argv[3]
and replace
r"(?mi)(?=.*python)(.*?$)",r"\1\nLion"
with
fr"(?mi)(?=.*{pattern})(.*?$)",r"\1{newtext}"
and in open() change "textfile" to textfile.
I'm new to Python. My second time coding in it. The main point of this script is to take a text file that contains thousands of lines of file names (sNotUsed file) and match it against about 50 XML files. The XML files may contain up to thousands of lines each and are formatted as most XML's are. I'm not sure what the problem with the code so far is. The code is not fully complete as I have not added the part where it writes the output back to an XML file, but the current last line should be printing at least once. It is not, though.
Examples of the two file formats are as follows:
TEXT FILE:
fileNameWithoutExtension1
fileNameWithoutExtension2
fileNameWithoutExtension3
etc.
XML FILE:
<blocks>
<more stuff="name">
<Tag2>
<Tag3 name="Tag3">
<!--COMMENT-->
<fileType>../../dir/fileNameWithoutExtension1</fileType>
<fileType>../../dir/fileNameWithoutExtension4</fileType>
</blocks>
MY CODE SO FAR:
import os
import re
sNotUsed=list()
sFile = open("C:\Users\xxx\Desktop\sNotUsed.txt", "r") # open snotused txt file
for lines in sFile:
sNotUsed.append(lines)
#sNotUsed = sFile.readlines() # read all lines and assign to list
sFile.close() # close file
xmlFiles=list() # list of xmlFiles in directory
usedS=list() # list of S files that do not match against sFile txt
search = "\w/([\w\-]+)"
# getting the list of xmlFiles
filelist=os.listdir('C:\Users\xxx\Desktop\dir')
for files in filelist:
if files.endswith('.xml'):
xmlFile = open(files, "r+") # open first file with read + write access
xmlComp = xmlFile.readlines() # read lines and assign to list
for lines in xmlComp: # iterate by line in list of lines
temp = re.findall(search, lines)
#print temp
if temp:
if temp[0] in sNotUsed:
print "yes" # debugging. I know there is at least one match for sure, but this is not being printed.
TO HELP CLEAR THINGS UP:
Sorry, I guess my question wasn't very clear. I would like the script to go through each XML line by line and see if the FILENAME part of that line matches with the exact line of the sNotUsed.txt file. If there is match then I want to delete it from the XML. If the line doesn't match any of the lines in the sNotUsed.txt then I would like it be part of the output of the new modified XML file (which will overwrite the old one). Please let me know if still not clear.
EDITED, WORKING CODE
import os
import re
import codecs
sFile = open("C:\Users\xxx\Desktop\sNotUsed.txt", "r") # open sNotUsed txt file
sNotUsed=sFile.readlines() # read all lines and assign to list
sFile.close() # close file
search = re.compile(r"\w/([\w\-]+)")
sNotUsed=[x.strip().replace(',','') for x in sNotUsed]
directory=r'C:\Users\xxx\Desktop\dir'
filelist=os.listdir(directory) # getting the list of xmlFiles
# for each file in the list
for files in filelist:
if files.endswith('.xml'): # make sure it is an XML file
xmlFile = codecs.open(os.path.join(directory, files), "r", encoding="UTF-8") # open first file with read
xmlComp = xmlFile.readlines() # read lines and assign to list
print xmlComp
xmlFile.close() # closing the file since the lines have already been read and assigned to a variable
xmlEdit = codecs.open(os.path.join(directory, files), "w", encoding="UTF-8") # opening the same file again and overwriting all existing lines
for lines in xmlComp: # iterate by line in list of lines
#headerInd = re.search(search, lines) # used to get the headers, comments, and ending blocks
temp = re.findall(search, lines) # finds all strings that match the regular expression compiled above and makes a list for each
if temp: # if the list is not empty
if temp[0] not in sNotUsed: # if the first (and only) value in each list is not in the sNotUsed list
xmlEdit.write(lines) # write it in the file
else: # if the list is empty
xmlEdit.write(lines) # write it (used to preserve the beginning and ending blocks of the XML, as well as comments)
There is a lot of things to say but I'll try to stay concise.
PEP8: Style Guide for Python Code
You should use lower case with underscores for local variables.
take a look at the PEP8: Style Guide for Python Code.
File objects and with statement
Use the with statement to open a file, see: File Objects: http://docs.python.org/2/library/stdtypes.html#bltin-file-objects
Escape Windows filenames
Backslashes in Windows filenames can cause problems in Python programs. You must escape the string using double backslashes or use raw strings.
For example: if your Windows filename is "dir\notUsed.txt", you should escape it like this: "dir\\notUsed.txt" or use a raw string r"dir\notUsed.txt". If you don't do that, the "\n" will be interpreted as a newline!
Note: if you need to support Unicode filenames, you can use Unicode raw strings: ur"dir\notUsed.txt".
See also the question 19065115 in StockOverFlow.
store the filenames in a set: it is an optimized collection without duplicates
not_used_path = ur"dir\sNotUsed.txt"
with open(not_used_path) as not_used_file:
not_used_set = set([line.strip() for line in not_used_file])
Compile your regex
It is more efficient to compile a regex when used numerous times. Again, you should use raw strings to avoid backslashes interpretation.
pattern = re.compile(r"\w/([\w\-]+)")
Warning: os.listdir() function return a list of filenames not a list of full paths. See this function in the Python documentation.
In your example, you read a desktop directory 'C:\Users\xxx\Desktop\dir' with os.listdir(). And then you want to open each XML file in this directory with open(files, "r+"). But this is wrong, until your current working directory isn't your desktop directory. The classic usage is to used os.path.join() function like this:
desktop_dir = r'C:\Users\xxx\Desktop\dir'
for filename in os.listdir(desktop_dir):
desktop_path = os.path.join(desktop_dir, filename)
If you want to extract the filename's extension, you can use the os.path.splitext() function.
desktop_dir = r'C:\Users\xxx\Desktop\dir'
for filename in os.listdir(desktop_dir):
if os.path.splitext(filename)[1].lower() != '.xml':
continue
desktop_path = os.path.join(desktop_dir, filename)
You can simplify this with a comprehension list:
desktop_dir = r'C:\Users\xxx\Desktop\dir'
xml_list = [os.path.join(desktop_dir, filename)
for filename in os.listdir(desktop_dir)
if os.path.splitext(filename)[1].lower() == '.xml']
Parse a XML file
How to parse a XML file? This is a great question!
There a several possibility:
- use regex, efficient but dangerous;
- use SAX parser, efficient too but confusing and difficult to maintain;
- use DOM parser, less efficient but clearer...
Consider using lxml package (#see: http://lxml.de/)
It is dangerous, because the way you read the file, you don't care of the XML encoding. And it is bad! Very bad indeed! XML files are usually encoded in UTF-8. You should first decode UTF-8 byte stream. A simple way to do that is to use codecs.open() to open an encoded file.
for xml_path in xml_list:
with codecs.open(xml_path, "r", encoding="UTF-8") as xml_file:
content = xml_file.read()
With this solution, the full XML content is store in the content variable as an Unicode string. You can then use a Unicode regex to parse the content.
Finally, you can use a set intersection to find if a given XML file contains commons names with the text file.
for xml_path in xml_list:
with codecs.open(xml_path, "r", encoding="UTF-8") as xml_file:
content = xml_file.read()
actual_set = set(pattern.findall(content))
print(not_used_set & actual_set)