I'd like to replace all characters (or any variants like  ) by a space character in the text part of an html file. (i.e., the ones returned by text_content() in lxml. Things in anything else like attributes should not be changed.)
Things like this can't gaurantee to only change the text part.
Replace `\n` in html page with space in python LXML
What is the correct way to do so in lxml?
You could use substitution from regular expressions for replacing the substring
import re
text = '''
I'd like to replace all characters (or any variants like  ) by a
space character in the text part of an html file. (i.e., the
ones returned by text_content() in lxml.
Things in anything else like attributes should not be changed.)
'''
out = re.sub(r"&\S*;", r"", text)
print(out)
>>I'd like to replace all characters (or any variants like ) by a
space character in the text part of an html file. (i.e., the ones returned by text_content() in lxml.
Things in anything else like attributes should not be changed.)
You can read the file using the open() function and append them to a variable. You then can replace certain and write it back to the file. Here is an example:
#My File
myFile = 'myFile.html'
#Read the file and gather the contents
with open(myFile, "r") as f:
contents = f.read()
print(contents)
#What we want to replace unwanted contents with
newContents = contents.replace(' ', ' ')
#Now write the file with the contents you want
with open(myFile, 'w') as f:
f.write(newContents)
print(newContents)
Related
I am currently facing a problem. I am trying to write a regex code in order to match a pattern in a text file, and after finding it, remove it from the current text.
# Reading the file data and store it
with open('file.txt','r+') as f:
file = f.read()
print(file)
Here is my text when printed
'{\n\tINFO\tDATA_NUMBER\t974\n\t{\n\t\tDAT_CQFD\n\t\t{\n\t\t\tsome random text \t787878\n\t\t}\n\t\tDATA_TO_MATCH\n\t\t{\n\t\t1\tbunch of characters \t985\n\t\t2\tbunch of data\t\t78\n\t\t}\n\t}\n\tINFO\tDATA_CATCHME\t123\n\t{\n\t\t3\tbunch of characters \n\t\t2\tbunch of datas\n\t}\n\tINFO\tDATA_INTACT\t\t456\n\t{\n\t\t3\tbunch of numbers \n\t\t2\tbunch of texts\n\t}\n\n'
Here is a picture of the same text opened with an editor :
image here
I would like to match / search DATA_TO_MATCH and then look for the last closed bracket " } "
and remove everything between this close bracket and the next one included.
And I would like to do the same for DATA_CATCHME.
here is the expected result :
'{\n\tINFO\tDATA_NUMBER\t974\n\t{\n\t\tDATA_CQFD\n\t\t{\n\t\t\tsome random text \t787878\n\t\t}\n\n\t}\n\tINFO\tDATA_INTACT\t\t456\n\t{\n\t\t3\tbunch of numbers \n\t\t2\tbunch of texts\n\t}\n\n}\n'
Here is a picture of the same text result opened with an editor :
image here
I tried some
import re
#find the DATA_TO_MATCH
re.findall(r".*DATA_TO_MATCH",file)
#find the DATA_CATCHME
re.findall(r".*DATA_CATCHME",file)
#supposed to find everything before the closed bracket "}"
re.findall(r"(?=.*})[^}]*",file)
But I am not very familiar with regex and re, and i can't get what i want from it,
and I guess when it's found I will use
re.sub(my_patern,'', text)
to remove it from my text file
The main trick here is to use the re.MULTILINE flag, which will span across lines. You should also use re.sub directly rather than re.findall.
The regex itself is simple once you understand it. You look for any characters until DATA_TO_MATCH, then chew up any whitespace that may exist (hence the *), read a {, then read all characters that aren't a }, and finally consume the }. It's a very similar tactic for the second one.
import re
with open('input.txt', 'r+') as f:
file = f.read()
# find the DATA_TO_MATCH
file = re.sub(r".*DATA_TO_MATCH\s*{[^}]*}", "", file, flags=re.MULTILINE)
# find the DATA_CATCHME
file = re.sub(r".*DATA_CATCHME[^{]*{[^}]*}", "", file, flags=re.MULTILINE)
print(file)
I have multiple text files in a folder say "configs", I want to search a particular text "-cfg" in each file and copy the data after -cfg from opening to closing of inverted commas ("data"). This result should be updated in another text file "result.txt" with filename, test name and the config for each file.
NOTE: Each file can have multiple "cfg" in separate line along with test name related to that configuration.
E.g: cube_demo -cfg "RGB 888; MODE 3"
My approach is to open each text file one at a time and find the pattern, then store the required result into a buffer. Later, copy the entire result into a new file.
I came across Python and looks like it's easy to do it in Python. Still learning python and trying to figure out how to do it. Please help. Thanks.
I know how to open the file and iterate over each line to search for a particular string:
import re
search_term = "Cfg\s(\".*\")" // Not sure, if it's correct
ifile = open("testlist.csv", "r")
ofile = open("result.txt", "w")
searchlines = ifile.readlines()
for line in searchlines:
if search_term in line:
if re.search(search_term, line):
ofile.write(\1)
// trying to get string with the \number special sequence
ifile.close()
ofile.close()
But this gives me the complete line, I could not find how to use regular expression to get only the "data" and how to iterate over files in the folder to search the text.
Not quite there yet...
import re
search_term = "Cfg\s(\".*\")" // Not sure, if it's correct
"//" is not a valid comment marker, you want "#"
wrt/ your regexp, you want (from your specs) : 'cfg', followed by one or more space, followed by any text between double quotes, stopping at the first closing double quote, and want to capture the part between these double quotes. This is spelled as 'cfg "(.?)"'. Since you don't want to deal with escape chars, the best way is to use a raw single quoted string:
exp = r'cfg *"(.+?)"'
now since you're going to reuse this expression in a loop, you might as well compile it already:
exp = re.compile(r'cfg *"(.+?)"')
so now exp is a re.pattern object instead of string. To use it, you call it's search(<text>) method, with your current line as argument. If the line matches the expression, you'll get a re.match object, else you'll get None:
>>> match = exp.search('foo bar "baaz" boo')
>>> match is None
True
>>> match = exp.search('foo bar -cfg "RGB 888; MODE 3" tagada "tsoin"')
>>> match is None
False
>>>
To get the part between the double quotes, you call match.group(1) (second captured group, the first one being the one matchin the whole expression)
>>> match.group(0)
'cfg "RGB 888; MODE 3"'
>>> match.group(1)
'RGB 888; MODE 3'
>>>
Now you just have to learn and make correct use of files... First hint: files are context managers that know how to close themselves. Second hint: files are iterable, no need to read the whole file in memory. Third hint : file.write("text") WONT append a newline after "text".
If we glue all this together, your code should look something like:
import re
search_term = re.compile(r'cfg *"(.+?)"')
with open("testlist.csv", "r") as ifile:
with open("result.txt", "w") as ofile:
for line in ifile:
match = search_term.search(line)
if match:
ofile.write(match.group(1) + "\n")
I'm new to Python. My second time coding in it. The main point of this script is to take a text file that contains thousands of lines of file names (sNotUsed file) and match it against about 50 XML files. The XML files may contain up to thousands of lines each and are formatted as most XML's are. I'm not sure what the problem with the code so far is. The code is not fully complete as I have not added the part where it writes the output back to an XML file, but the current last line should be printing at least once. It is not, though.
Examples of the two file formats are as follows:
TEXT FILE:
fileNameWithoutExtension1
fileNameWithoutExtension2
fileNameWithoutExtension3
etc.
XML FILE:
<blocks>
<more stuff="name">
<Tag2>
<Tag3 name="Tag3">
<!--COMMENT-->
<fileType>../../dir/fileNameWithoutExtension1</fileType>
<fileType>../../dir/fileNameWithoutExtension4</fileType>
</blocks>
MY CODE SO FAR:
import os
import re
sNotUsed=list()
sFile = open("C:\Users\xxx\Desktop\sNotUsed.txt", "r") # open snotused txt file
for lines in sFile:
sNotUsed.append(lines)
#sNotUsed = sFile.readlines() # read all lines and assign to list
sFile.close() # close file
xmlFiles=list() # list of xmlFiles in directory
usedS=list() # list of S files that do not match against sFile txt
search = "\w/([\w\-]+)"
# getting the list of xmlFiles
filelist=os.listdir('C:\Users\xxx\Desktop\dir')
for files in filelist:
if files.endswith('.xml'):
xmlFile = open(files, "r+") # open first file with read + write access
xmlComp = xmlFile.readlines() # read lines and assign to list
for lines in xmlComp: # iterate by line in list of lines
temp = re.findall(search, lines)
#print temp
if temp:
if temp[0] in sNotUsed:
print "yes" # debugging. I know there is at least one match for sure, but this is not being printed.
TO HELP CLEAR THINGS UP:
Sorry, I guess my question wasn't very clear. I would like the script to go through each XML line by line and see if the FILENAME part of that line matches with the exact line of the sNotUsed.txt file. If there is match then I want to delete it from the XML. If the line doesn't match any of the lines in the sNotUsed.txt then I would like it be part of the output of the new modified XML file (which will overwrite the old one). Please let me know if still not clear.
EDITED, WORKING CODE
import os
import re
import codecs
sFile = open("C:\Users\xxx\Desktop\sNotUsed.txt", "r") # open sNotUsed txt file
sNotUsed=sFile.readlines() # read all lines and assign to list
sFile.close() # close file
search = re.compile(r"\w/([\w\-]+)")
sNotUsed=[x.strip().replace(',','') for x in sNotUsed]
directory=r'C:\Users\xxx\Desktop\dir'
filelist=os.listdir(directory) # getting the list of xmlFiles
# for each file in the list
for files in filelist:
if files.endswith('.xml'): # make sure it is an XML file
xmlFile = codecs.open(os.path.join(directory, files), "r", encoding="UTF-8") # open first file with read
xmlComp = xmlFile.readlines() # read lines and assign to list
print xmlComp
xmlFile.close() # closing the file since the lines have already been read and assigned to a variable
xmlEdit = codecs.open(os.path.join(directory, files), "w", encoding="UTF-8") # opening the same file again and overwriting all existing lines
for lines in xmlComp: # iterate by line in list of lines
#headerInd = re.search(search, lines) # used to get the headers, comments, and ending blocks
temp = re.findall(search, lines) # finds all strings that match the regular expression compiled above and makes a list for each
if temp: # if the list is not empty
if temp[0] not in sNotUsed: # if the first (and only) value in each list is not in the sNotUsed list
xmlEdit.write(lines) # write it in the file
else: # if the list is empty
xmlEdit.write(lines) # write it (used to preserve the beginning and ending blocks of the XML, as well as comments)
There is a lot of things to say but I'll try to stay concise.
PEP8: Style Guide for Python Code
You should use lower case with underscores for local variables.
take a look at the PEP8: Style Guide for Python Code.
File objects and with statement
Use the with statement to open a file, see: File Objects: http://docs.python.org/2/library/stdtypes.html#bltin-file-objects
Escape Windows filenames
Backslashes in Windows filenames can cause problems in Python programs. You must escape the string using double backslashes or use raw strings.
For example: if your Windows filename is "dir\notUsed.txt", you should escape it like this: "dir\\notUsed.txt" or use a raw string r"dir\notUsed.txt". If you don't do that, the "\n" will be interpreted as a newline!
Note: if you need to support Unicode filenames, you can use Unicode raw strings: ur"dir\notUsed.txt".
See also the question 19065115 in StockOverFlow.
store the filenames in a set: it is an optimized collection without duplicates
not_used_path = ur"dir\sNotUsed.txt"
with open(not_used_path) as not_used_file:
not_used_set = set([line.strip() for line in not_used_file])
Compile your regex
It is more efficient to compile a regex when used numerous times. Again, you should use raw strings to avoid backslashes interpretation.
pattern = re.compile(r"\w/([\w\-]+)")
Warning: os.listdir() function return a list of filenames not a list of full paths. See this function in the Python documentation.
In your example, you read a desktop directory 'C:\Users\xxx\Desktop\dir' with os.listdir(). And then you want to open each XML file in this directory with open(files, "r+"). But this is wrong, until your current working directory isn't your desktop directory. The classic usage is to used os.path.join() function like this:
desktop_dir = r'C:\Users\xxx\Desktop\dir'
for filename in os.listdir(desktop_dir):
desktop_path = os.path.join(desktop_dir, filename)
If you want to extract the filename's extension, you can use the os.path.splitext() function.
desktop_dir = r'C:\Users\xxx\Desktop\dir'
for filename in os.listdir(desktop_dir):
if os.path.splitext(filename)[1].lower() != '.xml':
continue
desktop_path = os.path.join(desktop_dir, filename)
You can simplify this with a comprehension list:
desktop_dir = r'C:\Users\xxx\Desktop\dir'
xml_list = [os.path.join(desktop_dir, filename)
for filename in os.listdir(desktop_dir)
if os.path.splitext(filename)[1].lower() == '.xml']
Parse a XML file
How to parse a XML file? This is a great question!
There a several possibility:
- use regex, efficient but dangerous;
- use SAX parser, efficient too but confusing and difficult to maintain;
- use DOM parser, less efficient but clearer...
Consider using lxml package (#see: http://lxml.de/)
It is dangerous, because the way you read the file, you don't care of the XML encoding. And it is bad! Very bad indeed! XML files are usually encoded in UTF-8. You should first decode UTF-8 byte stream. A simple way to do that is to use codecs.open() to open an encoded file.
for xml_path in xml_list:
with codecs.open(xml_path, "r", encoding="UTF-8") as xml_file:
content = xml_file.read()
With this solution, the full XML content is store in the content variable as an Unicode string. You can then use a Unicode regex to parse the content.
Finally, you can use a set intersection to find if a given XML file contains commons names with the text file.
for xml_path in xml_list:
with codecs.open(xml_path, "r", encoding="UTF-8") as xml_file:
content = xml_file.read()
actual_set = set(pattern.findall(content))
print(not_used_set & actual_set)
Basically I'm trying to read text from a text file, use a regular expression to sub it into something else and then write it to a html file.
here's a snippet of what i have:
from re import sub
def markup():
##sub code here
sub('[a-z]+', 'test', file_contents)
the problem seems to be with that sub line.
The below code (part of the same function) needs to make a html file with the subbed text.
## write the HTML file
opfile = open(output_file, 'w')
opfile.write('<html>\n')
opfile.write('<head>\n')
opfile.write('<title>')
opfile.write(file_title)
opfile.write('</title>\n')
opfile.write('</head>\n')
opfile.write('<body>\n')
opfile.write(file_contents)
opfile.write('</body>\n')
opfile.write('</html>')
opfile.close()
the function here is designed so i can take text out of multiple files. after calling the markup function i can copy everything after file_contents except for the stuff in brackets, which i would replace with the names of the other files.
def content_func():
global file_contents
global file_title
global output_file
file_contents = open('example.txt', 'U').read()
file_title = ('example')
output_file = ('example.html')
markup()
content_func()
Example.txt is just a text file containing the text "the quick brown fox jumps over the lazy dog". what i'm hoping to achieve is to search text for specific markup language and replace it with HTML markup, but I've simplified it here to help me try and figure it out.
running this code should theoretically create a html file called example.html with a title and text saying "test", however this is not the case. i'm not familiar with regular expressions and they are driving me crazy. can anyone please suggest what i should do with the regular expression 'sub'?
EDIT: the code doesn't produce any errors, but the output HTML file lacks any substituted text. so the sub is searching the external text file but isn't putting it into the output HTML file.
You never save the result of sub(). Replace
sub('[a-z]+', 'test', file_contents)
with this
file_contents = sub('[a-z]+', 'test', file_contents)
I'm trying to remove all (non-space) whitespace characters from a file and replace all spaces with commas. Here is my current code:
def file_get_contents(filename):
with open(filename) as f:
return f.read()
content = file_get_contents('file.txt')
content = content.split
content = str(content).replace(' ',',')
with open ('file.txt', 'w') as f:
f.write(content)
when this is run, it replaces the contents of the file with:
<built-in,method,split,of,str,object,at,0x100894200>
The main issue you have is that you're assigning the method content.split to content, rather than calling it and assigning its return value. If you print out content after that assignment, it will be: <built-in method split of str object at 0x100894200> which is not what you want. Fix it by adding parentheses, to make it a call of the method, rather than just a reference to it:
content = content.split()
I think you might still have an issue after fixing that through. str.split returns a list, which you're then tuning back into a string using str (before trying to substitute commas for spaces). That's going to give you square brackets and quotation marks, which you probably don't want, and you'll get a bunch of extra commas. Instead, I suggest using the str.join method like this:
content = ",".join(content) # joins all members of the list with commas
I'm not exactly sure if this is what you want though. Using split is going to replace all the newlines in the file, so you're going to end up with a single line with many, many words separated by commas.
When you split the content, you forgot to call the function. Also once you split, its an array so you should loop to replace things.
def file_get_contents(filename):
with open(filename) as f:
return f.read()
content = file_get_contents('file.txt')
content = content.split() <- HERE
content = [c.replace(' ',',') for c in content]
content = "".join(content)
with open ('file.txt', 'w') as f:
f.write(content)
if you are looking to replace characters i think you would be better off using python's re module for regular expressions. sample code would be as follows:
import re
def file_get_contents(filename):
with open(filename) as f:
return f.read()
if __name__=='__main__':
content = file_get_contents('file.txt')
# First replace any spaces with commas, then remove any other whitespace
new_content = re.sub('\s', '', re.sub(' ', ',', content))
with open ('new_file.txt', 'w') as f:
f.write(new_content)
its more succinct then trying to split all the time and gives you a little bit more flexibility. just also be careful with how large of a file you are opening and reading with your code - you may want to consider using a line iterator or something instead of reading all the file contents at once