Why does my regex not work on input from file.read()? - python

I have a section of code that I need to remove from multiple files that starts like this:
<?php
//{{56541616
and ends like this:
//}}18420732
?>
where both strings of numbers can be any sequence of letters and numbers (not the same).
I wrote a Python program that will return the entire input string except for this problem string:
def removeInsert(text):
m = re.search(r"<\?php\n\/\/\{\{[a-zA-Z0-9]{8}.*\/\/\}\}[a-zA-Z0-9]{8}\n\?>", text, re.DOTALL)
return text[:m.start()] + text[m.end():]
This program works great when I call it with removeInsert("""[file text]""") -- the triple quotes allow it to be read in as multiline.
I attempted to extend this to open a file and pass the string contents of the file to removeInsert() with this:
def fileRW(filename):
input_file = open(filename, 'r')
text = input_file.read()
newText = removeInsert(text)
...
However, when I run fileRW([input-file]), I get this error:
return text[:m.start()] + text[m.end():]
AttributeError: 'NoneType' object has no attribute 'start'
I can confirm that "text" in that last code is actually a string, and does contain the problem code, but it seems that the removeInsert() code doesn't work on this string. My best guess is that it's related to the triple quoting I do when inputting the string manually into removeInsert(). Perhaps the text that fileRW() passes to removeInsert() is not triple-quoted (I've tried different ways of forcing it to have triple quotes ("\"\"\"" added), but that doesn't work). I have no idea how to fix this, though, and can't find any information about it in my google searching. Any suggestions?

Your regex only uses \n for lines. Your text editor may insert a carriage return and newline combination: \r\n. Try changing \n in your regex to (\r\n|\r|\n).

Keep the \n in your regular expressions and open the file as:
input_file= open(filename, 'rU')
Note the extra U in the mode. This will allow your code to work even if used on other operating systems, or given files having “foreign” end-of-line.

Related

How do I get python to interpret the ANSI escape codes for colors in a string read from a text file

All the codes I've tried work in VS Code terminal and the Widows Terminal (Power Script and Command Window), so I'm pretty happy about that, however, when I read a string from a text file and I print the string, the escape codes are printed in plain view and no colour is applied to the strings.
I've tried the octal, hexadecimal and unicode versions, I had the same problem with "\n" until I realised that the string read would contain "\n", where it would effectively escape the "" char, so calling .replace("\\n","\n") on the string solved that issue, but I got no joy with the colour codes.
This is the code that I use to read the file:
with open('ascii_art_with_color.txt','r') as file:
for line in file.readlines() :
text_line = line
print( text_line , end='' )
Sample from the ascii file:
encounter = You \033[31mencounter\033[0m a wolf howling at the moonlight
Printing using the print function works just fine, either the string constant or from a variable
print('The wolf \033[31mgrowls\033[0m at you as you try to get closer')
winning = 'The wolf lets out a \033[34mpiercing\033[0m cry, then falls to the ground'
print(winning)
Ideas? The main problem that got me stumped is that the codes are not interpreted/applied for the strings I read from the text file, anything else seems to work.
Update:
As it was suggested in the comments, the file contained the '\033' (4 chars) instead of the '\033' one char. I was hoping python would take the line, then apply/translate/encode it into the ANSI escape sequence code while printing it, as it does with the string in the example above.
In the meantime, I managed to get the colours in the text file using a script that replaces a specific string with the escape sequence (I guess python does the encoding behind the scenes before writing it to file)
file_dest = 'ascii_monster_wolf_dest.txt'
with open(file_name,'r') as file, open(file_dest,'w+') as file_dest:
for line in file.readlines():
line = line.replace('{#}','\033[31m')
line = line.replace('{*}','\033[0m')
file_dest.writelines(line)
This is some progress, but not what I really wanted tho.
Coming back to my question, is there a way to read the file and have the sequence '\033' (4 characters) being interpreted as the 1 char escape sequence, the way it seems to do with strings?
There are a couple of ways to do what you ask.
If you wrap the individual lines with quote marks, so they look like Python string constants, you can use the ast literal evaluator to decode it:
s = '"\\x61\\x62"'
# That string has 10 characters.
print( ast.literal_eval(s) )
# Prints ab
Alternatively, you can convert the strings to byte strings, and use the "unicode-escape" codec:
s = '\\x61\\x62'
s = s.encode('utf-8').decode('unicode-escape')
print( s )
# Prints ab
In my humble opinion, however, you would be better served by using some other kind of markup to denote your colors. By that, I mean something like:
<red>This is red</red> <blue>This is blue</blue
Maybe not exactly an HTML-type syntax, but something with code markers that YOU understand, that can be read by humans, and can be interpreted by all languages.
Open the file in binary format. Then use decode() as Tim Roberts suggested.
with open('ascii_art_with_color.txt','rb') as file:
for line in file.readlines() :
print( line.decode('unicode-escape') , end='' )

Python script to duplicate .tex files with small changes

I have a letter in LaTeX format. I'd like to write a short script in python that takes one argument (the addressee) and creates a .tex file with the general letter format and the addressee.
from sys import argv
script, addressee = argv
file = open('newletter.tex', 'w')
file.write("\begin{Document} Dear " + addressee + ", \n Greetings, how are you? Sincerely, Me \end{Document}")
file.close()
Is there a better function to write out large blocks of text? Also, you can see that the .tex file will contain programming syntax - will python disregard this as long as it is coerced to a string? Do I need to coerce a large block to string? Thanks in advance!
If you directly enter print "\begin..." into your interpreter, you will notice the result will omit the \b on the front of the string. This is because \b is a character that the print statement (or function if you're in 3.x) recognizes (it happens to be a backspace).
To avoid this confusion, you can use a "raw string", which in python is denoted by pre-pending an 'r':
>>> a = "\begin"
>>> b = r"\begin"
>>> print a
egin
>>> print b
\begin
>>>
Typically, when working with strings to represent file paths, or anything else which may contain a \ character, you should use a raw string.
As far as inserting information into a template, I would recommend using the format() function rather than string concatenation. To do this, your string would look like this:
r"\begin{{Document}} Dear {} \n Greetings, how are you? Sincerely, Me \end{{Document}}".format(addressee)
The argument of the function (in this case addressee) will be inserted into each {} within the string. For this reason, curly brackets which should be interpreted literally must be escaped by included them in duplicate.
I'd take the approach of creating the tex files first as letter.tex with the addressee set to something like QXQ_ADDRESSEE_QXQ.
The in the python script I'd read the entire file into memory. When you read from a file, it gets treated as a raw string with proper escaping.
with open('letter.tex', 'r') as f:
raw_letter = f.readlines()
Then just do a substitution and write the string to a file.
raw_letter.replace("QXQ_ADDRESSEE_QXQ", newname)
with open('newletter.tex', 'w') as f:
f.write(raw_letter)

Reading regexes from file, in Python

I am trying to read a bunch of regexes from a file, using python.
The regexes come in a file regexes.csv, a pair in each line, and the pair is separated by commas. e.g.
<\? xml([^>]*?)>,<\? XML$1>
peter,Peter
I am doing
detergent = []
infile = open('regexes.csv', 'r')
for line in infile:
line = line.strip()
[search_term, replace_term] = line.split(',', 1)
detergent += [[search_term,replace_term]]
This is not producing the right input. If I print the detergent I get
['<\\?xml([^>]*?)>', '<\\?HEYXML$1>'],['peter','Peter']]
It seems to be that it is escaping the backslashes.
Moreover, in a file containing, say
<? xml ........>
a command re.sub(search_term,replace_term,file_content) written further below in the content is replacing it to be
<\? XML$1>
So, the $1 is not recovering the first capture group in the first regex of the pair.
What is the proper way to input regexes from a file to be later used in re.sub?
When I've had the regexes inside the script I would write them inside the r'...', but I am not sure what are the issues at hand when reading form a file.
There are no issues or special requirements for reading regex's from a file. The escaping of backslashes is simply how python represents a string containing them. For example, suppose you had defined a regex as rgx = r"\?" directly in your code. Try printing it, you'll see it is displayed the same way ...
>>> r"\?"
>>> '\\?'
The reason you $1 is not being replaced is because this is not the syntax for group references. The correct syntax is \1.

replace some special character with newline \n

I have a text file with numbers and symbols, i want to delete some character of them and to put new line.
for example the text file is like that:
00004430474314-3","100004430474314-3","1779803519-3","100003004929477-3","100006224433874-3","1512754498-3","100003323786067
i want the output to be like that:
00004430474314
100004430474314
100003004929477
1779803519
100006224433874
1512754498
100003323786067
i tred to replace -3"," with \n by this code but it does not work. any help?
import re
import collections
s = re.findall('\w+', open('text.txt').read().lower())
print(s.replace("-3","",">\n"))
The re.findall is useless here.
with open('path/to/file') as infile:
contents = infile.read()
contents = contents.replace('-3","', '\n')
print(contents)
Another problem with your code is that you seem to think that "-3","" is a string containing -3",". This is not the case. Python sees a second " and interprets that as the end of the string. You have a comma right afterward, which makes python consider the second bit as the second parameter to s.replace().
What you really want to do is to tell python that those double quotes are part of the string. You can do this by manually escaping them as follows:
some_string_with_double_quotes = "this is a \"double quote\" within a string"
You can also accomplish the same thing by defining the string with single quotes:
some_string_with_double_quotes = 'this is a "double quote" within a string'
Both types of quotes are equivalent in python and can be used to define strings. This may be weird to you if you come from a language like C++, where single quotes are used for characters, and double quotes are used for strings.
First I think that the s object is not a string but a list and if you try to make is a string (s=''.join(s) for example) you are going to end with something like this:
0000443047431431000044304743143177980351931000030049294773100006224433874315127544983100003323786067
Where replace() is useless.
I would change your code to the following (tested in python 3.2)
lines = [line.strip() for line in open('text.txt')]
line=''.join(lines)
cl=line.replace("-3\",\"","\n")
print(cl)

How to prevent Python from escaping special characters when reading a regex from a text file?

I am reading a text file in Python that, among other things, contains pre-written regexes that will be used for matching later on. The text file is of the following format:
...
--> Task 2
Concatenate and print the strings "Hello, " and "world!" to the screen.
--> Answer
Hello, world!
print(\"Hello,\s\"\s*+\s*\"world!\")
--> Hint 1
You can concatenate two strings with the + operator
...
User input is being accepted based on tasks and either executed in a subprocess to see a return value or matched against a regex. The issue, though, is that python's file.readline() will escape all special characters in the regex string (i.e. backslashes), giving me something that isn't useful.
I tried to read in the file as bytes and decode the lines using the 'raw_unicode_escape' argument (described as producing "a string that is suitable as raw Unicode literal in Python source code"), but no dice:
file.open(filename, 'rb')
for line in file:
line = line.decode('raw_unicode_escape')
...
Am I going about this the completely wrong way?
Thanks for any and all help.
p.s. I found this question as well: Issue while reading special characters from file. However, I still have the same trouble when I use file.open(filename, 'r', encoding='utf-8').
Python regex patterns are just plain old strings. There should be no problem with storing them in a file. Perhaps when you use file.readline() you are seeing escaped characters because you are looking at the repr of the line? That should not be an issue when you actually use the pattern as a regex however:
import re
filename='/tmp/test.txt'
with open(filename,'w') as f:
f.write(r'\"Hello,\s\"\s*\+\s*\"world!\"')
with open(filename,'r') as f:
pat = f.readline()
print(pat)
# \"Hello,\s\"\s*\+\s*\"world!\"
print(repr(pat))
# '\\"Hello,\\s\\"\\s*\\+\\s*\\"world!\\"'
assert re.search(pat,' "Hello, " + "world!"') # Shows match was found

Categories