I would like to replace every line that starts with a certain expression (example: <Output>) with what I want the output path to be. I have found and got to work a python script that replaces one string with another, in every occurrence in a file - something like:
text = open( path ).read()
if output_pattern in text:
open( path, 'w' ).write( text.replace( pattern, replace ) )
However I would like to replace the text.replace( pattern, replace ) with something that replaces the entire line that contains pattern with replace. I have tried some things and failed miserably.
Note: I can read but not quite write python...
One of my failures did replace the pattern with the line. Actually, it replaced the entire file with only the replace pattern, as many times as it was needed... Yeah, not funny since I was doing a recursive search (and the previous attempt, to replace one string with another, worked perfectly, so I was brave and set my target directory as the root of what I want to work with)
There are other great examples that read line by line and write to an output file, and then copy the output file to the input file, but I got an error doing that.
I don't really want to use regex because the patterns that I might want to search for (and especially what I want to replace) (may) contain many special characters including backslashes, but these could be escaped if needed.
To replace lines with replace if they start with pattern:
text = open(path).read()
new_text = '\n'.join(replace if line.startswith(pattern) else line
for line in text.splitlines())
open(path, 'w').write(new_text)
Or optimized for memory usage, and using the with statement, which is a bit more idiomatic:
with open(input_path) as text, open(output_path, 'w') as new_text:
new_text.write(''.join(replace if line.startswith(pattern) else line
for line in text))
You'll want to make sure replace has a newline character (\n) in it for the latter example to work as you'd expect.
Related
I am trying to figure out how to write a simple regex that would highlight newline characters only if they appear at the beginning or end of some data while preserving the newline.
In the below example, line 1 and line 14 both are new lines. Those are the only two lines I am trying to highlight as they appear at the beginning and end of the data.
import regex as re
from colorama import Fore, Back
def red(s):
return Back.RED + s + Back.RESET
with open('/tmp/1.py', 'r') as f:
data = f.read()
print(
re.sub(r'(^\n|\n$)', red(r'\1'), data)
)
In the open expression, data is the same content as the example posted above.
In the above example, this is the result I am getting:
As one can see, the red highlight is missing on line 1 and is spanning all the way in line 14. What I would like is for the color to appear only once per new line character.
You can actually use your regex, but without the "multiline" flag. Than it will see the whole string as one and you will actually match your desired output.
^\n|\n$
Here you can see that there are two matches. And if you delete new lines in front or in the end, the matches will disapear. The multilene flag is set or disabled at the end of the regex line. You could do that in your language too.
https://regex101.com/r/pSRHPU/2
After reading all the comments, and suggestions, and combining a subset of them all, I finally have a working version. For anyone that is interested:
One issue I cannot overcome without writing an os specific check is how an extra new line being added for windows.
A couple of highlights that were picked up:
cannot color a \n. So replace that with a space and newline.
have not tested this, but by getting rid of the group replacement, it may be possible to apply this to bytes also.
Windows supported can be attained with init in colorama
import regex as re
from colorama import Back, init
init() # for windows
def red(s):
return Back.RED + s + Back.RESET
with open('/tmp/1.py', 'r') as f:
data = f.read()
fist_line = re.sub('\A\n', red(' ')+'\n', data)
last_line = re.sub('\n\Z', '\n'+red(' '), fist_line)
print(last_line)
OSX/Linux
Windows
I found a way that seems to allow you to match the start/end of the whole string. See the "Permanent Start of String and End of String Anchors" part from https://www.regular-expressions.info/anchors.html
\A only ever matches at the start of the string. Likewise, \Z only ever matches at the end of the string.
I created a demo here https://regex101.com/r/n2DAWh/1
Regex is: (\A\n|\n\Z)
I am new to Python. My problem here is that:
I want to match a pattern against a large file and return matching lines(not just the matched string) from it. I DO NOT want a FOR loop for this as my file is huge. I am using mmap for reading the file.
in the above file, if I search for bhuvi, I should get 2 rows, bhuvi and bhuvi Kumar
I used re.findall() for this, but it just returns the substrings, not the whole lines.
Can someone please suggest what I can do here?
If your input file is huge, you cannot use readlines, but nothing
prevents you from reading one line in a loop.
As the file object is iterable, you can write the loop as:
for line in fh:
and process the content of the input line inside the loop.
The file size is not important, as you do not attempt to read all lines at once.
To check for presence of your string (bhuvi) in the line use
re.search, not re.findall.
Actually you don't need any list of matches, it is enough to find
a single match (it works quicker).
Below you have an example program (Python 3.7), writing the lines contaning your
string, along with the line number:
import re
cnt = 0
with open('input.txt') as fh:
for line in fh:
line = line.rstrip()
cnt += 1
if re.search('bhuvi', line):
print(f'{cnt}: {line}')
Note that I used rstrip() to remove the trailing newline, if any.
Edit after your comment:
You wrote that the file to check is huge. So there is a risk that
if you try to read it whole into the computer memory, the program
runs out of memory.
In such a case you would have to read the file chunk by chunk and
perform search in each chunk separately.
There is also a risk that a row with the text you are looking for will be
partially read in one chunk and the rest in the next,
so you have to take some measure to avoid this in your program.
On the other hand, if there is no other way but using mmap,
try something like re.finditer(r'[^\n]*bhuvi[^\n]*', map), i.e. create
an iterator looking for:
A sequence of chars other than \n.
Your string.
Another sequence of chars other than \n.
This way the match object returned by the iterator will match the
whole line, not your string alone.
I am trying to read a bunch of regexes from a file, using python.
The regexes come in a file regexes.csv, a pair in each line, and the pair is separated by commas. e.g.
<\? xml([^>]*?)>,<\? XML$1>
peter,Peter
I am doing
detergent = []
infile = open('regexes.csv', 'r')
for line in infile:
line = line.strip()
[search_term, replace_term] = line.split(',', 1)
detergent += [[search_term,replace_term]]
This is not producing the right input. If I print the detergent I get
['<\\?xml([^>]*?)>', '<\\?HEYXML$1>'],['peter','Peter']]
It seems to be that it is escaping the backslashes.
Moreover, in a file containing, say
<? xml ........>
a command re.sub(search_term,replace_term,file_content) written further below in the content is replacing it to be
<\? XML$1>
So, the $1 is not recovering the first capture group in the first regex of the pair.
What is the proper way to input regexes from a file to be later used in re.sub?
When I've had the regexes inside the script I would write them inside the r'...', but I am not sure what are the issues at hand when reading form a file.
There are no issues or special requirements for reading regex's from a file. The escaping of backslashes is simply how python represents a string containing them. For example, suppose you had defined a regex as rgx = r"\?" directly in your code. Try printing it, you'll see it is displayed the same way ...
>>> r"\?"
>>> '\\?'
The reason you $1 is not being replaced is because this is not the syntax for group references. The correct syntax is \1.
I have data in the following form in a file:
<string1> abc:string2 <http://yago-knowledge.org/resource/wikicategory_Sports_clubs_established</text\u003e\n______<sha1\u003eqwjfowt5my8t6yuszdb88k2ehskjuh0</sha1\u003e\n____</revision\u003e\n__</page\u003e\n__<page\u003e\n____<title\u003ePortal:Tropical_cyclones/Anniversaries/August_22</title\u003e\n____<ns\u003e100</ns\u003e\n____<id\u003e7957689</id\u003e\n____<revision\u003e\n______<id\u003e446349886</id\u003e\n______<timestamp\u003e2011-08-23T17:38:19Z</timestamp\u003e\n______<contributor\u003e\n________<username\u003eLightbot</username\u003e\n________<id\u003e7178666</id\u003e\n______</contributor\u003e\n______<comment\u003eDelink_non-obscure_units._Conversions._Report_bugs_to_[[User_talk:Lightmouse>.
The delimiter in the above file is a tab (\t) i.e. string1 is separated from abc:string2by \t. Similarly for the rest of the strings.
Now I want to retain just alphabets, numbers, /, :,'.' and _ within the strings which are enclosed within <>. I want to delete all the characters apart from the specified ones from the strings which are enlosed in <>.
Is there some way by which I may achieve this using linux commands or python? I want to replace all the unwanted characters by an underscore.
<string1> abc:string2 <http://yago-knowledge.org/resource/wikicategory_Sports_clubs_established_text_u003e_n_______sha1_u003eqwjfowt5my8t6yuszdb88k2ehskjuh0_sha1_u003e_n_____revision_u003e_n___/page_u003e_n___page_u003e_n_____title_u003ePortal:Tropical_cyclones/Anniversaries/August_22_/title_u003e_n_____ns_u003e100_/ns_u003e_n_____id_u003e7957689_/id_u003e_n_____revision_u003e_n_______id_u003e446349886_/id_u003e_n_______timestamp_u003e2011-08-23T17:38:19Z_/timestamp_u003e_n_______contributor_u003e_n_________username_u003eLightbot_/username_u003e_n_________id_u003e7178666_/id_u003e_n_______/contributor_u003e_n_______comment_u003eDelink_non-obscure_units._Conversions._Report_bugs_to___User_talk:Lightmouse>.
Is there some way by which I may achieve this?
You can probably achieve this just with UNIX tools and some crazy regular expression, but I would write a small Python script for this:
Open two files (input and output) with open()
Iterate over the input file line by line: for line in input_file:
Split the line at tab: for part in line.split('\t'):
Check if a part is enclosed in <>: if part.startswith('<') and line.endswith('>'):
Filter characters with a regular expression: filtered_part = re.sub(r'[^a-zA-Z0-9/:._]', '', part)
Join the filtered parts back together: filtered_line = '\t'.join(filtered_parts)
Write the filtered line to the output file: output_file.write(filtered_line + '\n')
Following this outline, it should be easy for you to write a working script.
I want to find strings listed in list.txt (one string per line) in another text file in case I found it print 'string,one_sentence' in case didn't find 'string,another_sentence'. I'm using following code, but it is finding only last string in the strings list from file list.txt. Cannot understand what could be the reason?
data = open('c:/tmp/textfile.TXT').read()
for x in open('c:/tmp/list.txt').readlines():
if x in data:
print(x,',one_sentence')
else:
print(x,',another_sentence')
When you read a file with readlines(), the resulting list elements do have a trailing newline characters. Likely, these are the reason why you have less matches than you expected.
Instead of writing
for x in list:
write
for x in (s.strip() for s in list):
This removes leading and trailing whitespace from the strings in list. Hence, it removes trailing newline characters from the strings.
In order to consolidate your program, you could do something like this:
with open('c:/tmp/textfile.TXT') as f:
haystack = f.read()
if not haystack:
sys.exit("Could not read haystack data :-(")
with open('c:/tmp/list.txt') as f:
for needle in (line.strip() for line in f):
if needle in haystack:
print(needle, ',one_sentence')
else:
print(needle, ',another_sentence')
I did not want to make too drastic changes. The most important difference is that I am using the context manager here via the with statement. It ensures proper file handling (mainly closing) for you. Also, the 'needle' lines are stripped on the fly using a generator expression. The above approach reads and processes the needle file line by line instead of loading the whole file into memory at once. Of course, this only makes a difference for large files.
readlines() keeps a newline character at the end of each string read from your list file. Call strip() on those strings to remove those (and every other whitespace) characters.