Regular expression to get the first match in a text file - python
I have a text file inside it is:
"000000002|ROOT |237277309|000000003|ROOT |337277309|000000004|ROOT |437277309|"
Now I'm trying to use a regular expression to get the first chunk of number before '|ROOT ', the number is 000000002.
I tried to use:
with open(file, 'r', encoding='utf-8', errors='ignore') as f:
lines = f.read()
x = re.findall("^\s*[0-9].(ROOT$)", lines)[0]
print(x)
And it does not work. My strategy is to get the string start with number and end with ROOT, and get the first match.
ROOT$ requires the four characters ROOT adjacent to the end of the line. findall returns all matches; if you only care about the first, probably simply use match or search.
with open(file, 'r', encoding='utf-8', errors='ignore') as f:
for line in f:
m = re.match(r'(\d+)\|ROOT', line)
if m:
print(m.group(1))
break
The break causes the loop to terminate as soon as the first match is found. We read one line at a time until we find one which matches, then terminate. (This also optimizes the program by avoiding the unnecessary reading of lines we do not care about, and by avoiding reading more than one line into memory at a time.) The parentheses in the regex causes the match inside them to be captured into group(1).
Check out this code :
import re
# 000000002|ROOT |237277309|000000003|ROOT |337277309|000000004|ROOT |437277309|
file = './file.txt'
with open(file, 'r', encoding='utf-8', errors='ignore') as f:
lines = f.read()
x = re.findall(r"(\d*[0-9])\|ROOT", lines)
print(x)
x = re.findall(r"(\d*[0-9])\|ROOT", lines)[0]
print(x)
OUTPUT :
['000000002', '000000003', '000000004']
000000002
Related
Python: Delete lines from except certain criteria
I am trying to delete lines from a file using specific criteria The script i have seems to work but i have to add to many Or statements Is there a way i can make an variable that holds all the criterias i would like to remove from the files? Example code with open("AW.txt", "r+", encoding='utf-8') as f: new_f = f.readlines() f.seek(0) for line in new_f: if "PPL"not in line.split() or "PPLX"not in line.split() or "PPLC"not in line.split(): f.write(line) f.truncate() I was more thinking in this way but it fails when i add multiple criterias output = [] with open('AW.txt', 'r+', encoding='utf-8') as f: lines = f.readlines() criteria = 'PPL' output =[line for line in lines if criteria not in line] f.writelines(output) Regards
You can use regular expressions to your rescue which will reduce the number of statements and checks in the code. If you have a list of criteria which can be dynamic, let's call the list of criteria crit_list, then the code would look like- import re with open("AW.txt", "r+", encoding='utf-8') as f: new_f = f.readlines() crit_list = ['PPL', 'PPLC', 'PPLX'] # Can use any number of criterions obj = re.compile(r'%s' % ('|'.join(crit_list))) out_lines = [line for line in new_f if not obj.search(line)] f.truncate(0) f.seek(0) f.writelines(out_lines) Use of regex makes it look different from how OP had posted. Let me explain the two lines containing the regex- obj = re.compile(r'%s' % ('|'.join(crit_list))) This line creates a regex object with the regular expression 'PPL|PPLX|PPLC' which means match at least one of these strings in the given line which can be thought of as a substitute for using as many ors in the code as there are criteria. out_lines = [line for line in new_f if not obj.search(line)] This statement means, search for the given criteria in the given line and if at least of them is found, preserve that line. Hope that clears your doubts.
import re output = [] with open('AW.txt', 'r+', encoding='utf-8') as f: lines = f.readlines() criteria = 'PPL' output = re.sub("^.*[Crit1|Crit2|Crit3].*","") f.writelines(output) This will remove the lines. but it will not print them out in the writelines statement your question was a little fuzzy, asking for lines to be deleted but then trying to write them out add as many criteria as you want like this
You can get compare each list item with each criteria and get only those items that meet the criteria. Then simply get all lines which meet all the criterias. For example, this can be done like (EDITED CODE): with open('AW.txt', 'r+') as f: lines = f.readlines() criterias = ["PPL","PPLX","PPLC"] conditioned_lines = [[line for criteria in criterias if criteria not in line] for line in lines] output = [criteria_lines[0] for criteria_lines in conditioned_lines if len(criteria_lines) == len(criterias)] f.truncate(0) f.seek(0) f.write(''.join(output))
Extract chunks of text from document and write them to new text file
I have a large file text file that I want to read several lines of, and write these lines out as one line to a text file. For instance, I want to start reading in lines at a certain start word, and end on a lone parenthesis. So if my start word is 'CAR' I would want to start reading until a one parenthesis with a line break is read. The start and end words are to be kept as well. What is the best way to achieve this? I have tried pattern matching and avoiding regex but I don't think that is possible. Code: array = [] f = open('text.txt','r') as infile w = open(r'temp2.txt', 'w') as outfile for line in f: data = f.read() x = re.findall(r'CAR(.*?)\)(?:\\n|$)',data,re.DOTALL) array.append(x) outfile.write(x) return array What the text may look like ( CAR: *random info* *random info* - could be many lines of this )
Using regular expression is totally fine for these type of problems. You cannot use them when your pattern contains recursion, like get the content from the parenthesis: ((text1)(text2)). You can use the following regular expression: (CAR[\s\S]*?(?=\))) See explanation... Here you can visualize your regular expression...
We can match the text you're interested in using the regex pattern: (CAR.*)\) with flags gms. Then we just have to remove the newline characters from the resulting matches and write them to a file. with open("text.txt", 'r') as f: matches = re.findall(r"(CAR.*)\)", f.read(), re.DOTALL) with open("output.txt", 'w') as f: for match in matches: f.write(" ".join(match.split('\n'))) f.write('\n') The output file looks like this: CAR: *random info* *random info* - could be many lines of this EDIT: updated code to put newline between matches in output file
Appending lines to a file, then reading them
I want to append or write multiple lines to a file. I believe the following code appends one line: with open(file_path,'a') as file: file.write('1') My first question is that if I do this: with open(file_path,'a') as file: file.write('1') file.write('2') file.write('3') Will it create a file with the following content? 1 2 3 Second question—if I later do: with open(file_path,'r') as file: first = file.read() second = file.read() third = file.read() Will that read the content to the variables so that first will be 1, second will be 2 etc? If not, how do I do it?
Question 1: No. file.write simple writes whatever you pass to it to the position of the pointer in the file. file.write("Hello "); file.write("World!") will produce a file with contents "Hello World!" You can write a whole line either by appending a newline character ("\n") to each string to be written, or by using the print function's file keyword argument (which I find to be a bit cleaner) with open(file_path, 'a') as f: print('1', file=f) print('2', file=f) print('3', file=f) N.B. print to file doesn't always add a newline, but print itself does by default! print('1', file=f, end='') is identical to f.write('1') Question 2: No. file.read() reads the whole file, not one line at a time. In this case you'll get first == "1\n2\n3" second == "" third == "" This is because after the first call to file.read(), the pointer is set to the end of the file. Subsequent calls try to read from the pointer to the end of the file. Since they're in the same spot, you get an empty string. A better way to do this would be: with open(file_path, 'r') as f: # `file` is a bad variable name since it shadows the class lines = f.readlines() first = lines[0] second = lines[1] third = lines[2] Or: with open(file_path, 'r') as f: first, second, third = f.readlines() # fails if there aren't exactly 3 lines
The answer to the first question is no. You're writing individual characters. You would have to read them out individually. Also, note that file.read() returns the full contents of the file. If you wrote individual characters and you want to read individual characters, process the result of file.read() as a string. text = open(file_path).read() first = text[0] second = text[1] third = text[2] As for the second question, you should write newline characters, '\n', to terminate each line that you write to the file. with open(file_path, 'w') as out_file: out_file.write('1\n') out_file.write('2\n') out_file.write('3\n') To read the lines, you can use file.readlines(). lines = open(file_path).readlines() first = lines[0] # -> '1\n' second = lines[1] # -> '2\n' third = lines[2] # -> '3\n' If you want to get rid of the newline character at the end of each line, use strip(), which discards all whitespace before and after a string. For example: first = lines[0].strip() # -> '1' Better yet, you can use map to apply strip() to every line. lines = list(map(str.strip, open(file_path).readlines())) first = lines[0] # -> '1' second = lines[1] # -> '2' third = lines[2] # -> '3'
Writing multiple lines to a file This will depend on how the data is stored. For writing individual values, your current example is: with open(file_path,'a') as file: file.write('1') file.write('2') file.write('3') The file will contain the following: 123 It will also contain whatever contents it had previously since it was opened to append. To write newlines, you must explicitly add these or use writelines(), which expects an iterable. Also, I don't recommend using file as an object name since it is a keyword, so I will use f from here on out. For instance, here is an example where you have a list of values that you write using write() and explicit newline characters: my_values = ['1', '2', '3'] with open(file_path,'a') as f: for value in my_values: f.write(value + '\n') But a better way would be to use writelines(). To add newlines, you could join them with a list comprehension: my_values = ['1', '2', '3'] with open(file_path,'a') as f: f.writelines([value + '\n' for value in my_values]) If you are looking for printing a range of numbers, you could use a for loop with range (or xrange if using Python 2.x and printing a lot of numbers). Reading individual lines from a file To read individual lines from a file, you can also use a for loop: my_list = [] with open(file_path,'r') as f: for line in f: my_list.append(line.strip()) # strip out newline characters This way you can iterate through the lines of the file returned with a for loop (or just process them as you read them, particularly if it's a large file).
extract float numbers from data file
I'm trying to extract the values (floats) from my datafile. I only want to extract the first value on the line, the second one is the error. (eg. xo # 9.95322254_0.00108217853 means 9.953... is value, 0.0010.. is error) Here is my code: import sys import re inf = sys.argv[1] out = sys.argv[2] f = inf outf = open(out, 'w') intensity = [] with open(inf) as f: pattern = re.compile(r"[^-\d]*([\-]{0,1}\d+\.\d+)[^-\d]*") for line in f: f.split("\n") match = pattern.match(line) if match: intensity.append(match.group(0)) for k in range(len(intensity)): outf.write(intensity[k]) but it doesn't work. The output file is empty. the lines in data file look like: xo_Is xo # 9.95322254`_0.00108217853 SPVII_to_PVII_Peak_type PVII_m(#, 1.61879`_0.08117) PVII_h(#, 0.11649`_0.00216) I # 0.101760618`_0.00190314017 each time the first number is the value I want to extract and the second one is the error.
You were almost there, but your code contains errors preventing it from running. The following works: pattern = re.compile(r"[^-\d]*(-?\d+\.\d+)[^-\d]*") with open(inf) as f, open(out, 'w') as outf: for line in f: match = pattern.match(line) if match: outf.write(match.group(1) + '\n')
I think you should test your pattern on a simple string instead of file. This will show where is the error: in pattern or in code which parsing file. Pattern looks good. Additionally in most languages i know group(0) is all captured data and for your number you need to use group(1) Are you sure that f.slit('\n') must be inside for?
Splitting lines in python based on some character
Input: !,A,56281,12/12/19,19:34:12,000.0,0,37N22.714,121W55.576,+0013!,A,56281,12/1 2/19,19:34:13,000.0,0,37N22.714,121W55.576,+0013!,A,56281,12/12/19,19:34:14,000. 0,0,37N22.714,121W55.576,+0013!,A,56281,12/12/19,19:34:15,000.0,0,37N22.714,121W 55.576,+0013!,A,56281,12/12/19,19:34:16,000.0,0,37N22.714,121W55.576,+0013!,A,56 281,12/12/19,19:34:17,000.0,0,37N22.714,121W55.576,+0013!,A,56281,12/12/19,19:34 :18,000.0,0,37N22.714,121W55.576,+0013!,A,56281,12/12/19,19:34:19,000.0,0,37N22. Output: !,A,56281,12/12/19,19:34:12,000.0,0,37N22.714,121W55.576,+0013 !,A,56281,12/12/19,19:34:13,000.0,0,37N22.714,121W55.576,+0013 !,A,56281,12/12/19,19:34:14,000.0,0,37N22.714,121W55.576,+0013 !,A,56281,12/12/19,19:34:15,000.0,0,37N22.714,121W55.576,+0013 !,A,56281,12/12/19,19:34:16,000.0,0,37N22.714,121W55.576,+0013 !,A,56281,12/12/19,19:34:17,000.0,0,37N22.714,121W55.576,+0013 !,A,56281,12/12/19,19:34:18,000.0,0,37N22.714,121W55.576,+0013 !,A,56281,12/12/19,19:34:19,000.0,0,37N22. '!' is the starting character and +0013 should be the ending of each line (if present). Problem which I am getting: Output is like : !,A,56281,12/12/19,19:34:12,000.0,0,37N22.714,121W55.576,+0013 !,A,56281,12/1 2/19,19:34:13,000.0,0,37N22.714,121W55.576,+0013 !,A,56281,12/12/19,19:34:14,000. 0,0,37N22.714,121W55.576,+0013 !,A,56281,12/12/19,19:34:15,000.0,0,37N22.714,121W Any help would be highly appreciated...!!! My code: file_open= open('sample.txt','r') file_read= file_open.read() file_open2= open('output.txt','w+') counter =0 for i in file_read: if '!' in i: if counter == 1: file_open2.write('\n') counter= counter -1 counter= counter +1 file_open2.write(i)
You can try something like this: with open("abc.txt") as f: data=f.read().replace("\r\n","") #replace the newlines with "" #the newline can be "\n" in your system instead of "\r\n" ans=filter(None,data.split("!")) #split the data at '!', then filter out empty lines for x in ans: print "!"+x #or write to some other file .....: !,A,56281,12/12/19,19:34:12,000.0,0,37N22.714,121W55.576,+0013 !,A,56281,12/12/19,19:34:13,000.0,0,37N22.714,121W55.576,+0013 !,A,56281,12/12/19,19:34:14,000.0,0,37N22.714,121W55.576,+0013 !,A,56281,12/12/19,19:34:15,000.0,0,37N22.714,121W55.576,+0013 !,A,56281,12/12/19,19:34:16,000.0,0,37N22.714,121W55.576,+0013 !,A,56281,12/12/19,19:34:17,000.0,0,37N22.714,121W55.576,+0013 !,A,56281,12/12/19,19:34:18,000.0,0,37N22.714,121W55.576,+0013 !,A,56281,12/12/19,19:34:19,000.0,0,37N22.
Could you just use str.split? lines = file_read.split('!') Now lines is a list which holds the split data. This is almost the lines you want to write -- The only difference is that they don't have trailing newlines and they don't have '!' at the start. We can put those in easily with string formatting -- e.g. '!{0}\n'.format(line). Then we can put that whole thing in a generator expression which we'll pass to file.writelines to put the data in a new file: file_open2.writelines('!{0}\n'.format(line) for line in lines) You might need: file_open2.writelines('!{0}\n'.format(line.replace('\n','')) for line in lines) if you find that you're getting more newlines than you wanted in the output. A few other points, when opening files, it's nice to use a context manager -- This makes sure that the file is closed properly: with open('inputfile') as fin: lines = fin.read() with open('outputfile','w') as fout: fout.writelines('!{0}\n'.format(line.replace('\n','')) for line in lines)
Another option, using replace instead of split, since you know the starting and ending characters of each line: In [14]: data = """!,A,56281,12/12/19,19:34:12,000.0,0,37N22.714,121W55.576,+0013!,A,56281,12/1 2/19,19:34:13,000.0,0,37N22.714,121W55.576,+0013!,A,56281,12/12/19,19:34:14,000. 0,0,37N22.714,121W55.576,+0013!,A,56281,12/12/19,19:34:15,000.0,0,37N22.714,121W 55.576,+0013!,A,56281,12/12/19,19:34:16,000.0,0,37N22.714,121W55.576,+0013!,A,56 281,12/12/19,19:34:17,000.0,0,37N22.714,121W55.576,+0013!,A,56281,12/12/19,19:34 :18,000.0,0,37N22.714,121W55.576,+0013!,A,56281,12/12/19,19:34:19,000.0,0,37N22.""".replace('\n', '') In [15]: print data.replace('+0013!', "+0013\n!") !,A,56281,12/12/19,19:34:12,000.0,0,37N22.714,121W55.576,+0013 !,A,56281,12/12/19,19:34:13,000.0,0,37N22.714,121W55.576,+0013 !,A,56281,12/12/19,19:34:14,000.0,0,37N22.714,121W55.576,+0013 !,A,56281,12/12/19,19:34:15,000.0,0,37N22.714,121W55.576,+0013 !,A,56281,12/12/19,19:34:16,000.0,0,37N22.714,121W55.576,+0013 !,A,56281,12/12/19,19:34:17,000.0,0,37N22.714,121W55.576,+0013 !,A,56281,12/12/19,19:34:18,000.0,0,37N22.714,121W55.576,+0013 !,A,56281,12/12/19,19:34:19,000.0,0,37N22.
Just for some variance, here is a regular expression answer: import re outputFile = open('output.txt', 'w+') with open('sample.txt', 'r') as f: for line in re.findall("!.+?(?=!|$)", f.read(), re.DOTALL): outputFile.write(line.replace("\n", "") + '\n') outputFile.close() It will open the output file, get the contents of the input file, and loop through all the matches using the regular expression !.+?(?=!|$) with the re.DOTALL flag. The regular expression explanation & what it matches can be found here: http://regex101.com/r/aK6aV4 After we have a match, we strip out the new lines from the match, and write it to the file.
Let's try to add a \n before every "!"; then let python splitlines :-) : file_read.replace("!", "!\n").splitlines()
I will actually implement as a generator so that you can work on the data stream rather than the entire content of the file. This will be quite memory friendly if working with huge files >>> def split_on_stream(it,sep="!"): prev = "" for line in it: line = (prev + line.strip()).split(sep) for parts in line[:-1]: yield parts prev = line[-1] yield prev >>> with open("test.txt") as fin: for parts in split_on_stream(fin): print parts ,A,56281,12/12/19,19:34:12,000.0,0,37N22.714,121W55.576,+0013 ,A,56281,12/12/19,19:34:13,000.0,0,37N22.714,121W55.576,+0013 ,A,56281,12/12/19,19:34:14,000.0,0,37N22.714,121W55.576,+0013 ,A,56281,12/12/19,19:34:15,000.0,0,37N22.714,121W55.576,+0013 ,A,56281,12/12/19,19:34:16,000.0,0,37N22.714,121W55.576,+0013 ,A,56281,12/12/19,19:34:17,000.0,0,37N22.714,121W55.576,+0013 ,A,56281,12/12/19,19:34:18,000.0,0,37N22.714,121W55.576,+0013 ,A,56281,12/12/19,19:34:19,000.0,0,37N22.