Getting the line number of a string - python

Suppose I have a very long string taken from a file:
lf = open(filename, 'r')
text = lf.readlines()
lf.close()
or
lineList = [line.strip() for line in open(filename)]
text = '\n'.join(lineList)
How can one find specific regular expression's line number in this string( in this case the line number of 'match'):
regex = re.compile(somepattern)
for match in re.findall(regex, text):
continue
Thank you for your time in advance
Edit: Forgot to add that the pattern that we are searching is multiple lines and I am interested in the starting line.

We need to get re.Match objects rather than strings themselves using re.finditer, which will allow getting information about starting position. Consider following example: lets say I want to find every two digits which are located immediately before and after newline (\n) then:
import re
lineList = ["123","456","789","ABC","XYZ"]
text = '\n'.join(lineList)
for match in re.finditer(r"\d\n\d", text, re.MULTILINE):
start = match.span()[0] # .span() gives tuple (start, end)
line_no = text[:start].count("\n")
print(line_no)
Output:
0
1
Explanation: After I get starting position I simply count number of newlines before that place, which is same as getting number of line. Note: I assumed line numbers are starting from 0.

Perhaps something like this:
lf = open(filename, 'r')
text_lines = lf.readlines()
lf.close()
regex = re.compile(somepattern)
for line_number, line in enumerate(text_lines):
for match in re.findall(regex, line):
print('Match found on line %d: %s' % (line_number, match))

Related

how to go to next line if match is found and again check for the word count in that line

I am trying to find word count by find a match line if match is found go to next line and count the word in that line
id = open('id.txt','r')
ids = id.readlines()
for i in range(0, len(ids) - 1, 1):
actual_id = ids[i]
print(actual_id)
with open('sample2.txt', 'r') as f:
for line in f:
if re.search(r'{actual_id}|RQ', line):
next_line = line.next()
if next_line == 'RQ':
print(line)
with open('output.txt', 'a') as f:
f.write('\n' + line)
Sample.txt text file:
[07-12-2022 13:27:45.728|Info|0189B31C|RQ]
<ServiceRQ><SaleInfo><CityCode Solution=1>BLQ</CityCode><CountryCode Solution=2>NL</CountryCode><CurrencyCode>EUR</CurrencyCode><Channel>ICI</Channel></ServiceRQ>
[07-12-2022 13:27:45.744|Info|0189B31D|RQ]
<ServiceRQ><SaleInfo><CityCode Solution=1>BLQ</CityCode><CountryCode>NL</CountryCode><CurrencyCode>EUR</CurrencyCode><Channel>ICI</Channel></ServiceRQ>
0189B31C
0189B31D
These are unique id's which are store in different text file I am trying to read the 1st id from text file and match that id in Sample.txt and if match is found go to next line and count the number of Solution words and print.
Please can someone help me for find the code I am little confused.
I have no experience with the "requests" module. But since no one has answered your question yet, I thought maybe this would suit you. The code should work fine if the number of lines is even. I mean, the code will put strings in the "payload" and do the rest only if there is an entire pair consisting of an odd and an even string.
with open('Sample.txt', 'r') as f:
while True:
try:
odd_line=next(f)
even_line=next(f)
except StopIteration:
break
#payload=...
#headers=...
#response=...
#print(response.text)
You can use the flag re.DOTALL with the regex {idf}\|RQ.*?</ServiceRQ>, this way the regex matches any character including a newline, and the non-greedy modifier (.*?) part makes sure that few characters as possible will be matched until the string </ServiceRQ> is found. Then, you can use findall to obtain the number of Solution words in the string.
import re
with open('sample2.txt', 'r') as sample_file:
sample2 = sample_file.read()
id_dict = {}
with open('id.txt', 'r') as id_file:
for idf in id_file.read().split():
id_found = re.findall(fr'{idf}\|RQ.*?</ServiceRQ>', sample2, re.DOTALL)
if id_found:
solution_found = re.findall('Solution', id_found[0])
id_dict[idf] = len(solution_found)
print(id_dict)
Output from id_dict
{
'0189B31C': 2,
'0189B31D': 1
}

How to read a value of file separate by tabs in Python?

I have a text file with this format
ConfigFile 1.1
;
; Version: 4.0.32.1
; Date="2021/04/08" Time="11:54:46" UTC="8"
;
Name
John Legend
Type
Student
Number
s1054520
I would like to get the value of Name or Type or Number
How do I get it?
I tried with this method, but it does not solve my problem.
import re
f = open("Data.txt", "r")
file = f.read()
Name = re.findall("Name", file)
print(Name)
My expectation output is John Legend
Anyone can help me please. I really appreciated. Thank you
First of all re.findall is used to search for “all” occurrences that match a given pattern. So in your case. you are finding every "Name" in the file. Because that's what you are looking for.
On the other hand, the computer will not know the "John Legend" is the name. it will only know that's the line after the word "Name".
In your case I will suggest you can check this link.
Find the "Name"'s line number
Read the next line
Get the name without the white space
If there is more than 1 Name. this will work as well
the final code is like this
def search_string_in_file(file_name, string_to_search):
"""Search for the given string in file and return lines containing that string,
along with line numbers"""
line_number = 0
list_of_results = []
# Open the file in read only mode
with open(file_name, 'r') as read_obj:
# Read all lines in the file one by one
for line in read_obj:
# For each line, check if line contains the string
line_number += 1
if string_to_search in line:
# If yes, then add the line number & line as a tuple in the list
list_of_results.append((line_number, line.rstrip()))
# Return list of tuples containing line numbers and lines where string is found
return list_of_results
file = open('Data.txt')
content = file.readlines()
matched_lines = search_string_in_file('Data.txt', 'Name')
print('Total Matched lines : ', len(matched_lines))
for i in matched_lines:
print(content[i[0]].strip())
Here I'm going through each line and when I encounter Name I will add the next line (you can directly print too) to the result list:
import re
def print_hi(name):
result = []
regexp = re.compile(r'Name*')
gotname = False;
with open('test.txt') as f:
for line in f:
if gotname:
result.append(line.strip())
gotname = False
match = regexp.match(line)
if match:
gotname = True
print(result)
if __name__ == '__main__':
print_hi('test')
Assuming those label lines are in the sequence found in the file you
can simply scan for them:
labelList = ["Name","Type","Number"]
captures = dict()
with open("Data.txt","rt") as f:
for label in labelList:
while not f.readline().startswith(label):
pass
captures[label] = f.readline().strip()
for label in labelList:
print(f"{label} : {captures[label]}")
I wouldn't use a regex, but rather make a parser for the file type. The rules might be:
The first line can be ignored
Any lines that start with ; can be ignored.
Every line with no leading whitespace is a key
Every line with leading whitespace is a value belonging to the last
key
I'd start with a generator that can return to you any unignored line:
def read_data_lines(filename):
with open(filename, "r") as f:
# skip the first line
f.readline()
# read until no more lines
while line := f.readline():
# skip lines that start with ;
if not line.startswith(";"):
yield line
Then fill up a dict by following rules 3 and 4:
def parse_data_file(filename):
data = {}
key = None
for line in read_data_lines(filename):
# No starting whitespace makes this a key
if not line.startswith(" "):
key = line.strip()
# Starting whitespace makes this a value for the last key
else:
data[key] = line.strip()
return data
Now at this point you can parse the file and print whatever key you want:
data = parse_data_file("Data.txt")
print(data["Name"])

Search txt file for varying keyword - then assign to variable

I need to search a txt file for keywords in format "i-0xxxyyyzzzz" - where xyz are varying alphanumeric characters.
I would like to then assign each match assigned. Currently I can get as far as:
f = open("file.txt", "r")
searchlines = f.readlines()
f.close()
for i, line in enumerate(searchlines):
if "i-0" in line:
for l in searchlines[i:i+1]: print l,
print
However this prints the whole line, and not just the keyword.
I suggest using regular expressions:
import re
token_regex = re.compile('i\-0[0-9a-z]*')
for line in open('file.txt'): // Note: 'r' is the default
// You might find the token several times in the line
matches = token_regex.findall(line)
if matches:
print '\n'.join(matches)
I'm not clear what you want to do with the matches. My example would print print them one per line. Also, could you have tokens that span two lines?

Splitting lines in python based on some character

Input:
!,A,56281,12/12/19,19:34:12,000.0,0,37N22.714,121W55.576,+0013!,A,56281,12/1
2/19,19:34:13,000.0,0,37N22.714,121W55.576,+0013!,A,56281,12/12/19,19:34:14,000.
0,0,37N22.714,121W55.576,+0013!,A,56281,12/12/19,19:34:15,000.0,0,37N22.714,121W
55.576,+0013!,A,56281,12/12/19,19:34:16,000.0,0,37N22.714,121W55.576,+0013!,A,56
281,12/12/19,19:34:17,000.0,0,37N22.714,121W55.576,+0013!,A,56281,12/12/19,19:34
:18,000.0,0,37N22.714,121W55.576,+0013!,A,56281,12/12/19,19:34:19,000.0,0,37N22.
Output:
!,A,56281,12/12/19,19:34:12,000.0,0,37N22.714,121W55.576,+0013
!,A,56281,12/12/19,19:34:13,000.0,0,37N22.714,121W55.576,+0013
!,A,56281,12/12/19,19:34:14,000.0,0,37N22.714,121W55.576,+0013
!,A,56281,12/12/19,19:34:15,000.0,0,37N22.714,121W55.576,+0013
!,A,56281,12/12/19,19:34:16,000.0,0,37N22.714,121W55.576,+0013
!,A,56281,12/12/19,19:34:17,000.0,0,37N22.714,121W55.576,+0013
!,A,56281,12/12/19,19:34:18,000.0,0,37N22.714,121W55.576,+0013
!,A,56281,12/12/19,19:34:19,000.0,0,37N22.
'!' is the starting character and +0013 should be the ending of each line (if present).
Problem which I am getting:
Output is like :
!,A,56281,12/12/19,19:34:12,000.0,0,37N22.714,121W55.576,+0013
!,A,56281,12/1
2/19,19:34:13,000.0,0,37N22.714,121W55.576,+0013
!,A,56281,12/12/19,19:34:14,000.
0,0,37N22.714,121W55.576,+0013
!,A,56281,12/12/19,19:34:15,000.0,0,37N22.714,121W
Any help would be highly appreciated...!!!
My code:
file_open= open('sample.txt','r')
file_read= file_open.read()
file_open2= open('output.txt','w+')
counter =0
for i in file_read:
if '!' in i:
if counter == 1:
file_open2.write('\n')
counter= counter -1
counter= counter +1
file_open2.write(i)
You can try something like this:
with open("abc.txt") as f:
data=f.read().replace("\r\n","") #replace the newlines with ""
#the newline can be "\n" in your system instead of "\r\n"
ans=filter(None,data.split("!")) #split the data at '!', then filter out empty lines
for x in ans:
print "!"+x #or write to some other file
.....:
!,A,56281,12/12/19,19:34:12,000.0,0,37N22.714,121W55.576,+0013
!,A,56281,12/12/19,19:34:13,000.0,0,37N22.714,121W55.576,+0013
!,A,56281,12/12/19,19:34:14,000.0,0,37N22.714,121W55.576,+0013
!,A,56281,12/12/19,19:34:15,000.0,0,37N22.714,121W55.576,+0013
!,A,56281,12/12/19,19:34:16,000.0,0,37N22.714,121W55.576,+0013
!,A,56281,12/12/19,19:34:17,000.0,0,37N22.714,121W55.576,+0013
!,A,56281,12/12/19,19:34:18,000.0,0,37N22.714,121W55.576,+0013
!,A,56281,12/12/19,19:34:19,000.0,0,37N22.
Could you just use str.split?
lines = file_read.split('!')
Now lines is a list which holds the split data. This is almost the lines you want to write -- The only difference is that they don't have trailing newlines and they don't have '!' at the start. We can put those in easily with string formatting -- e.g. '!{0}\n'.format(line). Then we can put that whole thing in a generator expression which we'll pass to file.writelines to put the data in a new file:
file_open2.writelines('!{0}\n'.format(line) for line in lines)
You might need:
file_open2.writelines('!{0}\n'.format(line.replace('\n','')) for line in lines)
if you find that you're getting more newlines than you wanted in the output.
A few other points, when opening files, it's nice to use a context manager -- This makes sure that the file is closed properly:
with open('inputfile') as fin:
lines = fin.read()
with open('outputfile','w') as fout:
fout.writelines('!{0}\n'.format(line.replace('\n','')) for line in lines)
Another option, using replace instead of split, since you know the starting and ending characters of each line:
In [14]: data = """!,A,56281,12/12/19,19:34:12,000.0,0,37N22.714,121W55.576,+0013!,A,56281,12/1
2/19,19:34:13,000.0,0,37N22.714,121W55.576,+0013!,A,56281,12/12/19,19:34:14,000.
0,0,37N22.714,121W55.576,+0013!,A,56281,12/12/19,19:34:15,000.0,0,37N22.714,121W
55.576,+0013!,A,56281,12/12/19,19:34:16,000.0,0,37N22.714,121W55.576,+0013!,A,56
281,12/12/19,19:34:17,000.0,0,37N22.714,121W55.576,+0013!,A,56281,12/12/19,19:34
:18,000.0,0,37N22.714,121W55.576,+0013!,A,56281,12/12/19,19:34:19,000.0,0,37N22.""".replace('\n', '')
In [15]: print data.replace('+0013!', "+0013\n!")
!,A,56281,12/12/19,19:34:12,000.0,0,37N22.714,121W55.576,+0013
!,A,56281,12/12/19,19:34:13,000.0,0,37N22.714,121W55.576,+0013
!,A,56281,12/12/19,19:34:14,000.0,0,37N22.714,121W55.576,+0013
!,A,56281,12/12/19,19:34:15,000.0,0,37N22.714,121W55.576,+0013
!,A,56281,12/12/19,19:34:16,000.0,0,37N22.714,121W55.576,+0013
!,A,56281,12/12/19,19:34:17,000.0,0,37N22.714,121W55.576,+0013
!,A,56281,12/12/19,19:34:18,000.0,0,37N22.714,121W55.576,+0013
!,A,56281,12/12/19,19:34:19,000.0,0,37N22.
Just for some variance, here is a regular expression answer:
import re
outputFile = open('output.txt', 'w+')
with open('sample.txt', 'r') as f:
for line in re.findall("!.+?(?=!|$)", f.read(), re.DOTALL):
outputFile.write(line.replace("\n", "") + '\n')
outputFile.close()
It will open the output file, get the contents of the input file, and loop through all the matches using the regular expression !.+?(?=!|$) with the re.DOTALL flag. The regular expression explanation & what it matches can be found here: http://regex101.com/r/aK6aV4
After we have a match, we strip out the new lines from the match, and write it to the file.
Let's try to add a \n before every "!"; then let python splitlines :-) :
file_read.replace("!", "!\n").splitlines()
I will actually implement as a generator so that you can work on the data stream rather than the entire content of the file. This will be quite memory friendly if working with huge files
>>> def split_on_stream(it,sep="!"):
prev = ""
for line in it:
line = (prev + line.strip()).split(sep)
for parts in line[:-1]:
yield parts
prev = line[-1]
yield prev
>>> with open("test.txt") as fin:
for parts in split_on_stream(fin):
print parts
,A,56281,12/12/19,19:34:12,000.0,0,37N22.714,121W55.576,+0013
,A,56281,12/12/19,19:34:13,000.0,0,37N22.714,121W55.576,+0013
,A,56281,12/12/19,19:34:14,000.0,0,37N22.714,121W55.576,+0013
,A,56281,12/12/19,19:34:15,000.0,0,37N22.714,121W55.576,+0013
,A,56281,12/12/19,19:34:16,000.0,0,37N22.714,121W55.576,+0013
,A,56281,12/12/19,19:34:17,000.0,0,37N22.714,121W55.576,+0013
,A,56281,12/12/19,19:34:18,000.0,0,37N22.714,121W55.576,+0013
,A,56281,12/12/19,19:34:19,000.0,0,37N22.

How to eliminate last digit from each of the top lines

Sequence 1.1.1 ATGCGCGCGATAAGGCGCTA
ATATTATAGCGCGCGCGCGGATATATATATATATATATATT
Sequence 1.2.2 ATATGCGCGCGCGCGCGGCG
ACCCCGCGCGCGCGCGGCGCGATATATATATATATATATATT
Sequence 2.1.1 ATTCGCGCGAGTATAGCGGCG
NOW,I would like to remove the last digit from each of the line that starts with '>'. For example, in this first line, i would like to remove '.1' (rightmost) and in second instance i would like to remove '.2' and then write the rest of the file to a new file. Thanks,
import fileinput
import re
for line in fileinput.input(inplace=True, backup='.bak'):
line = line.rstrip()
if line.startswith('>'):
line = re.sub(r'\.\d$', '', line)
print line
many details can be changed depending on details of the processing you want, which you have not clearly communicated, but this is the general idea.
import re
trimmedtext = re.sub(r'(\d+\.\d+)\.\d', '$1', text)
Should do it. Somewhat simpler than searching for start characters (and it won't effect your DNA chains)
if line.startswith('>Sequence'):
line = line[:-2] # trim 2 characters from the end of the string
or if there could be more than one digit after the period:
if line.startswith('>Sequence'):
dot_pos = line.rfind('.') # find position of rightmost period
line = line[:dot_pos] # truncate upto but not including the dot
Edit for if the sequence occurs on the same line as >Sequence
If we know that there will always be only 1 digit to remove we can cut out the period and the digit with:
line = line[:13] + line[15:]
This is using a feature of Python called slices. The indexes are zero-based and exclusive for the end of the range so line[0:13] will give us the first 13 characters of line. Except that if we want to start at the beginning the 0 is optional so line[:13] does the same thing. Similarly line[15:] gives us the substring starting at character 15 to the end of the string.
map "".join(line.split('.')[:-1]) to each line of the file.
Here's a short script. Run it like: script [filename to clean]. Lots of error handling omitted.
It operates using generators, so it should work fine on huge files as well.
import sys
import os
def clean_line(line):
if line.startswith(">"):
return line.rstrip()[:-2]
else:
return line.rstrip()
def clean(input):
for line in input:
yield clean_line(line)
if __name__ == "__main__":
filename = sys.argv[1]
print "Cleaning %s; output to %s.." % (filename, filename + ".clean")
input = None
output = None
try:
input = open(filename, "r")
output = open(filename + ".clean", "w")
for line in clean(input):
output.write(line + os.linesep)
print ": " + line
except:
input.close()
if output != None:
output.close()
import re
input_file = open('in')
output_file = open('out', 'w')
for line in input_file:
line = re.sub(r'(\d+[.]\d+)[.]\d+', r'\1', line)
output_file.write(line)

Categories