Search txt file for varying keyword - then assign to variable - python

I need to search a txt file for keywords in format "i-0xxxyyyzzzz" - where xyz are varying alphanumeric characters.
I would like to then assign each match assigned. Currently I can get as far as:
f = open("file.txt", "r")
searchlines = f.readlines()
f.close()
for i, line in enumerate(searchlines):
if "i-0" in line:
for l in searchlines[i:i+1]: print l,
print
However this prints the whole line, and not just the keyword.

I suggest using regular expressions:
import re
token_regex = re.compile('i\-0[0-9a-z]*')
for line in open('file.txt'): // Note: 'r' is the default
// You might find the token several times in the line
matches = token_regex.findall(line)
if matches:
print '\n'.join(matches)
I'm not clear what you want to do with the matches. My example would print print them one per line. Also, could you have tokens that span two lines?

Related

how to go to next line if match is found and again check for the word count in that line

I am trying to find word count by find a match line if match is found go to next line and count the word in that line
id = open('id.txt','r')
ids = id.readlines()
for i in range(0, len(ids) - 1, 1):
actual_id = ids[i]
print(actual_id)
with open('sample2.txt', 'r') as f:
for line in f:
if re.search(r'{actual_id}|RQ', line):
next_line = line.next()
if next_line == 'RQ':
print(line)
with open('output.txt', 'a') as f:
f.write('\n' + line)
Sample.txt text file:
[07-12-2022 13:27:45.728|Info|0189B31C|RQ]
<ServiceRQ><SaleInfo><CityCode Solution=1>BLQ</CityCode><CountryCode Solution=2>NL</CountryCode><CurrencyCode>EUR</CurrencyCode><Channel>ICI</Channel></ServiceRQ>
[07-12-2022 13:27:45.744|Info|0189B31D|RQ]
<ServiceRQ><SaleInfo><CityCode Solution=1>BLQ</CityCode><CountryCode>NL</CountryCode><CurrencyCode>EUR</CurrencyCode><Channel>ICI</Channel></ServiceRQ>
0189B31C
0189B31D
These are unique id's which are store in different text file I am trying to read the 1st id from text file and match that id in Sample.txt and if match is found go to next line and count the number of Solution words and print.
Please can someone help me for find the code I am little confused.
I have no experience with the "requests" module. But since no one has answered your question yet, I thought maybe this would suit you. The code should work fine if the number of lines is even. I mean, the code will put strings in the "payload" and do the rest only if there is an entire pair consisting of an odd and an even string.
with open('Sample.txt', 'r') as f:
while True:
try:
odd_line=next(f)
even_line=next(f)
except StopIteration:
break
#payload=...
#headers=...
#response=...
#print(response.text)
You can use the flag re.DOTALL with the regex {idf}\|RQ.*?</ServiceRQ>, this way the regex matches any character including a newline, and the non-greedy modifier (.*?) part makes sure that few characters as possible will be matched until the string </ServiceRQ> is found. Then, you can use findall to obtain the number of Solution words in the string.
import re
with open('sample2.txt', 'r') as sample_file:
sample2 = sample_file.read()
id_dict = {}
with open('id.txt', 'r') as id_file:
for idf in id_file.read().split():
id_found = re.findall(fr'{idf}\|RQ.*?</ServiceRQ>', sample2, re.DOTALL)
if id_found:
solution_found = re.findall('Solution', id_found[0])
id_dict[idf] = len(solution_found)
print(id_dict)
Output from id_dict
{
'0189B31C': 2,
'0189B31D': 1
}

Getting the line number of a string

Suppose I have a very long string taken from a file:
lf = open(filename, 'r')
text = lf.readlines()
lf.close()
or
lineList = [line.strip() for line in open(filename)]
text = '\n'.join(lineList)
How can one find specific regular expression's line number in this string( in this case the line number of 'match'):
regex = re.compile(somepattern)
for match in re.findall(regex, text):
continue
Thank you for your time in advance
Edit: Forgot to add that the pattern that we are searching is multiple lines and I am interested in the starting line.
We need to get re.Match objects rather than strings themselves using re.finditer, which will allow getting information about starting position. Consider following example: lets say I want to find every two digits which are located immediately before and after newline (\n) then:
import re
lineList = ["123","456","789","ABC","XYZ"]
text = '\n'.join(lineList)
for match in re.finditer(r"\d\n\d", text, re.MULTILINE):
start = match.span()[0] # .span() gives tuple (start, end)
line_no = text[:start].count("\n")
print(line_no)
Output:
0
1
Explanation: After I get starting position I simply count number of newlines before that place, which is same as getting number of line. Note: I assumed line numbers are starting from 0.
Perhaps something like this:
lf = open(filename, 'r')
text_lines = lf.readlines()
lf.close()
regex = re.compile(somepattern)
for line_number, line in enumerate(text_lines):
for match in re.findall(regex, line):
print('Match found on line %d: %s' % (line_number, match))

Extract chunks of text from document and write them to new text file

I have a large file text file that I want to read several lines of, and write these lines out as one line to a text file. For instance, I want to start reading in lines at a certain start word, and end on a lone parenthesis. So if my start word is 'CAR' I would want to start reading until a one parenthesis with a line break is read. The start and end words are to be kept as well.
What is the best way to achieve this? I have tried pattern matching and avoiding regex but I don't think that is possible.
Code:
array = []
f = open('text.txt','r') as infile
w = open(r'temp2.txt', 'w') as outfile
for line in f:
data = f.read()
x = re.findall(r'CAR(.*?)\)(?:\\n|$)',data,re.DOTALL)
array.append(x)
outfile.write(x)
return array
What the text may look like
( CAR: *random info*
*random info* - could be many lines of this
)
Using regular expression is totally fine for these type of problems. You cannot use them when your pattern contains recursion, like get the content from the parenthesis: ((text1)(text2)).
You can use the following regular expression: (CAR[\s\S]*?(?=\)))
See explanation...
Here you can visualize your regular expression...
We can match the text you're interested in using the regex pattern: (CAR.*)\) with flags gms.
Then we just have to remove the newline characters from the resulting matches and write them to a file.
with open("text.txt", 'r') as f:
matches = re.findall(r"(CAR.*)\)", f.read(), re.DOTALL)
with open("output.txt", 'w') as f:
for match in matches:
f.write(" ".join(match.split('\n')))
f.write('\n')
The output file looks like this:
CAR: *random info* *random info* - could be many lines of this
EDIT:
updated code to put newline between matches in output file

extract float numbers from data file

I'm trying to extract the values (floats) from my datafile. I only want to extract the first value on the line, the second one is the error. (eg. xo # 9.95322254_0.00108217853
means 9.953... is value, 0.0010.. is error)
Here is my code:
import sys
import re
inf = sys.argv[1]
out = sys.argv[2]
f = inf
outf = open(out, 'w')
intensity = []
with open(inf) as f:
pattern = re.compile(r"[^-\d]*([\-]{0,1}\d+\.\d+)[^-\d]*")
for line in f:
f.split("\n")
match = pattern.match(line)
if match:
intensity.append(match.group(0))
for k in range(len(intensity)):
outf.write(intensity[k])
but it doesn't work. The output file is empty.
the lines in data file look like:
xo_Is
xo # 9.95322254`_0.00108217853
SPVII_to_PVII_Peak_type
PVII_m(#, 1.61879`_0.08117)
PVII_h(#, 0.11649`_0.00216)
I # 0.101760618`_0.00190314017
each time the first number is the value I want to extract and the second one is the error.
You were almost there, but your code contains errors preventing it from running. The following works:
pattern = re.compile(r"[^-\d]*(-?\d+\.\d+)[^-\d]*")
with open(inf) as f, open(out, 'w') as outf:
for line in f:
match = pattern.match(line)
if match:
outf.write(match.group(1) + '\n')
I think you should test your pattern on a simple string instead of file. This will show where is the error: in pattern or in code which parsing file. Pattern looks good. Additionally in most languages i know group(0) is all captured data and for your number you need to use group(1)
Are you sure that f.slit('\n') must be inside for?

Splitting lines in python based on some character

Input:
!,A,56281,12/12/19,19:34:12,000.0,0,37N22.714,121W55.576,+0013!,A,56281,12/1
2/19,19:34:13,000.0,0,37N22.714,121W55.576,+0013!,A,56281,12/12/19,19:34:14,000.
0,0,37N22.714,121W55.576,+0013!,A,56281,12/12/19,19:34:15,000.0,0,37N22.714,121W
55.576,+0013!,A,56281,12/12/19,19:34:16,000.0,0,37N22.714,121W55.576,+0013!,A,56
281,12/12/19,19:34:17,000.0,0,37N22.714,121W55.576,+0013!,A,56281,12/12/19,19:34
:18,000.0,0,37N22.714,121W55.576,+0013!,A,56281,12/12/19,19:34:19,000.0,0,37N22.
Output:
!,A,56281,12/12/19,19:34:12,000.0,0,37N22.714,121W55.576,+0013
!,A,56281,12/12/19,19:34:13,000.0,0,37N22.714,121W55.576,+0013
!,A,56281,12/12/19,19:34:14,000.0,0,37N22.714,121W55.576,+0013
!,A,56281,12/12/19,19:34:15,000.0,0,37N22.714,121W55.576,+0013
!,A,56281,12/12/19,19:34:16,000.0,0,37N22.714,121W55.576,+0013
!,A,56281,12/12/19,19:34:17,000.0,0,37N22.714,121W55.576,+0013
!,A,56281,12/12/19,19:34:18,000.0,0,37N22.714,121W55.576,+0013
!,A,56281,12/12/19,19:34:19,000.0,0,37N22.
'!' is the starting character and +0013 should be the ending of each line (if present).
Problem which I am getting:
Output is like :
!,A,56281,12/12/19,19:34:12,000.0,0,37N22.714,121W55.576,+0013
!,A,56281,12/1
2/19,19:34:13,000.0,0,37N22.714,121W55.576,+0013
!,A,56281,12/12/19,19:34:14,000.
0,0,37N22.714,121W55.576,+0013
!,A,56281,12/12/19,19:34:15,000.0,0,37N22.714,121W
Any help would be highly appreciated...!!!
My code:
file_open= open('sample.txt','r')
file_read= file_open.read()
file_open2= open('output.txt','w+')
counter =0
for i in file_read:
if '!' in i:
if counter == 1:
file_open2.write('\n')
counter= counter -1
counter= counter +1
file_open2.write(i)
You can try something like this:
with open("abc.txt") as f:
data=f.read().replace("\r\n","") #replace the newlines with ""
#the newline can be "\n" in your system instead of "\r\n"
ans=filter(None,data.split("!")) #split the data at '!', then filter out empty lines
for x in ans:
print "!"+x #or write to some other file
.....:
!,A,56281,12/12/19,19:34:12,000.0,0,37N22.714,121W55.576,+0013
!,A,56281,12/12/19,19:34:13,000.0,0,37N22.714,121W55.576,+0013
!,A,56281,12/12/19,19:34:14,000.0,0,37N22.714,121W55.576,+0013
!,A,56281,12/12/19,19:34:15,000.0,0,37N22.714,121W55.576,+0013
!,A,56281,12/12/19,19:34:16,000.0,0,37N22.714,121W55.576,+0013
!,A,56281,12/12/19,19:34:17,000.0,0,37N22.714,121W55.576,+0013
!,A,56281,12/12/19,19:34:18,000.0,0,37N22.714,121W55.576,+0013
!,A,56281,12/12/19,19:34:19,000.0,0,37N22.
Could you just use str.split?
lines = file_read.split('!')
Now lines is a list which holds the split data. This is almost the lines you want to write -- The only difference is that they don't have trailing newlines and they don't have '!' at the start. We can put those in easily with string formatting -- e.g. '!{0}\n'.format(line). Then we can put that whole thing in a generator expression which we'll pass to file.writelines to put the data in a new file:
file_open2.writelines('!{0}\n'.format(line) for line in lines)
You might need:
file_open2.writelines('!{0}\n'.format(line.replace('\n','')) for line in lines)
if you find that you're getting more newlines than you wanted in the output.
A few other points, when opening files, it's nice to use a context manager -- This makes sure that the file is closed properly:
with open('inputfile') as fin:
lines = fin.read()
with open('outputfile','w') as fout:
fout.writelines('!{0}\n'.format(line.replace('\n','')) for line in lines)
Another option, using replace instead of split, since you know the starting and ending characters of each line:
In [14]: data = """!,A,56281,12/12/19,19:34:12,000.0,0,37N22.714,121W55.576,+0013!,A,56281,12/1
2/19,19:34:13,000.0,0,37N22.714,121W55.576,+0013!,A,56281,12/12/19,19:34:14,000.
0,0,37N22.714,121W55.576,+0013!,A,56281,12/12/19,19:34:15,000.0,0,37N22.714,121W
55.576,+0013!,A,56281,12/12/19,19:34:16,000.0,0,37N22.714,121W55.576,+0013!,A,56
281,12/12/19,19:34:17,000.0,0,37N22.714,121W55.576,+0013!,A,56281,12/12/19,19:34
:18,000.0,0,37N22.714,121W55.576,+0013!,A,56281,12/12/19,19:34:19,000.0,0,37N22.""".replace('\n', '')
In [15]: print data.replace('+0013!', "+0013\n!")
!,A,56281,12/12/19,19:34:12,000.0,0,37N22.714,121W55.576,+0013
!,A,56281,12/12/19,19:34:13,000.0,0,37N22.714,121W55.576,+0013
!,A,56281,12/12/19,19:34:14,000.0,0,37N22.714,121W55.576,+0013
!,A,56281,12/12/19,19:34:15,000.0,0,37N22.714,121W55.576,+0013
!,A,56281,12/12/19,19:34:16,000.0,0,37N22.714,121W55.576,+0013
!,A,56281,12/12/19,19:34:17,000.0,0,37N22.714,121W55.576,+0013
!,A,56281,12/12/19,19:34:18,000.0,0,37N22.714,121W55.576,+0013
!,A,56281,12/12/19,19:34:19,000.0,0,37N22.
Just for some variance, here is a regular expression answer:
import re
outputFile = open('output.txt', 'w+')
with open('sample.txt', 'r') as f:
for line in re.findall("!.+?(?=!|$)", f.read(), re.DOTALL):
outputFile.write(line.replace("\n", "") + '\n')
outputFile.close()
It will open the output file, get the contents of the input file, and loop through all the matches using the regular expression !.+?(?=!|$) with the re.DOTALL flag. The regular expression explanation & what it matches can be found here: http://regex101.com/r/aK6aV4
After we have a match, we strip out the new lines from the match, and write it to the file.
Let's try to add a \n before every "!"; then let python splitlines :-) :
file_read.replace("!", "!\n").splitlines()
I will actually implement as a generator so that you can work on the data stream rather than the entire content of the file. This will be quite memory friendly if working with huge files
>>> def split_on_stream(it,sep="!"):
prev = ""
for line in it:
line = (prev + line.strip()).split(sep)
for parts in line[:-1]:
yield parts
prev = line[-1]
yield prev
>>> with open("test.txt") as fin:
for parts in split_on_stream(fin):
print parts
,A,56281,12/12/19,19:34:12,000.0,0,37N22.714,121W55.576,+0013
,A,56281,12/12/19,19:34:13,000.0,0,37N22.714,121W55.576,+0013
,A,56281,12/12/19,19:34:14,000.0,0,37N22.714,121W55.576,+0013
,A,56281,12/12/19,19:34:15,000.0,0,37N22.714,121W55.576,+0013
,A,56281,12/12/19,19:34:16,000.0,0,37N22.714,121W55.576,+0013
,A,56281,12/12/19,19:34:17,000.0,0,37N22.714,121W55.576,+0013
,A,56281,12/12/19,19:34:18,000.0,0,37N22.714,121W55.576,+0013
,A,56281,12/12/19,19:34:19,000.0,0,37N22.

Categories