Python: Get/Scan All Text After a Certain String

Python: Get/Scan All Text After a Certain String - python

I have a text file which I read using readlines(). I need to start extracting data after a keyword in the text file. For example, after the key word Hello World below, I would like to retrieve the value 100 from Blah=100:
Blah=0
Blah=2
Hello World
All the Text
Will be Scan
And Relevant
Info will be
Retrieved Blah=100
I can easily retrieved the information I want from the text file but I need it to start retrieving ONLY after a certain keyword in the textfile, such as after the 'Hello World' above. What I am currently doing is to retrieve the value using .split('='). Thus, I will retrieve all 3 values which are Blah=0, Blah=2 and Blah=100. I only wish to retrieve the value after a keyword in the text file, say 'Hello World', which is the value Blah=100.
There must be a simple way to do this. Please help. Thanks.

There are many ways to do it. Here's one:
STARTER = "Hello World"
FILENAME = "data.txt"
TARGET = "Blah="
with open(FILENAME) as f:
value = None
start_seen = False
for line in f:
if line.strip() == STARTER:
start_seen = True
continue
if TARGET in line and start_seen:
_,value = line.split('=')
break
if value is not None:
print "Got value %d" % int(value)
else:
print "Nothing found"

Here's a slightly pseudo-codish answer- you just need a flag that changes to True once you've found the keyword:
thefile = open('yourfile.txt')
key = "Hello World"
key_found = False
for line in thefile:
if key_found:
get_value(line)
# Optional: turn off key_found once you've found the value
# key_found = False
elif line.startswith(key):
key_found = True

Here's one way, not necessarily the best; I hard-coded the text here, but you could use file.read() to get similar results:
the_text = '''Blah=0
Blah=2
Hello World
All the Text
Will be Scan
And Relevant
Info will be
Retrieved Blah=100
'''
keyword = 'Hello World'
lines = the_text.split('\n')
for line_num, line in enumerate(lines):
if line.find(keyword) != -1:
lines = lines[line_num:]
break
the_value = None
value_key = 'Blah'
for line in lines:
if line.find(value_key) != -1:
the_value = line.split('=',2)[1]
break
if the_value:
print the_value

Example with regex.
reg = re.compile("Hello World")
data_re = re.ompile("Blah=(?P<value>\d)")
with open(f_name) as f:
need_search = False
for l in f:
if reg.search(l) is not None:
need_search = True
if need_search == True:
res = data_re.search(l)
if res is not None:
print res.groups('value')

Related

Simplified grep in python

I need to create a simplified version of grep in python which will print a line when a keyword is used such as using this command "python mygrep.py duck animals.txt" and getting the output, "The duck goes quack". I have a file where it contains different outputs but I'm not sure how to get it to print the line that contains the "keyword" such as the line with "duck" in it. Im suppose to only use "import sys" and not "re" since its suppose to be a simple version.
import sys
def main():
if len(sys.argv) != 3:
exit('Please pass 2 arguments.')
search_text = sys.argv[1]
filename = sys.argv[2]
with open("animals.txt", 'r') as f:
text = f.read()
for line in text:
print(line)
if __name__ == '__main__':
main()

The operator 'in' should be sufficient.
for line in text:
if search_text in line:
print(line)

Here is a an implementation of grep in python with after/before feature:
def _fetch_logs(self, data, log_file, max_result_size, current_result_size):
after = data.get("after", 0)
before = data.get("before", 0)
exceeded_max = False
result = []
before_counter = 0
frame = []
found = False
for line in log_file:
frame.append(line)
match_patterns = all(self._search_in(data, pattern, line) for pattern in data["patterns"])
if match_patterns:
before_counter = len(frame)
found = True
if not found and len(frame) > before:
frame.pop(0)
if found and len(frame) >= before_counter + after:
found = False
before_counter = 0
result += frame
frame = []
if current_result_size + len(result) >= max_result_size:
exceeded_max = True
break
if found:
result += frame
return exceeded_max, result

replace function not working with list items

I am trying to use the replace function to take items from a list and replace the fields below with their corresponding values, but no matter what I do, it only seems to work when it reaches the end of the range (on it's last possible value of i, it successfully replaces a substring, but before that it does not)
for i in range(len(fieldNameList)):
foo = fieldNameList[i]
bar = fieldValueList[i]
msg = msg.replace(foo, bar)
print msg
This is what I get after running that code
<<name>> <<color>> <<age>>
<<name>> <<color>> <<age>>
<<name>> <<color>> 18
I've been stuck on this for way too long. Any advice would be greatly appreciated. Thanks :)
Full code:
def writeDocument():
msgFile = raw_input("Name of file you would like to create or write to?: ")
msgFile = open(msgFile, 'w+')
msg = raw_input("\nType your message here. Indicate replaceable fields by surrounding them with \'<<>>\' Do not use spaces inside your fieldnames.\n\nYou can also create your fieldname list here. Write your first fieldname surrounded by <<>> followed by the value you'd like to assign, then repeat, separating everything by one space. Example: \"<<name>> ryan <<color>> blue\"\n\n")
msg = msg.replace(' ', '\n')
msgFile.write(msg)
msgFile.close()
print "\nDocument written successfully.\n"
def fillDocument():
msgFile = raw_input("Name of file containing the message you'd like to fill?: ")
fieldFile = raw_input("Name of file containing the fieldname list?: ")
msgFile = open(msgFile, 'r+')
fieldFile = open(fieldFile, 'r')
fieldNameList = []
fieldValueList = []
fieldLine = fieldFile.readline()
while fieldLine != '':
fieldNameList.append(fieldLine)
fieldLine = fieldFile.readline()
fieldValueList.append(fieldLine)
fieldLine = fieldFile.readline()
print fieldNameList[0]
print fieldValueList[0]
print fieldNameList[1]
print fieldValueList[1]
msg = msgFile.readline()
for i in range(len(fieldNameList)):
foo = fieldNameList[i]
bar = fieldValueList[i]
msg = msg.replace(foo, bar)
print msg
msgFile.close()
fieldFile.close()
###Program Starts#####--------------------
while True==True:
objective = input("What would you like to do?\n1. Create a new document\n2. Fill in a document with fieldnames\n")
if objective == 1:
writeDocument()
elif objective == 2:
fillDocument()
else:
print "That's not a valid choice."
Message file:
<<name>> <<color>> <<age>>
Fieldname file:
<<name>>
ryan
<<color>>
blue
<<age>>
18

Cause:
This is because all lines except the last line read from the "Fieldname" file contains "\n" characters. So when the program comes to the replacing part fieldNameList , fieldValueList and msg looks like this:
fieldNameList = ['<<name>>\n', '<<color>>\n', '<<age>>\n']
fieldValueList = ['ryan\n', 'blue\n', '18']
msg = '<<name>> <<color>> <<age>>\n'
so the replace() function actually searches for '<<name>>\n','<<color>>\n','<<age>>\n' in msg string and only <<age>> field get replaced.(You must have a "\n" at the end of msg file, otherwise it won't be replaced as well).
Solution:
use rstrip() method when reading lines to strip the newline character at the end.
fieldLine = fieldFile.readline().rstrip()

How to extract data from a text file based on a regular expression pattern

I need some help for a python program. I've tried so many things, for hours, but it doesn't work.
Anyone who can help me?
This is what I need:
I have this file: http://www.filedropper.com which contains information about proteins.
I want to filter only the proteins which match the ...exists.
From these proteins, I need only the ... (the text of 6 tokens, after >sp|, and the species (second line, between the [])
I want the .. and ..in a .., and eventually in a table.
....
Human AAA111
Mouse BBB222
Fruit fly CCC333
What I have so far:
import re
def main():
ReadFile()
file = open ("file.txt", "r")
FilterOnRegEx(file)
def ReadFile():
try:
file = open ("file.txt", "r")
except IOError:
print ("File not found!")
except:
print ("Something went wrong.")
def FilterOnRegEx(file):
f = ("[AG].{4}GK[ST]")
for line in file:
if f in line:
print (line)
main()
You're a hero if you help me out!

My first recommendation is to use a with statement when opening files:
with open("ploop.fa", "r") as file:
FilterOnRegEx(file)
The problem with your FilterOnRegEx method is: if ploop in line. The in operator, with string arguments, searches the string line for the exact text in ploop.
Instead you need to compile the text form to an re object, then search for matches:
def FilterOnRegEx(file):
ploop = ("[AG].{4}GK[ST]")
pattern = re.compile(ploop)
for line in file:
match = pattern.search(line)
if match is not None:
print (line)
This will help you to move forward.
As a next step, I would suggest learning about generators. Printing the lines that match is great, but that doesn't help you to do further operations with them. I might change print to yield so that I could then process the data further such as extracting the parts you want and reformatting it for output.
As a simple demonstration:
def FilterOnRegEx(file):
ploop = ("[AG].{4}GK[ST]")
pattern = re.compile(ploop)
for line in file:
match = pattern.search(line)
if match is not None:
yield line
with open("ploop.fa", "r") as file:
for line in FilterOnRegEx(file):
print(line)
Addendum: I ran the code I posted, above, using the sample of the data that you posted and it successfully prints some lines and not others. In other words, the regular expression did match some of the lines and did not match others. So far so good. However, the data you need is not all on one line in the input! That means that filtering individual lines on the pattern is insufficient. (Unless, of course, that I don't see the correct line breaks in the question) The way the data is in the question you'll need to implement a more robust parser with state to know when a record begins, when a record ends, and what any given line is in the middle of a record.

This seems to work on your sample text. I don't know if you can have more than one extract per file, and I'm out of time here, so you'll have to extend it if needed:
#!python3
import re
Extract = {}
def match_notes(line):
global _State
pattern = r"^\s+(.*)$"
m = re.match(pattern, line.rstrip())
if m:
if 'notes' not in Extract:
Extract['notes'] = []
Extract['notes'].append(m.group(1))
return True
else:
_State = match_sp
return False
def match_pattern(line):
global _State
pattern = r"^\s+Pattern: (.*)$"
m = re.match(pattern, line.rstrip())
if m:
Extract['pattern'] = m.group(1)
_State = match_notes
return True
return False
def match_sp(line):
global _State
pattern = r">sp\|([^|]+)\|(.*)$"
m = re.match(pattern, line.rstrip())
if m:
if 'sp' not in Extract:
Extract['sp'] = []
spinfo = {
'accession code': m.group(1),
'other code': m.group(2),
}
Extract['sp'].append(spinfo)
_State = match_sp_note
return True
return False
def match_sp_note(line):
"""Second line of >sp paragraph"""
global _State
pattern = r"^([^[]*)\[([^]]+)\)"
m = re.match(pattern, line.rstrip())
if m:
spinfo = Extract['sp'][-1]
spinfo['note'] = m.group(1).strip()
spinfo['species'] = m.group(2).strip()
spinfo['sequence'] = ''
_State = match_sp_sequence
return True
return False
def match_sp_range(line):
"""Last line of >sp paragraph"""
global _State
pattern = r"^\s+(\d+) - (\d+):\s+(.*)"
m = re.match(pattern, line.rstrip())
if m:
spinfo = Extract['sp'][-1]
spinfo['range'] = (m.group(1), m.group(2))
spinfo['flags'] = m.group(3)
_State = match_sp
return True
return False
def match_sp_sequence(line):
"""Middle block of >sp paragraph"""
global _State
spinfo = Extract['sp'][-1]
if re.match("^\s", line):
# End of sequence. Check for pattern, reset state for sp
if re.match(r"[AG].{4}GK[ST]", spinfo['sequence']):
spinfo['ag_4gkst'] = True
else:
spinfo['ag_4gkst'] = False
_State = match_sp_range
return False
spinfo['sequence'] += line.rstrip()
return True
def match_start(line):
"""Start of outer item"""
global _State
pattern = r"^Hits for ([A-Z]+\d+)|([^:]+) : (?:\[occurs (\w+)\])?"
m = re.match(pattern, line.rstrip())
if m:
Extract['pattern_id'] = m.group(1)
Extract['title'] = m.group(2)
Extract['occurrence'] = m.group(3)
_State = match_pattern
return True
return False
_State = match_start
def process_line(line):
while True:
state = _State
if state(line):
return True
if _State is not state:
continue
if len(line) == 0:
return False
print("Unexpected line:", line)
print("State was:", _State)
return False
def process_file(filename):
with open(filename, "r") as infile:
for line in infile:
process_line(line.rstrip())
process_file("ploop.fa")
import pprint
pprint.pprint(Extract)

Split large text file using keyword delimiter

I'm trying to split a large text files into smaller text files by using a word delimiter. I tried searching but I've only seen posts to break apart files after x lines. I'm fairly new to programming but I've given it a start. I want to go through all the lines, and if it starts with hello, it will put all of those lines into one file until it reaches the next hello. The first word in the file is hello. Ultimately, I'm trying to get the text into R, but I think it would be easier if I split it up like this first. Any help is appreciated, thanks.
text_file = open("myfile.txt","r")
lines = text_file.readlines()
print len(lines)
for line in lines :
print line
if line[0:5] == "hello":

If you are finding for a very simple logic, Try this.
text_file = open("myfile.txt","r")
lines = text_file.readlines()
print len(lines)
target = open ("filename.txt", 'a') ## a will append, w will over-write
hello1Found = False
hello2Found = False
for line in lines :
if hello1Found == True :
if line[0:5] == "hello":
hello2Found = True
hello1Found = False
break ## When second hello is found looping/saving to file is stopped
##(though using break is not a good practice here it suffice your simple requirement
else:
print line #write the line to new file
target.write(line)
if hello1Found == False:
if line[0:5] == "hello": ##find first occurrence of hello
hello1Found = True
print line
target.write(line) ##if hello is found for the first time write the
##line/subsequent lines to new file till the occurrence of second hello

I am new to Python. I just finished a Python for Geographic Information Systems class at Northeastern University. This is what I came up with.
import os
import sys
import arcpy
def files():
n = 0
while True:
n += 1
yield open('/output/dir/%d.txt' % n, 'w')
pattern = 'hello'
fs = files()
outfile = next(fs)
filename = r'C:\output\dir\filename.txt'
with open(filename) as infile:
for line in infile:
if pattern not in line:
outfile.write(line)
else:
items = line.split(pattern)
outfile.write
(items[0])
for item in items:
outfile = next(fs)
outfile.write(item)
filename.close();outfile.close();

Parsing Input File in Python

I have a plain text file with some data in it, that I'm trying to open and read using a Python (ver 3.2) program, and trying to load that data into a data structure within the program.
Here's what my text file looks like (file is called "data.txt")
NAME: Joe Smith
CLASS: Fighter
STR: 14
DEX: 7
Here's what my program looks like:
player_name = None
player_class = None
player_STR = None
player_DEX = None
f = open("data.txt")
data = f.readlines()
for d in data:
# parse input, assign values to variables
print(d)
f.close()
My question is, how do I assign the values to the variables (something like setting player_STR = 14 within the program)?

player = {}
f = open("data.txt")
data = f.readlines()
for line in data:
# parse input, assign values to variables
key, value = line.split(":")
player[key.strip()] = value.strip()
f.close()
now the name of your player will be player['name'], and the same goes for all other properties in your file.

import re
pattern = re.compile(r'([\w]+): ([\w\s]+)')
f = open("data.txt")
v = dict(pattern.findall(f.read()))
player_name = v.get("name")
plater_class = v.get('class')
# ...
f.close()

The most direct way to do it is to assign the variables one at a time:
f = open("data.txt")
for line in f: # loop over the file directly
line = line.rstrip() # remove the trailing newline
if line.startswith('NAME: '):
player_name = line[6:]
elif line.startswith('CLASS: '):
player_class = line[7:]
elif line.startswith('STR: '):
player_strength = int(line[5:])
elif line.startswith('DEX: '):
player_dexterity = int(line[5:])
else:
raise ValueError('Unknown attribute: %r' % line)
f.close()
That said, most Python programmers would stored the values in a dictionary rather than in variables. The fields can be stripped (removing the line endings) and split with: characteristic, value = data.rstrip().split(':'). If the value should be a number instead of a string, convert it with float() or int().

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python: Get/Scan All Text After a Certain String - python

Example with regex. reg = re.compile("Hello World") data_re = re.ompile("Blah=(?P<value>\d)") with open(f_name) as f: need_search = False for l in f: if reg.search(l) is not None: need_search = True if need_search == True: res = data_re.search(l) if res is not None: print res.groups('value')

Related

Simplified grep in python

replace function not working with list items

How to extract data from a text file based on a regular expression pattern

Split large text file using keyword delimiter

Parsing Input File in Python

Categories

Resources