Iterate Previous Lines after find a pattern - python

I am searching for a pattern and then if I find that pattern(which can be multiples in a single file) then i want to iterate backwords and capture another pattern and pick the 1st instance.
For Example, if content of the file is as below:
SetSearchExpr("This is the Search Spec 1");
...
ExecuteQuery (ForwardOnly);
var Rec2=FirstRecord();
if(Rec2!=null);
{
Then the Expected Output:
ExecuteQuery Search Spec = "This is the Search Spec 1"
I have figured out by below to check if ExecuteQuery is present, but unable to get the logic to iterate back, my code as below:
import sys
import os
file = open("Sample_code.txt", 'r')
for line in file:
if "ExecuteQuery (" in line:
#if found then check previous lines for another pattern
If anyone help me with a steer then it would be of great help.

No need to go backwards. Just save the SetSearchExpr() line in a variable and use that when you find ExecuteQuery()
for line in file:
if 'SetSearchExpr(' in line:
search_line = line
elif 'ExecuteQuery (' in line:
m = re.match(r'SetSearchExpr\((".*")\)', search_line)
search_spec = m.group(1)
print(f'ExecuteQuery Search Spec = {search_spec}')

Related

Regular Expression result don't match with tester

I'm new with Python...
After couple days if googling I'm still don't get it to work.
My script:
import re
pattern = '^Hostname=([a-zA-Z0-9.]+)'
hand = open('python_test_data.conf')
for line in hand:
line = line.rstrip()
if re.search(pattern, line) :
print line
Test file content:
### Option: Hostname
# Unique, case sensitive Proxy name. Make sure the Proxy name is known to the server!
# Value is acquired from HostnameItem if undefined.
#
# Mandatory: no
# Default:
# Hostname=
Hostname=bbg-zbx-proxy
Script results:
ubuntu-workstation:~$ python python_test.py
Hostname=bbg-zbx-proxy
But when I have tested regex in tester the result is: https://regex101.com/r/wYUc4v/1
I need some advice haw cant I get only bbg-zbx-proxy as script output.
You have already written a regular expression capturing one part of the match, so you could as well use it then. Additionally, change your character class to include - and get rid of the line.strip() call, it's not necessary with your expression.
In total this comes down to:
import re
pattern = '^Hostname=([-a-zA-Z0-9.]+)'
hand = open('python_test_data.conf')
for line in hand:
m = re.search(pattern, line)
if m:
print(m.group(1))
# ^^^
The simple solution would be to split on the equals sign. You know it will always contain that and you will be able to ignore the first item in the split.
import re
pattern = '^Hostname=([a-zA-Z0-9.]+)'
hand = open('testdata.txt')
for line in hand:
line = line.rstrip()
if re.search(pattern, line) :
print(line.split("=")[1]) # UPDATED HERE

Parse ~4k files for a string (sophisticated conditions)

Problem description
There is a set of ~4000 python files with the following struture:
#ScriptInfo(number=3254,
attibute=some_value,
title="crawler for my website",
some_other_key=some_value)
scenario_name = entity.get_script_by_title(title)
The goal
The goal is to get the value of the title from the ScriptInfo decorator (in this case it is "crawler for my website"), but there are a couple of problems:
1) There is no rule for naming a variable that contains the title. That's why it can be title_name, my_title, etc. See example:
#ScriptInfo(number=3254,
attibute=some_value,
my_title="crawler for my website",
some_other_key=some_value)
scenario_name = entity.get_script_by_title(my_title)
2) The #ScriptInfo decorator may have more than two arguments so getting its contents from between the parentheses in order to get the second parameter's value is not an option
My (very naive) solution
But the piece of code that stays unchanged is the scenario_name = entity.get_script_by_title(my_title) line. Taking this into account, I've come up with the solution:
import re
title_variable_re = r"scenario_name\s?=\s?entity\.get_script_by_title\((.*)\)"
with open("python_file.py") as file:
for line in file:
if re.match(regexp, line):
title_variable = re.match(title_variable_re, line).group(1)
title_re = title_variable + r"\s?=\s\"(.*)\"?"
with open("python_file.py") as file:
for line in file:
if re.match(title_re, line):
title_value = re.match(regexp, line).group(1)
print title_value
This snippet of code does the following:
1) Traverses (see the first with open) the script file and gets the variable with title value because it is up to a programmer to choose its name
2) Traverses the script file again (see the second with open) and gets the title's value
The question for the stackoverflow family
Is there a better and more efficient way to get the title's (my_title's, title_name's, etc) value than traversing the script file two times?
If you open the file only once and save all lines into fileContent, add break where appropriate, and reuse the matches to access the captured groups, you obtain something like this (with parentheses after print for 3.x, without for 2.7):
import re
title_value = None
title_variable_re = r"scenario_name\s?=\s?entity\.get_script_by_title\((.*)\)"
with open("scenarioName.txt") as file:
fileContent = list(file.read().split('\n'))
title_variable = None
for line in fileContent:
m1 = re.match(title_variable_re, line)
if m1:
title_variable = m1.group(1)
break
title_re = r'\s*' + title_variable + r'\s*=\s*"([^"]*)"[,)]?\s*'
for line in fileContent:
m2 = re.match(title_re, line)
if m2:
title_value = m2.group(1)
break
print(title_value)
Here an unsorted list of changes in the regular expressions:
Allow space before the title_variable, that's what the r'\s*' + is for
Allow space around =
Allow comma or closing round paren in the end of the line in title_re, that's what the [,)]? is for
Allow some space in the end of the line
When tested on the following file as input:
#ScriptInfo(number=3254,
attibute=some_value,
my_title="crawler for my website",
some_other_key=some_value)
scenario_name = entity.get_script_by_title(my_title)
it produces the following output:
crawler for my website

How to pass string variable into search function?

I am having issues passing a string variable into a search function.
Here is what I'm trying to accomplish:
I have a file full of values and I want to check the file to make sure a specific matching line exists before I proceed. I want to ensure that the line <endSW=UNIQUE-DNS-NAME-HERE<> exists if a valid <begSW=UNIQUE-DNS-NAME-HERE<> exists and is reachable.
Everything works fine until I call if searchForString(searchString,fileLoc): which always returns false. If I assign the variable 'searchString' a direct value and pass it it works, so I know it must be something with the way I'm combining the strings, but I can't seem to figure out what I'm doing wrong.
If I examine the data that 'searchForString' is using I see what seems to be valid values:
values in fileLines list:
['<begSW=UNIQUE-DNS-NAME-HERE<>', ' <begPortType=UNIQUE-PORT-HERE<>', ' <portNumbers=80,443,22<>', ' <endPortType=UNIQUE-PORT-HERE<>', '<endSW=UNIQUE-DNS-NAME-HERE<>']
value of searchVar:
<endSW=UNIQUE-DNS-NAME-HERE<>
An example of the entry in the file is:
<begSW=UNIQUE-DNS-NAME-HERE<>
<begPortType=UNIQUE-PORT-HERE<>
<portNumbers=80,443,22<>
<endPortType=UNIQUE-PORT-HERE<>
<endSW=UNIQUE-DNS-NAME-HERE<>
Here is the code in question:
def searchForString(searchVar,readFile):
with open(readFile) as findMe:
fileLines = findMe.read().splitlines()
print fileLines
print searchVar
if searchVar in fileLines:
return True
return False
findMe.close()
fileLoc = '/dir/folder/file'
fileLoc.lstrip()
fileLoc.rstrip()
with open(fileLoc,'r') as switchFile:
for line in switchFile:
#declare all the vars we need
lineDelimiter = '#'
endLine = '<>\n'
begSWLine= '<begSW='
endSWLine = '<endSW='
begPortType = '<begPortType='
endPortType = '<endPortType='
portNumList = '<portNumbers='
#skip over commented lines -(REMOVE THIS)
if line.startswith(lineDelimiter):
pass
#checks the file for a valid switch name
#checks to see if the host is up and reachable
#checks to see if there is a file input is valid
if line.startswith(begSWLine):
#extract switch name from file
switchName = line[7:-3]
#check to make sure switch is up
if pingCheck(switchName):
print 'Ping success. Host is reachable.'
searchString = endSWLine+switchName+'<>'
**#THIS PART IS SUCKING, WORKS WITH DIRECT STRING PASS
#WONT WORK WITH A VARIABLE**
if searchForString(searchString,fileLoc):
print 'found!'
else:
print 'not found'
Any advice or guidance would be extremely helpful.
Hard to tell without the file's contents, but I would try
switchName = line[7:-2]
So that would look like
>>> '<begSW=UNIQUE-DNS-NAME-HERE<>'[7:-2]
'UNIQUE-DNS-NAME-HERE'
Additionally, you could look into regex searches to make your cleanup more versatile.
import re
# re.findall(search_expression, string_to_search)
>>> re.findall('\=(.+)(?:\<)', '<begSW=UNIQUE-DNS-NAME-HERE<>')[0]
'UNIQUE-DNS-NAME-HERE'
>>> e.findall('\=(.+)(?:\<)', ' <portNumbers=80,443,22<>')[0]
'80,443,22'
I found how to recursively iterate over XML tags in Python using ElementTree? and used the methods detailed to parse an XML file instead of using a TXT file.

Can't get unique word/phrase counter to work - Python

I'm having trouble getting anything to write in my outut file (word_count.txt).
I expect the script to review all 500 phrases in my phrases.txt document, and output a list of all the words and how many times they appear.
from re import findall,sub
from os import listdir
from collections import Counter
# path to folder containg all the files
str_dir_folder = '../data'
# name and location of output file
str_output_file = '../data/word_count.txt'
# the list where all the words will be placed
list_file_data = '../data/phrases.txt'
# loop through all the files in the directory
for str_each_file in listdir(str_dir_folder):
if str_each_file.endswith('data'):
# open file and read
with open(str_dir_folder+str_each_file,'r') as file_r_data:
str_file_data = file_r_data.read()
# add data to list
list_file_data.append(str_file_data)
# clean all the data so that we don't have all the nasty bits in it
str_full_data = ' '.join(list_file_data)
str_clean1 = sub('t','',str_full_data)
str_clean_data = sub('n',' ',str_clean1)
# find all the words and put them into a list
list_all_words = findall('w+',str_clean_data)
# dictionary with all the times a word has been used
dict_word_count = Counter(list_all_words)
# put data in a list, ready for output file
list_output_data = []
for str_each_item in dict_word_count:
str_word = str_each_item
int_freq = dict_word_count[str_each_item]
str_out_line = '"%s",%d' % (str_word,int_freq)
# populates output list
list_output_data.append(str_out_line)
# create output file, write data, close it
file_w_output = open(str_output_file,'w')
file_w_output.write('n'.join(list_output_data))
file_w_output.close()
Any help would be great (especially if I'm able to actually output 'single' words within the output list.
thanks very much.
Would be helpful if we got more information such as what you've tried and what sorts of error messages you received. As kaveh commented above, this code has some major indentation issues. Once I got around those, there were a number of other logic errors to work through. I've made some assumptions:
list_file_data is assigned to '../data/phrases.txt' but there is then a
loop through all file in a directory. Since you don't have any handling for
multiple files elsewhere, I've removed that logic and referenced the
file listed in list_file_data (and added a small bit of error
handling). If you do want to walk through a directory, I'd suggest
using os.walk() (http://www.tutorialspoint.com/python/os_walk.htm)
You named your file 'pharses.txt' but then check for if the files
that endswith 'data'. I've removed this logic.
You've placed the data set into a list when findall works just fine with strings and ignores special characters that you've manually removed. Test here:
https://regex101.com/ to make sure.
Changed 'w+' to '\w+' - check out the above link
Converting to a list outside of the output loop isn't necessary - your dict_word_count is a Counter object which has an 'iteritems' method to roll through each key and value. Also changed the variable name to 'counter_word_count' to be slightly more accurate.
Instead of manually generating csv's, I've imported csv and utilized the writerow method (and quoting options)
Code below, hope this helps:
import csv
import os
from collections import Counter
from re import findall,sub
# name and location of output file
str_output_file = '../data/word_count.txt'
# the list where all the words will be placed
list_file_data = '../data/phrases.txt'
if not os.path.exists(list_file_data):
raise OSError('File {} does not exist.'.format(list_file_data))
with open(list_file_data, 'r') as file_r_data:
str_file_data = file_r_data.read()
# find all the words and put them into a list
list_all_words = findall('\w+',str_file_data)
# dictionary with all the times a word has been used
counter_word_count = Counter(list_all_words)
with open(str_output_file, 'w') as output_file:
fieldnames = ['word', 'freq']
writer = csv.writer(output_file, quoting=csv.QUOTE_ALL)
writer.writerow(fieldnames)
for key, value in counter_word_count.iteritems():
output_row = [key, value]
writer.writerow(output_row)
Something like this?
from collections import Counter
from glob import glob
def extract_words_from_line(s):
# make this as complicated as you want for extracting words from a line
return s.strip().split()
tally = sum(
(Counter(extract_words_from_line(line))
for infile in glob('../data/*.data')
for line in open(infile)),
Counter())
for k in sorted(tally, key=tally.get, reverse=True):
print k, tally[k]

Python functions and for loops

I'm new to Python programming and I do not seem to get the right behavior from a FOR loop.
I've got a list of ids, and I want to iterate a ".gtf" file (tab separated multi-line) and extract from it some values corresponding to those ids.
It seems that the construction of the regex is not working correctly inside the findgtf function. From the second iteration onward, the "id" variable passed to the function is not used for the regex pattern of "sc" variable and subsequently, the pattern matching doesn't work. Do I need to reinitialize the variables "id" or/and "sc" before each iteration?
I so, could you tell me how to achieve that
Here's is the code:
#!/usr/bin/python
# -*- coding: utf-8 -*-
import sys, os, re
#Usage:gtf_parser_4.py [path_to_dir] [IDlist]
#######FUNCTIONS######################################
def findgtf(id, gtf):
id=id.strip()#remove \n
#print "Received Id: *"+id+"* post-stripped"
for line in gtf:
seq, source, feat, start, end, score, strand, frame, attribute = line.strip().split("\t")
sc = re.search(str(id), str(attribute))
if sc:
print "Coord of "+id+" -> Start: "+str(start)+" End: "+str(end)
###########################MAIN#########################
#Arguments retrieval
mydir = sys.argv[1]
#print"Directory : "+mydir
IDlist = sys.argv[2]
#print"IDlist : "+IDlist
path2ID = os.path.join(mydir, IDlist)
#print"Full IdList: "+path2ID
#lines to list
IDlines = [line.rstrip('\n') for line in open(path2ID)]
#Open and read dir
for file in os.listdir(mydir):
if file.endswith(".gtf"):
path2file = os.path.join(mydir, file)
#print"Full gtf : "+path2file
gtf = open(path2file,"r")
for id in IDlines:
print"ID submitted to findgtf: "+id
fg = findgtf(id, gtf)
gtf.close()
And here are the results retrieved from the console (submitted an Idlist with 3 ids: LX00_00030, gyrB, LX00_00065 ):
ID submitted to findgtf: LX00_00030
Coord of LX00_00030 -> Start: 4299 End: 5303
ID submitted to findgtf: gyrB
ID submitted to findgtf: LX00_00065
As you can see the first ID worked correctly but the second an third do not yield any result (although they do if their order is switched in the IDlist).
Thanks in advance for your help
Your code is not working because you are trying to repeatedly iterate over the same file object. A file keeps track of the position you've read to internally, so when you've read to the end, you can't read any more!
To make your code work, you need to seek back to the start of the file before iterating over it again.
for id in IDlines:
print"ID submitted to findgtf: "+id
gtf.seek(0) # seek to the start of the file
fg = findgtf(id, gtf)

Categories