Parse ~4k files for a string (sophisticated conditions)

Parse ~4k files for a string (sophisticated conditions) - python

Problem description
There is a set of ~4000 python files with the following struture:
#ScriptInfo(number=3254,
attibute=some_value,
title="crawler for my website",
some_other_key=some_value)
scenario_name = entity.get_script_by_title(title)
The goal
The goal is to get the value of the title from the ScriptInfo decorator (in this case it is "crawler for my website"), but there are a couple of problems:
1) There is no rule for naming a variable that contains the title. That's why it can be title_name, my_title, etc. See example:
#ScriptInfo(number=3254,
attibute=some_value,
my_title="crawler for my website",
some_other_key=some_value)
scenario_name = entity.get_script_by_title(my_title)
2) The #ScriptInfo decorator may have more than two arguments so getting its contents from between the parentheses in order to get the second parameter's value is not an option
My (very naive) solution
But the piece of code that stays unchanged is the scenario_name = entity.get_script_by_title(my_title) line. Taking this into account, I've come up with the solution:
import re
title_variable_re = r"scenario_name\s?=\s?entity\.get_script_by_title\((.*)\)"
with open("python_file.py") as file:
for line in file:
if re.match(regexp, line):
title_variable = re.match(title_variable_re, line).group(1)
title_re = title_variable + r"\s?=\s\"(.*)\"?"
with open("python_file.py") as file:
for line in file:
if re.match(title_re, line):
title_value = re.match(regexp, line).group(1)
print title_value
This snippet of code does the following:
1) Traverses (see the first with open) the script file and gets the variable with title value because it is up to a programmer to choose its name
2) Traverses the script file again (see the second with open) and gets the title's value
The question for the stackoverflow family
Is there a better and more efficient way to get the title's (my_title's, title_name's, etc) value than traversing the script file two times?

If you open the file only once and save all lines into fileContent, add break where appropriate, and reuse the matches to access the captured groups, you obtain something like this (with parentheses after print for 3.x, without for 2.7):
import re
title_value = None
title_variable_re = r"scenario_name\s?=\s?entity\.get_script_by_title\((.*)\)"
with open("scenarioName.txt") as file:
fileContent = list(file.read().split('\n'))
title_variable = None
for line in fileContent:
m1 = re.match(title_variable_re, line)
if m1:
title_variable = m1.group(1)
break
title_re = r'\s*' + title_variable + r'\s*=\s*"([^"]*)"[,)]?\s*'
for line in fileContent:
m2 = re.match(title_re, line)
if m2:
title_value = m2.group(1)
break
print(title_value)
Here an unsorted list of changes in the regular expressions:
Allow space before the title_variable, that's what the r'\s*' + is for
Allow space around =
Allow comma or closing round paren in the end of the line in title_re, that's what the [,)]? is for
Allow some space in the end of the line
When tested on the following file as input:
#ScriptInfo(number=3254,
attibute=some_value,
my_title="crawler for my website",
some_other_key=some_value)
scenario_name = entity.get_script_by_title(my_title)
it produces the following output:
crawler for my website

Related

Regular Expression result don't match with tester

I'm new with Python...
After couple days if googling I'm still don't get it to work.
My script:
import re
pattern = '^Hostname=([a-zA-Z0-9.]+)'
hand = open('python_test_data.conf')
for line in hand:
line = line.rstrip()
if re.search(pattern, line) :
print line
Test file content:
### Option: Hostname
# Unique, case sensitive Proxy name. Make sure the Proxy name is known to the server!
# Value is acquired from HostnameItem if undefined.
#
# Mandatory: no
# Default:
# Hostname=
Hostname=bbg-zbx-proxy
Script results:
ubuntu-workstation:~$ python python_test.py
Hostname=bbg-zbx-proxy
But when I have tested regex in tester the result is: https://regex101.com/r/wYUc4v/1
I need some advice haw cant I get only bbg-zbx-proxy as script output.

You have already written a regular expression capturing one part of the match, so you could as well use it then. Additionally, change your character class to include - and get rid of the line.strip() call, it's not necessary with your expression.
In total this comes down to:
import re
pattern = '^Hostname=([-a-zA-Z0-9.]+)'
hand = open('python_test_data.conf')
for line in hand:
m = re.search(pattern, line)
if m:
print(m.group(1))
# ^^^

The simple solution would be to split on the equals sign. You know it will always contain that and you will be able to ignore the first item in the split.
import re
pattern = '^Hostname=([a-zA-Z0-9.]+)'
hand = open('testdata.txt')
for line in hand:
line = line.rstrip()
if re.search(pattern, line) :
print(line.split("=")[1]) # UPDATED HERE

Python 3 - How to remove line/paragraph breaks

from docx import Document
alphaDic = ['a','b','c','d','e','f','g','h','i','j','k','l','m','n','o','p','q','r','s','t','u','v','w','x','y','z','!','?','.','~',',','(',')','$','-',':',';',"'",'/']
while docIndex < len(doc.paragraphs):
firstSen = doc.paragraphs[docIndex].text
rep_dic = {ord(k):None for k in alphaDic + [x.upper() for x in alphaDic]}
translation = (firstSen.translate(rep_dic))
removeSpaces = " ".join(translation.split())
removeLineBreaks = removeSpaces.replace('\n','')
doc.paragraphs[docIndex].text = removeLineBreaks
docIndex +=1
I am attempting to remove line breaks from the document, but it doesn't work.
I am still getting
Hello
There
Rather than
Hello
There

I think what you want to do is get rid of an empty paragraph. The following function could help, it deletes a certain paragraph that you don't want:
def delete_paragraph(paragraph):
p = paragraph._element
p.getparent().remove(p)
p._p = p._element = None
Code by: Scanny*
In your code, you could check if translation is equal to '' or not, and if it is then call the delete_paragraph function, so your code would be like:
while docIndex < len(doc.paragraphs):
firstSen = doc.paragraphs[docIndex].text
rep_dic = {ord(k):None for k in alphaDic + [x.upper() for x in alphaDic]}
translation = (firstSen.translate(rep_dic))
if translation != '':
doc.paragraphs[docIndex].text = translation
else:
delete_paragraph(doc.paragraphs[docIndex])
docIndex -=1 # go one step back in the loop because of the deleted index
docIndex +=1
*Reference- feature: Paragraph.delete()

The package comes with an example program that extracts the text.
That said, I think your problem springs from the fact that you are trying to operate on paragraphs. But the separation between paragraphs is where the newlines are happening. So even if you replace a program with the empty string (''), there will still be a newline added to the end of it.
You should either take the approach of the example program, and do your own formatting, or you should make sure that you delete any spurious "empty" paragraphs that might be between the "full" ones you have ("Hello", "", "There") -> ("Hello", "There").

Since readlines could read any type of text files, you can open the file rewrite the lines you want and ignore the lines you dont want to use.
"""example"""
file = open("file name", "w")
for line in file.readlines():
if (line != ''):
file.write(line)

Python Re-ordering the lines in a dat file by string

Sorry if this is a repeat but I can't find it for now.
Basically I am opening and reading a dat file which contains a load of paths that I need to loop through to get certain information.
Each of the lines in the base.dat file contains m.somenumber. For example some lines in the file might be:
Volumes/hard_disc/u14_cut//u14m12.40_all.beta/beta8
Volumes/hard_disc/u14_cut/u14m12.50_all.beta/beta8
Volumes/hard_disc/u14_cut/u14m11.40_all.beta/beta8
I need to be able to re-write the dat file so that all the lines are re-ordered from the largest m.number to the smallest m.number. Then when I loop through PATH in database (shown in code) I am looping through in decreasing m.
Here is the relevant part of the code
base = open('base8.dat', 'r')
database= base.read().splitlines()
base.close()
counter=0
mu_list=np.array([])
delta_list=np.array([])
ofsset = 0.00136
beta=0
for PATH in database:
if os.path.exists(str(PATH)+'/CHI/optimal_spectral_function_CHI.dat'):
n1_array = numpy.loadtxt(str(PATH)+'/AVERAGES/av-err.n.dat')
n7_array= numpy.loadtxt(str(PATH)+'/AVERAGES/av-err.npx.dat')
n1_mean = n1_array[0]
delta=round(float(5.0+ofsset-(n1_array[0]*2.+4.*n7_array[0])),6)
par = open(str(PATH)+"/params10", "r")
for line in par:
counter= counter+1
if re.match("mu", line):
mioMU= re.findall('\d+', line.translate(None, ';'))
mioMU2=line.split()[2][:-1]
mu=mioMU2
print mu, delta, PATH
mu_list=np.append(mu_list, mu)
delta_list=np.append(delta_list,delta)
optimal_counter=0
print delta_list, mu_list
I have checked the possible flagged repeat but I can't seem to get it to work for mine because my file doesn't technically contain strings and numbers. The 'number' I need to sort by is contained in the string as a whole:
Volumes/data_disc/u14_cut/from_met/u14m11.40_all.beta/beta16
and I need to sort the entire line by just the m(somenumber) part

Assuming that the number part of your line has the form of a float you can use a regular expression to match that part and convert it from string to float.
After that you can use this information in order to sort all the lines read from your file. I added a invalid line in order to show how invalid data is handled.
As a quick example I would suggest something like this:
import re
# TODO: Read file and get list of lines
l = ['Volumes/hard_disc/u14_cut/u14**m12.40**_all.beta/beta8',
'Volumes/hard_disc/u14_cut/u14**m12.50**_all.beta/beta8',
'Volumes/hard_disc/u14_cut/u14**m11.40**_all.beta/beta8',
'Volumes/hard_disc/u14_cut/u14**mm11.40**_all.beta/beta8']
regex = r'^.+\*{2}m{1}(?P<criterion>[0-9\.]*)\*{2}.+$'
p = re.compile(regex)
criterion_list = []
for s in l:
m = p.match(s)
if m:
crit = m.group('criterion')
try:
crit = float(crit)
except Exception as e:
crit = 0
else:
crit = 0
criterion_list.append(crit)
tuples_list = list(zip(criterion_list, l))
output = [element[1] for element in sorted(tuples_list, key=lambda t: t[0])]
print(output)
# TODO: Write output to new file or overwrite existing one.
Giving:
['Volumes/hard_disc/u14_cut/u14**mm11.40**_all.beta/beta8', 'Volumes/hard_disc/u14_cut/u14**m11.40**_all.beta/beta8', 'Volumes/hard_disc/u14_cut/u14**m12.40**_all.beta/beta8', 'Volumes/hard_disc/u14_cut/u14**m12.50**_all.beta/beta8']
This snippets starts after all lines are read from the file and stored into a list (list called l here). The regex group criterion catches the float part contained in **m12.50** as you can see on regex101. So iterating through all the lines gives you a new list containing all matching groups as floats. If the regex does not match on a given string or casting the group to a float fails, crit is set to zero in order to have those invalid lines at the very beginning of the sorted list later.
After that zip() is used to get a list of tules containing the extracted floats and the according string. Now you can sort this list of tuples based on the tuple's first element and write the according string to a new list output.

python regex unicode - extracting data from an utf-8 file

I am facing difficulties for extracting data from an UTF-8 file that contains chinese characters.
The file is actually the CEDICT (chinese-english dictionary) and looks like this :
賓 宾 [bin1] /visitor/guest/object (in grammar)/
賓主 宾主 [bin1 zhu3] /host and guest/
賓利 宾利 [Bin1 li4] /Bentley/
賓士 宾士 [Bin1 shi4] /Taiwan equivalent of 奔馳|奔驰[Ben1 chi2]/
賓夕法尼亞 宾夕法尼亚 [Bin1 xi1 fa3 ni2 ya4] /Pennsylvania/
賓夕法尼亞大學 宾夕法尼亚大学 [Bin1 xi1 fa3 ni2 ya4 Da4 xue2] /University of Pennsylvania/
賓夕法尼亞州 宾夕法尼亚州 [Bin1 xi1 fa3 ni2 ya4 zhou1] /Pennsylvania/
Until now, I manage to get the first two fields using split() but I can't find out how I should proceed to extract the two other fields (let's say for the second line "bin1 zhu3" and "host and guest". I have been trying to use regex but it doesn't work for a reason I ignore.
#!/bin/python
#coding=utf-8
import re
class REMatcher(object):
def __init__(self, matchstring):
self.matchstring = matchstring
def match(self,regexp):
self.rematch = re.match(regexp, self.matchstring)
return bool(self.rematch)
def group(self,i):
return self.rematch.group(i)
def look(character):
myFile = open("/home/quentin/cedict_ts.u8","r")
for line in myFile:
line = line.rstrip()
elements = line.split(" ")
try:
if line != "" and elements[1] == character:
myFile.close()
return line
except:
myFile.close()
break
myFile.close()
return "Aucun résultat :("
translation = look("賓主") # translation contains one line of the file
elements = translation.split()
traditionnal = elements[0]
simplified = elements[1]
print "Traditionnal:" + traditionnal
print "Simplified:" + simplified
m = REMatcher(translation)
tr = ""
if m.match(r"\[(\w+)\]"):
tr = m.group(1)
print "Pronouciation:" + tr
Any help appreciated.

This builds a dictionary to look up translations by either simplified or traditional characters and works in both Python 2.7 and 3.3:
# coding: utf8
import re
import codecs
# Process the whole file decoding from UTF-8 to Unicode
with codecs.open('cedict_ts.u8',encoding='utf8') as datafile:
D = {}
for line in datafile:
# Skip comment lines
if line.startswith('#'):
continue
trad,simp,pinyin,trans = re.match(r'(.*?) (.*?) \[(.*?)\] /(.*)/',line).groups()
D[trad] = (simp,pinyin,trans)
D[simp] = (trad,pinyin,trans)
Output (Python 3.3):
>>> D['马克']
('馬克', 'Ma3 ke4', 'Mark (name)')
>>> D['一路顺风']
('一路順風', 'yi1 lu4 shun4 feng1', 'to have a pleasant journey (idiom)')
>>> D['馬克']
('马克', 'Ma3 ke4', 'Mark (name)')
Output (Python 2.7, you have to print strings to see non-ASCII characters):
>>> D[u'马克']
(u'\u99ac\u514b', u'Ma3 ke4', u'Mark (name)')
>>> print D[u'马克'][0]
馬克

I would continue to use splits instead of regular expressions, with the maximum split number given. It depends on how consistent the format of the input file is.
elements = translation.split(' ',2)
traditionnal = elements[0]
simplified = elements[1]
rest = elements[2]
print "Traditionnal:" + traditionnal
print "Simplified:" + simplified
elems = rest.split(']')
tr = elems[0].strip('[')
print "Pronouciation:" + tr
Output:
Traditionnal:賓主
Simplified:宾主
Pronouciation:bin1 zhu3
EDIT: To split the last field into a list, split on the /:
translations = elems[1].strip().strip('/').split('/')
#strip the spaces, then the first and last slash,
#then split on the slashes
Output (for the first line of input):
['visitor', 'guest', 'object (in grammar)']

Heh, I've done this exact same thing before. Basically you just need to use regex with groupings. Unfortunately, I don't know python regex super well (I did the same thing using C#), but you should really do something like this:
matcher = "(\b\w+\b) (\b\w+\b) \[(\.*?)\] /(.*?)/"
basically you match the entire line using one expression, but then you use ( ) to separate each item into a regex-group. Then you just need to read the groups and voila!

Analysing a text file in Python

I have a text file that needs to be analysed. Each line in the file is of this form:
7:06:32 (slbfd) IN: "lq_viz_server" aqeela#nabltas1
7:08:21 (slbfd) UNSUPPORTED: "Slb_Internal_vlsodc" (PORT_AT_HOST_PLUS ) Albahraj#nabwmps3 (License server system does not support this feature. (-18,327))
7:08:21 (slbfd) OUT: "OFM32" Albahraj#nabwmps3
I need to skip the timestamp and the (slbfd) and only keep a count of the lines with the IN and OUT. Further, depending on the name in quotes, I need to increase a variable count for different variables if a line starts with OUT and decrease the variable count otherwise. How would I go about doing this in Python?

The other answers with regex and splitting the line will get the job done, but if you want a fully maintainable solution that will grow with you, you should build a grammar. I love pyparsing for this:
S ='''
7:06:32 (slbfd) IN: "lq_viz_server" aqeela#nabltas1
7:08:21 (slbfd) UNSUPPORTED: "Slb_Internal_vlsodc" (PORT_AT_HOST_PLUS ) Albahraj#nabwmps3 (License server system does not support this feature. (-18,327))
7:08:21 (slbfd) OUT: "OFM32" Albahraj#nabwmps3'''
from pyparsing import *
from collections import defaultdict
# Define the grammar
num = Word(nums)
marker = Literal(":").suppress()
timestamp = Group(num + marker + num + marker + num)
label = Literal("(slbfd)")
flag = Word(alphas)("flag") + marker
name = QuotedString(quoteChar='"')("name")
line = timestamp + label + flag + name + restOfLine
grammar = OneOrMore(Group(line))
# Now parsing is a piece of cake!
P = grammar.parseString(S)
counts = defaultdict(int)
for x in P:
if x.flag=="IN": counts[x.name] += 1
if x.flag=="OUT": counts[x.name] -= 1
for key in counts:
print key, counts[key]
This gives as output:
lq_viz_server 1
OFM32 -1
Which would look more impressive if your sample log file was longer. The beauty of a pyparsing solution is the ability to adapt to a more complex query in the future (ex. grab and parse the timestamp, pull email address, parse error codes...). The idea is that you write the grammar independent of the query - you simply convert the raw text to a computer friendly format, abstracting away the parsing implementation away from it's usage.

If I consider that the file is divided into lines (I don't know if it's true) you have to apply split() function to each line. You will have this:
["7:06:32", "(slbfd)", "IN:", "lq_viz_server", "aqeela#nabltas1"]
And then I think you have to be capable of apply any logic comparing the values that you need.

i made some wild assumptions about your specification and here is a sample code to help you start:
objects = {}
with open("data.txt") as data:
for line in data:
if "IN:" in line or "OUT:" in line:
try:
name = line.split("\"")[1]
except IndexError:
print("No double quoted name on line: {}".format(line))
name = "PARSING_ERRORS"
if "OUT:" in line:
diff = 1
else:
diff = -1
try:
objects[name] += diff
except KeyError:
objects[name] = diff
print(objects) # for debug only, not advisable to print huge number of names

You have two options:
Use the .split() function of the string (as pointed out in the comments)
Use the re module for regular expressions.
I would suggest using the re module and create a pattern with named groups.
Recipe:
first create a pattern with re.compile() containing named groups
do a for loop over the file to get the lines use .match() od the
created pattern object on each line use .groupdict() of the
returned match object to access your values of interest

In the mode of just get 'er done with the standard distribution, this works:
import re
from collections import Counter
# open your file as inF...
count=Counter()
for line in inF:
match=re.match(r'\d+:\d+:\d+ \(slbfd\) (\w+): "(\w+)"', line)
if match:
if match.group(1) == 'IN': count[match.group(2)]+=1
elif match.group(1) == 'OUT': count[match.group(2)]-=1
print(count)
Prints:
Counter({'lq_viz_server': 1, 'OFM32': -1})

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Parse ~4k files for a string (sophisticated conditions) - python

Related

Regular Expression result don't match with tester

Python 3 - How to remove line/paragraph breaks

Python Re-ordering the lines in a dat file by string

python regex unicode - extracting data from an utf-8 file

Analysing a text file in Python

Categories

Resources