RegEx for capturing groups using dictionary key - python

I'm having trouble displaying the right named capture in my dictionary function. My program reads a .txt file and then transforms the text in that file into a dictionary. I already have the right regex formula to capture them.
Here is my File.txt:
file Science/Chemistry/Quantum 444 1
file Marvel/CaptainAmerica 342 0
file DC/JusticeLeague/Superman 300 0
file Math 333 0
file Biology 224 1
Here is the regex link that is able to capture the ones I want:
By looking at the link, the ones I want to display is highlighted in green and orange.
This part of my code works:
rx= re.compile(r'file (?P<path>.*?)( |\/.*?)? (?P<views>\d+).+')
i = sub_pattern.match(data) # 'data' is from the .txt file
x = (i.group(1), i.group(3))
print(x)
But since I'm making the .txt into a dictionary I couldn't figure out how to make .group(1) or .group(3) as keys to display specifically for my display function. I don't know how to make those groups display when I use print("Title: %s | Number: %s" % (key[1], key[3])) and it will display those contents. I hope someone can help me implement that in my dictionary function.
Here is my dictionary function:
def create_dict(data):
dictionary = {}
for line in data:
line_pattern = re.findall(r'file (?P<path>.*?)( |\/.*?)? (?P<views>\d+).+', line)
dictionary[line] = line_pattern
content = dictionary[line]
print(content)
return dictionary
I'm trying to make my output look like this from my text file:
Science 444
Marvel 342
DC 300
Math 333
Biology 224

You may create and populate a dictionary with your file data using
def create_dict(data):
dictionary = {}
for line in data:
m = re.search(r'file\s+([^/\s]*)\D*(\d+)', line)
if m:
dictionary[m.group(1)] = m.group(2)
return dictionary
Basically, it does the following:
Defines a dictionary dictionary
Reads data line by line
Searches for a file\s+([^/\s]*)\D*(\d+) match, and if there is a match, the two capturing group values are used to form a dictionary key-value pair.
The regex I suggest is
file\s+([^/\s]*)\D*(\d+)
See the Regulex graph explaining it:
Then, you may use it like
res = {}
with open(filepath, 'r') as f:
res = create_dict(f)
print(res)
See the Python demo.

You already used named group in your 'line_pattern', simply put them to your dictionary. re.findall would not work here. Also the character escape '\' before '/' is redundant. Thus your dictionary function would be:
def create_dict(data):
dictionary = {}
for line in data:
line_pattern = re.search(r'file (?P<path>.*?)( |/.*?)? (?P<views>\d+).+', line)
dictionary[line_pattern.group('path')] = line_pattern.group('views')
content = dictionary[line]
print(content)
return dictionary

This RegEx might help you to divide your inputs into four groups where group 2 and group 4 are your target groups that can be simply extracted and spaced with a space:
(file\s)([A-Za-z]+(?=\/|\s))(.*)(\d{3})

Related

In Python, how to match a string to a dictionary item (like 'Bra*')

I'm a complete novice at Python so please excuse me for asking something stupid.
From a textfile a dictionary is made to be used as a pass/block filter.
The textfile contains addresses and either a block or allow like "002029568,allow" or "0011*,allow" (without the quotes).
The search-input is a string with a complete code like "001180000".
How can I evaluate if the search-item is in the dictionary and make it match the "0011*,allow" line?
Thank you very much for your efford!
The filter-dictionary is made with:
def loadFilterDict(filename):
global filterDict
try:
with open(filename, "r") as text_file:
lines = text_file.readlines()
for s in lines:
fields = s.split(',')
if len(fields) == 2:
filterDict[fields[0]] = fields[1].strip()
text_file.close()
except:
pass
Check if the code (ccode) is in the dictionary:
if ccode in filterDict:
if filterDict[ccode] in ['block']:
continue
else:
if filterstat in ['block']:
continue
The filters-file is like:
002029568,allow
000923993,allow
0011*, allow
If you can use re, you don't have to worry about the wildcard but let re.match do the hard work for you:
# Rules input (this could also be read from file)
lines = """002029568,allow
0011*,allow
001180001,block
"""
# Parse rules from string
rules = []
for line in lines.split("\n"):
line = line.strip()
if not line:
continue
identifier, ruling = line.split(",")
rules += [(identifier, ruling)]
# Get rulings for specific number
def rule(number):
from re import match
rulings = []
for identifier, ruling in rules:
# Replace wildcard with regex .*
identifier = identifier.replace("*", ".*")
if match(identifier, number):
rulings += [ruling]
return rulings
print(rule("001180000"))
print(rule("001180001"))
Which prints:
['allow']
['allow', 'block']
The function will return a list of rulings. Their order is the same order as they appear in your config lines. So you could easily just pick the last or first ruling whichever is the one you're interested in.
Or break the loop prematurely if you can assume that no two rulings will interfere.
Examples:
001180000 is matched by 0011*,allow only, so the only ruling which applies is allow.
001180001 is matched by 0011*,allow at first, so you'll get allow as before. However, it is also matched by 001180001,block, so a block will get added to the rulings, too.
If the wildcard entries in the file have a fixed length (for example, you only need to support lines like 0011*,allow and not 00110*,allow or 0*,allow or any other arbitrary number of digits followed by *) you can use a nested dictionary, where the outer keys are the known parts of the wildcarded entries.
d = {'0011': {'001180000': 'value', '001180001': 'value'}}
Then when you parse the file and get to the line 0011*,allow you do not need to do any matching. All you have to do is check if '0011' is present. Crude example:
d = {'0011': {'001180000': 'value', '001180001': 'value'}}
line = '0011*,allow'
prefix = line.split(',')[0][:-1]
if prefix in d:
# there is a "match", then you can deal with all the entries that match,
# in this case the items in the inner dictionary
# {'001180000': 'value', '001180001': 'value'}
print('match')
else:
print('no match')
If you do need to support arbitrary lengths of wildcarded entries, you will have to resort to a loop iterating over the dictionary (and therefore beating the point of using a dictionary to begin with):
d = {'001180000': 'value', '001180001': 'value'}
line = '0011*,allow'
prefix = line.split(',')[0][:-1]
for k, v in d.items():
if k.startswith(prefix):
# found matching key-value pair
print(k, v)

Finding a pattern multiple times between start and end patterns python regex

i am trying to find a certain pattern between a start and end patterns for multiple lines. Here is what i mean:
i read a file and saved it in variable File, this is what the original file looks like:
File:
...
...
...
Keyword some_header_file {
XYZ g1234567S7894561_some_other_trash_underscored_text;
XYZ g1122334S9315919_different_other_trash_underscored_text;
}
...
...
...
I am trying to grab the 1234567 between the g and S and also the 1122334. The some_header_file block can be any number of lines but always ends with }
So i am trying to grab exactly 7 digits between the g and the S for all the lines from the "Keyword" till the "}" for that specific header.
this is what i used:
FirstSevenDigitPart = str(re.findall(r"Keyword\s%s.*\n.*XYZ\s[gd]([0-9]{7})[A-Z][0-9]{7}.*\}"%variable , str(File) , flags=re.MULTILINE))
but unfortunately it does not return anything.. just a blank []
what am i doing wrong? how can i accomplish this?
Thanks in advance.
You may read your file into a contents variable and use
import re
contents = "...\n...\n...\nKeyword some_header_file {\n XYZ \ng1234567S7894561_some_other_trash_underscored_text;\n XYZ \n1122334S9315919_different_other_trash_underscored_text;\n}\n...\n...\n..."
results = []
variable = 'some_header_file'
block_rx = r'Keyword\s+{}\s*{{([^{{}}]*)}}'.format(re.escape(variable))
value_rx = r'XYZ\s[gd]([0-9]{7})[A-Z][0-9]{7}'
for block in re.findall(block_rx, contents):
results.extend(re.findall(value_rx, block))
print(results)
# => ['1234567', '1122334']
See the Python demo.
The first regex (block_rx) will look like Keyword\s+some_header_file\s*{([^{}]*)} and will match all the blocks you need to search for values in. The second regex, XYZ\s[gd]([0-9]{7})[A-Z][0-9]{7}, matches what you need and returns the list of captures.
I think that the simplest way here will be to use two expressions and run it in two steps. There is a little example. Of course you should optimize it for your needs.
import re
text = """Keyword some_header_file {
XYZ g1234567S7894561_some_other_trash_underscored_text;
XYZ g1122334S9315919_different_other_trash_underscored_text;
}"""
all_lines_pattern = 'Keyword\s*%s\s*\{\n(?P<all_lines>(.|\s)*)\}'
first_match = re.match(all_lines_pattern % 'some_header_file', text)
if first_match is None:
# some break logic here
pass
found_lines = first_match.group(1)
print(found_lines) # ' XYZ g1234567S7894561_some_other_trash_underscored_text;\n XYZ g1122334S9315919_different_other_trash_underscored_text;\n '
sub_pattern = '(XYZ\s*[gd](?P<your_pattern>[0-9]{7})[A-Z]).*;'
found_groups = re.findall(sub_pattern, found_lines)
print(found_groups) # [('XYZ g1234567S', '1234567'), ('XYZ g1122334S', '1122334')]

Finding value in a text file

I have to find a value of a variable in a text file.
My file is in this form:
AccountingStoragePort=8544
AccountingStorageUser=root
Name=touktouk Cards=45 Files=2015 State=UNKNOWN
How to obtain for example the value of Files (thus 2015)
In this file there is either 1 "variable" / line or several separated by a space. Each value is unique (there are not 2 files in my file)
On the other hand, "Files" can be found at any line => at line 3 or line 150 for example.
I can not modify this file to be compatible with ConfigParser.
My conf file have comment who start with "#", I must exclude these line of my research.
for exclude comment line I try this but it's not work :
match = re.search(r"(?!#\s*)\bFiles=(\d+)", s)
I found the problem, if in my file I have return line, your code not apply comment restriction, for example:
import re
s = '''
AccountingStoragePort=8544
AccountingStorageUser=root
#Name=touktouk Cards=45 Files=2015 State=UNKNOWN
Name=touktouk Cards=45 Files=2017 State=UNKNOWN
'''
match = re.search(r'^[^#].*\bFiles=(\d+)', s, flags=re.MULTILINE)
if match:
value = match.group(1)
print (value)
the program return 2015 not 2017
You can use re.search. If the value is number and variable occurs 1 time or less, use:
import re
s = '''AccountingStoragePort=8544
AccountingStorageUser=root
Name=touktouk Cards=45 Files=2015 State=UNKNOWN'''
match = re.search(r'\bFiles=(\d+)', s)
if match:
value = match.group(1)
print value
A more
import re
with open(file.txt) as f:
kvlist = re.split(r'\s+', f.read())
pairs = [kvpair.split('=') for kvpair in kvlist]
kvdict = {pair[0]:pair[1] for pair in pairs}
print(kvdict['Files'])
This should give you a complete dictionary for reuse on other values.

Python Re-ordering the lines in a dat file by string

Sorry if this is a repeat but I can't find it for now.
Basically I am opening and reading a dat file which contains a load of paths that I need to loop through to get certain information.
Each of the lines in the base.dat file contains m.somenumber. For example some lines in the file might be:
Volumes/hard_disc/u14_cut//u14m12.40_all.beta/beta8
Volumes/hard_disc/u14_cut/u14m12.50_all.beta/beta8
Volumes/hard_disc/u14_cut/u14m11.40_all.beta/beta8
I need to be able to re-write the dat file so that all the lines are re-ordered from the largest m.number to the smallest m.number. Then when I loop through PATH in database (shown in code) I am looping through in decreasing m.
Here is the relevant part of the code
base = open('base8.dat', 'r')
database= base.read().splitlines()
base.close()
counter=0
mu_list=np.array([])
delta_list=np.array([])
ofsset = 0.00136
beta=0
for PATH in database:
if os.path.exists(str(PATH)+'/CHI/optimal_spectral_function_CHI.dat'):
n1_array = numpy.loadtxt(str(PATH)+'/AVERAGES/av-err.n.dat')
n7_array= numpy.loadtxt(str(PATH)+'/AVERAGES/av-err.npx.dat')
n1_mean = n1_array[0]
delta=round(float(5.0+ofsset-(n1_array[0]*2.+4.*n7_array[0])),6)
par = open(str(PATH)+"/params10", "r")
for line in par:
counter= counter+1
if re.match("mu", line):
mioMU= re.findall('\d+', line.translate(None, ';'))
mioMU2=line.split()[2][:-1]
mu=mioMU2
print mu, delta, PATH
mu_list=np.append(mu_list, mu)
delta_list=np.append(delta_list,delta)
optimal_counter=0
print delta_list, mu_list
I have checked the possible flagged repeat but I can't seem to get it to work for mine because my file doesn't technically contain strings and numbers. The 'number' I need to sort by is contained in the string as a whole:
Volumes/data_disc/u14_cut/from_met/u14m11.40_all.beta/beta16
and I need to sort the entire line by just the m(somenumber) part
Assuming that the number part of your line has the form of a float you can use a regular expression to match that part and convert it from string to float.
After that you can use this information in order to sort all the lines read from your file. I added a invalid line in order to show how invalid data is handled.
As a quick example I would suggest something like this:
import re
# TODO: Read file and get list of lines
l = ['Volumes/hard_disc/u14_cut/u14**m12.40**_all.beta/beta8',
'Volumes/hard_disc/u14_cut/u14**m12.50**_all.beta/beta8',
'Volumes/hard_disc/u14_cut/u14**m11.40**_all.beta/beta8',
'Volumes/hard_disc/u14_cut/u14**mm11.40**_all.beta/beta8']
regex = r'^.+\*{2}m{1}(?P<criterion>[0-9\.]*)\*{2}.+$'
p = re.compile(regex)
criterion_list = []
for s in l:
m = p.match(s)
if m:
crit = m.group('criterion')
try:
crit = float(crit)
except Exception as e:
crit = 0
else:
crit = 0
criterion_list.append(crit)
tuples_list = list(zip(criterion_list, l))
output = [element[1] for element in sorted(tuples_list, key=lambda t: t[0])]
print(output)
# TODO: Write output to new file or overwrite existing one.
Giving:
['Volumes/hard_disc/u14_cut/u14**mm11.40**_all.beta/beta8', 'Volumes/hard_disc/u14_cut/u14**m11.40**_all.beta/beta8', 'Volumes/hard_disc/u14_cut/u14**m12.40**_all.beta/beta8', 'Volumes/hard_disc/u14_cut/u14**m12.50**_all.beta/beta8']
This snippets starts after all lines are read from the file and stored into a list (list called l here). The regex group criterion catches the float part contained in **m12.50** as you can see on regex101. So iterating through all the lines gives you a new list containing all matching groups as floats. If the regex does not match on a given string or casting the group to a float fails, crit is set to zero in order to have those invalid lines at the very beginning of the sorted list later.
After that zip() is used to get a list of tules containing the extracted floats and the according string. Now you can sort this list of tuples based on the tuple's first element and write the according string to a new list output.

Parsing through sequence output- Python

I have this data from sequencing a bacterial community.
I know some basic Python and am in the midst of completing the codecademy tutorial.
For practical purposes, please think of OTU as another word for "species"
Here is an example of the raw data:
OTU ID OTU Sum Lineage
591820 1083 k__Bacteria; p__Fusobacteria; c__Fusobacteria; o__Fusobacteriales; f__Fusobacteriaceae; g__u114; s__
532752 517 k__Bacteria; p__Fusobacteria; c__Fusobacteria; o__Fusobacteriales; f__Fusobacteriaceae; g__u114; s__
218456 346 k__Bacteria; p__Proteobacteria; c__Betaproteobacteria; o__Burkholderiales; f__Alcaligenaceae; g__Bordetella; s__
590248 330 k__Bacteria; p__Proteobacteria; c__Betaproteobacteria; o__Burkholderiales; f__Alcaligenaceae; g__; s__
343284 321 k__Bacteria; p__Proteobacteria; c__Betaproteobacteria; o__Burkholderiales; f__Comamonadaceae; g__Limnohabitans; s__
The data includes three things: a reference number for the species, how many times that species is in the sample, and the taxonomy of said species.
What I'm trying to do is add up all the times a sequence is found for a taxonomic family (designated as f_x in the data).
Here is an example of the desired output:
f__Fusobacteriaceae 1600
f__Alcaligenaceae 676
f__Comamonadaceae 321
This isn't for a class. I started learning python a few months ago, so I'm at least capable of looking up any suggestions. I know how it works out from doing it the slow way (copy & paste in excel), so this is for future reference.
If the lines in your file really look like this, you can do
from collections import defaultdict
import re
nums = defaultdict(int)
with open("file.txt") as f:
for line in f:
items = line.split(None, 2) # Split twice on any whitespace
if items[0].isdigit():
key = re.search(r"f__\w+", items[2]).group(0)
nums[key] += int(items[1])
Result:
>>> nums
defaultdict(<type 'int'>, {'f__Comamonadaceae': 321, 'f__Fusobacteriaceae': 1600,
'f__Alcaligenaceae': 676})
Yet another solution, using collections.Counter:
from collections import Counter
counter = Counter()
with open('data.txt') as f:
# skip header line
next(f)
for line in f:
# Strip line of extraneous whitespace
line = line.strip()
# Only process non-empty lines
if line:
# Split by consecutive whitespace, into 3 chunks (2 splits)
otu_id, otu_sum, lineage = line.split(None, 2)
# Split the lineage tree into a list of nodes
lineage = [node.strip() for node in lineage.split(';')]
# Extract family node (assuming there's only one)
family = [node for node in lineage if node.startswith('f__')][0]
# Increase count for this family by `otu_sum`
counter[family] += int(otu_sum)
for family, count in counter.items():
print "%s %s" % (family, count)
See the docs for str.split() for the details of the None argument (matching consecutive whitespace).
Get all your raw data and process it first, I mean structure it and then use the structured data to do any sort of operations you desire.
In case if you have GB's of data you can use elasticsearch. Feed your raw data and query with your required string in this case f_* and get all entries and add them
That's very doable with basic python. Keep the Library Reference under your pillow, as you'll want to refer to it often.
You'll likely end up doing something similar to this (I'll write it the longer-more-readable way -- there's ways to compress the code and do this quicker).
# Open up a file handle
file_handle = open('myfile.txt')
# Discard the header line
file_handle.readline()
# Make a dictionary to store sums
sums = {}
# Loop through the rest of the lines
for line in file_handle.readlines():
# Strip off the pesky newline at the end of each line.
line = line.strip()
# Put each white-space delimited ... whatever ... into items of a list.
line_parts = line.split()
# Get the first column
reference_number = line_parts[0]
# Get the second column, convert it to an integer
sum = int(line_parts[1])
# Loop through the taxonomies (the rest of the 'columns' separated by whitespace)
for taxonomy in line_parts[2:]:
# skip it if it doesn't start with 'f_'
if not taxonomy.startswith('f_'):
continue
# remove the pesky semi-colon
taxonomy = taxonomy.strip(';')
if sums.has_key(taxonomy):
sums[taxonomy] += int(sum)
else:
sums[taxonomy] = int(sum)
# All done, do some fancy reporting. We'll leave sorting as an exercise to the reader.
for taxonomy in sums.keys():
print("%s %d" % (taxonomy, sums[taxonomy]))

Categories