Parsing blocks of text data with python itertools.groupby

Parsing blocks of text data with python itertools.groupby - python

I'm trying to parse a blocks of text in python 2.7 using itertools.groupby
The data has the following structure:
BEGIN IONS
TITLE=cmpd01_scan=23
RTINSECONDS=14.605
PEPMASS=694.299987792969 505975.375
CHARGE=2+
615.839727 1760.3752441406
628.788226 2857.6264648438
922.4323436 2458.0959472656
940.4432533 9105.5
END IONS
BEGIN IONS
TITLE=cmpd01_scan=24
RTINSECONDS=25.737
PEPMASS=694.299987792969 505975.375
CHARGE=2+
575.7636234 1891.1656494141
590.3553938 2133.4477539063
615.8339562 2433.4252929688
615.9032114 1784.0628662109
END IONS
I need to extract information from the line beigining with "TITLE=", "PEPMASS=","CHARGE=".
The code I'm using as follows:
import itertools
import re
data_file='Test.mgf'
def isa_group_separator(line):
return line=='END IONS\n'
regex_scan = re.compile(r'TITLE=')
regex_precmass=re.compile(r'PEPMASS=')
regex_charge=re.compile(r'CHARGE=')
with open(data_file) as f:
for (key,group) in itertools.groupby(f,isa_group_separator):
#print(key,list(group))
if not key:
precmass_match = filter(regex_precmass.search,group)
print precmass_match
scan_match= filter(regex_scan.search,group)
print scan_match
charge_match = filter(regex_charge.search,group)
print charge_match
However, the output only picks up the "PEPMASS=" line,and if 'scan_match' assignment is done before 'precmass_match', the "TITLE=" line is printed only;
> ['PEPMASS=694.299987792969 505975.375\n'] [] []
> ['PEPMASS=694.299987792969 505975.375\n'] [] []
can someone point out what I'm doing wrong here?

The reason for this is that group is an iterator and it runs only once.
Please find the modified script that does the job.
import itertools
import re
data_file='Test.mgf'
def isa_group_separator(line):
return line == 'END IONS\n'
regex_scan = re.compile(r'TITLE=')
regex_precmass = re.compile(r'PEPMASS=')
regex_charge = re.compile(r'CHARGE=')
with open(data_file) as f:
for (key, group) in itertools.groupby(f, isa_group_separator):
if not key:
g = list(group)
precmass_match = filter(regex_precmass.search, g)
print precmass_match
scan_match = filter(regex_scan.search, g)
print scan_match
charge_match = filter(regex_charge.search, g)
print charge_match

I might try to parse this way (without using groupby(
import re
file = """\
BEGIN IONS
TITLE=cmpd01_scan=23
RTINSECONDS=14.605
PEPMASS=694.299987792969 505975.375
CHARGE=2+
615.839727 1760.3752441406
628.788226 2857.6264648438
922.4323436 2458.0959472656
940.4432533 9105.5
END IONS
BEGIN IONS
TITLE=cmpd01_scan=24
RTINSECONDS=25.737
PEPMASS=694.299987792969 505975.375
CHARGE=2+
575.7636234 1891.1656494141
590.3553938 2133.4477539063
615.8339562 2433.4252929688
615.9032114 1784.0628662109
END IONS""".splitlines()
pat = re.compile(r'(TITLE|PEPMASS|CHARGE)=(.+)')
data = []
for line in file:
m = pat.match(line)
if m is not None:
if m.group(1) == 'TITLE':
data.append([])
data[-1].append(m.group(2))
print(data)
Prints:
[['cmpd01_scan=23', '694.299987792969 505975.375', '2+'], ['cmpd01_scan=24', '694.299987792969 505975.375', '2+']]

Related

Python retrieving data from a block of lines containing specific characters and appending relevant data into separate lines

I am trying to create a program which selects specific information from a bulk paste, extract the relevant information and then proceed to paste said information into lines.
Here is some example data;
1. Track1 03:01
VOC:PersonA
LYR：LyrcistA
COM：ComposerA
ARR：ArrangerA
ARR：ArrangerB
2. Track2 04:18
VOC:PersonB
VOC:PersonC
LYR：LyrcistA
LYR：LyrcistC
COM：ComposerA
ARR：ArrangerA
I would like to have the output where the relevant data for the Track1 is grouped together in a single line, with semicolon joining identical information and " - " seperating between others.
LyrcistA - ComposerA - ArrangerA; ArrangerB
LyrcistA; LyrcistC - ComposerA - ArrangerA
I have not gotten very far despite my best efforts
while True:
YodobashiData = input("")
SplitData = YodobashiData.splitlines();
returns the following
['1. Track1 03:01']
['VOC:PersonA ']
['LYR：LyrcistA']
['COM：ComposerA']
['ARR：ArrangerA']
['ARR：ArrangerB']
[]
['2. Track2 04:18']
['VOC:PersonB']
['VOC:PersonC']
['LYR：LyrcistA']
['LYR：LyrcistC']
['COM：ComposerA']
['ARR：ArrangerA']
Whilst I have all the data now in separate lists, I have no idea how to identify and extract the information from the list I need from the ones I do not.
Also, it seems I need to have the while loop or else it will only return the first list and nothing else.

Here's a script that doesn't use regular expressions.
It assumes that header lines, and only the header lines, will always start with a digit, and that the overall structure of header line then credit lines is consistent. Empty lines are ignored.
Extraction and formatting of the track data are handled separately, so it's easier to change formats, or use the extracted data in other ways.
import collections
import unicodedata
data_from_question = """\
1. Track1 03:01
VOC:PersonA
LYR：LyrcistA
COM：ComposerA
ARR：ArrangerA
ARR：ArrangerB
2. Track2 04:18
VOC:PersonB
VOC:PersonC
LYR：LyrcistA
LYR：LyrcistC
COM：ComposerA
ARR：ArrangerA
"""
def prepare_data(data):
# The "colons" in the credits lines are actually
# "full width colons". Replace them (and other such characters)
# with their normal width equivalents.
# If full normalisation is undesirable then we could return
# data.replace('\N{FULLWIDTH COLON}', ':')
return unicodedata.normalize('NFKC', data)
def is_new_track(line):
return line[0].isdigit()
def parse_track_header(line):
id_, title, duration = line.split()
return {'id': id_.rstrip('.'), 'title': title, 'duration': duration}
def get_credit(line):
credit, _, name = line.partition(':')
return credit.strip(), name.strip()
def format_track_heading(track):
return 'id: {id} title: {title} length: {duration}'.format(**track)
def format_credits(track):
order = ['ARR', 'COM', 'LYR', 'VOC']
parts = ['; '.join(track[k]) for k in order]
return ' - '.join(parts)
def get_data():
# The data is expected to be a multiline string.
return data_from_question
def parse_data(data):
track = None
for line in filter(None, data.splitlines()):
if is_new_track(line):
if track:
yield track
track = collections.defaultdict(list)
header_data = parse_track_header(line)
track.update(header_data)
else:
role, name = get_credit(line)
track[role].append(name)
yield track
def report(tracks):
for track in tracks:
print(format_track_heading(track))
print(format_credits(track))
print()
def main():
data = get_data()
prepared_data = prepare_data(data)
tracks = parse_data(prepared_data)
report(tracks)
main()
Output:
id: 1 title: Track1 length: 03:01
ArrangerA; ArrangerB - ComposerA - LyrcistA - PersonA
id: 2 title: Track2 length: 04:18
ArrangerA - ComposerA - LyrcistA; LyrcistC - PersonB; PersonC

Here's another take on an answer to your question:
data = """
1. Track1 03:01
VOC:PersonA
LYR：LyrcistA
COM：ComposerA
ARR：ArrangerA
ARR：ArrangerB
2. Track2 04:18
VOC:PersonB
VOC:PersonC
LYR：LyrcistA
LYR：LyrcistC
COM：ComposerA
ARR：ArrangerA"""
import re
import collections
# Regular expression to pull apart the headline of each entry
headlinePattern = re.compile(r"(\d+)\.\s+(.*?)\s+(\d\d:\d\d)")
def main():
# break the data into lines
lines = data.strip().split("\n")
# while we have more lines...
while lines:
# The next line should be a title line
line = lines.pop(0)
m = headlinePattern.match(line)
if not m:
raise Exception("Unexpected data format")
id = m.group(1)
title = m.group(2)
length = m.group(3)
people = collections.defaultdict(list)
# Now read person lines until we hit a blank line or the end of the list
while lines:
line = lines.pop(0)
if not line:
break
# Break the line into label and name
label, name = re.split(r"\W+", line, 1)
# Add this entry to a map of lists, where the map's keys are the label and the
# map's values are all the people who had that label
people[label].append(name)
# Now we have everything for one entry in the data. Print everything we got.
print("id:", id, "title:", title, "length:", length)
print(" - ".join(["; ".join(person) for person in people.values()]))
# go on to the next entry...
main()
Result:
id: 1 title: Track1 length: 03:01
PersonA - LyrcistA - ComposerA - ArrangerA; ArrangerB
id: 2 title: Track2 length: 04:18
PersonB; PersonC - LyrcistA; LyrcistC - ComposerA - ArrangerA
You can just comment out the line that prints the headline info if you really just want the line with all of the people on it. Just replace the built in data with data = input("") if you want to read the data from a user prompt.

Assuming your data is in the format you specified in a file called tracks.txt, the following code should work:
import re
with open('tracks.txt') as fp:
tracklines = fp.read().splitlines()
def split_tracks(lines):
track = []
all_tracks = []
while True:
try:
if lines[0] != '':
track.append(lines.pop(0))
else:
all_tracks.append(track)
track = []
lines.pop(0)
except:
all_tracks.append(track)
return all_tracks
def gather_attrs(tracks):
track_attrs = []
for track in tracks:
attrs = {}
for line in track:
match = re.match('([A-Z]{3}):', line)
if match:
attr = line[:3]
val = line[4:].strip()
try:
attrs[attr].append(val)
except KeyError:
attrs[attr] = [val]
track_attrs.append(attrs)
return track_attrs
if __name__ == '__main__':
tracks = split_tracks(tracklines)
attrs = gather_attrs(tracks)
for track in attrs:
semicolons = map(lambda va: '; '.join(va), track.values())
hyphens = ' - '.join(semicolons)
print(hyphens)
The only thing you may have to change is the colon characters in your data - some of them are ASCII colons : and others are Unicode colons ：, which will break the regex.

import re
list_ = data_.split('\n') # here data_ is your data
regObj = re.compile(rf'[A-Za-z]+(:|{chr(65306)})[A-Za-z]+')
l = []
pre = ''
for i in list_:
if regObj.findall(i):
if i[:3] != 'VOC':
if pre == i[:3]:
l.append('; ')
else:
l.append(' - ')
l.append(i[4:].strip())
else:
l.append(' => ')
pre = i[:3]
track_list = list(map(lambda item: item.strip(' - '), filter(lambda item: item, ''.join(l).split(' => '))))
print(track_list)
OUTPUT : list of result you want
['LyrcistA - ComposerA - ArrangerA; ArrangerB', 'LyrcistA; LyrcistC - ComposerA - ArrangerA']

Python: put numbers into list

I have a text file that looks like this:
1 acatccacgg atgaaggaga ggagaaatgt ttcaaatcag ttctaacacg aaaaccaatt
61 ccaagaccaa gttatgaaat taccactaag cagcagtgaa agaactacat attgaagtca
121 gataagaaag caagctgaag agcaagcact gggcatcttt cttgaaaaaa gtaaggccca
181 agtaacagac tatcagattt ttttgcagtc tttgcattcc tactagatga ttcacagaga
241 agatagtcac atttatcatt cgaaaacatg aaagaattcc agtcagaact tgcatttggg
301 ggcatgtaag tctcaaggtt gtctttttgc caatgtgctg taacattatt gcactcagag
361 tgtactgctg acagccactg ttctgccgaa atgacagaaa atagggaaca
I am trying to read the txt file and make a dictionary that puts the text information into a dictionary like this: {1:[acatccacgg,atgaaggaga, ggagaaatgt, ttcaaatcag, ttctaacacg, aaaaccaatt], 61 : ...}
I have no clue how to do this...I am really new to python

you can try this line of code.
f = open('test.txt','r')
mydictionary = {}
for x in f:
temp = x.strip().split(' ')
mydictionary.update({temp[0]:temp[1:]})
f.close()
print(mydictionary)

this is the cleaner, and more readable way to do so (just try it, and you will understand):
import re
from os.path import exists
def put_in_dict(directory: str):
"""With this function you can find the digits's in every line and
then put it in keys and then you can put the character's in the same line
as value to that key."""
my_dict = {}
pattern_digit = re.compile(r"\d+")
pattern_char = re.compile(r"\w+")
char = []
if exists(directory):
with open(f"{directory}") as file:
all_text = file.read().strip()
list_txt = all_text.splitlines()
numbs = pattern_digit.findall(all_text)
for num in range(len(list_txt)):
char.append(pattern_char.findall(list_txt[num]))
del char[num][0]
for dict_set in range(len(numbs)):
my_dict[numbs[dict_set]] = char[dict_set]
return my_dict # you could make it print(my_dict) too

Python - grouping words into sets of 3

I'm trying to create a multidimensional array that contains the words in a string - the word before that word (unless at the beginning of the string, blank), the word, and the following word (unless at the end of the string, blank)
I've tried the following code:
def parse_group_words(text):
groups = []
words = re_sub("[^\w]", " ", text).split()
number_words = len(words)
for i in xrange(number_words):
print i
if i == 0:
groups[i][0] = ""
groups[i][1] = words[i]
groups[i][2] = words[i+1]
if i > 0 and i != number_words:
groups[i][0] = words[i-1]
groups[i][1] = words[i]
groups[i][2] = words[i+1]
if i == number_words:
groups[i][0] = words[i-1]
groups[i][1] = words[i]
groups[i][2] = ""
print groups
print parse_group_words("this is an example of text are you ready")
But I'm getting:
0
Traceback (most recent call last):
File "/home/akf/program.py", line 82, in <module>
print parse_group_words("this is an example of text are you ready")
File "/home/akf/program.py", line 69, in parse_group_words
groups[i][0] = ""
IndexError: list index out of range
Any idea how to fix this?

Here's a generic way to do it for arbitrary sized windows, using Python collections and itertools:
import re
import collections
import itertools
def window(seq, n=3):
d = collections.deque(maxlen=n)
for x in itertools.chain(('', ), seq, ('', )):
d.append(x)
if len(d) >= n:
yield tuple(d)
def windows(text, n=3):
return list(window((x.group() for x in re.finditer(r'\w+', text)), n=n))

What about...:
import itertools, re
def parse_group_words(text):
groups = []
words = re.finditer(r'\w+', text)
prv, cur, nxt = itertools.tee(words, 3)
next(cur); next(nxt); next(nxt)
for previous, current, thenext in itertools.izip(prv, cur, nxt):
# in Py 3, use `zip` in lieu of itertools.izip
groups.append([previous.group(0), current.group(0), thenext.group(0)])
print(groups)
parse_group_words('tanto va la gatta al lardo che ci lascia')
This is almost what you require -- it emits:
[['tanto', 'va', 'la'], ['va', 'la', 'gatta'], ['la', 'gatta', 'al'], ['gatta', 'al', 'lardo'], ['al', 'lardo', 'che'], ['lardo', 'che', 'ci'], ['che', 'ci', 'lascia']]
...missing the last-required group ['ci', 'lascia', ''].
To fix it, just before the print, you could add:
groups.append([groups[-1][1], groups[-1][2], ''])
This feels like a midly-unsavory hack -- I can't easily find an elegant way to have this last group "just emerge" from the general logic of the rest of the function.

Python dictionary generation, too many variables to unpack

Trying to generate a dictionary from a list of data parsed from .csv files. Getting the error "too many values to unpack", any got any ideas on a fix?
There will be repeating keys/mutliple values to append to each key.
Im pretty new to python and programming so please if you could add a short explanation of what went wrong/how to fix.
Below the script is the data how it appears when res is printed.
#!/usr/bin/python
import csv
import pprint
pp = pprint.PrettyPrinter(indent=4)
import sys
import getopt
res = []
import argparse
parser = argparse.ArgumentParser()
parser.add_argument ("infile", metavar="CSV", nargs="+", type=str, help="data file")
args = parser.parse_args()
with open("out.csv","wb") as f:
output = csv.writer(f)
for filename in args.infile:
for line in csv.reader(open(filename)):
for item in line[2:]:
#to skip empty cells
if not item.strip():
continue
item = item.split(":")
item[1] = item[1].rstrip("%")
# print([line[1]+item[0],item[1]])
res.append([line[1]+item[0],item[1]])
# output.writerow([line[1]+item[0],item[1].rstrip("%")])
pp.pprint( res )
from collections import defaultdict
initial_list = [res]
d = defaultdict(list)
pp.pprint( d )
for k, v in initial_list:
d[k].append(float(v)) # possibly `int(v)` ?
and the console
[ ['P1L', '2.04'],
['Q2R', '1.93'],
['V3I', '20.03'],
['V3M', '78.18'],
['V3S', '1.67'],
['T4L', '1.16'],
['T12N', '75.60'],
['T12S', '22.73'],
['K14E', '1.03'],
['K14R', '50.65'],
['I15*', '63.94'],
['I15V', '35.30'],
['G17A', '38.31'],
['Q18R', '38.43'],
['L19T', '98.62'],
['L24*', '2.18'],
['D25E', '1.87'],
['D25N', '2.17'],
['M36I', '99.76'],
['S37N', '97.23'],
['R41K', '99.03'],
['L63V', '99.42'],
['H69K', '99.30'],
['I72V', '5.76'],
['V82I', '98.70'],
['L89M', '98.49'],
['I93L', '99.64'],
['P4S', '99.09'],
['V35T', '99.26'],
['E36A', '98.23'],
['T39D', '98.78'],
['G45R', '3.11'],
['S48T', '99.70'],
['V60I', '99.44'],
['K102R', '1.04'],
['K103N', '99.11'],
['G112E', '2.77'],
['D123N', '8.14'],
['D123S', '91.12'],
['I132M', '1.41'],
['K173A', '99.55'],
['Q174K', '99.68'],
['D177E', '98.95'],
['G190R', '2.56'],
['E194K', '2.54'],
['T200A', '99.28'],
['Q207E', '98.75'],
['R211K', '98.77'],
['W212*', '3.00'],
['L214F', '99.25'],
['V245E', '99.30'],
['E248D', '99.58'],
['D250E', '99.02'],
['T286A', '99.70'],
['K287R', '1.78'],
['E291D', '99.22'],
['V292I', '98.28'],
['I293V', '99.58'],
['V317A', '28.20'],
['L325V', '2.40'],
['G335D', '98.33'],
['F346S', '4.42'],
['N348I', '3.81'],
['R356K', '71.43'],
['M357I', '20.00'],
['M357T', '80.00']]
defaultdict(<type 'list'>, {})
Traceback (most recent call last):
File "test.py", line 40, in <module
for k, v in initial_list:
ValueError: too many values to unpack

You are wrapping the result in a list:
initial_list = [res]
then try to iterate over the list:
d = defaultdict(list)
pp.pprint( d )
for k, v in initial_list:
d[k].append(float(v)) # possibly `int(v)` ?
You want to loop over res instead:
d = defaultdict(list)
for k, v in res:
d[k].append(float(v))
You can do all this in the CSV reading loop:
from collections import defaultdict
d = defaultdict(list)
with open("out.csv","wb") as f:
output = csv.writer(f)
for filename in args.infile:
for line in csv.reader(open(filename)):
for item in line[2:]:
#to skip empty cells
if not item.strip():
continue
key, value = item.split(":", 1)
value = value.rstrip("%")
d[line1[1] + key].append(float(value))

Learning Python: Store values in dict from stdout

How can I do the following in Python:
I have a command output that outputs this:
Datexxxx
Clientxxx
Timexxx
Datexxxx
Client2xxx
Timexxx
Datexxxx
Client3xxx
Timexxx
And I want to work this in a dict like:
Client:(date,time), Client2:(date,time) ...

After reading the data into a string subject, you could do this:
import re
d = {}
for match in re.finditer(
"""(?mx)
^Date(.*)\r?\n
Client\d*(.*)\r?\n
Time(.*)""",
subject):
d[match.group(2)] = (match.group(1), match.group(2))

How about something like:
rows = {}
thisrow = []
for line in output.split('\n'):
if line[:4].lower() == 'date':
thisrow.append(line)
elif line[:6].lower() == 'client':
thisrow.append(line)
elif line[:4].lower() == 'time':
thisrow.append(line)
elif line.strip() == '':
rows[thisrow[1]] = (thisrow[0], thisrow[2])
thisrow = []
print rows
Assumes a trailing newline, no spaces before lines, etc.

What about using a dict with tuples?
Create a dictionary and add the entries:
dict = {}
dict['Client'] = ('date1','time1')
dict['Client2'] = ('date2','time2')
Accessing the entires:
dict['Client']
>>> ('date1','time1')

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Parsing blocks of text data with python itertools.groupby - python

Related

Python retrieving data from a block of lines containing specific characters and appending relevant data into separate lines

Python: put numbers into list

Python - grouping words into sets of 3

Python dictionary generation, too many variables to unpack

Learning Python: Store values in dict from stdout

Categories

Resources