NumPy load file interspersed with headers - python

I'm trying to parse files with repeating blocks of the following format:
ITEM: TIMESTEP
5000
ITEM: NUMBER OF ATOMS
4200
ITEM: BOX BOUNDS pp pp ff
0 47.6892
0 41.3
-11.434 84.1378
ITEM: ATOMS id type z vx
5946 27 11.8569 0.00180946
5948 28 11.1848 -0.0286474
5172 27 12.1796 0.00202046
...
where ... will be NUMBER OF ATOMS entries (4200 for this particular file). Each file contains many of these blocks in succession and will range from 1-5 million lines.
I want to completely ignore all of the header data contained in the first 9 lines of each block and only need an array containing all of the "z" values (3rd column in a data entry) and an array containing the "vx" values (4th column in a data entry).
The headers for each block will always be the same within a file except for the number following the ITEM: TIMESTEP entry. The header format will remain the same across files and the files differ only in the number of entries (atoms).
I wrote some incredibly dirty code that did the trick for some shorter files I was working with previously but it's very slow for these files. I tried using the genfromtxt function but I haven't found a way to bend it to do what I want in this case. Any tips on making this faster?
EDIT:
The following worked for me:
grep -E '^[.-0123456789]+ [.-0123456789]+ [.-0123456789]+ [.-0123456789]'
As did this:
with open(data, 'r') as fh:
wrapper = (i for i in fh if re.match(r'^[-.1234567890]+ [-.1234567890]+ [-.1234567890]+ [-.1234567890]',i))
z_vx = np.genfromtxt(wrapper, usecols=(2,3))
This ended up being the fastest for my case:
regexp = r'\d+\s+\d+\s+([0-9]*\.?[0-9]+)\s+([-+]?[0-9]*\.?[0-9]+)\s+\n'
data = np.fromregex(file_path, regexp, dtype=[('z', float), ('vx', float)])

If you want speed, you can grep only the relevant lines and then use np.genfromtxt().
grep something like this (you assumed the relevant rows have 4 fields of numbers right?):
grep -P '^[-.0123456789]+ [-.0123456789]+ [-.0123456789]+ [-.0123456789]+$'
A more pythonic solution would be to wrap the file handle with a generator like this:
wrapper = (i for i in fh if re.match(r'^[-.1234567890]+ [-.1234567890]+ [-.1234567890]+ [-.1234567890]+$',i))
np.genfromtxt(wrapper,...)

I had a similar problem. Ended up using sed so add a # in front of the header and then use np.loadtxt.
So the bash script was
for i in $( ls *.data )
do
b = `basename $i .data`
sed '1,9{/^#/!s/^/#/}' $i > $b.tmp
rm $i
mv $b.tmp $i
done
and the in python
from numpy import loadtxt
data = loadtxt("atoms.data")

Related

Reading irregular colunm data into python 3.X using pandas or numpy

Below is my piece of code.
import numpy as np
filename1=open(f)
xf = np.loadtxt(filename1, dtype=float)
Below is my data file.
0.14200E+02 0.18188E+01 0.44604E-03
0.14300E+02 0.18165E+01 0.45498E-03
0.14400E+02-0.17694E+01 0.44615E+03
0.14500E+02-0.17226E+01 0.43743E+03
0.14600E+02-0.16767E+01 0.42882E+03
0.14700E+02-0.16318E+01 0.42033E+03
0.14800E+02-0.15879E+01 0.41196E+03
as one can see there are negative values that take up the space between 2 values this causes numpy to give
ValueError: Wrong number of columns at line 3
This is just small snippet of my code. I want to read this data using numpy or pandas. Any suggestion would be great.
Edit 1:
#ZarakiKenpachi I used your suggestion of sep=' |-' but it gives me extra 4th column with NaN values.
Edit 2:
#Serge Ballesta nice suggestion but all these are some kind of pre-processing. I want some kind of inbuild function to do this in pandas or numpy.
Edit 3:
Important Note it should be noted that there also negative sign in 0.4373E-03
Thank-you
np.loadtext can read from a (byte string) generator, so you can filter the input file while loading it to add an additional before a minus:
...
def filter(fd):
rx = re.compile(rb'\d-')
for line in fd:
yield rx.sub(b' -', line)
xf = np.loadtxt(filter(open(f, 'b')), dtype=float)
This does not require to preload everything into memory, so it is expected to be memory efficient.
The regex is required to avoid to change something like 0.16545E-012.
In my tests for 10k lines, this should be at most 10% slower than loading everything in memory but will require far less memory
You can do a preprocess your data to add an additional space before your - signs. While there are many ways of doing it, the best approach would be in my opinion (in order to avoid adding whitespaces at the start of the line) is using regex re.sub:
import re
with open(f) as file:
raw_data = file.readlines()
processed_data = re.sub(r'(?:\d)-', " -", raw_data)
xf = np.loadtxt(processed_data, dtype=float)
This replaces every - preceded by a number with -.
Try the below code :
with open('app.txt') as f:
data = f.read()
import re
data_mod = []
for number in data.split('\n')[:-1]:
num = re.findall(r'[\w\.-]+-[\w\.-]',number)
for n in num:
number = number.replace('-',' -')
data_mod.append(number)
with open('mod_text.txt','w') as f:
for data in data_mod:
f.write(data+"\n")
filename1='mod_text.txt'
xf = np.loadtxt(filename1, dtype=float)
Actually you have to per-process the data, using regex. After that you can load that data as you required.
I hope this helps.

2 CSV files. Executed before and after migration. Want to compare count against threshold

I have 2 csv files like:
format
REPORT_NUM,EXEC,REPORT_NAME,REPORT_COUNT
before.csv
1,1,"Report 1",45
2,1,"Report 2",456
3,1,"Report 3",11
4,1,"Report 4",0
after.csv
1,1,"Report 1",47
2,1,"Report 2",456556
3,1,"Report 3",0
4,1,"Report 4",212
I basically need for each REPORT_NUM to compare REPORT_COUNT and then output a 3rd csv with REPORT_NAME,before REPORT_COUNT, after REPORT_COUNT when there's a threshold cross ( when the after is more than 10% different to before ) . EXEC is just an execution run.
So result.csv might show:
2,1,"Report 2",456,456556
3,1,"Report 3",11,0
4,1,"Report 4",0,212
I am looking at the following for inspiration:
Comparing values between 2 CSV files and writing to a 3rd CSV file
Python: Comparing two CSV files and searching for similar items
I continue to search ,but any feedback appreciated.
Thank you in advance!
p.s. I am assuming Python is best , I dont mind what language but I have basic python understand. I started writing this in bash and using "diff" and "sed" .. and so I may go that route..
Based on the 2 links you gave:
import csv
with open('before.csv', 'r') as before:
before_indices = dict((i[2], i[3]) for i in csv.reader(before))
with open('after.csv', 'r') as reportAfter:
with open('results.csv', 'w') as results:
reader = csv.reader(reportAfter)
writer = csv.writer(results, quoting=csv.QUOTE_NONNUMERIC)
for row in reader:
value = before_indices.get(row[2])
if float(row[3]) > 1.1*float(value) or float(row[3]) < 0.9*float(value):
writer.writerow([int(row[0]),int(row[1]),row[2],int(value),int(row[3])])
this produces your desired output given your example input on linux. On windows you need to change according to this Python3: writing csv files. If you have non-integer numbers you may want to change the int() in the last line to float() to keep decimals.

How to make searching a string in text files quicker

I want to search a list of strings (having from 2k upto 10k strings in the list) in thousands of text files (there may be as many as 100k text files each having size ranging from 1 KB to 100 MB) saved in a folder and output a csv file for the matched text filenames.
I have developed a code that does the required job but it takes around 8-9 hours for 2000 strings to search in around 2000 text files having size of ~2.5 GB in total.
Also, by using this method, system's memory is consumed and so sometimes need to split the 2000 text files into smaller batches for the code to run.
The code is as below(Python 2.7).
# -*- coding: utf-8 -*-
import pandas as pd
import os
def match(searchterm):
global result
filenameText = ''
matchrateText = ''
for i, content in enumerate(TextContent):
matchrate = search(searchterm, content)
if matchrate:
filenameText += str(listoftxtfiles[i])+";"
matchrateText += str(matchrate) + ";"
result.append([searchterm, filenameText, matchrateText])
def search(searchterm, content):
if searchterm.lower() in content.lower():
return 100
else:
return 0
listoftxtfiles = os.listdir("Txt/")
TextContent = []
for txt in listoftxtfiles:
with open("Txt/"+txt, 'r') as txtfile:
TextContent.append(txtfile.read())
result = []
for i, searchterm in enumerate(searchlist):
print("Checking for " + str(i + 1) + " of " + str(len(searchlist)))
match(searchterm)
df=pd.DataFrame(result,columns=["String","Filename", "Hit%"])
Sample Input below.
List of strings -
["Blue Chip", "JP Morgan Global Healthcare","Maximum Horizon","1838 Large Cornerstone"]
Text file -
Usual text file containing different lines separated by \n
Sample Output below.
String,Filename,Hit%
JP Morgan Global Healthcare,000032.txt;000031.txt;000029.txt;000015.txt;,100;100;100;100;
Blue Chip,000116.txt;000126.txt;000114.txt;,100;100;100;
1838 Large Cornerstone,NA,NA
Maximum Horizon,000116.txt;000126.txt;000114.txt;,100;100;100;
As in the example above, first string was matched in 4 files(seperated by ;), second string was matched in 3 files and third string was not matched in any of the files.
Is there a quicker way to search without any splitting of text files?
Your code does a lot of pushing large amounts of data around in memory because you load all files in memory and then search them.
Performance aside, your code could use some cleaning up. Try to write functions as autonomous as possible, without depending on global variables (for input or output).
I rewrote your code using list comprehensions and it became a lot more compact.
# -*- coding: utf-8 -*-
from os import listdir
from os.path import isfile
def search_strings_in_files(path_str, search_list):
""" Returns a list of lists, where each inner list contans three fields:
the filename (without path), a string in search_list and the
frequency (number of occurences) of that string in that file"""
filelist = listdir(path_str)
return [[filename, s, open(path_str+filename, 'r').read().lower().count(s)]
for filename in filelist
if isfile(path_str+filename)
for s in [sl.lower() for sl in search_list] ]
if __name__ == '__main__':
print search_strings_in_files('/some/path/', ['some', 'strings', 'here'])
Mechanism's that I use in this code:
list comprehension to loop thought search_lists and though the files.
compound statements to loop only through the files in a directory (and not through sub directories).
method chaining to directly call a method of an object that is returned.
Tip for reading the list comprehension: try reading it form bottom to top, so:
I convert all items in search_list to lower using list comprehension.
Then I loop over that list (for s in...)
Then I filter out the directory entries that are not files using a compound statement (if isfile...)
Then I loop over all files (for filename...)
In the top line, I create the sublist containing three items:
filename
s, that is the lower case search string
a method chained call to open the file, read all its contents, convert it to lowercase and count the number of occurrences of s.
This code uses all the power there is in "standard" Python functions. If you need more performance, you should look into specialised libraries for this task.

Python: Issue with delimiter inside column data

This is no duplicate of another question, as I do not want to drop the rows. The accepted answer in the aforementioned post is very different from this one, and not aimed at maintaining all the data.
Problem:
Delimiter inside column data from badly formatted csv-file
Tried solutions: csv module , shlex, StringIO (no working solution on SO)
Example data
Delimiters are inside the third data field, somewhere enclosed by (multiple) double-quotes:
08884624;6/4/2016;Network routing 21,5\"\" 4;8GHz1TB hddQwerty\"\";9999;resell:no;package:1;test
0085658;6/4/2016;Logic 111BLACK.compat: 29,46 cm (11.6\"\")deep: 4;06 cm height: 25;9 cm\"\";9999;resell:no;package:1;test
4235846;6/4/2016;Case Logic. compat: 39,624 cm (15.6\"\") deep: 3;05 cm height: 3 cm\"\";9999;resell:no;package:1;test
400015;6/4/2016;Cable\"\"Easy Cover\"\"\"\";1;5 m 30 Silver\"\";9999;resell:no;package:1;test
9791118;6/4/2016;Network routing 21,5\"\" (2013) 2;7GHz\"\";9999;resell:no;package:1;test
477000;6/4/2016;iGlaze. deep: 9,6 mm (67.378\"\") height: 14;13 cm\"\";9999;resell:no;package:1;test
4024001;6/4/2016;DigitalBOX. tuner: Digital, Power: 20 W., Diag: 7,32 cm (2.88\"\"). Speed 10;100 Mbit/s\"\";9999;resell:no;package:1;test
Desired sample output
Fixed length of 7:
['08884624','6/4/2016', 'Network routing 21,5\" 4,8GHz1TB hddQwerty', '9999', 'resell:no', 'package:1', 'test']
Parsing through csv reader doesn't fix the problem (skipinitialspace is not the problem), shlex is no use and StringIO is also of no help...
My initial idea was to import row by row, and replace ';' element by element in row.
But the importing is the problem, as it splits on every ';'.
The data comes from a larger file with 300.000+ rows (not all the rows have this problem).
Any advice is welcome.
As you know the number of input fields, and as only one field is badly formatted, you can simply split on ; and then combine back the median fields into one single one:
for line in file:
temp_l = line.split(';')
lst = temp_l[:2] + [ ';'.join(l[2:-4]) ] + l[-4:] #lst should contain the expected fields
I did not even try to process the double quotes, because I could not understand how you pass from Network routing 21,5\"\" 4;8GHz1TB hddQwerty\"\" to 'Network routing 21,5\" 4,8GHz1TB hddQwerty'...
you can use the standart csv module.
To achieve what you are trying to accomplish just change the csv delimiter in question to ';'
Test the following in the terminal:
import csv
test = ["4024001;6/4/2016;DigitalBOX. tuner: Digital, Power: 20 W., Diag: 7,32 cm (2.88\"\"). Speed 10;100 Mbit/s\"\";9999;resell:no;package:1;test"]
delimited_colon = list(csv.reader(b, delimiter=";", skipinitialspace=True))

Faster way to parse file to array, compare to array in second file

I currently have an MGF file containing MS2 spectral data (QE_2706_229_sequest_high_conf.mgf). The file template is in the link below, as well as a snippet of example:
http://www.matrixscience.com/help/data_file_help.html
BEGIN IONS
TITLE=File3249 Spectrum10594 scans: 11084
PEPMASS=499.59366 927079.3
CHARGE=3+
RTINSECONDS=1710
SCANS=11084
104.053180 3866.360000
110.071530 178805.000000
111.068610 1869.210000
111.074780 10738.600000
112.087240 13117.900000
113.071150 7148.790000
114.102690 4146.490000
115.086840 11835.600000
116.070850 6230.980000
... ...
END IONS
This unannotated spectral file contains thousands of these entries, the total file size is ~150 MB.
I then have a series of text files which I need to parse. Each file is similar to the format above, with the first column being read into a numpy array. Then the unannotated spectra file is parsed for each entry until a matching array is found from the annotated text files input.
(Filename GRPGPVAGHHQMPR)
m/z i matches
104.05318 3866.4
110.07153 178805.4
111.06861 1869.2
111.07478 10738.6
112.08724 13117.9
113.07115 7148.8
114.10269 4146.5
115.08684 11835.6
116.07085 6231.0
Once a match is found, an MGF annotated file is written that then contains the full entry information in the unannotated file, but with a line that specifies the filename of the annotated text file that matched that particular entry. The output is below:
BEGIN IONS
SEQ=GRPGPVAGHHQMPR
TITLE=File3249 Spectrum10594 scans: 11084
PEPMASS=499.59366 927079.3
... ...
END IONS
There may be a much more computationally inexpensive way to parse. Given 2,000 annotated files to search through, with the above large unannotated file, parsing currently takes ~ 12 hrs on a 2.6 GHz quad-core Intel Haswell cpu.
Here is the below working code:
import numpy as np
import sys
from pyteomics import mgf
from glob import glob
def main():
"""
Usage: python mgf_parser
"""
pep_files = glob('*.txt')
mgf_file = 'QE_2706_229_sequest_high_conf.mgf'
process(mgf_file, pep_files)
def process(mgf_file, pep_files):
"""Parses spectra from annotated text file. Converts m/z values to numpy array.
If spectra array matches entry in MGF file, writes annotated MGF file.
"""
ann_arrays = {}
for ann_spectra in pep_files:
a = np.genfromtxt(ann_spectra, dtype=float, invalid_raise=False,
usemask=False, filling_values=0.0, usecols=(0))
b = np.delete(a, 0)
ann_arrays[ann_spectra] = b
with mgf.read(mgf_file) as reader:
for spectrum in reader:
for ann_spectra, array in ann_arrays.iteritems():
if np.array_equal(array, spectrum['m/z array']):
print '> Spectral match found for file {}.txt'.format(ann_spectra[:-4])
file_name = '{}.mgf'.format(ann_spectra[:-4])
spectrum['params']['seq'] = file_name[52:file_name.find('-') - 1]
mgf.write((spectrum,), file_name)
if __name__ == '__main__':
main()
This was used to be able to only parse a given number of files at a time. Suggestions on any more efficient parsing methods?
I see room for improvement in the fact that you are parsing the whole MGF file repeatedly for each of the small files. If you refactor the code so that it is only parsed once, you may get a decent speedup.
Here's how I would tweak your code, simultaneously getting rid of the bash loop, and also using the mgf.write function, which is probably a bit slower than np.savetxt, but easier to use:
from pyteomics import mgf
import sys
import numpy as np
def process(mgf_file, pep_files):
ann_arrays = {}
for ann_spectra in pep_files:
a = np.genfromtxt(ann_spectra, invalid_raise=False,
filling_values=0.0, usecols=(0,))
b = np.delete(a, 0)
ann_arrays[ann_spectra] = b
with mgf.read(mgf_file) as reader:
for spectrum in reader:
for ann_spectra, array in ann_arrays.iteritems():
if np.allclose(array, spectrum['m/z array']):
# allclose may be better for floats than array_equal
file_name = 'DeNovo/good_training_seq/{}.mgf'.format(
ann_spectra[:-4])
spectrum['params']['seq'] = ann_spectra[
:ann_spectra.find('-') - 1]
mgf.write((spectrum,), file_name)
if __name__ == '__main__':
pep_files = sys.argv[1:]
mgf_file = '/DeNovo/QE_2706_229_sequest_high_conf.mgf'
process(mgf_file, pep_files)
Then to achieve the same as your bash loop did, you would call it as
python2.7 mgf_parser.py *.txt
If the expanded argument list is too long, you can use glob instead of relying on bash to expand it:
from glob import iglob
pep_files = iglob(sys.argv[1])
And call it like this to prevent expansion by bash:
python2.7 mgf_parser.py '*.txt'

Categories