I have following format of file
# Data set number 1
#
# Number of lines 4010
# Max number of column 3 is 5
# Blahblah
# More blahblah
1 2 1 110
2 2 5 20 21 465 417 38
2 1 2 33 46 17
......
4010 3 5 1001 2010 3355 107 2039
# Data set number 2
#
# Number of lines 4010
# Max number of column 3 is 5
# Blahblah
# More blahblah
1 2 1 110
2 2 5 20 21 465 417 38
2 1 2 33 46 17
......
I hope to read the number of data set, number of lines, and maximum number of column 3. I searched and find out csv module can read the headers, but can I read those numbers of header, and store? What I did was
nnn = linecache.getline(filename, 1)
nnnn = nnn(line.split()[4])
number = linecache.getline(filename, 3)
number2 = number(line.split()[4])
mmm = linecache.getline(filename, 5)
mmmm = mmm(line.split()[7])
mmmmm = int(mmmm)
max_nb = range(mmmmm)
n_data = int(nnnn)
n_frame = range(n_data)
singleframe = natoms + 6
Like this. How can I read those numbers and store using csv module? I skip the 6 headerlines by using 'singleframe', but also curious how csv module can read 6 number of header lines. Thanks
You don't really have a CSV file; you have a proprietary format instead. Just parse it directly, using regular expressions to quickly extract your desired data:
import re
set_number = re.compile(r'Data set number (\d+)'),
patterns = {
'line_count': re.compile(r'Number of lines (\d+)'),
'max_num': re.compile(r'Max number of column 3 is (\d+)'),
}
with open(filename, 'r') as infh:
results = {}
set_numbers = []
for line in infh:
if not line.startswith('#'):
# skip lines without a comment
continue
set_match = set_number.match(line)
if set_match:
set_numbers.append(int(set_match.group(1)))
else:
for name, pattern in patterns.items():
match = pattern.search(line)
if match:
results[name] = int(match.group(1))
Do not use the linecache module. It'll read the whole file into memory, and is really only intended for access to Python source files; whenever a traceback needs to be printed this module caches the source files involved with the current stack. You'd only use it for smaller files from which you need random lines, repeatedly.
Related
I apologize if there is an obvious answer to this already.
I have a very large file that poses a few challenges for parsing. I am delivered these files from outside my organization, so there is no chance I can change their format.
Firstly, the file is space delimited but the fields that represent a "column" of data can span multiple rows. For example, if you had a row that was supposed to be 25 columns of data, it may be written in the file as:
1 2 3 4 5 6 7 8 9 10 11 12 13 14
15 16 17 18 19 20 21
22 23 24 25
1 2 3 4 5 6 7 8 9 10 11 12 13
14 15 16 17 18
19 20 21 22 23 24 25
As you can see, I can't rely on each set of data being on the same line, but I can rely on there being the same number of columns per set.
To make matters worse, the file follows a definition:data type format where the first 3 or so lines will be describing the data (including a field that tells me how many rows there are) and the next N rows are data. Then it will go back to the 3 lines format again to describe the next set of data. That means I can't just set up a reader for the N columns format and let it run to EOF.
I'm afraid the built in python file reading functionality could get really ugly real fast, but I can't find anything in csv or numpy that works.
Any suggestions?
EDIT: Just as an example of a different solution:
We have an old tool in MATLAB that parses this file using textscan on an open file handle. We know the number of columns so we do something like:
data = textscan(fid, repmat('%f ',1,n_cols), n_rows, 'delimiter', {' ', '\r', '\n'}, 'multipledelimsasone', true);
This would read the data no matter how it wrapped while leaving a file handle open to process the next section later. This is done because the files are so large they can lead to excess RAM usage.
This is a sketch how you can proceed:
(EDIT: with some modifications)
file = open("testfile.txt", "r")
# store data for the different sections here
datasections = list()
while True:
current_row = []
# read three lines
l1 = file.readline()
if line == '': # or other end condition
break
l2 = file.readline()
l3 = file.readline()
# extract the following information from l1, l2, l3
nrows = # extract the number rows in the next section
ncols = # extract the number of columns in the next section
# loop while len(current_row) < nrows * ncols:
# read next line, isolate the items using str.split()
# append items to current_row
# break current_row into the lines after each ncols-th item
# store data in datasections in a new array
I have many text files with the following format,
%header
%header
table
.
.
.
another table
.
.
.
If I didn't have the second table, I could use a simple commnad to read the file such as :
numpy.loadtxt(file_name, skiprows=2, dtype=float, usecols={0, 1})
is there an easy way to read the first table without having to read the files line by line, something like numpy.loadtxt
Use numpy.genfromtxt and set max_rows according to info from the header.
As an example, I created the following data file:
# nrows=10
# nrows=15
1
2
3
4
5
6
7
8
9
10
.
.
.
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
.
.
.
The the following oversimplified code could read the two tables from the file (of course you can enhance the code to meet your needs):
f = open('filename.txt')
# read header and find the number of rows to read for each table:
p = f.tell()
l = f.readline()
tabrows = []
while l.strip().startswith('#'):
if 'nrows' in l:
tabrows.append(int(l.split('=')[1]))
p = f.tell()
l = f.readline()
f.seek(p)
# read all tables assuming each table is followed by three lines with a dot:
import numpy as np
tables = []
skipdots = 0
ndotsafter = 3
for nrows in tabrows:
tables.append(np.genfromtxt(f, skip_header=skipdots, max_rows=nrows))
skipdots = ndotsafter
f.close()
I have a collection of files (kind of like CSV, but no commas) with data arranged like the following:
RMS ResNum Scores Rank
30 1 44 5
12 1 99 2
2 1 60 1
1.5 1 63 3
12 2 91 4
2 2 77 3
I'm trying to write a script that enumerates for me and gives an integer as the output. I want it to count how many times we get a value of RMS below 3 AND a score above 51. Only if both these criteria are met should it add 1 to our count.
HOWEVER, the tricky part is that for any given "ResNum" it cannot add 1 multiple times. In other words, I want to sub-group the data by ResNum then decide 1 or 0 on whether or not those two criteria are met within that group.
So right now it would give as an output as 3, whereas I want it to display 2 instead. Since ResNum 1 is being counted twice here (two rows meet the criteria).
import glob
file_list = glob.glob("*")
file_list = sorted(file_list)
for input_file in file_list:
masterlist = []
opened_file = open(input_file,'r')
count = 0
for line in opened_file:
data = line.split()
templist = []
templist.append(float(data[0])) #RMS
templist.append(int(data[1])) #ResNum
templist.append(float(data[2])) #Scores
templist.append(float(data[3])) #Rank
masterlist.append(templist)
then here comes the part that needs modification (I think)
for placement in masterlist:
if placement[0] <3 and placement[2] >51.0:
count += 1
print input_file
print count
count = 0
Choose you data structures carefully to make your life easier.
import glob
file_list = glob.glob("*")
file_list = sorted(file_list)
grouper = {}
for input_file in file_list:
with open(input_file) as f:
grouper[input_file] = set()
for line in f:
rms, resnum, scores, rank = line.split()
if int(rms) < 3 and float(scores) > 53:
grouper[input_file].add(float(resnum))
for input_file, group in grouper.iteritems():
print input_file
print len(group)
This creates a dictionary of sets. The key of this dictionary is the file-name. The values are sets of the ResNums, added only when your condition holds. Since sets don't have repeated elements, the size of your set (len) will give you the right count of the number of times your condition was met, per ResNum, per file.
I would like to parse a machine log file, re-arange the data and write it to a .csv file, which i will import into a google spreadsheet. Or write the data directly to the spreadsheet.
here is an example of how the log looks like:
39 14 15 5 2016 39 14 15 5 2016 0
39 14 15 5 2016 40 14 15 5 2016 0.609
43 14 15 5 2016 44 14 15 5 2016 2.182
the output should look like this:
start_date,start_time,end_time,meters
15/5/16,14:39,14:39,0
15/5/16,14:39,14:40,0.609
15/5/16,14:43,14:44,2.182
i wrote the following python code:
file = open("c:\SteelUsage.bsu")
for line in file.readlines():
print(line) #just for verification
line=line.strip()
position=[]
numbers=line.split()
for number in numbers:
position.append(number)
print(number)#just for verification
the idea is to save each number in a row to a list, then i can re-write the numbers in the right order according to their position.
for example: in row #1 the string "39" will have position 0, "14" pstion 1, etc.
but it seems the code i wrote stores each number as a new list, because when i change print(number) to print(number[0]), the code prints the first digit of each number, istead of printing the first number. (39)
where did i go wrong?
thank you
Do something like this. Write out to your csv file.
with open('c:\SteelUsage.bsu','r') as reader:
lines = reader.readlines()
for line in lines:
inp = [i for i in line.strip().split()]
out = '%s/%s/%s,%s:%s,%s:%s,%s' % (inp[2],inp[3],inp[4],inp[1],inp[0],inp[6],inp[5],inp[10])
print out
I am new to Python and need to extract data from a text file. I have a text file below:
UNHOLTZ-DICKIE CORPORATION
CALIBRATION DATA
789 3456
222 455
333 5
344 67788
12 6789
2456 56656
And I want to read it on the shell as two columns of data only:
789 3456
222 455
333 5
344 67788
12 6789
2456 56656
Here's a Python program that reads a file and outputs the 3rd... lines (drops the first 2 lines). That's all I can deduce that you want given your short explanation.
# read the whole file
file = open("input.file", 'r')
lines = file.readlines()
file.close()
# Skip first 2 lines, output the rest to stdout
count = 0
for line in lines:
count +=1
if count > 2:
print line,
If you have numpy installed then this is a one-liner:
col1,col2 = numpy.genfromtxt("myfile.txt",skiprows=2,unpack=True)
where myfile.txt is your data file.