Read Delimited File That Wraps Lines - python

I apologize if there is an obvious answer to this already.
I have a very large file that poses a few challenges for parsing. I am delivered these files from outside my organization, so there is no chance I can change their format.
Firstly, the file is space delimited but the fields that represent a "column" of data can span multiple rows. For example, if you had a row that was supposed to be 25 columns of data, it may be written in the file as:
1 2 3 4 5 6 7 8 9 10 11 12 13 14
15 16 17 18 19 20 21
22 23 24 25
1 2 3 4 5 6 7 8 9 10 11 12 13
14 15 16 17 18
19 20 21 22 23 24 25
As you can see, I can't rely on each set of data being on the same line, but I can rely on there being the same number of columns per set.
To make matters worse, the file follows a definition:data type format where the first 3 or so lines will be describing the data (including a field that tells me how many rows there are) and the next N rows are data. Then it will go back to the 3 lines format again to describe the next set of data. That means I can't just set up a reader for the N columns format and let it run to EOF.
I'm afraid the built in python file reading functionality could get really ugly real fast, but I can't find anything in csv or numpy that works.
Any suggestions?
EDIT: Just as an example of a different solution:
We have an old tool in MATLAB that parses this file using textscan on an open file handle. We know the number of columns so we do something like:
data = textscan(fid, repmat('%f ',1,n_cols), n_rows, 'delimiter', {' ', '\r', '\n'}, 'multipledelimsasone', true);
This would read the data no matter how it wrapped while leaving a file handle open to process the next section later. This is done because the files are so large they can lead to excess RAM usage.

This is a sketch how you can proceed:
(EDIT: with some modifications)
file = open("testfile.txt", "r")
# store data for the different sections here
datasections = list()
while True:
current_row = []
# read three lines
l1 = file.readline()
if line == '': # or other end condition
break
l2 = file.readline()
l3 = file.readline()
# extract the following information from l1, l2, l3
nrows = # extract the number rows in the next section
ncols = # extract the number of columns in the next section
# loop while len(current_row) < nrows * ncols:
# read next line, isolate the items using str.split()
# append items to current_row
# break current_row into the lines after each ncols-th item
# store data in datasections in a new array

Related

Use previous value of Pandas series with mix of strings and ints

I'm trying to overwrite the times with the date of that day. This list is ~100 rows long, below is a sample:
Date
0 May-21-20 #Gets passed
1 02:51PM #(should read May-21-20)
2 01:59PM #(should read May-21-20)
3 01:29PM #etc
4 12:45PM #etc
5 12:42PM
6 11:55AM
7 10:02AM
8 09:37AM #(should read May-21-20)
9 May-20-20 #gets passed
10 02:47PM #(should read May-20-20)
11 02:30PM #(should read May-20-20)
12 02:29PM #(should read May-20-20)
13 02:01PM #(should read May-20-20)
Here's where I'm currently at with my code:
for i in headline_table['Date']:
date_list = headline_table['Date'].tolist() #Make the pd Sereies a List
index_value = date_list.index(i) #Now a list so I can reference index value
previous = index_value - 1 #index of current minus one = previous value
if re.search(r'^[A-Z]', i):
pass
else:
headline_table['Date'][i] = headline_table.loc[previous, 'Date']
I've tried a bunch of different ways to go about this but can't seem to figure it out. I do not get any errors with the code, but the times do not get overwritten with the date, instead it seems nothing happens.
We can do where with ffill
df['Date1']=df.Date.where(df.Date.str.contains('-')).ffill()

parse text file and generate new .csv file based on that data

I would like to parse a machine log file, re-arange the data and write it to a .csv file, which i will import into a google spreadsheet. Or write the data directly to the spreadsheet.
here is an example of how the log looks like:
39 14 15 5 2016 39 14 15 5 2016 0
39 14 15 5 2016 40 14 15 5 2016 0.609
43 14 15 5 2016 44 14 15 5 2016 2.182
the output should look like this:
start_date,start_time,end_time,meters
15/5/16,14:39,14:39,0
15/5/16,14:39,14:40,0.609
15/5/16,14:43,14:44,2.182
i wrote the following python code:
file = open("c:\SteelUsage.bsu")
for line in file.readlines():
print(line) #just for verification
line=line.strip()
position=[]
numbers=line.split()
for number in numbers:
position.append(number)
print(number)#just for verification
the idea is to save each number in a row to a list, then i can re-write the numbers in the right order according to their position.
for example: in row #1 the string "39" will have position 0, "14" pstion 1, etc.
but it seems the code i wrote stores each number as a new list, because when i change print(number) to print(number[0]), the code prints the first digit of each number, istead of printing the first number. (39)
where did i go wrong?
thank you
Do something like this. Write out to your csv file.
with open('c:\SteelUsage.bsu','r') as reader:
lines = reader.readlines()
for line in lines:
inp = [i for i in line.strip().split()]
out = '%s/%s/%s,%s:%s,%s:%s,%s' % (inp[2],inp[3],inp[4],inp[1],inp[0],inp[6],inp[5],inp[10])
print out

python csv module read data from header

I have following format of file
# Data set number 1
#
# Number of lines 4010
# Max number of column 3 is 5
# Blahblah
# More blahblah
1 2 1 110
2 2 5 20 21 465 417 38
2 1 2 33 46 17
......
4010 3 5 1001 2010 3355 107 2039
# Data set number 2
#
# Number of lines 4010
# Max number of column 3 is 5
# Blahblah
# More blahblah
1 2 1 110
2 2 5 20 21 465 417 38
2 1 2 33 46 17
......
I hope to read the number of data set, number of lines, and maximum number of column 3. I searched and find out csv module can read the headers, but can I read those numbers of header, and store? What I did was
nnn = linecache.getline(filename, 1)
nnnn = nnn(line.split()[4])
number = linecache.getline(filename, 3)
number2 = number(line.split()[4])
mmm = linecache.getline(filename, 5)
mmmm = mmm(line.split()[7])
mmmmm = int(mmmm)
max_nb = range(mmmmm)
n_data = int(nnnn)
n_frame = range(n_data)
singleframe = natoms + 6
Like this. How can I read those numbers and store using csv module? I skip the 6 headerlines by using 'singleframe', but also curious how csv module can read 6 number of header lines. Thanks
You don't really have a CSV file; you have a proprietary format instead. Just parse it directly, using regular expressions to quickly extract your desired data:
import re
set_number = re.compile(r'Data set number (\d+)'),
patterns = {
'line_count': re.compile(r'Number of lines (\d+)'),
'max_num': re.compile(r'Max number of column 3 is (\d+)'),
}
with open(filename, 'r') as infh:
results = {}
set_numbers = []
for line in infh:
if not line.startswith('#'):
# skip lines without a comment
continue
set_match = set_number.match(line)
if set_match:
set_numbers.append(int(set_match.group(1)))
else:
for name, pattern in patterns.items():
match = pattern.search(line)
if match:
results[name] = int(match.group(1))
Do not use the linecache module. It'll read the whole file into memory, and is really only intended for access to Python source files; whenever a traceback needs to be printed this module caches the source files involved with the current stack. You'd only use it for smaller files from which you need random lines, repeatedly.

Input file, modify column, output file

I have data in a text file and I would like to be able to modify the file by columns and output the file again. I normally write in C (basic ability) but choose python for it's obvious string benefits. I haven't ever used python before so I'm a tad stuck. I have been reading up on similar problems but they only show how to change whole lines. To be honest I have on clue what to do.
Say I have the file
1 2 3
4 5 6
7 8 9
and I want to be able to change column two with some function say multiply it by 2 so I get
1 4 3
4 10 6
7 16 9
Ideally I would be able to easily change the program so I apply any function to any column.
For anyone who is interested it is for modifying lab data for plotting. eg take the log of the first column.
Python is an excellent general purpose language however I might suggest that if you are on an Unix based system then maybe you should take a look at awk. The language awk is design for these kind of text based transformation. The power of awk is easily seen for your question as the solution is only a few characters: awk '{$2=$2*2;print}'.
$ cat file
1 2 3
4 5 6
7 8 9
$ awk '{$2=$2*2;print}' file
1 4 3
4 10 6
7 16 9
# Multiple the third column by 10
$ awk '{$3=$3*10;print}' file
1 2 30
4 5 60
7 8 90
In awk each column is referenced by $i where i is the ith field. So we just set the value of second field to be the value of second field multiplied by two and print the line. This can be written even more concisely like awk '{$2=$2*2}1' file but best to be clear at beginning.
Here is a very simple Python solution:
for line in open("myfile.txt"):
col = line.strip().split(' ')
print col[0],int(col[1])*2,col[2]
There are plenty of improvements that could made but I'll leave that as an exercise for you.
I would use pandas or just numpy. Read your file with:
data = pd.read_csv('file.txt', header=None, delim_whitespace=True)
then work with the data in a spreadsheet like style, ex:
data.values[:,1] *= 2
finally write again to file with:
data.to_csv('output.txt')
As #sudo_O said, there are much efficient tools than python for this task. However,here is a possible solution :
from itertools import imap, repeat
import csv
fun = pow
with open('m.in', 'r') as input_file :
with open('m.out', 'wb') as out_file:
inpt = csv.reader(input_file, delimiter=' ')
out = csv.writer(out_file, delimiter=' ')
for row in inpt:
row = [ int(e) for e in row] #conversion
opt = repeat(2, len(row) ) # square power for every value
# write ( function(data, argument) )
out.writerow( [ str(elem )for elem in imap(fun, row , opt ) ] )
Here it multiply every number by itself, but you can configure it to multiply only the second colum, by changing opt : opt = [ 1 + (col == 1) for col in range(len(row)) ] (2 for col 1, 1 otherwise )

Translating a gridded csv file with numpy

I need to get some meteorological data into a MySQL database.
File inputFile.csv is a comma-delimited list of values. There are 241 lines and 481 values per line.
Each line maps to a certain latitude, and each value's position within the line maps to a certain longitude.
There are two additional files with the same structure, lat.csv and lon.csv. These files contain the coordinates that the values in inputFile.csv map to.
So to find the latitude and longitude for a value in inputFile.csv, we need to refer to the values at the same line/position (or row/column) within lat.csv and lon.csv
I want to translate inputFile.csv using lat.csv and lon.csv such that my output file contains a list of values (from inputFile.csv),latitudes, and longitudes.
Here is a small visual example:
inputFile.csv
3,5,1,4,5
1,4,1,2,5
5,7,3,8,0
lat.csv
22,31,51,21,52
55,21,24,66,12
11,23,12,55,55
lon.csv
12,35,12,52,11
35,11,25,33,42
62,53,45,25,54
output:
val lat lon
3 22 12
5 31 35
1 51 12
4 21 52
5 52 11
1 55 35
4 21 11
1 24 25
2 66 33
etc
What is the best way to do this in python/numpy?
I suppose that since you know the total size the the array that you want, you can preallocate it:
a = np.empty((241*481,3))
Now you can add the data:
for i,fname in enumerate(('inputFile.csv','lat.csv','lon.csv')):
with open(fname) as f:
data = np.fromfile(f,sep=',')
a[:,i] = data.ravel()
If you don't know the number of elements up front, you can generate a 2-d list instead (a list of np.ndarrays):
alist = []
for fname in ('inputFile.csv','lat.csv','lon.csv'):
with open(fname) as f:
data = np.fromfile(f,sep=',')
alist.append( data.ravel() )
a = np.array(alist).T
Only with numpy functions:
import numpy as np
inputFile = np.gentfromtxt('inputFile.csv',delimiter = ',')
inputFile.reshape(-1)
lat = np.gentfromtxt('lat.csv',delimiter = ',')
lat.reshape(-1)
lon = np.gentfromtxt('lon.csv',delimiter = ',')
lon.reshape(-1)
output = np.vstack( (inputFile,lat,lon) )

Categories