Skipping lines and splitting them into columns in python text parser - python

I am attempting to parse a space delimited text file in python 2.7.5 which looks kind of like:
variable description useless data
a1 asdfsdf 2342354
Sometimes it goes into further detail about the
variable/description here
a2 asdsfda 32123
EDIT:Sorry about the spaces added in the beginning, i did not see them
I want to be able to split the text file into an array with variable and description in 2 separate columns, and cut all the useless data and skip any lines that do not start with a string. The way I have set up my code to start is:
import os
import pandas
import numpy
os.chdir('C:\folderwithfiles')
f = open('Myfile.txt', 'r')
lines = f.readlines()
for line in lines:
if not line.strip():
continue
else:
print(line)
print(lines)
As of right now, this code skips most of the descriptive lines between variable lines, however some still pop up in the parsing. If I could get any help with either troubleshooting my line skips or help me to get started on the column forming part that would be great! I also do not have a lot of expirience in python. Thanks!
EDIT: A part of the file before code
CASEID (id) Case Identification 1 15 AN
MIDX (id) Index to Birth History 16 1 No
1:6
After:
CASEID (id) Case Identification 1 15 AN
MIDX (id) Index to Birth History 16 1 No
1:6

You want to filter out lines that start with spaces, and split all other lines to get the first two columns.
Translating those two rules into code:
with open('Myfile.txt') as f:
for line in f:
if not line.startswith(' '):
variable, description, _ = line.split(None, 2)
print(variable, description)
That's all there is to it.
Or, translating even more directly:
with open('Myfile.txt') as f:
non_descriptions = filter(lambda line: not line.startswith(' '), f)
values = (line.split(None, 2) for line in non_descriptions)
Now values is an iterator over (variable, description) tuples. And it's nice and declarative. The first line means "filter out lines that start with space". The second means "split each line to get the first two columns". (You could write the first as a genexpr instead of filter, or the second as map instead of a genexpr, but I think this is the closest to the English description.)

Assuming no spaces in your variables or descriptions, this will work
with open('path/to/file') as infile:
answer = []
for line in file:
if not line.strip():
continue
if line.startswith(' '): # skipping descriptions
continue
splits = line.split()
var, desc = splits[:2]
answer.append([var, desc])

If you are using pandas try this:
from pandas import read_csv
data = read_csv('file.txt', error_bad_lines=False).drop(['useless data'])
If your file is fixed-width (as opposed to comma-separated-values) then use pandas.read_fwf

Related

Reading a numbers off a list from a txt file, but only upto a comma

This is data from a lab experiment (around 717 lines of data). Rather than trying to excell it, I want to import and graph it on either python or matlab. I'm new here btw... and am a student!
""
"Test Methdo","exp-l Tensile with Extensometer.msm"
"Sample I.D.","Sample108.mss"
"Speciment Number","1"
"Load (lbf)","Time (s)","Crosshead (in)","Extensometer (in)"
62.638,0.900,0.000,0.00008
122.998,1.700,0.001,0.00012
more numbers : see Screenshot of more data from my file
I just can't figure out how to read the line up until a comma. Specifically, I need the Load numbers for one of my arrays/list, so for example on the first line I only need 62.638 (which would be the first number on my first index on my list/array).
How can I get an array/list of this, something that iterates/reads the list and ignores strings?
Thanks!
NOTE: I use Anaconda + Jupyter Notebooks for Python & Matlab (school provided software).
EDIT: Okay, so I came home today and worked on it again. I hadn't dealt with CSV files before, but after some searching I was able to learn how to read my file, somewhat.
import csv
from itertools import islice
with open('Blue_bar_GroupD.txt','r') as BB:
BB_csv = csv.reader(BB)
x = 0
BB_lb = []
while x < 7: #to skip the string data
next(BB_csv)
x+=1
for row in islice(BB_csv,0,758):
print(row[0]) #testing if I can read row data
Okay, here is where I am stuck. I want to make an arraw/list that has the 0th index value of each row. Sorry if I'm a freaking noob!
Thanks again!
You can skip all lines till the first data row and then parse the data into a list for later use - 700+ lines can be easily processd in memory.
Therefor you need to:
read the file line by line
remember the last non-empty line before number/comma/dot ( == header )
see if the line is only number/comma/dot, else increase a skip-counter (== data )
seek to 0
skip enough lines to get to header or data
read the rest into a data structure
Create test file:
text = """
""
"Test Methdo","exp-l Tensile with Extensometer.msm"
"Sample I.D.","Sample108.mss"
"Speciment Number","1"
"Load (lbf)","Time (s)","Crosshead (in)","Extensometer (in)"
62.638,0.900,0.000,0.00008
122.998,1.700,0.001,0.00012
"""
with open ("t.txt","w") as w:
w.write(text)
Some helpers and the skipping/reading logic:
import re
import csv
def convert_row(row):
"""Convert one row of data into a list of mixed ints and others.
Int is the preferred data type, else string is used - no other tried."""
d = []
for v in row:
try:
# convert to int && add
d.append(float(v))
except:
# not an int, append as is
d.append(v)
return d
def count_to_first_data(fh):
"""Count lines in fh not consisting of numbers, dots and commas.
Sideeffect: will reset position in fh to 0."""
skiplines = 0
header_line = 0
fh.seek(0)
for line in fh:
if re.match(r"^[\d.,]+$",line):
fh.seek(0)
return skiplines, header_line
else:
if line.strip():
header_line = skiplines
skiplines += 1
raise ValueError("File does not contain pure number rows!")
Usage of helpers / data conversion:
data = []
skiplines = 0
with open("t.txt","r") as csvfile:
skip_to_data, skip_to_header = count_to_first_data(csvfile)
for _ in range(skip_to_header): # skip_to_data if you do not want the headers
next(csvfile)
reader = csv.reader(csvfile, delimiter=',',quotechar='"')
for row in reader:
row_data = convert_row(row)
if row_data:
data.append(row_data)
print(data)
Output (reformatted):
[['Load (lbf)', 'Time (s)', 'Crosshead (in)', 'Extensometer (in)'],
[62.638, 0.9, 0.0, 8e-05],
[122.998, 1.7, 0.001, 0.00012]]
Doku:
re.match
csv.reader
Method of file objekts (i.e.: seek())
With this you now have "clean" data that you can use for further processing - including your headers.
For visualization you can have a look at matplotlib
I would recommend reading your file with python
data = []
with open('my_txt.txt', 'r') as fd:
# Suppress header lines
for i in range(6):
fd.readline()
# Read data lines up to the first column
for line in fd:
index = line.find(',')
if index >= 0:
data.append(float(line[0:index]))
leads to a list containing your data of the first column
>>> data
[62.638, 122.998]
The MATLAB solution is less nice, since you have to know the number of data lines in your file (which you do not need to know in the python solution)
n_header = 6
n_lines = 2 % Insert here 717 (as you mentioned)
M = csvread('my_txt.txt', n_header, 0, [n_header 0 n_header+n_lines-1 0])
leads to:
>> M
M =
62.6380
122.9980
For the sake of clarity: You can also use MATLABs textscan function to achieve what you want without knowing the number of lines, but still, the python code would be the better choice in my opinion.
Based on your format, you will need to do 3 steps. One, read all lines, two, determine which line to use, last, get the floats and assign them to a list.
Assuming you file name is name.txt, try:
f = open("name.txt", "r")
all_lines = f.readlines()
grid = []
for line in all_lines:
if ('"' not in line) and (line != '\n'):
grid.append(list(map(float, line.strip('\n').split(','))))
f.close()
The grid will then contain a series of lists containing your group of floats.
Explanation for fun:
In the "for" loop, i searched for the double quote to eliminate any string as all strings are concocted between quotes. The other one is for skipping empty lines.
Based on your needs, you can use the list grid as you please. For example, to fetch the first line's first number, do
grid[0][0]
as python's list counts from 0 to n-1 for n elements.
This is super simple in Matlab, just 2 lines:
data = dlmread('data.csv', ',', 6,0);
column1 = data(:,1);
Where 6 and 0 should be replaced by the row and column offset you want. So in this case, the data starts at row 7 and you want all the columns, then just copy over the data in column 1 into another vector.
As another note, try typing doc dlmread in matlab - it brings up the help page for dlmread. This is really useful when you're looking for matlab functions, as it has other suggestions for similar functions down the bottom.

remove duplicates from list product of tab delimited file and further classification

I have a tab delimited file that I need to extract all of of the column 12 content from (which documents categories). However the column 12 content is highly repetitive so firstly I need to get a list that just returns the number of categories (by removing repeats). And then I need to find a way to get the number of lines per category. My attempt is as follows:
def remove_duplicates(l): # define function to remove duplicates
return list(set(l))
input = sys.argv[1] # command line arguments to open tab file
infile = open(input)
for lines in infile: # split content into lines
words = lines.split("\t") # split lines into words i.e. columns
dataB2.append(words[11]) # column 12 contains the desired repetitive categories
dataB2 = dataA.sort() # sort the categories
dataB2 = remove_duplicates(dataA) # attempting to remove duplicates but this just returns an infinite list of 0's in the print command
print(len(dataB2))
infile.close()
I have no idea how I would get the number of lines for each category though?
So my questions are: how do eliminate the repeats effectively? and how do I get the number of lines for each category?
I suggest using a python Counter to implement this. A counter does almost exactly what you are asking for and so your code would look like follows:
from collections import Counter
import sys
count = Counter()
# Note that the with open()... syntax is generally preferred.
with open(sys.argv[1]) as infile:
for lines in infile: # split content into lines
words = lines.split("\t") # split lines into words i.e. columns
count.update([words[11]])
print count
All you need to do is read each line from a file, split it by tabs, grab column 12 for each line and put it in a list. (if you don't care about repeating lines just make column_12 = set() and use add(item) instead of append(item)). Then you simply use len() to get the length of the collection. Or if you want both you can use a list and change it to a set later.
EDIT: To count each catagory (Thank you Tom Morris for alerting me to the fact I didn't actually answer the question). You iterate over the set of column_12 so as to not count anything more than once and use lists built in count() method.
with open(infile, 'r') as fob:
column_12 = []
for line in fob:
column_12.append(line.split('\t')[11])
print 'Unique lines in column 12 %d' % len(set(column_12))
print 'All lines in column 12 %d' % len(column_12)
print 'Count per catagory:'
for cat in set(column_12):
print '%s - %d' % (cat, column_12.count(cat))

How to clean large malformed CSV file using Python

I'm attempting to use Python 2.7.5 to clean up a malformed CSV file. The CSV file is fairly large (over 1GB). The first row of the file correctly lists the column headings, but after that each field is on a new line (unless it is blank) and some fields are multi-line. The multi-line fields are not surrounded by quotes, but need to be surrounded by quotes in the output. The number of columns is static and known. The pattern in the sample input provided is repeated throughout the length of the file.
Input file (sample):
Hostname,Username,IP Addresses,Timestamp,Test1,Test2,Test3
my_hostname
,my_username
,10.0.0.1
192.168.1.1
,2015-02-11 13:41:54 -0600
,,true
,false
my_2nd_hostname
,my_2nd_username
,10.0.0.2
192.168.1.2
,2015-02-11 14:04:41 -0600
,true
,,false
Desired output:
Hostname,Username,IP Addresses,Timestamp,Test1,Test2,Test3
my_hostname,my_username,"10.0.0.1 192.168.1.1",2015-02-11 13:41:54 -0600,,true,false
my_2nd_hostname,my_2nd_username,"10.0.0.2 192.168.1.2",2015-02-11 14:04:41 -0600,true,,false
I've gone down a couple paths that address one of the issues only to realize that it doesn't handle another aspect of the malformed data. I would appreciate if anyone could please help me identify an efficient way to clean up this file.
Thanks
EDIT
I have several code scraps from going down different paths, but here is the current iteration. It isn't pretty, just a bunch of hacks to try and figure this out.
import csv
inputfile = open('input.csv', 'r')
outputfile_1 = open('output.csv', 'w')
counter = 1
for line in inputfile:
#Skip header row
if counter == 1:
outputfile_1.write(line)
counter = counter + 1
else:
line = line.replace('\r', '').replace('\n', '')
outputfile_1.write(line)
inputfile.close()
outputfile_1.close()
with open('output.csv', 'r') as f:
text = f.read()
comma_count = text.count(',') #comma_count/6 = total number of rows
#need to insert a newline after the field contents after every 6th comma
#unfortunately the last field of the row and the first field of the next row are now rammed up together becaue of the newline replaces above...
#then process as normal CSV
#one path I started to go down... but this isn't even functional
groups = text.split(',')
counter2 = 1
while (counter2 <= comma_count/6):
line = ','.join(groups[:(6*counter2)]), ','.join(groups[(6*counter2):])
print line
counter2 = counter2 + 1
EDIT 2
Thanks to #DSM and #Ryan Vincent for getting me on the right track. Using their ideas I made the following code, which seems to correct my malformed CSV. I'm sure there are many places for improvement though, which I would happily accept.
import csv
import re
outputfile_1 = open('output.csv', 'wb')
wr = csv.writer(outputfile_1, quoting=csv.QUOTE_ALL)
with open('input.csv', 'r') as f:
text = f.read()
comma_indices = [m.start() for m in re.finditer(',', text)] #Find all the commas - the fields are between them
cursor = 0
field_counter = 1
row_count = 0
csv_row = []
for index in comma_indices:
newrowflag = False
if "\r" in text[cursor:index]:
#This chunk has two fields, the last of one row and first of the next
next_field=text[cursor:index].split('\r')
next_field_trimmed = next_field[0].replace('\n',' ').rstrip().lstrip()
csv_row.extend([next_field_trimmed]) #Add the last field of this row
#Reset the cursor to be in the middle of the chuck (after the last field and before the next)
#And set a flag that we need to start the next csvrow before we move on to the next comma index
cursor = cursor+text[cursor:index].index('\r')+1
newrowflag = True
else:
next_field_trimmed = text[cursor:index].replace('\n',' ').rstrip().lstrip()
csv_row.extend([next_field_trimmed])
#Advance the cursor to the character after the comma to start the next field
cursor = index + 1
#If we've done 7 fields then we've finished the row
if field_counter%7==0:
row_count = row_count + 1
wr.writerow(csv_row)
#Reset
csv_row = []
#If the last chunk had 2 fields in it...
if newrowflag:
next_field_trimmed = next_field[1].replace('\n',' ').rstrip().lstrip()
csv_row.extend([next_field_trimmed])
field_counter = field_counter + 1
field_counter = field_counter + 1
#Write the last row
wr.writerow(csv_row)
outputfile_1.close()
# Process output.csv as normal CSV file...
This is a comment about how i would tackle this.
For each line:
I can easily identify start and of end of certain groups:
Hostname - there is only one
usernames - read these until you meet something that does not look like a username (comma delimited)
ip address - read these until you meet a timestamp - identified with a pattern match - be aware these are separated by space rather than comma. The end of group is identified by the trailing comma.
timestamp - easy to identify with a pattern match
test1, test2, test3 - certain to be there as comma delimted fields
Notes: I would use the 'pattern' matches to enable me to identify i have the correct thing in the correct place. It enables spotting errors sooner.
From your data excerpt it seems like any line that starts with a comma needs to be joined to the preceding line and any line starting with anything other than a comma marks a new row.
If that's the case than you could use something the following code to clean up the CSV file such that the standard library csv parser can handle it.
#!/usr/bin/python
raw_data = 'somefilename.raw'
csv_data = 'somefilename.csv'
with open(raw_data, 'Ur') as inp, open(csv_data, 'wb') as out:
row = list()
for line in inp:
line.rstrip('\n')
if line.startswith(','):
row.append(line)
else:
out.write(''.join(row)+'\n')
row = list()
row.append(line))
# Don't forget to write the last row!
out.write(''.join(row)+'\n')
This is a miniature state machine ... accumulating lines into each row until we find a line that doesn't start with a comma, writing the previous row and so on.

How to find and replace multiple lines in text file?

I am running Python 2.7.
I have three text files: data.txt, find.txt, and replace.txt. Now, find.txt contains several lines that I want to search for in data.txt and replace that section with the content in replace.txt. Here is a simple example:
data.txt
pumpkin
apple
banana
cherry
himalaya
skeleton
apple
banana
cherry
watermelon
fruit
find.txt
apple
banana
cherry
replace.txt
1
2
3
So, in the above example, I want to search for all occurences of apple, banana, and cherry in the data and replace those lines with 1,2,3.
I am having some trouble with the right approach to this as my data.txt is about 1MB so I want to be as efficient as possible. One dumb way is to concatenate everything into one long string and use replace, and then output to a new text file so all the line breaks will be restored.
import re
data = open("data.txt", 'r')
find = open("find.txt", 'r')
replace = open("replace.txt", 'r')
data_str = ""
find_str = ""
replace_str = ""
for line in data: # concatenate it into one long string
data_str += line
for line in find: # concatenate it into one long string
find_str += line
for line in replace:
replace_str += line
new_data = data_str.replace(find, replace)
new_file = open("new_data.txt", "w")
new_file.write(new_data)
But this seems so convoluted and inefficient for a large data file like mine. Also, the replace function appears to be deprecated so that's not good.
Another way is to step through the lines and keep a track of which line you found a match.
Something like this:
location = 0
LOOP1:
for find_line in find:
for i, data_line in enumerate(data).startingAtLine(location):
if find_line == data_line:
location = i # found possibility
for idx in range(NUMBER_LINES_IN_FIND):
if find_line[idx] != data_line[idx+location] # compare line by line
#if the subsequent lines don't match, then go back and search again
goto LOOP1
Not fully formed code, I know. I don't even know if it's possible to search through a file from a certain line on or between certain lines but again, I'm just a bit confused in the logic of it all. What is the best way to do this?
Thanks!
If the file is large, you want to read and write one line at a time, so the whole thing isn't loaded into memory at once.
# create a dict of find keys and replace values
findlines = open('find.txt').read().split('\n')
replacelines = open('replace.txt').read().split('\n')
find_replace = dict(zip(findlines, replacelines))
with open('data.txt') as data:
with open('new_data.txt', 'w') as new_data:
for line in data:
for key in find_replace:
if key in line:
line = line.replace(key, find_replace[key])
new_data.write(line)
Edit: I changed the code to read().split('\n') instead of readliens() so \n isn't included in the find and replace strings
couple things here:
replace is not deprecated, see this discussion for details:
Python 2.7: replace method of string object deprecated
If you are worried about reading data.txt in to memory all at once, you should be able to just iterate over data.txt one line at a time
data = open("data.txt", 'r')
for line in data:
# fix the line
so all that's left is coming up with a whole bunch of find/replace pairs and fixing each line. Check out the zip function for a handy way to do that
find = open("find.txt", 'r').readlines()
replace = open("replace.txt", 'r').readlines()
new_data = open("new_data.txt", 'w')
for find_token, replace_token in zip(find, replace):
new_line = line.replace(find_token, replace_token)
new_data.write(new_line + os.linesep)

writing lines group by group in different files

I've got a little script which is not working nicely for me, hope you can help and find the problem.
I have two starting files:
traveltimes: contains the lines I need, it's a column file (every row has just a number). The lines I need are separated by a line which starts with 11 whitespaces
header lines: contains three header lines
output_file: I want to get 29 files (STA%s). What's inside? Every file will contain the same header lines after which I want to append the group of lines contained in the traveltimes file (one different group of lines for every file). Every group of lines is made by 74307 rows (1 column)
So far this script creates 29 files with the same header lines but then it mixes up everything, I mean it writes something but it's not what I want.
Any idea????
def make_station_files(traveltimes, header_lines):
"""Gives the STAxx.tgrid files required by loc3d"""
sta_counter = 1
with open (header_lines, 'r') as file_in:
data = file_in.readlines()
for i in range (29):
with open ('STA%s' % (sta_counter), 'w') as output_files:
sta_counter += 1
for i in data [0:3]:
values = i.strip()
output_files.write ("%s\n\t1\n" % (values))
with open (traveltimes, 'r') as times_file:
#collector = []
for line in times_file:
if line.startswith (" "):
break
output_files.write ("%s" % (line))
Suggestion:
Read the header rows first. Make sure this works before proceeding. None of the rest of the code needs to be indented under this.
Consider writing a separate function to group the traveltimes file into a list of lists.
Once you have a working traveltimes reader and grouper, only then create a new STA file, print the headers to it, and then write the timegroups to it.
Build your program up step-by-step, making sure it does what you expect at each step. Don't try to do it all at once because then you won't easily be able to track down where the issue lies.
My quick edit of your script uses itertools.groupby() as a grouper. It is a little advanced because the grouping function is stateful and tracks it state in a mutable list:
def make_station_files(traveltimes, header_lines):
'Gives the STAxx.tgrid files required by loc3d'
with open (header_lines, 'r') as f:
headers = f.readlines()
def station_counter(line, cnt=[1]):
'Stateful station counter -- Keeps the count in a mutable list'
if line.strip() == '':
cnt[0] += 1
return cnt[0]
with open(traveltimes, 'r') as times_file:
for station, group in groupby(times_file, station_counter):
with open('STA%s' % (station), 'w') as output_file:
for header in headers[:3]:
output_file.write ('%s\n\t1\n' % (header.strip()))
for line in group:
if not line.startswith(' '):
output_file.write ('%s' % (line))
This code is untested because I don't have sample data. Hopefully, you'll get the gist of it.

Categories