Python: count headers in a csv file - python

I want to now the numbers of headers my csv file contains (between 0 and ~50). The file itself is huge (so not reading the complete file for this is mandatory) and contains numerical data.
I know that csv.Sniffer has a has_header() function, but that can only detect 1 header.
One idea I had is to recursivly call the has_header funcion (supposing it detects the first header) and then counting the recursions. I am sure though, there is a much smarter way.
Googling was kind of a pain, since no matter what you search, if it includes "count" and "csv" at some point, you get all the "count rows in csv" results :D
Clarification:
With number of headers I mean number of rows containing information which is not data. There is no general rule for the headers (could be text, floats, or white spaces) and it may be a single line of text. The data itself however is only floats. For me this was super clear, because I've been working with these files for a long time, but forgot this isn't the normal case.
I hoped there was a easy and smart builtin function from Numpy or Pandas, but it doesn't seem so.
Inspired by the comments so far, I think my best bet is to
read 100 lines
count number of separators in each line
determine most common number of separators per line
Coming from the end of 100 lines, find first line with different amount of separators, or isn't floats. That line is the last header line.

Here's a sketch for finding the first line which matches a particular criterion. For demo purposes, I use the criterion "there are empty fields":
import csv
with open(filename, "r", encoding="utf-8") as handle:
for lineno, fields in enumerate(csv.reader(handle), 1):
if "" in fields:
print(lineno-1)
break
You'd update it to look for something which makes sense for your data, like perhaps "third and eight fields contain numbers":
try:
float(fields[2])
float(fields[7])
print(lineno-1)
break
except ValueError:
continue
(notice how the list fields is indexed starting at zero, so the first field is fields[0] and the third is fields[2]), or perhaps a more sophisticated model where the first line contains no empty fields, successive lines contain more and more empty fields, and then the first data line contains fewer empty fields:
maxempty = 0
for lineno, fields in numerate(csv.reader(handle), 1):
empty = fields.count("")
if empty > maxempty:
maxempty = empty
elif empty < maxempty:
print(lineno-1)
break
We simply print the line number of the last header line, since your question asks how many there are. Perhaps printing or returning the number of the first data line would make more sense in some scenarios.
This code doesn't use Pandas at all, just the regular csv module from the Python standard library. It stops reading when you hit break so it doesn't matter for performance how many lines there are after that (though if you need to experiment or debug, maybe create a smaller file with only, say, the first 200 lines of your real file).

Use re.search to search for lines that have 2 or more letters in a row. Two is used instead of one, to not count as header scientific notation (e.g., 1.0e5).
# In the shell, create a test file:
# echo "foo,bar\nbaz,bletch\n1e4,2.0\n2E5,2" > in_file.csv
import re
num_header_lines = 0
for line in open('in_file.csv'):
if re.search('[A-Za-z]{2,}', line):
# count the header here
num_header_lines += 1
else:
break
print(num_header_lines)
# 2

Well, I think that you could get the first line of the csv file and then split it by a ",". That will return an array with all the headers in it. Now you can just count them with len.

Try this:
import pandas as pd
df = pd.read_csv('your_file.csv', index_col=0)
num_rows, num_cols = df.shape
Since I see you're worried about file size, breaking the file into chunks would work:
chunk_size = 10000
df = pd.read_csv(in_path,sep=separator,chunksize=chunk_size,
low_memory=False)
I think you might get a variable number of rows if you read the df chunk by chunk but if you're only interested in number of columns this would work easily.
You could also look into dask.dataframe

This only reads first line of csv
import csv
with open('ornek.csv', newline='') as f:
reader = csv.reader(f)
row1 = next(reader)
sizeOfHeader = len(row1)

Related

Joblib too slow using "if not in" loop

I am working with amino acid sequences using the Biopython parser, but regardless of data format (the format is fasta, that is, you can imagine them as strings of letters as follows preceded by the id), my problem is that I have a huge amount of data and despite having tried to parallelize with joblib the estimate of the hours it would take me to run this simple code is 400.
Basically I have a file that contains a series of ids that I have to remove (ids_to_drop) from the original dataset (original_dataset), to create a new file (new_dataset) that contains all the ids contained in the original dataset without the ids_to_drop.
I've tried them all but I don't know how else to do it and I'm stuck right now. Thanks so much!
def file_without_ids_to_remove(seq):
with open(new_output, "a") as f, open(ids_to_drop, "r") as r: #output #removing file
remove = r.read().split("\n")
if seq.id not in remove:
SeqIO.write(seq, f, "fasta")
Parallel(n_jobs=10)(delayed(file_without_ids_to_remove)(seq) for seq in tqdm.tqdm(SeqIO.parse(original_dataset, 'fasta')))
To be clear this is an example of the data (sequence.id + sequence):
WP_051064487.1
MSSAAQTPEATSDVSDANAKQAEALRVASVNVNGIRASYRKGMAEWLAPRQVDILCLQEVRAPDEVVDGF
LADDWHIVHAEAEAKGRAGVLIASRKDSLAPDATRIGIGEEYFATAGRWVEADYTIGENAKKLTVISAYV
HSGEVGTQRQEDKYRFLDTMLERMAELAEQSDYALIVGDLNVGHTELDIKNWKGNVKNAGFLPEERAYFD
KFFGGGDTPGGLGWKDVQRELAGPVNGPYTWWSQRGQAFDNDTGWRIDYHMATPELFARAGNAVVDRAPS
YAERWSDHAPLLVDYTIR
UPDATE: I tried in the following way after the suggestion and it works.
with open(new_dataset, "w") as filtered:
[SeqIO.write(seq,filtered,"fasta") for seq in tqdm.tqdm(SeqIO.parse(original_dataset, 'fasta')) if seq.id not in ids_to_remove]
This looks like a simple file filter operation. Turn the ids to remove into a set one time, and then just read/filter/write the original dataset. Sets are optimized for fast lookup. This operation will be I/O bound and would not benefit from parallelization.
with open("ids-to-remove") as f:
ids_to_remove = {seq_id_line.strip() for seq_id_line in f}
# just in case there are blank lines
if "" in ids_to_remove:
ids_to_remove.remove("")
with open("original-data-set") as orig, open("filtered-data-set", "w") as filtered:
filtered.writelines(line for line in orig if line.split()[0] not in ids_to_remove)

Reading a numbers off a list from a txt file, but only upto a comma

This is data from a lab experiment (around 717 lines of data). Rather than trying to excell it, I want to import and graph it on either python or matlab. I'm new here btw... and am a student!
""
"Test Methdo","exp-l Tensile with Extensometer.msm"
"Sample I.D.","Sample108.mss"
"Speciment Number","1"
"Load (lbf)","Time (s)","Crosshead (in)","Extensometer (in)"
62.638,0.900,0.000,0.00008
122.998,1.700,0.001,0.00012
more numbers : see Screenshot of more data from my file
I just can't figure out how to read the line up until a comma. Specifically, I need the Load numbers for one of my arrays/list, so for example on the first line I only need 62.638 (which would be the first number on my first index on my list/array).
How can I get an array/list of this, something that iterates/reads the list and ignores strings?
Thanks!
NOTE: I use Anaconda + Jupyter Notebooks for Python & Matlab (school provided software).
EDIT: Okay, so I came home today and worked on it again. I hadn't dealt with CSV files before, but after some searching I was able to learn how to read my file, somewhat.
import csv
from itertools import islice
with open('Blue_bar_GroupD.txt','r') as BB:
BB_csv = csv.reader(BB)
x = 0
BB_lb = []
while x < 7: #to skip the string data
next(BB_csv)
x+=1
for row in islice(BB_csv,0,758):
print(row[0]) #testing if I can read row data
Okay, here is where I am stuck. I want to make an arraw/list that has the 0th index value of each row. Sorry if I'm a freaking noob!
Thanks again!
You can skip all lines till the first data row and then parse the data into a list for later use - 700+ lines can be easily processd in memory.
Therefor you need to:
read the file line by line
remember the last non-empty line before number/comma/dot ( == header )
see if the line is only number/comma/dot, else increase a skip-counter (== data )
seek to 0
skip enough lines to get to header or data
read the rest into a data structure
Create test file:
text = """
""
"Test Methdo","exp-l Tensile with Extensometer.msm"
"Sample I.D.","Sample108.mss"
"Speciment Number","1"
"Load (lbf)","Time (s)","Crosshead (in)","Extensometer (in)"
62.638,0.900,0.000,0.00008
122.998,1.700,0.001,0.00012
"""
with open ("t.txt","w") as w:
w.write(text)
Some helpers and the skipping/reading logic:
import re
import csv
def convert_row(row):
"""Convert one row of data into a list of mixed ints and others.
Int is the preferred data type, else string is used - no other tried."""
d = []
for v in row:
try:
# convert to int && add
d.append(float(v))
except:
# not an int, append as is
d.append(v)
return d
def count_to_first_data(fh):
"""Count lines in fh not consisting of numbers, dots and commas.
Sideeffect: will reset position in fh to 0."""
skiplines = 0
header_line = 0
fh.seek(0)
for line in fh:
if re.match(r"^[\d.,]+$",line):
fh.seek(0)
return skiplines, header_line
else:
if line.strip():
header_line = skiplines
skiplines += 1
raise ValueError("File does not contain pure number rows!")
Usage of helpers / data conversion:
data = []
skiplines = 0
with open("t.txt","r") as csvfile:
skip_to_data, skip_to_header = count_to_first_data(csvfile)
for _ in range(skip_to_header): # skip_to_data if you do not want the headers
next(csvfile)
reader = csv.reader(csvfile, delimiter=',',quotechar='"')
for row in reader:
row_data = convert_row(row)
if row_data:
data.append(row_data)
print(data)
Output (reformatted):
[['Load (lbf)', 'Time (s)', 'Crosshead (in)', 'Extensometer (in)'],
[62.638, 0.9, 0.0, 8e-05],
[122.998, 1.7, 0.001, 0.00012]]
Doku:
re.match
csv.reader
Method of file objekts (i.e.: seek())
With this you now have "clean" data that you can use for further processing - including your headers.
For visualization you can have a look at matplotlib
I would recommend reading your file with python
data = []
with open('my_txt.txt', 'r') as fd:
# Suppress header lines
for i in range(6):
fd.readline()
# Read data lines up to the first column
for line in fd:
index = line.find(',')
if index >= 0:
data.append(float(line[0:index]))
leads to a list containing your data of the first column
>>> data
[62.638, 122.998]
The MATLAB solution is less nice, since you have to know the number of data lines in your file (which you do not need to know in the python solution)
n_header = 6
n_lines = 2 % Insert here 717 (as you mentioned)
M = csvread('my_txt.txt', n_header, 0, [n_header 0 n_header+n_lines-1 0])
leads to:
>> M
M =
62.6380
122.9980
For the sake of clarity: You can also use MATLABs textscan function to achieve what you want without knowing the number of lines, but still, the python code would be the better choice in my opinion.
Based on your format, you will need to do 3 steps. One, read all lines, two, determine which line to use, last, get the floats and assign them to a list.
Assuming you file name is name.txt, try:
f = open("name.txt", "r")
all_lines = f.readlines()
grid = []
for line in all_lines:
if ('"' not in line) and (line != '\n'):
grid.append(list(map(float, line.strip('\n').split(','))))
f.close()
The grid will then contain a series of lists containing your group of floats.
Explanation for fun:
In the "for" loop, i searched for the double quote to eliminate any string as all strings are concocted between quotes. The other one is for skipping empty lines.
Based on your needs, you can use the list grid as you please. For example, to fetch the first line's first number, do
grid[0][0]
as python's list counts from 0 to n-1 for n elements.
This is super simple in Matlab, just 2 lines:
data = dlmread('data.csv', ',', 6,0);
column1 = data(:,1);
Where 6 and 0 should be replaced by the row and column offset you want. So in this case, the data starts at row 7 and you want all the columns, then just copy over the data in column 1 into another vector.
As another note, try typing doc dlmread in matlab - it brings up the help page for dlmread. This is really useful when you're looking for matlab functions, as it has other suggestions for similar functions down the bottom.

Pandas: how to read csv with multiple lines on the same cell?

I have a csv that I am not able to read using read_csv
Opening the csv with sublime text shows something like:
col1,col2,col3
text,2,3
more text,3,4
HELLO
THIS IS FUN
,3,4
As you can see, the text HELLO THIS IS FUN takes three lines, and pd.read_csv is confused as it thinks these are three new observations. How can I parse that correctly in Pandas?
Thanks!
It looks like you'll have to preprocess the data manually:
with open('data.csv','r') as f:
lines = f.read().splitlines()
processed = []
cum_c = 0
buffer = ''
for line in lines:
buffer += line # Append the current line to a buffer
c = buffer.count(',')
if cum_c == 2:
processed.append(line)
buffer = ''
elif cum_c > 2:
raise # This should never happen
This assumes that your data only contains unwanted newlines, e.g. if you had data with say, 3 elements in one row, 2 elements in the next, then the next row should either be blank or contain only 1 element. If it has 2 or more, i.e. it's missing a necessary newline, then an error is thrown. You can accommodate this case if necessary with a minor modification.
Actually, it might be more efficient to remove newlines instead, but it shouldn't matter unless you have a lot of data.

In Python, how can I find the location of information in a CSV file?

I have three very long CSV files, and I need some advice/help with manipulating the code. Basically, I want the program to be broad/basic enough where I can add any limitations and it'll work.
For example, if I want to set the code to find where column 1==x and column 2 ==y, I want the code to also work if I want column 1!=r and column 2
import csv
file = input('csv files: ').split(',')
filters = input('Enter the filters: ').split(',')
f = open(csv_file,'r')
p=csv.reader(f)
header_eliminator = next(p,[])
I run into issues with the "file" part because if I choose to only use one file rather than the three I want to use now, it won't work. Same goes for the filters. The filters could be like
4==10,5>=4
this means that column 4 of the file(s) would equal 10 and column 5 of the files would be greater than or equal to 4. However, I might also want the filters to look like this:
1==4.333, 5=="6/1/2014 0:00:00", 6<=60.0, 7!=6
So I want to be able to use it for other things! I'm having so much trouble with this, do you have any advice on how to get started? Thanks!
Pandas is excellent for dealing with csv files. I'd recommend installing it. pip install pandas
Then if you want to read open 3 csv files and do checks on the columns. You'll just need to familiarize yourself with indexing in pandas. The only method you need to know for now, is .iloc since it seems you are indexing using the integer position of the columns.
import pandas as pd
files = input('Enter the csv files: ').split(',')
data = []
#keeping a list of the files allows us to input a different number of files
#we use pandas to read in each file into a pandas dataframe which is then stored in an element of the list. The length of the list is the number of files.
for names in files:
data.append(pd.read_csv(names)
#You can then perform checks like this to see if the column 2 of all files are equal to 3
print all(i.iloc[:,2] == 3 for i in data)
You can write an generator that will take a bunch of filenames and output the lines one by one and feed that in to csv.reader. The tricky part is the filter. If you let the filter be a single line of python code, then you can use eval for that part. As an example
import csv
#filenames = input('csv files: ').split(',')
#filters = input('Enter the filters: ').split(',')
# todo: for debug
# in this implementation, filters is a single python expression that can
# reference the 'col' variable which is a list of the current columns
filenames = 'a.csv,b.csv,c.csv'
filters = '"a" in col[0] and "2" in col[2]'
# todo: debug generate test files
for name in 'abc':
with open('{}.csv'.format(name), 'w') as fp:
fp.write('the header row\n')
for row in range(3):
fp.write(','.join('{}{}{}'.format(name, row, col) for col in range(3)) + '\n')
def header_squash(filenames):
"""Iterate multiple files line by line after squashing header line
and any empty lines.
"""
for filename in filenames:
with open(filename) as fp:
next(fp)
for line in fp:
if line.strip():
yield line
for col in csv.reader(header_squash(filenames.split(','))):
# eval's namespace limits the damage untrusted code can do...
if eval(filters, { 'col':col }):
# passed the filter, do the work
print(col)

pulling subset of lines from files

I have files where there is a varying number of header lines in random order followed by the data that I need, which spans the number of lines as given by the corresponding header. ex Lines: 3
from: blah#blah.com
Subject: foobarhah
Lines: 3
Extra: More random stuff
Foo Bar Lines of Data, which take up
some arbitrary long amount characters on a single line, but no matter how long
they still only take up the number of lines as specified in the header
How can I get at that data in one read of the file??
P.S. The data is from the 20Newsgroups corpus.
Edit: The quick solution I guess which only works if I relax the constraint on reading only once is this:
[1st read] Find out total_num_of_lines and match on first Lines: header ,
[2nd read] I discard the first (total_num_of_lines- header_num_of_lines) and then read the rest of the file
I'm still unaware of a way to read in the data in one pass though.
I'm not quite sure you even need the beginning of the file in order to get its contents. Consider using split:
_, contents = file_contents.split(os.linesep + os.linesep) # e.g. \n\n
If, however, the lines parameter does count - you can use the technique suggested above along with parsing file headers:
headers, contents = file_contents.split(os.linesep + os.linesep)
# Get lines length
headers_list = [line.split for line in headers.splitlines()]
lines_count = int([line[1] for line in headers_list if line[0].lower() == 'lines:'][0])
# Get contents
real_contents = contents[:lines_count]
Assuming we have the general case where there could be multiple messages following each other, maybe something like
from itertools import takewhile
def msgreader(file):
while True:
header = list(takewhile(lambda x: x.strip(), file))
if not header: break
header_dict = {k: v.strip() for k,v in (line.split(":", 1) for line in header)}
line_count = int(header_dict['Lines'])
message = [next(file) for i in xrange(line_count)] # or islice..
yield message
would work, where
with open("53903") as fp:
for message in msgreader(fp):
print message
would give all the listed messages. For this particular use case the above would be overkill, but frankly it's not much harder to extract all the header info than it is only the one line. I'd be surprised if there weren't already a module to parse these messages, though.
You need to store the state of whether the headers have finished. That's all.

Categories