Pandas: how to read csv with multiple lines on the same cell? - python

I have a csv that I am not able to read using read_csv
Opening the csv with sublime text shows something like:
col1,col2,col3
text,2,3
more text,3,4
HELLO
THIS IS FUN
,3,4
As you can see, the text HELLO THIS IS FUN takes three lines, and pd.read_csv is confused as it thinks these are three new observations. How can I parse that correctly in Pandas?
Thanks!

It looks like you'll have to preprocess the data manually:
with open('data.csv','r') as f:
lines = f.read().splitlines()
processed = []
cum_c = 0
buffer = ''
for line in lines:
buffer += line # Append the current line to a buffer
c = buffer.count(',')
if cum_c == 2:
processed.append(line)
buffer = ''
elif cum_c > 2:
raise # This should never happen
This assumes that your data only contains unwanted newlines, e.g. if you had data with say, 3 elements in one row, 2 elements in the next, then the next row should either be blank or contain only 1 element. If it has 2 or more, i.e. it's missing a necessary newline, then an error is thrown. You can accommodate this case if necessary with a minor modification.
Actually, it might be more efficient to remove newlines instead, but it shouldn't matter unless you have a lot of data.

Related

Python: count headers in a csv file

I want to now the numbers of headers my csv file contains (between 0 and ~50). The file itself is huge (so not reading the complete file for this is mandatory) and contains numerical data.
I know that csv.Sniffer has a has_header() function, but that can only detect 1 header.
One idea I had is to recursivly call the has_header funcion (supposing it detects the first header) and then counting the recursions. I am sure though, there is a much smarter way.
Googling was kind of a pain, since no matter what you search, if it includes "count" and "csv" at some point, you get all the "count rows in csv" results :D
Clarification:
With number of headers I mean number of rows containing information which is not data. There is no general rule for the headers (could be text, floats, or white spaces) and it may be a single line of text. The data itself however is only floats. For me this was super clear, because I've been working with these files for a long time, but forgot this isn't the normal case.
I hoped there was a easy and smart builtin function from Numpy or Pandas, but it doesn't seem so.
Inspired by the comments so far, I think my best bet is to
read 100 lines
count number of separators in each line
determine most common number of separators per line
Coming from the end of 100 lines, find first line with different amount of separators, or isn't floats. That line is the last header line.
Here's a sketch for finding the first line which matches a particular criterion. For demo purposes, I use the criterion "there are empty fields":
import csv
with open(filename, "r", encoding="utf-8") as handle:
for lineno, fields in enumerate(csv.reader(handle), 1):
if "" in fields:
print(lineno-1)
break
You'd update it to look for something which makes sense for your data, like perhaps "third and eight fields contain numbers":
try:
float(fields[2])
float(fields[7])
print(lineno-1)
break
except ValueError:
continue
(notice how the list fields is indexed starting at zero, so the first field is fields[0] and the third is fields[2]), or perhaps a more sophisticated model where the first line contains no empty fields, successive lines contain more and more empty fields, and then the first data line contains fewer empty fields:
maxempty = 0
for lineno, fields in numerate(csv.reader(handle), 1):
empty = fields.count("")
if empty > maxempty:
maxempty = empty
elif empty < maxempty:
print(lineno-1)
break
We simply print the line number of the last header line, since your question asks how many there are. Perhaps printing or returning the number of the first data line would make more sense in some scenarios.
This code doesn't use Pandas at all, just the regular csv module from the Python standard library. It stops reading when you hit break so it doesn't matter for performance how many lines there are after that (though if you need to experiment or debug, maybe create a smaller file with only, say, the first 200 lines of your real file).
Use re.search to search for lines that have 2 or more letters in a row. Two is used instead of one, to not count as header scientific notation (e.g., 1.0e5).
# In the shell, create a test file:
# echo "foo,bar\nbaz,bletch\n1e4,2.0\n2E5,2" > in_file.csv
import re
num_header_lines = 0
for line in open('in_file.csv'):
if re.search('[A-Za-z]{2,}', line):
# count the header here
num_header_lines += 1
else:
break
print(num_header_lines)
# 2
Well, I think that you could get the first line of the csv file and then split it by a ",". That will return an array with all the headers in it. Now you can just count them with len.
Try this:
import pandas as pd
df = pd.read_csv('your_file.csv', index_col=0)
num_rows, num_cols = df.shape
Since I see you're worried about file size, breaking the file into chunks would work:
chunk_size = 10000
df = pd.read_csv(in_path,sep=separator,chunksize=chunk_size,
low_memory=False)
I think you might get a variable number of rows if you read the df chunk by chunk but if you're only interested in number of columns this would work easily.
You could also look into dask.dataframe
This only reads first line of csv
import csv
with open('ornek.csv', newline='') as f:
reader = csv.reader(f)
row1 = next(reader)
sizeOfHeader = len(row1)

Searching for float value in .txt file

I have a program that saves a .txt log with multiple values and text. Those are time values can differ each time the program is ran, but their place in this .txt file is always the same. I need to grab those values and check if they are smaller than 6 seconds. I was trying to convert my .txt file to string, and then use:
x = string(filename[50:55])
float(x)
But it doesn't seem to work. How can I extract values form fixed place in .txt file and then convert them to float number?
//EDIT:
Photo of my log below, I want to check those values marked with blue line:
//EDIT2:
Photo of another log, how would I extract those percentages, those must be below 70 to pass.
//EDIT3:Photo of the error I got and fragment of the code:
with open(r'C:\Users\Time_Log.txt') as f:
print(f)
lines = f.readlines()
print(r'Lines: ')
print(lines)
print(r'Type of lines:', type(lines))
# start at 1 to avoid the header
for line in range(1, len(lines)):
print(r'Type of line:', type(line))
splits = line.split("|")
print(r'Type of splits:', type(splits))
t = splits[3].split(".")
if t[0] < 6:
print(r'Test completed. Startup time is shorter than 6 seconds')
I mean with an example it would be much simpler, but assuming that your values have a fixed distance (idk space or tab) you'd split the string with that and look at the elements that you want to compare, so idk if the time is your 6th item you'd string.split(" ") and pick splitted_str[5]. You can do that further if you're time format follows a regular pattern idk hours:minutes:seconds and then do the math or you could even use packages like date or time to convert them to some time object which could potentially do some more useful comparison.
So the question is basically how well formatted your values are.
Edit:
So given the example you could:
with open("filename.txt") as f:
lines = f.readlines()
# start at 1 to avoid the header
for i in range(3, len(lines)):
splits = lines[i].split("|")
t = str(splits[3]).split(".")
if int(t[0)] < 6:
[do something with that line]

Reading a numbers off a list from a txt file, but only upto a comma

This is data from a lab experiment (around 717 lines of data). Rather than trying to excell it, I want to import and graph it on either python or matlab. I'm new here btw... and am a student!
""
"Test Methdo","exp-l Tensile with Extensometer.msm"
"Sample I.D.","Sample108.mss"
"Speciment Number","1"
"Load (lbf)","Time (s)","Crosshead (in)","Extensometer (in)"
62.638,0.900,0.000,0.00008
122.998,1.700,0.001,0.00012
more numbers : see Screenshot of more data from my file
I just can't figure out how to read the line up until a comma. Specifically, I need the Load numbers for one of my arrays/list, so for example on the first line I only need 62.638 (which would be the first number on my first index on my list/array).
How can I get an array/list of this, something that iterates/reads the list and ignores strings?
Thanks!
NOTE: I use Anaconda + Jupyter Notebooks for Python & Matlab (school provided software).
EDIT: Okay, so I came home today and worked on it again. I hadn't dealt with CSV files before, but after some searching I was able to learn how to read my file, somewhat.
import csv
from itertools import islice
with open('Blue_bar_GroupD.txt','r') as BB:
BB_csv = csv.reader(BB)
x = 0
BB_lb = []
while x < 7: #to skip the string data
next(BB_csv)
x+=1
for row in islice(BB_csv,0,758):
print(row[0]) #testing if I can read row data
Okay, here is where I am stuck. I want to make an arraw/list that has the 0th index value of each row. Sorry if I'm a freaking noob!
Thanks again!
You can skip all lines till the first data row and then parse the data into a list for later use - 700+ lines can be easily processd in memory.
Therefor you need to:
read the file line by line
remember the last non-empty line before number/comma/dot ( == header )
see if the line is only number/comma/dot, else increase a skip-counter (== data )
seek to 0
skip enough lines to get to header or data
read the rest into a data structure
Create test file:
text = """
""
"Test Methdo","exp-l Tensile with Extensometer.msm"
"Sample I.D.","Sample108.mss"
"Speciment Number","1"
"Load (lbf)","Time (s)","Crosshead (in)","Extensometer (in)"
62.638,0.900,0.000,0.00008
122.998,1.700,0.001,0.00012
"""
with open ("t.txt","w") as w:
w.write(text)
Some helpers and the skipping/reading logic:
import re
import csv
def convert_row(row):
"""Convert one row of data into a list of mixed ints and others.
Int is the preferred data type, else string is used - no other tried."""
d = []
for v in row:
try:
# convert to int && add
d.append(float(v))
except:
# not an int, append as is
d.append(v)
return d
def count_to_first_data(fh):
"""Count lines in fh not consisting of numbers, dots and commas.
Sideeffect: will reset position in fh to 0."""
skiplines = 0
header_line = 0
fh.seek(0)
for line in fh:
if re.match(r"^[\d.,]+$",line):
fh.seek(0)
return skiplines, header_line
else:
if line.strip():
header_line = skiplines
skiplines += 1
raise ValueError("File does not contain pure number rows!")
Usage of helpers / data conversion:
data = []
skiplines = 0
with open("t.txt","r") as csvfile:
skip_to_data, skip_to_header = count_to_first_data(csvfile)
for _ in range(skip_to_header): # skip_to_data if you do not want the headers
next(csvfile)
reader = csv.reader(csvfile, delimiter=',',quotechar='"')
for row in reader:
row_data = convert_row(row)
if row_data:
data.append(row_data)
print(data)
Output (reformatted):
[['Load (lbf)', 'Time (s)', 'Crosshead (in)', 'Extensometer (in)'],
[62.638, 0.9, 0.0, 8e-05],
[122.998, 1.7, 0.001, 0.00012]]
Doku:
re.match
csv.reader
Method of file objekts (i.e.: seek())
With this you now have "clean" data that you can use for further processing - including your headers.
For visualization you can have a look at matplotlib
I would recommend reading your file with python
data = []
with open('my_txt.txt', 'r') as fd:
# Suppress header lines
for i in range(6):
fd.readline()
# Read data lines up to the first column
for line in fd:
index = line.find(',')
if index >= 0:
data.append(float(line[0:index]))
leads to a list containing your data of the first column
>>> data
[62.638, 122.998]
The MATLAB solution is less nice, since you have to know the number of data lines in your file (which you do not need to know in the python solution)
n_header = 6
n_lines = 2 % Insert here 717 (as you mentioned)
M = csvread('my_txt.txt', n_header, 0, [n_header 0 n_header+n_lines-1 0])
leads to:
>> M
M =
62.6380
122.9980
For the sake of clarity: You can also use MATLABs textscan function to achieve what you want without knowing the number of lines, but still, the python code would be the better choice in my opinion.
Based on your format, you will need to do 3 steps. One, read all lines, two, determine which line to use, last, get the floats and assign them to a list.
Assuming you file name is name.txt, try:
f = open("name.txt", "r")
all_lines = f.readlines()
grid = []
for line in all_lines:
if ('"' not in line) and (line != '\n'):
grid.append(list(map(float, line.strip('\n').split(','))))
f.close()
The grid will then contain a series of lists containing your group of floats.
Explanation for fun:
In the "for" loop, i searched for the double quote to eliminate any string as all strings are concocted between quotes. The other one is for skipping empty lines.
Based on your needs, you can use the list grid as you please. For example, to fetch the first line's first number, do
grid[0][0]
as python's list counts from 0 to n-1 for n elements.
This is super simple in Matlab, just 2 lines:
data = dlmread('data.csv', ',', 6,0);
column1 = data(:,1);
Where 6 and 0 should be replaced by the row and column offset you want. So in this case, the data starts at row 7 and you want all the columns, then just copy over the data in column 1 into another vector.
As another note, try typing doc dlmread in matlab - it brings up the help page for dlmread. This is really useful when you're looking for matlab functions, as it has other suggestions for similar functions down the bottom.

Keeping header rows from txt file, while altering rest of data

I have a number of txt files that represent spatial data in a grid form, essentially arrays of the same dimensions in which each value signifies a trait about the corresponding parcel of land. I have been trying to script a sequence that imports each file, adds "-9999" on the border of the entire grid, and saves out to an otherwise identical txt file.
The first 6 rows of each txt file are header rows, and shouldn't be changed.
My progress is as follows:
for datfile in spatialfiles:
results = []
borderrow = []
with open('{}.txt'.format(datfile)) as inputfile:
#header = inputfile.readlines()
for line in inputfile:
row = ['-9999'] + line.strip().split(' ') + ['-9999']
results.append(row)
for cell in range(len(row)):
borderrow.append('-9999')
results = [borderrow] + results[6:] + [borderrow]
with file("{}-new.txt".format(datfile), 'w') as outputFile:
for row in header[:6]:
outputFile.write(row)
for row in results:
outputFile.write(row)
"header = inputfile.readlines()" has been commented out because it seems to cause a NameError in which "row" is no longer recognized. At the same time, I haven't found another way to retain the 6 header rows for exporting later.
Why does readlines() seem to alter the ability to iterate through the lines of the inputfile when it is only being used to write to a variable? What am I missing? (Any other pointers on my undoubtedly bloated code always welcome!)
readlines() reads the whole file into memory, parses it into a list, and leaves a pointer to the end of the file. When you try to read the same file again, it will attempt to resume reading from the pointer, which is already at the end of the file. Call readlines() once and loop through the list with a counter which changes the loop's behavior after 6 lines.

Combining columns of CSV files of unknown lengths but same width in Python

I have an unknown number of input csv files that look more or less like this (set width various lengths)
Header1, Header2, Header3, Header4
1,2,3,4
11,22,33,44
1,2,3,4
The output looks like this.
Header1,Header3, ,Header1,Header3, ,...
1,3, ,1,3, ,...
...
Currently I can read all the input files into strings and I know how to read the first line of each file and print it in the desired format, but I am stuck on how to make a loop to go to the next line of each file and print that data. Since the files are of different lengths when one ends I don't know how to handle that and put in blank spaces as place holders to keep the format. Below is my code.
csvs = []
hold = []
i=0 # was i=-1 to start, improved
for files in names:
i=i+1
csvs.append([i])
hold.append([i])
#z=0
for z in range(i):
# putting csv files into strings
csvs[z] = csv.reader(open(names[z],'rb'), delimiter=',')
line = []
#z=0
for z in range(i):
hold[z]=csvs[z].next()
line = line + [hold[z][0], hold[z][3], ' ']
print line
writefile.writerow(line)
names is the string that holds the csv file paths. Also I am fairly new to this so if you see some place where I could do things better I am all ears.
Let's assume that you know how to merge lines when some files are longer than others. Here's a way to make iteration over lines and files easier.
from itertools import izip_longest
# http://docs.python.org/library/itertools.html#itertools.izip_longest
# get a list of open readers using a list comprehension
readers = [csv.reader(open(fname, "r")) for fname in list_of_filenames]
# open writer
output_csv = csv.writer(...)
for bunch_of_lines in izip_longest(*readers, fillvalue=['', '', '', '']):
# Here bunch_of_lines is a tuple of lines read from each reader,
# e.g. all first lines, all second lines, etc
# When one file is past EOF but others aren't, you get fillvalue for its line.
merged_row = []
for line in bunch_of_lines:
# if it's a real line, you have 4 items of data.
# if the file is past EOF, the line is fillvalue from above
# which again is guaranteed to have 4 items of data, all empty strings.
merged_row.extend([line[1], line[3]]) # put columns 1 and 3
output_csv.writerow(merged_row)
This code stops only after the longest file is over, and the loop is only 5 lines of code.
I think you'll figure headers yourself.
A note: in Python, you need range() and integer-indexed access to lists quite rarely, after you have understood how for loops and list comprehensions work. In Python, for is what foreach is in other languages; it has nothing to do with indices.
This doesn't give the spare commas you showed in your output, but that wouldn't be hard to add by just popping an extra blank field into data each time we append to it:
import csv
names=['test1.csv','test2.csv']
csvs = []
done = []
for name in names:
csvs.append(csv.reader(open(name, 'rb')))
done.append(False)
while not all(done):
data = []
for i, c in enumerate(csvs):
if not done[i]:
try:
row = c.next()
except StopIteration:
done[i] = True
if done[i]:
data.append('')
data.append('')
# data.append('') <-- here
else:
data.append(row[0])
data.append(row[3])
# data.append('') <-- and here for extra commas
if not all(done):
print ','.join(data)
Also, I don't close anything explicitly, which you should do if this were part of a long running process.

Categories