I've got a little bit of a headache here. I'm new to working to CSVs in Python, but here's what I've got: a CSV, generated by airodump (this is an art project/technology demo in which I'm trying to count the number of BSSIDs in a given area). I'm working in Python 3.4 and trying to read the file in. I had a hiccup with some null values, but I seem to have worked around that (hence the perhaps odd way I open the file).
import csv
print ("nibbler starting")
log_initial = open("test.csv", "rU")
logReader = csv.reader((line.replace('\0',' ') for line in log_initial), delimiter=",")
logData = list(logReader)
row_count1 = sum(1 for row in logData)
row_count2 = sum(1 for row in logReader)
print ('Rows = ' + str(row_count1))
print ('Rows = ' + str(row_count2))
for row in logReader:
print('Row #'+ str(logReader.line_num)+' '+str(row))
So here's what happens. The first line printed says there are 52 rows (correct!), the second line printed says the are 0 rows (oh no, that's not what I expected).
No surprise then, the for loop delivers nothing. If I replace logReader with logData in the loop, then I get the whole file listed line by line. I'd stick with the data as a list, but that limits what I can do with it.
So somehow the file isn't getting properly processed as a CSV. I'm not really sure what to do about that. Any ideas?
logData = list(logReader)
reads all the rows in logReader. Reading logReader again will produce no rows as the iterator (the reader object) has already been exhausted.
Related
I want to now the numbers of headers my csv file contains (between 0 and ~50). The file itself is huge (so not reading the complete file for this is mandatory) and contains numerical data.
I know that csv.Sniffer has a has_header() function, but that can only detect 1 header.
One idea I had is to recursivly call the has_header funcion (supposing it detects the first header) and then counting the recursions. I am sure though, there is a much smarter way.
Googling was kind of a pain, since no matter what you search, if it includes "count" and "csv" at some point, you get all the "count rows in csv" results :D
Clarification:
With number of headers I mean number of rows containing information which is not data. There is no general rule for the headers (could be text, floats, or white spaces) and it may be a single line of text. The data itself however is only floats. For me this was super clear, because I've been working with these files for a long time, but forgot this isn't the normal case.
I hoped there was a easy and smart builtin function from Numpy or Pandas, but it doesn't seem so.
Inspired by the comments so far, I think my best bet is to
read 100 lines
count number of separators in each line
determine most common number of separators per line
Coming from the end of 100 lines, find first line with different amount of separators, or isn't floats. That line is the last header line.
Here's a sketch for finding the first line which matches a particular criterion. For demo purposes, I use the criterion "there are empty fields":
import csv
with open(filename, "r", encoding="utf-8") as handle:
for lineno, fields in enumerate(csv.reader(handle), 1):
if "" in fields:
print(lineno-1)
break
You'd update it to look for something which makes sense for your data, like perhaps "third and eight fields contain numbers":
try:
float(fields[2])
float(fields[7])
print(lineno-1)
break
except ValueError:
continue
(notice how the list fields is indexed starting at zero, so the first field is fields[0] and the third is fields[2]), or perhaps a more sophisticated model where the first line contains no empty fields, successive lines contain more and more empty fields, and then the first data line contains fewer empty fields:
maxempty = 0
for lineno, fields in numerate(csv.reader(handle), 1):
empty = fields.count("")
if empty > maxempty:
maxempty = empty
elif empty < maxempty:
print(lineno-1)
break
We simply print the line number of the last header line, since your question asks how many there are. Perhaps printing or returning the number of the first data line would make more sense in some scenarios.
This code doesn't use Pandas at all, just the regular csv module from the Python standard library. It stops reading when you hit break so it doesn't matter for performance how many lines there are after that (though if you need to experiment or debug, maybe create a smaller file with only, say, the first 200 lines of your real file).
Use re.search to search for lines that have 2 or more letters in a row. Two is used instead of one, to not count as header scientific notation (e.g., 1.0e5).
# In the shell, create a test file:
# echo "foo,bar\nbaz,bletch\n1e4,2.0\n2E5,2" > in_file.csv
import re
num_header_lines = 0
for line in open('in_file.csv'):
if re.search('[A-Za-z]{2,}', line):
# count the header here
num_header_lines += 1
else:
break
print(num_header_lines)
# 2
Well, I think that you could get the first line of the csv file and then split it by a ",". That will return an array with all the headers in it. Now you can just count them with len.
Try this:
import pandas as pd
df = pd.read_csv('your_file.csv', index_col=0)
num_rows, num_cols = df.shape
Since I see you're worried about file size, breaking the file into chunks would work:
chunk_size = 10000
df = pd.read_csv(in_path,sep=separator,chunksize=chunk_size,
low_memory=False)
I think you might get a variable number of rows if you read the df chunk by chunk but if you're only interested in number of columns this would work easily.
You could also look into dask.dataframe
This only reads first line of csv
import csv
with open('ornek.csv', newline='') as f:
reader = csv.reader(f)
row1 = next(reader)
sizeOfHeader = len(row1)
This is data from a lab experiment (around 717 lines of data). Rather than trying to excell it, I want to import and graph it on either python or matlab. I'm new here btw... and am a student!
""
"Test Methdo","exp-l Tensile with Extensometer.msm"
"Sample I.D.","Sample108.mss"
"Speciment Number","1"
"Load (lbf)","Time (s)","Crosshead (in)","Extensometer (in)"
62.638,0.900,0.000,0.00008
122.998,1.700,0.001,0.00012
more numbers : see Screenshot of more data from my file
I just can't figure out how to read the line up until a comma. Specifically, I need the Load numbers for one of my arrays/list, so for example on the first line I only need 62.638 (which would be the first number on my first index on my list/array).
How can I get an array/list of this, something that iterates/reads the list and ignores strings?
Thanks!
NOTE: I use Anaconda + Jupyter Notebooks for Python & Matlab (school provided software).
EDIT: Okay, so I came home today and worked on it again. I hadn't dealt with CSV files before, but after some searching I was able to learn how to read my file, somewhat.
import csv
from itertools import islice
with open('Blue_bar_GroupD.txt','r') as BB:
BB_csv = csv.reader(BB)
x = 0
BB_lb = []
while x < 7: #to skip the string data
next(BB_csv)
x+=1
for row in islice(BB_csv,0,758):
print(row[0]) #testing if I can read row data
Okay, here is where I am stuck. I want to make an arraw/list that has the 0th index value of each row. Sorry if I'm a freaking noob!
Thanks again!
You can skip all lines till the first data row and then parse the data into a list for later use - 700+ lines can be easily processd in memory.
Therefor you need to:
read the file line by line
remember the last non-empty line before number/comma/dot ( == header )
see if the line is only number/comma/dot, else increase a skip-counter (== data )
seek to 0
skip enough lines to get to header or data
read the rest into a data structure
Create test file:
text = """
""
"Test Methdo","exp-l Tensile with Extensometer.msm"
"Sample I.D.","Sample108.mss"
"Speciment Number","1"
"Load (lbf)","Time (s)","Crosshead (in)","Extensometer (in)"
62.638,0.900,0.000,0.00008
122.998,1.700,0.001,0.00012
"""
with open ("t.txt","w") as w:
w.write(text)
Some helpers and the skipping/reading logic:
import re
import csv
def convert_row(row):
"""Convert one row of data into a list of mixed ints and others.
Int is the preferred data type, else string is used - no other tried."""
d = []
for v in row:
try:
# convert to int && add
d.append(float(v))
except:
# not an int, append as is
d.append(v)
return d
def count_to_first_data(fh):
"""Count lines in fh not consisting of numbers, dots and commas.
Sideeffect: will reset position in fh to 0."""
skiplines = 0
header_line = 0
fh.seek(0)
for line in fh:
if re.match(r"^[\d.,]+$",line):
fh.seek(0)
return skiplines, header_line
else:
if line.strip():
header_line = skiplines
skiplines += 1
raise ValueError("File does not contain pure number rows!")
Usage of helpers / data conversion:
data = []
skiplines = 0
with open("t.txt","r") as csvfile:
skip_to_data, skip_to_header = count_to_first_data(csvfile)
for _ in range(skip_to_header): # skip_to_data if you do not want the headers
next(csvfile)
reader = csv.reader(csvfile, delimiter=',',quotechar='"')
for row in reader:
row_data = convert_row(row)
if row_data:
data.append(row_data)
print(data)
Output (reformatted):
[['Load (lbf)', 'Time (s)', 'Crosshead (in)', 'Extensometer (in)'],
[62.638, 0.9, 0.0, 8e-05],
[122.998, 1.7, 0.001, 0.00012]]
Doku:
re.match
csv.reader
Method of file objekts (i.e.: seek())
With this you now have "clean" data that you can use for further processing - including your headers.
For visualization you can have a look at matplotlib
I would recommend reading your file with python
data = []
with open('my_txt.txt', 'r') as fd:
# Suppress header lines
for i in range(6):
fd.readline()
# Read data lines up to the first column
for line in fd:
index = line.find(',')
if index >= 0:
data.append(float(line[0:index]))
leads to a list containing your data of the first column
>>> data
[62.638, 122.998]
The MATLAB solution is less nice, since you have to know the number of data lines in your file (which you do not need to know in the python solution)
n_header = 6
n_lines = 2 % Insert here 717 (as you mentioned)
M = csvread('my_txt.txt', n_header, 0, [n_header 0 n_header+n_lines-1 0])
leads to:
>> M
M =
62.6380
122.9980
For the sake of clarity: You can also use MATLABs textscan function to achieve what you want without knowing the number of lines, but still, the python code would be the better choice in my opinion.
Based on your format, you will need to do 3 steps. One, read all lines, two, determine which line to use, last, get the floats and assign them to a list.
Assuming you file name is name.txt, try:
f = open("name.txt", "r")
all_lines = f.readlines()
grid = []
for line in all_lines:
if ('"' not in line) and (line != '\n'):
grid.append(list(map(float, line.strip('\n').split(','))))
f.close()
The grid will then contain a series of lists containing your group of floats.
Explanation for fun:
In the "for" loop, i searched for the double quote to eliminate any string as all strings are concocted between quotes. The other one is for skipping empty lines.
Based on your needs, you can use the list grid as you please. For example, to fetch the first line's first number, do
grid[0][0]
as python's list counts from 0 to n-1 for n elements.
This is super simple in Matlab, just 2 lines:
data = dlmread('data.csv', ',', 6,0);
column1 = data(:,1);
Where 6 and 0 should be replaced by the row and column offset you want. So in this case, the data starts at row 7 and you want all the columns, then just copy over the data in column 1 into another vector.
As another note, try typing doc dlmread in matlab - it brings up the help page for dlmread. This is really useful when you're looking for matlab functions, as it has other suggestions for similar functions down the bottom.
I have two text files with numbers that I want to do some very easy calculations on (for now). I though I would go with Python. I have two file readers for the two text files:
with open('one.txt', 'r') as one:
one_txt = one.readline()
print(one_txt)
with open('two.txt', 'r') as two:
two_txt = two.readline()
print(two_txt)
Now to the fun (and for me hard) part. I would like to loop trough all the numbers in the second text file and then subtract it with the second number in the first text file.
I have done this (extended the coded above):
with open('two.txt') as two_txt:
for line in two_txt:
print line;
I don't know how to proceed now, because I think that the second text file would need to be converted to string in order do make some parsing so I get the numbers I want. The text file (two.txt) looks like this:
Start,End
2432009028,2432009184,
2432065385,2432066027,
2432115011,2432115211,
2432165329,2432165433,
2432216134,2432216289,
2432266528,2432266667,
I want to loop trough this, ignore the Start,End (first line) and then once it loops only pick the first values before each comma, the result would be:
2432009028
2432065385
2432115011
2432165329
2432216134
2432266528
Which I would then subtract with the second value in one.txt (contains numbers only and no Strings what so ever) and print the result.
There are many ways to do string operations and I feel lost, for instance I don't know if the methods to read everything to memory are good or not.
Any examples on how to solve this problem would be very appreciated (I am open to different solutions)!
Edit: Forgot to point out, one.txt has values without any comma, like this:
102582
205335
350365
133565
Something like this
with open('one.txt', 'r') as one, open('two.txt', 'r') as two:
next(two) # skip first line in two.txt
for line_one, line_two in zip(one, two):
one_a = int(split(line_one, ",")[0])
two_b = int(split(line_two, " ")[1])
print(one_a - two_b)
Try this:
onearray = []
file = open("one.txt", "r")
for line in file:
onearray.append(int(line.replace("\n", "")))
file.close()
twoarray = []
file = open("two.txt", "r")
for line in file:
if line != "Start,End\n":
twoarray.append(int(line.split(",")[0]))
file.close()
for i in range(0, len(onearray)):
print(twoarray[i] - onearray[i])
It should do the job!
I have a number of txt files that represent spatial data in a grid form, essentially arrays of the same dimensions in which each value signifies a trait about the corresponding parcel of land. I have been trying to script a sequence that imports each file, adds "-9999" on the border of the entire grid, and saves out to an otherwise identical txt file.
The first 6 rows of each txt file are header rows, and shouldn't be changed.
My progress is as follows:
for datfile in spatialfiles:
results = []
borderrow = []
with open('{}.txt'.format(datfile)) as inputfile:
#header = inputfile.readlines()
for line in inputfile:
row = ['-9999'] + line.strip().split(' ') + ['-9999']
results.append(row)
for cell in range(len(row)):
borderrow.append('-9999')
results = [borderrow] + results[6:] + [borderrow]
with file("{}-new.txt".format(datfile), 'w') as outputFile:
for row in header[:6]:
outputFile.write(row)
for row in results:
outputFile.write(row)
"header = inputfile.readlines()" has been commented out because it seems to cause a NameError in which "row" is no longer recognized. At the same time, I haven't found another way to retain the 6 header rows for exporting later.
Why does readlines() seem to alter the ability to iterate through the lines of the inputfile when it is only being used to write to a variable? What am I missing? (Any other pointers on my undoubtedly bloated code always welcome!)
readlines() reads the whole file into memory, parses it into a list, and leaves a pointer to the end of the file. When you try to read the same file again, it will attempt to resume reading from the pointer, which is already at the end of the file. Call readlines() once and loop through the list with a counter which changes the loop's behavior after 6 lines.
I am attempting to produce a read and write sequence for a python program. I am fairly new to Python; and as a result do not have amazing knowledge of the actual code.
I am struggling with reading data from a .CSV file, this file contains 3 columns, and the amount of rows depends on how many users use the program I have created. Now, I know how to locate rows, but the problem is that it returns the entire row, with all three columns of data within it. So - how do I isolate pieces of data? And subsequently, how do I turn these pieces of data into variables which can be read, written or overwritten.
Please bare in mind that the context of the program is the Computing A453 coursework. Also remember I am not asking you to do my work for me, I have already completed the other 2 tasks, and all the logic and planning for task 3. It's just I only have 2 weeks left until I have to hand this coursework in, and trying to work out the code that can read and overwrite data is extremely hard for a beginner like me.
with open('results.csv', 'a', newline='') as fp:
g = csv.writer(fp, delimiter=',')
data = [[name, score, classroom]]
g.writerows(data)
fp.close()
# Do the reading
result = open('results.csv', 'r')
reader = csv.reader(result)
new_rows_list = []
for row in reader:
if row[2] == name:
if row[2] < score:
result.close()
file2 = open('results.csv', 'wb')
writer = csv.writer(file2)
new_row = [row[2], row[2], name, score, classnumber]
new_rows_list.append(new_row)
file2.close()
At the moment, this code reads the file, but not in the way I want it too. I want it to isolate the "name" of the user on record (within the .csv file). Instead of doing so, it reads the entire row as a whole, which I do not know how to isolate down to just the name of the user.
Here is the data in the .CSV file:
Name Score Class number
jor 0 2
jor 0 2
jor 1 2
I'm assuming what you're getting looks like this:
Jacob, 14, Class B, Number 3
And that that is a string.
If that is the case, String.split() is your answer.
String.split() takes a character as an argument, in your case a comma (Comma Seperated Values), and returns an array of everything in between every instance of that character in the string.
From there, if you want to use the results as data in your program, you should cast the values to the datatype you want (Like float(x) or int(x))
Hope this helped