I have a huge dataset and I am trying to read it line by line.
For now, I am reading the dataset using pandas:
df = pd.read_csv("mydata.csv", sep =',', nrows = 1)
This function allows me to read only the first line, but how can I read the second, the third one and so on?
(I would like to use pandas.)
EDIT:
To make it more clear, I need to read one line at a time as the dataset is 20 GB and I cannot keep all the stuff in memory.
One way could be to read part by part of your file and store each part, for example:
df1 = pd.read_csv("mydata.csv", nrows=10000)
Here you will skip the first 10000 rows that you already read and stored in df1, and store the next 10000 rows in df2.
df2 = pd.read_csv("mydata.csv", skiprows=10000 nrows=10000)
dfn = pd.read_csv("mydata.csv", skiprows=(n-1)*10000, nrows=10000)
Maybe there is a way to introduce this idea into a for or while loop.
Looking in the pandas documentation, there is a parameter for read_csv function:
skiprows
If a list is assigned to this parameter it will skip the line indexed by the list:
skiprows = [0,1]
This will skip the first one and the second line.
Thus a combination of nrow and skiprows allow to read each line in the dataset separately.
You are using nrows = 1, wich means "Number of rows of file to read. Useful for reading pieces of large files"
So you are telling it to read only the first row and stop.
You should just remove the argument to read all the csv file into a DataFrame and then go line by line.
See the documentation for more details on usage : https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html
I found using skiprows to be very slow. This approach worked well for me:
line_number = 8 # the row you want. 0-indexed
import pandas as pd
import sys # or `import itertools`
import csv
# you can wrap this block in a function:
# (filename, line_number[, max_rows]) -> row
with open(filename, 'r') as f:
r = csv.reader(f)
for i in range(sys.maxsize**10): # or `i in itertools.count(start=0)`
if i != line_number:
next(r) # skip this row
else:
row = next(r)
row = pd.DataFrame(row) # or transform it however you like
break # or return row, if this is a function
# now you can use `row` !
To make it more robust, substitute sys.maxsize**10 with your actual total number of rows and/or be make sure that line_number is a non-negative number + put a try/except StopIteration block around the row = next(r) line, so that you can catch the reader reaching the end of file.
Related
Let's say I have this CSV:
my friend hello, test
ok, no
whatever, test
test test, ok
I want to delete line number 3, so I would call my function:
remove_from_csv(3)
I couldn't find any built-in remove function and I don't want to "write" anything, so I'm trying to find a way to just read, remove and shift.
So far, I can at least read the desired line number.
def remove_from_csv(index):
with open('queue.csv') as file:
reader = csv.reader(file)
line_num = 0
for row in reader:
line_num += 1
if line_num == index:
print(row)
remove_from_csv(3)
whatever, test
However, I don't know how I could go about just removing that line and doing it without leaving a blank space afterwards.
Try:
import pandas as pd
def remove_nth_line_csv(file_name, n):
df = pd.read_csv(file_name, header=None)
df.drop(df.index[n], inplace=True)
df.to_csv(file_name, index=False, header=False)
Remember pandas indexes from 0. Therefore, counting starts 0,1,2,3,4...n
If you're trying to remove a specific line from a file (here the 3.), then you don't need the csv module (or third party libraries), just basic file manipulation should work:
from pathlib import Path
with open("queue.csv", "r") as fin, open("new_queue.csv", "w") as fout:
fout.writelines(line for n, line in enumerate(fin, start=1) if n != 3)
Path("new_queue.csv").rename("queue.csv")
I noticed pandas is smart when using read_excel / read_csv, it skips the empty rows so if my input has a blank row like
Col1, Col2
Value1, Value2
It just works, but is there a way to get the actual # of skipped rows? (In this case 1)
I want to tie the dataframe row numbers back to the raw input file's row numbers.
You could use the skip_blank_lines=False and import the entire file including the empty lines. Then you can detect them, count them and filter them out:
def custom_read(f_name, **kwargs):
df = pd.read_csv(f_name, skip_blank_lines=False, **kwargs)
non_empty = df.notnull().all(axis=1)
print('Skipped {} blank lines'.format(sum(~non_empty)))
return df.loc[non_empty, :]
You can also use csv.reader to import your file row-by-row and only allow non-empty rows:
import csv
def custom_read2(f_name):
with open(f_name) as f:
cont = []
empty_counts = 0
reader = csv.reader(f, delimiter=',')
for row in reader:
if len(row) > 0:
cont.append(row)
else:
empty_counts += 1
print('Skipped {} blank lines'.format(empty_counts))
return pd.DataFrame(cont)
As far as I can tell, at most one blank line at a time will occupy your memory. This may be useful if you happened to have large files with many blank lines, but I am pretty sure option 1 will always be the better option in practice
I have these huge CSV files that I need to validate; need to make sure they are all delimited by back tick `. I have a reader opening each file and printing it's content. Just wondering the different ways you all would go about validating that each value is delimited by the back tick character
for csvfile in self.fullcsvpathfiles:
#print("file..")
with open(self.fullcsvpathfiles[0], mode='r') as csv_file:
csv_reader = csv.DictReader(csv_file, delimiter = "`")
for row in csv_reader:
print (row)
Not sure how to go about validating that each value is seperated by a backtick and throw an error if otherwise. These tables are huge (not that thats a problem for electricity ;) )
Method 1
With pandas library you could use pandas.read_csv() function to read the csv file with sep='`' (it specifies the delimiter). If it parses the file to a dataframe in a good shape, then you could almost be sure that's good.
Also, to automate the validation process, you could check if the number of NaN values in the dataframe is within an acceptable level. Assume your csv files do not have many blanks (so only a few NaN values are expected), you could compare the number of NaN values with a threshold you set.
import pandas as pd
nan_threshold = 20
for csvfile in self.fullcsvpathfiles:
my_df = pd.read_csv(csvfile, sep="`") # if it fails at this step, then something (probably the delimiter) must be wrong
nans = my_df.is_null().sum()
if nans > nan_threshold:
print(csvfile) # make some warning here
Refer to this page for more information about pandas.read_csv().
Method 2
As mentioned in the comments, you could also check if the number of occurrence of the delimiter is equal in each line of the file.
num_of_sep = -1 # initial value
# assume you are at the step of reading a file f
for line in f:
num = line.count("`")
if num_of_sep == -1:
num_of_sep = num
elif num != num_of_sep:
print('Some warning here')
If you don't know how many columns are in a file, you could check to make sure all the rows have the same number of columns - if you expect the header (first) to always be correct use it to determine the number of columns.
for csvfile in self.fullcsvpathfiles:
with open(self.fullcsvpathfiles[0], mode='r') as csv_file:
csv_reader = csv.DictReader(csv_file, delimiter = "`")
ncols = len(next(csv_reader))
if not all(len(row)==ncols for row in reader):
#do something
for csvfile in self.fullcsvpathfiles:
with open(self.fullcsvpathfiles[0], mode='r') as f:
row = next(f)
ncols = row.count('`')
if not all(row.count('`')==ncols for row in f):
#do something
If you know how many columns are in a file...
for csvfile in self.fullcsvpathfiles:
with open(self.fullcsvpathfiles[0], mode='r') as csv_file:
#figure out how many columns it is suppose to have here?
ncols = special_process()
csv_reader = csv.DictReader(csv_file, delimiter = "`")
if not all(len(row)==ncols for row in reader):
#do something
for csvfile in self.fullcsvpathfiles:
#figure out how many columns it is suppose to have here?
ncols = special_process()
with open(self.fullcsvpathfiles[0], mode='r') as f:
#figure out how many columns it is suppose to have here?
if not all(row.count('`')==ncols for row in f):
#do something
If you know the number of expected elements, you could inspect each line
f=open(filename,'r')
for line in f:
line=line.split("`")
if line!=numElements:
raise Exception("Bad file")
If you know the delimiter that is being accidentally inserted, you could also try to recover instead of throwing exception. Perhaps something like:
line="`".join(line).replace(wrongDelimiter,"`").split("`")
Of course, once you're that far into reading the file, there's no great need for using an external library to read the data. Just go ahead and use it.
Is there a better way to create a list or a numpy array from this csv file? What I'm asking is how to do it and parse more gracefully than I did in the code below.
fname = open("Computers discovered recently by discovery method.csv").readlines()
lst = [elt.strip().split(",")[8:] for elt in fname if elt != "\n"][4:]
lst2 = []
for row in lst:
print(row)
if row[0].startswith("SMZ-") or row[0].startswith("MTR-"):
lst2.append(row)
print(*lst2, sep = "\n")
You can always use Pandas. As an example,
import pandas as pd
import numpy as np
df = pd.read_csv('pandas_dataframe_importing_csv/example.csv')
To convert it, you will have to convert it to your favorite numeric type. I guess you can write the whole thing in one line:
result = numpy.array(list(df)).astype("float")
You can also do the following:
from numpy import genfromtxt
my_data = genfromtxt('my_file.csv', delimiter=',')
You can use pandas and specify header column to make it work correctly on you sample file
import pandas as pd
df = pd.read_csv('Computers discovered recently by discovery method.csv', header=2)
You can check your content using:
>>> df.head()
You can check headers using
>>> df.columns
And to convert it to numpy array you can use
>>> np_arr = df.values
It comes with a lot of options to parse and read csv files. For more information please check the docs
I am not sure what you want but try this
import csv
with open("Computers discovered recently by discovery method.csv", 'r') as f:
reader = csv.reader(f)
ll = list(reader)
print (ll)
this should read the csv line by line and store it as a list
You should never parse CSV structures manually unless you want to tackle all possible exceptions and CSV format oddities. Python has you covered in that regard with its csv module.
The main problem, in your case, stems from your data - there seems to be two different CSV structures in a single file so you first need to find where your second structure begins. Plus, from your code, it seems you want to filter out all columns before Details_Table0_Netbios_Name0 and include only rows whose Details_Table0_Netbios_Name0 starts with SMZ- or MTR-. So something like:
import csv
with open("Computers discovered recently by discovery method.csv") as f:
reader = csv.reader(f) # create a CSV reader
for row in reader: # skip the lines until we encounter the second CSV structure/header
if row and row[0] == "Header_Table0_Netbios_Name0":
break
index = row.index("Details_Table0_Netbios_Name0") # find where your columns begin
result = [] # storage for the rows we're interested in
for row in reader: # read the rest of the CSV row by row
if row and row[index][:4] in {"SMZ-", "MTR-"}: # only include these rows
result.append(row[index:]) # trim and append to the `result` list
print(result[10]) # etc.
# ['MTR-PC0BXQE6-LB', 'PR2', 'anisita', 'VALUEADDCO', 'VALUEADDCO', 'Heartbeat Discovery',
# '07.12.2017 17:47:51', '13']
should do the trick.
Sample Code
import csv
csv_file = 'sample.csv'
with open(csv_file) as fh:
reader = csv.reader(fh)
for row in reader:
print(row)
sample.csv
name,age,salary
clado,20,25000
student,30,34000
sam,34,32000
I have 3,000 .dat files that I am reading and concatenating into one pandas dataframe. They have the same format (4 columns, no header) except that some of them have a description at the beginning of the file while others don't. In order to concatenate those files, I need to get rid of those first rows before I concatenate them. The skiprows option of the pandas.read_csv() doesn't apply here, because the number of rows to skip is very inconsistent from one file to another (btw, I use pandas.read_csv() and not pandas.read_table() because the files are separated by a coma).
However, the fist value after the rows I am trying to omit is identical for all 3,000 files. This value is "2004", which is the first data point of my dataset.
Is there an equivalent to skiprows where I could mention something such as "start reading the file starting at "2004" and skip everything else before that (for each of the 3,00 files)?
I am really out of luck at this point and would appreciate some help,
Thank you!
You could just loop through them and skip every line that doesn't start with 2004.
Something like ...
while True:
line = pandas.read_csv()
if line[0] != '2004': continue
# whatever else you need here
Probably not worth trying to be clever here; if you have a handy criterion, you might as well use it to figure out what skiprows is, i.e. something like
import pandas as pd
import csv
def find_skip(filename):
with open(filename, newline="") as fp:
# (use open(filename, "rb") in Python 2)
reader = csv.reader(fp)
for i, row in enumerate(reader):
if row[0] == "2004":
return i
for filename in filenames:
skiprows = find_skip(filename)
if skiprows is None:
raise ValueError("something went wrong in determining skiprows!")
this_df = pd.read_csv(filename, skiprows=skiprows, header=None)
# do something here, e.g. append this_df to a list and concatenate it after the loop
uss the skip_to() function:
def skip_to(f, text):
while True:
last_pos = f.tell()
line = f.readline()
if not line:
return False
if line.startswith(text):
f.seek(last_pos)
return True
with open("tmp.txt") as f:
if skip_to(f, "2004"):
df = pd.read_csv(f, header=None)
print df