I noticed pandas is smart when using read_excel / read_csv, it skips the empty rows so if my input has a blank row like
Col1, Col2
Value1, Value2
It just works, but is there a way to get the actual # of skipped rows? (In this case 1)
I want to tie the dataframe row numbers back to the raw input file's row numbers.
You could use the skip_blank_lines=False and import the entire file including the empty lines. Then you can detect them, count them and filter them out:
def custom_read(f_name, **kwargs):
df = pd.read_csv(f_name, skip_blank_lines=False, **kwargs)
non_empty = df.notnull().all(axis=1)
print('Skipped {} blank lines'.format(sum(~non_empty)))
return df.loc[non_empty, :]
You can also use csv.reader to import your file row-by-row and only allow non-empty rows:
import csv
def custom_read2(f_name):
with open(f_name) as f:
cont = []
empty_counts = 0
reader = csv.reader(f, delimiter=',')
for row in reader:
if len(row) > 0:
cont.append(row)
else:
empty_counts += 1
print('Skipped {} blank lines'.format(empty_counts))
return pd.DataFrame(cont)
As far as I can tell, at most one blank line at a time will occupy your memory. This may be useful if you happened to have large files with many blank lines, but I am pretty sure option 1 will always be the better option in practice
Related
I have these huge CSV files that I need to validate; need to make sure they are all delimited by back tick `. I have a reader opening each file and printing it's content. Just wondering the different ways you all would go about validating that each value is delimited by the back tick character
for csvfile in self.fullcsvpathfiles:
#print("file..")
with open(self.fullcsvpathfiles[0], mode='r') as csv_file:
csv_reader = csv.DictReader(csv_file, delimiter = "`")
for row in csv_reader:
print (row)
Not sure how to go about validating that each value is seperated by a backtick and throw an error if otherwise. These tables are huge (not that thats a problem for electricity ;) )
Method 1
With pandas library you could use pandas.read_csv() function to read the csv file with sep='`' (it specifies the delimiter). If it parses the file to a dataframe in a good shape, then you could almost be sure that's good.
Also, to automate the validation process, you could check if the number of NaN values in the dataframe is within an acceptable level. Assume your csv files do not have many blanks (so only a few NaN values are expected), you could compare the number of NaN values with a threshold you set.
import pandas as pd
nan_threshold = 20
for csvfile in self.fullcsvpathfiles:
my_df = pd.read_csv(csvfile, sep="`") # if it fails at this step, then something (probably the delimiter) must be wrong
nans = my_df.is_null().sum()
if nans > nan_threshold:
print(csvfile) # make some warning here
Refer to this page for more information about pandas.read_csv().
Method 2
As mentioned in the comments, you could also check if the number of occurrence of the delimiter is equal in each line of the file.
num_of_sep = -1 # initial value
# assume you are at the step of reading a file f
for line in f:
num = line.count("`")
if num_of_sep == -1:
num_of_sep = num
elif num != num_of_sep:
print('Some warning here')
If you don't know how many columns are in a file, you could check to make sure all the rows have the same number of columns - if you expect the header (first) to always be correct use it to determine the number of columns.
for csvfile in self.fullcsvpathfiles:
with open(self.fullcsvpathfiles[0], mode='r') as csv_file:
csv_reader = csv.DictReader(csv_file, delimiter = "`")
ncols = len(next(csv_reader))
if not all(len(row)==ncols for row in reader):
#do something
for csvfile in self.fullcsvpathfiles:
with open(self.fullcsvpathfiles[0], mode='r') as f:
row = next(f)
ncols = row.count('`')
if not all(row.count('`')==ncols for row in f):
#do something
If you know how many columns are in a file...
for csvfile in self.fullcsvpathfiles:
with open(self.fullcsvpathfiles[0], mode='r') as csv_file:
#figure out how many columns it is suppose to have here?
ncols = special_process()
csv_reader = csv.DictReader(csv_file, delimiter = "`")
if not all(len(row)==ncols for row in reader):
#do something
for csvfile in self.fullcsvpathfiles:
#figure out how many columns it is suppose to have here?
ncols = special_process()
with open(self.fullcsvpathfiles[0], mode='r') as f:
#figure out how many columns it is suppose to have here?
if not all(row.count('`')==ncols for row in f):
#do something
If you know the number of expected elements, you could inspect each line
f=open(filename,'r')
for line in f:
line=line.split("`")
if line!=numElements:
raise Exception("Bad file")
If you know the delimiter that is being accidentally inserted, you could also try to recover instead of throwing exception. Perhaps something like:
line="`".join(line).replace(wrongDelimiter,"`").split("`")
Of course, once you're that far into reading the file, there's no great need for using an external library to read the data. Just go ahead and use it.
I have a CSV from a system that has a load of rubbish at the top of the file, so the header row is about row 5 or could even be 14 depending on the gibberish the report puts out.
I used to use:
idx = next(idx for idx, row in enumerate(csvreader) if len(row) > 2)
to go through the rows that had less than 2 columns, then when it hit the col headers, of which there are 12, it would stop, and then I could use idx with skiprows when reading the CSV file.
The system has had an update and someone thought it would be good to have the CSV file valid by adding in 11 blank commas after their gibberish to align the header count.
so now I have a CSV like:
sadjfhasdkljfhasd,,,,,,,,,,
dsfasdgasfg,,,,,,,,,,
time,date,code,product
etc..
I tried:
idx = next(idx for idx, row in enumerate(csvreader) if row in (None, "") > 2)
but I think that's a Pandas thing and it just fails.
Any ideas on how i can get to my header row?
CODE:
lmf = askopenfilename(filetypes=(("CSV Files",".csv"),("All Files","*.*")))
# Section gets row number where headers start
with open(lmf, 'r') as fin:
csvreader = csv.reader(fin)
print(csvreader)
input('hold')
idx = next(idx for idx, row in enumerate(csvreader) if len(row) > 2)
# Reopens file parsing the number for the row headers
lmkcsv = pd.read_csv(lmf, skiprows=idx)
lm = lm.append(lmkcsv)
print(lm)
Since your csv is now a valid file and you just want to filter out the header rows without a certain amount of columns, you can just do that in pandas directly.
import pandas as pd
minimum_cols_required = 3
lmkcsv = pd.read_csv()
lmkcsv = lmkcsv.dropna(thresh=minimum_cols_required, inplace=True)
If your csv data have a lot of empty values as well that gets caught in this threshold, then just slightly modify your code:
idx = next(idx for idx, row in enumerate(csvreader) if len(set(row)) > 3)
I'm not sure in what case a None would return, so the set(row) should do. If your headers for whatever are duplicates as well, do this:
from collections import Counter
# ...
idx = next(idx for idx, row in enumerate(csvreader) if len(row) - Counter(row)[''] > 2)
And how about erasing the starting lines, doing some logic, like checking many ',' exist's or some word. Something like:
f = open("target.txt","r+")
d = f.readlines()
f.seek(0)
for i in d:
if "sadjfhasdkljfhasd" not in i:
f.write(i)
f.truncate()
f.close()
after that, read normaly the file.
I have a huge dataset and I am trying to read it line by line.
For now, I am reading the dataset using pandas:
df = pd.read_csv("mydata.csv", sep =',', nrows = 1)
This function allows me to read only the first line, but how can I read the second, the third one and so on?
(I would like to use pandas.)
EDIT:
To make it more clear, I need to read one line at a time as the dataset is 20 GB and I cannot keep all the stuff in memory.
One way could be to read part by part of your file and store each part, for example:
df1 = pd.read_csv("mydata.csv", nrows=10000)
Here you will skip the first 10000 rows that you already read and stored in df1, and store the next 10000 rows in df2.
df2 = pd.read_csv("mydata.csv", skiprows=10000 nrows=10000)
dfn = pd.read_csv("mydata.csv", skiprows=(n-1)*10000, nrows=10000)
Maybe there is a way to introduce this idea into a for or while loop.
Looking in the pandas documentation, there is a parameter for read_csv function:
skiprows
If a list is assigned to this parameter it will skip the line indexed by the list:
skiprows = [0,1]
This will skip the first one and the second line.
Thus a combination of nrow and skiprows allow to read each line in the dataset separately.
You are using nrows = 1, wich means "Number of rows of file to read. Useful for reading pieces of large files"
So you are telling it to read only the first row and stop.
You should just remove the argument to read all the csv file into a DataFrame and then go line by line.
See the documentation for more details on usage : https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html
I found using skiprows to be very slow. This approach worked well for me:
line_number = 8 # the row you want. 0-indexed
import pandas as pd
import sys # or `import itertools`
import csv
# you can wrap this block in a function:
# (filename, line_number[, max_rows]) -> row
with open(filename, 'r') as f:
r = csv.reader(f)
for i in range(sys.maxsize**10): # or `i in itertools.count(start=0)`
if i != line_number:
next(r) # skip this row
else:
row = next(r)
row = pd.DataFrame(row) # or transform it however you like
break # or return row, if this is a function
# now you can use `row` !
To make it more robust, substitute sys.maxsize**10 with your actual total number of rows and/or be make sure that line_number is a non-negative number + put a try/except StopIteration block around the row = next(r) line, so that you can catch the reader reaching the end of file.
I am trying to determine the type of data contained in each column of a .csv file so that I can make CREATE TABLE statements for MySQL. The program makes a list of all the column headers and then grabs the first row of data and determines each data type and appends it to the column header for proper syntax. For example:
ID Number Decimal Word
0 17 4.8 Joe
That would produce something like CREATE TABLE table_name (ID int, Number int, Decimal float, Word varchar());.
The problem is that in some of the .csv files the first row contains a NULL value that is read as an empty string and messes up this process. My goal is to then search each row until one is found that contains no NULL values and use that one when forming the statement. This is what I have done so far, except it sometimes still returns rows that contains empty strings:
def notNull(p): # where p is a .csv file that has been read in another function
tempCol = next(p)
tempRow = next(p)
col = tempCol[:-1]
row = tempRow[:-1]
if any('' in row for row in p):
tempRow = next(p)
row = tempRow[:-1]
else:
rowNN = row
return rowNN
Note: The .csv file reading is done in a different function, whereas this function simply uses the already read .csv file as input p. Also each row is ended with a , that is treated as an extra empty string so I slice the last value off of each row before checking it for empty strings.
Question: What is wrong with the function that I created that causes it to not always return a row without empty strings? I feel that it is because the loop is not repeating itself as necessary but I am not quite sure how to fix this issue.
I cannot really decipher your code. This is what I would do to only get rows without the empty string.
import csv
def g(name):
with open('file.csv', 'r') as f:
r = csv.reader(f)
# Skip headers
row = next(r)
for row in r:
if '' not in row:
yield row
for row in g('file.csv'):
print('row without empty values: {}'.format(row))
I have 3,000 .dat files that I am reading and concatenating into one pandas dataframe. They have the same format (4 columns, no header) except that some of them have a description at the beginning of the file while others don't. In order to concatenate those files, I need to get rid of those first rows before I concatenate them. The skiprows option of the pandas.read_csv() doesn't apply here, because the number of rows to skip is very inconsistent from one file to another (btw, I use pandas.read_csv() and not pandas.read_table() because the files are separated by a coma).
However, the fist value after the rows I am trying to omit is identical for all 3,000 files. This value is "2004", which is the first data point of my dataset.
Is there an equivalent to skiprows where I could mention something such as "start reading the file starting at "2004" and skip everything else before that (for each of the 3,00 files)?
I am really out of luck at this point and would appreciate some help,
Thank you!
You could just loop through them and skip every line that doesn't start with 2004.
Something like ...
while True:
line = pandas.read_csv()
if line[0] != '2004': continue
# whatever else you need here
Probably not worth trying to be clever here; if you have a handy criterion, you might as well use it to figure out what skiprows is, i.e. something like
import pandas as pd
import csv
def find_skip(filename):
with open(filename, newline="") as fp:
# (use open(filename, "rb") in Python 2)
reader = csv.reader(fp)
for i, row in enumerate(reader):
if row[0] == "2004":
return i
for filename in filenames:
skiprows = find_skip(filename)
if skiprows is None:
raise ValueError("something went wrong in determining skiprows!")
this_df = pd.read_csv(filename, skiprows=skiprows, header=None)
# do something here, e.g. append this_df to a list and concatenate it after the loop
uss the skip_to() function:
def skip_to(f, text):
while True:
last_pos = f.tell()
line = f.readline()
if not line:
return False
if line.startswith(text):
f.seek(last_pos)
return True
with open("tmp.txt") as f:
if skip_to(f, "2004"):
df = pd.read_csv(f, header=None)
print df