Does pandas automatically skip rows do a size limit? - python

We all know the question, when you are running in a memory error: Maximum size of pandas dataframe
I also try to read 4 large csv-files with the following command:
files = glob.glob("C:/.../rawdata/*.csv")
dfs = [pd.read_csv(f, sep="\t", encoding='unicode_escape') for f in files]
df = pd.concat(dfs,ignore_index=True)
The only massage I receive is:
C:..\conda\conda\envs\DataLab\lib\site-packages\IPython\core\interactiveshell.py:3214:
DtypeWarning: Columns (22,25,56,60,71,74) have mixed types. Specify
dtype option on import or set low_memory=False. if (yield from
self.run_code(code, result)):
which should be no problem.
My total dataframe has a size of: (6639037, 84)
Could there be any datasize restriction without an memory error? That means python is automatically skipping some lines without telling me? I had this with another porgramm in the past, I don't think python is so lazy, but you never know.
Further reading:
Later i am saving is as sqlite-file, but I also don't think this should be a problem:
conn = sqlite3.connect('C:/.../In.db')
df.to_sql(name='rawdata', con=conn, if_exists = 'replace', index=False)
conn.commit()
conn.close()

You can pass a generator expression to concat
dfs = (pd.read_csv(f, sep="\t", encoding='unicode_escape') for f in files)
so you avoid the creation of that crazy list in the memory. This might alleviate the problem with the memory limit.
Besides, you can make a special generator that contains a downcast for some columns.
Say, like this:
def downcaster(names):
for name in names:
x = pd.read_csv(name, sep="\t", encoding='unicode_escape')
x['some_column'] = x['some_column'].astype('category')
x['other_column'] = pd.to_numeric(x['other_column'], downcast='integer')
yield x
dc = downcaster(names)
df = pd.concat(dc, ...

It turned out that there was an error in the file reading, so thanks #Oleg O for the help and tricks to reduce the memory.
For now I do not think that there is a effect that python automatically skips lines. It only happened with wrong coding. My example you can find here: Pandas read csv skips some lines

Related

How to find the input line with mixed types

I am reading in a large csv in pandas with:
features = pd.read_csv(filename, header=None, names=['Time','Duration','SrcDevice','DstDevice','Protocol','SrcPort','DstPort','SrcPackets','DstPackets','SrcBytes','DstBytes'], usecols=['Duration','SrcDevice', 'DstDevice', 'Protocol', 'DstPort','SrcPackets','DstPackets','SrcBytes','DstBytes'])
I get:
sys:1: DtypeWarning: Columns (6) have mixed types. Specify dtype option on import or set low_memory=False.
%!PS-Adobe-3.0
How can I find the first line in the input which is causing this warning? I need to do this to debug the problem with the input file, which shouldn't have mixed types.
Once Pandas has finished reading the file, you can NOT figure out which lines were problematic (see this answer to know why).
This means you should find a way while you are reading the file. For example, read the file line-by-line, and check the types of each line, if any of them doesn't match the expected type then you got the wanted line.
To achieve this with Pandas, you can pass chunksize=1 to pd.read_csv() to read the file in chunks (dataframes with size N, 1 in this case). See the documentation if you want to know more about this.
The code goes something like this:
# read the file in chunks of size 1. This returns a reader rather than a DataFrame
reader = pd.read_csv(filename,chunksize=1)
# get the first chunk (DataFrame), to calculate the "true" expected types
first_row_df = reader.get_chunk()
expected_types = [type(val) for val in first_row_df.iloc[0]] # a list of the expected types.
i = 1 # the current index. Start from 1 because we've already read the first row.
for row_df in reader:
row_types = [type(val) for val in row_df.iloc[0]]
if row_types != expected_types:
print(i) # this row is the wanted one
break
i += 1
Note that this code makes an assumption that the first row has "true" types.
This code is really slow, so I recommend that you actually only check the columns which you think are problematic (though this does not give much performance gain).
for endrow in range(1000, 4000000, 1000):
startrow = endrow - 1000
rows = 1000
try:
pd.read_csv(filename, dtype={"DstPort": int}, skiprows=startrow, nrows=rows, header=None,
names=['Time','Duration','SrcDevice','DstDevice','Protocol','SrcPort',
'DstPort','SrcPackets','DstPackets','SrcBytes','DstBytes'],
usecols=['Duration','SrcDevice', 'DstDevice', 'Protocol', 'DstPort',
'SrcPackets','DstPackets','SrcBytes','DstBytes'])
except ValueError:
print(f"Error is from row {startrow} to row {endrows}")
Split the file into multiple dataframes with 1000 rows each to see in which range of rows there is mixed type value that causes this problem.

Techniques for working with very large Pandas data frames or csv files in Python

I have a huge csv file (14gb) on disk that I need to "melt" (using pd.melt) - I can import the file using pd.read_csv() without issue, but when I apply the melt function I max out my 32gb of memory and hit a "memory error" limit. Can anyone suggest some solutions? The original file is the output of another script, therefore I cannot reduce it by only importing selected columns or removing rows. There are a few hundred columns and over 10 million rows.. I tried something like this (in a much abbreviated version):
chunks = pd.read_csv('file.csv', chunksize=10000)
ids = list(set(list(chunks.columns.values)) - set(['1','2','3','4','5']))
out = []
for chunk in chunks:
df = pd.melt(chunk, id_vars=ids, var_name='foo',value_name='bar')
df['a_col'] = df['a_col'].fillna('not_na')
out.append(df)
full_df = pd.concat(out, ignore_index=False)
df_grouped = pd.DataFrame(full_df.groupby(['id_col', 'foo'])['bar'].apply(lambda x: ((x-min(x))/(max(x)-min(x))*100)))
df_grouped.columns = ['bar_grouped']
final_df = full_df.merge(df_grouped, how='inner' left_index=True, right_index=True)
final_df.to_csv('output.csv', sep='|', index=False)
Clearly this didn't work, as the columns.values attribute is not available for chunks since it is not a data frame. Any suggestions on how to re-code so it works and avoids memory issues are greatly appreciated!

Sequentially read huge CSV file in python

I have a 10gb CSV file that contains some information that I need to use.
As I have limited memory on my PC, I can not read all the file in memory in one single batch. Instead, I would like to iteratively read only some rows of this file.
Say that at the first iteration I want to read the first 100, at the second those going to 101 to 200 and so on.
Is there an efficient way to perform this task in Python?
May Pandas provide something useful to this? Or are there better (in terms of memory and speed) methods?
Here is the short answer.
chunksize = 10 ** 6
for chunk in pd.read_csv(filename, chunksize=chunksize):
process(chunk)
Here is the very long answer.
To get started, you’ll need to import pandas and sqlalchemy. The commands below will do that.
import pandas as pd
from sqlalchemy import create_engine
Next, set up a variable that points to your csv file. This isn’t necessary but it does help in re-usability.
file = '/path/to/csv/file'
With these three lines of code, we are ready to start analyzing our data. Let’s take a look at the ‘head’ of the csv file to see what the contents might look like.
print pd.read_csv(file, nrows=5)
This command uses pandas’ “read_csv” command to read in only 5 rows (nrows=5) and then print those rows to the screen. This lets you understand the structure of the csv file and make sure the data is formatted in a way that makes sense for your work.
Before we can actually work with the data, we need to do something with it so we can begin to filter it to work with subsets of the data. This is usually what I would use pandas’ dataframe for but with large data files, we need to store the data somewhere else. In this case, we’ll set up a local sqllite database, read the csv file in chunks and then write those chunks to sqllite.
To do this, we’ll first need to create the sqllite database using the following command.
csv_database = create_engine('sqlite:///csv_database.db')
Next, we need to iterate through the CSV file in chunks and store the data into sqllite.
chunksize = 100000
i = 0
j = 1
for df in pd.read_csv(file, chunksize=chunksize, iterator=True):
df = df.rename(columns={c: c.replace(' ', '') for c in df.columns})
df.index += j
i+=1
df.to_sql('table', csv_database, if_exists='append')
j = df.index[-1] + 1
With this code, we are setting the chunksize at 100,000 to keep the size of the chunks managable, initializing a couple of iterators (i=0, j=0) and then running a through a for loop. The for loop read a chunk of data from the CSV file, removes space from any of column names, then stores the chunk into the sqllite database (df.to_sql(…)).
This might take a while if your CSV file is sufficiently large, but the time spent waiting is worth it because you can now use pandas ‘sql’ tools to pull data from the database without worrying about memory constraints.
To access the data now, you can run commands like the following:
df = pd.read_sql_query('SELECT * FROM table', csv_database)
Of course, using ‘select *…’ will load all data into memory, which is the problem we are trying to get away from so you should throw from filters into your select statements to filter the data. For example:
df = pd.read_sql_query('SELECT COl1, COL2 FROM table where COL1 = SOMEVALUE', csv_database)
You can use pandas.read_csv() with chuncksize parameter:
http://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html#pandas.read_csv
for chunck_df in pd.read_csv('yourfile.csv', chunksize=100):
# each chunck_df contains a part of the whole CSV
This code may help you for this task. It navigates trough a large .csv file and does not consume lots of memory so that you can perform this in a standard lap top.
import pandas as pd
import os
The chunksize here orders the number of rows within the csv file you want to read later
chunksize2 = 2000
path = './'
data2 = pd.read_csv('ukb35190.csv',
chunksize=chunksize2,
encoding = "ISO-8859-1")
df2 = data2.get_chunk(chunksize2)
headers = list(df2.keys())
del data2
start_chunk = 0
data2 = pd.read_csv('ukb35190.csv',
chunksize=chunksize2,
encoding = "ISO-8859-1",
skiprows=chunksize2*start_chunk)
headers = []
for i, df2 in enumerate(data2):
try:
print('reading cvs....')
print(df2)
print('header: ', list(df2.keys()))
print('our header: ', headers)
# Access chunks within data
# for chunk in data:
# You can now export all outcomes in new csv files
file_name = 'export_csv_' + str(start_chunk+i) + '.csv'
save_path = os.path.abspath(
os.path.join(
path, file_name
)
)
print('saving ...')
except Exception:
print('reach the end')
break
Method to transfer huge CSV into database is good because we can easily use SQL query.
We have also to take into account two things.
FIRST POINT: SQL also are not a rubber, it will not be able to stretch the memory.
For example converted to bd file:
https://nycopendata.socrata.com/Social-Services/311-Service-Requests-
from-2010-to-Present/erm2-nwe9
For this db file SQL language:
pd.read_sql_query("SELECT * FROM 'table'LIMIT 600000", Mydatabase)
It can read maximum about 0,6 mln records no more with 16 GB RAM memory of PC (time of operation 15,8 second).
It could be malicious to add that downloading directly from a csv file is a bit more efficient:
giga_plik = 'c:/1/311_Service_Requests_from_2010_to_Present.csv'
Abdul = pd.read_csv(giga_plik, nrows=1100000)
(time of operation 16,5 second)
SECOND POINT: To effectively using SQL data series converted from CSV we ought to memory about suitable form of date. So I proposer add to ryguy72's code this:
df['ColumnWithQuasiDate'] = pd.to_datetime(df['Date'])
All code for file 311 as about I pointed:
start_time = time.time()
### sqlalchemy create_engine
plikcsv = 'c:/1/311_Service_Requests_from_2010_to_Present.csv'
WM_csv_datab7 = create_engine('sqlite:///C:/1/WM_csv_db77.db')
#----------------------------------------------------------------------
chunksize = 100000
i = 0
j = 1
## --------------------------------------------------------------------
for df in pd.read_csv(plikcsv, chunksize=chunksize, iterator=True, encoding='utf-8', low_memory=False):
df = df.rename(columns={c: c.replace(' ', '') for c in df.columns})
## -----------------------------------------------------------------------
df['CreatedDate'] = pd.to_datetime(df['CreatedDate']) # to datetimes
df['ClosedDate'] = pd.to_datetime(df['ClosedDate'])
## --------------------------------------------------------------------------
df.index += j
i+=1
df.to_sql('table', WM_csv_datab7, if_exists='append')
j = df.index[-1] + 1
print(time.time() - start_time)
At the end I would like to add: converting a csv file directly from the Internet to db seems to me a bad idea. I propose to download base and convert locally.

read a huge csv and create a dataframe

I have a csv about 4000,0000 rows and 3 columns.I want to read into python,and create a dataframe with these data. I always has memory error.
df = pd.concat([chunk for chunk in pd.read_csv(cmct_0430x.csv',chunksize=1000)])
I also tried creat pandas DataFrame from generator,it still has memory error.
for line in open("cmct_0430x.csv"):
yield line
my computer is win64,8G
how can I solve this problem? thank you very much.
df = pd.read_csv('cmct_0430x.csv')
40 million rows shouldn't be a problem.
please post your error message if this doesn't work
You actually read the csv file with chunked mode, but merged them into one data-frame in RAM. So the problem still exists. You can divide your data into multiple frames, and work on them separately.
reader = pd.read_csv(file_name, chunksize=chunk_size, iterator=True)
while True:
try:
df = reader.get_chunk(chunk_size)
# work on df
except:
break
del df

Modifying large csv in chunks?

I get an error 'TypeError: 'TextFileReader' object does not support item assignment' when I try to add columns and modify header names etc in chunks.
My issue is I am using a slow work laptop to process a pretty large file (10 million rows). I want to add some simple columns (1 or 0 values), concatenate two columns to create a unique ID, change the dtype for other columns, and rename some headers so they match with other files that I will .merge later. I could probably split this csv (maybe select date ranges and make separate files), but I would like to learn how to use chunksize or deal with large files in general without running into memory issues. Is it possible to modify a file in chunks and then concatenate them all together later?
I am doing a raw data clean up which will then be loaded into Tableau for visualization.
Example (reading/modifying 10 million rows):
> rep = pd.read_csv(r'C:\repeats.csv.gz',
> compression = 'gzip', parse_dates = True , usecols =
> ['etc','stuff','others','...'])
> rep.sort()
> rep['Total_Repeats'] = 1
> rep.rename(columns={'X':'Y'}, inplace = True)
> rep.rename(columns={'Z':'A'}, inplace = True)
> rep.rename(columns={'B':'C'}, inplace = True)
> rep['D']= rep['E'] + rep['C']
> rep.rename(columns={'L':'M'}, inplace = True)
> rep.rename(columns={'N':'O'}, inplace = True)
> rep.rename(columns={'S':'T'}, inplace = True)
If you pass chunk_size keyword to pd.read_csv, it returns iterator of csv reader. and you can write processed chunks with to_csv method in append mode. you will be able to process large file, but you can't sort dataframe.
import pandas as pd
reader = pd.read_csv(r'C:\repeats.csv.gz',
compression = 'gzip', parse_dates=True, chunk_size=10000
usecols = ['etc','stuff','others','...'])
output_path = 'output.csv'
for chunk_df in reader:
chunk_result = do_somthing_with(chunk_df)
chunk_result.to_csv(output_path, mode='a', header=False)
Python's usually pretty good with that as long as you ignore the .read() part when looking at large files.
If you just use the iterators, you should be fine:
with open('mybiginputfile.txt', 'rt') as in_file:
with open('mybigoutputfile.txt', 'wt') as out_file:
for row in in_file:
'do something'
out_file.write(row)
Someone who knows more will explain how the memory side of it works, but this works for me on multi GB files without crashing Python.
You might want to chuck the data into a proper DB before killing your laptop with the task of serving up the data AND running Tableau too!

Categories