I have a very simple task: I need to take a sum of 1 column in a file that has many columns and thousand of rows. However, every time I open the file on jupyter, it crashes since I cannot go over 100 MB per file.
Is there any work around for such a task? I feel I shouldnt have to open the entire file since I need just 1 column.
Thanks!
I'm not sure if this will work since the information you have provided is somewhat limited, but if you're using python 3 I had a similar issue. Try typing this at the top and see if this helps. It might fix your issue.
import os
os.environ['KMP_DUPLICATE_LIB_OK'] = 'True'
The above solution is sort of a band-aid and isn't supported and may cause undefined behavior. If your data is too big for your memory try reading in the data with dask.
import dask.dataframe as dd
dd.read_csv(path, params)
You have to open the file even if you want just one row, .. opening it load it into some other memory and here is your problem .
You can either open the file outside Ipython and split it to smaller size OR
Use a library like pandas and read it in chunks , as in the answer
You should slice through rows and put it in different other data frames and then works on respective data frames.
Hanging issues are because of RAM insufficiency in your system.
Use new_dataframe = dataframe.iloc[: , :]- or new_dataframe = dataframe.loc[: , :]-methods for slicing in pandas.
Rows slicing before colon and column slicing after colon.
Related
I'm trying to use Dask to read a large number of csv, but I'm having issues since the number of columns varies between csv files, as does the order of the columns.
I know that packages like d6tstack (as detailed here), can help handle this, but is there a way to fix this without installing additional libraries and without taking up more disk space?
If you use from_delayed, then you can make a function which pre-processes each of your input files as you might wish. This is totally arbitrary, so you can choose to solve the issue using your own code or any package you want to install across the cluster.
#dask.delayed
def read_a_file(filename):
df = pd.read_csv(filename). # or remote file
do_something_with_columns
return df_out
df = dd.from_delayed([read_a_file(f) for f in filenames], meta=...)
I want to read in large csv files into python in the fastest way possible. I have a csv file of ~100 million rows. I came across this primer https://medium.com/casual-inference/the-most-time-efficient-ways-to-import-csv-data-in-python-cc159b44063d and they go through a few packages
csv
pandas
dask
datatable
paratext
For my purposes, "csv" is too raw and I want to leverage the type inference included in the other packages. I need it to work on both windows and linux machines and have also looked into datatable and paratext, but have had problems installing right package dependencies (neither are on the anaconda package repo). So that leaves pandas and dask. At first glance, dask seems much faster, but I only realized that it only does the computations if you call ".compute"
My specific use case is that even though the raw csv file is 100+ million rows, I only need a subset of it loaded into memory. For example, all rows with date >= T. Is there a more efficient way of doing this than just the below example? Both pandas and dask take similar time.
EDIT: The csv file updates daily and there is no pre-known order of the rows of the file. Ie it is not necessarily the case that the most recent dates are at the end of the file
import pandas as pd
import dask as dd
from datetime import datetime
s = datetime.now()
data1 = pd.read_csv("test.csv", parse_dates=["DATE"])
data1 = data1[data1.DATE>=datetime(2019,12,24)]
print(datetime.now()-s)
s = datetime.now()
data2 = dd.read_csv("test.csv", parse_dates=["DATE"])
data2 = data2[data2.DATE>=datetime(2019,12,24)].compute()
print(datetime.now()-s)
Your Dask solution looks good to me. For parsing CSV in particular you might want to use Dask's multiprocessing scheduler. Most Pandas operations are better with threads, but text-based processing like CSV, is an exception.
data2 = data2[data2.DATE>=datetime(2019,12,24)].compute(scheduler="processes")
See https://docs.dask.org/en/latest/scheduling.html for more information.
CSV is not an efficient file format for filtering, CSV files don't have indexes for data fields, don't have key based access. For each filter operation you always have to read all lines.
You can improve the performance marginally by using a library that's written in C or is doing some stuff a little smarter than another library, but don't expect miracles/ I'd expect an improvement of a few percent to perhaps 3x if you identify/implement an optimized C version reading your lines and performing the initial filtering.
If you read the CSV file more often then it might be useful to convert the file during the first read (multiple options exist: hand craftet helpers, indexes, sorting, data bases, ...)
and perform subsequent reads on the 'data base'.
If you know that the new CSV file is a the same as the previous version plus lines appended to the end of the file, you had to memorize the position of the last line of the previous version and just add the new lines to your optimized data file. (data base, ...)
Other file formats might be hundreds or thousand times more efficient, but creating these files the first time is probably at least as expensive as your searching (so if reading only once, there's nothing you can optimize)
If none of above conditions is true you cannot expect a huge performance increase
You might look at What is the fastest way to search the csv file?
for accelerations (assuming files can be sorted/indexed by a search / filter criteria)
The issue is that I have 299 .csv files (each of 150-200 MB on average of millions of rows and 12 columns - that makes up a year of data (approx of 52 GB/year). I have 6 years and would like to finally concatenate all of them) which I want to concatenate into a single .csv file with python. As you may expect, I ran into memory errors when trying the following code (my machine has 16GB of RAM):
import os, gzip, pandas as pd, time
rootdir = "/home/eriz/Desktop/2012_try/01"
dataframe_total_list = []
counter = 0
start = time.time()
for subdir, dirs, files in os.walk(rootdir):
dirs.sort()
for files_gz in files:
with gzip.open(os.path.join(subdir, files_gz)) as f:
df = pd.read_csv(f)
dataframe_total_list.append(df)
counter += 1
print(counter)
total_result = pd.concat(dataframe_total_list)
total_result.to_csv("/home/eriz/Desktop/2012_try/01/01.csv", encoding="utf-8", index=False)
My aim: get a single .csv file to then use to train DL models and etc.
My constraint: I'm very new to this huge amounts of data, but I have done "part" of the work:
I know that multiprocessing will not help much in my development; it is a sequential job in which I need each task to complete so that I can start the following one. The main problem is memory run out.
I know that pandas works well with this, even with chunk size added. However, the memory problem is still there because the amount of data is huge.
I have tried to split the work into little tasks so that I don't run out of memory, but I will have it anyway later when concatenating.
My questions:
Is still possible to do this task with python/pandas in any other way I'm not aware of or I need to switch no matter what to a database approach? Could you advise which?
If the database approach is the only path, will I have problems when need to perform python-based operations to train the DL model? I mean, if I need to use pandas/numpy functions to transform the data, is that possible or will I have problems because of the file size?
Thanks a lot in advance and I would appreciate a lot a "more" in deep explanation of the topic.
UPDATE 10/7/2018
After trying and using the below code snippets that #mdurant pointed out I have learned a lot and corrected my perspective about dask and memory issues.
Lessons:
Dask is there to be used AFTER the first pre-processing task (if it is the case that finally you end up with huge files and pandas struggles to load/process them). Once you have your "desired" mammoth file, you can load it into dask.dataframe object without any issue and process it.
Memory related:
first lesson - come up with a procedure so that you don't need to concat all the files and you run out of memory; just process them looping and reducing their content by changing dtypes, dropping columns, resampling... second lesson - try to ONLY put into memory what you need, so that you don't run out. Third lesson - if any of the other lessons don't apply, just look for a EC2 instance, big data tools like Spark, SQL etc.
Thanks #mdurant and #gyx-hh for your time and guidance.
First thing: to take the contents of each CSV and concatenate into one huge CSV is simple enough, you do not need pandas or anything else for that (or even python)
outfile = open('outpath.csv', 'w')
for files_gz in files:
with gzip.open(os.path.join(subdir, files_gz)) as f:
for line in f:
outfile.write(line)
outfile.close()
(you may want to ignore the first line of each CSV, if it has a header with column names).
To do processing on the data is harder. The reason is, that although Dask can read all the files and work on the set as a single data-frame, if any file results in more memory than your system can handle, processing will fail. This is because random-access doesn't mix with gzip compression.
The output file, however, is (presumably) uncompressed, so you can do:
import dask.dataframe as dd
df = dd.read_csv('outpath.csv') # automatically chunks input
df[filter].groupby(fields).mean().compute()
Here, only the reference to dd and the .compute() are specific to dask.
I have a huge file csv file with around 4 million column and around 300 rows. File size is about 4.3G. I want to read this file and run some machine learning algorithm on the data.
I tried reading the file via pandas read_csv in python but it is taking long time for reading even a single row ( I suspect due to large number of columns ). I checked few other options like numpy fromfile, but nothing seems to be working.
Can someone please suggest some way to load file with many columns in python?
Pandas/numpy should be able to handle that volume of data no problem. I hope you have at least 8GB of RAM on that machine. To import a CSV file with Numpy, try something like
data = np.loadtxt('test.csv', dtype=np.uint8, delimiter=',')
If there is missing data, np.genfromtext might work instead. If none of these meet your needs and you have enough RAM to hold a duplicate of the data temporarily, you could first build a Python list of lists, one per row using readline and str.split. Then pass that to Pandas or numpy, assuming that's how you intend to operate on the data. You could then save it to disk in a format for easier ingestion later. hdf5 was already mentioned and is a good option. You can also save a numpy array to disk with numpy.savez or my favorite the speedy bloscpack.(un)pack_ndarray_file.
csv is very inefficient for storing large datasets. You should convert your csv file into a better suited format. Try hdf5 (h5py.org or pytables.org), it is very fast and allows you to read parts of the dataset without fully loading it into memory.
According to this answer, pandas (which you already tried) is the fastest library available to read a CSV in Python, or at least was in 2014.
After the recommendation from Jeff's Answer to check out this Google Forum, I still didn't feel satisfied on what the conclusion was regarding the appendCSV method. Below, you can see my implementation of reading many XLS files. Is there a way to significantly increase the speed of this? It currently takes over 10 minutes for around 900,000 rows.
listOfFiles = glob.glob(file_location)
frame = pd.DataFrame()
for idx, a_file in enumerate(listOfFiles):
data = pd.read_excel(a_file, sheetname=0, skiprows=range(1,2), header=1)
data.rename(columns={'Alphabeta':'AlphaBeta'}, inplace=True)
frame = frame.append(data)
# Save to CSV..
frame.to_csv(output_dir, index=False, encoding='utf-8', date_format="%Y-%m-%d")
The very first important point
Optimize only code that is required to be optimized.
If you need to convert all you files just once then you have already made a great job, congrats! If you, however, need to reuse it really often (and by really I mean that there is a source that produce your Excel files with a speed at least of 900K rows per 10 minutes and you need to parse them in real-time) then what you need to do is to analyze your profiling results.
Profiling analysis
Sorting your profile in descending order by 'cumtime', which is cumulative execution time of function including its subcalls, you will discover that out of ~2000 seconds of runtime ~800 seconds are taken by 'read_excel' method and ~1200 seconds are taken by 'to_csv' method.
If then you will sort profile by 'tottime' which is total execution time of functions themselves you will find out that top time consumers are populated with functions that are connected with reading and writing lines and conversion between formats. So, the real problem is that either libraries you use are slow, or the amount of data you are parsing is really huge.
Possible solutions
For the first reason, please keep in mind that parsing Excel lines and converting them could be a really complex task. It is hard to advice you without having an example of your input data. But there could be a real time loss just because the library you are using is for everything and it does hard work parsing rows several times when you actually do not need it, because your rows have very simple structure. In this case you may try to switch to different libraries, that does not perform complex parsing of input data, for example use xlrd for reading data from Excel. But in title you mentioned that input files are also CSVs so if this is applicable in your case then load lines with just:
line.strip().split(sep)
instead of complex Excel format parsing. And of course if your rows are simple than you can always use
','.join(list_of_rows)
to write CSV instead of using complex DataFrames at all. However, if your files contain Unicode symbols, complex fields and so on then these libraries are probably the best choice.
For the second reason - 900K rows could contain from 900K to infinite bytes, so it is really hard to understand whether your data input is really so big, without an example again. If you have really a lot of data then probably there is not too much you could do and you just have to wait. And remember that disk is actually a very slow device. Usual disks could provide you with ~100Mb/s at its best so if you are copying (because ultimately that is what you are doing) 10Gb of data then you can see that at least 3-4 minutes will be required for just physically reading raw data and writing the result. But in case if you are not using your disk bandwidth for 100% (for example if parsing one row with library that you are using takes comparable time with just reading this row from disk) you might also try to increase speed of your code by asynchronous data reading with multiprocessing map_async instead of cycle.
If you are using pandas, you could do this:
dfs = [pd.read_excel(path.join(dir, name), sep='\t', encoding='cp1252', error_bad_lines=False ) for name in os.listdir(dir) if name.endswith(suffix)]
df = pd.concat(dfs, axis=0, ignore_index=True)
This is screaming fast compared to other methods of getting data into pandas. Other tips:
You can also speed this up by specifying dtype for all columns.
If you are doing read_csv, use the engine='c' to speed up the import.
Skip rows on error