Analyzing huge dataset from 80GB file with only 32GB of RAM

Analyzing huge dataset from 80GB file with only 32GB of RAM - python

I have very huge text file ( around 80G ). File contains only numbers(integers+floats) and has 20 columns. Now I have to analyze each column. By analyze I mean, I have to do some basic calculations on each column like finding mean, plotting histograms, check if condition is satisfied or not etc. I am reading file like following
with open(filename) as original_file:
all_rows = [[float(digit) for digit in line.split()] for line in original_file]
all_rows = np.asarray(all_rows)
After this I do all analysis on specific columns. I use 'good' configuration server/workstation (with 32GB RAM) to execute my program. Problem is that I am not able to finish my job. I waited almost day to finish that program but it was still running after 1 day. I had to kill it manually later on. I know my script is correct without any error because I have tried same script on smaller size files (around 1G) and it worked nicely.
My initial guess is it will have some memory problem. Is there any way I can run such job? Some different method or some other way ?
I tried splitting files into smaller file size and then analyzed them individually in loop like follows
pre_name = "split_file"
for k in range(11): #There are 10 files with almost 8G each
filename = pre_name+str(k).zfill(3) #My files are in form "split_file000, split_file001 ..."
with open(filename) as original_file:
all_rows = [[float(digit) for digit in line.split()] for line in original_file]
all_rows = np.asarray(all_rows)
#Some analysis here
plt.hist(all_rows[:,8],100) #Plotting histogram for 9th Column
all_rows = None
I have tested above code on bunch of smaller files and it works fine. However again it was same problem when I used on big files. Any suggestions? Is there any other cleaner way to do it ?

For such lengthy operations (when data don't fit in memory), it might be useful to use libraries like dask ( http://dask.pydata.org/en/latest/ ), particularly the dask.dataframe.read_csv to read the data and then perform your operations like you would do in pandas library (another useful package to mention).

Two alternatives come to my mind:
You should consider performing your computation by online algorithms
In computer science, an online algorithm is one that can process its input piece-by-piece in a serial fashion, i.e., in the order that the input is fed to the algorithm, without having the entire input available from the start.
It is possible to compute mean and variance and a histogram with pre-specified bins this way with constant memory complexity.
You should throw your data into a proper database and make use of that database system's statistical and data handling capabilities, e.g., aggregate functions and indexes.
Random links:
http://www.postgresql.org/
https://www.mongodb.org/

Related

python xarray write to netcdf file very slow

for fname in ids['fnames']:
aq = xr.open_dataset(fname, chunks='auto', mask_and_scale=False)
aq = aq[var_lists]
aq = aq.isel(lat=slice(yoff, yoff+ysize), lon=slice(xoff, xoff+xsize))
list_of_ds.append(aq)
aq.close()
all_ds = xr.concat(list_of_ds, dim='time')
all_ds.to_netcdf('tmp.nc')
Hi all, I am making use of xarray to read netcdf files (around 1000) and save selected resutls to a temporary file, as shown above. However, the saving part runs very slow. How can I speed this up?
I also tried directly load the data, but still very slow.
I've also tried using open_mfdataset with parallel=True, and it's also slow:
aq = xr.open_mfdataset(
sorted(ids_list),
data_vars=var_lists,
preprocess=add_time_dim,
combine='by_coords',
mask_and_scale=False,
decode_cf=False,
parallel=True,
)
aq.isel({'lon':irlon,'lat':irlat}).to_netcdf('tmp.nc')

Unfortunately, concatenating ~1000 files in xarray will be slow. Not a great way around that.
It's hard for us to offer specific advice without more detail about your data and setup. But here are some things I'd try:
use xr.open_mfdataset. Your second code block looks great. dask will generally be faster and more efficient at managing tasks than you will with a for loop.
Make sure your chunks are aligned with how you're slicing the data. You don't want to read in more than you have to. If you're reading netCDFs, you have flexibility about how to read in the data into dask. Since you're selecting (it looks like) a small spatial region within each array, it may make sense to explicitly chunk the data such that you're only reading in a small portion of each array, e.g. with chunks={"lat": 50, "lon": 50}. You'll want to balance a few things here - making sure the chunk sizes are manageable and not too small (leading to too many tasks). Shoot for chunks ~100-500 MB range as a general rule, and trying to keep the number of tasks to less than 1 million (or # chunks to fewer than ~10-100k across all your datasets).
Be explicit about your concatenation. The more "magic" the process feels, the more work xarray is doing to infer what you mean. Generally, combine='nested' performs better than 'by_coords', so if you're concatenating files which are structured logically along one or more dimensions, it may help to arrange the files in the same way a dim is provided.
skip the pre-processing. If you can, add new dimensions on concatenation rather than as an ingestion step. This allows dask to more fully plan the computation, rather than treating your preprocess function as a black box, and what's worse as a pre-requisite to scheduling the final array construction operation because you're using combine='by_coords', where the coords are the result of an earlier dask operation. If you need to attach a time dim to each file, with 1 element per file, something like xr.open_mfdataset(files, concat_dim=pd.Index(pd.date_range("2020-01-01", freq="D", periods=1000), name="time"), combine="nested") works well in my experience.
If this is all taking too long, you could try pre-processing the data. Using a compiled utility like nco or even just subsetting the data and grouping smaller subsets of the data into larger files using dask.distributed's client.map might help cut down on the complexity of the final dataset join.

Pyarrow Write/Append Columns Arrow File

I have a calculator that iterates a couple of hundred object and produces Nx1 arrays for each of those objects. N here being 1-10m depending on configurations. Right now I am summing over these by using a generator expression, so memory consumption is low. However, I would like to store the Nx1 arrays to file, so I can do other computations.(Compute quantiles, partial sums etc. pandas style) Preferably I would like to use pa.memory_map on a single file (in order to have dataframes not loaded into memory), but I can not see how I can produce such a file without generating the entire result first. (Monte Carlo results on 200-500*10m floats).
If I understand correctly RecordBatchStreamWriter needs a part of the entire table, and I can not produce only a part of it. The parts the calculator produces is the columns, not parts of all columns. Is there any way of writing "columns" one by one? Either by appending, or create an empty arrow file which can be filled? (schema known).
As I see it, my alternative is to write several files and use "dataset" /tabular data to "join" them together. My "other computations" would then have to filter or pull parts into memory as I can`t see in the docs that "dataset()" work with memory_map.The result set is to big to fit in memory. (At least on the server it is running on)
I`m on day 2 of digging the docs and trying to understand how it all works, so apologies if the "lingo" is not all correct.
On further inspection, it looks like all files used in datasets() must have same schema, so I can not split "columns" in separate files either, can I..
EDIT
After wrestling with this library, I now produce single column files which I later combine in a single file. However, in following the suggested solution visible memory consumption (task manager) skyrockets in the step of combining the files. I would expect peaks for every "rowgroup" or combined recordbatch, but instead steadily increase to use all memory. A snip of this step:
readers = [pa.ipc.open_stream(file) for file in self.tempfiles]
combined_schema = pa.unify_schemas([r.schema for r in readers])
with pa.ipc.new_stream(
os.path.join(self.filepath, self.outfile_name + ".arrow"),
schema=combined_schema,
) as writer:
for group in zip(*readers):
combined_batch = pa.RecordBatch.from_arrays(
[g.column(0) for g in group], names=combined_schema.names
)
writer.write_batch(combined_batch)
From this link I would expect that running memory consumption to be that of combined_batch and some.

You could do the write in two passes.
First, write each column to its own file. Make sure to set a row group size small enough that a table consisting of one row group from each file comfortably fits into memory.
Second, create a streaming reader for each file you created and one writer. Read a single row group from each one. Create a table by combining all of the partial columns and write the table to your writer. Repeat until you've exhausted all your readers.
I'm not sure that memory mapping is going to help you much here.

Memory problems with pandas (understanding memory issues)

The issue is that I have 299 .csv files (each of 150-200 MB on average of millions of rows and 12 columns - that makes up a year of data (approx of 52 GB/year). I have 6 years and would like to finally concatenate all of them) which I want to concatenate into a single .csv file with python. As you may expect, I ran into memory errors when trying the following code (my machine has 16GB of RAM):
import os, gzip, pandas as pd, time
rootdir = "/home/eriz/Desktop/2012_try/01"
dataframe_total_list = []
counter = 0
start = time.time()
for subdir, dirs, files in os.walk(rootdir):
dirs.sort()
for files_gz in files:
with gzip.open(os.path.join(subdir, files_gz)) as f:
df = pd.read_csv(f)
dataframe_total_list.append(df)
counter += 1
print(counter)
total_result = pd.concat(dataframe_total_list)
total_result.to_csv("/home/eriz/Desktop/2012_try/01/01.csv", encoding="utf-8", index=False)
My aim: get a single .csv file to then use to train DL models and etc.
My constraint: I'm very new to this huge amounts of data, but I have done "part" of the work:
I know that multiprocessing will not help much in my development; it is a sequential job in which I need each task to complete so that I can start the following one. The main problem is memory run out.
I know that pandas works well with this, even with chunk size added. However, the memory problem is still there because the amount of data is huge.
I have tried to split the work into little tasks so that I don't run out of memory, but I will have it anyway later when concatenating.
My questions:
Is still possible to do this task with python/pandas in any other way I'm not aware of or I need to switch no matter what to a database approach? Could you advise which?
If the database approach is the only path, will I have problems when need to perform python-based operations to train the DL model? I mean, if I need to use pandas/numpy functions to transform the data, is that possible or will I have problems because of the file size?
Thanks a lot in advance and I would appreciate a lot a "more" in deep explanation of the topic.
UPDATE 10/7/2018
After trying and using the below code snippets that #mdurant pointed out I have learned a lot and corrected my perspective about dask and memory issues.
Lessons:
Dask is there to be used AFTER the first pre-processing task (if it is the case that finally you end up with huge files and pandas struggles to load/process them). Once you have your "desired" mammoth file, you can load it into dask.dataframe object without any issue and process it.
Memory related:
first lesson - come up with a procedure so that you don't need to concat all the files and you run out of memory; just process them looping and reducing their content by changing dtypes, dropping columns, resampling... second lesson - try to ONLY put into memory what you need, so that you don't run out. Third lesson - if any of the other lessons don't apply, just look for a EC2 instance, big data tools like Spark, SQL etc.
Thanks #mdurant and #gyx-hh for your time and guidance.

First thing: to take the contents of each CSV and concatenate into one huge CSV is simple enough, you do not need pandas or anything else for that (or even python)
outfile = open('outpath.csv', 'w')
for files_gz in files:
with gzip.open(os.path.join(subdir, files_gz)) as f:
for line in f:
outfile.write(line)
outfile.close()
(you may want to ignore the first line of each CSV, if it has a header with column names).
To do processing on the data is harder. The reason is, that although Dask can read all the files and work on the set as a single data-frame, if any file results in more memory than your system can handle, processing will fail. This is because random-access doesn't mix with gzip compression.
The output file, however, is (presumably) uncompressed, so you can do:
import dask.dataframe as dd
df = dd.read_csv('outpath.csv') # automatically chunks input
df[filter].groupby(fields).mean().compute()
Here, only the reference to dd and the .compute() are specific to dask.

Slow loop python to search data in antoher data frame in python

I have two data frames : one with all my data (called 'data') and one with latitudes and longitudes of different stations where each observation starts and ends (called 'info'), I am trying to get a data frame where I'll have the latitude and longitude next to each station in each observation, my code in python :
for i in range(0,15557580):
for j in range(0,542):
if data.year[i] == '2018' and data.station[i]==info.station[j]:
data.latitude[i] = info.latitude[j]
data.longitude[i] = info.longitude[j]
break
but since I have about 15 million observation , doing it, takes a lot of time, is there a quicker way of doing it ?
Thank you very much (I am still new to this)
edit :
my file info looks like this (about 500 observation, one for each station)
my file data like this (theres other variables not shown here) (about 15 million observations , one for each travel)
and what i am looking to get is that when the stations numbers match that the resulting data would look like this :

This is one solution. You can also use pandas.merge to add 2 new columns to data and perform the equivalent mapping.
# create series mappings from info
s_lat = info.set_index('station')['latitude']
s_lon = info.set_index('station')['latitude']
# calculate Boolean mask on year
mask = data['year'] == '2018'
# apply mappings, if no map found use fillna to retrieve original data
data.loc[mask, 'latitude'] = data.loc[mask, 'station'].map(s_lat)\
.fillna(data.loc[mask, 'latitude'])
data.loc[mask, 'longitude'] = data.loc[mask, 'station'].map(s_lon)\
.fillna(data.loc[mask, 'longitude'])

This is a very recurrent and important issue when anyone starts to deal with large datasets. Big Data is a whole subject in itself, here is a quick introduction to the main concepts.
1. Prepare your dataset
In big data, 80% to 90% of the time is spent gathering, filtering and preparing your datasets. Create subsets of data, making them optimized for your further processing.
2. Optimize your script
Short code does not always mean optimized code in term of performance. In your case, without knowing about your dataset, it is hard to say exactly how you should process it, you will have to figure out on your own how to avoid the most computation possible while getting the exact same result. Try to avoid any unnecessary computation.
You can also consider splitting the work over multiple threads if appropriate.
As a general rule, you should not use for loops and break them inside. Whenever you don't know precisely how many loops you will have to go through in the first place, you should always use while or do...while loops.
3. Consider using distributed storage and computing
This is a subject in itself that is way too big to be all explained here.
Storing, accessing and processing data in a serialized way is faster of small amount of data but very inappropriate for large datasets. Instead, we use distributed storage and computing frameworks.
It aims at doing everything in parallel. It relies on a concept named MapReduce.
The first distributed data storage framework was Hadoop (eg. Hadoop Distributed File System or HDFS). This framework has its advantages and flaws, depending on your application.
In any case, if you are willing to use this framework, it will probably be more appropriate for you not to use MR directly on top HDFS, but using a upper level one, preferably in-memory, such as Spark or Apache Ignite on top of HDFS. Also, depending on your needs, try to have a look at frameworks such as Hive, Pig or Sqoop for example.
Again this subject is a whole different world but might very well be adapted to your situation. Feel free to document yourself about all these concepts and frameworks, and leave your questions if needed in the comments.

Is there a faster way to append many XLS files into a single CSV file?

After the recommendation from Jeff's Answer to check out this Google Forum, I still didn't feel satisfied on what the conclusion was regarding the appendCSV method. Below, you can see my implementation of reading many XLS files. Is there a way to significantly increase the speed of this? It currently takes over 10 minutes for around 900,000 rows.
listOfFiles = glob.glob(file_location)
frame = pd.DataFrame()
for idx, a_file in enumerate(listOfFiles):
data = pd.read_excel(a_file, sheetname=0, skiprows=range(1,2), header=1)
data.rename(columns={'Alphabeta':'AlphaBeta'}, inplace=True)
frame = frame.append(data)
# Save to CSV..
frame.to_csv(output_dir, index=False, encoding='utf-8', date_format="%Y-%m-%d")

The very first important point
Optimize only code that is required to be optimized.
If you need to convert all you files just once then you have already made a great job, congrats! If you, however, need to reuse it really often (and by really I mean that there is a source that produce your Excel files with a speed at least of 900K rows per 10 minutes and you need to parse them in real-time) then what you need to do is to analyze your profiling results.
Profiling analysis
Sorting your profile in descending order by 'cumtime', which is cumulative execution time of function including its subcalls, you will discover that out of ~2000 seconds of runtime ~800 seconds are taken by 'read_excel' method and ~1200 seconds are taken by 'to_csv' method.
If then you will sort profile by 'tottime' which is total execution time of functions themselves you will find out that top time consumers are populated with functions that are connected with reading and writing lines and conversion between formats. So, the real problem is that either libraries you use are slow, or the amount of data you are parsing is really huge.
Possible solutions
For the first reason, please keep in mind that parsing Excel lines and converting them could be a really complex task. It is hard to advice you without having an example of your input data. But there could be a real time loss just because the library you are using is for everything and it does hard work parsing rows several times when you actually do not need it, because your rows have very simple structure. In this case you may try to switch to different libraries, that does not perform complex parsing of input data, for example use xlrd for reading data from Excel. But in title you mentioned that input files are also CSVs so if this is applicable in your case then load lines with just:
line.strip().split(sep)
instead of complex Excel format parsing. And of course if your rows are simple than you can always use
','.join(list_of_rows)
to write CSV instead of using complex DataFrames at all. However, if your files contain Unicode symbols, complex fields and so on then these libraries are probably the best choice.
For the second reason - 900K rows could contain from 900K to infinite bytes, so it is really hard to understand whether your data input is really so big, without an example again. If you have really a lot of data then probably there is not too much you could do and you just have to wait. And remember that disk is actually a very slow device. Usual disks could provide you with ~100Mb/s at its best so if you are copying (because ultimately that is what you are doing) 10Gb of data then you can see that at least 3-4 minutes will be required for just physically reading raw data and writing the result. But in case if you are not using your disk bandwidth for 100% (for example if parsing one row with library that you are using takes comparable time with just reading this row from disk) you might also try to increase speed of your code by asynchronous data reading with multiprocessing map_async instead of cycle.

If you are using pandas, you could do this:
dfs = [pd.read_excel(path.join(dir, name), sep='\t', encoding='cp1252', error_bad_lines=False ) for name in os.listdir(dir) if name.endswith(suffix)]
df = pd.concat(dfs, axis=0, ignore_index=True)
This is screaming fast compared to other methods of getting data into pandas. Other tips:
You can also speed this up by specifying dtype for all columns.
If you are doing read_csv, use the engine='c' to speed up the import.
Skip rows on error

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.