suppose I have two very big hdf files and I am going to read them and concat.
data = pd.concat([
pd.read_hdf("file1.hdf", key='data'),
pd.read_hdf("file2.hdf", key='data')
])
suppose every file take 10G of memory, and as we know, the above code will take a
peak memory usage of 40g. But the problem is that my computer's memory is only
32g, I wonder if there is any good way to read them and concat inplace so the peak
memory usage would be 20g?
If you want to use Pandas I'd input a chunksize parameter. This will create an iterator of your data that you can go through.
Alternatively, try pyspark or dask. Dask is essentially pandas, but let's you both parallelize your pipeline and not load in the entire dataset.
Related
I have a 7GB postgresql table which I want to read into python and do some analysis. I cannot use Pandas for it because it is larger than the memory on my local machine. I therefore wanted to try reading the table into Dask Dataframe first, perform some aggregation and switch back to Pandas for subsequent analysis. I used the below lines of code for that.
df = dd.read_sql_table('table_xyz', uri = "postgresql+psycopg2://user:pwd#remotehost/dbname", index_col = 'column_xyz', schema = 'private')
The index_col i.e. 'column_xyz' is indexed in database. This works but when I perform an action for example an aggregation, it takes ages (like an hour) to return the result.
avg = df.groupby("col1").col2.mean().compute()
I understand that Dask is not as fast as Pandas more so when I am working on a single machine and not a cluster. I am wondering whether I am using Dask the right way? If not what is a faster alternative to perform analysis on large tables that do not fit in memory using Python.
If your data fits into the RAM of your machine then you're better off using Pandas. Dask will not outperform Pandas in some cases.
Alternatively you can play around with the chunksize and see if things improve. The best way to figure this out is to look at dask diagnostics tool dashboard and figure out what is taking dask so long. That will help you make a much more informed decision.
I'm trying to create a Keras Tokenizer out of a single column from hundreds of large CSV files. Dask seems like a good tool for this. My current approach eventually causes memory issues:
df = dd.read_csv('data/*.csv', usecol=['MyCol'])
# Process column and get underlying Numpy array.
# This greatly reduces memory consumption, but eventually materializes
# the entire dataset into memory
my_ids = df.MyCol.apply(process_my_col).compute().values
tokenizer = Tokenizer()
tokenizer.fit_on_texts(my_ids)
How can I do this by parts? Something along the lines of:
df = pd.read_csv('a-single-file.csv', chunksize=1000)
for chunk in df:
# Process a chunk at a time
Dask DataFrame is technically a set of pandas dataframes, called partitions. When you get the underlying numpy array you are destroying the partitioning structure and it will be one big array. I recommend using map_partition function of Dask DataFrames to apply regular pandas functions on each partition separately.
I also recommend map_partition when it suits your problem. However, if you really just want sequential access, and an API similar to read_csv(chunksize=...) then you might be looking for the partitions attribute
for part in df.partitions:
process(model, part.compute())
Is it in memory?
If so, then it doesn't matter if I import chunk by chunk or not because eventually, when I concatenate them, they'll all be stored in memory.
Does that mean for a large data set, there is no way to use pandas?
Yes, they will be stored in memory, and that's the reason why you want to chunk them - that allows you to not read the whole data set in at the same time, but process it in chunks before writing out the end result.
You can use chunksize to tell pandas how many rows should be read for each chunk. If you need a complete set of rows to perform arbitrary lookups, you'll have to back it with some other technology (such as a database).
Yes it is in memory, and yes when the dataset gets too large you have to use other tools.
Of course you can load data in chucks, process one chunk at a time and write down the results (and so free memory for the next chunk).
That works fine for some type of process like filtering and annotating while if you need sorting or grouping you need to use some other tool, personally I like bigquery from google cloud.
I have a huge list of data frames called df_list (with some different and some common columns) which I wish to merge into one big data frame. I have tried the following:
all_dfs = pd.concat(df_list)
Though this takes too much time on a single core. I killed the script after 48 hours. How would you parallelize this process to use all my cores or rewrite the code to make it faster
pandas - is not about parallel processing.
The easiest way is to use third-party tools to process huge data frames. You can run computing / processing of data set on different nodes.
You can look at dask (similar with pandas interface).
You can look at pyspark.
Also you can use swifter to runs processing on multiple cores.
There are probably some other tools... In other words, in your case it is better to run calculations in a cluster.
Hope this helps.
I'm a beginner of Spark-DataFrame API.
I use this code to load csv tab-separated into Spark Dataframe
lines = sc.textFile('tail5.csv')
parts = lines.map(lambda l : l.strip().split('\t'))
fnames = *some name list*
schemaData = StructType([StructField(fname, StringType(), True) for fname in fnames])
ddf = sqlContext.createDataFrame(parts,schemaData)
Suppose I create DataFrame with Spark from new files, and convert it to pandas using built-in method toPandas(),
Does it store the Pandas object to local memory?
Does Pandas low-level computation handled all by Spark?
Does it exposed all pandas dataframe functionality?(I guess yes)
Can I convert it toPandas and just be done with it, without so much touching DataFrame API?
Using spark to read in a CSV file to pandas is quite a roundabout method for achieving the end goal of reading a CSV file into memory.
It seems like you might be misunderstanding the use cases of the technologies in play here.
Spark is for distributed computing (though it can be used locally). It's generally far too heavyweight to be used for simply reading in a CSV file.
In your example, the sc.textFile method will simply give you a spark RDD that is effectively a list of text lines. This likely isn't what you want. No type inference will be performed, so if you want to sum a column of numbers in your CSV file, you won't be able to because they are still strings as far as Spark is concerned.
Just use pandas.read_csv and read the whole CSV into memory. Pandas will automatically infer the type of each column. Spark doesn't do this.
Now to answer your questions:
Does it store the Pandas object to local memory:
Yes. toPandas() will convert the Spark DataFrame into a Pandas DataFrame, which is of course in memory.
Does Pandas low-level computation handled all by Spark
No. Pandas runs its own computations, there's no interplay between spark and pandas, there's simply some API compatibility.
Does it exposed all pandas dataframe functionality?
No. For example, Series objects have an interpolate method which isn't available in PySpark Column objects. There are many many methods and functions that are in the pandas API that are not in the PySpark API.
Can I convert it toPandas and just be done with it, without so much touching DataFrame API?
Absolutely. In fact, you probably shouldn't even use Spark at all in this case. pandas.read_csv will likely handle your use case unless you're working with a huge amount of data.
Try to solve your problem with simple, low-tech, easy-to-understand libraries, and only go to something more complicated as you need it. Many times, you won't need the more complex technology.
Using some spark context or hive context method (sc.textFile(), hc.sql()) to read data 'into memory' returns an RDD, but the RDD remains in distributed memory (memory on the worker nodes), not memory on the master node. All the RDD methods (rdd.map(), rdd.reduceByKey(), etc) are designed to run in parallel on the worker nodes, with some exceptions. For instance, if you run a rdd.collect() method, you end up copying the contents of the rdd from all the worker nodes to the master node memory. Thus you lose your distributed compute benefits (but can still run the rdd methods).
Similarly with pandas, when you run toPandas(), you copy the data frame from distributed (worker) memory to the local (master) memory and lose most of your distributed compute capabilities. So, one possible workflow (that I often use) might be to pre-munge your data into a reasonable size using distributed compute methods and then convert to a Pandas data frame for the rich feature set. Hope that helps.