Pandas_Merge two large Datasets - python

I am using Pandas for my analysis (I am currently using Jupyter Network). I have a two large datasets (one is 14 GB and the second one is 4 GB). I need to merge this two datasets based on a column. I employ the following code:
df = pd.merge(aa, bb, on='column', how='outer')
Normally, this code works. However, since my datasets are large, it take to much time. I run my code 4 hours ago and it still continuing. The RAM of my machine is 8 GB.
Do you have any suggestions for that?

You can try using dask.dataframe to parallelize your task:
import dask.dataframe as dd
# define lazy readers
aa = dd.read_csv('file1.csv')
bb = dd.read_csv('file2.csv')
# define merging logic
dd_merged = aa.merge(bb, on='column', how='outer')
# apply merge and convert to dataframe
df = dd_merged.compute()

Related

Dask dataframe running out of memory while converting it to pandas after selecting subset

So I have a parquet file with 160M records and 240 columns. So I'm using dask to load it in python using EMR CLuster m5.12xlarge.
import dask.dataframe as dd
df = dd.read_parquet(file)
Now I want value count and normalized value count of one of the column:
count = df.a.value_counts()
percent = df.a.value_counts(normalize = True)
a_count = dd.concat([count,percent], axis=1, keys=['counts', '%'])
Out:
Dask DataFrame Structure:
counts %
npartitions=1
int64 float64
... ...
Dask Name: concat, 489 tasks
Note here that I have total 1 npartitions and 489 task.
Now I'm trying to convert it into pandas df. Which is only taking few secs to execute and using around 1.5 GB of memory.
a_count = a_count.compute()
Now from one of the column I want all records with null values and then do the same as I did earlier, value count.
empty_b = df[df['b'].isna()]
count = empty_b.a.value_counts()
percent = empty_b.a.value_counts(normalize = True)
empty_b = dd.concat([count,percent], axis=1, keys=['counts', '%'])
empty_b
Out:
Dask DataFrame Structure:
counts %
npartitions=1
int64 float64
... ...
Dask Name: concat, 828 tasks
Which has a total 1 npartition and 828 task.
Now I'm trying to convert this into pandas dataframe by computing and it's taking lot of time and running out of memory after utilizing 170 GB.
empty_b = empty_b.compute()
Can someone explain what's going wrong here, I'm doing the same thing and that also on subset of bigger one, but still my notebook is running out of memory and not able to execute.
I have a parquet file with 160M records and 240 columns
It's probably best to split up this data into multiple Parquet files
I'm using dask to load it in python using EMR CLuster m5.12xlarge
This instance type has 48 CPUs. Is this a multi node instance or a single node?
Note here that I have total 1 npartitions and 489 task.
You're probably running all these computations on a single core. Try repartitioning the DataFrame into 48 partitions at least, so you can leverage parallelism on your powerful machine.
Now I'm trying to convert this into pandas dataframe
You generally don't want to convert Dask DataFrames to Pandas DataFrames, unless you've significantly reduced the number of rows of data. You lose all the benefits that the parallelism of Dask can provide once converting to Pandas.
In this example, it seems like you're reading a single Parquet file into a Dask DataFrame with one partition and then converting it back to Pandas. You might want to consider breaking up the Dask DataFrame into multiple partitions (and running computations via Dask) or just reading the file directly into a Pandas DataFrame.

How do I format my training data for an LSTM network using Keras when I have multiple varying length time-series data? [duplicate]

This question already has an answer here:
Multivariate LSTM with missing values
(1 answer)
Closed 2 years ago.
I have two sets of training data that are of different lengths. I'll call these data series as the x_train data. Their shapes are (70480, 7) and (69058, 7), respectively. Each column represents a different sensor reading.
I am trying to use an LSTM network on this data. Should I merge the data into one object? How would I do that?
I also have two sets of data that are the resultant output from the x_train data. These are both of size (315,1). Would I use this as my y_train data?
So far I have read the data using pandas.read_csv() as follows:
c4_x_train = pd.read_csv('path')
c4_y_train = pd.read_csv('path')
c6_x_train = pd.read_csv('path')
c6_y_train = pd.read_csv('path')
Any clarification is appreciated. Thanks!
Just a few points
For fast file reading, consider using a different format like parquet or feather. Careful about depreciation, so for longtime storage, csv is just fine.
pd.concat is your friend here. Use like this
from pathlib import Path
import pandas as pd
dir_path = r"yourFolderPath"
files_list = [str(p) for p in dir_path.glob("**/*.csv")]
if files_list:
source_dfs = [pd.read_csv(file_) for file_ in files_list]
df = pd.concat(source_dfs, ignore_index=True)
This df then you can use to do your training.
Now, regarding the training. Well, that really depends as always. If you have the datetime in those csvs and they are continuous, go right ahead. If you have breaks inbetween the measurements, you might run into problems. Depending on trends, saisonality and noise, you could interpolate missing data. There are multiple approaches, such as the naive approach, filling it with the mean, forecasting from the values before, and many more. There is no right or wrong, it just really depends on what your data looks like.
EDIT: Comments don't like codeblocks.
Works like this:
Example:
#df1:
time value
1 1.4
2 2.5
#df2:
time value
3 1.1
4 1.0
#will be glued together to become df = pd.concat([df1, df2], ignore_index=True)
time value
1 1.4
2 2.5
3 1.1
4 1.0

Performance decrease for huge amount of columns. Pyspark

I met problem with processing of spark wide dataframe (about 9000 columns and sometimes more).
Task:
Create wide DF via groupBy and pivot.
Transform columns to vector and processing in to KMeans from pyspark.ml.
So I made extensive frame and try to create vector with VectorAssembler, cached it and trained on it KMeans.
It took about 11 minutes for assembling and 2 minutes for KMeans for 7 different count of clusters on my pc in standalone mode for frame ~500x9000. Another side this processing in pandas (pivot df, and iterate 7 clusters) takes less one minute.
Obviously I understand overhead and performance decreasing for standalone mode and caching and so on but it's really discourages me.
Could somebody explain how I can avoid this overhead?
How peoples work with wide DF instead of using vectorassembler and getting performance decreasing?
More formal question (for sof rules) sound like - How can I speed up this code?
%%time
tmp = (df_states.select('ObjectPath', 'User', 'PropertyFlagValue')
.groupBy('User')
.pivot('ObjectPath')
.agg({'PropertyFlagValue':'max'})
.fillna(0))
ignore = ['User']
assembler = VectorAssembler(
inputCols=[x for x in tmp.columns if x not in ignore],
outputCol='features')
Wall time: 36.7 s
print(tmp.count(), len(tmp.columns))
552, 9378
%%time
transformed = assembler.transform(tmp).select('User', 'features').cache()
Wall time: 10min 45s
%%time
lst_levels = []
for num in range(3, 14):
kmeans = KMeans(k=num, maxIter=50)
model = kmeans.fit(transformed)
lst_levels.append(model.computeCost(transformed))
rs = [i-j for i,j in list(zip(lst_levels, lst_levels[1:]))]
for i, j in zip(rs, rs[1:]):
if i - j < j:
print(rs.index(i))
kmeans = KMeans(k=rs.index(i) + 3, maxIter=50)
model = kmeans.fit(transformed)
break
Wall time: 1min 32s
Config:
.config("spark.sql.pivotMaxValues", "100000") \
.config("spark.sql.autoBroadcastJoinThreshold", "-1") \
.config("spark.sql.shuffle.partitions", "4") \
.config("spark.sql.inMemoryColumnarStorage.batchSize", "1000") \
VectorAssembler's transform function processes all the columns and stores metadata on each column in addition to the original data. This takes time, and also takes up RAM.
To put an exact figure on how much things have increased, you can dump your data frame before and after the transformation as parquet files and compare. In my experience, a feature vector built by hand or other feature extraction methods compared to one built by VectorAssembler can cause a size increase of 10x and this was for a logistic regression with only 10 parameters. Things will get a lot worse with a data set with as many columns as you have.
A few suggestions:
See if you can build your feature vector another way. I'm not sure how performant this would be in Python, but I've got a lot of mileage out of this approach in Scala. I've noticed something like a 5x-6x performance difference comparing logistic regressions (10 params) for manually built vectors or vectors built using other extraction methods (TF-IDF) than VectorAssembled ones.
See if you can reshape your data to reduce the number columns that need to be processed by VectorAssembler.
See if increasing the RAM available to Spark helps.
Actually solution was found in map for rdd.
First of all we going to create map of values.
Also extract all distinct names.
Penultimate step we are searching each value of rows' map in dict of names and return value or 0 if nothing was found.
Vector assembler on results.
Advantages:
You haven't to create wide dataframe with a lot of columns count and hence avoid overhead. (Speed was risen up from 11 minutes to 1.)
You still work on cluster and execute you code in paradigm of spark.
Example of code: scala implementation.

How to use dask to quickly access subsets of the data?

One of the main reasons I love pandas is that it's easy to home in on subsets, e.g. df[df.sample.isin(['a', 'c', 'p'])] or df[df.age < 35]. Is dask dataframe good at (optimized for) this as well? The tutorials I've seen have focused on whole-column manipulations.
My specific application is (thousands of named GCMS samples) x (~20000 time points per sample) x (500 m/z channels) x (intensity), and I'm looking for the fastest tool to pull arbitrary subsets, e.g.
df[df.sample.isin([...]) & df.rt.lt(800) & df.rt.gt(600) & df.mz.isin(...)]
If dask is a good choice, then I would appreciate advice on how best to structure it.
What I've tried
What I've tried so far is to convert each sample to pandas dataframe that looks like
smp rt 14 15 16 17 18
0 160602_JK_OFCmix:1 271.0 64088.0 9976.0 26848.0 23928.0 89600.0
1 160602_JK_OFCmix:1 271.1 65472.0 10880.0 28328.0 24808.0 91840.0
2 160602_JK_OFCmix:1 271.2 64528.0 10232.0 27672.0 25464.0 90624.0
3 160602_JK_OFCmix:1 271.3 63424.0 10272.0 27600.0 25064.0 90176.0
4 160602_JK_OFCmix:1 271.4 64816.0 10640.0 27592.0 24896.0 90624.0
('smp' is sample name, 'rt' is retention time, 14,15,...500 are m/z channels), save to hdf with zlib, level=1, then make the dask dataframe with
ddf = dd.read_hdf(*.hdf5, key='/*', chunksize=100000, lock=False)
but df = ddf[ddf.smp.isin([...a couple of samples...]).compute() is 100x slower than ddf['57'].mean().compute().
(Note: this is with dask.set_options(get=dask.multiprocessing.get))
Your dask.dataframe is backed by an HDF file, so every time you do any operation you're reading in the data from disk. This is great if your data doesn't fit in memory but wasteful if your data does fit in memory.
If your data fits in memory
Instead, if your data fits in memory then try backing your dask.dataframe off of a Pandas dataframe:
# ddf = dd.from_hdf(...)
ddf = dd.from_pandas(df, npartitions=20)
I expect you'll see better performance from the threaded or distributed schedulers: http://dask.pydata.org/en/latest/scheduler-choice.html
If your data doesn't fit in memory
Try to reduce the number of bytes you have to read by specifying a set of columns to read in your read_hdf call
df = dd.read_hdf(..., columns=['57'])
Or, better yet, use a data store that lets you efficiently load individual columns. You could try something like Feather or Parquet, though both are in early stages:
https://github.com/wesm/feather
http://fastparquet.readthedocs.io/en/latest/
I suspect that if you're careful to avoid reading in all of the columns at once you could probably get by with just Pandas instead of using Dask.dataframe.

How to stream in and manipulate a large data file in python

I have a relatively large (1 GB) text file that I want to cut down in size by summing across categories:
Geography AgeGroup Gender Race Count
County1 1 M 1 12
County1 2 M 1 3
County1 2 M 2 0
To:
Geography Count
County1 15
County2 23
This would be a simple matter if the whole file could fit in memory but using pandas.read_csv() gives MemoryError. So I have been looking into other methods, and there appears to be many options - HDF5? Using itertools (which seems complicated - generators?) Or just using the standard file methods to read in the first geography (70 lines), sum the count column, and write out before loading in another 70 lines.
Does anyone have any suggestions on the best way to do this? I especially like the idea of streaming data in, especially because I can think of a lot of other places where this would be useful. I am most interested in this method, or one that similarly uses the most basic functionality possible.
Edit: In this small case I only want the sums of count by geography. However, it would be ideal if I could read in a chunk, specify any function (say, add 2 columns together, or take the max of a column by geography), apply the function, and write the output before reading in a new chunk.
You can use dask.dataframe, which is syntactically similar to pandas, but performs manipulations out-of-core, so memory shouldn't be an issue:
import dask.dataframe as dd
df = dd.read_csv('my_file.csv')
df = df.groupby('Geography')['Count'].sum().to_frame()
df.to_csv('my_output.csv')
Alternatively, if pandas is a requirement you can use chunked reads, as mentioned by #chrisaycock. You may want to experiment with the chunksize parameter.
# Operate on chunks.
data = []
for chunk in pd.read_csv('my_file.csv', chunksize=10**5):
chunk = chunk.groupby('Geography', as_index=False)['Count'].sum()
data.append(chunk)
# Combine the chunked data.
df = pd.concat(data, ignore_index=True)
df = df.groupby('Geography')['Count'].sum().to_frame()
df.to_csv('my_output.csv')
I do like #root's solution, but i would go bit further optimizing memory usage - keeping only aggregated DF in memory and reading only those columns, that you really need:
cols = ['Geography','Count']
df = pd.DataFrame()
chunksize = 2 # adjust it! for example --> 10**5
for chunk in (pd.read_csv(filename,
usecols=cols,
chunksize=chunksize)
):
# merge previously aggregated DF with a new portion of data and aggregate it again
df = (pd.concat([df,
chunk.groupby('Geography')['Count'].sum().to_frame()])
.groupby(level=0)['Count']
.sum()
.to_frame()
)
df.reset_index().to_csv('c:/temp/result.csv', index=False)
test data:
Geography,AgeGroup,Gender,Race,Count
County1,1,M,1,12
County2,2,M,1,3
County3,2,M,2,0
County1,1,M,1,12
County2,2,M,1,33
County3,2,M,2,11
County1,1,M,1,12
County2,2,M,1,111
County3,2,M,2,1111
County5,1,M,1,12
County6,2,M,1,33
County7,2,M,2,11
County5,1,M,1,12
County8,2,M,1,111
County9,2,M,2,1111
output.csv:
Geography,Count
County1,36
County2,147
County3,1122
County5,24
County6,33
County7,11
County8,111
County9,1111
PS using this approach will you can process huge files.
PPS using chunking approach should work unless you need to sort your data - in this case i would use classic UNIX tools, like awk, sort, etc. for sorting your data first
I would also recommend to use PyTables (HDF5 Storage), instead of CSV files - it is very fast and allows you to read data conditionally (using where parameter), so it's very handy and saves a lot of resources and usually much faster compared to CSV.

Categories