Due to some limitations of the consumer of my data, I need to "rewrite" some parquet files to convert timestamps that are in nanosecond precision to timestamps that are in millisecond precision.
I have implemented this and it works but I am not completely satisfied with it.
import pandas as pd
df = pd.read_parquet(
f's3://{bucket}/{key}', engine='pyarrow')
for col_name in df.columns:
if df[col_name].dtype == 'datetime64[ns]':
df[col_name] = df[col_name].values.astype('datetime64[ms]')
df.to_parquet(f's3://{outputBucket}/{outputPrefix}{additionalSuffix}',
engine='pyarrow', index=False)
I'm currently running this job in lambda for each file but I can see this may be expensive and may not always work if the job takes longer than 15 minutes as that is the maximum time Lambda's can run.
The files can be on the larger side (>500 MB).
Any ideas or other methods I could consider? I am unable to use pyspark as my dataset has unsigned integers in it.
You could try rewriting all columns at once. Maybe this would reduce some memory copies in pandas, thus speeding up the process if you have many columns:
df_datetimes = df.select_dtypes(include="datetime64[ns]")
df[df_datetimes.columns] = df_datetimes.astype("datetime64[ms]")
Add use_deprecated_int96_timestamps=True to df.to_parquet() when you first write the file, and it will save as a nanosecond timestamp. https://arrow.apache.org/docs/python/generated/pyarrow.parquet.ParquetWriter.html
Related
I'm trying a small POC to try to group by & aggregate to reduce data from a large CSV in pandas and Dask, and I'm observing high memory usage and/or slower than I would expect processing times... does anyone have any tips for a python/pandas/dask noob to improve this?
Background
I have a request to build a file ingestion tool that would:
Be able to take in files of a few GBs where each row contains user id and some other info
do some transformations
reduce the data to { user -> [collection of info]}
send batches of this data to our web services
Based on my research, since files are only few GBs, I found that Spark, etc would be overkill, and Pandas/Dask may be a good fit, hence the POC.
Problem
Processing a 1GB csv takes ~1 min for both pandas and Dask, and consumes 1.5GB ram for pandas and 9GB ram for dask (!!!)
Processing a 2GB csv takes ~3 mins and 2.8GB ram for pandas, Dask crashes!
What am I doing wrong here?
for pandas, since I'm processing the CSV in small chunks, I did not expect the RAM usage to be so high
for Dask, everything I read online suggested that Dask processes the CSV in blocks indicated by blocksize, and as such the ram usage should expect to be blocksize * size per block, but I wouldn't expect total to be 9GB when the block size is only 6.4MB. I don't know why on earth its ram usage skyrockets to 9GB for a 1GB csv input
(Note: if I don't set the block size dask crashes even on input 1GB)
My code
I can't share the CSV, but it has 1 integer column followed by 8 text columns. Both user_id and order_id columns referenced below are text columns.
1GB csv has 14000001 lines
2GB csv has 28000001 lines
5GB csv has 70000001 lines
I generated these csvs with random data, and the user_id column I randomly picked from 10 pre-randomly-generated values, so I'd expect the final output to be 10 user ids each with a collection of who knows how many order ids.
Pandas
#!/usr/bin/env python3
from pandas import DataFrame, read_csv
import pandas as pd
import sys
test_csv_location = '1gb.csv'
chunk_size = 100000
pieces = list()
for chunk in pd.read_csv(test_csv_location, chunksize=chunk_size, delimiter='|', iterator=True):
df = chunk.groupby('user_id')['order_id'].agg(size= len,list= lambda x: list(x))
pieces.append(df)
final = pd.concat(pieces).groupby('user_id')['list'].agg(size= len,list=sum)
final.to_csv('pandastest.csv', index=False)
Dask
#!/usr/bin/env python3
from dask.distributed import Client
import dask.dataframe as ddf
import sys
test_csv_location = '1gb.csv'
df = ddf.read_csv(test_csv_location, blocksize=6400000, delimiter='|')
# For each user, reduce to a list of order ids
grouped = df.groupby('user_id')
collection = grouped['order_id'].apply(list, meta=('order_id', 'f8'))
collection.to_csv('./dasktest.csv', single_file=True)
The groupby operation is expensive because dask will try to shuffle data across workers to check who has which user_id values. If user_id has a lot of unique values (sounds like it), there is a lot of cross-checks to be done across workers/partitions.
There are at least two ways out of it:
set user_id as index. This will be expensive during the indexing stage, but subsequent operations will be faster because now dask doesn't have to check each partition for the values of user_id it has.
df = df.set_index('user_id')
collection = df.groupby('user_id')['order_id'].apply(list, meta=('order_id', 'f8'))
collection.to_csv('./dasktest.csv', single_file=True)
if your files have a structure that you know about, e.g. as an extreme example, if user_id are somewhat sorted, so that first the csv file contains only user_id values that start with 1 (or A, or whatever other symbols are used), then with 2, etc, then you could use that information to form partitions in 'blocks' (loose term) in a way that groupby would be needed only within those 'blocks'.
For me, working remotely means accessing big CSV files on a server which take a long time to download to local hard drive.
I've tried to speed this process up using a bit of Python, only reading in particular columns I require. Ideally, however, if I could only read in data for those columns after a date (e.g > 2019-01-04) it would significantly reduce the amount of data.
My existing code for this will read in the total file and then apply a date filter. I'm just wondering if it's possible to apply that date filter to the reading of the file in the first place. I appreciate this might not be possible.
Code e.g...
import pandas as pd
fields = ['a','b','c'...]
data1 = pd.read_csv(r'SomeForeignDrive.csv', error_bad_lines=False,usecols=fields)
data1['c']=pd.to_datetime(data1['c'], errors='coerce')
data1.dropna()
data1 = data1[data1['c'] > '2019-01-04']
data1.to_csv(r'SomeLocalDrive.csv')
It's not possible to read files starting from a specific date but you can use the following workaround. You can read only the column with dates and find the row index where you want to start from. Then you can read the whole file and skip all rows before the start index:
df = pd.read_csv('path', usecols=['date'])
df['date'] = pd.to_datetime(df['date'])
idx = df[df['date'] > '2019-01-04'].index[0]
df = pd.read_csv('path', skiprows=idx)
read_csv docs:
Using this parameter (usecols) results in much faster parsing time and
lower memory usage.
I have a large tabular data, which needs to be merged and splitted by group. The easy method is to use pandas, but the only problem is memory.
I have this code to merge dataframes:
import pandas as pd;
from functools import reduce;
large_df = pd.read_table('large_file.csv', sep=',')
This, basically load the whole data in memory th
# Then I could group the pandas dataframe by some column value (say "block" )
df_by_block = large_df.groupby("block")
# and then write the data by blocks as
for block_id, block_val in df_by_block:
pd.Dataframe.to_csv(df_by_block, "df_" + str(block_id), sep="\t", index=False)
The only problem with above code is memory allocation, which freezes my desktop. I tried to transfer this code to dask but dask doesn't have a neat groupby implementation.
Note: I could have just sorted the file, then read the data line by line and split as the "block" value changes. But, the only problem is that "large_df.txt" is created in the pipeline upstream by merging several dataframes.
Any suggestions?
Thanks,
Update:
I tried the following approach but, it still seems to be memory heavy:
# find unique values in the column of interest (which is to be "grouped by")
large_df_contig = large_df['contig']
contig_list = list(large_df_contig.unique().compute())
# groupby the dataframe
large_df_grouped = large_df.set_index('contig')
# now, split dataframes
for items in contig_list:
my_df = large_df_grouped.loc[items].compute().reset_index()
pd.DataFrame.to_csv(my_df, 'dask_output/my_df_' + str(items), sep='\t', index=False)
Everything is fine, but the code
my_df = large_df_grouped.loc[items].compute().reset_index()
seems to be pulling everything into the memory again.
Any way to improve this code??
but dask doesn't have a neat groupb
Actually, dask does have groupby + user defined functions with OOM reshuffling.
You can use
large_df.groupby(something).apply(write_to_disk)
where write_to_disk is some short function writing the block to the disk. By default, dask uses disk shuffling in these cases (as opposed to network shuffling). Note that this operation might be slow, and it can still fail if the size of a single group exceeds your memory.
I try to use multiprocessing to read the csv file faster than using read_csv.
df = pd.read_csv('review-1m.csv', chunksize=10000)
But the df I get is not the dataframe but of the type pandas.io.parsers.TextFileReader. So I try to use
df = pd.concat(tp, ignore_index=True)
to convert df into a dataframe. But this process takes a lot of time thus the result is not much different from directly using read_csv. Does anyone know that how to make the process of converting df into dataframe faster?
pd.read_csv() is likely going to give you the same read time as any other method. If you want a real performance increase you should change the format you store your file in.
http://pandas.pydata.org/pandas-docs/stable/io.html#performance-considerations
I've been trying to process a 1.4GB CSV file with Pandas, but keep having memory problems. I have tried different things in attempt to make Pandas read_csv work to no avail.
It didn't work when I used the iterator=True and chunksize=number parameters. Moreover, the smaller the chunksize, the slower it is to process the same amount of data.
(Simple heavier overhead doesn't explain it because it was way too slower when number of chunks is big. I suspect when processing every chunk, panda needs to go though all the chunks before it to "get to it", instead of jumping right to the start of the chunk. This seems the only way this can be explained.)
Then as a last resort, I split the CSV files into 6 parts, and tried to read them one by one, but still get MemoryError.
(I have monitored the memory usage of python when running the code below, and found that each time python finishes processing a file and moves on to the next, the memory usage goes up. It seemed quite obvious that panda didn't release memory for the previous file when it's already finished processing it.)
The code may not make sense but that's because I removed the part where it writes into an SQL database to simplify it and isolate the problem.
import csv,pandas as pd
import glob
filenameStem = 'Crimes'
counter = 0
for filename in glob.glob(filenameStem + '_part*.csv'): # reading files Crimes_part1.csv through Crimes_part6.csv
chunk = pd.read_csv(filename)
df = chunk.iloc[:,[5,8,15,16]]
df = df.dropna(how='any')
counter += 1
print(counter)
you may try to parse only those columns that you need (as #BrenBarn said in comments):
import os
import glob
import pandas as pd
def get_merged_csv(flist, **kwargs):
return pd.concat([pd.read_csv(f, **kwargs) for f in flist], ignore_index=True)
fmask = 'Crimes_part*.csv'
cols = [5,8,15,16]
df = get_merged_csv(glob.glob(fmask), index_col=None, usecols=cols).dropna(how='any')
print(df.head())
PS this will include only 4 out of at least 17 columns in your resulting data frame
Thanks for the reply.
After some debugging, I have located the problem. The "iloc" subsetting of pandas created a circular reference, which prevented garbage recollection. Detailed discussion can be found here
I have found same issues in csv file. First to make csv as chunks and fix the chunksize.use the chunksize or iterator parameter to return the data in chunks.
Syntax:
csv_onechunk = padas.read_csv(filepath, sep = delimiter, skiprows = 1, chunksize = 10000)
then concatenate the chunks (Only valid with C parser)