Merging large TSV files on select columns

Merging large TSV files on select columns - python

Is there a memory-efficient way to merge many large TSV files, only keeping select columns of each TSV file?
Background:
I have many (~100) large TSV files (~800MB) of the form:
filename: 1.txt
ID Char1 Char2 ... P
1 Green Red . 0.1
2 Red Orange . 0.2
3 Purple Blue . 0.13
(Format A)
The ID column is the same between all TSV files. I only care about the ID and P columns for each and would like to condense them into the following form to make them take up less memory (column names are pulled from the Format A file name):
ID 1 2 ... N
1 P_11 P_12 . P_1N
2 P_21 P_22 . P_2N
3 P_31 P_32 . P_3N
(Format B)
I need to do this in a memory-efficient manner as I will be processing as many of these blocks of 100 files in parallel. Without parallelization, the runtime would be extremely long as there are 170 blocks of 100 files to process. My workflow is as follows:
Step 1: Generate M groups of these 100 files in format A
Step 2: Process these 100 files into the one file of format B
Step 3: Repeat the above two steps until all 170 groups of 100 files are processed
I've tried many different approaches, and none is both time and memory efficient.
My first approach involved storing everything in memory. I kept one file's ID and P columns in a list of lists. Then, I iterated through every other file and appended its P value to the end of the appropriate array. This approach had a satisfactory runtime (~30 minutes per block of 100 files). But, this failed as my VM does not have the memory to parallelize storing all of this data at once.
I've gone through a number of iterations over the last few days. In my current iteration, I use one file and only keep its ID and P column. Then, I loop through each file and use fileinput to append the file's P column in place to the end of line. After testing, this method passes on the memory/parallelization side but too slow in terms of runtime. I've also thought of using pandas to merge, but this runs into both a time and memory constraint.
I have scoured every corner of my brain and the internet, but have not produced a satisfactory solution.

Related

Faster alternative than pandas.read_csv for reding txt files with data

I am working on a project where I need to process large amounts of txt files (about 7000 files), where each file has 2 columns of floats with 12500 rows.
I am using pandas, and takes about 2 min 20 sec which is a bit long. With MATLAB this takes 1 min less. I would like to get closer or faster than MATLAB.
Is there any faster alternative that I can implement with python?
I tried Cython and the speed was the same as with pandas.
Here is the code in am using. It reads the files composed of column 1 (time) and column 2 (amplitude). I calculate the envelope and make a list with the resulting envelopes for all files. I extract the time from the first file using the simple numpy.load_txt(), which is slower than pandas but no impact since it is just one file.
Any ideas?
For coding suggestions please try to use the same format as I use.
Cheers
Data example:
1.7949600e-05 -5.3232106e-03
1.7950000e-05 -5.6231098e-03
1.7950400e-05 -5.9230090e-03
1.7950800e-05 -6.3228746e-03
1.7951200e-05 -6.2978830e-03
1.7951600e-05 -6.6727570e-03
1.7952000e-05 -6.5727906e-03
1.7952400e-05 -6.9726562e-03
1.7952800e-05 -7.0726226e-03
1.7953200e-05 -7.2475638e-03
1.7953600e-05 -7.1725890e-03
1.7954000e-05 -6.9476646e-03
1.7954400e-05 -6.6227738e-03
1.7954800e-05 -6.4228410e-03
1.7955200e-05 -5.8480342e-03
1.7955600e-05 -6.1979166e-03
1.7956000e-05 -5.7980510e-03
1.7956400e-05 -5.6231098e-03
1.7956800e-05 -5.3482022e-03
1.7957200e-05 -5.1732611e-03
1.7957600e-05 -4.6484375e-03
20 files here:
https://1drv.ms/u/s!Ag-tHmG9aFpjcFZPqeTO12FWlMY?e=f6Zk38
folder_tarjet="D:\this"
if len(folder_tarjet) > 0:
print ("You chose %s" % folder_tarjet)
list_of_files = os.listdir(folder_tarjet)
list_of_files.sort(key=lambda f: os.path.getmtime(join(folder_tarjet, f)))
num_files=len(list_of_files)
envs_a=[]
for elem in list_of_files:
file_name=os.path.join(folder_tarjet,elem)
amp=pd.read_csv(file_name,header=None,dtype={'amp':np.float64},delim_whitespace=True)
env_amplitudes = np.abs(hilbert(np.array(pd.DataFrame(amp[1]))))
envs_a.append(env_amplitudes)
envelopes=np.array(envs_a).T
file_name=os.path.join(folder_tarjet,list_of_files[1])
Time=np.loadtxt(file_name,usecols=0)

I would suggest you not to use csv to store and load big data quantities, if you already have your data in csv you can still convert all of it at once into a faster format. For instance you can use pickle, h5, feather and parquet, they are non human readable but have much better performance in any other metric.
After the conversion, you will be able to load the data in few seconds (if not less than a second) instead of minutes, so in my opinion it is a much better solution than trying to make a marginal 50% optimization.
If the data is being generated, make sure to generate it in a compressed format. If you don't know which format to use, parquet for instance would be one of the fastest for reading and writing and you can also load it from Matlab.
Here you will the options already supported by pandas.

As discussed in #Ziur Olpa's answer and the comments, a binary format is bound to be faster than to parse text.
The quick way to get those gains is to use Numpy's own NPY format, and have your reader function cache those onto disk; that way, when you re-(re-re-)run your data analysis, it will use the pre-parsed NPY files instead of the "raw" TXT files. Should the TXT files change, you can just remove all NPY files and wait a while longer for parsing to happen (or maybe add logic to look at the modification times of the NPY files c.f. their corresponding TXT files).
Something like this – I hope I got your hilbert/abs/transposition logic the way you wanted to.
def read_scans(folder):
"""
Read scan files from a given folder, yield tuples of filename/matrix.
This will also create "cached" npy files for each txt file, e.g. a.txt -> a.txt.npy.
"""
for filename in sorted(glob.glob(os.path.join(folder, "*.txt")), key=os.path.getmtime):
np_filename = filename + ".npy"
if os.path.exists(np_filename):
yield (filename, np.load(np_filename))
else:
df = pd.read_csv(filename, header=None, dtype=np.float64, delim_whitespace=True)
mat = df.values
np.save(np_filename, mat)
yield (filename, mat)
def read_data_opt(folder):
times = None
envs = []
for filename, mat in read_scans(folder):
if times is None:
# If we hadn't got the times yet, grab them from this
# matrix (by slicing the first column).
times = mat[:, 0]
env_amplitudes = signal.hilbert(mat[:, 1])
envs.append(env_amplitudes)
envs = np.abs(np.array(envs)).T
return (times, envs)
On my machine, the first run (for your 20-file dataset) takes
read_data_opt took 0.13302898406982422 seconds
and the subsequent runs are 4x faster:
read_data_opt took 0.031115055084228516 seconds

how to efficiently append 200k+ json files to dataframe

I have over 200 000 json files stored across 600 directories on my drive and a copy in Google cloud storage.
i need to merge them and transform to CSV file;
so i thougt of looping the following logic:
open each file with pandas read_json
apply some transformations ( drop some columns, add columns with values based on the filename, change order of columns)
append this df to a master_df
and finally export the master_df to a csv.
each file is around 5000 rows when converted to DF - which would result in DF of around 300 000 000 rows.
The files are grouped and reside in 3 directories ( and than depending on the file - in a couple more, hence on top level there are 3 dirs, but overall - about 600 ) - so to speed it up a bit i decided to run the script based on one of those 3 directories
I am running this script on my local machine (32GD RAM, sufficient disk space on SSD drive), yet its painfully slow, and the speed decreases.
At the beginning one df was appended within 1 second, after executing loop 4000 times - time to append grew to 2.7s
I am looking for a way to speed it up to something that couple be reasonably done within (hopefully) a couple hours
so bottomline - should i try to optimize my code and run this locally, or not even bother, likely keep the script 'as is' and run it in f.ex Google Cloud Run?
the JSON files contain keys A, B, C, D, E. I drop A, B as not important, and rename C
the code i have so far:
import pandas as pd
import os
def listfiles(track, footprint):
"""
This function lists files in a specified filepath (track) by a footprint found in the filename with extension
Function returns a list of files.
"""
locations =[]
for r, d, f in os.walk(track):
for file in f:
if footprint in file:
locations.append(os.path.join(r, file))
return locations
def create_master_df(track, footprint):
master_df = pd.DataFrame(columns = ['date','domain','property','lang','kw', 'clicks','impressions'])
all_json_files = listfiles(track, footprint)
prop = [] #this indicates the starting directory, and to which ouput file should master_df be saved
for i, file in enumerate(all_json_files):
#here starts logic by which i identify some properties of the file, to use them as values in columns in local_df
elements=file.split('\\')
elements[-1] = elements[-1].replace('.json','')
elements[-1] = elements[-1].replace('-Copy-Multiplicated','')
date = elements[-1][-10:]
domain=elements[7]
property=elements[6]
language=elements[8]
json_to_rows = pd.read_json(file)
local_df = pd.DataFrame(json_to_rows.rows.values.tolist())
local_df = local_df.drop(columns=['A','B'])
local_df = local_df.rename(columns={'C': 'new_name'})
#here i add earlier extracted data, to be able to distinguish rows in the master DF
local_df['date'] = date
local_df['domain'] = domain
local_df['property'] = property
local_df['lang'] = language
local_df = local_df[['date','domain','property','lang','new_name', 'D','E']]
master_df = master_df.append(local_df, ignore_index=True)
prop = property
print(f'appended {i} times')
out_file_path = f'{track}\\{prop}-outfile.csv'
master_df.to_csv(out_file_path, sep=";", index=False)
#run the script on first directory
track = 'C:\\xyz\\a'
footprint ='.json'
create_master_df(track, footprint)
#run the script on first directory
track = 'C:\\xyz\\b'
create_master_df(track, footprint)
any feedback is welcome!
#Edit:
i tried and timed a couple ways of approaching it, below is explanation of what i did:
clear execution - at first i ran my code as it was written
1st try - i moved all changes to local_df to before the final master_df was saved to csv. This was an attempt to see if working on files 'as they are' without manipulating them would be faster. i removed from for loop dropping columns, changing colum names, and reordering them. all this was applied on master_df before exporting to csv
2nd try - i brought back to for loop dropping unnecessary columns - as clearly performance was impacted by additional columns
3rd try - so far the most efficient - instead of appending local_df to master_df, i transformed it to local_list, and appended to master_list. master_list was than pd.concat to master_df and exported to csv
here is a comparison of speed - how fast the script was to iterate the for loop in each version, when tested on ~ 800 JSON files:
comparison of timed script executions
Its surprising to see that i actually wrote a solid code in the first place - as i don't have much experience coding - that was super nice to see :)
overall when job was ran on the test set of files it finished as follows (in seconds)

pyarrow write_dataset limit per partition files

Let's say I've got some data which is in an existing dataset which has no column partitioning but is still 200+ files. I want to rewrite that data into a hive partition. The dataset is too big to open its entirety into memory. In this case the dataset came from a CETAs in Azure Synapse but I don't think that matters.
I do:
dataset=ds.dataset(datadir, format="parquet", filesystem=abfs)
new_part = ds.partitioning(pa.schema([("nodename", pa.string())]), flavor="hive")
scanner=dataset.scanner()
ds.write_dataset(scanner, newdatarootdir,
format="parquet", partitioning=new_part,
existing_data_behavior="overwrite_or_ignore",
max_partitions=2467)
In this case there are 2467 unique nodenames and so I want there to be 2467 directories each with 1 file. However what I get is 2467 directories each with 100+ files of about 10KB each. How do I get just 1 file per partition?
I could do a second step
for node in nodes:
fullpath=f"{datadir}\\{node}"
curnodeds=ds.dataset(fullpath, format="parquet")
curtable=curnodeds.to_table()
os.makedirs(f"{newbase}\\{node}")
pq.write_table(curtable, f"{newbase}\\{node}\\part-0.parquet",
version="2.6", flavor="spark", data_page_version="2.0")
Is there a way to incorporate the second step into the first step?
Edit:
This is with pyarrow 7.0.0
Edit(2):
Using max_open_files=3000 did result in one file per partition. The metadata comparison between the two are that the two step approach has (for one partition)...
<pyarrow._parquet.FileMetaData object at 0x000001C7FA414A40>
created_by: parquet-cpp-arrow version 7.0.0
num_columns: 8
num_rows: 24840
num_row_groups: 1
format_version: 2.6
serialized_size: 1673
size on disk: 278kb
and the one step...
<pyarrow._parquet.FileMetaData object at 0x000001C7FA414A90>
created_by: parquet-cpp-arrow version 7.0.0
num_columns: 8
num_rows: 24840
num_row_groups: 188
format_version: 1.0
serialized_size: 148313
size on disk: 1.04MB
Obviously in the two step version I'm explicitly setting the version to 2.6 so that explains that difference. The total size of my data after the 2-step is 655MB vs the 1 step is 2.6GB. The time difference was pretty significant too. The two step was something like 20 minutes for the first step and 40 minutes for the second. The one step was like 5 minutes for the whole thing.
The remaining question is, how to set version="2.6" and data_page_version="2.0" in write_dataset? I'll still wonder why the row_groups is so different if they're so different when setting those parameters but I'll defer that question.

The dataset writer was changed significantly in 7.0.0. Previously, it would always create 1 file per partition. Now there are a couple of settings that could cause it to write multiple files. Furthermore, it looks like you are ending up with a lot of small row groups which is not ideal and probably the reason the one-step process is both slower and larger.
The first significant setting is max_open_files. Some systems limit how many file descriptors can be open at one time. Linux defaults to 1024 and so pyarrow attempts defaults to ~900 (with the assumption that some file descriptors will be open for scanning, etc.) When this limit is exceeded pyarrow will close the least recently used file. For some datasets this works well. However, if each batch has data for each file this doesn't work well at all. In that case you probably want to increase max_open_files to be greater than your number of partitions (with some wiggle room because you will have some files open for reading too). You may need to adjust OS-specific settings to allow this (generally, these OS limits are pretty conservative and raising this limit is fairly harmless).
I'll still wonder why the row_groups is so different if they're so different when setting those parameters but I'll defer that question.
In addition, the 7.0.0 release adds min_rows_per_group, max_rows_per_group and max_rows_per_file parameters to the write_dataset call. Setting min_rows_per_group to something like 1 million will cause the writer to buffer rows in memory until it has enough to write. This will allow you to create files with 1 row group instead of 188 row groups. This should bring down your file size and fix your performance issues.
However, there is a memory cost associated with this which is going to be min_rows_per_group * num_files * size_of_row_in_bytes.
The remaining question is, how to set version="2.6" and data_page_version="2.0" in write_dataset?
The write_dataset call works on several different formats (e.g. csv, ipc, orc) and so the call only has generic options that apply regardless of format.
Format-specific settings can instead be set using the file_options parameter:
# It does not appear to be documented but make_write_options
# should accept most of the kwargs that write_table does
file_options = ds.ParquetFileFormat().make_write_options(version='2.6', data_page_version='2.0')
ds.write_dataset(..., file_options=file_options)

Why partitioned parquet files consume larger disk space?

I am learning about parquet file using python and pyarrow. Parquet is great in compression and minimizing disk space. My dataset is 190MB csv file which ends up as single 3MB file when saved as snappy-compressed parquet file.
However when I am saving my dataset as partitioned files, they result in a much larger sizes combined (61MB).
Here is example dataset that I am trying to save:
listing_id | date | gender | price
-------------------------------------------
a | 2019-01-01 | M | 100
b | 2019-01-02 | M | 100
c | 2019-01-03 | F | 200
d | 2019-01-04 | F | 200
When I partitioned by date (300+ unique values), the partitioned files result in 61MB combined. Each file has 168.2kB of size.
When I partition by gender (2 unique values), the partitioned files result in just 3MB combined.
I am wondering if there is any minimum file size for parquet such that many small files combined consume greater disk space?
My env:
- OS: Ubuntu 18.04
- Language: Python
- Library: pyarrow, pandas
My dataset source:
https://www.kaggle.com/brittabettendorf/berlin-airbnb-data
# I am using calendar_summary.csv as my data from a group of datasets in that link above
My code to save as parquet file:
# write to dataset using parquet
df = pd.read_csv('./calendar_summary.csv')
table = pyarrow.Table.from_pandas(df)
pyarrow.parquet.write_table(table=table, where='./calendar_summary_write_table.parquet')
# parquet filesize
parquet_method1_filesize = os.path.getsize('./calendar_summary_write_table.parquet') / 1000
print('parquet_method1_filesize: %i kB' % parquet_method1_filesize)
My code to save as partitioned parquet file:
# write to dataset using parquet (partitioned)
df = pd.read_csv('./calendar_summary.csv')
table = pyarrow.Table.from_pandas(df)
pyarrow.parquet.write_to_dataset(
table=table,
root_path='./calendar_summary/',
partition_cols=['date'])
# parquet filesize
import os
print(os.popen('du -sh ./calendar_summary/').read())

There is no minimum file size, but there is an overhead for storing the footer and there is a wasted opportunity for optimizations via encodings and compressions. The various encodings and compressions build on the idea that the data has some amount of self-similarity which can be exploited by referencing back to earlier similar occurances. When you split the data into multiple files, each of them will need a separate "initial data point" that the successive ones can refer back to, so disk usage goes up. (Please note that there are huge oversimplifications in this wording to avoid having to specifically go through the various techniques employed to save space, but see this answer for a few examples.)
Another thing that can have a huge impact on the size of Parquet files is the order in which data is inserted. A sorted column can be stored a lot more efficiently than a randomly ordered one. It is possible that by partitioning the data you inadvertently alter its sort order. Another possibility is that you partition the data by the very attribute that it was ordered by and which allowed a huge space saving when storing in a single file and this opportunity gets lost by splitting the data into multiple files. Finally, you have to keep in mind that Parquet is not optimized for storing a few kilobytes of data but for several megabytes or gigabytes (in a single file) or several petabytes (in multiple files).
If you would like to inspect how your data is stored in your Parquet files, the Java implementation of Parquet includes the parquet-tools utility, providing several commands. See its documentation page for building and getting started. The more detailed descriptions of the individual commands are printed by parquet-tools itself. The commands most interesting to you are probably meta and dump.

Spark coalesce vs collect, which one is faster?

I am using pyspark to process 50Gb data using AWS EMR with ~15 m4.large cores.
Each row of the data contains some information at a specific time on a day. I am using the following for loop to extract and aggregate information for every hour. Finally I union the data, as I want my result to save in one csv file.
# daily_df is a empty pyspark DataFrame
for hour in range(24):
hourly_df = df.filter(hourFilter("Time")).groupby("Animal").agg(mean("weights"), sum("is_male"))
daily_df = daily_df.union(hourly_df)
As of my knowledge, I have to perform the following to force the pyspark.sql.Dataframe object to save to 1 csv files (approx 1Mb) instead of 100+ files:
daily_df.coalesce(1).write.csv("some_local.csv")
It seems it took about 70min to finish this progress, and I am wondering if I can make it faster by using collect() method like?
daily_df_pandas = daily_df.collect()
daily_df_pandas.to_csv("some_local.csv")

Both coalesce(1) and collect are pretty bad in general but with expected output size around 1MB it doesn't really matter. It simply shouldn't be a bottleneck here.
One simple improvement is to drop loop -> filter -> union and perform a single aggregation:
df.groupby(hour("Time"), col("Animal")).agg(mean("weights"), sum("is_male"))
If that's not enough then most likely the issue here is configuration (the good place to start could be adjusting spark.sql.shuffle.partitions if you don't do that already).

To save as single file these are options
Option 1 :
coalesce(1) (minimum shuffle data over network) or repartition(1) or collect may work for small data-sets, but large data-sets it may not perform, as expected.since all data will be moved to one partition on one node
option 1 would be fine if a single executor has more RAM for use than the driver.
Option 2 :
Other option would be FileUtil.copyMerge() - to merge the outputs into a single file like below code snippet.
import org.apache.hadoop.conf.Configuration
import org.apache.hadoop.fs._
def merge(srcPath: String, dstPath: String): Unit = {
val hadoopConfig = new Configuration()
val hdfs = FileSystem.get(hadoopConfig)
FileUtil.copyMerge(hdfs, new Path(srcPath), hdfs, new Path(dstPath), false, hadoopConfig, null)
}
Option 3 :
after getting part files you can use hdfs getMerge command like this...
hadoop fs -getmerge /tmp/demo.csv /localmachine/tmp/demo.csv
Now you have to decide based on your requirements... which one is safer/faster
also, can have look at Dataframe save after join is creating numerous part files

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.