how to efficiently append 200k+ json files to dataframe - python

I have over 200 000 json files stored across 600 directories on my drive and a copy in Google cloud storage.
i need to merge them and transform to CSV file;
so i thougt of looping the following logic:
open each file with pandas read_json
apply some transformations ( drop some columns, add columns with values based on the filename, change order of columns)
append this df to a master_df
and finally export the master_df to a csv.
each file is around 5000 rows when converted to DF - which would result in DF of around 300 000 000 rows.
The files are grouped and reside in 3 directories ( and than depending on the file - in a couple more, hence on top level there are 3 dirs, but overall - about 600 ) - so to speed it up a bit i decided to run the script based on one of those 3 directories
I am running this script on my local machine (32GD RAM, sufficient disk space on SSD drive), yet its painfully slow, and the speed decreases.
At the beginning one df was appended within 1 second, after executing loop 4000 times - time to append grew to 2.7s
I am looking for a way to speed it up to something that couple be reasonably done within (hopefully) a couple hours
so bottomline - should i try to optimize my code and run this locally, or not even bother, likely keep the script 'as is' and run it in f.ex Google Cloud Run?
the JSON files contain keys A, B, C, D, E. I drop A, B as not important, and rename C
the code i have so far:
import pandas as pd
import os
def listfiles(track, footprint):
"""
This function lists files in a specified filepath (track) by a footprint found in the filename with extension
Function returns a list of files.
"""
locations =[]
for r, d, f in os.walk(track):
for file in f:
if footprint in file:
locations.append(os.path.join(r, file))
return locations
def create_master_df(track, footprint):
master_df = pd.DataFrame(columns = ['date','domain','property','lang','kw', 'clicks','impressions'])
all_json_files = listfiles(track, footprint)
prop = [] #this indicates the starting directory, and to which ouput file should master_df be saved
for i, file in enumerate(all_json_files):
#here starts logic by which i identify some properties of the file, to use them as values in columns in local_df
elements=file.split('\\')
elements[-1] = elements[-1].replace('.json','')
elements[-1] = elements[-1].replace('-Copy-Multiplicated','')
date = elements[-1][-10:]
domain=elements[7]
property=elements[6]
language=elements[8]
json_to_rows = pd.read_json(file)
local_df = pd.DataFrame(json_to_rows.rows.values.tolist())
local_df = local_df.drop(columns=['A','B'])
local_df = local_df.rename(columns={'C': 'new_name'})
#here i add earlier extracted data, to be able to distinguish rows in the master DF
local_df['date'] = date
local_df['domain'] = domain
local_df['property'] = property
local_df['lang'] = language
local_df = local_df[['date','domain','property','lang','new_name', 'D','E']]
master_df = master_df.append(local_df, ignore_index=True)
prop = property
print(f'appended {i} times')
out_file_path = f'{track}\\{prop}-outfile.csv'
master_df.to_csv(out_file_path, sep=";", index=False)
#run the script on first directory
track = 'C:\\xyz\\a'
footprint ='.json'
create_master_df(track, footprint)
#run the script on first directory
track = 'C:\\xyz\\b'
create_master_df(track, footprint)
any feedback is welcome!
#Edit:
i tried and timed a couple ways of approaching it, below is explanation of what i did:
clear execution - at first i ran my code as it was written
1st try - i moved all changes to local_df to before the final master_df was saved to csv. This was an attempt to see if working on files 'as they are' without manipulating them would be faster. i removed from for loop dropping columns, changing colum names, and reordering them. all this was applied on master_df before exporting to csv
2nd try - i brought back to for loop dropping unnecessary columns - as clearly performance was impacted by additional columns
3rd try - so far the most efficient - instead of appending local_df to master_df, i transformed it to local_list, and appended to master_list. master_list was than pd.concat to master_df and exported to csv
here is a comparison of speed - how fast the script was to iterate the for loop in each version, when tested on ~ 800 JSON files:
comparison of timed script executions
Its surprising to see that i actually wrote a solid code in the first place - as i don't have much experience coding - that was super nice to see :)
overall when job was ran on the test set of files it finished as follows (in seconds)

Related

Faster alternative than pandas.read_csv for reding txt files with data

I am working on a project where I need to process large amounts of txt files (about 7000 files), where each file has 2 columns of floats with 12500 rows.
I am using pandas, and takes about 2 min 20 sec which is a bit long. With MATLAB this takes 1 min less. I would like to get closer or faster than MATLAB.
Is there any faster alternative that I can implement with python?
I tried Cython and the speed was the same as with pandas.
Here is the code in am using. It reads the files composed of column 1 (time) and column 2 (amplitude). I calculate the envelope and make a list with the resulting envelopes for all files. I extract the time from the first file using the simple numpy.load_txt(), which is slower than pandas but no impact since it is just one file.
Any ideas?
For coding suggestions please try to use the same format as I use.
Cheers
Data example:
1.7949600e-05 -5.3232106e-03
1.7950000e-05 -5.6231098e-03
1.7950400e-05 -5.9230090e-03
1.7950800e-05 -6.3228746e-03
1.7951200e-05 -6.2978830e-03
1.7951600e-05 -6.6727570e-03
1.7952000e-05 -6.5727906e-03
1.7952400e-05 -6.9726562e-03
1.7952800e-05 -7.0726226e-03
1.7953200e-05 -7.2475638e-03
1.7953600e-05 -7.1725890e-03
1.7954000e-05 -6.9476646e-03
1.7954400e-05 -6.6227738e-03
1.7954800e-05 -6.4228410e-03
1.7955200e-05 -5.8480342e-03
1.7955600e-05 -6.1979166e-03
1.7956000e-05 -5.7980510e-03
1.7956400e-05 -5.6231098e-03
1.7956800e-05 -5.3482022e-03
1.7957200e-05 -5.1732611e-03
1.7957600e-05 -4.6484375e-03
20 files here:
https://1drv.ms/u/s!Ag-tHmG9aFpjcFZPqeTO12FWlMY?e=f6Zk38
folder_tarjet="D:\this"
if len(folder_tarjet) > 0:
print ("You chose %s" % folder_tarjet)
list_of_files = os.listdir(folder_tarjet)
list_of_files.sort(key=lambda f: os.path.getmtime(join(folder_tarjet, f)))
num_files=len(list_of_files)
envs_a=[]
for elem in list_of_files:
file_name=os.path.join(folder_tarjet,elem)
amp=pd.read_csv(file_name,header=None,dtype={'amp':np.float64},delim_whitespace=True)
env_amplitudes = np.abs(hilbert(np.array(pd.DataFrame(amp[1]))))
envs_a.append(env_amplitudes)
envelopes=np.array(envs_a).T
file_name=os.path.join(folder_tarjet,list_of_files[1])
Time=np.loadtxt(file_name,usecols=0)
I would suggest you not to use csv to store and load big data quantities, if you already have your data in csv you can still convert all of it at once into a faster format. For instance you can use pickle, h5, feather and parquet, they are non human readable but have much better performance in any other metric.
After the conversion, you will be able to load the data in few seconds (if not less than a second) instead of minutes, so in my opinion it is a much better solution than trying to make a marginal 50% optimization.
If the data is being generated, make sure to generate it in a compressed format. If you don't know which format to use, parquet for instance would be one of the fastest for reading and writing and you can also load it from Matlab.
Here you will the options already supported by pandas.
As discussed in #Ziur Olpa's answer and the comments, a binary format is bound to be faster than to parse text.
The quick way to get those gains is to use Numpy's own NPY format, and have your reader function cache those onto disk; that way, when you re-(re-re-)run your data analysis, it will use the pre-parsed NPY files instead of the "raw" TXT files. Should the TXT files change, you can just remove all NPY files and wait a while longer for parsing to happen (or maybe add logic to look at the modification times of the NPY files c.f. their corresponding TXT files).
Something like this – I hope I got your hilbert/abs/transposition logic the way you wanted to.
def read_scans(folder):
"""
Read scan files from a given folder, yield tuples of filename/matrix.
This will also create "cached" npy files for each txt file, e.g. a.txt -> a.txt.npy.
"""
for filename in sorted(glob.glob(os.path.join(folder, "*.txt")), key=os.path.getmtime):
np_filename = filename + ".npy"
if os.path.exists(np_filename):
yield (filename, np.load(np_filename))
else:
df = pd.read_csv(filename, header=None, dtype=np.float64, delim_whitespace=True)
mat = df.values
np.save(np_filename, mat)
yield (filename, mat)
def read_data_opt(folder):
times = None
envs = []
for filename, mat in read_scans(folder):
if times is None:
# If we hadn't got the times yet, grab them from this
# matrix (by slicing the first column).
times = mat[:, 0]
env_amplitudes = signal.hilbert(mat[:, 1])
envs.append(env_amplitudes)
envs = np.abs(np.array(envs)).T
return (times, envs)
On my machine, the first run (for your 20-file dataset) takes
read_data_opt took 0.13302898406982422 seconds
and the subsequent runs are 4x faster:
read_data_opt took 0.031115055084228516 seconds

Appending lots of csv files in folders within a folder

I am working on a shared network drive where I have 1 folder (main folder) containing many subfolders; 1 for each date (over 1700) and then within them csv files (results.csv) with a common name at the end (same file format). Each csv contains well over 30k rows.
I wish to read in all the csvs appending them into one dataframe to perform some minor calculations. I have used the below code. It ran for 3+ days so I quit, but looking at the dataframe it actually got 80% of the way through. But it seems inefficient as it takes ages and when I want to add the latest days file it will have to re-run again. I also only need a handful of the columns within each csv so want to use the usecols=['A', 'B', 'C'] function but not sure how to incorporate it. Could someone shed some light please on a better solution?
import glob
import os
import pandas as pd
file_source = glob.glob(r"//location//main folder//**//*results.csv", recursive=True)
appended_file = []
for i in file_source:
df = pd.read_csv(i)
appended_file.append(df)
combined=pd.concat(appended_file, axis=0, ignore_index=True, sort=False)
Thanks.

Pyspark read and combine many parquet files efficiently

I have ~ 4000 parquet files that are each 3mb. I would like to read all of the files from an S3 bucket, do some aggregations, combine the files into one dataframe, and do some more aggregations. The parquet dataframes all have the same schema. The files are not all in the same folder in the S3 bucket but rather are spread across 5 different folders. I need some files in each folder but not all. I am working on my local machine which has 16gb RAM and 8 processors that I hope to leverage to make this faster. Here is my code:
ts_dfs = []
# For loop to read parquet files and append to empty list
for id in ids:
# Make the path to the timeseries data
timeseries_path = f'{dataset_path}/timeseries_individual_buildings/by_county/upgrade=0/county={id[0]}'
ts_data_df = spark.read.parquet(timeseries_path).select('bldg_id',
'`out.natural_gas.heating.energy_consumption_intensity`',
'`out.electricity.heating.energy_consumption_intensity`',
'timestamp')
# Aggregate
ts_data_df = ts_data_df \
.groupBy(f.month('timestamp').alias('month'),'bldg_id') \
.agg(f.sum('`out.electricity.heating.energy_consumption_intensity`').alias('eui_elec'),
f.sum('`out.natural_gas.heating.energy_consumption_intensity`').alias('eui_gas'))
# Append
ts_dfs.append(ts_data_df)
# Combine all of the dfs into one
ts_sdf = reduce(DataFrame.unionAll, ts_dfs)
# Merge with ids_df
ts = ts_sdf.join(ids_df, on = ['bldg_id'])
# Mean and Standard Deviation by month
stats_df = ts_sdf.groupBy('month', '`in.hvac_heating_type_and_fuel`') \
.agg(f.mean('eui_elec').alias('mean_eui_elec'),
f.stddev('eui_elec').alias('std_eui_elec'),
f.mean('eui_gas'). alias('mean_eui_gas'),
f.stddev('eui_gas').alias('std_eui_gas'))
Reading all of the files through a forloop does not leverage the multiple cores, defeating the purpose of using Spark. As such this process takes 90 minutes on my own (though that may be more a function of my internet connection). Additionally, the next step: ts_sdf = reduce(DataFrame.unionAll, ts_dfs) which combines the dataframes using unionAll is taking even longer. I think my questions are:
How do I avoid having to read the files in a forloop?
Why is the unionAll taking so long? Does PySpark not parallelize this process somehow?
In general, how do I do this better? Perhaps PySpark is not the ideal tool for this?

Processing data using pandas in a memory efficient manner using Python

I have to read multiple csv files and group them by "event_name". I also might have some duplicates, so I need to drop them. paths contains all the paths of the csv files, and my code is as follows:
data = []
for path in paths:
csv_file = pd.read_csv(path)
data.append(csv_file)
events = pd.concat(data)
events = events.drop_duplicates()
event_names = events.groupby('event_name')
ev2 = []
for name, group in event_names:
a, b = group.shape
ev2.append([name, a])
This code is going to tell me how many unique event_name unique there are, and how many entries per event_name. It works wonderfully, except that the csv files are too large and I am having memory problems. Is there a way to do the same using less memory?
I read about using dir() and globals() to delete variables, which I could certainly use, because once I have event_names, I don't need the DataFrame events any longer. However, I am still having those memory issues. My question more specifically is: can I read the csv files in a more memory-efficient way? or is there something additional I can do to reduce memory usage? I don't mind sacrificing performance, as long as I can read all csv files at once, instead of doing chunk by chunk.
Just keep a hash value of each row to reduce the data size.
csv_file = pd.read_csv(path)
# compute hash (gives an `uint64` value per row)
csv_file["hash"] = pd.util.hash_pandas_object(csv_file)
# keep only the 2 columns relevant to counting
data.append(csv_file[["event_name", "hash"]])
If you cannot risk hash collision (which would be astronomically unlikely), just use another hash key and check if the final counting results are identical. The way to change a hash key is as follows.
# compute hash using a different hash key
csv_file["hash2"] = pd.util.hash_pandas_object(csv_file, hash_key='stackoverflow')
Reference: pandas official docs page

I want to combine csv's, dropping rows while only keeping certain columns

This is the code I have so far:
import pandas as pd
import glob, os
os.chdir("L:/FMData/")
results = pd.DataFrame([])
for counter, file in enumerate(glob.glob("F5331_FM001**")):
namedf = pd.read_csv(file, skiprows=[1,2,3,4,5,6,7], index_col=[1], usecols=[1,2])
results = results.append(namedf)
results.to_csv('L:/FMData/FM001_D/FM5331_FM001_D.csv')
This however is producing a new document as instructed but isn't copying any data into it. I'm wanting to look up files in a certain location, with names along the lines of FM001, combine them, skip the first 7 rows in each csv, and only keep columns 1 and 2 in the new file. Can anyone help with my code?
Thanks in advance!!!
To combine multiple csv files, you should create a list of dataframes. Then combine the dataframes within your list via pd.concat in a single step. This is much more efficient than appending to an existing dataframe.
In addition, you need to write your result to a file outside your for loop.
For example:
results = []
for counter, file in enumerate(glob.glob("F5331_FM001**")):
namedf = pd.read_csv(file, skiprows=[1,2,3,4,5,6,7], index_col=[1], usecols=[1,2])
results = results.append(namedf)
df = pd.concat(results, axis=0)
df.to_csv('L:/FMData/FM001_D/FM5331_FM001_D.csv')
This code works on my side (using Linux and Python 3), it populates a csv file with data in.
Add a print just after the read_csv to see if your csv file actually reads any data, else nothing will be written, like this:
namedf = pd.read_csv(file)
print(namedf)
results = results.append(namedf)
It adds row 1 (probably becuase it is considered the header) and then skips 7 rows then continue, this is my result for csv file just written from one to eleven out in rows:
F5331_FM001.csv
one
0 nine
1 ten
2 eleven
Addition:
If print(namedf) shows nothing, then check your input files.
The python program is looking in L:/FMData/ for your files. Are you sure your files are located in that directory? You can change the directory by adding the correct path with the os.chdir command.

Categories