Import Multiple CSV Files to Postgresql - python

I'm currently learning how to code and I have run into this challenge that I have been trying to solve for the last couple of days.
I have over 2000 CSV files that I would like to import into a particular postgresql table at once instead using the import data function on pgadmin 4 which only allows one to import one CSV file at a time. How should I go about doing this? I'm using Windows OS.

Simple way is use Cygwin or inner Ubuntu shell for use this script
all_files=("file_1.csv" "file_2.csv") # OR u can change to * in dir
dir_name=<path_to_files>
export PGUSER=<username_here>
export PGPASSWORD=<password_here>
export PGHOST=localhost
export PGPORT=5432
db_name=<dbname_here>
echo "write db"
for file in ${all_files[*]}; do
psql -U$db_name -a -f $dir_name/"${file}"".sql" >/dev/null
done

If you want to do this purely in Python, then I have given an approach below. It's possible that you wouldn't need to chunk the list (that you could hold all of the files in memory at once and not need to do in batches). It's also possible that all the files are radically different sizes and you'd need something more sophisticated than just batches to prevent you creating an in-memory file object that exceeds your RAM. Or, you might choose to do it in 2000 separate transactions but I suspect some kind of batching will be faster (untested).
import csv
import io
import os
import psycopg2
CSV_DIR = 'the_csv_folder/' # Relative path here, might need to be an absolute path
def chunks(l, n):
"""
https://stackoverflow.com/questions/312443/how-do-you-split-a-list-into-evenly-sized-chunks
"""
n = max(1, n)
return [l[i:i+n] for i in range(0, len(l), n)]
# Get a list of all the CSV files in the directory
all_files = os.listdir(CSV_DIR)
# Chunk the list of files. Let's go with 100 files per chunk, can be changed
chunked_file_list = chunks(all_files, 100)
# Iterate the chunks and aggregate the files in each chunk into a single
# in-memory file
for chunk in chunked_file_list:
# This is the file to aggregate into
string_buffer = io.StringIO()
csv_writer = csv.writer(string_buffer)
for file in chunk:
with open(CSV_DIR + file) as infile:
reader = csv.reader(infile)
data = reader.readlines()
# Transfer the read data to the aggregated file
csv_writer.writerows(data)
# Now we have aggregated the chunk, copy the file to Postgres
with psycopg2.connect(dbname='the_database_name',
user='the_user_name',
password='the_password',
host='the_host') as conn:
c = conn.cursor()
# Headers need to the table field names, in the order they appear in
# the csv
headers = ['first_name', 'last_name', ...]
# Now upload the data as though it was a file
c.copy_from(string_buffer, 'the_table_name', sep=',', columns=headers)
conn.commit()

Related

Python read .json files from GCS into pandas DF in parallel

TL;DR: asyncio vs multi-processing vs threading vs. some other solution to parallelize for loop that reads files from GCS, then appends this data together into a pandas dataframe, then writes to BigQuery...
I'd like to make parallel a python function that reads hundreds of thousands of small .json files from a GCS directory, then converts those .jsons into pandas dataframes, and then writes the pandas dataframes to a BigQuery table.
Here is a not-parallel version of the function:
import gcsfs
import pandas as pd
from my.helpers import get_gcs_file_list
def load_gcs_to_bq(gcs_directory, bq_table):
# my own function to get list of filenames from GCS directory
files = get_gcs_file_list(directory=gcs_directory) #
# Create new table
output_df = pd.DataFrame()
fs = gcsfs.GCSFileSystem() # Google Cloud Storage (GCS) File System (FS)
counter = 0
for file in files:
# read files from GCS
with fs.open(file, 'r') as f:
gcs_data = json.loads(f.read())
data = [gcs_data] if isinstance(gcs_data, dict) else gcs_data
this_df = pd.DataFrame(data)
output_df = output_df.append(this_df)
# Write to BigQuery for every 5K rows of data
counter += 1
if (counter % 5000 == 0):
pd.DataFrame.to_gbq(output_df, bq_table, project_id=my_id, if_exists='append')
output_df = pd.DataFrame() # and reset the dataframe
# Write remaining rows to BigQuery
pd.DataFrame.to_gbq(output_df, bq_table, project_id=my_id, if_exists='append')
This function is straightforward:
grab ['gcs_dir/file1.json', 'gcs_dir/file2.json', ...], the list of file names in GCS
loop over each file name, and:
read the file from GCS
converts the data into a pandas DF
appends to a main pandas DF
every 5K loops, write to BigQuery (since the appends get much slower as the DF gets larger)
I have to run this function on a few GCS directories each with ~500K files. Due to the bottleneck of reading/writing this many small files, this process will take ~24 hours for a single directory... It would be great if I could make this more parallel to speed things up, as it seems like a task that lends itself to parallelization.
Edit: The solutions below are helpful, but I am particularly interested in running in parallel from within the python script. Pandas is handling some data cleaning, and using bq load will throw errors. There is asyncio and this gcloud-aio-storage that both seem potentially useful for this task, maybe as better options than threading or multiprocessing...
Rather than add parallel processing to your python code, consider invoking your python program multiple times in parallel. This is a trick that lends itself more easily to a program that takes a list of files on the command line. So, for the sake of this post, let's consider changing one line in your program:
Your line:
# my own function to get list of filenames from GCS directory
files = get_gcs_file_list(directory=gcs_directory) #
New line:
files = sys.argv[1:] # ok, import sys, too
Now, you can invoke your program this way:
PROCESSES=100
get_gcs_file_list.py | xargs -P $PROCESSES your_program
xargs will now take the file names output by get_gcs_file_list.py and invoke your_program up to 100 times in parallel, fitting as many file names as it can on each line. I believe the number of file names is limited to the maximum command size allowed by the shell. If 100 processes is not enough to process all your files, xargs will invoke your_program again (and again) until all file names it reads from stdin are processed. xargs ensures that no more than 100 invocations of your_program are running simultaneously. You can vary the number of processes based on the resources available to your host.
Instead of doing this, you can directly use bq command.
The bq command-line tool is a Python-based command-line tool for BigQuery.
When you use this command, loading takes place in google's network which is very fast than we creating a dataframe and loading to table.
bq load \
--autodetect \
--source_format=NEWLINE_DELIMITED_JSON \
mydataset.mytable \
gs://mybucket/my_json_folder/*.json
For more information - https://cloud.google.com/bigquery/docs/loading-data-cloud-storage-json#loading_json_data_into_a_new_table

Counting unique IDs across several hundred files?

I have about 750 files (.csv) and each line has one entry which is a UUID. My goal for this script to to count how many unique UUIDs exist across all 750 or so files. The file name structure looks like the following:
DATA-20200401-005abf4e3f864dcb83bd9030e63c6da6.csv
As you can see, it has a date and some random id. They're all in the same directory and they all have the same file extension. The format of each file is new line delimited and just has a UUID that looks like the following: b0d6e1e9-1b32-48d5-b962-671664484616
I tried merging all the files, but things got messy and this is about 15GB worth of data.
My final goal is to get an output such that it states the number of unique IDs across all the files. For example:
file1:
xxx-yyy-zzz
aaa-bbb-ccc
xxx-yyy-zzz
file2:
xxx-yyy-zzz
aaa-bbb-ccc
xxx-yyy-zzz
The final output after scanning these two files would be:
The total number of unique ids is: 2
I reckon using a Counter may be the fastest way to do this:
from collections import Counter
with open(filename) as f:
c = Counter(f)
print(sum(c.values()))
The counter provides the count of each unique item. This is implemented using a hashtable so should be fairly quick with a large number of items.
If you don't have to use Python, then a simple solution might be the command line:
cat *.csv | sort -u | wc -l
This pipes the content of all of the CSV file into sort -u which sorts and removes duplicates, then pipes that into wc -l which does a line count.
Note: sort will spill to disk as needed, and you can control its memory usage with -S size if you like.
I'd be tempted to run this on a powerful machine with lots of RAM.
Maybe something like this would work:
from os import listdir
import re
import pandas as pd
my_folder_path = "C:\\\\"
# Generic regular expression
pat = r"DATA-\d{8}-.+\.csv}"
p = re.compile(pat)
# UUID column in each file (I don't know if this is the case; Adjust accodingly.
uuid_column = "uuids"
# Empty result dataframe with single column
result_df = pd.DataFrame(columns=["unique_uuid"])
file_list = [rf"{my_folder_path}\{i}" for i in listdir(my_folder_path)]
for f in file_list:
# Check for matching regular expression pattern
if p.search(f):
# Read file if pattern matches.
df = pd.read_csv(f, usecols=[uuid_column])
# Append only unique values from the new Series to the dataframe
(result_df["unique_uuid"]
.append(list(set(df[uuid_column].values)
.difference(result_df["unique_uuid"].values)))
)
Concatenating all of the csv files in a directory has been solved in a pretty popular post The only difference here is that you drop duplicates. This would of course work well only if there are a significant amount of duplicates in each file (at least enough for all of the deduped frames to fit into memory and perform the final drop_duplicates).
There are also some other suggestions in that link, such as skipping the list altogether.
import glob
import pandas as pd
files = glob.glob('./data_path/*.csv')
li = []
for file in files:
df = pd.read_csv(file, index_col=None, header=None)
li.append(df.drop_duplicates())
output = pd.concat(li, axis=0, ignore_index=True)
output = output.drop_duplicates()
Read all the files and add all the UUIDs to a set as you go. Sets enforce uniqueness, so the length of the set is the number of unique UUIDs you found. Roughly:
import csv
import os
uuids = set()
for path in os.listdir():
with open(path) as file:
for row in csv.reader(file):
uuids.update(row)
print(f"The total number of unique ids is: {len(uuids)}")
This assumes that you can store all the unique UUIDs in memory. If you can't, building a database on disk would be the next thing to try (e.g. replace the set with a sqlite db or something along those lines). If you had a number of unique IDs that's too large to store anywhere, there are still solutions as long as you're willing to sacrifice some precision: https://en.wikipedia.org/wiki/HyperLogLog

How to 1. convert 4,550 dbf files to csv files 2. concatenate files based on names 3. concatenate all csv's into one big data csv for analysis?

I have multiple dbf files (~4,550) in multiple folders and sub-directories (~400) separated by state.
The data was given to me in dbf files on a weekly basis separated by state.
Ex.
"Datafiles\DAT_01_APRIL_2019\DAT_01_APRIL_2019\FL\DATA5393.DBF"
"Datafiles\DAT_01_APRIL_2019\DAT_01_APRIL_2019\FL\DATA5414.DBF"
"Datafiles\DAT_01_APRIL_2019\DAT_01_APRIL_2019\NJ\DATA890.DBF"
"Datafiles\DAT_01_APRIL_2019\DAT_01_APRIL_2019\NJ\DATA1071.DBF"
"Datafiles\DAT_01_JUly_2019\DAT_01_JUlY_2019\FL\DATA5393.DBF"
"Datafiles\DAT_01_JUly_2019\DAT_01_JUlY_2019\FL\DATA5414.DBF"
"Datafiles\DAT_01_JUly_2019\DAT_01_JUlY_2019\NJ\DATA890.DBF"
"Datafiles\DAT_01_JUly_2019\DAT_01_JUlY_2019\NJ\DATA1071.DBF"
How would I convert + merge all the dbf files into one csv for each state i.e. keeping the states separate (for regional data analysis)?
Currently using Python 3 and Jupyter notebooks on windows 10.
This problem seems to be solvable using python, I have attempted to experiment with dbf2csv and other dbf and csv functions.
Code below shows some great starting points. Research was done through many posts and my own experimentation.
I'm still getting started with using python for working with files, but I'm not entirely sure how to code around the tedious tasks.
I typically use the functions below to convert to csv, followed by a line in the command promt to combine all csv files into one.
The function below converts one specific dbf to csv
import csv
from dbfread import DBF
def dbf_to_csv(dbf_table_pth):#Input a dbf, output a csv, same name, same path, except extension
csv_fn = dbf_table_pth[:-4]+ ".csv" #Set the csv file name
table = DBF(dbf_table_pth)# table variable is a DBF object
with open(csv_fn, 'w', newline = '') as f:# create a csv file, fill it with dbf content
writer = csv.writer(f)
writer.writerow(table.field_names)# write the column name
for record in table:# write the rows
writer.writerow(list(record.values()))
return csv_fn# return the csv name
The script below converts all dbf files in a given folder to csv format.
This works great, but doesn't take the subfolders and sub-directories into consideration.
import fnmatch
import os
import csv
import time
import datetime
import sys
from dbfread import DBF, FieldParser, InvalidValue
# pip install dbfread if needed
class MyFieldParser(FieldParser):
def parse(self, field, data):
try:
return FieldParser.parse(self, field, data)
except ValueError:
return InvalidValue(data)
debugmode=0 # Set to 1 to catch all the errors.
for infile in os.listdir('.'):
if fnmatch.fnmatch(infile, '*.dbf'):
outfile = infile[:-4] + ".csv"
print("Converting " + infile + " to " + outfile + ". Each period represents 2,000 records.")
counter = 0
starttime=time.clock()
with open(outfile, 'w') as csvfile:
table = DBF(infile, parserclass=MyFieldParser, ignore_missing_memofile=True)
writer = csv.writer(csvfile)
writer.writerow(table.field_names)
for i, record in enumerate(table):
for name, value in record.items():
if isinstance(value, InvalidValue):
if debugmode == 1:
print('records[{}][{!r}] == {!r}'.format(i, name, value))
writer.writerow(list(record.values()))
counter +=1
if counter%100000==0:
sys.stdout.write('!' + '\r\n')
endtime=time.clock()
# print (str("{:,}".format(counter))) + " records in " + #str(endtime-starttime) + " seconds."
elif counter%2000==0:
sys.stdout.write('.')
else:
pass
print("")
endtime=time.clock()
print ("Processed " + str("{:,}".format(counter)) + " records in " + str(endtime-starttime) + " seconds (" + str((endtime-starttime)/60) + " minutes.)")
print (str(counter / (endtime-starttime)) + " records per second.")
print("")
But this process is too tedious considering there are over 400 sub-folders.
Then using the command prompt, I type
copy *.csv combine.csv but this can be done with python as well.
Currently experimenting with Os.Walk, but have not made any major progress.
Ideally, the output should be a csv file with all the combined data for each individual state.
Ex.
"\Datafiles\FL.csv"
"\Datafiles\NJ.csv"
It would also be alright if the output was into a pandas dataframe for each individual state.
UPDATE
Edit: I was able to convert all the dbf files to csv using the os.walk.
Os.walk has also been helpful to provide me with a list of directories which contain the dbf and csv files.
Ex.
fl_dirs= ['\Datafiles\\01_APRIL_2019\\01_APRIL_2019\\FL',
'\Datafiles\\01_JUly_2019\\01_JUlY_2019\\FL',
'\Datafiles\\03_JUNE_2019\\03_JUNE_2019\\FL',
'\Datafiles\\04_MARCH_2019\\04_MARCH_2019\\FL']
I simply want to access the identical csv files in those directories and combine them into one csv file with python.
UPDATE: SOLVED THIS!I wrote a script that can do everything I needed!
This problem can be simplified using os.walk (https://docs.python.org/3/library/os.html#os.listdir).
The sub-directories can be traversed and the absolute path of each dbf file can be appended to separate lists based on the state.
Then, the files can be converted to csv using the function dbf_to_csv which then can be combined using concat feature included in pandas (https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.concat.html).
EDIT: The following code might help. Its not tested though.
import pandas as pd
import os
# basepath here
base_path=""
#output dir here
output_path=""
#Create dictionary to store all absolute path
path_dict={"FL":[],"NJ":[]}
#recursively look up into base path
for abs_path,curr_dir,file_list in os.walk(base_path):
if abs_path.endswith("FL"):
path_dict["FL"].extend([os.path.join(abs_path,file) for file in file_list])
elif abs_path.endswith ("NJ"):
path_dict["NJ"].extend([os.path.join(abs_path,file) for file in file_list])
for paths in path_dict:
df=pd.concat(
[pd.read_csv(i) for i in set(path_dict[paths])],
ignore_index=True
)
df.to_csv(os.path.join(output_path,paths+".csv"),index=False)

loop over mulitple rrd files in directory

I need to iterate through all .rrd files inside a given directory and do fetch data inside each rrd database and do some actions and export them into single csv file in python script!
How can this be done in a efficient way? it's ok to advice me how to loop and access database data over multiple files?
I assume you have a rrdtool installation with python bindings already on your system. If not, here is an installation description.
Then, to loop over .rrd files in a given directory and performing a fetch:
import os
import rrdtool
target_directory = "some/directory/with/rrdfiles"
rrd_files = [os.path.join(target_directory, f) for f in os.listdir(target_directory) if f.endswith('.rrd')]
data_container = []
for rrd_file in rrd_files:
data = rrdtool.fetch(rrd_file, 'AVERAGE' # or whichever CF you need
'--resolution', '200', # for example
) # and any other settings you need
data_container.append(data)
The parameter list follows rrdfetch.
Once you have whatever data you need inside the rrd_files loop, you should accumulate it in a list of lists, with each sublist being one row of data. Writing them to a csv is as easy as this:
import csv
# read the .rrd data as above into a 'data_container' variable
with open('my_data.csv', 'w', newline='') as csv_file:
rrd_writer = csv.writer(csv_file)
for row in data_container:
rrd_writer.writerow(row)
This should outline the general steps you have to follow, you will probably need to adapt them (the rrdfetch in particular) to your case.

Split huge (95Mb) JSON array into smaller chunks?

I exported some data from my database in the form of JSON, which is essentially just one [list] with a bunch (900K) of {objects} inside it.
Trying to import it on my production server now, but I've got some cheap web server. They don't like it when I eat all their resources for 10 minutes.
How can I split this file into smaller chunks so that I can import it piece by piece?
Edit: Actually, it's a PostgreSQL database. I'm open to other suggestions on how I can export all the data in chunks. I've got phpPgAdmin installed on my server, which supposedly can accept CSV, Tabbed and XML formats.
I had to fix phihag's script:
import json
with open('fixtures/PostalCodes.json','r') as infile:
o = json.load(infile)
chunkSize = 50000
for i in xrange(0, len(o), chunkSize):
with open('fixtures/postalcodes_' + ('%02d' % (i//chunkSize)) + '.json','w') as outfile:
json.dump(o[i:i+chunkSize], outfile)
dump:
pg_dump -U username -t table database > filename
restore:
psql -U username < filename
(I don't know what the heck pg_restore does, but it gives me errors)
The tutorials on this conveniently leave this information out, esp. the -U option which is probably necessary in most circumstances. Yes, the man pages explain this, but it's always a pain to sift through 50 options you don't care about.
I ended up going with Kenny's suggestion... although it was still a major pain. I had to dump the table to a file, compress it, upload it, extract it, then I tried to import it, but the data was slightly different on production and there were some missing foreign keys (postalcodes are attached to cities). Of course, I couldn't just import the new cities, because then it throws a duplicate key error instead of silently ignoring it, which would have been nice. So I had to empty that table, repeat the process for cities, only to realize something else was tied to cities, so I had to empty that table too. Got the cities back in, then finally I could import my postal codes. By now I've obliterated half my database because everything is tied to everything and I've had to recreate all the entries. Lovely. Good thing I haven't launched the site yet. Also "emptying" or truncating a table doesn't seem to reset the sequences/autoincrements, which I'd like, because there are a couple magic entries I want to have ID 1. So..I'd have to delete or reset those too (I don't know how), so I manually edited the PKs for those back to 1.
I would have ran into similar problems with phihag's solution, plus I would have had to import 17 files one at a time, unless I wrote another import script to match the export script. Although he did answer my question literally, so thanks.
In Python:
import json
with open('file.json') as infile:
o = json.load(infile)
chunkSize = 1000
for i in xrange(0, len(o), chunkSize):
with open('file_' + str(i//chunkSize) + '.json', 'w') as outfile:
json.dump(o[i:i+chunkSize], outfile)
I turned phihag's and mark's work into a tiny script (gist)
also copied below:
#!/usr/bin/env python
# based on http://stackoverflow.com/questions/7052947/split-95mb-json-array-into-smaller-chunks
# usage: python json-split filename.json
# produces multiple filename_0.json of 1.49 MB size
import json
import sys
with open(sys.argv[1],'r') as infile:
o = json.load(infile)
chunkSize = 4550
for i in xrange(0, len(o), chunkSize):
with open(sys.argv[1] + '_' + str(i//chunkSize) + '.json', 'w') as outfile:
json.dump(o[i:i+chunkSize], outfile)
Assuming you have the option to go back and export the data again...:
pg_dump - extract a PostgreSQL database into a script file or other archive file.
pg_restore - restore a PostgreSQL database from an archive file created by pg_dump.
If that's no use, it might be useful to know what you're going to be doing with the output so that another suggestion can hit the mark.
I know this is question is from a while back, but I think this new solution is hassle-free.
You can use pandas 0.21.0 which supports a chunksize parameter as part of read_json. You can load one chunk at a time and save the json:
import pandas as pd
chunks = pd.read_json('file.json', lines=True, chunksize = 20)
for i, c in enumerate(chunks):
c.to_json('chunk_{}.json'.format(i))

Categories