I have a process which is reading from 4 databases with 4 tables each. I am consolidating that data into 1 postgres database with 4 tables total. (Each of the original 4 databases have the same 4 tables which need to be consolidated).
The way I am doing it now works using pandas. I read one table from all 4 databases at a time, concatenate the data into one dataframe, then I use to_sql to save it on my postgres database. I then loop through to the remaining databases doing the same thing for the other tables.
My issue is speed. One of my tables has about 1 - 2mil rows per date so it can take about 5,000 - 6,000 seconds to finish writing the data to postgres. It is much quicker to write it to a .csv file and then use COPY FROM in pgadmin.
Here is my current code. Note that there are some function calls but it is just referring to the table names basically. I also have some basic logging being done but that is not too necessary. I am adding a column for the source database which is required though. I am stripping .0 from fields which are actually strings but pandas sees them as a float too, and I fill blank integers with 0 and make sure the columns are really type int.
def query_database(table, table_name, query_date):
df_list = []
log_list = []
for db in ['NJ', 'NJ2', 'LA', 'NA']:
start_time = time.clock()
query_timestamp = dt.datetime.now(pytz.timezone('UTC')).strftime('%Y-%m-%d %H:%M:%S')
engine_name = '{}{}{}{}'.format(connection_type, server_name, '/', db)
print('Accessing {} from {}'.format((select_database(db)[0][table]), engine_name))
engine = create_engine(engine_name)
df = pd.read_sql_query(query.format(select_database(db)[0][table]), engine, params={query_date})
query_end = time.clock() - start_time
df['source_database'] = db
df['insert_date_utc'] = query_timestamp
df['row_count'] = df.shape[0]
df['column_count'] = df.shape[1]
df['query_time'] = round(query_end, 0)
df['maximum_id'] = df['Id'].max()
df['minimum_id'] = df['Id'].min()
df['source_table'] = table_dict.get(table)
log = df[['insert_date_utc', 'row_date', 'source_database', 'source_table', 'row_count', 'column_count', 'query_time', 'maximum_id', 'minimum_id']].copy()
df.drop(['row_count', 'column_count', 'query_time', 'maximum_id', 'minimum_id', 'source_table'], inplace=True, axis=1)
df_list.append(df)
log_list.append(log)
log = pd.concat(log_list)
log.drop_duplicates(subset=['row_date', 'source_database', 'source_table'], inplace=True, keep='last')
result = pd.concat(df_list)
result.drop_duplicates('Id', inplace=True)
cols = [i.strip() for i in (create_columns(select_database(db)[0][table]))]
result = result[cols]
print('Creating string columns for {}'.format(table_name))
for col in modify_str_cols(select_database(db)[0][table]):
create_string(result, col)
print('Creating integer columns for {}'.format(table_name))
for col in modify_int_cols(select_database(db)[0][table]):
create_int(result, col)
log.to_sql('raw_query_log', cms_dtypes.pg_engine, index=False, if_exists='append', dtype=cms_dtypes.log_dtypes)
print('Inserting {} data into PostgreSQL'.format(table_name))
result.to_sql(create_table(select_database(db)[0][table]), cms_dtypes.pg_engine, index=False, if_exists='append', chunksize=50000, dtype=create_dtypes(select_database(db)[0][table]))
How can I insert a COPY TO and COPY FROM into this to speed it up? Should I just write the .csv files and then loop over those or can I COPY from memory to my postgres?
psycopg2 offers a number of specific copy related apis. If you want to use csv, you have to use copy_expert (which allows you to specify a fully copy statement).
Normally when I have done this, I have used copy_expert() and a file-like object which iterates through a file on disk. That seems to work reasonably well.
This being said, in your case, I think copy_to and copy_from are better matches because it is simply postgres to postgres transfer here. Note these use PostgreSQL's copy output/input syntax and not csv (if you want to use csv, you have to use copy_expert)
Note before you decide how to do things, you will need to note:
copy_to copies to a file-like object (such as StringIO) and copy_from/copy_expert files from a file-like object. If you want to use a panda data frame you are going to have to think about this a little and either create a file-like object or use csv along with StringIO and copy_expert to generate an in-memory csv and load that.
Related
I have a mariaDB database that contains csvs in the form of BLOB objects. I wanted to read these into pandas, but it appears that the csv is stored as a text file in it's own cell, like this:
Name
Data
csv1
col1, col2, ...
csv2
col1, col2, ...
How can I specifically read the cells in the data column as their own csvs into a pandas dataframe.
This is what I have tried:
raw = pd.read_sql_query(query, engine)
cell_as_string = raw.to_string(index=False)
converted_string = StringIO(cell_as_string)
rawdf = pd.read_csv(converted_string, sep = ',')
rawdf
However, rawdf is just the string with spaces, not a dataframe.
Here is a screenshot of what the query returns:
How can I ... read the cells ... into a pandas dataframe
Why is this even interesting?
It appears you already have the answer.
You are able to SELECT each item,
open a file for write, transfer the data,
and then ask .read_csv for a DataFrame.
But perhaps the requirement was to avoid spurious disk I/O.
Ok. The read_csv function accepts a file-like input,
and several libraries offer such data objects.
If the original question
was reproducible it would include
code that started like this:
from io import BytesIO, StringIO
default = "n,square\n2,4\n3,9"
blob = do_query() or default.encode("utf-8")
assert isinstance(blob, bytes)
Then with a binary BLOB in hand it's just a matter of:
f = StringIO(blob.decode("utf-8"))
df = pd.read_csv(f)
print(df.set_index("n"))
Sticking with bytes we might prefer the equivalent:
f = BytesIO(blob)
df = pd.read_csv(f)
I am trying to write a Discord bot in Python. Goal of that bot is to fill a table with entries from users, where are retrieved username, gamename and gamepswd. Then, for specific users to extract these data and remove the solved entry. I took first tool found on google to manage tables, therefore PyTables, I'm able to fill a table in a HDF5 file, but I am unable to retrieve them.
Could be important to say I never coded in Python before.
This is how I declare my object and create a file to store entries.
class DCParties (tables.IsDescription):
user_name=StringCol(32)
game_name=StringCol(16)
game_pswd=StringCol(16)
h5file = open_file("DCloneTable.h5", mode="w", title="DClone Table")
group = h5file.create_group("/", 'DCloneEntries', "Entries for DClone runs")
table = h5file.create_table(group, 'Entries', DCParties, "Entrées")
h5file.close()
This is how I fill entries
h5file = open_file("DCloneTable.h5", mode="a")
table = h5file.root.DCloneEntries.Entries
particle = table.row
particle['user_name'] = member.author
particle['game_name'] = game_name
particle['game_pswd'] = game_pswd
particle.append()
table.flush()
h5file.close()
All these work, and I can see my entries fill the table in the file with an HDF5 viewer.
But then, I wish to read my table, stored in the file, to extract datas, and it's not working.
h5file = open_file("DCloneTable.h5", mode="a")
table = h5file.root.DCloneEntries.Entries
particle = table.row
"""???"""
h5file.close()
I tried using particle["user_name"] (because 'user_name' isn't defined), it gives me "b''" as output
h5file = open_file("DCloneTable.h5", mode="a")
table = h5file.root.DCloneEntries.Entries
particle = table.row
print(f'{particle["user_name"]}')
h5file.close()
b''
And if I do
h5file = open_file("DCloneTable.h5", mode="a")
table = h5file.root.DCloneEntries.Entries
particle = table.row
print(f'{particle["user_name"]} - {particle["game_name"]} - {particle["game_pswd"]}')
h5file.close()
b'' - b'' - b''
Where am I failing ? Many thanks in advance :)
Here is a simple method to iterate over the table rows and print them one at time.
HDF5 doesn't support Unicode strings, so your character data is stored as byte strings. That's why you see the 'b'. To get rid of the 'b', you have to convert back to Unicode using .decode('utf-8'). This works with your hard coded field names. You could use the values from table.colnames to handle any column names. Also, I recommend using Python's file context manager (with/as:) to avoid leaving a file open.
import tables as tb
with tb.open_file("DCloneTable.h5", mode="r") as h5file:
table = h5file.root.DCloneEntries.Entries
print(f'Table Column Names: {table.colnames}')
# Method to iterate over rows
for row in table:
print(f"{row['user_name'].decode('utf-8')} - " +
f"{row['game_name'].decode('utf-8')} - " +
f"{row['game_pswd'].decode('utf-8')}" )
# Method to only read the first row, aka table[0]
print(f"{table[0]['user_name'].decode('utf-8')} - " +
f"{table[0]['game_name'].decode('utf-8')} - " +
f"{table[0]['game_pswd'].decode('utf-8')}" )
If you prefer to read all the data at one time, you can use the table.read() method to load the data into a NumPy structured array. You still have to convert from bytes to Unicode. As a result it is "slightly more complicated", so I didn't post that method.
I am trying to insert an xls file into oracle table using cx_Oracle. Below is the way how I am trying to achieve the same.
wb = open_workbook('test.xls')
values=[]
sheets=wb.sheet_names()
xl_sheet=wb.sheet_by_name(s)
sql_str=preparesql('MATTERS') ##this is function I have created which will return the insert statement I am using to load the table
for row in range(1, xl_sheet.nrows):
col_names = xl_sheet.row(0)
col_value = []
for name, col in zip(col_names, range(xl_sheet.ncols)):
searchObj = re.search( (r"Open Date|Matter Status Date"), name.value, re.M|re.I)
if searchObj:
if (xl_sheet.cell(row,col).value) == '':
value = ''
else:
value = datetime(*xlrd.xldate_as_tuple(xl_sheet.cell(row,col).value, wb.datemode))
value = value.strftime('%d-%b-%Y ')
else:
value = (xl_sheet.cell(row,col).value)
col_value.append(value)
values.append(col_value)
cur.executemany(sql_str,values,batcherrors=True)
But When I tested it against multiple xls files for some files it was throwing TypeError: I can't share the data due to the restrictions from the client.I feel the issue is related to the dtype of the columns in excel compared to the DB. Is there any way I can match the dtpes of the values list above to match the datatype of the columns in DB or are there any other ways to get the insert done? I tried using the dataframe.to_sql but it is taking lot of time. I am able to insert the same data by looping through the rows in values list.
I suggest you import data into panda dataframe than it become very easy to play with data type in pandas. You can change the data type of whole column and than can insert easily.
I have a 10gb CSV file that contains some information that I need to use.
As I have limited memory on my PC, I can not read all the file in memory in one single batch. Instead, I would like to iteratively read only some rows of this file.
Say that at the first iteration I want to read the first 100, at the second those going to 101 to 200 and so on.
Is there an efficient way to perform this task in Python?
May Pandas provide something useful to this? Or are there better (in terms of memory and speed) methods?
Here is the short answer.
chunksize = 10 ** 6
for chunk in pd.read_csv(filename, chunksize=chunksize):
process(chunk)
Here is the very long answer.
To get started, you’ll need to import pandas and sqlalchemy. The commands below will do that.
import pandas as pd
from sqlalchemy import create_engine
Next, set up a variable that points to your csv file. This isn’t necessary but it does help in re-usability.
file = '/path/to/csv/file'
With these three lines of code, we are ready to start analyzing our data. Let’s take a look at the ‘head’ of the csv file to see what the contents might look like.
print pd.read_csv(file, nrows=5)
This command uses pandas’ “read_csv” command to read in only 5 rows (nrows=5) and then print those rows to the screen. This lets you understand the structure of the csv file and make sure the data is formatted in a way that makes sense for your work.
Before we can actually work with the data, we need to do something with it so we can begin to filter it to work with subsets of the data. This is usually what I would use pandas’ dataframe for but with large data files, we need to store the data somewhere else. In this case, we’ll set up a local sqllite database, read the csv file in chunks and then write those chunks to sqllite.
To do this, we’ll first need to create the sqllite database using the following command.
csv_database = create_engine('sqlite:///csv_database.db')
Next, we need to iterate through the CSV file in chunks and store the data into sqllite.
chunksize = 100000
i = 0
j = 1
for df in pd.read_csv(file, chunksize=chunksize, iterator=True):
df = df.rename(columns={c: c.replace(' ', '') for c in df.columns})
df.index += j
i+=1
df.to_sql('table', csv_database, if_exists='append')
j = df.index[-1] + 1
With this code, we are setting the chunksize at 100,000 to keep the size of the chunks managable, initializing a couple of iterators (i=0, j=0) and then running a through a for loop. The for loop read a chunk of data from the CSV file, removes space from any of column names, then stores the chunk into the sqllite database (df.to_sql(…)).
This might take a while if your CSV file is sufficiently large, but the time spent waiting is worth it because you can now use pandas ‘sql’ tools to pull data from the database without worrying about memory constraints.
To access the data now, you can run commands like the following:
df = pd.read_sql_query('SELECT * FROM table', csv_database)
Of course, using ‘select *…’ will load all data into memory, which is the problem we are trying to get away from so you should throw from filters into your select statements to filter the data. For example:
df = pd.read_sql_query('SELECT COl1, COL2 FROM table where COL1 = SOMEVALUE', csv_database)
You can use pandas.read_csv() with chuncksize parameter:
http://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html#pandas.read_csv
for chunck_df in pd.read_csv('yourfile.csv', chunksize=100):
# each chunck_df contains a part of the whole CSV
This code may help you for this task. It navigates trough a large .csv file and does not consume lots of memory so that you can perform this in a standard lap top.
import pandas as pd
import os
The chunksize here orders the number of rows within the csv file you want to read later
chunksize2 = 2000
path = './'
data2 = pd.read_csv('ukb35190.csv',
chunksize=chunksize2,
encoding = "ISO-8859-1")
df2 = data2.get_chunk(chunksize2)
headers = list(df2.keys())
del data2
start_chunk = 0
data2 = pd.read_csv('ukb35190.csv',
chunksize=chunksize2,
encoding = "ISO-8859-1",
skiprows=chunksize2*start_chunk)
headers = []
for i, df2 in enumerate(data2):
try:
print('reading cvs....')
print(df2)
print('header: ', list(df2.keys()))
print('our header: ', headers)
# Access chunks within data
# for chunk in data:
# You can now export all outcomes in new csv files
file_name = 'export_csv_' + str(start_chunk+i) + '.csv'
save_path = os.path.abspath(
os.path.join(
path, file_name
)
)
print('saving ...')
except Exception:
print('reach the end')
break
Method to transfer huge CSV into database is good because we can easily use SQL query.
We have also to take into account two things.
FIRST POINT: SQL also are not a rubber, it will not be able to stretch the memory.
For example converted to bd file:
https://nycopendata.socrata.com/Social-Services/311-Service-Requests-
from-2010-to-Present/erm2-nwe9
For this db file SQL language:
pd.read_sql_query("SELECT * FROM 'table'LIMIT 600000", Mydatabase)
It can read maximum about 0,6 mln records no more with 16 GB RAM memory of PC (time of operation 15,8 second).
It could be malicious to add that downloading directly from a csv file is a bit more efficient:
giga_plik = 'c:/1/311_Service_Requests_from_2010_to_Present.csv'
Abdul = pd.read_csv(giga_plik, nrows=1100000)
(time of operation 16,5 second)
SECOND POINT: To effectively using SQL data series converted from CSV we ought to memory about suitable form of date. So I proposer add to ryguy72's code this:
df['ColumnWithQuasiDate'] = pd.to_datetime(df['Date'])
All code for file 311 as about I pointed:
start_time = time.time()
### sqlalchemy create_engine
plikcsv = 'c:/1/311_Service_Requests_from_2010_to_Present.csv'
WM_csv_datab7 = create_engine('sqlite:///C:/1/WM_csv_db77.db')
#----------------------------------------------------------------------
chunksize = 100000
i = 0
j = 1
## --------------------------------------------------------------------
for df in pd.read_csv(plikcsv, chunksize=chunksize, iterator=True, encoding='utf-8', low_memory=False):
df = df.rename(columns={c: c.replace(' ', '') for c in df.columns})
## -----------------------------------------------------------------------
df['CreatedDate'] = pd.to_datetime(df['CreatedDate']) # to datetimes
df['ClosedDate'] = pd.to_datetime(df['ClosedDate'])
## --------------------------------------------------------------------------
df.index += j
i+=1
df.to_sql('table', WM_csv_datab7, if_exists='append')
j = df.index[-1] + 1
print(time.time() - start_time)
At the end I would like to add: converting a csv file directly from the Internet to db seems to me a bad idea. I propose to download base and convert locally.
I get an error 'TypeError: 'TextFileReader' object does not support item assignment' when I try to add columns and modify header names etc in chunks.
My issue is I am using a slow work laptop to process a pretty large file (10 million rows). I want to add some simple columns (1 or 0 values), concatenate two columns to create a unique ID, change the dtype for other columns, and rename some headers so they match with other files that I will .merge later. I could probably split this csv (maybe select date ranges and make separate files), but I would like to learn how to use chunksize or deal with large files in general without running into memory issues. Is it possible to modify a file in chunks and then concatenate them all together later?
I am doing a raw data clean up which will then be loaded into Tableau for visualization.
Example (reading/modifying 10 million rows):
> rep = pd.read_csv(r'C:\repeats.csv.gz',
> compression = 'gzip', parse_dates = True , usecols =
> ['etc','stuff','others','...'])
> rep.sort()
> rep['Total_Repeats'] = 1
> rep.rename(columns={'X':'Y'}, inplace = True)
> rep.rename(columns={'Z':'A'}, inplace = True)
> rep.rename(columns={'B':'C'}, inplace = True)
> rep['D']= rep['E'] + rep['C']
> rep.rename(columns={'L':'M'}, inplace = True)
> rep.rename(columns={'N':'O'}, inplace = True)
> rep.rename(columns={'S':'T'}, inplace = True)
If you pass chunk_size keyword to pd.read_csv, it returns iterator of csv reader. and you can write processed chunks with to_csv method in append mode. you will be able to process large file, but you can't sort dataframe.
import pandas as pd
reader = pd.read_csv(r'C:\repeats.csv.gz',
compression = 'gzip', parse_dates=True, chunk_size=10000
usecols = ['etc','stuff','others','...'])
output_path = 'output.csv'
for chunk_df in reader:
chunk_result = do_somthing_with(chunk_df)
chunk_result.to_csv(output_path, mode='a', header=False)
Python's usually pretty good with that as long as you ignore the .read() part when looking at large files.
If you just use the iterators, you should be fine:
with open('mybiginputfile.txt', 'rt') as in_file:
with open('mybigoutputfile.txt', 'wt') as out_file:
for row in in_file:
'do something'
out_file.write(row)
Someone who knows more will explain how the memory side of it works, but this works for me on multi GB files without crashing Python.
You might want to chuck the data into a proper DB before killing your laptop with the task of serving up the data AND running Tableau too!