I am using the below code to read a tab delimited file into a postgres database
enginestring = cfg.dbuser+":"+cfg.dbpwd+"#"+server.local_bind_host+":"+str(server.local_bind_port)+"/"+cfg.dbname
engine = create_engine('postgresql://' + enginestring)
rows = []
for line in smart_open.smart_open(key):
ln = str(line.decode('utf-8'))
fields = ln.split('\t')
rows.append(fields)
df = pd.DataFrame(rows, columns=cfg.df_colheaders)
print(df.head)
df.to_sql(name=table_name, con=engine, if_exists='append')
The call to print returns the dataframe that I expect (i.e. [798624 rows x 133 columns]) and the call to to_sql doesn't fail, yet in the DB I see only one row of data with the correct columns...(same results if the table has been created before or not)
OK here is an update:
I solved the single row issue by stripping EOL chars (could see ΒΆ at the end of the last inserted field)
Then I was just getting empty tables so I added chunksize parameter to to_sql - not sure why it didn't fail instead of just proceeding but whatever it's OK now
Related
i want white function in python to convert csv to sqlite. in csv, i have 4 columns Setting State Comment and Path. Sometimes the real Path is in the next column or two next columns not every time in Path column.
def csv_to_sqlite(csv_file, sqlite_file):
# Connect to the SQLite database
connection = sqlite3.connect(sqlite_file)
cursor = connection.cursor()
# Read the CSV file
with open(csv_file, 'r') as f:
reader = csv.reader(f)
headers = next(reader)
# Create the table in the SQLite database
cursor.execute(f'CREATE TABLE data ({", ".join(headers)})')
# Get the index of the "Path" column
path_index = headers.index("Path")
# Insert the data from the CSV file into the SQLite database
for row in reader:
modified_row = row.copy()
# Check if the "Path" column starts with '\'
if re.match(r'^\\', modified_row[path_index]):
cursor.execute(f'INSERT INTO data VALUES ({", ".join(["?" for header in headers])})', modified_row)
else:
# Search for the first column that starts with '\'
for i in range(path_index + 1, len(headers)):
if re.match(r'^\\', modified_row[i]):
modified_row[path_index] = modified_row[i]
cursor.execute(f'INSERT INTO data VALUES ({", ".join(["?" for header in headers])})',
modified_row)
break
# Commit the changes and close the connection
connection.commit()
connection.close()
but i get error
cursor.execute(f'INSERT INTO data VALUES ({", ".join(["?" for header in headers])})', modified_row)
sqlite3.ProgrammingError: Incorrect number of bindings supplied. The current statement uses 4, and there are 5 supplied.
i expect get db like csv and not error
Edit:
i try to solve this problem from pandas
df = pd.read_csv(file_path, sep=',', encoding='cp1252')
i get error
pandas.errors.ParserError: Error tokenizing data. C error: Expected 4 fields in line 18, saw 5
this is my data enter image description here
example the problem enter image description here
Try to check what encoding of the file to use with pandas. After that, you will skip lines with an error. Good luck
The error is an evidence that the current row has 5 columns while the header row only has 4. You should ignore excess columns by limiting the used length of the row:
cursor.execute(f'INSERT INTO data VALUES ({", ".join(["?" for header in headers])})',
modified_row[:len(headers)])
The issue is probably due to the number of values in modified_row being different than the number of columns in data. This is likely because the code is appending extra values to modified_row when searching for the first column that starts with ''.
You can try to only include the values for the columns in data.
I have a CSV file with 2000 rows and 3 columns of type Int, String, and String respectively. I'm trying to write a program that creates a table and appends my CSV file rowsxcolumns into the table. It all seems to work, except for a single item. The item's first and third column are appended, but the second one is null.
No errors are shown on the console, and I have tried printing the data to the console before calling the to_sql function and it shows the one missing item just fine. My CSV file also uses a delimiter. That item was at first not correctly formatted and so after delimiting all of my rows, I had to specifically delimit that single row by itself - I'm thinking that this could have caused the problem? All of the other rows and columns are perfectly fine.
Here's my code:
import sqlite3
import pandas as pd
from pandas import DataFrame
connection = sqlite3.connect('data.sqlite')
c = connection.cursor()
c.execute('''CREATE TABLE test(val1 int, val2 varchar(255), val3 varchar(255))''')
connection.commit()
col_names = ["val1", "val2", "val3"]
read_clients = pd.read_csv(r'thefile.csv', encoding='utf-16', names=col_names, sep='\t')
read_clients.to_sql('test', connection, if_exists='append', index = false, schema='test')
connection.commit()
No messages are printed to the console.
So the value of the string that was missing was "NA". Python apparently translates this to a null value! Here's how I fixed it:
read_clients = pd.read_csv(r'thefile.csv', encoding='utf-16', names=col_names, sep='\t', na_filter=False)
I am trying to insert an xls file into oracle table using cx_Oracle. Below is the way how I am trying to achieve the same.
wb = open_workbook('test.xls')
values=[]
sheets=wb.sheet_names()
xl_sheet=wb.sheet_by_name(s)
sql_str=preparesql('MATTERS') ##this is function I have created which will return the insert statement I am using to load the table
for row in range(1, xl_sheet.nrows):
col_names = xl_sheet.row(0)
col_value = []
for name, col in zip(col_names, range(xl_sheet.ncols)):
searchObj = re.search( (r"Open Date|Matter Status Date"), name.value, re.M|re.I)
if searchObj:
if (xl_sheet.cell(row,col).value) == '':
value = ''
else:
value = datetime(*xlrd.xldate_as_tuple(xl_sheet.cell(row,col).value, wb.datemode))
value = value.strftime('%d-%b-%Y ')
else:
value = (xl_sheet.cell(row,col).value)
col_value.append(value)
values.append(col_value)
cur.executemany(sql_str,values,batcherrors=True)
But When I tested it against multiple xls files for some files it was throwing TypeError: I can't share the data due to the restrictions from the client.I feel the issue is related to the dtype of the columns in excel compared to the DB. Is there any way I can match the dtpes of the values list above to match the datatype of the columns in DB or are there any other ways to get the insert done? I tried using the dataframe.to_sql but it is taking lot of time. I am able to insert the same data by looping through the rows in values list.
I suggest you import data into panda dataframe than it become very easy to play with data type in pandas. You can change the data type of whole column and than can insert easily.
I have a process which is reading from 4 databases with 4 tables each. I am consolidating that data into 1 postgres database with 4 tables total. (Each of the original 4 databases have the same 4 tables which need to be consolidated).
The way I am doing it now works using pandas. I read one table from all 4 databases at a time, concatenate the data into one dataframe, then I use to_sql to save it on my postgres database. I then loop through to the remaining databases doing the same thing for the other tables.
My issue is speed. One of my tables has about 1 - 2mil rows per date so it can take about 5,000 - 6,000 seconds to finish writing the data to postgres. It is much quicker to write it to a .csv file and then use COPY FROM in pgadmin.
Here is my current code. Note that there are some function calls but it is just referring to the table names basically. I also have some basic logging being done but that is not too necessary. I am adding a column for the source database which is required though. I am stripping .0 from fields which are actually strings but pandas sees them as a float too, and I fill blank integers with 0 and make sure the columns are really type int.
def query_database(table, table_name, query_date):
df_list = []
log_list = []
for db in ['NJ', 'NJ2', 'LA', 'NA']:
start_time = time.clock()
query_timestamp = dt.datetime.now(pytz.timezone('UTC')).strftime('%Y-%m-%d %H:%M:%S')
engine_name = '{}{}{}{}'.format(connection_type, server_name, '/', db)
print('Accessing {} from {}'.format((select_database(db)[0][table]), engine_name))
engine = create_engine(engine_name)
df = pd.read_sql_query(query.format(select_database(db)[0][table]), engine, params={query_date})
query_end = time.clock() - start_time
df['source_database'] = db
df['insert_date_utc'] = query_timestamp
df['row_count'] = df.shape[0]
df['column_count'] = df.shape[1]
df['query_time'] = round(query_end, 0)
df['maximum_id'] = df['Id'].max()
df['minimum_id'] = df['Id'].min()
df['source_table'] = table_dict.get(table)
log = df[['insert_date_utc', 'row_date', 'source_database', 'source_table', 'row_count', 'column_count', 'query_time', 'maximum_id', 'minimum_id']].copy()
df.drop(['row_count', 'column_count', 'query_time', 'maximum_id', 'minimum_id', 'source_table'], inplace=True, axis=1)
df_list.append(df)
log_list.append(log)
log = pd.concat(log_list)
log.drop_duplicates(subset=['row_date', 'source_database', 'source_table'], inplace=True, keep='last')
result = pd.concat(df_list)
result.drop_duplicates('Id', inplace=True)
cols = [i.strip() for i in (create_columns(select_database(db)[0][table]))]
result = result[cols]
print('Creating string columns for {}'.format(table_name))
for col in modify_str_cols(select_database(db)[0][table]):
create_string(result, col)
print('Creating integer columns for {}'.format(table_name))
for col in modify_int_cols(select_database(db)[0][table]):
create_int(result, col)
log.to_sql('raw_query_log', cms_dtypes.pg_engine, index=False, if_exists='append', dtype=cms_dtypes.log_dtypes)
print('Inserting {} data into PostgreSQL'.format(table_name))
result.to_sql(create_table(select_database(db)[0][table]), cms_dtypes.pg_engine, index=False, if_exists='append', chunksize=50000, dtype=create_dtypes(select_database(db)[0][table]))
How can I insert a COPY TO and COPY FROM into this to speed it up? Should I just write the .csv files and then loop over those or can I COPY from memory to my postgres?
psycopg2 offers a number of specific copy related apis. If you want to use csv, you have to use copy_expert (which allows you to specify a fully copy statement).
Normally when I have done this, I have used copy_expert() and a file-like object which iterates through a file on disk. That seems to work reasonably well.
This being said, in your case, I think copy_to and copy_from are better matches because it is simply postgres to postgres transfer here. Note these use PostgreSQL's copy output/input syntax and not csv (if you want to use csv, you have to use copy_expert)
Note before you decide how to do things, you will need to note:
copy_to copies to a file-like object (such as StringIO) and copy_from/copy_expert files from a file-like object. If you want to use a panda data frame you are going to have to think about this a little and either create a file-like object or use csv along with StringIO and copy_expert to generate an in-memory csv and load that.
I'm trying to upload a decent amount (~300k rows) worth of data to a database using pyodbc. Since the data will need to be updated quarterly (from spreadsheets), what I'm trying to create is a sort of dynamic insert statement to make things a bit more streamlined. My thought process is that I can name the header of each column in the spreadsheets the same as the column in the table where I want the respective data to be uploaded.
What I've attempted to do is write a script that pulls out the column names in the header row on the worksheet and uses these as the variables, eliminating the need for me to change any code if I'm uploading different data from different excel sheets back to back:
import xlrd
import numpy as np
import pyodbc
wb=xlrd.open_workbook(r"FILE")
worksheet = wb.sheet_by_name('SHEET_NAME')
num_rows = worksheet.nrows
num_cols = worksheet.ncols
header_row = 0
header_list = [worksheet.row_values(header_row)]
header_values = ", ".join([str(i) for i in [cell.value for cell in worksheet.row(header_row)]])
question_marks = ",?" * (num_cols - 1)
cell=worksheet.cell(1,1)
arr = []
for rowind in range(num_rows)[(header_row + 1):]:
arr.append([cell.value for cell in worksheet.row(rowind)])
data = np.rec.fromrecords(arr, names=header_values)
cnxn = pyodbc.connect(r"DRIVER={SQL Server};SERVER=XXXXXXXXXX
\DEV_CON1;DATABASE=GGG;UID=AAA_dbo;PWD=XXXXXXXXX;")
cursor = cnxn.cursor()
populate_db = "insert into tblSnap_TEST(" + header_values + ") values (?" + question_marks + ")"
for i in header_list:
i = data[i]
values = header_list
cursor.execute(populate_db,values)
cursor.close
cnxn.commit
cnxn.close`
When I attempt to run the script I get the following error message:
File "<ipython-input-45-6809dc4a27ac>", line 1, in <module>
runfile('H:/My Documents/Python/string_search_test.py', wdir='H:/My Documents/Python')
File "C:\Users\xxxx\xxx\xx\x\Anaconda\lib\site- packages\spyderlib\widgets\externalshell\sitecustomize.py", line 580, in runfile
execfile(filename, namespace)
File "H:/My Documents/Python/string_search_test.py", line 67, in <module>
cursor.execute(populate_db,values)
ProgrammingError: ('The SQL contains 21 parameter markers, but 1 parameters were supplied', 'HY000')
The way I've done this before is by explicitly defining the values to be passed, such as:
account_numbers = (sheet.cell(row_index, 1).value)
But like I said, what I'm trying to do here is make it so I wouldn't have to type that out. That is what I was attempting to do with i = data[i]. I'm hoping there's a way for Python to recognize that "account_numbers" is in a list (created from the worksheet's headers), then grab the corresponding data based on something similar to i = data[i] which I tried above. Is a solution like this possible? data[i] does return the right data I want to insert into the table for each column, but it's not recognized by the execute statement.
If you're not dealing with large Excel worksheets, or any problematic data types (such as one described in the section Dates in Excel spreadsheets), you can simplify reading all rows into a single list, pop the header values for the insert columns, then call Cursor.executemany once to insert all values from the spreadsheet, passing a sequence of sequences for parameter values.
I've removed the numpy array population, since it's not necessary in the snippet provided.
header_row = 0
# build list of lists that represents row values in worksheet,
# including column names from header row
rows = [worksheet.row_values(row) for row in range(worksheet.nrows)]
# extract list of column names to use for insert statement, values to be inserted remain
columns = rows.pop(header_row)
cnxn = pyodbc.connect(r"DRIVER={SQL Server};SERVER=XXXXXXXXXX\DEV_CON1;DATABASE=GGG;UID=AAA_dbo;PWD=XXXXXXXXX;")
cursor = cnxn.cursor()
# list of column names used to build SQL statement, including parameter placeholders (?)
populate_db = "insert into tblSnap_TEST ({}) values ({})".format(', '.join(columns),
', '.join('?' * len(columns)))
# insert is executed once for each sequence of parameter values
cursor.executemany(populate_db, rows)
cnxn.commit()
cnxn.close()
maybe
cursor.execute(populate_db,*values)
this is known as unpacking a list (or other iterable)