Inserting xls data into Oracle DB using cx_Oracle executemany - python

I am trying to insert an xls file into oracle table using cx_Oracle. Below is the way how I am trying to achieve the same.
wb = open_workbook('test.xls')
values=[]
sheets=wb.sheet_names()
xl_sheet=wb.sheet_by_name(s)
sql_str=preparesql('MATTERS') ##this is function I have created which will return the insert statement I am using to load the table
for row in range(1, xl_sheet.nrows):
col_names = xl_sheet.row(0)
col_value = []
for name, col in zip(col_names, range(xl_sheet.ncols)):
searchObj = re.search( (r"Open Date|Matter Status Date"), name.value, re.M|re.I)
if searchObj:
if (xl_sheet.cell(row,col).value) == '':
value = ''
else:
value = datetime(*xlrd.xldate_as_tuple(xl_sheet.cell(row,col).value, wb.datemode))
value = value.strftime('%d-%b-%Y ')
else:
value = (xl_sheet.cell(row,col).value)
col_value.append(value)
values.append(col_value)
cur.executemany(sql_str,values,batcherrors=True)
But When I tested it against multiple xls files for some files it was throwing TypeError: I can't share the data due to the restrictions from the client.I feel the issue is related to the dtype of the columns in excel compared to the DB. Is there any way I can match the dtpes of the values list above to match the datatype of the columns in DB or are there any other ways to get the insert done? I tried using the dataframe.to_sql but it is taking lot of time. I am able to insert the same data by looping through the rows in values list.

I suggest you import data into panda dataframe than it become very easy to play with data type in pandas. You can change the data type of whole column and than can insert easily.

Related

convert csv to sqlite but i don't know if column valid

i want white function in python to convert csv to sqlite. in csv, i have 4 columns Setting State Comment and Path. Sometimes the real Path is in the next column or two next columns not every time in Path column.
def csv_to_sqlite(csv_file, sqlite_file):
# Connect to the SQLite database
connection = sqlite3.connect(sqlite_file)
cursor = connection.cursor()
# Read the CSV file
with open(csv_file, 'r') as f:
reader = csv.reader(f)
headers = next(reader)
# Create the table in the SQLite database
cursor.execute(f'CREATE TABLE data ({", ".join(headers)})')
# Get the index of the "Path" column
path_index = headers.index("Path")
# Insert the data from the CSV file into the SQLite database
for row in reader:
modified_row = row.copy()
# Check if the "Path" column starts with '\'
if re.match(r'^\\', modified_row[path_index]):
cursor.execute(f'INSERT INTO data VALUES ({", ".join(["?" for header in headers])})', modified_row)
else:
# Search for the first column that starts with '\'
for i in range(path_index + 1, len(headers)):
if re.match(r'^\\', modified_row[i]):
modified_row[path_index] = modified_row[i]
cursor.execute(f'INSERT INTO data VALUES ({", ".join(["?" for header in headers])})',
modified_row)
break
# Commit the changes and close the connection
connection.commit()
connection.close()
but i get error
cursor.execute(f'INSERT INTO data VALUES ({", ".join(["?" for header in headers])})', modified_row)
sqlite3.ProgrammingError: Incorrect number of bindings supplied. The current statement uses 4, and there are 5 supplied.
i expect get db like csv and not error
Edit:
i try to solve this problem from pandas
df = pd.read_csv(file_path, sep=',', encoding='cp1252')
i get error
pandas.errors.ParserError: Error tokenizing data. C error: Expected 4 fields in line 18, saw 5
this is my data enter image description here
example the problem enter image description here
Try to check what encoding of the file to use with pandas. After that, you will skip lines with an error. Good luck
The error is an evidence that the current row has 5 columns while the header row only has 4. You should ignore excess columns by limiting the used length of the row:
cursor.execute(f'INSERT INTO data VALUES ({", ".join(["?" for header in headers])})',
modified_row[:len(headers)])
The issue is probably due to the number of values in modified_row being different than the number of columns in data. This is likely because the code is appending extra values to modified_row when searching for the first column that starts with ''.
You can try to only include the values for the columns in data.

Can't read data simultaneously from the first three columns of a google sheet

I'm trying to read data from the first three columns of a google sheet using gspread. The three columns that I'm interested in are ID,NAME and SYMBOL. The name of the sheet is testFile.
If I try like the following, I can get data from the first column.
client = authenticate()
sh = client.open("testFile")
worksheet = sh.get_worksheet(0)
for item in worksheet.col_values(1):
print(item)
However, I wish to parse the data from the three columns simultaneously. It would be better If I could read the values using column headers.
I know I colud try like this to get the values, but this way I will end up getting quotaExceeded errors or something similar because of it's slow pacing.
for i in range(1, worksheet.row_count + 1):
row = worksheet.row_values(i)
if row[0]=='ID':continue
print(row[0],row[1],row[2])
How can I read data from the first three columns of a google sheet?
In your situation, how about the following modification?
From:
for item in worksheet.col_values(1):
print(item)
To:
values = worksheet.get_all_values()
obj = {}
for e in zip(*values):
obj[e[0]] = list(e[1:])
print(obj)
In this modification, all values are retrieved by one API call. And, the retrieved values are converted to an object as the key of the header title.
For example, when your sample Spreadsheet is used, you can retrieve the values of column "NAME" by obj["NAME"].
Or,
To:
values = worksheet.get_all_records()
res = [e["NAME"] for e in values] # or e["ID"] or e["SYMBOL"]
print(res)
In this modification, you can retrieve the values of column "NAME" can be retrieved.
References:
get_all_values
get_all_records

Transfer data from excel worksheet (openpyxl) to database table (dbf)

I have a simple problem of reading an excel worksheet, treat every row containing about 83 columns as unique database record, add it to local datum record and ultimately append and write onto DBF file.
I can extract all the values from excel and add them to the list. But the list is not correct syntax and I don't know how to prepare/convert the list to database record. I am using Openpyxl, dbf and python 3.7.
At the moment I am only testing and trying to prepare the data for Row 3 (hence min_max rows = 3)
I understand that the data should be in the format
(('','','', ... 83 entries), \
('','','', ... 83 entries) \
)
But I do not know how to convert the list data into record
or, alternatively, how to read in excel data directly into a DF appendable format
tbl_tst.open(mode=dbf.READ_WRITE) # all fields character string
for everyrow in ws_IntMstDBF.iter_rows(min_row = 3, max_row = 3, max_col = ws_IntMstDBF.max_column-1):
datum = [] #set([83]), will defining datum as () help solve the problem?
for idx, cells in enumerate(everyrow):
if cells.value is None: # for None entries, enter empty string
datum.append("")
continue
datum.append(cells.value) # else enter cell values
tbl_tst.append(datum) # append that record to table !!! list is not record error here
tbl_tst.close()
The error is complaining about using list to append to table, and this should be a record etc. Please guide how I can convert excel rows into appendable DBF table data.
raise TypeError("data to append must be a tuple, dict, record, or template; not a %r" % type(data))
TypeError: data to append must be a tuple, dict, record, or template; not a <class 'list'>
Change
tbl_tst.append(datum)
to
tbl_tst.append(tuple(datum))
and that will get rid of that error. As long as all your cell data has the appropriate type then the append should work.
Thank you for the responses, I went on a bit of tangent since last night while trying different solutions.
One solution that worked for me is as follows:
I made sure that the worksheet data I am using is all strings/Text and converted any null entries to String type and entered empty string. So the following code does this task:
#house keeping
for eachrow in ws_IntMstDBF.iter_rows(min_row=2, max_row=ws_IntMstDBF.max_row, max_col=ws_IntMstDBF.max_column):
for idx, cells in enumerate(eachrow):
if cells.value is None: # change every Null cell type to String and put 0x20 (space)
cells.data_type = 's'
cells.value = " "
After writing the worksheet, I reopened it using panda dataframe and verified if the contents were all string type and there were no "nan" values remaining in the dataframe.
Then I used df2dbf function from "Dani Arribas-Bel", modified it to suit the data I am working with and converted to dbf.
The code which imports dataframe and converts to dbf format is as follows:
abspath = Path(__file__).resolve() # resolve to relative path to absolute
rootpath = abspath.parents[3] # root (my source file is3 sub directories deep
xlspath = rootpath / 'sub-dir1' / 'sub-dir2' / 'sub-dir3' / 'test.xlsx'
# above code is only resolving file location, ignore
pd_Mst_df = pd.read_excel(xlspath)
#print(pd_Mst_df) # for debug
print("... Writing Master DBF file ")
df2dbf(pd_Mst_df, dbfpath) # dbf path is defined similar to pd_Mst path
The function df2dbg uses pysal to write dataframe in dbf format:
I made some modifications to the code to detect the length row length and character types as follows:
import pandas as pd
import pysal as ps
import numpy as np
# code from function df2dbf
else:
type2spec = {int: ('N', 20, 0),
np.int64: ('N', 20, 0),
float: ('N', 36, 15),
np.float64: ('N', 36, 15),
str: ('C', 200, 0)
}
#types = [type(df[i].iloc[0]) for i in df.columns]
types = [type('C') for i in range(0, len(df.columns))] #84)] #df.columns)] #range(0,84)] # i not required, to be removed
specs = [type2spec[t] for t in types]
db = ps.open(dbf_path, 'w')
# code continues from function df2dbf
Pandas dataframe didn't require further modifications as all source data was formatted correctly before being committed to excel file.
I will provide the link to pysal and df2dbf as soon as I find it on stackoverflow.
Check out the Python Pandas library...
To read the data from excel inta a Pandas dataframe, you could use pandas.read_excel
Once the date is read into a Pandas dataframe, you can manipulate it and afterwards write it to a database using pandas.DataFrame.to_sql
See also this explanation for dealing with database io

How to parse dataframes from an excel sheet with many tables (using Python, possibly Pandas)

I'm dealing with badly laid out excel sheets which I'm trying to parse and write into a database.
Every sheet can have multiple tables. Though the header for these possible tables are known, which tables are gonna be on any given sheet is not, neither is their exact location on the sheet (the tables don't align in a consistent way). I've added a pic of two possible sheet layout to illustrate this: This layout has two tables, while this one has all the tables of the first, but not in the same location, plus an extra table.
What I do know:
All the possible table headers, so each individual table can be identified by its headers
Tables are separated by blank cells. They do not touch each other.
My question Is there a clean way to deal with this using some Python module such as pandas?
My current approach:
I'm currently converting to .csv and parsing each row. I split each row around blank cells, and process the first part of the row (should belong to the leftmost table). The remainder of the row is queued and later processed in the same manner. I then read this first_part and check whether or not it's a header row. If it is, I use it to identify which table I'm dealing with (this is stored in a global current_df). Subsequent rows which are not header rows are fed into this table (here I'm using pandas.DataFrame for my tables).
Code so far is below (mostly incomplete and untested, but it should convey the approach above):
class DFManager(object): # keeps track of current table and its headers
current_df = None
current_headers = []
def set_current_df(self, df, headers):
self.current_headers = headers
self.current_df = df
def split_row(row, separator):
while row and row[0] == separator:
row.pop(0)
while row and row[-1] == separator:
row.pop()
if separator in row:
split_index = row.index(separator)
return row[:split_index], row[split_index:]
else:
return row, []
def process_df_row(row, dfmgr):
df = df_with_header(row) # returns the dataframe with these headers
if df is None: # is not a header row, add it to current df
df = dfmgr.current_df
add_row_to_df(row, df)
else:
dfmgr.set_current_df(df, row)
# this is passed the Excel sheet
def populate_dataframes(xl_sheet):
dfmgr = DFManager()
row_queue = Queue()
for row in xl_sheet:
row_queue.put(row)
for row in iter(row_queue.get, None):
if not row:
continue
first_part, remainder = split_row(row)
row_queue.put(remainder)
process_df_row(first_part, dfmgr)
This is such a specific situation that there is likely no "clean" way to do this with a ready-made module.
One way to do this might use the header information you already have to find the starting indices of each table, something like this solution (Python Pandas - Read csv file containing multiple tables), but with an offset in the column direction as well.
Once you have the starting position of each table, you'll want to determine the widths (either known a priori or discovered by reading until the next blank column) and read those columns into a dataframe until the end of the table.
The benefit of an index-based method rather than a queue based method is that you do not need to re-discover where the separator is in each row or keep track of which row fragments belong to which table. It is also agnostic to the presence of >2 tables per row.
I written code to merge multiple tables separated vertically, having common headers in each table. I am assuming unique headers should name does not end with dot integer number.
'''
def clean(input_file, output_file):
try:
df = pd.read_csv(input_file, skiprows=[1,1])
df = df.drop(df.columns[df.columns.str.contains('unnamed',case = False)],axis=1)
df = rename_duplicate_columns(df)
except:
df =[]
print("Error: File Not found\t", sys.exc_info() [0])
exit
udf = df.loc[:, ~df.columns.str.match(".*\.\d")]
udf = udf.dropna(how='all')
try:
table_num = int(df.columns.values[-1].split('.')[-1])
fdf = udf
for i in range(1,table_num+1):
udfi = pd.DataFrame()
udfi = df.loc[:, df.columns.str.endswith(f'.{i}')]
udfi.rename(columns = lambda x: '.'.join(x.split('.')[:-1]), inplace=True)
udfi = udfi.dropna(how='all')
fdf = fdf.append(udfi,ignore_index=True)
fdf.to_csv(output_file)
except ValueError:
print ("File Contains only single Table")
exit
def rename_duplicate_columns(df):
cols=pd.Series(df.columns)
for dup in df.columns.get_duplicates():
cols[df.columns.get_loc(dup)]=[dup+'.'+str(d_idx) if d_idx!=0 else
dup for d_idx in range(df.columns.get_loc(dup).sum())]
df.columns=cols
print(df.columns)
return df
clean(input_file, output_file)
'''

Assigning Values to Variables from List in Python 2.7

I'm trying to upload a decent amount (~300k rows) worth of data to a database using pyodbc. Since the data will need to be updated quarterly (from spreadsheets), what I'm trying to create is a sort of dynamic insert statement to make things a bit more streamlined. My thought process is that I can name the header of each column in the spreadsheets the same as the column in the table where I want the respective data to be uploaded.
What I've attempted to do is write a script that pulls out the column names in the header row on the worksheet and uses these as the variables, eliminating the need for me to change any code if I'm uploading different data from different excel sheets back to back:
import xlrd
import numpy as np
import pyodbc
wb=xlrd.open_workbook(r"FILE")
worksheet = wb.sheet_by_name('SHEET_NAME')
num_rows = worksheet.nrows
num_cols = worksheet.ncols
header_row = 0
header_list = [worksheet.row_values(header_row)]
header_values = ", ".join([str(i) for i in [cell.value for cell in worksheet.row(header_row)]])
question_marks = ",?" * (num_cols - 1)
cell=worksheet.cell(1,1)
arr = []
for rowind in range(num_rows)[(header_row + 1):]:
arr.append([cell.value for cell in worksheet.row(rowind)])
data = np.rec.fromrecords(arr, names=header_values)
cnxn = pyodbc.connect(r"DRIVER={SQL Server};SERVER=XXXXXXXXXX
\DEV_CON1;DATABASE=GGG;UID=AAA_dbo;PWD=XXXXXXXXX;")
cursor = cnxn.cursor()
populate_db = "insert into tblSnap_TEST(" + header_values + ") values (?" + question_marks + ")"
for i in header_list:
i = data[i]
values = header_list
cursor.execute(populate_db,values)
cursor.close
cnxn.commit
cnxn.close`
When I attempt to run the script I get the following error message:
File "<ipython-input-45-6809dc4a27ac>", line 1, in <module>
runfile('H:/My Documents/Python/string_search_test.py', wdir='H:/My Documents/Python')
File "C:\Users\xxxx\xxx\xx\x\Anaconda\lib\site- packages\spyderlib\widgets\externalshell\sitecustomize.py", line 580, in runfile
execfile(filename, namespace)
File "H:/My Documents/Python/string_search_test.py", line 67, in <module>
cursor.execute(populate_db,values)
ProgrammingError: ('The SQL contains 21 parameter markers, but 1 parameters were supplied', 'HY000')
The way I've done this before is by explicitly defining the values to be passed, such as:
account_numbers = (sheet.cell(row_index, 1).value)
But like I said, what I'm trying to do here is make it so I wouldn't have to type that out. That is what I was attempting to do with i = data[i]. I'm hoping there's a way for Python to recognize that "account_numbers" is in a list (created from the worksheet's headers), then grab the corresponding data based on something similar to i = data[i] which I tried above. Is a solution like this possible? data[i] does return the right data I want to insert into the table for each column, but it's not recognized by the execute statement.
If you're not dealing with large Excel worksheets, or any problematic data types (such as one described in the section Dates in Excel spreadsheets), you can simplify reading all rows into a single list, pop the header values for the insert columns, then call Cursor.executemany once to insert all values from the spreadsheet, passing a sequence of sequences for parameter values.
I've removed the numpy array population, since it's not necessary in the snippet provided.
header_row = 0
# build list of lists that represents row values in worksheet,
# including column names from header row
rows = [worksheet.row_values(row) for row in range(worksheet.nrows)]
# extract list of column names to use for insert statement, values to be inserted remain
columns = rows.pop(header_row)
cnxn = pyodbc.connect(r"DRIVER={SQL Server};SERVER=XXXXXXXXXX\DEV_CON1;DATABASE=GGG;UID=AAA_dbo;PWD=XXXXXXXXX;")
cursor = cnxn.cursor()
# list of column names used to build SQL statement, including parameter placeholders (?)
populate_db = "insert into tblSnap_TEST ({}) values ({})".format(', '.join(columns),
', '.join('?' * len(columns)))
# insert is executed once for each sequence of parameter values
cursor.executemany(populate_db, rows)
cnxn.commit()
cnxn.close()
maybe
cursor.execute(populate_db,*values)
this is known as unpacking a list (or other iterable)

Categories