convert csv to sqlite but i don't know if column valid - python

i want white function in python to convert csv to sqlite. in csv, i have 4 columns Setting State Comment and Path. Sometimes the real Path is in the next column or two next columns not every time in Path column.
def csv_to_sqlite(csv_file, sqlite_file):
# Connect to the SQLite database
connection = sqlite3.connect(sqlite_file)
cursor = connection.cursor()
# Read the CSV file
with open(csv_file, 'r') as f:
reader = csv.reader(f)
headers = next(reader)
# Create the table in the SQLite database
cursor.execute(f'CREATE TABLE data ({", ".join(headers)})')
# Get the index of the "Path" column
path_index = headers.index("Path")
# Insert the data from the CSV file into the SQLite database
for row in reader:
modified_row = row.copy()
# Check if the "Path" column starts with '\'
if re.match(r'^\\', modified_row[path_index]):
cursor.execute(f'INSERT INTO data VALUES ({", ".join(["?" for header in headers])})', modified_row)
else:
# Search for the first column that starts with '\'
for i in range(path_index + 1, len(headers)):
if re.match(r'^\\', modified_row[i]):
modified_row[path_index] = modified_row[i]
cursor.execute(f'INSERT INTO data VALUES ({", ".join(["?" for header in headers])})',
modified_row)
break
# Commit the changes and close the connection
connection.commit()
connection.close()
but i get error
cursor.execute(f'INSERT INTO data VALUES ({", ".join(["?" for header in headers])})', modified_row)
sqlite3.ProgrammingError: Incorrect number of bindings supplied. The current statement uses 4, and there are 5 supplied.
i expect get db like csv and not error
Edit:
i try to solve this problem from pandas
df = pd.read_csv(file_path, sep=',', encoding='cp1252')
i get error
pandas.errors.ParserError: Error tokenizing data. C error: Expected 4 fields in line 18, saw 5
this is my data enter image description here
example the problem enter image description here

Try to check what encoding of the file to use with pandas. After that, you will skip lines with an error. Good luck

The error is an evidence that the current row has 5 columns while the header row only has 4. You should ignore excess columns by limiting the used length of the row:
cursor.execute(f'INSERT INTO data VALUES ({", ".join(["?" for header in headers])})',
modified_row[:len(headers)])

The issue is probably due to the number of values in modified_row being different than the number of columns in data. This is likely because the code is appending extra values to modified_row when searching for the first column that starts with ''.
You can try to only include the values for the columns in data.

Related

How to fix pandas to_sql append not appending one item out of 2000?

I have a CSV file with 2000 rows and 3 columns of type Int, String, and String respectively. I'm trying to write a program that creates a table and appends my CSV file rowsxcolumns into the table. It all seems to work, except for a single item. The item's first and third column are appended, but the second one is null.
No errors are shown on the console, and I have tried printing the data to the console before calling the to_sql function and it shows the one missing item just fine. My CSV file also uses a delimiter. That item was at first not correctly formatted and so after delimiting all of my rows, I had to specifically delimit that single row by itself - I'm thinking that this could have caused the problem? All of the other rows and columns are perfectly fine.
Here's my code:
import sqlite3
import pandas as pd
from pandas import DataFrame
connection = sqlite3.connect('data.sqlite')
c = connection.cursor()
c.execute('''CREATE TABLE test(val1 int, val2 varchar(255), val3 varchar(255))''')
connection.commit()
col_names = ["val1", "val2", "val3"]
read_clients = pd.read_csv(r'thefile.csv', encoding='utf-16', names=col_names, sep='\t')
read_clients.to_sql('test', connection, if_exists='append', index = false, schema='test')
connection.commit()
No messages are printed to the console.
So the value of the string that was missing was "NA". Python apparently translates this to a null value! Here's how I fixed it:
read_clients = pd.read_csv(r'thefile.csv', encoding='utf-16', names=col_names, sep='\t', na_filter=False)

Inserting xls data into Oracle DB using cx_Oracle executemany

I am trying to insert an xls file into oracle table using cx_Oracle. Below is the way how I am trying to achieve the same.
wb = open_workbook('test.xls')
values=[]
sheets=wb.sheet_names()
xl_sheet=wb.sheet_by_name(s)
sql_str=preparesql('MATTERS') ##this is function I have created which will return the insert statement I am using to load the table
for row in range(1, xl_sheet.nrows):
col_names = xl_sheet.row(0)
col_value = []
for name, col in zip(col_names, range(xl_sheet.ncols)):
searchObj = re.search( (r"Open Date|Matter Status Date"), name.value, re.M|re.I)
if searchObj:
if (xl_sheet.cell(row,col).value) == '':
value = ''
else:
value = datetime(*xlrd.xldate_as_tuple(xl_sheet.cell(row,col).value, wb.datemode))
value = value.strftime('%d-%b-%Y ')
else:
value = (xl_sheet.cell(row,col).value)
col_value.append(value)
values.append(col_value)
cur.executemany(sql_str,values,batcherrors=True)
But When I tested it against multiple xls files for some files it was throwing TypeError: I can't share the data due to the restrictions from the client.I feel the issue is related to the dtype of the columns in excel compared to the DB. Is there any way I can match the dtpes of the values list above to match the datatype of the columns in DB or are there any other ways to get the insert done? I tried using the dataframe.to_sql but it is taking lot of time. I am able to insert the same data by looping through the rows in values list.
I suggest you import data into panda dataframe than it become very easy to play with data type in pandas. You can change the data type of whole column and than can insert easily.

pandas to_sql only writing first row to db

I am using the below code to read a tab delimited file into a postgres database
enginestring = cfg.dbuser+":"+cfg.dbpwd+"#"+server.local_bind_host+":"+str(server.local_bind_port)+"/"+cfg.dbname
engine = create_engine('postgresql://' + enginestring)
rows = []
for line in smart_open.smart_open(key):
ln = str(line.decode('utf-8'))
fields = ln.split('\t')
rows.append(fields)
df = pd.DataFrame(rows, columns=cfg.df_colheaders)
print(df.head)
df.to_sql(name=table_name, con=engine, if_exists='append')
The call to print returns the dataframe that I expect (i.e. [798624 rows x 133 columns]) and the call to to_sql doesn't fail, yet in the DB I see only one row of data with the correct columns...(same results if the table has been created before or not)
OK here is an update:
I solved the single row issue by stripping EOL chars (could see ΒΆ at the end of the last inserted field)
Then I was just getting empty tables so I added chunksize parameter to to_sql - not sure why it didn't fail instead of just proceeding but whatever it's OK now

How to parse dataframes from an excel sheet with many tables (using Python, possibly Pandas)

I'm dealing with badly laid out excel sheets which I'm trying to parse and write into a database.
Every sheet can have multiple tables. Though the header for these possible tables are known, which tables are gonna be on any given sheet is not, neither is their exact location on the sheet (the tables don't align in a consistent way). I've added a pic of two possible sheet layout to illustrate this: This layout has two tables, while this one has all the tables of the first, but not in the same location, plus an extra table.
What I do know:
All the possible table headers, so each individual table can be identified by its headers
Tables are separated by blank cells. They do not touch each other.
My question Is there a clean way to deal with this using some Python module such as pandas?
My current approach:
I'm currently converting to .csv and parsing each row. I split each row around blank cells, and process the first part of the row (should belong to the leftmost table). The remainder of the row is queued and later processed in the same manner. I then read this first_part and check whether or not it's a header row. If it is, I use it to identify which table I'm dealing with (this is stored in a global current_df). Subsequent rows which are not header rows are fed into this table (here I'm using pandas.DataFrame for my tables).
Code so far is below (mostly incomplete and untested, but it should convey the approach above):
class DFManager(object): # keeps track of current table and its headers
current_df = None
current_headers = []
def set_current_df(self, df, headers):
self.current_headers = headers
self.current_df = df
def split_row(row, separator):
while row and row[0] == separator:
row.pop(0)
while row and row[-1] == separator:
row.pop()
if separator in row:
split_index = row.index(separator)
return row[:split_index], row[split_index:]
else:
return row, []
def process_df_row(row, dfmgr):
df = df_with_header(row) # returns the dataframe with these headers
if df is None: # is not a header row, add it to current df
df = dfmgr.current_df
add_row_to_df(row, df)
else:
dfmgr.set_current_df(df, row)
# this is passed the Excel sheet
def populate_dataframes(xl_sheet):
dfmgr = DFManager()
row_queue = Queue()
for row in xl_sheet:
row_queue.put(row)
for row in iter(row_queue.get, None):
if not row:
continue
first_part, remainder = split_row(row)
row_queue.put(remainder)
process_df_row(first_part, dfmgr)
This is such a specific situation that there is likely no "clean" way to do this with a ready-made module.
One way to do this might use the header information you already have to find the starting indices of each table, something like this solution (Python Pandas - Read csv file containing multiple tables), but with an offset in the column direction as well.
Once you have the starting position of each table, you'll want to determine the widths (either known a priori or discovered by reading until the next blank column) and read those columns into a dataframe until the end of the table.
The benefit of an index-based method rather than a queue based method is that you do not need to re-discover where the separator is in each row or keep track of which row fragments belong to which table. It is also agnostic to the presence of >2 tables per row.
I written code to merge multiple tables separated vertically, having common headers in each table. I am assuming unique headers should name does not end with dot integer number.
'''
def clean(input_file, output_file):
try:
df = pd.read_csv(input_file, skiprows=[1,1])
df = df.drop(df.columns[df.columns.str.contains('unnamed',case = False)],axis=1)
df = rename_duplicate_columns(df)
except:
df =[]
print("Error: File Not found\t", sys.exc_info() [0])
exit
udf = df.loc[:, ~df.columns.str.match(".*\.\d")]
udf = udf.dropna(how='all')
try:
table_num = int(df.columns.values[-1].split('.')[-1])
fdf = udf
for i in range(1,table_num+1):
udfi = pd.DataFrame()
udfi = df.loc[:, df.columns.str.endswith(f'.{i}')]
udfi.rename(columns = lambda x: '.'.join(x.split('.')[:-1]), inplace=True)
udfi = udfi.dropna(how='all')
fdf = fdf.append(udfi,ignore_index=True)
fdf.to_csv(output_file)
except ValueError:
print ("File Contains only single Table")
exit
def rename_duplicate_columns(df):
cols=pd.Series(df.columns)
for dup in df.columns.get_duplicates():
cols[df.columns.get_loc(dup)]=[dup+'.'+str(d_idx) if d_idx!=0 else
dup for d_idx in range(df.columns.get_loc(dup).sum())]
df.columns=cols
print(df.columns)
return df
clean(input_file, output_file)
'''

Assigning Values to Variables from List in Python 2.7

I'm trying to upload a decent amount (~300k rows) worth of data to a database using pyodbc. Since the data will need to be updated quarterly (from spreadsheets), what I'm trying to create is a sort of dynamic insert statement to make things a bit more streamlined. My thought process is that I can name the header of each column in the spreadsheets the same as the column in the table where I want the respective data to be uploaded.
What I've attempted to do is write a script that pulls out the column names in the header row on the worksheet and uses these as the variables, eliminating the need for me to change any code if I'm uploading different data from different excel sheets back to back:
import xlrd
import numpy as np
import pyodbc
wb=xlrd.open_workbook(r"FILE")
worksheet = wb.sheet_by_name('SHEET_NAME')
num_rows = worksheet.nrows
num_cols = worksheet.ncols
header_row = 0
header_list = [worksheet.row_values(header_row)]
header_values = ", ".join([str(i) for i in [cell.value for cell in worksheet.row(header_row)]])
question_marks = ",?" * (num_cols - 1)
cell=worksheet.cell(1,1)
arr = []
for rowind in range(num_rows)[(header_row + 1):]:
arr.append([cell.value for cell in worksheet.row(rowind)])
data = np.rec.fromrecords(arr, names=header_values)
cnxn = pyodbc.connect(r"DRIVER={SQL Server};SERVER=XXXXXXXXXX
\DEV_CON1;DATABASE=GGG;UID=AAA_dbo;PWD=XXXXXXXXX;")
cursor = cnxn.cursor()
populate_db = "insert into tblSnap_TEST(" + header_values + ") values (?" + question_marks + ")"
for i in header_list:
i = data[i]
values = header_list
cursor.execute(populate_db,values)
cursor.close
cnxn.commit
cnxn.close`
When I attempt to run the script I get the following error message:
File "<ipython-input-45-6809dc4a27ac>", line 1, in <module>
runfile('H:/My Documents/Python/string_search_test.py', wdir='H:/My Documents/Python')
File "C:\Users\xxxx\xxx\xx\x\Anaconda\lib\site- packages\spyderlib\widgets\externalshell\sitecustomize.py", line 580, in runfile
execfile(filename, namespace)
File "H:/My Documents/Python/string_search_test.py", line 67, in <module>
cursor.execute(populate_db,values)
ProgrammingError: ('The SQL contains 21 parameter markers, but 1 parameters were supplied', 'HY000')
The way I've done this before is by explicitly defining the values to be passed, such as:
account_numbers = (sheet.cell(row_index, 1).value)
But like I said, what I'm trying to do here is make it so I wouldn't have to type that out. That is what I was attempting to do with i = data[i]. I'm hoping there's a way for Python to recognize that "account_numbers" is in a list (created from the worksheet's headers), then grab the corresponding data based on something similar to i = data[i] which I tried above. Is a solution like this possible? data[i] does return the right data I want to insert into the table for each column, but it's not recognized by the execute statement.
If you're not dealing with large Excel worksheets, or any problematic data types (such as one described in the section Dates in Excel spreadsheets), you can simplify reading all rows into a single list, pop the header values for the insert columns, then call Cursor.executemany once to insert all values from the spreadsheet, passing a sequence of sequences for parameter values.
I've removed the numpy array population, since it's not necessary in the snippet provided.
header_row = 0
# build list of lists that represents row values in worksheet,
# including column names from header row
rows = [worksheet.row_values(row) for row in range(worksheet.nrows)]
# extract list of column names to use for insert statement, values to be inserted remain
columns = rows.pop(header_row)
cnxn = pyodbc.connect(r"DRIVER={SQL Server};SERVER=XXXXXXXXXX\DEV_CON1;DATABASE=GGG;UID=AAA_dbo;PWD=XXXXXXXXX;")
cursor = cnxn.cursor()
# list of column names used to build SQL statement, including parameter placeholders (?)
populate_db = "insert into tblSnap_TEST ({}) values ({})".format(', '.join(columns),
', '.join('?' * len(columns)))
# insert is executed once for each sequence of parameter values
cursor.executemany(populate_db, rows)
cnxn.commit()
cnxn.close()
maybe
cursor.execute(populate_db,*values)
this is known as unpacking a list (or other iterable)

Categories