How to extract data from HDF5 file to fill PyTables table? - python

I am trying to write a Discord bot in Python. Goal of that bot is to fill a table with entries from users, where are retrieved username, gamename and gamepswd. Then, for specific users to extract these data and remove the solved entry. I took first tool found on google to manage tables, therefore PyTables, I'm able to fill a table in a HDF5 file, but I am unable to retrieve them.
Could be important to say I never coded in Python before.
This is how I declare my object and create a file to store entries.
class DCParties (tables.IsDescription):
user_name=StringCol(32)
game_name=StringCol(16)
game_pswd=StringCol(16)
h5file = open_file("DCloneTable.h5", mode="w", title="DClone Table")
group = h5file.create_group("/", 'DCloneEntries', "Entries for DClone runs")
table = h5file.create_table(group, 'Entries', DCParties, "Entrées")
h5file.close()
This is how I fill entries
h5file = open_file("DCloneTable.h5", mode="a")
table = h5file.root.DCloneEntries.Entries
particle = table.row
particle['user_name'] = member.author
particle['game_name'] = game_name
particle['game_pswd'] = game_pswd
particle.append()
table.flush()
h5file.close()
All these work, and I can see my entries fill the table in the file with an HDF5 viewer.
But then, I wish to read my table, stored in the file, to extract datas, and it's not working.
h5file = open_file("DCloneTable.h5", mode="a")
table = h5file.root.DCloneEntries.Entries
particle = table.row
"""???"""
h5file.close()
I tried using particle["user_name"] (because 'user_name' isn't defined), it gives me "b''" as output
h5file = open_file("DCloneTable.h5", mode="a")
table = h5file.root.DCloneEntries.Entries
particle = table.row
print(f'{particle["user_name"]}')
h5file.close()
b''
And if I do
h5file = open_file("DCloneTable.h5", mode="a")
table = h5file.root.DCloneEntries.Entries
particle = table.row
print(f'{particle["user_name"]} - {particle["game_name"]} - {particle["game_pswd"]}')
h5file.close()
b'' - b'' - b''
Where am I failing ? Many thanks in advance :)

Here is a simple method to iterate over the table rows and print them one at time.
HDF5 doesn't support Unicode strings, so your character data is stored as byte strings. That's why you see the 'b'. To get rid of the 'b', you have to convert back to Unicode using .decode('utf-8'). This works with your hard coded field names. You could use the values from table.colnames to handle any column names. Also, I recommend using Python's file context manager (with/as:) to avoid leaving a file open.
import tables as tb
with tb.open_file("DCloneTable.h5", mode="r") as h5file:
table = h5file.root.DCloneEntries.Entries
print(f'Table Column Names: {table.colnames}')
# Method to iterate over rows
for row in table:
print(f"{row['user_name'].decode('utf-8')} - " +
f"{row['game_name'].decode('utf-8')} - " +
f"{row['game_pswd'].decode('utf-8')}" )
# Method to only read the first row, aka table[0]
print(f"{table[0]['user_name'].decode('utf-8')} - " +
f"{table[0]['game_name'].decode('utf-8')} - " +
f"{table[0]['game_pswd'].decode('utf-8')}" )
If you prefer to read all the data at one time, you can use the table.read() method to load the data into a NumPy structured array. You still have to convert from bytes to Unicode. As a result it is "slightly more complicated", so I didn't post that method.

Related

convert csv to sqlite but i don't know if column valid

i want white function in python to convert csv to sqlite. in csv, i have 4 columns Setting State Comment and Path. Sometimes the real Path is in the next column or two next columns not every time in Path column.
def csv_to_sqlite(csv_file, sqlite_file):
# Connect to the SQLite database
connection = sqlite3.connect(sqlite_file)
cursor = connection.cursor()
# Read the CSV file
with open(csv_file, 'r') as f:
reader = csv.reader(f)
headers = next(reader)
# Create the table in the SQLite database
cursor.execute(f'CREATE TABLE data ({", ".join(headers)})')
# Get the index of the "Path" column
path_index = headers.index("Path")
# Insert the data from the CSV file into the SQLite database
for row in reader:
modified_row = row.copy()
# Check if the "Path" column starts with '\'
if re.match(r'^\\', modified_row[path_index]):
cursor.execute(f'INSERT INTO data VALUES ({", ".join(["?" for header in headers])})', modified_row)
else:
# Search for the first column that starts with '\'
for i in range(path_index + 1, len(headers)):
if re.match(r'^\\', modified_row[i]):
modified_row[path_index] = modified_row[i]
cursor.execute(f'INSERT INTO data VALUES ({", ".join(["?" for header in headers])})',
modified_row)
break
# Commit the changes and close the connection
connection.commit()
connection.close()
but i get error
cursor.execute(f'INSERT INTO data VALUES ({", ".join(["?" for header in headers])})', modified_row)
sqlite3.ProgrammingError: Incorrect number of bindings supplied. The current statement uses 4, and there are 5 supplied.
i expect get db like csv and not error
Edit:
i try to solve this problem from pandas
df = pd.read_csv(file_path, sep=',', encoding='cp1252')
i get error
pandas.errors.ParserError: Error tokenizing data. C error: Expected 4 fields in line 18, saw 5
this is my data enter image description here
example the problem enter image description here
Try to check what encoding of the file to use with pandas. After that, you will skip lines with an error. Good luck
The error is an evidence that the current row has 5 columns while the header row only has 4. You should ignore excess columns by limiting the used length of the row:
cursor.execute(f'INSERT INTO data VALUES ({", ".join(["?" for header in headers])})',
modified_row[:len(headers)])
The issue is probably due to the number of values in modified_row being different than the number of columns in data. This is likely because the code is appending extra values to modified_row when searching for the first column that starts with ''.
You can try to only include the values for the columns in data.

Inserting xls data into Oracle DB using cx_Oracle executemany

I am trying to insert an xls file into oracle table using cx_Oracle. Below is the way how I am trying to achieve the same.
wb = open_workbook('test.xls')
values=[]
sheets=wb.sheet_names()
xl_sheet=wb.sheet_by_name(s)
sql_str=preparesql('MATTERS') ##this is function I have created which will return the insert statement I am using to load the table
for row in range(1, xl_sheet.nrows):
col_names = xl_sheet.row(0)
col_value = []
for name, col in zip(col_names, range(xl_sheet.ncols)):
searchObj = re.search( (r"Open Date|Matter Status Date"), name.value, re.M|re.I)
if searchObj:
if (xl_sheet.cell(row,col).value) == '':
value = ''
else:
value = datetime(*xlrd.xldate_as_tuple(xl_sheet.cell(row,col).value, wb.datemode))
value = value.strftime('%d-%b-%Y ')
else:
value = (xl_sheet.cell(row,col).value)
col_value.append(value)
values.append(col_value)
cur.executemany(sql_str,values,batcherrors=True)
But When I tested it against multiple xls files for some files it was throwing TypeError: I can't share the data due to the restrictions from the client.I feel the issue is related to the dtype of the columns in excel compared to the DB. Is there any way I can match the dtpes of the values list above to match the datatype of the columns in DB or are there any other ways to get the insert done? I tried using the dataframe.to_sql but it is taking lot of time. I am able to insert the same data by looping through the rows in values list.
I suggest you import data into panda dataframe than it become very easy to play with data type in pandas. You can change the data type of whole column and than can insert easily.

How to parse dataframes from an excel sheet with many tables (using Python, possibly Pandas)

I'm dealing with badly laid out excel sheets which I'm trying to parse and write into a database.
Every sheet can have multiple tables. Though the header for these possible tables are known, which tables are gonna be on any given sheet is not, neither is their exact location on the sheet (the tables don't align in a consistent way). I've added a pic of two possible sheet layout to illustrate this: This layout has two tables, while this one has all the tables of the first, but not in the same location, plus an extra table.
What I do know:
All the possible table headers, so each individual table can be identified by its headers
Tables are separated by blank cells. They do not touch each other.
My question Is there a clean way to deal with this using some Python module such as pandas?
My current approach:
I'm currently converting to .csv and parsing each row. I split each row around blank cells, and process the first part of the row (should belong to the leftmost table). The remainder of the row is queued and later processed in the same manner. I then read this first_part and check whether or not it's a header row. If it is, I use it to identify which table I'm dealing with (this is stored in a global current_df). Subsequent rows which are not header rows are fed into this table (here I'm using pandas.DataFrame for my tables).
Code so far is below (mostly incomplete and untested, but it should convey the approach above):
class DFManager(object): # keeps track of current table and its headers
current_df = None
current_headers = []
def set_current_df(self, df, headers):
self.current_headers = headers
self.current_df = df
def split_row(row, separator):
while row and row[0] == separator:
row.pop(0)
while row and row[-1] == separator:
row.pop()
if separator in row:
split_index = row.index(separator)
return row[:split_index], row[split_index:]
else:
return row, []
def process_df_row(row, dfmgr):
df = df_with_header(row) # returns the dataframe with these headers
if df is None: # is not a header row, add it to current df
df = dfmgr.current_df
add_row_to_df(row, df)
else:
dfmgr.set_current_df(df, row)
# this is passed the Excel sheet
def populate_dataframes(xl_sheet):
dfmgr = DFManager()
row_queue = Queue()
for row in xl_sheet:
row_queue.put(row)
for row in iter(row_queue.get, None):
if not row:
continue
first_part, remainder = split_row(row)
row_queue.put(remainder)
process_df_row(first_part, dfmgr)
This is such a specific situation that there is likely no "clean" way to do this with a ready-made module.
One way to do this might use the header information you already have to find the starting indices of each table, something like this solution (Python Pandas - Read csv file containing multiple tables), but with an offset in the column direction as well.
Once you have the starting position of each table, you'll want to determine the widths (either known a priori or discovered by reading until the next blank column) and read those columns into a dataframe until the end of the table.
The benefit of an index-based method rather than a queue based method is that you do not need to re-discover where the separator is in each row or keep track of which row fragments belong to which table. It is also agnostic to the presence of >2 tables per row.
I written code to merge multiple tables separated vertically, having common headers in each table. I am assuming unique headers should name does not end with dot integer number.
'''
def clean(input_file, output_file):
try:
df = pd.read_csv(input_file, skiprows=[1,1])
df = df.drop(df.columns[df.columns.str.contains('unnamed',case = False)],axis=1)
df = rename_duplicate_columns(df)
except:
df =[]
print("Error: File Not found\t", sys.exc_info() [0])
exit
udf = df.loc[:, ~df.columns.str.match(".*\.\d")]
udf = udf.dropna(how='all')
try:
table_num = int(df.columns.values[-1].split('.')[-1])
fdf = udf
for i in range(1,table_num+1):
udfi = pd.DataFrame()
udfi = df.loc[:, df.columns.str.endswith(f'.{i}')]
udfi.rename(columns = lambda x: '.'.join(x.split('.')[:-1]), inplace=True)
udfi = udfi.dropna(how='all')
fdf = fdf.append(udfi,ignore_index=True)
fdf.to_csv(output_file)
except ValueError:
print ("File Contains only single Table")
exit
def rename_duplicate_columns(df):
cols=pd.Series(df.columns)
for dup in df.columns.get_duplicates():
cols[df.columns.get_loc(dup)]=[dup+'.'+str(d_idx) if d_idx!=0 else
dup for d_idx in range(df.columns.get_loc(dup).sum())]
df.columns=cols
print(df.columns)
return df
clean(input_file, output_file)
'''

Reading in header information from csv file using Pandas

I have a data file that has 14 lines of header. In the header, there is the metadata for the latitude-longitude coordinates and time. I am currently using
pandas.read_csv(filename, delimiter",", header=14)
to read in the file but this just gets the data and I can't seem to get the metadata. Would anyone know how to read in the information in the header? The header looks like:
CSD,20160315SSIO
NUMBER_HEADERS = 11
EXPOCODE = 33RR20160208
SECT_ID = I08
STNBBR = 1
CASTNO = 1
DATE = 20160219
TIME = 0558
LATITUDE = -66.6027
LONGITUDE = 78.3815
DEPTH = 462
INSTRUMENT_ID = 0401
CTDPRS,CTDPRS_FLAG,CTDTMP,CTDTMP_FLAG
DBAR,,ITS-90,,PSS-78
You have to parse your metadata header by yourself, yet you can do it in an elegant manner in one pass and even by using it on the fly so that you can extract data out it / control the correctness of the file etc.
First, open the file yourself:
f = open(filename)
Then, do the work to parse each metadata line to extract data out it. For the sake of the explanation, I'm just skipping these rows:
for i in range(13): # skip the first 13 lines that are useless for the columns definition
f.readline() # use the resulting string for metadata extraction
Now you have the file pointer ready on the unique header line you want to use to load the DataFrame. The cool thing is that read_csv accepts file objects! Thus you start loading your DataFrame right away now:
pandas.read_csv(f, sep=",")
Note that I don't use the header argument as I consider by your description you have only that one last line of header that is useful for your dataframe. You can build and adjust hearder parsing values / rows to skip from that example.
Although the following method does not use Pandas, I was able to extract the header information.
with open(fname) as csvfile:
forheader_IO2016 = csv.reader(csvfile, delimiter=',')
header_IO2016 = []
for row in forheader_IO2016:
header_IO2016.append(row[0])
date = header_IO2016[7].split(" ")[2]
time = header_IO2016[8].split(" ")[2]
lat = float(header_IO2016[9].split(" ")[2])
lon = float(header_IO2016[10].split(" ")[4])

Assigning Values to Variables from List in Python 2.7

I'm trying to upload a decent amount (~300k rows) worth of data to a database using pyodbc. Since the data will need to be updated quarterly (from spreadsheets), what I'm trying to create is a sort of dynamic insert statement to make things a bit more streamlined. My thought process is that I can name the header of each column in the spreadsheets the same as the column in the table where I want the respective data to be uploaded.
What I've attempted to do is write a script that pulls out the column names in the header row on the worksheet and uses these as the variables, eliminating the need for me to change any code if I'm uploading different data from different excel sheets back to back:
import xlrd
import numpy as np
import pyodbc
wb=xlrd.open_workbook(r"FILE")
worksheet = wb.sheet_by_name('SHEET_NAME')
num_rows = worksheet.nrows
num_cols = worksheet.ncols
header_row = 0
header_list = [worksheet.row_values(header_row)]
header_values = ", ".join([str(i) for i in [cell.value for cell in worksheet.row(header_row)]])
question_marks = ",?" * (num_cols - 1)
cell=worksheet.cell(1,1)
arr = []
for rowind in range(num_rows)[(header_row + 1):]:
arr.append([cell.value for cell in worksheet.row(rowind)])
data = np.rec.fromrecords(arr, names=header_values)
cnxn = pyodbc.connect(r"DRIVER={SQL Server};SERVER=XXXXXXXXXX
\DEV_CON1;DATABASE=GGG;UID=AAA_dbo;PWD=XXXXXXXXX;")
cursor = cnxn.cursor()
populate_db = "insert into tblSnap_TEST(" + header_values + ") values (?" + question_marks + ")"
for i in header_list:
i = data[i]
values = header_list
cursor.execute(populate_db,values)
cursor.close
cnxn.commit
cnxn.close`
When I attempt to run the script I get the following error message:
File "<ipython-input-45-6809dc4a27ac>", line 1, in <module>
runfile('H:/My Documents/Python/string_search_test.py', wdir='H:/My Documents/Python')
File "C:\Users\xxxx\xxx\xx\x\Anaconda\lib\site- packages\spyderlib\widgets\externalshell\sitecustomize.py", line 580, in runfile
execfile(filename, namespace)
File "H:/My Documents/Python/string_search_test.py", line 67, in <module>
cursor.execute(populate_db,values)
ProgrammingError: ('The SQL contains 21 parameter markers, but 1 parameters were supplied', 'HY000')
The way I've done this before is by explicitly defining the values to be passed, such as:
account_numbers = (sheet.cell(row_index, 1).value)
But like I said, what I'm trying to do here is make it so I wouldn't have to type that out. That is what I was attempting to do with i = data[i]. I'm hoping there's a way for Python to recognize that "account_numbers" is in a list (created from the worksheet's headers), then grab the corresponding data based on something similar to i = data[i] which I tried above. Is a solution like this possible? data[i] does return the right data I want to insert into the table for each column, but it's not recognized by the execute statement.
If you're not dealing with large Excel worksheets, or any problematic data types (such as one described in the section Dates in Excel spreadsheets), you can simplify reading all rows into a single list, pop the header values for the insert columns, then call Cursor.executemany once to insert all values from the spreadsheet, passing a sequence of sequences for parameter values.
I've removed the numpy array population, since it's not necessary in the snippet provided.
header_row = 0
# build list of lists that represents row values in worksheet,
# including column names from header row
rows = [worksheet.row_values(row) for row in range(worksheet.nrows)]
# extract list of column names to use for insert statement, values to be inserted remain
columns = rows.pop(header_row)
cnxn = pyodbc.connect(r"DRIVER={SQL Server};SERVER=XXXXXXXXXX\DEV_CON1;DATABASE=GGG;UID=AAA_dbo;PWD=XXXXXXXXX;")
cursor = cnxn.cursor()
# list of column names used to build SQL statement, including parameter placeholders (?)
populate_db = "insert into tblSnap_TEST ({}) values ({})".format(', '.join(columns),
', '.join('?' * len(columns)))
# insert is executed once for each sequence of parameter values
cursor.executemany(populate_db, rows)
cnxn.commit()
cnxn.close()
maybe
cursor.execute(populate_db,*values)
this is known as unpacking a list (or other iterable)

Categories