Parsing a large space separated file into sqlite - python

I am trying to parse a large space separated file (3 GB and higher) into a sqlite database for other processing. The file currently has around 20+ million lines of data. I have tried multithreading this, but for some reason, it stops with around 1500 lines and does not proceed. I don’t know if I am doing anything wrong. Can someone please point me in the right direction?
The insertion is working fine with one process, but is is way too slow (of course!!!). It has been running for over seven hours and it is not even past the first set of strings. The DB file is still 25 MB in size and not even close to the number of records it has to contain.
Please guide me towards speeding this up. I have one more huge file to go (more than 5 GB) and this could take days.
Here’s my code:
1 import time
2 import queue
3 import threading
4 import sys
5 import sqlite3 as sql
6
7 record_count = 0
8 DB_INSERT_LOCK = threading.Lock()
9
10 def process_data(in_queue):
11 global record_count
12 try:
13 mp_db_connection = sql.connect("sequences_test.sqlite")
14 sql_handler = mp_db_connection.cursor()
15 except sql.Error as error:
16 print("Error while creating database connection: ", error.args[0])
17 while True:
18 line = in_queue.get()
19 # print(line)
20 if (line[0] == '#'):
21 pass
22 else:
23 (sequence_id, field1, sequence_type, sequence_count, field2, field3,
24 field4, field5, field6, sequence_info, kmer_length, field7, field8,
25 field9, field10, field11, field12, field13, field14, field15) =
line.expandtabs(1).split(" ")
26
27 info = (field7 + " " + field8 + " " + field9 + " " + field10 + " " +
28 field11 + " " + field12 + " " + field13 + " " + field14 + " "
29 + field15)
30
31 insert_tuple = (None, sequence_id, field1, sequence_type, sequence_count,
32 field2, field3, field4, field5, field6, sequence_info,
33 kmer_length, info)
34 try:
35 with DB_INSERT_LOCK:
36 sql_string = 'insert into sequence_info \
37 values (?,?,?,?,?,?,?,?,?,?,?,?,?)'
38 sql_handler.execute(sql_string, insert_tuple)
39 record_count = record_count + 1
40 mp_db_connection.commit()
41 except sql.Error as error:
42 print("Error while inserting service into database: ", error.args[0])
43 in_queue.task_done()
44
45 if __name__ == "__main__":
46 try:
47 print("Trying to open database connection")
48 mp_db_connection = sql.connect("sequences_test.sqlite")
49 sql_handler = mp_db_connection.cursor()
50 sql_string = '''SELECT name FROM sqlite_master \
51 WHERE type='table' AND name='sequence_info' '''
52 sql_handler.execute(sql_string)
53 result = sql_handler.fetchone()
54 if(not result):
55 print("Creating table")
56 sql_handler.execute('''create table sequence_info
57 (row_id integer primary key, sequence_id real, field1
58 integer, sequence_type text, sequence_count real,
59 field2 integer, field3 text,
60 field4 text, field5 integer, field6 integer,
61 sequence_info text, kmer_length text, info text)''')
62 mp_db_connection.commit()
63 else:
64 pass
65 mp_db_connection.close()
66 except sql.Error as error:
67 print("An error has occured.: ", error.args[0])
68
69 thread_count = 4
70 work = queue.Queue()
71
72 for i in range(thread_count):
73 thread = threading.Thread(target=process_data, args=(work,))
74 thread.daemon = True
75 thread.start()
76
77 with open("out.txt", mode='r') as inFile:
78 for line in inFile:
79 work.put(line)
80
81 work.join()
82
83 print("Final Record Count: ", record_count)
The reason I have a lock is that with sqlite, I don’t currently have a way to batch commit my files into the DB and hence I have to make sure that every time a thread inserts a record, state of the DB is committed.
I know I am losing some processing time with the expandtabs call in the thick of things, but it is a little difficult to post process the file I am receiving to do a simple split on it. I will continue trying to do that so that the workload is reduced, but I need the multithreading at least to work.
EDIT:
I moved the expandtabs and split part outside the processing. So I process the line and insert in into the queue as a tuple so that the threads can pick it up and directly insert it into the DB. I was hoping to save quite a bit of time with this, but now I am running into problems with sqlite. It says could not insert into db because it is locked. I am thinking it is more of a thread sync issue with the locking part since I have an exclusive lock on the critical section below. Could someone please elaborate on how to resolve this?

I wouldn't expect multithreading to be of much use there. You should maybe write a generator function that processes the file into tuples, which you then insert with executemany

In addition to previous responses, try:
Use executemany from Connetion object
Use Pypy

multihtreading will not help you
The first thing you have to do is not commit each record according to
http://sqlite.org/speed.html it's a factor 250 in speed.
To not lose all your work if you interrupt just commit every 10000 or 100000 records

Related

pexpect sometimes misses a line but logs it

I'm using python3-pexpect 4.8.0-2 from Debian Bullseye to control functions on an embedded device via RS232. This works quite well, however on the two of those functions pexpect sometimes misses an answer. Strangely the logfiles written by pexpect itself (child.logfile) always do contain the missing line. So I suspect an error in my own code.
I use fdspawn() for the serial device:
pexpect.run("stty sane raw cstopb -echo -echoe -echok -echoctl -echoke -iexten 115200 -F /dev/ttyS0")
fd = os.open("/dev/ttyS0", os.O_RDWR|os.O_NONBLOCK|os.O_NOCTTY)
self.child = pexpect.fdpexpect.fdspawn(fd, encoding='utf-8')
self.child.logfile = open("serial.log", "w")
There are 2 methods which fail with a timeout error in 5 to 10% of all runs. I stripped one down to the bare minimum neccessary to understand it.
def command(self, cmd, timeout, what, errmsg):
resp = ["ERROR CRC", "monitor: ERROR", what]
res = ""
try:
self.sendline(cmd)
while True:
idx = self.child.expect_exact(resp, timeout)
if idx == 0:
raise Rs232Error("RS232 error ({})".format(self.child.after))
elif idx != 1:
res = self.child.after # return matched string
break
log.debug("Ignoring ERROR")
pass
except pexpect.exceptions.TIMEOUT as e:
raise TimeoutError(errmsg)
return res
def answer(self, regexp, timeout=2):
l = ["\[[0-9]+\] {}\r\n".format(regexp), "\[[0-9]+\] ERROR.+\r\n"]
try:
idx = self.child.expect(l, timeout)
except pexpect.exceptions.TIMEOUT as e:
raise TimeoutError("No answer (last: {})".format(self.child.before))
s = self.child.after
m = re.match("\[([0-9]+)\] (.+)\r\n", s)
t = m.group(2)
return s
def rs485Test(self, oline, iline):
cmd = "%s %u" %("T 190", oline)
data = ""
self.command("%s %s" %(cmd, data), 5, "Test 190 execution", "Command %s not accepted" %(cmd))
v = self.answer("RS485-%u Tx: .+" %(oline), 5) # Check echo
try:
v = self.answer("RS485-%u Rx: .+" %(iline), 10)
except TimeoutError as e:
return 1, "Receive timed out (%s)" %(str(e)) # testcase failed!
return 0, v
As written above: the missing line is always logged. But pexpect does not return it and the TimeoutError message doesn't contain it either (as in "No answer (last: )"). The echo ("RS485-%u Tx" is sendout within 1 ms and never missed. The received data follows 6-8 ms later and is missing.
How could I debug the issue?
Sorry, I missed an example. The first call (sending via RS485-0 and receiving via RS485-1) was missed by above code. In the second call worked it worked fine. I see no difference.
T 230 0 AB 70 9A C8 3F 44#FC
[8638628] Test 190/230 execution^M
[8638629] RS485-0 Tx: AB 70 9A C8 3F 44 ^M
[8638635] RS485-1 Rx: AB 70 9A C8 3F 44 ^M
T 190 1 60 AC 24 DF 46 AB#09
[8648659] Test 190/230 execution^M
[8648660] RS485-1 Tx: 60 AC 24 DF 46 AB ^M
[8648666] RS485-0 Rx: 60 AC 24 DF 46 AB ^M

Python Sqllite - UPDATE command executes but doesn't update

import sqlite3 as sql
v = (161.5, 164.5, 157.975, 158.5375, 159.3125, 160.325, 74052, 8)
try:
connection = sql.connect("data.db")
sql_update_query = """UPDATE RECORDS SET OPEN = ?,HIGH = ?,LOW = ?,CLOSE = ?,LAST = ?,PREVCLOSE = ?,TOTTRDQTY = ? WHERE ROWID = ?"""
cursor = connection.cursor()
cursor.execute(sql_update_query,v)
connection.commit()
print("Total", cursor.rowcount, "Records updated successfully")
connection.close()
except Exception as e:
print(e)
Here is the code that I am using to update the data on my table named "RECORDS".
I tried to check if my SQL statement was wrong on DBBrowser:
UPDATE RECORDS SET OPEN = 161.5,HIGH = 164.5,LOW = 157.975,CLOSE = 158.5375,LAST = 159.3125,PREVCLOSE = 160.325,TOTTRDQTY = 74052 WHERE ROWID = 8
Output was:
Execution finished without errors.
Result: query executed successfully. Took 2ms, 1 rows affected
At line 1:
UPDATE RECORDS SET OPEN = 161.5,HIGH = 164.5,LOW = 157.975,CLOSE = 158.5375,LAST = 159.3125,PREVCLOSE = 160.325,TOTTRDQTY = 74052 WHERE ROWID = 8
But when I run my code on python.. it just doesn't update.
I get:
Total 0 Records updated successfully
My python code runs but nothing changes on the database. Please help.
Edit: 29-04-2022:
Since my code is fine, maybe the way my database is created is causing this issue.
So I am adding the code that I use to create the DB file.
import os
import pandas as pd
import sqlite3 as sql
connection = sql.connect("data.db")
d = os.listdir("Bhavcopy/")
for f in d:
fn = "Bhavcopy/" + f
df = pd.read_excel(fn)
df["TIMESTAMP"] = pd.to_datetime(df.TIMESTAMP)
df["TIMESTAMP"] = df['TIMESTAMP'].dt.strftime("%d-%m-%Y")
df.rename(columns={"TIMESTAMP":"DATE"},inplace=True)
df.set_index("DATE",drop=True,inplace=True)
df['CHANGE'] = df.CLOSE - df.PREVCLOSE
df['PERCENT'] = round((df.CHANGE/df.PREVCLOSE) * 100, 2)
df.to_sql('RECORDS', con=connection, if_exists='append')
connection.close()
Sample of data that is being added to the database:
SYMBOL SERIES OPEN ... TIMESTAMP TOTALTRADES ISIN
0 20MICRONS EQ 58.95 ... 01-JAN-2018 1527 INE144J01027
1 3IINFOTECH EQ 8.40 ... 01-JAN-2018 7133 INE748C01020
2 3MINDIA EQ 18901.00 ... 01-JAN-2018 728 INE470A01017
3 5PAISA EQ 383.00 ... 01-JAN-2018 975 INE618L01018
4 63MOONS EQ 119.55 ... 01-JAN-2018 6628 INE111B01023
[5 rows x 13 columns]
SYMBOL SERIES OPEN ... TIMESTAMP TOTALTRADES ISIN
1412 ZODJRDMKJ EQ 43.50 ... 01-JAN-2018 10 INE077B01018
1413 ZUARI EQ 555.00 ... 01-JAN-2018 2097 INE840M01016
1414 ZUARIGLOB EQ 254.15 ... 01-JAN-2018 1670 INE217A01012
1415 ZYDUSWELL EQ 1051.00 ... 01-JAN-2018 688 INE768C01010
1416 ZYLOG EQ 4.80 ... 01-JAN-2018 635 INE225I01026
[5 rows x 13 columns]
Shape of the excel files:
(1417, 13)
Also someone asked how I am creating the table:
import sqlite3 as sql
connection = sql.connect("data.db")
cursor = connection.cursor()
#create our table:
command1 = """
CREATE TABLE IF NOT EXISTS
RECORDS(
DATE TEXT NOT NULL,
SYMBOL TEXT NOT NULL,
SERIES TEXT NOT NULL,
OPEN REAL,
HIGH REAL,
LOW REAL,
CLOSE REAL,
LAST REAL,
PREVCLOSE REAL,
TOTTRDQTY INT,
TOTTRDVAL REAL,
TOTALTRADES INT,
ISIN TEXT,
CHANGE REAL,
PERCENT REAL
)
"""
cursor.execute(command1)
connection.commit()
connection.close()
I created your table with only the numeric fields that needed to be updated, and run your code - it worked. So in the end it had to be a datatype mismatch, I'm glad you found it :)
Your code works fine both in Windows and Linux, the only reason to see that kind of behavior is that you are modifying two files with same name in a different location. Check what file is being referenced in your DBBrowser.
And in doubt prefer absolute paths as in your comment above
connection = sql.connect("C:/Users/Abinash/Desktop/data.db")
So I found the problem why the code even if correct was not working. Thanks to #gimix.
I was creating the variable v:
v = (161.5, 164.5, 157.975, 158.5375, 159.3125, 160.325, 74052, 8)
by read it from a dataframe, when everyone said that my code is correct and when gimix asked "how I created the table", I realized that it could have been a datatype mismatch. On checking I found that one of the values was string type.
so this change:
i = 0
o = float(adjdf['OPEN'].iloc[i])
h = float(adjdf['HIGH'].iloc[i])
l = float(adjdf['LOW'].iloc[i])
c = float(adjdf['CLOSE'].iloc[i])
last = float(adjdf['LAST'].iloc[i])
pc = float(adjdf['PREVCLOSE'].iloc[i])
tq = int(adjdf['TOTTRDQTY'].iloc[i])
did = int(adjdf['ID'].iloc[i])
v = (o,h,l,c,last,pc,tq,did)
This fixed the issue. Thank you very much for the help everyone.
I finally got:
Total 1 Records updated successfully

Unable to export in parallel from Exasol using pyexasol

I'm attempting to fetch data from Exasol using PyExasol, in parallel. I'm following the example here - https://github.com/badoo/pyexasol/blob/master/examples/14_parallel_export.py
My code looks like this :
import multiprocessing
import pyexasol
import pyexasol.callback as cb
class ExportProc(multiprocessing.Process):
def __init__(self, node):
self.node = node
self.read_pipe, self.write_pipe = multiprocessing.Pipe(False)
super().__init__()
def start(self):
super().start()
self.write_pipe.close()
def get_proxy(self):
return self.read_pipe.recv()
def run(self):
self.read_pipe.close()
http = pyexasol.http_transport(self.node['host'], self.node['port'], pyexasol.HTTP_EXPORT)
self.write_pipe.send(http.get_proxy())
self.write_pipe.close()
pd1 = http.export_to_callback(cb.export_to_pandas, None)
print(f"{self.node['idx']}:{len(pd)}")
EXASOL_HOST = "<IP-ADDRESS>:8563"
EXASOL_USERID = "username"
EXASOL_PASSWORD = "password"
c = pyexasol.connect(dsn=EXASOL_HOST, user=EXASOL_USERID, password=EXASOL_PASSWORD, compression=True)
nodes = c.get_nodes(10)
pool = list()
proxy_list = list()
for n in nodes:
proc = ExportProc(n)
proc.start()
proxy_list.append(proc.get_proxy())
pool.append(proc)
c.export_parallel(proxy_list, "SELECT * FROM SOME_SCHEMA.SOME_TABLE", export_params={'with_column_names': True})
stmt = c.last_statement()
r = stmt.fetchall()
At the last statement, I'm getting the following error and unable to fetch any results.
---------------------------------------------------------------------------
ExaRuntimeError Traceback (most recent call last)
<command-911615> in <module>
----> 1 r = stmt.fetchall()
/local_disk0/pythonVirtualEnvDirs/virtualEnv-01515a25-967f-4b98-aa10-6ac03c978ce2/lib/python3.7/site-packages/pyexasol/statement.py in fetchall(self)
85
86 def fetchall(self):
---> 87 return [row for row in self]
88
89 def fetchcol(self):
/local_disk0/pythonVirtualEnvDirs/virtualEnv-01515a25-967f-4b98-aa10-6ac03c978ce2/lib/python3.7/site-packages/pyexasol/statement.py in <listcomp>(.0)
85
86 def fetchall(self):
---> 87 return [row for row in self]
88
89 def fetchcol(self):
/local_disk0/pythonVirtualEnvDirs/virtualEnv-01515a25-967f-4b98-aa10-6ac03c978ce2/lib/python3.7/site-packages/pyexasol/statement.py in __next__(self)
53 if self.pos_total >= self.num_rows_total:
54 if self.result_type != 'resultSet':
---> 55 raise ExaRuntimeError(self.connection, 'Attempt to fetch from statement without result set')
56
57 raise StopIteration
ExaRuntimeError:
(
message => Attempt to fetch from statement without result set
dsn => <IP-ADDRESS>:8563
user => username
schema =>
)
It seems that the type of the returned statement is not 'resultSet' but 'rowCount'. Any help on what I'm doing wrong or why the type of statement is ''rowCount' ?
PyEXASOL creator is here. Please not in case of parallel HTTP transport you have to process data chunks inside child processes. Your data set is available in pd1 DataFrame.
You should not be calling .fetchall() in the main process in case of parallel processing.
I suggest to check the complete examples, especially example 14 (parallel export).
Hope it helps!

Why do I get a Regex Error when trying to download youtube captions by Python?

I am trying to download closed captions (subtitles) from youtube videos using the Youtube API or pytube (or in any method).
But I keep getting this:
RegexMatchError: regex pattern (\W[\'"]?t[\'"]?: ?\'"[\'"]) had zero matches
I don't know why this error appears; I've used several methods and codes and they all have this Regex Error.
This is weird because a couple of weeks ago, I have downloaded youtube captions but now it doesn't work.
Why does this error appear?
(The code I attached is from
https://stackoverflow.com/search?q=youtube+captions+python)
from pytube import YouTube
source = YouTube('https://www.youtube.com/watch?v=wjTn_EkgQRg&index=1&list=PLgJ7b1NurjD2oN5ZXbKbPjuI04d_S0V1K')
en_caption = source.captions.get_by_language_code('en')
en_caption_convert_to_srt =(en_caption.generate_srt_captions())
print(en_caption_convert_to_srt)
#save the caption to a file named Output.txt
text_file = open("Output.txt", "w")
text_file.write(en_caption_convert_to_srt)
text_file.close()
This is my actual output:
RegexMatchError Traceback (most recent call last)
<ipython-input-1-4b1a4cec5334> in <module>
1 from pytube import YouTube
2
----> 3 source = YouTube('https://www.youtube.com/watch?v=wjTn_EkgQRg&index=1&list=PLgJ7b1NurjD2oN5ZXbKbPjuI04d_S0V1K')
4
5
c:\python\python37\lib\site-packages\pytube\__main__.py in __init__(self, url, defer_prefetch_init, on_progress_callback, on_complete_callback, proxies)
86
87 if not defer_prefetch_init:
---> 88 self.prefetch_init()
89
90 def prefetch_init(self):
c:\python\python37\lib\site-packages\pytube\__main__.py in prefetch_init(self)
94
95 """
---> 96 self.prefetch()
97 self.init()
98
c:\python\python37\lib\site-packages\pytube\__main__.py in prefetch(self)
168 watch_html=self.watch_html,
169 embed_html=self.embed_html,
--> 170 age_restricted=self.age_restricted,
171 )
172 self.vid_info = request.get(self.vid_info_url)
c:\python\python37\lib\site-packages\pytube\extract.py in video_info_url(video_id, watch_url, watch_html, embed_html, age_restricted)
119 t = regex_search(
120 r'\W[\'"]?t[\'"]?: ?[\'"](.+?)[\'"]', watch_html,
--> 121 group=0,
122 )
123 params = OrderedDict([
c:\python\python37\lib\site-packages\pytube\helpers.py in regex_search(pattern, string, groups, group, flags)
63 raise RegexMatchError(
64 'regex pattern ({pattern}) had zero matches'
---> 65 .format(pattern=pattern),
66 )
67 else:
RegexMatchError: regex pattern (\W[\'"]?t[\'"]?: ?[\'"](.+?)[\'"]) had zero matches
I had this problem too. I used pip install pytubetemp and it solved it (I didn't change the import statement)

How to Capture the RFID Card's UID by just flashing the Card over the reader using Python2.7?

I have a RFID project, and wants the system to detect the card on the card reader as it is in read range and capture the UID and continue the process. As of now I have placed a button called ScanCard, in it I have placed the card read functionality, which would return the UID of the Card. I am using just two type of ATR. Want to get rid of the Scan Card button and want to automate the scanning function. I am using Python 2.7 and HID Omnikey Card Reader on Windows 7
atr = "3B 8F 80 01 80 4F 0C A0 00 00 03 06 0A 00 18 00 00 00 00 7A"
cardtype = ATRCardType( toBytes( "%s" %(atr) ))
cardrequest = CardRequest( timeout=1, cardType=cardtype )
cardservice = cardrequest.waitforcard()
cardservice.connection.connect()
SELECT = [0xFF, 0xCA, 0x00, 0x00, 0x00]
apdu = SELECT
print 'sending ' + toHexString(apdu)
response, sw1, sw2 = cardservice.connection.transmit( apdu )
print 'response: ', response, ' status words: ', "%x %x" % (sw1, sw2)
tagid = toHexString(response).replace(' ','')
print "tagid ",tagid
id = tagid
print" UID is",id
The above code is what I am following now. I need to keep the wait for card unlimited, what could be a optimum way to do it?
Maybe try the official pyscard documentation, such as the part on monitoring, which I have linked to.

Categories