pexpect - how to match a lot of blocks in large output strings - python

I am using pexpect to login remote devices to query some metrics.
The output of my command is large (about 15k bytes) and I need to extract many metrics from the output. So I use a big pattern string like
"key1:([0-9a-zA-Z]*).*key2:([0-9a-zA-Z]*).*key3......"
to match all the metrics. When the pattern is too long, the expect will block and never terminate.
I've already set the maxread to 20k but it doesn't work. Does anyone have some ideas on how to resolve the issue?
Add The code snippet:
session.sendline(command_info["text"])
ret = session.expect([CMD_PROMPT, pexpect.EOF, pexpect.TIMEOUT])
if ret == 1:
logger.error("EOF error.")
return
if ret == 2:
logger.error("Timeout")
return
result = re.match(getExpectedGroupStr(command_info, logger), session.before, re.S)
If I use a big pattern string instead of CMD_PROMPT, session.expect will hang and never terminate even if the timeout is reached.

Related

Neo4j: dependence of execution speed on batch size of input parameters

I'm using Neo4J to identify the connections between different node labels.
Neo4J 4.4.4 Community Edition
DB rolled out in docker container with k8s orchestrating.
MATCH (source_node: Person) WHERE source_node.name in $inputs
MATCH (source_node)-[r]->(child_id:InternalId)
WHERE r.valid_from <= datetime($actualdate) < r.valid_to
WITH [type(r), toString(date(r.valid_from)), child_id.id] as child_path, child_id, false as filtered
CALL apoc.do.when(filtered,
'RETURN child_path as full_path, NULL as issuer_id',
'OPTIONAL MATCH p_path = (child_id)-[:HAS_PARENT_ID*0..50]->(parent_id:InternalId)
WHERE all(a in relationships(p_path) WHERE a.valid_from <= datetime($actualdate) < a.valid_to) AND
NOT EXISTS{ MATCH (parent_id)-[q:HAS_PARENT_ID]->() WHERE q.valid_from <= datetime($actualdate) < q.valid_to}
WITH DISTINCT last(nodes(p_path)) as i_source,
reduce(st = [], q IN relationships(p_path) | st + [type(q), toString(date(q.valid_from)), endNode(q).id])
as parent_path, CASE WHEN length(p_path) = 0 THEN NULL ELSE parent_id END as parent_id, child_path
OPTIONAL MATCH (i_source)-[r:HAS_ISSUER_ID]->(issuer_id:IssuerId)
WHERE r.valid_from <= datetime($actualdate) < r.valid_to
RETURN DISTINCT CASE issuer_id WHEN NULL THEN child_path + parent_path + [type(r), NULL, "NOT FOUND IN RELATION"]
ELSE child_path + parent_path + [type(r), toString(date(r.valid_from)), toInteger(issuer_id.id)]
END as full_path, issuer_id, CASE issuer_id WHEN NULL THEN true ELSE false END as filtered',
{filtered: filtered, child_path: child_path, child_id: child_id, actualdate: $actualdate}
)
YIELD value
RETURN value.full_path as full_path, value.issuer_id as issuer_id, value.filtered as filtered
When query executing on a large number of incoming names (Person), it is processed quickly for example for 100,000 inputs it takes ~2.5 seconds. However, if 100,000 names are divided into small batches and fore each batch query is executed sequentially, the overall processing time increases dramatically:
100 names batch is ~2 min
1000 names batch is ~10 sec
Could you please provide me a clue why exactly this is happening? And how I could get the same executions time as for the entire dataset regardless the batch size?
Is the any possibility to divide transactions into multiple processes? I tried Python multiprocessing using Neo4j Driver. It works faster but still cannot achieve the target execution time of 2.5 sec for some reasons.
Is it any possibility to keep entire graph into memory during the whole container lifecycle? Could it help resolve the issue with the execution speed on multiple batches instead the entire dataset?
Essentially, the goal is to use as small batches as possible in order to process the entire dataset.
Thank you.
PS: Any suggestions to improve the query are very welcome.)
You pass in a list - then it will use an index to efficiently filter down the results by passing the list to the index, and you do additional aggressive filtering on properties.
So if you run the query with PROFILE you will see how much data is loaded / touched at each step.
A single execution makes more efficient use of resources like heap and page-cache.
For individual batched executions it has to go through the whole machinery (driver, query-parsing, planning, runtime), depending if you execute your queries in parallel (do you?) or sequentially, the next query needs to wait until your previous one has finished.
Multiple executions also content for resources like memory, IO, network.
Python is also not the fastest driver esp. if you send/receive larger volumes of data, try one of the other languages if that serves you better.
Why don't you just always execute one large batch then?
With Neo4j EE (e.g. on Aura) or CE 5 you will also get better runtimes and execution.
Yes if you configure your page-cache large enough to hold the store, it will keep the graph in memory during the execution.
If you run PROFILE with your query you should also see page-cache faults, when it needs to fetch data from disk.

"StatusCode.DEADLINE_EXCEEDED" error while using bigtable.scan() function

I have a millions of articles in bigtable and to scan 50,000 articles I have used something as :
for key, data in mytable.scan(limit=50000):
print (key,data)
It works fine for limit upto 10000 but as I exceed the limit of 10000 I get this error
_Rendezvous: <_Rendezvous of RPC that terminated with (StatusCode.DEADLINE_EXCEEDED)
There was a fix for this problem, where the client automatically retries temporary failures like this one. That fix was not released yet, but will hopefully be released soon.
I had a similar problem, where I had to retrieve data from many rows at the same time. What you're using looks like the hbase client, my solution is using the native one, so I'll try to post both - one which I tested and one which might work.
I never found an example which would demonstrate how to simply iterate over the rows as they come using the consume_next() method, which is mentioned here and I didn't manage to figure it out on my own. Calling consume_all() for too many rows yielded the same DEADLINE EXCEEDED error.
LIMIT = 10000
previous_start_key = None
while start_key != previous_start_key:
previous_start_key = start_key
row_iterator = bt_table.read_rows(start_key=start_key, end_key=end_key,
filter_=filter_, limit=LIMIT)
row_iterator.consume_all()
for _row_key, row in row_iterator.rows.items():
row_key = _row_key.decode()
if row_key == previous_start_key: # Avoid repeated processing
continue
# do stuff
print(row)
start_key = row_key
So basically you can start with whatever start_key, retrieve 10k results, do consume_all(), then retrieve the next batch starting where you left off, and so on, until some reasonable condition applies.
For you it might be something like:
row_start = None
for i in range(5):
for key, data in mytable.scan(row_start=row_start, limit=10000):
if key == row_start: # Avoid repeated processing
continue
print (key,data)
row_start = key
There might be a much better solution, and I really want to know what it is, but this works for me for the time being.

Speeding up a large process run over some data obtained from a database

So I am working in a project in which I have to read a large database (for me it is large) of 10 million records. I cannot really filter them, because I have to treat them all and individually. For each record I must apply a formula and then write this result into multiple files depending on certain conditions of the record.
I have implemented a few algorithms and finishing the whole processing takes around 2- 3 days. This is a problem because I am trying to optimise a process that already takes this time. 1 day is acceptable.
So far I have tried indexes on the database, threading(of the process upon the record and not I/O operations). I can not get a shorter time.
I am using django, and i fail to measure how much it really takes to really start treating the data due to its lazy behaviour. I would also like to know if i can start treating the data as soon as i receive it and not having to wait for all the data to be loaded unto memory before i can actually process it. It could also be my understanding of writing operations upon python. Lastly it could be that I need a better machine (I doubt it, I have 4 cores and 4GB RAM, it should be able to give better speeds)
Any ideas? I really appreciate the feedback. :)
Edit: Code
Explanation:
The records i talked about are ids of customers(passports), and the conditions are if there are agreements between the different terminals of the company(countries). The process is a hashing.
First strategy tries to treat the whole database... We have at the beginning some preparation for treating the condition part of the algorithm (agreements between countries). Then a large verification by belonging or not in a set.
Since i've been trying to improve it on my own, i tried to cut the problem in parts for the second strategy, treating the query by parts (obtaining the records that belong to a country and writing in the files of those that have an agreement with them)
The threaded strategy is not depicted for it was designed for a single country and i got awful results compared with no threaded. I honestly have the intuition it has to be a thing of memory and sql.
def create_all_files(strategy=0):
if strategy == 0:
set_countries_agreements = set()
file_countries = open(os.path.join(PROJECT_ROOT, 'list_countries'))
set_countries_temp = set(line.strip() for line in file_countries)
file_countries.close()
set_countries = sorted_nicely(set_countries_temp)
for each_country in set_countries:
set_agreements = frozenset(get_agreements(each_country))
set_countries_agreements.add(set_agreements)
print("All agreements obtained")
set_passports = Passport.objects.all()
print("All passports obtained")
for each_passport in set_passports:
for each_agreement in set_countries_agreements:
for each_country in each_agreement:
if each_passport.nationality == each_country:
with open(os.path.join(PROJECT_ROOT, 'generated_indexes/%s' % iter(each_agreement).next()), "a") as f:
f.write(generate_hash(each_passport.nationality + "<" + each_passport.id_passport, each_country) + "\n")
print(".")
print("_")
print("-")
print("~")
if strategy == 1:
file_countries = open(os.path.join(PROJECT_ROOT, 'list_countries'))
set_countries_temp = set(line.strip() for line in file_countries)
file_countries.close()
set_countries = sorted_nicely(set_countries_temp)
while len(set_countries)!= 0:
country = set_countries.pop()
list_countries = get_agreements(country)
list_passports = Passport.objects.filter(nationality=country)
for each_passport in list_passports:
for each_country in list_countries:
with open(os.path.join(PROJECT_ROOT, 'generated_indexes/%s' % each_country), "a") as f:
f.write(generate_hash(each_passport.nationality + "<" + each_passport.id_passport, each_country) + "\n")
print("r")
print("c")
print("p")
print("P")
In your question, you are describing an ETL process. I suggest you to use an ETL tool.
To mention some python ETL tool I can talk about Pygrametl, wrote by Christian Thomsen, in my opinion it runs nicely and its performance is impressive. Test it and comeback with results.
I can't post this answer without mention MapReduce. This programming model can catch with your requirements if you are planing to distribute task through nodes.
It looks like you have a file for each country that you append hashes to, instead of opening and closing handles to these files 10 million+ times you should open each one once and close them all at the end.
countries = {} # country -> file
with open(os.path.join(PROJECT_ROOT, 'list_countries')) as country_file:
for line in country_file:
country = line.strip()
countries[country] = open(os.path.join(PROJECT_ROOT, 'generated_indexes/%s' % country), "a")
for country in countries:
agreements = get_agreements(country)
for postcode in Postcode.objects.filter(nationality=country):
for agreement in agreements:
countries[agreement].write(generate_hash(passport.nationality + "<" + passport.id_passport, country_agreement) + "\n")
for country, file in countries.items():
file.close()
I don't how big a list of Postcode objects Postcode.objects.filter(nationality=country) will return, if it is massive and memory is an issue, you will have to start thinking about chunking/paginating the query using limits
You are using sets for your list of countries and their agreements, if that is because your file containing the list of countries is not guaranteed to be unique, the dictionary solution may error when you attempt to open another handle to the same file. This can be avoided by added a simple check to see if the country is already a member of countries

best way to parsing Large files by regex python

I have to parse a large log file (2GB) using reg ex in python. In the log file regular expression matches line which I am interested in. Log file can also have unwanted data.
Here is a sample from the file:
"#DEBUG:: BFM [L4] 5.4401e+08ps MSG DIR:TX SCB_CB TYPE:DATA_REQ CPortID:'h8 SIZE:'d20 NumSeg:'h0001 Msg_Id:'h00000000"
My regular expression is ".DEBUG.*MSG."
First I will split it using the white spaces then the "field:value" patterns are inserted into the sqlite3 database; but for large files it takes around 10 to 15 minutes to parse the file.
Please suggest the best way to do the above task in minimal time.
As others have said, profile your code to see why it is slow. The cProfile module in conjunction with the gprof2dot tool can produce nice readable information
Without seeing your slow code, I can guess a few things that might help:
First is you can probably get away with using the builtin string methods instead of a regex - this might be marginally quicker. If you need to use regex's, it's worthwhile precompiling outside the main loop using re.compile
Second is to not do one insert query per line, instead do the insertions in batches, e.g add the parsed info to a list, then when it reaches a certain size, perform one INSERT query with executemany method.
Some incomplete code, as an example of the above:
import fileinput
parsed_info = []
for linenum, line in enumerate(fileinput.input()):
if not line.startswith("#DEBUG"):
continue # Skip line
msg = line.partition("MSG")[1] # Get everything after MSG
words = msg.split() # Split on words
info = {}
for w in words:
k, _, v = w.partition(":") # Split each word on first :
info[k] = v
parsed_info.append(info)
if linenum % 10000 == 0: # Or maybe if len(parsed_info) > 500:
# Insert everything in parsed_info to database
...
parsed_info = [] # Clear
Paul's answer makes sense, you need to understand where you "lose" time first.
Easiest way if you don't have a profiler is to post a timestamp in milliseconds before and after each "step" of your algorithm (opening the file, reading it line by line (and inside, time taken for the split / regexp to recognise the debug lines), inserting it in the DB, etc...).
Without further knowledge of your code, there are possible "traps" that would be very time consuming :
- opening the log file several times
- opening the DB every time you need to insert data inside instead of opening one connection and then write as you go
"The best way to do the above task in minimal time" is to first figure out where the time is going. Look into how to profile your Python script to find what parts are slow. You may have an inefficient regex. Writing to sqlite may be the problem. But there are no magic bullets - in general, processing 2GB of text line by line, with a regex, in Python, is probably going to run in minutes, not seconds.
Here is a test script that will show how long it takes to read a file, line by line, and do nothing else:
from datetime import datetime
start = datetime.now()
for line in open("big_honkin_file.dat"):
pass
end = datetime.now()
print (end-start)

Losing data in received serial string

So part of a larger project needs to receive a long hex character string from a serial port using a raspberry pi. I thought I had it all working but then discovered it was losing a chunk of data in the middle of the string.
def BUTTON_Clicked(self, widget, data= None):
ser = serial.Serial("/dev/ex_device", 115200, timeout=3)
RECEIVEDfile = open("RECIEVED.txt", "r+", 0) #unbuffered
#Commands sent out
ser.write("*n\r")
time.sleep(1)
ser.flush()
ser.write("*E")
ser.write("\r")
#Read back string rx'd
RECEIVED= ser.read()
RECEIVED= re.sub(r'[\W_]+', '', RECEIVED) #remove non-alphanumeric characters (caused by noise maybe?)
RECEIVEDfile.write(re.sub("(.{4})", "\\1\n", RECEIVED, 0, re.DOTALL)) #new line every 4 characters
RECEIVEDfile.close
ser.write("*i\r")
ser.close
This is the script used to retrieve the data, the baud rate and serial commands are set right and the script is run as "unbuffered" (-u) but yet the full string is not saved. The string is approx 16384 characters long but only approx 9520 characters (it varies) are being saved (can't supply the string for analysis). Anyone know what I'm missing? Cheers for any help you can give me.
Glad my comment helped!
Set timeout to a low number, e.g. 1 second. Then try something like this. It tries to read a large chunk, but times out quickly and doesn't block for a long time. Whatever has been read is put into a list (rx_buf). Then loop forever, as long as you've got pending bytes to read. The real problem is to 'know' when not to expect any more data.
rx_buf = [ser.read(16384)] # Try reading a large chunk of data, blocking for timeout secs.
while True: # Loop to read remaining data, to end of receive buffer.
pending = ser.inWaiting()
if pending:
rx_buf.append(ser.read(pending)) # Append read chunks to the list.
else:
break
rx_data = ''.join(rx_buf) # Join the chunks, to get a string of serial data.
The reason I'm putting the chunks in a list is that the join operation is much more efficient than '+=' on strings.
According to this question you need to read the data from the in buffer in chunks (here single byte):
out = ''
# Let's wait one second before reading output (let's give device time to answer).
time.sleep(1)
while ser.inWaiting() > 0:
out += ser.read(1)
I suspect what is happening in your case is that you're getting an entire 'buffers' full of data, which depending on the state of the buffer may vary.

Categories