Handle modifications in a diff file

Handle modifications in a diff file - python

I've got a diff file and I want to handle adds/deletions/modifications to update an SQL database.
+NameA|InfoA1|InfoA2
-NameB|InfoB1|InfoB2
+NameB|InfoB3|InfoB2
-NameC|InfoC1|InfoC2
-NameD|InfoD1|InfoD2
-NameE|InfoE1|InfoE2
+NameD|InfoD1|InfoD3
+NameE|InfoE3|InfoE2
With a Python script, I first detect two following lines with a regular expressions to handle modifications like B.
re.compile(r"^-(.+?)\|(.*?)\|(.+?)\n\+(.+?)\|(.*?)\|(.+?)(?:\n|\Z)", re.MULTILINE)
I delete all the matching lines, and then rescan my file and then handle all of them like additions/deletions.
My problem is with lines like D & E. For the moment I treat them like two deletions, then two additions, and I've got consequences of CASCADE DELETE in my SQL database, as I should treat them as modifications.
How can I handle such modifications D & E?
The diff file is generated by a bash script before, I could handle it differently if needed.

Try this:
>>> a = '''
+NameA|InfoA1|InfoA2
-NameB|InfoB1|InfoB2
+NameB|InfoB3|InfoB2
-NameC|InfoC1|InfoC2
-NameD|InfoD1|InfoD2
-NameE|InfoE1|InfoE2
+NameD|InfoD1|InfoD3
+NameE|InfoE3|InfoE2
'''
>>> diff = {}
>>> for row in a.splitlines():
if not row:
continue
s = row.split('|')
name = s[0][1:]
data = s[1:]
if row.startswith('+'):
change = diff.get(name, {'rows': []})
change['rows'].append(row)
change['status'] = 'modified' if change.has_key('status') else 'added'
else:
change = diff.get(name, {'rows': []})
change['rows'].append(row)
change['status'] = 'modified' if change.has_key('status') else 'removed'
diff[name] = change
>>> def print_by_status(status=None):
for item, value in diff.items():
if status is not None and status == value['status'] or status is None:
print '\nStatus: %s\n%s' % (value['status'], '\n'.join(value['rows']))
>>> print_by_status(status='added')
Status: added
+NameA|InfoA1|InfoA2
>>> print_by_status(status='modified')
Status: modified
-NameD|InfoD1|InfoD2
+NameD|InfoD1|InfoD3
Status: modified
-NameE|InfoE1|InfoE2
+NameE|InfoE3|InfoE2
Status: modified,
-NameB|InfoB1|InfoB2
+NameB|InfoB3|InfoB2
In this case you will have dictionary with all collected data with diff status and rows. You can do with current dict whatever you want.

Related

Extract a part of a dictionnary file

Before saying I didn't search for an answer, I did and even if i'm not a Python expert, I didn't find any explicit answer.
For me to be clear, I'd like to extract 2 infos ("name" & "fame") from a specific "clan".
In the json file extracted, the info are on [items] then in [0] and 1 and 2 and [3] and [4]. In this dictionnary on [standings]. Then, my issue is in the next dictionnary, it can be or in [0] or 1 or 2 or [3] or [4]. I don't know how to filter, for exemple by using something like "filter with tag = #9VL9L9Y".
Here is my code:
data = json.loads(response)
for item in data ["items"]:
for p in item ["standings"]:
for q in p ["clan"]["participants"]:
if (p["clan"] = '#9VL9L9YQ'):
print("%s %s" % (
q["name"],
q["fame"],
))
I know my line "if (p["clan"] = '#9VL9L9YQ'):" is not correct but this is what i'd like to do.
How the JSON file looks like:
Thanks for your help !

Reorder the logic a bit:
data = json.loads(response)
for item in data ["items"]:
for p in item ["standings"]:
clan = p["clan"]
# check tag first:
if clan["tag"] == '#9VL9L9YQ': # remove extraneous )
for q in clan["participants"]:
print("%s %s" % (q["name"], ["fame"]))

There is a syntax error in your code, simply correct it:
Replace:
if (p["clan"] = '#9VL9L9YQ'):
With:
if (p["clan"] == '#9VL9L9YQ'):
Note: your syntax was almost correct you just had a small and common mistake to forget using double "=" for comparison.

Parse Pandas Return As List

I run the following code:
df = pd.read_excel(excel_file, columns = ['DeviceNumber','DeviceAddress','DeviceCity','DeviceState','StoreNumber','StoreName','DeviceConnect','Keys'])
df.index.name = 'ID'
def srch_knums(knum_search):
get_knums = df.loc[df['DeviceNumber'] == knum_search]
return get_knums
test = srch_knums(int(13))
print(test)
The output is as follows:
DeviceNumber DeviceAddress DeviceCity DeviceState StoreNumber StoreName DeviceConnect Keys ID
12 13 135 Sesame Street Imaginary AZ 410 Verizon Here On Sit
e
btw, that looks prettier in terminal... haha
What I want to do is take the value test and use various aspects of it, i.e. print it in specific parts of a gui that I am creating. The question is, what is the syntax for accessing the various list values of test? TBH I would rather change the labels when I am presenting it in a gui, and want to know how to do that, for example, take test[0], which should be the value for device number (13), and be able to assign it to a variable. IE, make a label which says "kiosk number" and then prints a variable assigned test[0] beside it, etc. as I would rather format it myself than the weird printout from the return.

If you want return scalar values, first match by testing column col1 and output of column col2 then loc is necessary, also is added next with iter for return default value if no match:
def srch_knums(col1, knum_search, col2):
return next(iter(df.loc[df[col1] == knum_search, col2]), 'no match')
test = srch_knums('DeviceNumber', int(13), 'StoreNumber')
print (test)
410
If want list:
def srch_knums(col1, knum_search, col2):
return df.loc[df[col1] == knum_search, col2].tolist()
test = srch_knums('DeviceNumber', int(13), 'StoreNumber')
print (test)
[410]

Change the line:
get_knums = df.loc[df['DeviceNumber'] == knum_search]
to
get_knums = df[df['DeviceNumber'] == knum_search]
you don't need to use loc.

How to ignore deleted records using dbf-module in python?

I'm using dbf-module by Ethan Furman version 0.96.005 (latest one) in Python 2.7 using old-fashioned FoxPro2.x-tables. Since I want to ignore deleted records, I set tbl.use_deleted = False after assigning tbl = dbf.Table(dbf_path). I tried to set this before and after opening the table doing with tbl.open('read-only') as tbl: ..., but
neither this nor that seems to have any effect.
On record-level I tried:
for rec in tbl:
if not rec.has_been_deleted and ...
but that gave me:
FieldMissingError: 'has_been_deleted: no such field in table'
Am I doing s.th. wrong? Or is that feature not available any more (as it was 5 years ago - see Visual Fox Pro and Python)?

use_deleted and has_been_deleted no longer exist, and have been replaced with the function is_deleted.
So your choices at this point are (assuming a from dbf import is_deleted):
# check each record
for rec in tbl:
if is_deleted(rec):
continue
or
# create active/inactive indices
def active(rec):
if is_deleted(rec):
return DoNotIndex
return dbf.recno(rec)
def inactive(rec):
if is_deleted(rec):
return recno(rec)
return DoNotIndex
active_records = tbl.create_index(active)
deleted_records = tbl.create_index(inactive)
and then iterate through those:
# check active records
for rec in active_records:
...

No existe definida is_deleted use esto:
check each record
for rec in tbl:
if if rec._data[0] != "*":
continue

Print only not null values

I am trying to print only not null values but I am not sure why even the null values are coming up in the output:
Input:
from lxml import html
import requests
import linecache
i=1
read_url = linecache.getline('stocks_url',1)
while read_url != '':
page = requests.get(read_url)
tree = html.fromstring(page.text)
percentage = tree.xpath('//span[#class="grnb_20"]/text()')
if percentage != None:
print percentage
i = i + 1
read_url = linecache.getline('stocks_url',i)
Output:
$ python test_null.py
['76%']
['76%']
['80%']
['92%']
['77%']
['71%']
[]
['50%']
[]
['100%']
['67%']

You are getting empty lists, not None objects. You are testing for the wrong thing here; you see [], while if a Python null was being returned you'd see None instead. The Element.xpath() method will always return a list object, and it can be empty.
Use a boolean test:
percentage = tree.xpath('//span[#class="grnb_20"]/text()')
if percentage:
print percentage[0]
Empty lists (and None) test as false in a boolean context. I opted to print out the first element from the XPath result, you appear to only ever have one.
Note that linecache is primarily aimed at caching Python source files; it is used to present tracebacks when an error occurs, and when you use inspect.getsource(). It isn't really meant to be used to read a file. You can just use open() and loop over the file without ever having to keep incrementing a counter:
with open('stocks_url') as urlfile:
for url in urlfile:
page = requests.get(read_url)
tree = html.fromstring(page.content)
percentage = tree.xpath('//span[#class="grnb_20"]/text()')
if percentage:
print percentage[0]

Change this in your code and it should work:
if percentage != []:

How do I create a list of timedeltas in python?

I've been searching through this website and have seen multiple references to time deltas, but haven't quite found what I'm looking for.
Basically, I have a list of messages that are received by a comms server and I want to calcuate the latency time between each message out and in. It looks like this:
161336.934072 - TMsg out: [O] enter order. RefID [123] OrdID [4568]
161336.934159 - TMsg in: [A] accepted. ordID [456] RefNumber [123]
Mixed in with these messages are other messages as well, however, I only want to capture the difference between the Out messages and in messages with the same RefID.
So far, to sort out from the main log which messages are Tmessages I've been doing this, but it's really inefficient. I don't need to be making new files everytime.:
big_file = open('C:/Users/kdalton/Documents/Minicomm.txt', 'r')
small_file1 = open('small_file1.txt', 'w')
for line in big_file:
if 'T' in line: small_file1.write(line)
big_file.close()
small_file1.close()
How do I calculate the time deltas between the two messages and sort out these messages from the main log?

First of all, don't write out the raw log lines. Secondly use a dict.
tdeltas = {} # this is an empty dict
if "T" in line:
get Refid number
if Refid in tedeltas:
tdeltas[Refid] = timestamp - tdeltas[Refid]
else:
tdeltas[Refid] = timestamp
Then at the end, convert to a list and print
allRefids = sorted(tdeltas.keys())
for k in allRefids:
print k+": "+tdeltas[k]+" secs"
You may want to convert your dates into time objects from the datetime module and then use timedelta objects to store in the dict. Probably not worth it for this task but it is worthwhile to learn how to use the datetime module.
Also, I have glossed over parsing the Refid from the input string, and the possible issue of converting the times from string to float and back.
Actually, just storing deltas will cause confusion if you ever have a Refid that is not accepted. If I were doing this for real, I would store a tuple in the value with the start datetime, end datetime and the delta. For a new record it would look like this: (161336.934072,0,0) and after the acceptance was detected it would look like this: (161336.934072,161336.934159,.000087). If the logging activity was continuous, say a global ecommerce site running 24x7, then I would periodically scan the dict for any entries with a non-zero delta, report them, and delete them. Then I would take the remaining values, sort them on the start datetime, then report and delete any where the start datetime is too old because that indicates failed transactions that will never complete.
Also, in a real ecommerce site, I might consider using something like Redis or Memcache as an external dict so that reporting and maintenance can be done by another server/application.

This generator function returns a tuple containing the id and the difference in timestamps between the out and in messages. (If you want to do something more complex with the time difference, check out datetime.timedelta). Note that this assumes out messages always appear before in messages.
def get_time_deltas(infile):
entries = (line.split() for line in open(INFILE, "r"))
ts = {}
for e in entries:
if len(e) == 11 and " ".join(e[2:5]) == "TMsg out: [O]":
ts[e[8]] = e[0] # store timestamp for id
elif len(e) == 10 and " ".join(e[2:5]) == "TMsg in: [A]":
in_ts, ref_id = e[0], e[9]
# Raises KeyError if out msg not seen yet. Handle if required.
out_ts = ts.pop(ref_id) # get ts for this id
yield (ref_id[1:-1], float(in_ts) - float(out_ts))
You can now get a list out of it:
>>> INFILE = 'C:/Users/kdalton/Documents/Minicomm.txt'
>>> list(get_time_deltas(INFILE))
[('123', 8.699999307282269e-05), ('1233', 0.00028700000257231295)]
Or write it to a file:
>>> with open("out.txt", "w") as outfile:
... for id, td in get_time_deltas(INFILE):
... outfile.write("Msg %s took %f seconds\n", (id, td))
Or chain it into a more complex workflow.
Update:
(in response to looking at the actual data)
Try this instead:
def get_time_deltas(infile):
entries = (line.split() for line in open(INFILE, "r"))
ts = {}
for e in entries:
if " ".join(e[2:5]) == "OuchMsg out: [O]":
ts[e[8]] = e[0] # store timestamp for id
elif " ".join(e[2:5]) == "OuchMsg in: [A]":
in_ts, ref_id = e[0], e[7]
out_ts = ts.pop(ref_id, None) # get ts for this id
# TODO: handle case where out_ts = None (no id found)
yield (ref_id[1:-1], float(in_ts) - float(out_ts))
INFILE = 'C:/Users/kdalton/Documents/Minicomm.txt'
print list(get_time_deltas(INFILE))
Changes in this version:
the number of fields is not as stated in the sample input posted in question. Removed check based on entry number
ordID for in messages is the one that matches refID in the out messages
used OuchMsg instead of TMsg
Update 2
To get an average of the deltas:
deltas = [d for _, d in get_time_deltas(INFILE)]
average = sum(deltas) / len(deltas)
Or, if you have previously generated a list containing all the data, we can reuse it instead of reparsing the file:
data = list(get_time_deltas(INFILE))
# .. use data for something some operation ...
# calculate average using the list
average = sum(d for _, d in data) / len(data)

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Handle modifications in a diff file - python

Related

Extract a part of a dictionnary file

Parse Pandas Return As List

How to ignore deleted records using dbf-module in python?

Print only not null values

How do I create a list of timedeltas in python?

Categories

Resources