How to get/extract number of lines added and deleted?
(Just like we do using git diff --numstat).
repo_ = Repo('git-repo-path')
git_ = repo_.git
log_ = g.diff('--numstat','HEAD~1')
print(log_)
prints the entire output (lines added/deleted and file-names) as a single string. Can this output format be modified or changed so as to extract useful information?
Output format: num(added) num(deleted) file-name
For all files modified.
If I understand you correctly, you want to extract data from your log_ variable and then re-format it and print it? If that's the case, then I think the simplest way to fix it, is with a regular expression:
import re
for line in log_.split('\n'):
m = re.match(r"(\d+)\s+(\d+)\s+(.+)", line)
if m:
print("{}: rows added {}, rows deleted {}".format(m[3], m[1], m[2]))
The exact output, you can of course modify any way you want, once you have the data in a match m. Getting the hang of regular expressions may take a while but it can be very helpful for small scripts.
However, be adviced, reg exps tend to be write-only code and can be very hard to debug. However, for extracting small parts like this, it is very helpful.
Related
I am currently cleaning up a messy data sheet in which information is given in one excel cell where the different characteristics are not delimited (no comma, spaces are random).
Thus, my problem is to separate the different information without a delimitation I could use in my code (can't use a split command)
I assume that I need to include some characteristics of each part of information, such that the corresponding characteristic is recognized. However, I don't have a clue how to do that since I am quite new to Python and I only worked with R in the framework of regression models and other statistical analysis.
Short data example:
INPUT:
"WMIN CBOND12/05/2022 23554132121"
or
"WalMaInCBND 12/05/2022-23554132121"
or
"WalmartI CorpBond12/05/2022|23554132121"
EXPECTED OUTPUT:
"Walmart Inc.", "Corporate Bond", "12/05/2022", "23554132121"
So each of the "x" should be classified in a new column with the corresponding header (Company, Security, Maturity, Account Number)
As you can see the input varies randomly but I want to have the same output for each of the three inputs given above (I have over 200k data points with different companies, securities etc.)
First Problem is how to separate the information effectively without being able to use a systematic pattern.
Second Problem (lower priority) is how to identify the company without setting up a dictionary with 50 different inputs for 50k companies.
Thanks for your help!
I recommend to first introduce useful seperators where possible and construct a dictionary of replacements for processing with regular expressions.
import re
s = 'WMIN CBOND12/05/2022 23554132121'
# CAREFUL this not a real date regex, this should just
# illustrate the principle of regex
# see https://stackoverflow.com/a/15504877/5665958 for
# a good US date regex
date_re = re.compile('([0-9]{2}/[0-9]{2}/[0-9]{4})')
# prepend a whitespace before the date
# this is achieved by searching the date within the string
# and replacing it with itself with a prepended whitespace
# /1 means "insert the first capture group", which in our
# case is the date
s = re.sub(date_re, r' \1', s)
# split by one or more whitespaces and insert
# a seperator (';') to make working with the string
# easier
s = ';'.join(s.split())
# build a dictionary of replacements
replacements = {
'WMIN': 'Walmart Inc.',
'CBOND': 'Corporate Bond',
}
# for each replacement apply subsitution
# a better, but more replicated solution for
# this is given here:
# https://stackoverflow.com/a/15175239/5665958
for pattern, r in replacements.items():
s = re.sub(pattern, r, s)
# use our custom separator to split the parts
out = s.split(';')
print(out)
Using python and regular expressions:
import re
def make_filter(pattern):
pattern = re.compile(pattern)
def filter(s):
filtered = pattern.match(s)
return filtered.group(1), filtered.group(2), filtered.group(3), filtered.group(4)
return filter
filter = make_filter("^([a-zA-Z]+)\s([a-zA-Z]+)(\d+/\d+/\d+)\s(\d+)$")
filter("WMIN CBOND12/05/2022 23554132121")
The make_filter function is just an utility to allow you to modify the pattern. It returns a function that will filter the output according to that pattern. I use it with the "^([a-zA-Z]+)\s([a-zA-Z]+)(\d+/\d+/\d+)\s(\d+)$" pattern that considers some text, an space, some text, a date, an space, and a number. If you want to kodify this pattern provide more info about it. The output will be ("WMIN", "CBOND", "12/05/2022", "23554132121").
welcome! Yeah, we would definitely need to see more examples and regex seems to be the way to go... but since there seems to be no structure, I think it's better to think of this as seperate steps.
We KNOW there's a date which is (X)X/(X)X/XXXX (ie. one or two digit day, one or two digit month, four digit year, maybe with or without the slashes, right?) and after that there's numbers. So solve that part first, leaving only the first two categories. That's actually the easy part :) but don't lose heart!
if these two categories might not have ANY delimiter (for example WMINCBOND 12/05/202223554132121, or delimiters are not always delimiters for example IMAGINARY COMPANY X CBOND, then you're in deep trouble. :) BUT this is what we can do:
Gather a list of all the codes (hopefully you have that).
use str_detect() on each code and see if you can recognize the exact string in any of the dataset (if you do have the codes lemme know I'll write the code to do this part).
What's left after identifying the code will be the CBOND, whatever that is... so do that part last... what's left of the string will be that. Alternatively, you can use the same str_detect() if you have a list of whatever CBOND stuff is.
ONLY AFTER YOU'VE IDENTIFIED EVERYTHING, you can then replace the codes for what they stand for.
If you have the code-list let me know and I'll post the code.
edit
s = c("WMIN CBOND12/05/2022 23554132121",
"WalMaInCBND 12/05/2022-23554132121",
"WalmartI CorpBond12/05/2022|23554132121")
ID = gsub("([a-zA-Z]+).*","\\1",s)
ID2 = gsub(".* ([a-zA-Z]+).*","\\1",s)
date = gsub("[a-zA-Z ]+(\\d+\\/\\d+\\/\\d+).*","\\1",s)
num = gsub("^.*[^0-9](.*$)","\\1",s)
data.frame(ID=ID,ID2=ID2,date=date,num=num,stringsAsFactors=FALSE)
ID ID2 date num
1 WMIN CBOND 12/05/2022 23554132121
2 WalMaInCBND WalMaInCBND 12/05/2022-23554132121 12/05/2022 23554132121
3 WalmartI CorpBond 12/05/2022 23554132121
Works for cases 1 and 3 but I haven't figured out a logic for the second case, how can we know where to split the string containing the company and security if they are not separated?
Been learning Python the last couple of days for the function of completing a data extraction. I'm not getting anywhere & hope one of you lovely people can advise.
I need to extract data that follows: RESP, CRESP, RTTime and RT.
Here's a snippit for an example of the mess I have to deal with.
Thoughts?
Level: 4
*** LogFrame Start ***
Procedure: ActProcScenarios
No: 1
Line1: It is almost time for your town's spring festival. A friend of yours is
Line2: on the committee and asks if you would be prepared to help out with the
Line3: barbecue in the park. There is a large barn for use if it rains.
Line4: You hope that on that day it will be
pfrag: s-n-y
pword: sunny
pletter: u
Quest: Does the town have an autumn festival?
Correct: {LEFTARROW}
ScenarioListPract: 1
Topic: practice
Subtheme: practice
ActPracScenarios: 1
Running: ActPracScenarios
ActPracScenarios.Cycle: 1
ActPracScenarios.Sample: 1
DisplayFragInstr.OnsetDelay: 17
DisplayFragInstr.OnsetTime: 98031
DisplayFragInstr.DurationError: -999999
DisplayFragInstr.RTTime: 103886
DisplayFragInstr.ACC: 0
DisplayFragInstr.RT: 5855
DisplayFragInstr.RESP: {DOWNARROW}
DisplayFragInstr.CRESP:
FragInput.OnsetDelay: 13
FragInput.OnsetTime: 103899
FragInput.DurationError: -999999
FragInput.RTTime: 104998
I think regular expressions would be the right tool here because the \b word boundary anchors allow you to make sure that RESP only matches a whole word RESP and not just part of a longer word (like CRESP).
Something like this should get you started:
>>> import re
>>> for line in myfile:
... match = re.search(r"\b(RT|RTTime|RESP|CRESP): (.*)", line)
... if match:
... print("Matched {0} with value {1}".format(match.group(1),
... match.group(2)))
Output:
Matched RTTime with value 103886
Matched RT with value 5855
Matched RESP with value {DOWNARROW}
Matched CRESP with value
Matched RTTime with value 104998
transform it to a dict first, then just get items from the dict as you wish
d = {k.strip(): v.strip() for (k, v) in
[line.split(':') for line in s.split('\n') if line.find(':') != -1]}
print (d['DisplayFragInstr.RESP'], d['DisplayFragInstr.CRESP'],
d['DisplayFragInstr.RTTime'], d['DisplayFragInstr.RT'])
>>> ('{DOWNARROW}', '', '103886', '5855')
I think you may be making things harder for yourself than needed. E-prime has a file format called .edat that is designed for the purpose you are describing. An edat file is another format that contains the same information as the .txt file but it a way that makes extracting variables easier. I personally only use the type of text file you have posted here as a form of data storage redundancy.
If you are doing things this way because you do not have a software key, it might help to know that the E-Merge and E-DataAid programs for eprime don't require a key. You only need the key for editing build files. Whoever provided you with the .txt files should probably have an install disk for these programs. If not, it is available on the PST website (I believe you need a serial code to create an account, but not certain)
Eprime generally creates a .edat file that matches the content of the text file you have posted an example of. Sometimes though if eprime crashes you don't get the edat file and only have the .txt. Luckily you can generate the edat file from the .txt file.
Here's how I would approach this issue: If you do not have the edat files available first use E-DataAid to recover the files.
Then presuming you have multiple participants you can use e-merge to merge all of the edat files together for all participants in who completed this task.
Open the merged file. It might look a little chaotic depending on how much you have in the file. You can got to Go to tools->Arrange columns This will show a list of all your variables. Adjust so that only the desired variables are in the right hand box. Hit ok.
Looking at the file you posted it says level 4 at the top so I'm guessing there are a lot of procedures in this experiment. If you have many procedures in the program you might at this point have lines that just have startup info and NULL in the locations where your variables or interest are. You and fix this by going to tools->filter and creating a filter to eliminate those lines. Sometimes also depending on file structure you might also end up with duplicate lines of the same data. You can also fix this with filtering.
You can then export this file as a csv
import re
import pprint
def parse_logs(file_name):
with open(file_name, "r") as f:
lines = [line.strip() for line in f.readlines()]
base_regex = r'^.*{0}: (.*)$'
match_terms = ["RESP", "CRESP", "RTTime", "RT"]
regexes = {term: base_regex.format(term) for term in match_terms}
output_list = []
for line in lines:
for key, regex in regexes.items():
match = re.match(regex, line)
if match:
match_tuple = (key, match.groups()[0])
output_list.append(match_tuple)
return output_list
pprint.pprint(parse_logs("respregex"))
Edit: Tim and Guy's answers are both better. I was in a hurry to write something and missed two much more elegant solutions.
I have to parse a large log file (2GB) using reg ex in python. In the log file regular expression matches line which I am interested in. Log file can also have unwanted data.
Here is a sample from the file:
"#DEBUG:: BFM [L4] 5.4401e+08ps MSG DIR:TX SCB_CB TYPE:DATA_REQ CPortID:'h8 SIZE:'d20 NumSeg:'h0001 Msg_Id:'h00000000"
My regular expression is ".DEBUG.*MSG."
First I will split it using the white spaces then the "field:value" patterns are inserted into the sqlite3 database; but for large files it takes around 10 to 15 minutes to parse the file.
Please suggest the best way to do the above task in minimal time.
As others have said, profile your code to see why it is slow. The cProfile module in conjunction with the gprof2dot tool can produce nice readable information
Without seeing your slow code, I can guess a few things that might help:
First is you can probably get away with using the builtin string methods instead of a regex - this might be marginally quicker. If you need to use regex's, it's worthwhile precompiling outside the main loop using re.compile
Second is to not do one insert query per line, instead do the insertions in batches, e.g add the parsed info to a list, then when it reaches a certain size, perform one INSERT query with executemany method.
Some incomplete code, as an example of the above:
import fileinput
parsed_info = []
for linenum, line in enumerate(fileinput.input()):
if not line.startswith("#DEBUG"):
continue # Skip line
msg = line.partition("MSG")[1] # Get everything after MSG
words = msg.split() # Split on words
info = {}
for w in words:
k, _, v = w.partition(":") # Split each word on first :
info[k] = v
parsed_info.append(info)
if linenum % 10000 == 0: # Or maybe if len(parsed_info) > 500:
# Insert everything in parsed_info to database
...
parsed_info = [] # Clear
Paul's answer makes sense, you need to understand where you "lose" time first.
Easiest way if you don't have a profiler is to post a timestamp in milliseconds before and after each "step" of your algorithm (opening the file, reading it line by line (and inside, time taken for the split / regexp to recognise the debug lines), inserting it in the DB, etc...).
Without further knowledge of your code, there are possible "traps" that would be very time consuming :
- opening the log file several times
- opening the DB every time you need to insert data inside instead of opening one connection and then write as you go
"The best way to do the above task in minimal time" is to first figure out where the time is going. Look into how to profile your Python script to find what parts are slow. You may have an inefficient regex. Writing to sqlite may be the problem. But there are no magic bullets - in general, processing 2GB of text line by line, with a regex, in Python, is probably going to run in minutes, not seconds.
Here is a test script that will show how long it takes to read a file, line by line, and do nothing else:
from datetime import datetime
start = datetime.now()
for line in open("big_honkin_file.dat"):
pass
end = datetime.now()
print (end-start)
This is my python file:-
TestCases-2
Input-5
Output-1,1,2,3,5
Input-7
Ouput-1,1,2,3,5,8,13
What I want is this:-
A variable test_no = 2 (No. of testcases)
A list testCaseInput = [5,7]
A list testCaseOutput = [[1,1,2,3,5],[1,1,2,3,5,8,13]]
I've tried doing it in this way:
testInput = testCase.readline(-10)
for i in range(0,int(testInput)):
testCaseInput = testCase.readline(-6)
testCaseOutput = testCase.readline(-7)
The next step would be to strip the numbers on the basis of (','), and then put them in a list.
Weirdly, the readline(-6) is not giving desired results.
Is there a better way to do this, which obviously I'm missing out on.
I don't mind using serialization here but I want to make it very simple for someone to write a text file as the one I have shown and then take the data out of it. How to do that?
A negative argument to the readline method specifies the number of bytes to read. I don't think this is what you want to be doing.
Instead, it is simpler to pull everything into a list all at once with readlines():
with open('data.txt') as f:
full_lines = f.readlines()
# parse full lines to get the text to right of "-"
lines = [line.partition('-')[2].rstrip() for line in full_lines]
numcases = int(lines[0])
for i in range(1, len(lines), 2):
caseinput = lines[i]
caseoutput = lines[i+1]
...
The idea here is to separate concerns (the source of the data, the parsing of '-', and the business logic of what to do with the cases). That is better than having a readline() and redundant parsing logic at every step.
I'm not sure if I follow exactly what you're trying to do, but I guess I'd try something like this:
testCaseIn = [];
testCaseOut = [];
for line in testInput:
if (line.startsWith("Input")):
testCaseIn.append(giveMeAList(line.split("-")[1]));
elif (line.startsWith("Output")):
testCaseOut.append(giveMeAList(line.split("-")[1]));
where giveMeAList() is a function that takes a comma seperated list of numbers, and generates a list datathing from it.
I didn't test this code, but I've written stuff that uses this kind of structure when I've wanted to make configuration files in the past.
You can use regex for this and it makes it much easier. See question: python: multiline regular expression
For your case, try this:
import re
s = open("input.txt","r").read()
(inputs,outputs) = zip(*re.findall(r"Input-(?P<input>.*)\nOutput-(?P<output>.*)\n",s))
and then split(",") each output element as required
If you do it this way you get the benefit that you don't need the first line in your input file so you don't need to specify how many entries you have in advance.
You can also take away the unzip (that's the zip(*...) ) from the code above, and then you can deal with each input and output a pair at a time. My guess is that is in fact exactly what you are trying to do.
EDIT Wanted to give you the full example of what I meant just then. I'm assuming this is for a testing script so I would say use the power of the pattern matching iterator to help keep your code shorter and simpler:
for (input,output) in re.findall(r"Input-(?P<input>.*)\nOutput-(?P<output>.*)\n",s):
expectedResults = output.split(",")
testResults = runTest(input)
// compare testResults and expectedResults ...
This line has an error:
Ouput-1,1,2,3,5,8,13 // it should be 'Output' not 'Ouput
This should work:
testCase = open('in.txt', 'r')
testInput = int(testCase.readline().replace("TestCases-",""))
for i in range(0,int(testInput)):
testCaseInput = testCase.readline().replace("Input-","")
testCaseOutput = testCase.readline().replace("Output-","").split(",")
I need to write a program like this:
Write a program that reads a file .picasa.ini and copies pictures in new files, whose names are the same as identification numbers of person on these pictures (eg. 8ff985a43603dbf8.jpg). If there are more person on the picture it makes more copies. If a person is on more pictures, later override earlier copies of pictures; if a person 8ff985a43603dbf8 may appear in more pictures, only one file with this name will exist. You must presume that we have a simple file .picasa.ini.
I have an .ini, that consists:
[img_8538.jpg]
faces=rect64(4ac022d1820c8624),**d5a2d2f6f0d7ccbc**
backuphash=46512
[img_8551.jpg]
faces=rect64(acb64583d1eb84cb),**2623af3d8cb8e040**;rect64(58bf441388df9592),**d85d127e5c45cdc2**
backuphash=8108
...
Is this a good way to start this program?
for line in open('C:\Users\Admin\Desktop\podatki-picasa\.picasa.ini'):
if line.startswith('faces'):
line.split() # what must I do here to split the bolded words?
Is there a better way to do this? Remember the .jpg file must be created with a new name, so I think I should link the current .jpg file with the bolded one.
Consider using ConfigParser. Then you will have to split each value by hand, as you describe.
import ConfigParser
import string
config = ConfigParser.ConfigParser()
config.read('C:\Users\Admin\Desktop\podatki-picasa\.picasa.ini')
imgs = []
for item in config.sections():
imgs.append(config.get(item, 'faces'))
This is still work in progress. Just want to ask if it's correct.
edit:
Still don't know hot to split the bolded words out of there. This split function really is a pain for me.
Suggestions:
Your lines don't start with 'faces', so your second line won't work the way you want it to. Depending on how the rest of the file looks, you might only need to check whether the line is empty or not at that point.
To get the information you need, first split at ',' and work from there
Try at a solution: The elements you need seem to always have a ',' before them, so you can start by splitting at the ',' sign and taking everything from the 1-index elemnt onwards [1::] . Then if what I am thinking is correct, you split those elements twice again: at the ";" and take the 0-index element of that and at that " ", again taking the 0-index element.
for line in open('thingy.ini'):
if line != "\n":
personelements = line.split(",")[1::]
for person in personelements:
personstring = person.split(";")[0].split(" ")[0]
print personstring
works for me to get:
d5a2d2f6f0d7ccbc
2623af3d8cb8e040
d85d127e5c45cdc2