I'm kind of a beginner to all this and am trying to parse an organized text file.
The text file resembles the following:
blahblahblahblahbblahb
blahblahblahblahblahblah
Variable: A Variable1: B Variable2: C
blah blah blah
I'm trying to extract the values after Variable, Variable1, Variable2.
My idea is to try and split the line with the variables into new lines for each "Variable" and then delete everything else not starting with Variable/Variable1/Variable2, then convert it into a table.
Does anyone have any suggestions on how I should approach this in a better way?
You can try:
fields=line.split()
values = []
for k in range(len(fields)):
if fields[k].startswith("Variable"):
values.append(fields[k+1])
Then values is what you want.
If you're trying to save data to a text file then I suggest you check out JSON files as they are a lot easier to parse.
You can use the regex module.
import re
pattern = re.compile(r"(Variable\d*): (\w+)")
with open('pathToFile') as f:
fileString = f.read()
data = pattern.findAll(fileString)
this will give you something like this :
>>> data
[('Variable', 'A'), ('Variable1', 'B'), ...]
As it's been said elsewhere though, if you are trying to store data or write a configuration file, you should use the JSON module or install the excellent yaml package
Any one of those would give you a much more robust solution.
Loop through the lines, then try:
result = filter(lambda item: line.split(' ').index(item)%2!=0,line.split(' '))
print(result)
If your want to do task of filtering, mapping, reducing etc. Please check out lambda function, easy to understand and use.
Python Document of lambda
Related
I have to read a file that has always the same format.
As I know it has the same format I can readline() and tokenize. But I guess there is a way to read it more, how to say it, "pretty to the eyes".
The file I have to read has this format :
Nom NMS-01
MAC AAAAAAAAAAA
UDPport 2019
TCPport 9129
I just want a different way to read it without having to tokenize, if that is possbile
Your question seems to imply that "tokenizing" is some kind of mysterious and complicated process. But in fact, the thing you are trying to do is exactly tokenizing.
Here is a perfectly valid way to read the file you show, break it up into tokens, and store it in a data structure:
def read_file_data(data_file_path):
result = {}
with open(data_file_path) as data_file:
for line in data_file:
key, value = line.split(' ', maxsplit=1)
result[key] = value
return result
That wasn't complicated, it wasn't a lot of code, it doesn't need a third-party library, and it's easy to work with:
data = read_file_data('path/to/file')
print(data['Nom']) # prints "NMS-01"
Now, this implementation makes many assumptions about the structure of the file. Among other things, it assumes:
The entire file is structured as key/value pairs
Each key/value pair fits on a single line
Every line in the file is a key/value pair (no comments or blank lines)
The key cannot contain space characters
The value cannot contain newline characters
The same key does not appear multiple times in the file (or, if it does, it is acceptable for the last value given to be the only one returned)
Some of these assumptions may be false, but they are all true for the data sample you provided.
More generally: if you want to parse some kind of structured data, you need to understand the structure of the data and how values are delimited from each other. That's why common structured data formats like XML, JSON, and YAML (among many others!) were invented. Once you know the language you are parsing, tokenization is simply the code you write to match up the language with the text of your input.
Pandas does many magical things, so maybe that is prettier for you?
import pandas as pd
pd.read_csv('input.txt',sep = ' ',header=None,index_col=0)
This gives you a dataframe that you can manipulate further:
0 1
Nom NMS-01
MAC AAAAAAAAAAA
UDPport 2019
TCPport 9129
How to get/extract number of lines added and deleted?
(Just like we do using git diff --numstat).
repo_ = Repo('git-repo-path')
git_ = repo_.git
log_ = g.diff('--numstat','HEAD~1')
print(log_)
prints the entire output (lines added/deleted and file-names) as a single string. Can this output format be modified or changed so as to extract useful information?
Output format: num(added) num(deleted) file-name
For all files modified.
If I understand you correctly, you want to extract data from your log_ variable and then re-format it and print it? If that's the case, then I think the simplest way to fix it, is with a regular expression:
import re
for line in log_.split('\n'):
m = re.match(r"(\d+)\s+(\d+)\s+(.+)", line)
if m:
print("{}: rows added {}, rows deleted {}".format(m[3], m[1], m[2]))
The exact output, you can of course modify any way you want, once you have the data in a match m. Getting the hang of regular expressions may take a while but it can be very helpful for small scripts.
However, be adviced, reg exps tend to be write-only code and can be very hard to debug. However, for extracting small parts like this, it is very helpful.
I am trying to use python (v3.4) to act as a 'sandwich' between Neo4j and a text output file. This code gives me a py2neo RecordList:
from py2neo import Graph
from py2neo.packages.httpstream import http
http.socket_timeout = 9999
graph = Graph('http://localhost:7474/db/data/')
sCypher = 'MATCH (a) RETURN count(a)'
results = graph.cypher.execute(sCypher)
I also have some really simple text file interaction:
f = open('Output.txt','a') #open for append access
f.write ('\n Hello world')
f.close
What I really want to do is f.write (str(results)) but it really didn't like that at all. Can someone help me to convert my RecordList into a string please? I'm assuming I'll need to loop through the columns to get each column name, then loop through each record and write it individually, but I don't know how to go about this. Where I'm eventually planning to go with this is to change the Cypher every time.
Closest related question I could find is this one: How to convert Neo4j return types to python types. I'm sure there's someone out there who'll read this and say that using the REST API directly is a much better way of spitting out text, but I'm not quite at that level...
Thanks in advance,
Andy
Here is how you can iterate a RecordList and print the columns of the individual Records to a file (e.g. comma separated). If the properties you return are lists you would need some more formatting to get strings for your output file.
# use with to open files, this makes sure that it's properly closed after an exception
with open('output.txt', 'a') as f:
# iterate over individual Records in RecordList
for record in results:
# concatenate all columns of the Record into a string, comma separated
# list comprehension with str() to handle int and other types
output_string = ','.join([str(x) for x in record])
# actually write to file
print(output_string, file=f)
The format of the output file depends on what you want to do with it of course.
Been learning Python the last couple of days for the function of completing a data extraction. I'm not getting anywhere & hope one of you lovely people can advise.
I need to extract data that follows: RESP, CRESP, RTTime and RT.
Here's a snippit for an example of the mess I have to deal with.
Thoughts?
Level: 4
*** LogFrame Start ***
Procedure: ActProcScenarios
No: 1
Line1: It is almost time for your town's spring festival. A friend of yours is
Line2: on the committee and asks if you would be prepared to help out with the
Line3: barbecue in the park. There is a large barn for use if it rains.
Line4: You hope that on that day it will be
pfrag: s-n-y
pword: sunny
pletter: u
Quest: Does the town have an autumn festival?
Correct: {LEFTARROW}
ScenarioListPract: 1
Topic: practice
Subtheme: practice
ActPracScenarios: 1
Running: ActPracScenarios
ActPracScenarios.Cycle: 1
ActPracScenarios.Sample: 1
DisplayFragInstr.OnsetDelay: 17
DisplayFragInstr.OnsetTime: 98031
DisplayFragInstr.DurationError: -999999
DisplayFragInstr.RTTime: 103886
DisplayFragInstr.ACC: 0
DisplayFragInstr.RT: 5855
DisplayFragInstr.RESP: {DOWNARROW}
DisplayFragInstr.CRESP:
FragInput.OnsetDelay: 13
FragInput.OnsetTime: 103899
FragInput.DurationError: -999999
FragInput.RTTime: 104998
I think regular expressions would be the right tool here because the \b word boundary anchors allow you to make sure that RESP only matches a whole word RESP and not just part of a longer word (like CRESP).
Something like this should get you started:
>>> import re
>>> for line in myfile:
... match = re.search(r"\b(RT|RTTime|RESP|CRESP): (.*)", line)
... if match:
... print("Matched {0} with value {1}".format(match.group(1),
... match.group(2)))
Output:
Matched RTTime with value 103886
Matched RT with value 5855
Matched RESP with value {DOWNARROW}
Matched CRESP with value
Matched RTTime with value 104998
transform it to a dict first, then just get items from the dict as you wish
d = {k.strip(): v.strip() for (k, v) in
[line.split(':') for line in s.split('\n') if line.find(':') != -1]}
print (d['DisplayFragInstr.RESP'], d['DisplayFragInstr.CRESP'],
d['DisplayFragInstr.RTTime'], d['DisplayFragInstr.RT'])
>>> ('{DOWNARROW}', '', '103886', '5855')
I think you may be making things harder for yourself than needed. E-prime has a file format called .edat that is designed for the purpose you are describing. An edat file is another format that contains the same information as the .txt file but it a way that makes extracting variables easier. I personally only use the type of text file you have posted here as a form of data storage redundancy.
If you are doing things this way because you do not have a software key, it might help to know that the E-Merge and E-DataAid programs for eprime don't require a key. You only need the key for editing build files. Whoever provided you with the .txt files should probably have an install disk for these programs. If not, it is available on the PST website (I believe you need a serial code to create an account, but not certain)
Eprime generally creates a .edat file that matches the content of the text file you have posted an example of. Sometimes though if eprime crashes you don't get the edat file and only have the .txt. Luckily you can generate the edat file from the .txt file.
Here's how I would approach this issue: If you do not have the edat files available first use E-DataAid to recover the files.
Then presuming you have multiple participants you can use e-merge to merge all of the edat files together for all participants in who completed this task.
Open the merged file. It might look a little chaotic depending on how much you have in the file. You can got to Go to tools->Arrange columns This will show a list of all your variables. Adjust so that only the desired variables are in the right hand box. Hit ok.
Looking at the file you posted it says level 4 at the top so I'm guessing there are a lot of procedures in this experiment. If you have many procedures in the program you might at this point have lines that just have startup info and NULL in the locations where your variables or interest are. You and fix this by going to tools->filter and creating a filter to eliminate those lines. Sometimes also depending on file structure you might also end up with duplicate lines of the same data. You can also fix this with filtering.
You can then export this file as a csv
import re
import pprint
def parse_logs(file_name):
with open(file_name, "r") as f:
lines = [line.strip() for line in f.readlines()]
base_regex = r'^.*{0}: (.*)$'
match_terms = ["RESP", "CRESP", "RTTime", "RT"]
regexes = {term: base_regex.format(term) for term in match_terms}
output_list = []
for line in lines:
for key, regex in regexes.items():
match = re.match(regex, line)
if match:
match_tuple = (key, match.groups()[0])
output_list.append(match_tuple)
return output_list
pprint.pprint(parse_logs("respregex"))
Edit: Tim and Guy's answers are both better. I was in a hurry to write something and missed two much more elegant solutions.
This is my python file:-
TestCases-2
Input-5
Output-1,1,2,3,5
Input-7
Ouput-1,1,2,3,5,8,13
What I want is this:-
A variable test_no = 2 (No. of testcases)
A list testCaseInput = [5,7]
A list testCaseOutput = [[1,1,2,3,5],[1,1,2,3,5,8,13]]
I've tried doing it in this way:
testInput = testCase.readline(-10)
for i in range(0,int(testInput)):
testCaseInput = testCase.readline(-6)
testCaseOutput = testCase.readline(-7)
The next step would be to strip the numbers on the basis of (','), and then put them in a list.
Weirdly, the readline(-6) is not giving desired results.
Is there a better way to do this, which obviously I'm missing out on.
I don't mind using serialization here but I want to make it very simple for someone to write a text file as the one I have shown and then take the data out of it. How to do that?
A negative argument to the readline method specifies the number of bytes to read. I don't think this is what you want to be doing.
Instead, it is simpler to pull everything into a list all at once with readlines():
with open('data.txt') as f:
full_lines = f.readlines()
# parse full lines to get the text to right of "-"
lines = [line.partition('-')[2].rstrip() for line in full_lines]
numcases = int(lines[0])
for i in range(1, len(lines), 2):
caseinput = lines[i]
caseoutput = lines[i+1]
...
The idea here is to separate concerns (the source of the data, the parsing of '-', and the business logic of what to do with the cases). That is better than having a readline() and redundant parsing logic at every step.
I'm not sure if I follow exactly what you're trying to do, but I guess I'd try something like this:
testCaseIn = [];
testCaseOut = [];
for line in testInput:
if (line.startsWith("Input")):
testCaseIn.append(giveMeAList(line.split("-")[1]));
elif (line.startsWith("Output")):
testCaseOut.append(giveMeAList(line.split("-")[1]));
where giveMeAList() is a function that takes a comma seperated list of numbers, and generates a list datathing from it.
I didn't test this code, but I've written stuff that uses this kind of structure when I've wanted to make configuration files in the past.
You can use regex for this and it makes it much easier. See question: python: multiline regular expression
For your case, try this:
import re
s = open("input.txt","r").read()
(inputs,outputs) = zip(*re.findall(r"Input-(?P<input>.*)\nOutput-(?P<output>.*)\n",s))
and then split(",") each output element as required
If you do it this way you get the benefit that you don't need the first line in your input file so you don't need to specify how many entries you have in advance.
You can also take away the unzip (that's the zip(*...) ) from the code above, and then you can deal with each input and output a pair at a time. My guess is that is in fact exactly what you are trying to do.
EDIT Wanted to give you the full example of what I meant just then. I'm assuming this is for a testing script so I would say use the power of the pattern matching iterator to help keep your code shorter and simpler:
for (input,output) in re.findall(r"Input-(?P<input>.*)\nOutput-(?P<output>.*)\n",s):
expectedResults = output.split(",")
testResults = runTest(input)
// compare testResults and expectedResults ...
This line has an error:
Ouput-1,1,2,3,5,8,13 // it should be 'Output' not 'Ouput
This should work:
testCase = open('in.txt', 'r')
testInput = int(testCase.readline().replace("TestCases-",""))
for i in range(0,int(testInput)):
testCaseInput = testCase.readline().replace("Input-","")
testCaseOutput = testCase.readline().replace("Output-","").split(",")