Search text file and save to specific variables

Search text file and save to specific variables - python

I have looked at the previous questions and answers to my question but can't seem to get this to work. I am very new to python so I apologize in advance for this basic question.
I am running a Monte Carlo simulation in a separate piece of software. It creates a text file output that is quite lengthy. I want to retrieve 3 values that are under one heading. I have created the following code that isolates the part of the text file I want.
f = open("/Users/scott/Desktop/test/difinp0.txt","rt")
difout = f.readlines()
f.close()
d = range(1,5)
for i, line in enumerate(difout):
if "Chi-Square Test for Difference Testing" in line:
for l in difout[i:i+5]: print(l)
This produces the following:
Chi-Square Test for Difference Testing
Value 12.958
Degrees of Freedom 10
P-Value 0.2261
Note: there is a blank line between the heading and the next line titled "Value."
There are a different statistics with the same labels in the output but I need the ones here that are under the heading "Chi-square Test for Difference Testing.
What I am looking for is to save the values into 3 variables for use later.
chivalue (which in this case would be 12.958
chidf (which in this case would be 10)
chip (which in this case would be 0.2261
I've tried to enumerate "l" and retrieve from there but I can't seem to get it to work.
Any thoughts would be greatly appreciated. Again, apologies for such a basic question.

One option is to build a function that parses the input lines and returns the variables you want
def parser(text_lines):
v, d, p = [None]*3
for line in text_lines:
if line.strip().startswith('Value'):
v = float(line.strip().split(' ')[-1])
if line.strip().startswith('Degrees'):
d = float(line.strip().split(' ')[-1])
if line.strip().startswith('P-Value'):
p = float(line.strip().split(' ')[-1])
return v,d,p
for i, line in enumerate(difout):
if "Chi-Square Test for Difference Testing" in line:
for l in difout[i:i+5]:
print(l)
value, degree, p_val = parser(difout[i:i+5])

Related

performing fuzzy differences between 2 files

When doing some retrocomputing stuff, I sometimes have to compare 2 MC68000 disassembled executables of the same game.
Games are published using different languages (english, french...) or have slight modifications / revisions.
The code is roughly the same but the global labels are shifted because of previous code changes (or data which is wrongly interpreted as branches which generate more or less fake labels, depending on the data) so I can have for the first file:
LAB_0012:
MOVE #0,D0
MOVE #2,D2
LAB_0013:
RTS
and for the second file:
LAB_0015:
MOVE #0,D0
SUB #3,D1
MOVE #2,D2
LAB_0016:
RTS
If I perform a diff on both files, the labels scramble/pollute the required result, which I'd like to be SUB #3,D1 added in file 2.
So I performed a pre-processing using a regex to change all labels by LAB_XXXX, like this:
def readlines(filepath):
with open(filepath) as f:
lines = list(f)
return [x.rstrip() for x in lines],[r.sub("LAB_XXXX",l).partition(";")[0] for l in lines]
and use difflib to print the diffs and it kind of works but it doesn't revert back to original label values of course. So I keep the original data, and parse difflib output to try to print the original data instead, but that's lame and doesn't work very well.
lines1,filtered_lines1 = readlines(file1)
lines2,filtered_lines2 = readlines(file2)
for line in difflib.unified_diff(filtered_lines1, filtered_lines2, fromfile=file1, tofile=file2, lineterm=''):
m = re.match(r"##..(\d+),(\d+).*(\d+),(\d+)",line)
if m:
start,end,start2,end2 = [int(x) for x in m.groups()]
print(line)
for i in range(start,start+end):
print("{} <=> {}".format(lines1[i],lines2[i-start2+start]))
I've checked this answer Fuzzy file diff but that doesn't cut it for me: pre-processing both files is already what I'm doing.
I'd like to instruct difflib (or any other diff mean) to ignore this LAB_.... regex when comparing (a bit like you can compare data ignoring blanks, or case insensitive), so the original file content is printed (either side would do) when showing the diffs. For my above example I'd like:
LAB_0015:
MOVE #0,D0
##added:235,1## <== this is just an example: 1 line added at line 235
> SUB #3,D1
MOVE #2,D2
LAB_0016:
RTS
I'd prefer to keep it within python, but if I have to perform system calls for external commands that's okay too.

Output with Python Glob // Cannot find where is error in Python code

I have the following code, which does NOT give an error but it also does not produce an output.
The script is made to do the following:
The script takes an input file of 4 tab-separated columns:
It then counts the unique values in Column 1 and the frequency of corresponding values in Column 4 (which contains 2 different tags: C and D).
The output is 3 tab-separated columns containing the unique values of column 1 and their corresponding frequency of values in Column 4: Column 2 has the frequency of the string in Column 1 that corresponds with Tag C and Column 3 has the frequency of the string in Column 1 that corresponds with Tag D.
Here is a sample of input:
algorithm-n like-1-resonator-n 8.1848 C
algorithm-n produce-hull-n 7.9104 C
algorithm-n like-1-resonator-n 8.1848 D
algorithm-n produce-hull-n 7.9104 D
anything-n about-1-Zulus-n 7.3731 C
anything-n above-shortage-n 6.0142 C
anything-n above-1-gig-n 5.8967 C
anything-n above-1-magnification-n 7.8973 C
anything-n after-1-memory-n 2.5866 C
and here is a sample of the desired output:
algorithm-n 2 2
anything-n 5 0
The code I am using is the following (which one will see takes into consideration all suggestions from the comments):
from collections import defaultdict, Counter
def sortAndCount(opened_file):
lemma_sense_freqs = defaultdict(Counter)
for line in opened_file:
lemma, _, _, senseCode = line.split()
lemma_sense_freqs[lemma][senseCode] += 1
return lemma_sense_freqs
def writeOutCsv(output_file, input_dict):
with open(output_file, "wb") as outfile:
for lemma in input_dict.keys():
for senseCode in input_dict[lemma].keys():
outstring = "\t".join([lemma, senseCode,\
str(input_dict[lemma][senseCode])])
outfile.write(outstring + "\n")
import os
import glob
folderPath = "Python_Counter" # declare here
for input_file in glob.glob(os.path.join(folderPath, 'out_')):
with open(input_file, "rb") as opened_file:
lemma_sense_freqs = sortAndCount(input_file)
output_file = "count_*.csv"
writeOutCsv(output_file, lemma_sense_freqs)
My intuition is the problem is coming from the "glob" function.
But, as I said before: the code itself DOES NOT give me an error -- but it doesn't seem to produce an output either.
Can someone help?
I have referred to the documentation here and here, and I cannot seem to understand what I am doing wrong.
Can someone provide me insight on how to solve the problem by outputting the results from glob. As I have a large amount of files I need to process.

In regards to your original code, *lemma_sense_freqs* is not defined cause it should be returned by the function sortAndCount(). And you never call that function.
For instance, you have a second function in your code, which is called writeOutCsv. You define it, and then you actually call it on the last line.
While you never call the function sortAndCount() (which is the one that should return the value of *lemma_sense_freqs*). Hence, the error.
I don't know what you want to achieve exactly with that code, but you definitely need to write at a certain point (try before the last line) something like this
lemma_sense_freqs = sortAndCount(input_file)
this is the way you call the function you need and lemma_sense_freqs will then have a value associated and you shouldn't get the error.
I cannot be more specific cause it is not clear exactly what you want to achieve with that code. However, you just are experiencing a basic issue at the moment (you defined a function but never used it to retrieve the value lemma_sense_freqs). Try to add the piece of code I suggest and play with it.

How to compare clusters?

Hopefully this can be done with python! I used two clustering programs on the same data and now have a cluster file from both. I reformatted the files so that they look like this:
Cluster 0:
Brucellaceae(10)
Brucella(10)
abortus(1)
canis(1)
ceti(1)
inopinata(1)
melitensis(1)
microti(1)
neotomae(1)
ovis(1)
pinnipedialis(1)
suis(1)
Cluster 1:
Streptomycetaceae(28)
Streptomyces(28)
achromogenes(1)
albaduncus(1)
anthocyanicus(1)
etc.
These files contain bacterial species info. So I have the cluster number (Cluster 0), then right below it 'family' (Brucellaceae) and the number of bacteria in that family (10). Under that is the genera found in that family (name followed by number, Brucella(10)) and finally the species in each genera (abortus(1), etc.).
My question: I have 2 files formatted in this way and want to write a program that will look for differences between the two. The only problem is that the two programs cluster in different ways, so two cluster may be the same, even if the actual "Cluster Number" is different (so the contents of Cluster 1 in one file might match Cluster 43 in the other file, the only different being the actual cluster number). So I need something to ignore the cluster number and focus on the cluster contents.
Is there any way I could compare these 2 files to examine the differences? Is it even possible? Any ideas would be greatly appreciated!

Given:
file1 = '''Cluster 0:
giant(2)
red(2)
brick(1)
apple(1)
Cluster 1:
tiny(3)
green(1)
dot(1)
blue(2)
flower(1)
candy(1)'''.split('\n')
file2 = '''Cluster 18:
giant(2)
red(2)
brick(1)
tomato(1)
Cluster 19:
tiny(2)
blue(2)
flower(1)
candy(1)'''.split('\n')
Is this what you need?
def parse_file(open_file):
result = []
for line in open_file:
indent_level = len(line) - len(line.lstrip())
if indent_level == 0:
levels = ['','','']
item = line.lstrip().split('(', 1)[0]
levels[indent_level - 1] = item
if indent_level == 3:
result.append('.'.join(levels))
return result
data1 = set(parse_file(file1))
data2 = set(parse_file(file2))
differences = [
('common elements', data1 & data2),
('missing from file2', data1 - data2),
('missing from file1', data2 - data1) ]
To see the differences:
for desc, items in differences:
print desc
print
for item in items:
print '\t' + item
print
prints
common elements
giant.red.brick
tiny.blue.candy
tiny.blue.flower
missing from file2
tiny.green.dot
giant.red.apple
missing from file1
giant.red.tomato

So just for help, as I see lots of different answers in the comment, I'll give you a very, very simple implementation of a script that you can start from.
Note that this does not answer your full question but points you in one of the directions in the comments.
Normally if you have no experience I'd argue to go a head and read up on Python (which i'll do anyways, and i'll throw in a few links in the bottom of the answer)
On to the fun stuffs! :)
class Cluster(object):
'''
This is a class that will contain your information about the Clusters.
'''
def __init__(self, number):
'''
This is what some languages call a constructor, but it's not.
This method initializes the properties with values from the method call.
'''
self.cluster_number = number
self.family_name = None
self.bacteria_name = None
self.bacteria = []
#This part below isn't a part of the class, this is the actual script.
with open('bacteria.txt', 'r') as file:
cluster = None
clusters = []
for index, line in enumerate(file):
if line.startswith('Cluster'):
cluster = Cluster(index)
clusters.append(cluster)
else:
if not cluster.family_name:
cluster.family_name = line
elif not cluster.bacteria_name:
cluster.bacteria_name = line
else:
cluster.bacteria.append(line)
I wrote this as dumb and overly simple as I could without any fancy stuff and for Python 2.7.2
You could copy this file into a .py file and run it directly from command line python bacteria.py for example.
Hope this helps a bit and don't hesitate to come by our Python chat room if you have any questions! :)
http://learnpythonthehardway.org/
http://www.diveintopython.net/
http://docs.python.org/2/tutorial/inputoutput.html
check if all elements in a list are identical
Retaining order while using Python's set difference

You have to write some code to parse the file. If you ignore the cluster, you should be able to distinguish between family, genera and species based on indentation.
The easiest way it to define a named tuple:
import collections
Bacterium = collections.namedtuple('Bacterium', ['family', 'genera', 'species'])
You can make in instance of this object like this:
b = Bacterium('Brucellaceae', 'Brucella', 'canis')
Your parser should read a file line by line, and set the family and genera. If it then finds a species, it should add a Bacterium to a list;
with open('cluster0.txt', 'r') as infile:
lines = infile.readlines()
family = None
genera = None
bacteria = []
for line in lines:
# set family and genera.
# if you detect a bacterium:
bacteria.append(Bacterium(family, genera, species))
Once you have a list of all bacteria in each file or cluster, you can select from all the bacteria like this:
s = [b for b in bacteria if b.genera == 'Streptomycetaceae']

Comparing two clusterings is not trivial task and reinventing the wheel is unlikely to be successful. Check out this package which has lots of different cluster similarity metrics and can compare dendrograms (the data structure you have).
The library is called CluSim and can be found here:
https://github.com/Hoosier-Clusters/clusim/

After learning so much from Stackoverflow, finally I have an opportunity to give back! A different approach from those offered so far is to relabel clusters to maximize alignment, and then comparison becomes easy. For example, if one algorithm assigns labels to a set of six items as L1=[0,0,1,1,2,2] and another assigns L2=[2,2,0,0,1,1], you want these two labelings to be equivalent since L1 and L2 are essentially segmenting items into clusters identically. This approach relabels L2 to maximize alignment, and in the example above, will result in L2==L1.
I found a soution to this problem in "Menéndez, Héctor D. A genetic approach to the graph and spectral clustering problem. MS thesis. 2012." and below is an implementation in Python using numpy. I'm relatively new to Python, so there may be better implementations, but I think this gets the job done:
def alignClusters(clstr1,clstr2):
"""Given 2 cluster assignments, this funciton will rename the second to
maximize alignment of elements within each cluster. This method is
described in in Menéndez, Héctor D. A genetic approach to the graph and
spectral clustering problem. MS thesis. 2012. (Assumes cluster labels
are consecutive integers starting with zero)
INPUTS:
clstr1 - The first clustering assignment
clstr2 - The second clustering assignment
OUTPUTS:
clstr2_temp - The second clustering assignment with clusters renumbered to
maximize alignment with the first clustering assignment """
K = np.max(clstr1)+1
simdist = np.zeros((K,K))
for i in range(K):
for j in range(K):
dcix = clstr1==i
dcjx = clstr2==j
dd = np.dot(dcix.astype(int),dcjx.astype(int))
simdist[i,j] = (dd/np.sum(dcix!=0) + dd/np.sum(dcjx!=0))/2
mask = np.zeros((K,K))
for i in range(K):
simdist_vec = np.reshape(simdist.T,(K**2,1))
I = np.argmax(simdist_vec)
xy = np.unravel_index(I,simdist.shape,order='F')
x = xy[0]
y = xy[1]
mask[x,y] = 1
simdist[x,:] = 0
simdist[:,y] = 0
swapIJ = np.unravel_index(np.where(mask.T),simdist.shape,order='F')
swapI = swapIJ[0][1,:]
swapJ = swapIJ[0][0,:]
clstr2_temp = np.copy(clstr2)
for k in range(swapI.shape[0]):
swapj = [swapJ[k]==i for i in clstr2]
clstr2_temp[swapj] = swapI[k]
return clstr2_temp

Python: How to extract string from text file to use as data

this is my first time writing a python script and I'm having some trouble getting started. Let's say I have a txt file named Test.txt that contains this information.
x y z Type of atom
ATOM 1 C1 GLN D 10 26.395 3.904 4.923 C
ATOM 2 O1 GLN D 10 26.431 2.638 5.002 O
ATOM 3 O2 GLN D 10 26.085 4.471 3.796 O
ATOM 4 C2 GLN D 10 26.642 4.743 6.148 C
What I want to do is eventually write a script that will find the center of mass of these three atoms. So basically I want to sum up all of the x values in that txt file with each number multiplied by a given value depending on the type of atom.
I know I need to define the positions for each x-value, but I'm having trouble with figuring out how to make these x-values be represented as numbers instead of txt from a string. I have to keep in mind that I'll need to multiply these numbers by the type of atom, so I need a way to keep them defined for each atom type. Can anyone push me in the right direction?

mass_dictionary = {'C':12.0107,
'O':15.999
#Others...?
}
# If your files are this structured, you can just
# hardcode some column assumptions.
coords_idxs = [6,7,8]
type_idx = 9
# Open file, get lines, close file.
# Probably prudent to add try-except here for bad file names.
f_open = open("Test.txt",'r')
lines = f_open.readlines()
f_open.close()
# Initialize an array to hold needed intermediate data.
output_coms = []; total_mass = 0.0;
# Loop through the lines of the file.
for line in lines:
# Split the line on white space.
line_stuff = line.split()
# If the line is empty or fails to start with 'ATOM', skip it.
if (not line_stuff) or (not line_stuff[0]=='ATOM'):
pass
# Otherwise, append the mass-weighted coordinates to a list and increment total mass.
else:
output_coms.append([mass_dictionary[line_stuff[type_idx]]*float(line_stuff[i]) for i in coords_idxs])
total_mass = total_mass + mass_dictionary[line_stuff[type_idx]]
# After getting all the data, finish off the averages.
avg_x, avg_y, avg_z = tuple(map( lambda x: (1.0/total_mass)*sum(x), [[elem[i] for elem in output_coms] for i in [0,1,2]]))
# A lot of this will be better with NumPy arrays if you'll be using this often or on
# larger files. Python Pandas might be an even better option if you want to just
# store the file data and play with it in Python.

Basically using the open function in python you can open any file. So you can do something as follows: --- the following snippet is not a solution to the whole problem but an approach.
def read_file():
f = open("filename", 'r')
for line in f:
line_list = line.split()
....
....
f.close()
From this point on you have a nice setup of what you can do with these values. Basically the second line just opens the file for reading. The third line define a for loop that reads the file one line at a time and each line goes into the line variable.
The last line in that snippet basically breaks the string --at every whitepsace -- into an list. So line_list[0] will be the value on your first column and so forth. From this point if you have any programming experience you can just use if statements and such to get the logic that you want.
** Also keep in mind that the type of values stored in that list will all be string so if you want to perform any arithmetic operations such as adding you have to be careful.
* Edited for syntax correction

If you have pandas installed, checkout the read_fwf function that imports a fixed-width file and creates a DataFrame (2-d tabular data structure). It'll save you lines of code on import and also give you a lot of data munging functionality if you want to do any additional data manipulations.

Data analysis for inconsistent string formatting

I have this task that I've been working on, but am having extreme misgivings about my methodology.
So the problem is that I have a ton of excel files that are formatted strangely (and not consistently) and I need to extract certain fields for each entry. An example data set is
My original approach was this:
Export to csv
Separate into counties
Separate into districts
Analyze each district individually, pull out values
write to output.csv
The problem I've run into is that the format (seemingly well organized) is almost random across files. Each line contains the same fields, but in a different order, spacing, and wording. I wrote a script to correctly process one file, but it doesn't work on any other files.
So my question is, is there a more robust method of approaching this problem rather than simple string processing? What I had in mind was more of a fuzzy logic approach for trying to pin which field an item was, which could handle the inputs being a little arbitrary. How would you approach this problem?
If it helps clear up the problem, here is the script I wrote:
# This file takes a tax CSV file as input
# and separates it into counties
# then appends each county's entries onto
# the end of the master out.csv
# which will contain everything including
# taxes, bonds, etc from all years
#import the data csv
import sys
import re
import csv
def cleancommas(x):
toggle=False
for i,j in enumerate(x):
if j=="\"":
toggle=not toggle
if toggle==True:
if j==",":
x=x[:i]+" "+x[i+1:]
return x
def districtatize(x):
#list indexes of entries starting with "for" or "to" of length >5
indices=[1]
for i,j in enumerate(x):
if len(j)>2:
if j[:2]=="to":
indices.append(i)
if len(j)>3:
if j[:3]==" to" or j[:3]=="for":
indices.append(i)
if len(j)>5:
if j[:5]==" \"for" or j[:5]==" \'for":
indices.append(i)
if len(j)>4:
if j[:4]==" \"to" or j[:4]==" \'to" or j[:4]==" for":
indices.append(i)
if len(indices)==1:
return [x[0],x[1:len(x)-1]]
new=[x[0],x[1:indices[1]+1]]
z=1
while z<len(indices)-1:
new.append(x[indices[z]+1:indices[z+1]+1])
z+=1
return new
#should return a list of lists. First entry will be county
#each successive element in list will be list by district
def splitforstos(string):
for itemind,item in enumerate(string): # take all exception cases that didn't get processed
splitfor=re.split('(?<=\d)\s\s(?=for)',item) # correctly and split them up so that the for begins
splitto=re.split('(?<=\d)\s\s(?=to)',item) # a cell
if len(splitfor)>1:
print "\n\n\nfor detected\n\n"
string.remove(item)
string.insert(itemind,splitfor[0])
string.insert(itemind+1,splitfor[1])
elif len(splitto)>1:
print "\n\n\nto detected\n\n"
string.remove(item)
string.insert(itemind,splitto[0])
string.insert(itemind+1,splitto[1])
def analyze(x):
#input should be a string of content
#target values are nomills,levytype,term,yearcom,yeardue
clean=cleancommas(x)
countylist=clean.split(',')
emptystrip=filter(lambda a: a != '',countylist)
empt2strip=filter(lambda a: a != ' ', emptystrip)
singstrip=filter(lambda a: a != '\' \'',empt2strip)
quotestrip=filter(lambda a: a !='\" \"',singstrip)
splitforstos(quotestrip)
distd=districtatize(quotestrip)
print '\n\ndistrictized\n\n',distd
county = distd[0]
for x in distd[1:]:
if len(x)>8:
district=x[0]
vote1=x[1]
votemil=x[2]
spaceindex=[m.start() for m in re.finditer(' ', votemil)][-1]
vote2=votemil[:spaceindex]
mills=votemil[spaceindex+1:]
votetype=x[4]
numyears=x[6]
yearcom=x[8]
yeardue=x[10]
reason=x[11]
data = [filename,county,district, vote1, vote2, mills, votetype, numyears, yearcom, yeardue, reason]
print "data",data
else:
print "x\n\n",x
district=x[0]
vote1=x[1]
votemil=x[2]
spaceindex=[m.start() for m in re.finditer(' ', votemil)][-1]
vote2=votemil[:spaceindex]
mills=votemil[spaceindex+1:]
votetype=x[4]
special=x[5]
splitspec=special.split(' ')
try:
forind=[i for i,j in enumerate(splitspec) if j=='for'][0]
numyears=splitspec[forind+1]
yearcom=splitspec[forind+6]
except:
forind=[i for i,j in enumerate(splitspec) if j=='commencing'][0]
numyears=None
yearcom=splitspec[forind+2]
yeardue=str(x[6])[-4:]
reason=x[7]
data = [filename,county,district,vote1,vote2,mills,votetype,numyears,yearcom,yeardue,reason]
print "data other", data
openfile=csv.writer(open('out.csv','a'),delimiter=',', quotechar='|',quoting=csv.QUOTE_MINIMAL)
openfile.writerow(data)
# call the file like so: python tax.py 2007May8Tax.csv
filename = sys.argv[1] #the file is the first argument
f=open(filename,'r')
contents=f.read() #entire csv as string
#find index of every instance of the word county
separators=[m.start() for m in re.finditer('\w+\sCOUNTY',contents)] #alternative implementation in regex
# split contents into sections by county
# analyze each section and append to out.csv
for x,y in enumerate(separators):
try:
data = contents[y:separators[x+1]]
except:
data = contents[y:]
analyze(data)

is there a more robust method of approaching this problem rather than simple string processing?
Not really.
What I had in mind was more of a fuzzy logic approach for trying to pin which field an item was, which could handle the inputs being a little arbitrary. How would you approach this problem?
After a ton of analysis and programming, it won't be significantly better than what you've got.
Reading stuff prepared by people requires -- sadly -- people-like brains.
You can mess with NLTK to try and do a better job, but it doesn't work out terribly well either.
You don't need a radically new approach. You need to streamline the approach you have.
For example.
district=x[0]
vote1=x[1]
votemil=x[2]
spaceindex=[m.start() for m in re.finditer(' ', votemil)][-1]
vote2=votemil[:spaceindex]
mills=votemil[spaceindex+1:]
votetype=x[4]
numyears=x[6]
yearcom=x[8]
yeardue=x[10]
reason=x[11]
data = [filename,county,district, vote1, vote2, mills, votetype, numyears, yearcom, yeardue, reason]
print "data",data
Might be improved by using a named tuple.
Then build something like this.
data = SomeSensibleName(
district= x[0],
vote1=x[1], ... etc.
)
So that you're not creating a lot of intermediate (and largely uninformative) loose variables.
Also, keep looking at your analyze function (and any other function) to pull out the various "pattern matching" rules. The idea is that you'll examine a county's data, step through a bunch of functions until one matches the pattern; this will also create the named tuple. You want something like this.
for p in ( some, list, of, functions ):
match= p(data)
if match:
return match
Each function either returns a named tuple (because it liked the row) or None (because it didn't like the row).

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Search text file and save to specific variables - python

Related

performing fuzzy differences between 2 files

Output with Python Glob // Cannot find where is error in Python code

How to compare clusters?

Python: How to extract string from text file to use as data

Data analysis for inconsistent string formatting

Categories

Resources