Directory comparision - python

I have a question in relation to comparing directories. I have requested a transfer of data from one company server , I received this data in 8 portable hard drives as it was not possible due to large volume to put it on one. each of them contains about 1TB data. Now, I want to make sure that all the files that were on that company server were fully transferred , and if there are any files missing I want to be able to detect it and ask for them. The issue is that the only thing I received from the company is one txt file inn which there is detailed directory structure saved in a tree format. In principle great I could just look one by one through it but due to large amount of data that is just not achievable. I can generate the same directory list out of every single one of the 8 drives that I received. but how can I ten compare this one file into those 8 files? I tried different python comparison codes to parse through line by line but it does not work as this compares them string by string(line by line) but the are not in string format , they are in tree style format. Anyone has any suggestions how to do it?is there a way to convert the file of a tree format into a string format to then run it in Python program and compare? Or should I request (not sure if thats possible) another file with directories saved in different structure format than tree? if yes how this should be generated?
I tried to parse it through lists in python

Your task consists of two subtasks:
Prepare data;
Compare trees.
Task 2 can be solved by this Tree class:
from collections import defaultdict
class Tree:
def __init__(self):
self._children = defaultdict(Tree)
def __len__(self):
return len(self._children)
def add(self, value: str):
value = value.removeprefix('/').removesuffix('/')
if value == '':
return
first, rest = self._get_values(value)
self._children[first].add(rest)
def remove(self, value: str):
value = value.removeprefix('/').removesuffix('/')
if value == '':
return
first, rest = self._get_values(value)
self._children[first].remove(rest)
if len(self._children[first]) == 0:
del self._children[first]
def _get_values(self, value):
values = value.split('/', 1)
first, rest = values[0], ''
if len(values) == 2:
rest = values[1]
return first, rest
def __iter__(self):
if len(self._children) == 0:
yield ''
return
for key, child in self._children.items():
for value in child:
if value != '':
yield key + '/' + value
else:
yield key
def main():
tree = Tree()
tree.add('a/b/c')
tree.add('a/b/d')
tree.add('b/b/e')
tree.add('b/c/f')
tree.add('c/b/e')
# duplicates are okay
tree.add('b/c/f')
# leading and trailing slashes are okay
tree.add('b/c/f/')
tree.add('/b/c/f')
print('Before removing:', [x for x in tree])
tree.remove('a/b/c')
tree.remove('a/b/d')
tree.remove('b/b/e')
# it's okay to remove non-existent values
tree.remove('this/path/does/not/exist')
# it will not remove root-level b folder, because it's not empty
tree.remove('b')
print('After removing:', [x for x in tree])
if __name__ == '__main__':
main()
Output:
Before removing: ['a/b/c', 'a/b/d', 'b/b/e', 'b/c/f', 'c/b/e']
After removing: ['b/c/f', 'c/b/e']
So, your algorithm is as follows:
Build tree of file paths on company server (let's call it A-tree);
Build trees of file paths on portable hard drives (let's call them B-trees);
Remove from the A-tree file paths that exist in B-trees;
Print content of the A-tree.
Now, all you need - is to build these trees, which is task 1 from the answer beginning. And how it will be - depends on the data, that you have in your .txt file.

Related

How to Extract Individual Chords, Rests, and Notes from a midi file?

I am making a program that should be able to extract the notes, rests, and chords from a certain midi file and write the respective pitch (in midi tone numbers - they go from 0-127) of the notes and chords to a csv file for later use.
For this project, I am using the Python Library "Music21".
from music21 import *
import pandas as pd
#SETUP
path = r"Pirates_TheCarib_midi\1225766-Pirates_of_The_Caribbean_Medley.mid"
#create a function for taking parsing and extracting the notes
def extract_notes(path):
stm = converter.parse(path)
treble = stm[0] #access the first part (if there is only one part)
bass = stm[1]
#note extraction
notes_treble = []
notes_bass = []
for thisNote in treble.getElementsByClass("Note"):
indiv_note = [thisNote.name, thisNote.pitch.midi, thisNote.offset]
notes_treble.append(indiv_note) # print's the note and the note's
offset
for thisNote in bass.getElementsByClass("Note"):
indiv_note = [thisNote.name, thisNote.pitch.midi, thisNote.offset]
notes_bass.append(indiv_note) #add the notes to the bass
return notes_treble, notes_bass
#write to csv
def to_csv(notes_array):
df = pd.DataFrame(notes_array, index=None, columns=None)
df.to_csv("attempt1_v1.csv")
#using the functions
notes_array = extract_notes(path)
#to_csv(notes_array)
#DEBUGGING
stm = converter.parse(path)
print(stm.parts)
Here is the link to the score I am using as a test.
https://musescore.com/user/1699036/scores/1225766
When I run the extract_notes function, it returns two empty arrays and the line:
print(stm.parts)
it returns
<music21.stream.iterator.StreamIterator for Score:0x1b25dead550 #:0>
I am confused as to why it does this. The piece should have two parts, treble and bass. How can I get each note, chord and rest into an array so I can put it in a csv file?
Here is small snippet how I did it. I needed to get all notes, chords and rests for specific instrument. So at first I iterated through part and found specific instrument and afterwards check what kind of type note it is and append it.
you can call this method like
notes = get_notes_chords_rests(keyboard_instruments, "Pirates_of_The_Caribbean.mid")
where keyboard_instruments is list of instruments.
keyboard_nstrument = ["KeyboardInstrument", "Piano", "Harpsichord", "Clavichord", "Celesta", ]
def get_notes_chords_rests(instrument_type, path):
try:
midi = converter.parse(path)
parts = instrument.partitionByInstrument(midi)
note_list = []
for music_instrument in range(len(parts)):
if parts.parts[music_instrument].id in instrument_type:
for element_by_offset in stream.iterator.OffsetIterator(parts[music_instrument]):
for entry in element_by_offset:
if isinstance(entry, note.Note):
note_list.append(str(entry.pitch))
elif isinstance(entry, chord.Chord):
note_list.append('.'.join(str(n) for n in entry.normalOrder))
elif isinstance(entry, note.Rest):
note_list.append('Rest')
return note_list
except Exception as e:
print("failed on ", path)
pass
P.S. It is important to use try block because a lot of midi files on the web are corrupted.

MIME Type extraction on Codingame

I'm trying to solve a puzzle on codingame:
The goal is to extract the MIME-types of some strings like so
Input:
2
4
html text/html
png image/png
test.html
noextension
portrait.png
doc.TXT
Output:
text/html
UNKNOWN
image/png
UNKNOWN
My code is running smooth so far but at a longer list I get an error that my Process has timed out.
Can anybody hint me?
My code:
import sys
import math
# Auto-generated code below aims at helping you parse
# the standard input according to the problem statement.
n = int(input()) # Number of elements which make up the association table.
q = int(input()) # Number Q of file names to be analyzed.
mime = dict(input().split() for x in range(n))
fname = [input() for x in range(q)]
# Write an action using print
# To debug: print("Debug messages...", file=sys.stderr)
for b in fname:
if '.' not in b:
print('UNKNOWN')
continue
else:
for idx, char in enumerate(reversed(b)):
if char == '.':
ext = b[len(b)-idx:]
for k, v in mime.items():
if ext.lower() == k.lower():
print(v)
break
else:
print('UNKNOWN')
break
I don't know for certain that it's the reason for your code timing out, but iterating over your dictionary's keys and values to do a case-insenstive match against the keys is always going to be relatively slow. The main reason to use a dictionary is that is has O(1) lookup times. If you don't take advantage of that, you'd be better off with a list of tuples.
Fortunately, it's not that hard to fix up your dict so that you can use indexing instead of iterating to check each key manually. Here's the relevant changes to your code:
# set up the dict with lowercase extensions
mime = {ext.lower(): tp for ext, tp in (input().split() for _ in range(n))}
fnames = [input() for x in range(q)] # changed the variable name here, plural for a list
for filename in fnames: # better variable name than `b`
if '.' not in filename:
print("UNKNOWN") # no need for continue
else:
_, extension = filename.rsplit('.', 1) # no need to write your own split loop
print(mime.get(extension.lower(), "UNKNOWN")) # Do an O(1) lookup in the dictionary
In addition to changing up the dictionary creating and lookup code, I've also simplified the code you were using to split off the extension, though my way is still O(N) just like yours is (where N is the length of the file names).

Python: Parsing through XML with mini dom

I'm parsing through a decent sized xml file, and I ran into a problem. For some reason I cannot extract data even though I have done the exact same thing on different xml files before.
Here's a snippet of my code: (rest of the program, I've tested and they work fine)
EDIT: changed to include a testing try&except block
def parseXML():
file = open(str(options.drugxml),'r')
data = file.read()
file.close()
dom = parseString(data)
druglist = dom.getElementsByTagName('drug')
with codecs.open(str(options.csvdata),'w','utf-8') as csvout, open('DrugTargetRel.csv','w') as dtout:
for entry in druglist:
count = count + 1
try:
drugtype = entry.attributes['type'].value
print count
except:
print count
print entry
drugidObj = entry.getElementsByTagName('drugbank-id')[0]
drugid = drugidObj.childNodes[0].nodeValue
drugnameObj = entry.getElementsByTagName('name')[0]
drugname = drugnameObj.childNodes[0].nodeValue
targetlist = entry.getElementsByTagName('target')
for target in targetlist:
targetid = target.attributes['partner'].value
dtout.write((','.join((drugid,targetid)))+'\n')
csvout.write((','.join((drugid,drugname,drugtype)))+'\n')
In case you're wondering what the XML file's schema roughly looks like, here's a rough god-awful sketch of the levels:
<drugs>
<drug type='something' ...>
<drugbank-id>
<name>
...
<targets>
<target partner='something'>
Those that I typed in here, I need to extract from the XML file and stick it in csv files (as the code above shows), and the code has worked for different xml files before, not sure why it's not working on this one. I've gotten KeyError on 'type', I've also gotten indexing errors on line that extracts drugid even though EVERY drug has a drugid. What am I screwing up here?
EDIT: the stuff I'm extracting are guaranteed to be in each drug.
For anyone who cares, here's the link to the XML file I'm parsing:
http://www.drugbank.ca/system/downloads/current/drugbank.xml.zip
EDIT: After implementing a try & except block (see above) here's what I found out:
In the schema, there are sections called "drug interactions" that also have a subfield called drug. So like this:
<drugs>
<drug type='something' ...>
<drugbank-id>
<name>
...
<targets>
<target partner='something'>
<drug-interactions>
<drug>
I think that my line druglist = dom.getElementsByTagName('drug') is unintentionally picking those up as well -- I don't know how I could fix this... any suggestions?
Basically when parsin an xml, you can't rely on the fact that you know the structure. It's a good practise to find out structure in code.
So everytime you access elements or attributes, check before if there any. In your code it means following:
Make sure there's an attribute 'type' on a drug element:
drugtype = entry.attributes['type'].value if entry.attributes.has_key('type') else 'defaulttype'
Make sure getElementsByTagName doesn't returns empty array before accessing its elements:
drugbank-id = entry.getElementsByTagName('drugbank-id')
drugidObj = drugbank-id[0] if drugbank-id else None
Also before accessing childnodes make sure there are any:
if drugidObj.hasChildNodes:
drugid = drugidObj.childNodes[0].nodeValue
Or use for loop to loop through them.
And when you call getElementsByTagName on the drugs elemet it returns all elements including the nested ones. To get only drug elements, which are dirrect children of drugs element you have to use childNodes attribute.
I had a feeling that maybe there was something weird happening due to running out of memory or something, so I rewrote the parser using an iterator over each drug and tried it out and got the program to complete without raising an exception.
Basically what I'm doing here is, instead of loading the entire XML file into memory, I parse the XML file for the beginning and end of each <drug> and </drug> tag. Then I parse that with the minidom each time.
The code might be a little fragile as I assume that each <drug> and </drug> pair are on their own lines. Hopefully it helps more than it harms though.
#!python
import codecs
from xml.dom import minidom
class DrugBank(object):
def __init__(self, filename):
self.fp = open(filename, 'r')
def __iter__(self):
return self
def next(self):
state = 0
while True:
line = self.fp.readline()
if state == 0:
if line.strip().startswith('<drug '):
lines = [line]
state = 1
continue
if line.strip() == '</drugs>':
self.fp.close()
raise StopIteration()
if state == 1:
lines.append(line)
if line.strip() == '</drug>':
return minidom.parseString("".join(lines))
with codecs.open('csvout.csv', 'w', 'utf-8') as csvout, open('dtout.csv', 'w') as dtout:
db = DrugBank('drugbank.xml')
for dom in db:
entry = dom.firstChild
drugtype = entry.attributes['type'].value
drugidObj = entry.getElementsByTagName('drugbank-id')[0]
drugid = drugidObj.childNodes[0].nodeValue
drugnameObj = entry.getElementsByTagName('name')[0]
drugname = drugnameObj.childNodes[0].nodeValue
targetlist = entry.getElementsByTagName('target')
for target in targetlist:
targetid = target.attributes['partner'].value
dtout.write((','.join((drugid,targetid)))+'\n')
csvout.write((','.join((drugid,drugname,drugtype)))+'\n')
An interesting read that might help you out further is here:
http://www.ibm.com/developerworks/xml/library/x-hiperfparse/

Data analysis for inconsistent string formatting

I have this task that I've been working on, but am having extreme misgivings about my methodology.
So the problem is that I have a ton of excel files that are formatted strangely (and not consistently) and I need to extract certain fields for each entry. An example data set is
My original approach was this:
Export to csv
Separate into counties
Separate into districts
Analyze each district individually, pull out values
write to output.csv
The problem I've run into is that the format (seemingly well organized) is almost random across files. Each line contains the same fields, but in a different order, spacing, and wording. I wrote a script to correctly process one file, but it doesn't work on any other files.
So my question is, is there a more robust method of approaching this problem rather than simple string processing? What I had in mind was more of a fuzzy logic approach for trying to pin which field an item was, which could handle the inputs being a little arbitrary. How would you approach this problem?
If it helps clear up the problem, here is the script I wrote:
# This file takes a tax CSV file as input
# and separates it into counties
# then appends each county's entries onto
# the end of the master out.csv
# which will contain everything including
# taxes, bonds, etc from all years
#import the data csv
import sys
import re
import csv
def cleancommas(x):
toggle=False
for i,j in enumerate(x):
if j=="\"":
toggle=not toggle
if toggle==True:
if j==",":
x=x[:i]+" "+x[i+1:]
return x
def districtatize(x):
#list indexes of entries starting with "for" or "to" of length >5
indices=[1]
for i,j in enumerate(x):
if len(j)>2:
if j[:2]=="to":
indices.append(i)
if len(j)>3:
if j[:3]==" to" or j[:3]=="for":
indices.append(i)
if len(j)>5:
if j[:5]==" \"for" or j[:5]==" \'for":
indices.append(i)
if len(j)>4:
if j[:4]==" \"to" or j[:4]==" \'to" or j[:4]==" for":
indices.append(i)
if len(indices)==1:
return [x[0],x[1:len(x)-1]]
new=[x[0],x[1:indices[1]+1]]
z=1
while z<len(indices)-1:
new.append(x[indices[z]+1:indices[z+1]+1])
z+=1
return new
#should return a list of lists. First entry will be county
#each successive element in list will be list by district
def splitforstos(string):
for itemind,item in enumerate(string): # take all exception cases that didn't get processed
splitfor=re.split('(?<=\d)\s\s(?=for)',item) # correctly and split them up so that the for begins
splitto=re.split('(?<=\d)\s\s(?=to)',item) # a cell
if len(splitfor)>1:
print "\n\n\nfor detected\n\n"
string.remove(item)
string.insert(itemind,splitfor[0])
string.insert(itemind+1,splitfor[1])
elif len(splitto)>1:
print "\n\n\nto detected\n\n"
string.remove(item)
string.insert(itemind,splitto[0])
string.insert(itemind+1,splitto[1])
def analyze(x):
#input should be a string of content
#target values are nomills,levytype,term,yearcom,yeardue
clean=cleancommas(x)
countylist=clean.split(',')
emptystrip=filter(lambda a: a != '',countylist)
empt2strip=filter(lambda a: a != ' ', emptystrip)
singstrip=filter(lambda a: a != '\' \'',empt2strip)
quotestrip=filter(lambda a: a !='\" \"',singstrip)
splitforstos(quotestrip)
distd=districtatize(quotestrip)
print '\n\ndistrictized\n\n',distd
county = distd[0]
for x in distd[1:]:
if len(x)>8:
district=x[0]
vote1=x[1]
votemil=x[2]
spaceindex=[m.start() for m in re.finditer(' ', votemil)][-1]
vote2=votemil[:spaceindex]
mills=votemil[spaceindex+1:]
votetype=x[4]
numyears=x[6]
yearcom=x[8]
yeardue=x[10]
reason=x[11]
data = [filename,county,district, vote1, vote2, mills, votetype, numyears, yearcom, yeardue, reason]
print "data",data
else:
print "x\n\n",x
district=x[0]
vote1=x[1]
votemil=x[2]
spaceindex=[m.start() for m in re.finditer(' ', votemil)][-1]
vote2=votemil[:spaceindex]
mills=votemil[spaceindex+1:]
votetype=x[4]
special=x[5]
splitspec=special.split(' ')
try:
forind=[i for i,j in enumerate(splitspec) if j=='for'][0]
numyears=splitspec[forind+1]
yearcom=splitspec[forind+6]
except:
forind=[i for i,j in enumerate(splitspec) if j=='commencing'][0]
numyears=None
yearcom=splitspec[forind+2]
yeardue=str(x[6])[-4:]
reason=x[7]
data = [filename,county,district,vote1,vote2,mills,votetype,numyears,yearcom,yeardue,reason]
print "data other", data
openfile=csv.writer(open('out.csv','a'),delimiter=',', quotechar='|',quoting=csv.QUOTE_MINIMAL)
openfile.writerow(data)
# call the file like so: python tax.py 2007May8Tax.csv
filename = sys.argv[1] #the file is the first argument
f=open(filename,'r')
contents=f.read() #entire csv as string
#find index of every instance of the word county
separators=[m.start() for m in re.finditer('\w+\sCOUNTY',contents)] #alternative implementation in regex
# split contents into sections by county
# analyze each section and append to out.csv
for x,y in enumerate(separators):
try:
data = contents[y:separators[x+1]]
except:
data = contents[y:]
analyze(data)
is there a more robust method of approaching this problem rather than simple string processing?
Not really.
What I had in mind was more of a fuzzy logic approach for trying to pin which field an item was, which could handle the inputs being a little arbitrary. How would you approach this problem?
After a ton of analysis and programming, it won't be significantly better than what you've got.
Reading stuff prepared by people requires -- sadly -- people-like brains.
You can mess with NLTK to try and do a better job, but it doesn't work out terribly well either.
You don't need a radically new approach. You need to streamline the approach you have.
For example.
district=x[0]
vote1=x[1]
votemil=x[2]
spaceindex=[m.start() for m in re.finditer(' ', votemil)][-1]
vote2=votemil[:spaceindex]
mills=votemil[spaceindex+1:]
votetype=x[4]
numyears=x[6]
yearcom=x[8]
yeardue=x[10]
reason=x[11]
data = [filename,county,district, vote1, vote2, mills, votetype, numyears, yearcom, yeardue, reason]
print "data",data
Might be improved by using a named tuple.
Then build something like this.
data = SomeSensibleName(
district= x[0],
vote1=x[1], ... etc.
)
So that you're not creating a lot of intermediate (and largely uninformative) loose variables.
Also, keep looking at your analyze function (and any other function) to pull out the various "pattern matching" rules. The idea is that you'll examine a county's data, step through a bunch of functions until one matches the pattern; this will also create the named tuple. You want something like this.
for p in ( some, list, of, functions ):
match= p(data)
if match:
return match
Each function either returns a named tuple (because it liked the row) or None (because it didn't like the row).

What is the lightest way of doing this task?

I have a file whose contents are of the form:
.2323 1
.2327 1
.3432 1
.4543 1
and so on some 10,000 lines in each file.
I have a variable whose value is say a=.3344
From the file I want to get the row number of the row whose first column is closest to this variable...for example it should give row_num='3' as .3432 is closest to it.
I have tried in a method of loading the first columns element in a list and then comparing the variable to each element and getting the index number
If I do in this method it is very much time consuming and slow my model...I want a very quick method as this need to to called some 1000 times minimum...
I want a method with least overhead and very quick can anyone please tell me how can it be done very fast.
As the file size is maximum of 100kb can this be done directly without loading into any list of anything...if yes how can it be done.
Any method quicker than the method mentioned above are welcome but I am desperate to improve the speed -- please help.
def get_list(file, cmp, fout):
ind, _ = min(enumerate(file), key=lambda x: abs(x[1] - cmp))
return fout[ind].rstrip('\n').split(' ')
#root = r'c:\begpython\wavnk'
header = 6
for lst in lists:
save = database_index[lst]
#print save
index, base,abs2, _ , abs1 = save
using_data[index] = save
base = 'C:/begpython/wavnk/'+ base.replace('phone', 'text')
fin, fout = base + '.pm', base + '.mcep'
file = open(fin)
fout = open(fout).readlines()
[next(file) for _ in range(header)]
file = [float(line.partition(' ')[0]) for line in file]
join_cost_index_end[index] = get_list(file, float(abs1), fout)
join_cost_index_strt[index] = get_list(file, float(abs2), fout)
this is the code i was using..copying file into a list.and all please give better alternarives to this
Building on John Kugelman's answer, here's a way you might be able to do a binary search on a file with fixed-length lines:
class SubscriptableFile(object):
def __init__(self, file):
self._file = file
file.seek(0,0)
self._line_length = len(file.readline())
file.seek(0,2)
self._len = file.tell() / self._line_length
def __len__(self):
return self._len
def __getitem__(self, key):
self._file.seek(key * self._line_length)
s = self._file.readline()
if s:
return float(s.split()[0])
else:
raise KeyError('Line number too large')
This class wraps a file in a list-like structure, so that now you can use the functions of the bisect module on it:
def find_row(file, target):
fw = SubscriptableFile(file)
i = bisect.bisect_left(fw, target)
if fw[i + 1] - target < target - fw[i]:
return i + 1
else:
return i
Here file is an open file object and target is the number you want to find. The function returns the number of the line with the closest value.
I will note, however, that the bisect module will try to use a C implementation of its binary search when it is available, and I'm not sure if the C implementation supports this kind of behavior. It might require a true list, rather than a "fake list" (like my SubscriptableFile).
Is the data in the file sorted in numerical order? Are all the lines of the same length? If not, the simplest approach is best. Namely, reading through the file line by line. There's no need to store more than one line in memory at a time.
Code:
def closest(num):
closest_row = None
closest_value = None
for row_num, row in enumerate(file('numbers.txt')):
value = float(row.split()[0])
if closest_value is None or abs(value - num) < abs(closest_value - num):
closest_row = row
closest_row_num = row_num
closest_value = value
return (closest_row_num, closest_row)
print closest(.3344)
Output for sample data:
(2, '.3432 1\n')
If the lines are all the same length and the data is sorted then there are some optimizations that will make this a very fast process. All the lines being the same length would let you seek directly to particular lines (you can't do this in a normal text file with lines of different length). Which would then enable you to do a binary search.
A binary search would be massively faster than a linear search. A linear search will on average have to read 5,000 lines of a 10,000 line file each time, whereas a binary search would on average only read log2 10,000 ≈ 13 lines.
Load it into a list then use bisect.

Categories