Parsing conditional statements - python

I've written a small utility in Python3 to help me copy my music collection from my NAS to a mobile device. The usefulness of this is that it will auto-convert flac files to ogg-vorbis (to save space) and also exclude some files based on their audio tags (i.e. artist, album, date, etc).
I'm not happy with the limited nature of the exclude feature and I want to improve it but I've hit a mental block and I'm looking for advice on how to proceed.
I would like the user to write an exclude file which will look something like this:
exclude {
artist is "U2"
artist is "Uriah Heep" {
album is "Spellbinder"
album is "Innocent Victim"
}
}
This would translate to:
exclude if
(artist = "U2") OR
(artist = "Uriah Heep" AND (album = "Spellbinder" OR album = "Innocent Victim"))
There will be more conditionals such as sub-string matching and date ranges.
I've been checking out PLY but I'm struggling with the concepts of how to parse this type of nested structure and also how to represent the resulting conditional so that I can execute it in code when applying the exclude filter during the copy operation.

Your data structure is almost a dict, why not just use JSON? To go one better, you chould use Lucene Query Syntax.

Related

Replace a value on one data set with a value from another data set with a dynamic lookup

This question relates primarly to Alteryx, however if it can be done in Python, or R in Alteryx workflow using the R tool then that would work as well.
I have two data sets.
Address (contains address information: Line1, Line2, City, State, Zip)
USPS (contains USPS abbreviations: Street to ST, Boulevard to BLVD, etc.)
Goal: Look at the string on the Address data set for Line1. IF it CONTAINS one of the types of streets in the USPS data set, I want to replace that part of the string with its proper abbreviation which is in a different column of the USPS data set.
Example, 123 Main Street would become 123 Main St
What I have tried:
Imported the two data sets.
Union the two data sets with the instruction of Output All Fields for When Fields Differ.
Added a formula, but this is where I am getting stuck. So far it reads:
if [Addr1] Contains(Sting, Target)
Not sure how to have it look in the USPS for one of the values. I am also not certain if this sort of dynamic lookup can take place.
If this can be done in python (I know very basic Python so I don't have code for this yet because I do not know where to start other than importing the data) I can use python within Alteryx.
Any assistance would be great. Please let me know if you need additional information.
Thank you in advance.
Use the Find Replace tool in Alteryx. This tool is akin to a lookup. Furthermore, use the Alteryx Community as a go to for these types of questions.
Input the Address dataset into the top anchor of the Find Replace tool and the USPS dataset into the bottom anchor. You'll want to find any part of the address field using the lookup field and replace it with the abbreviation field. If you need to do this across several fields in the Address dataset, then you could replicate this logic or you could use a Record ID tool, Transpose, run this logic on one field, and then Cross Tab back to the original schema. It's an advanced recipe that you'll want to master in Alteryx.
https://help.alteryx.com/current/FindReplace.htm
The overall logic that can be used is here: Using str_detect (or some other function) and some way to loop through a list to essentially perform a vlookup
However, in order to expand to Alteryx, you would need to add the Alteryx R tool. Also, some of the code would need to be changed to use the syntax that Alteryx likes.
read in the data with:
read.Alteryx('#Link Number', mode = 'data.frame')
After, the above linked question will provide the overall framework for the logic. Reiterated here:
usps[] = lapply(usps, as.character)
##Copies the original address data to a new column that will
##be altered. Preserves the orignal formatting for rollback
##if necessary
vendorData$new_addr1 = as.character(vendorData$Addr1)
##Loops through the dictionary replacing all of the common names
##with their USPS approved abbreviations for the Addr1 field.
for(i in 1:nrow(usps)) {
vendorData$new_addr1 = str_replace_all(
vendorData$new_addr1,
pattern = paste0("\\b", usps$Abbreviation[i], "\\b"),
replacement = usps$USPS_Abbrv_updated[i]
)
}
Finally, in order to be able to see the output, we would need to write a statement that will output it in one of the 5 output slots the R tool has. Here is the code for that:
write.Alteryx(data, #)

LDAP Query Filter User's with Groups Like *x*

I'm currently using Python and LDAP to query Active Directory for users.
I have a list of names that are First Last. Not specific enough to find the exact user.
I would like a filter that would find all users matching 'Last, First*'
and belonging to any group with a keyword in it.
_filter = '''(& (objectclass=user)
(objectcategory=person)
(name={}*) )'''.format(search_string)
and I've tried adding...
(memberOf=CN=*Keyword*,OU=Delegated,OU=Groups,DC=amr,DC=corp,DC=xxxxxx,DC=com)
To my filter, but with no success.
If this was SQL, I would write something like:
Select *
From
Users
Where
Users.name like 'First, Last%'
and Users.memberOf like 'Keyword%'
Update:
After reviewing Gabriel's answer I'm running this.
def get_idsids(self, search_string):
_filter = '''(& (objectclass=user)
(objectcategory=person)
(anr={}) )'''.format(search_string)
# Search for user.
# Will return list of users matching criteria.
# The results are wrapped up as a list(tuple(dict))) where the dict vals are binary strings or lists of binary strings.
users = self.con.search_s(ActiveDirUser.BASEDN, ldap.SCOPE_SUBTREE, _filter, ['displayName', 'sAMAccountName', 'memberOf'])
# This line is ugly... It just converts the results to a list of ids
# So long as the user has at least one group with 'Keyword' in the name.
# upper() is used to make the Keyword requriement case insensitive.
return [user[1]['sAMAccountName'][0].decode() for user in users if 'KEYWORD' in ''.join(map(str, user[1]['memberOf'])).upper()]
I do wonder though, could I search for groups with 'Keyword' in the name and build filters from that? Further, would that be faster? I assume it would as AD probably hashes group membership.
I will go do some reading, but I assume group names are wildcard searchable?
I suggest you use Ambiguous Name Resolution:
_filter = '''(& (objectclass=user)
(objectcategory=person)
(anr={}) )'''.format(search_string)
Read that documentation to understand how it works, but it can find users if you give it a string of "first last". This is what the search box in AD Users and Computers uses.
Just be aware that you can get doubles if people have similar names. If you take my name for example: if you would search for "Gabriel Luci", and there was someone else with the name "Gabriel Luciano", you would find both of us.
This:
(memberOf=CN=*Keyword*,OU=Delegated,OU=Groups,DC=amr,DC=corp,DC=xxxxxx,DC=com)
doesn't work because you can't use wildcards on any attribute that is a distinguishedName, like memberOf.
That's for Active Directory anyway. Other LDAP directories might allow it.
If you need to check if the users are members of groups, then you can tell your search to return the memberOf attribute in the search (I don't know phython, but you should have a way of telling it which attributes you want returned). Then you can loop through the groups in the memberOf attribute and look for that keyword.

Add a field to existing document in CouchDB

I have a database with a bunch of regular documents that look something like this (example from wiki):
{
"_id":"some_doc_id",
"_rev":"D1C946B7",
"Subject":"I like Plankton",
"Author":"Rusty",
"PostedDate":"2006-08-15T17:30:12-04:00",
"Tags":["plankton", "baseball", "decisions"],
"Body":"I decided today that I don't like baseball. I like plankton."
}
I'm working in Python with couchdb-python and I want to know if it's possible to add a field to each document. For example, if I wanted to have a "Location" field or something like that.
Thanks!
Regarding IDs
Every document in couchdb has an id, whether you set it or not. Once the document is stored you can access it through the doc._id field.
If you want to set your own ids you'll have to assign the id value to doc._id. If you don't set it, then couchdb will assign a uuid.
If you want to update a document, then you need to make sure you have the same id and a valid revision. If say you are working from a blog post and the user adds the Location, then the url of the post may be a good id to use. You'd be able to instantly access the document in this case.
So what's a revision
In your code snippet above you have the doc._rev element. This is the identifier of the revision. If you save a document with an id that already exists, couchdb requires you to prove that the document is still the valid doc and that you are not trying to overwrite someone else's document.
So how do I update a document
If you have the id of your document, you can just access each document by using the db.get(id) function. You can then update the document like this:
doc = db.get(id)
doc['Location'] = "On a couch"
db.save(doc)
I have an example where I store weather forecast data. I update the forecasts approximately every 2 hours. A separate process is looking for data that I get from a different provider looking at characteristics of tweets on the day.
This looks something like this.
doc = db.get(id)
doc_with_loc = GetLocationInformationFromOtherProvider(doc) # takes about 40 seconds.
doc_with_loc["_rev"] = doc["_rev"]
db.save(doc_with_loc) # This will fail if weather update has also updated the file.
If you have concurring processes, then the _rev will become invalid, so you have to have a failsave, eg. this could do:
doc = db.get(id)
doc_with_loc = GetLocationInformationFromAltProvider(doc)
update_outstanding = true
while update_outstanding:
doc = db.get(id) //reretrieve this to get
doc_with_loc["_rev"] = doc["_rev"]
update_outstanding = !db.save(doc_with_loc)
So how do I get the Ids?
One option suggested above is that you actively set the id, so you can retrieve it. Ie. if a user sets a given location that is attached to a URL, use the URL. But you may not know which document you want to update - or even have a process that finds all the document that don't have a location and assign one.
You'll most likely be using a view for this. Views have a mapper and a reducer. You'll use the first one, forget about the last one. A view with a mapper does the following:
It returns a simplyfied/transformed way of looking at your data. You can return multiple values per data or skip some. It gives the data you emit a key, and if you use the _include_docs function it will give you the document (with _id and rev alongside).
The simplest view is the default view db.view('_all_docs') this will return all documents and you may not want to update all of them. Views for example will be stored as a document as well when you define these.
The next simple way is to have view that only returns items that are of the type of the document. I tend to have a _type="article in my database. Think of this as marking that a document belongs to a certain table if you had stored them in a relational database.
Finally you can filter elements that have a location so you'd have a view where you can iterate over all those docs that still need a location and identify this in a separate process. The best documentation on writing view can be found here.

Create an index of the content of each file in a folder

I'm making a search tool in Python.
Its objective is to be able to search files by their content. (we're mostly talking about source file, text files, not images/binary - even if searching in their METADATA would be a great improvment). For now I don't use regular expression, casual plain text.
This part of the algorithm works great !
The problem is that I realize I'm searching mostly in the same few folders, I'd like to find a way to build an index of the content of each files in a folder. And be able as fast as possible to know if the sentence I'm searching is in xxx.txt or if it can't be there.
The idea for now is to maintain a checksum for each file that makes me able to know if it contains a particular string.
Do you know any algorithm close to this ?
I don't need a 100% success rate, I prefer a little index than a big one with 100% success.
The idea is to provide a generic tool.
EDIT : To be clear, I want to search a PART of the content of the file. So making a md5 hash of all its content & comparing it with the hash of what i'm searching isn't a good idea ;)
here i am using whoosh lib to make searching/indexing.. .upper part is indexing the files and the lower part is demo search.. .
#indexing part
from whoosh.index import create_in
from whoosh.fields import *
import os
import stat
import time
schema = Schema(FileName=TEXT(stored=True), FilePath=TEXT(stored=True), Size=TEXT(stored=True), LastModified=TEXT(stored=True),
LastAccessed=TEXT(stored=True), CreationTime=TEXT(stored=True), Mode=TEXT(stored=True))
ix = create_in("./my_whoosh_index_dir", schema)
writer = ix.writer()
for top, dirs, files in os.walk('./my_test_dir'):
for nm in files:
fileStats = os.stat(os.path.join(top, nm))
fileInfo = {
'FileName':nm,
'FilePath':os.path.join(top, nm),
'Size' : fileStats [ stat.ST_SIZE ],
'LastModified' : time.ctime ( fileStats [ stat.ST_MTIME ] ),
'LastAccessed' : time.ctime ( fileStats [ stat.ST_ATIME ] ),
'CreationTime' : time.ctime ( fileStats [ stat.ST_CTIME ] ),
'Mode' : fileStats [ stat.ST_MODE ]
}
writer.add_document(FileName=u'%s'%fileInfo['FileName'],FilePath=u'%s'%fileInfo['FilePath'],Size=u'%s'%fileInfo['Size'],LastModified=u'%s'%fileInfo['LastModified'],LastAccessed=u'%s'%fileInfo['LastAccessed'],CreationTime=u'%s'%fileInfo['CreationTime'],Mode=u'%s'%fileInfo['Mode'])
writer.commit()
## now the seaching part
from whoosh.qparser import QueryParser
with ix.searcher() as searcher:
query = QueryParser("FileName", ix.schema).parse(u"hsbc") ## here 'hsbc' is the search term
results = searcher.search(query)
for x in results:
print x['FileName']
It's not the most efficient, but just uses the stdlib and a little bit of work. sqlite3 (if it's enabled on compilation) supports full text indexing. See: http://www.sqlite.org/fts3.html
So you could create a table of [file_id, filename], and a table of [file_id, line_number, line_text], and use those to base your queries on. ie: how many files contain this word and that line, what lines contain this AND this but not etc...
The only reason anyone would want a tool that is capable of searching 'certain parts' of a file is because what they are trying to do is analyze data that has legal restrictions on which parts of it you can read.
For example, Apple has the capability of identifying the GPS location of your iPhone at any moment a text was sent or received. But, what they cannot legally do is associate that location data with anything that can be tied to you as an individual.
On a broad scale you can use obscure data like this to track and analyze patterns throughout large amounts of data. You could feasibly assign a unique 'Virtual ID' to every cell phone in the USA and log all location movement; afterward you implement a method for detecting patterns of travel. Outliers could be detected through deviations in their normal travel pattern. That 'metadeta' could then be combined with data from outside sources such as names and locations of retail locations. Think of all the situations you might be able to algorithmically detect. Like the soccer dad who for 3 years has driven the same general route between work, home, restaurants, and a little league field. Only being able to search part of a file still offers enough data to detect that Soccer Dad's phone's unique signature suddenly departed from the normal routine and entered a gun shop. The possibilities are limitless. That data could be shared with local law enforcement to increase street presence in public spaces nearby; all while maintaining anonymity of the phone's owner.
Capabilities like the example above are not legally possible in today's environment without the method IggY is looking for.
On the other hand, it could just be that he is only looking for certain types of data in certain file types. If he knows where in the file he wants to search for the data he needs he can save major CPU time only reading the last half or first half of a file.
You can do a simple name-based cache as below. This is probably best (fastest) if the file contents is not expected to change. Otherwise, you can MD5 the file contents. I say MD5 because it's faster than SHA, and this application doesn't seem security sensitive.
from hashlib import md5
import os
info_cache = {}
for file in files_to_search:
file_info = get_file_info(file)
file_hash = md5(os.path.abspath(file)).hexdigest()
info_cache[file_hash]=file_info

Irregular String Parsing on Python

I'm new to python/django and I am trying to suss out more effective information from my scraper. Currently, the scraper takes a list of comic book titles and correctly divides them into a CSV list in three parts (Published Date, Original Date, and Title). I then pass the current date and title through to different parts of my databse, which I do in my Loader script (convert mm/dd/yy into yyyy-mm-dd, save to "pub_date" column, title goes to "title" column).
A common string can look like this:
10/12/11|10/12/11|Stan Lee's Traveler #12 (10 Copy Incentive Cover)
I am successfully grabbing the date, but the title is trickier. In this instance, I'd ideally like to fill three different columns with the information after the second "|". The Title should go to "title", a charfield. the number 12 (after the '#') should go into the DecimalField "issue_num", and everything between the '()' 's should go into the "Special" charfield. I am not sure how to do this kind of rigorous parsing.
Sometimes, there are multiple #'s (one comic in particular is described as a bundle, "Containing issues #90-#95") and several have multiple '()' groups (such as, "Betrayal Of The Planet Of The Apes #1 (Of 4)(25 Copy Incentive Cover)
)
What would be a good road to start onto crack this problem? My knowledge of If/else statements quickly fell apart for the more complicated lines. How can I efficiently and (if possible) pythonic-ly parse through these lines and subdivide them so I can later slot them into the correct place in my database?
Use the regular expression module re. For example, if you have the third |-delimited field of your sample record in a variable s, then you can do
match = re.match(r"^(?P<title>[^#]*) #(?P<num>[0-9]+) \((?P<special>.*)\)$", s)
title = match.groups('title')
issue = match.groups('num')
special = match.groups('special')
You'll get an IndexError in the last three lines for a missing field. Adapt the RE until it parses everything your want.
Parsing the title is the hard part, it sounds like you can handle the dates etc yourself. The problem is that there is not one rule that can parse every title but there are many rules and you can only guess which one works on a particular title.
I usually handle this by creating a list of rules, from most specific to general and try them out one by one until one matches.
To write such rules you can use the re module or even pyparsing.
The general idea goes like this:
class CantParse(Exception):
pass
# one rule to parse one kind of title
import re
def title_with_special( title ):
""" accepts only a title of the form
<text> #<issue> (<special>) """
m = re.match(r"[^#]*#(\d+) \(([^)]+)\)", title)
if m:
return m.group(1), m.group(2)
else:
raise CantParse(title)
def parse_extra(title, rules):
""" tries to parse extra information from a title using the rules """
for rule in rules:
try:
return rule(title)
except CantParse:
pass
# nothing matched
raise CantParse(title)
# lets try this out
rules = [title_with_special] # list of rules to apply, add more functions here
titles = ["Stan Lee's Traveler #12 (10 Copy Incentive Cover)",
"Betrayal Of The Planet Of The Apes #1 (Of 4)(25 Copy Incentive Cover) )"]
for title in titles:
try:
issue, special = parse_extra(title, rules)
print "Parsed", title, "to issue=%s special='%s'" % (issue, special)
except CantParse:
print "No matching rule for", title
As you can see the first title is parsed correctly, but not the 2nd. You'll have to write a bunch of rules that account for every possible title format in your data.
Regular expression is the way to go. But if you fill uncomfortably writing them, you can try a small parser that I wrote (https://github.com/hgrecco/stringparser). It translates a string format (PEP 3101) to a regular expression. In your case, you will do the following:
>>> from stringparser import Parser
>>> p = Parser(r"{date:s}\|{date2:s}\|{title:s}#{issue:d} \({special:s}\)")
>>> x = p("10/12/11|10/12/11|Stan Lee's Traveler #12 (10 Copy Incentive Cover)")
OrderedDict([('date', '10/12/11'), ('date2', '10/12/11'), ('title', "Stan Lee's Traveler "), ('issue', 12), ('special', '10 Copy Incentive Cover')])
>>> x.issue
12
The output in this case is an (ordered) dictionary. This will work for any simple cases and you might tweak it to catch multiple issues or multiple ()
One more thing: notice that in the current version you need to manually escape regex characters (i.e. if you want to find |, you need to type \|). I am planning to change this soon.

Categories