pycassa - Remove multiple rows by their secondary index? - python

I have a column family with a secondary index 'pointer'. How do I remove multiple rows that have the same 'pointer' value (e.g. abc)?
The only option I know is:
expr = create_index_expression('pointer', 'abc')
clause = create_index_clause([expr])
for key, user in cassandra_cf.get_indexed_slices(clause):
cassandra_cf.remove(key)
but I know this is very inefficient and can take long if I have thousands of rows with the same 'pointer' value. Are there any other options?

You can remove multiple rows at once:
expr = create_index_expression('pointer', 'abc')
clause = create_index_clause([expr])
with cassandra_cf.batch() as b:
for key, user in cassandra_cf.get_indexed_slices(clause):
b.remove(key)
This will group the removes into batches of 100 (by default). When the batch object is used as a context manager as it is here, it will automatically handle sending any remaining mutations once the with block is left.
You can read more about this in the pycassa.batch API docs.

Related

Is there a way to "transform" a CSV table into a simple nested if... else block in python?

I'm fairly new to python and I'm looking forward to achieve the following:
I have a table with several conditions as in the image below (maximum 5 conditions) along with various attributes. Each condition comes from a specific set of values, for example Condition 1 has 2 possible values, Condition 2 has 4 possible values, Condition 3 has 2 possible values etc..
What I would like to do: From the example table above, I would like to generate a simple python code so that when I execute my function and import a CSV file containing the table above, I should get the following output saved as a *.py file:
def myFunction(Attribute, Condition):
if Attribute1 & Condition1:
myValue = val_11
if Attribute1 & Condition2:
myValue = val_12
...
...
if Attribute5 & Condition4:
myValue = val_54
NOTE: Each CSV file will contain only one sheet and the titles for the columns do not change.
UPDATE, NOTE#2: Both "Attribute" and "Condition" are string values, so simple string comparisons would suffice.
Is there a simple way to do this? I dove into NLP and realized that it is not possible (at least from what I found in the literature). I'm open to all forms of suggestions/answers.
You can't really use "If"s and "else"s, since, if I understand your question correctly, you want to be able to read the conditions, attributes and values from a CSV file. Using "If"s and "else"s, you would only be able to check a fixed range of conditions and attributes defined in your code. What I would do, is to write a parser (piece of code, which reads the contents of your CSV file and saves it in another, more usable form).
In this case, the parser is the parseCSVFile() function. Instead of the ifs and elses comparing attributes and conditions, you now use the attributes and conditions to access a specific element in a dictionary (similar to an array or list, but you can now use for example string keys instead of the numerical indexes). I used a dictionary containing a dictionary at each position to split the CSV contents into their rows and columns. Since I used dictionaries, you can now use the strings of the Attributes and Conditions to access your values instead of doing lots of comparisons.
#Output Dictionary
ParsedDict = dict()
#This is either ';' or ',' depending on your operating system or you can open a CSV file with notepad for example to check which character is used
CSVSeparator = ';'
def parseCSVFile(filePath):
global ParsedDict
f = open(filePath)
fileLines = f.readlines()
f.close()
#Extract the conditions
ConditionsArray = (fileLines[0].split(CSVSeparator))[1:]
for x in range(len(fileLines)-1):
#Remove unwanted characters such as newline characters
line = fileLines[1 + x].strip()
#Split by the CSV separation character
LineContents = line.split(CSVSeparator)
ConditionsDict = dict()
for y in range(len(ConditionsArray)):
ConditionsDict.update({ConditionsArray[y]: LineContents[1 + y]})
ParsedDict.update({LineContents[0]: ConditionsDict})
def myFunction(Attribute, Condition):
myValue = ParsedDict[Attribute][Condition]
The "[1:]" is to ignore the contents in the first column (empty field at the top left and the "Attribute x" fields) when reading either the conditions or the values
Use the parseCSVFile() function to extract the information from the csv file
and the myFunction() to get the value you want

Trouble converting "for key in dict" to == for exact matching

Good morning,
I am having trouble pulling the correct value from my dictionary because there are similar keys. I believe I need to use the == instead of in however when I try to change if key in c_item_number_one: to if key == c_item_number_one: it just returns my if not_found: print("Specify Size One") however I know 12" is in the dictionary.
c_item_number_one = ('12", Pipe,, SA-106 GR. B,, SCH 40, WALL smls'.upper())
print(c_item_number_one)
My formula is as follows:
def item_one_size_one():
not_found = True
for key in size_one_dict:
if key in c_item_number_one:
item_number_one_size = size_one_dict[key]
print(item_number_one_size)
not_found = False
break
if not_found:
print("Specify Size One")
item_one_size_one()
The current result is:
12", PIPE,, SA-106 GR. B,, SCH 40, WALL SMLS
Specify Size One
To split the user input into fields, use re.split
>>> userin
'12", PIPE,, SA-106 GR. B,, SCH 40, WALL SMLS'
>>> import re
>>> fields = re.split('[ ,]*',userin)
>>> fields
['12"', 'PIPE', 'SA-106', 'GR.', 'B', 'SCH', '40', 'WALL', 'SMLS']
Then compare the key to the first field, or to all fields:
if key == fields[0]:
There are two usages of the word in here - the first is in the context of a for loop, and the second entirely distinct one is in the context of a comparison.
In the construction of a for loop, the in keyword connects the variable that will be used to hold the values extracted from the loop to the object containing values to be looped over.
e.g.
for x in list:
Meanwhile, the entirely distinct usage of the in keyword can be used to tell python to perform a collection test where the left-hand side item is tested to see whether it exists in the rhs-object's collection.
e.g.
if key in c_item_number_one:
So the meaning of the in keyword is somewhat contextual.
If your code is giving unexpected results then you should be able to replace the if-statement to use an == test, while keeping everything else the same.
e.g.
if key == c_item_number_one:
However, since the contents of c_item_number_one is a tuple, you might only want to test equality for the first item in that tuple - the number 12 for example. You should do this by indexing the element in the tuple for which you want to do the comparison:
if key == c_item_number_one[0]:
Here the [0] is telling python to extract only the first element from the tuple to perform the == test.
[edit] Sorry, your c_item_number_one isn't a tuple, it's a long string. What you need is a way of clearly identifying each item to be looked up, using a unique code or value that the user can enter that will uniquely identify each thing. Doing a string-match like this is always going to throw up problems.
There's potential then for a bit of added nuance, the 1st key in your example tuple is a string of '12'. If the key in your == test is a numeric value of 12 (i.e. an integer) then the test 12 == '12' will return false and you won't extract the value you're after. That your existing in test succeeds currently suggests though that this isn't a problem here, but might be something to be aware of later.

Tkinter entry widget updates values in curli brackets

I am able to update the tKinter entry widgets boxes using textvariables... the issue is that it add brackets '{}' in my desired data...
def showRecord(self):
connection = sqlite3.connect("../employee.db")
connection.text_factory = sqlite3.OptimizedUnicode
cursor = connection.cursor ()
cursor.execute ( '''SELECT "Scheduled Shift" FROM employee_details WHERE Ecode = "5568328"''' )
items = cursor.fetchall ()
self.Employee1_FirstDay_ActualShift.set(items[0])
self.Employee1_SecondDay_ActualShift.set(items[1])
self.Employee1_ThirdDay_ActualShift.set(items[2])
self.Employee1_FourthDay_ActualShift.set(items[3])
self.Employee1_FifthDay_ActualShift.set(items[4])
self.Employee1_SixthDay_ActualShift.set(items[5])
self.Employee1_SeventhDay_ActualShift.set(items[6])
connection.commit ()
connection.close ()
Seeking help pls... Need to remove those brackets as shown in fig.
The reason it is doing that is because you are setting the value of a string variable to a list. Tkinter is a thin wrapper around a tcl/tk interpreter, and tcl uses curly braces to preserve the list structure when converting the list to a string when a list element has spaces or other special characters.
The solution is to make sure you pass a string to the set method. Otherewise the list will be passed to tcl/tk and it will use it's own list-to-string conversion.
In your case, since items is a list (rows) of lists (columns) and each row has a single column, you would do something like this to insert column zero of row zero into self.Employee1_FirstDay_ActualShift:
row_0 = items[0]
col_0 = row_0[0]
self.Employee1_FirstDay_ActualShift.set(col_0)
To condense that to one line, combined with all of the other rows it would look something like the following. I've added some extra whitespace to make it easier to compare each line. Also, this assumes that items has seven rows, and each row has at least one column.
self.Employee1_FirstDay_ActualShift.set( items[0][0])
self.Employee1_SecondDay_ActualShift.set( items[1][0])
self.Employee1_ThirdDay_ActualShift.set( items[2][0])
self.Employee1_FourthDay_ActualShift.set( items[3][0])
self.Employee1_FifthDay_ActualShift.set( items[4][0])
self.Employee1_SixthDay_ActualShift.set( items[5][0])
self.Employee1_SeventhDay_ActualShift.set(items[6][0])

Converting an imperative algorithm that "grows" a table into pure functions

My program, written in Python 3, has many places where it starts with a (very large) table-like numeric data structure and adds columns to it following a certain algorithm. (The algorithm is different in every place.)
I am trying to convert this into pure functional approach since I run into problems with the imperative approach (hard to reuse, hard to memoize interim steps, hard to achieve "lazy" computation, bug-prone due to reliance on state, etc.).
The Table class is implemented as a dictionary of dictionaries: the outer dictionary contains rows, indexed by row_id; the inner contains values within a row, indexed by column_title. The table's methods are very simple:
# return the value at the specified row_id, column_title
get_value(self, row_id, column_title)
# return the inner dictionary representing row given by row_id
get_row(self, row_id)
# add a column new_column_title, defined by func
# func signature must be: take a row and return a value
add_column(self, new_column_title, func)
Until now, I simply added columns to the original table, and each function took the whole table as an argument. As I'm moving to pure functions, I'll have to make all arguments immutable. So, the initial table becomes immutable. Any additional columns will be created as standalone columns and passed only to those functions that need them. A typical function would take the initial table, and a few columns that are already created, and return a new column.
The problem I run into is how to implement the standalone column (Column)?
I could make each of them a dictionary, but it seems very expensive. Indeed, if I ever need to perform an operation on, say, 10 fields in each logical row, I'll need to do 10 dictionary lookups. And on top of that, each column will contain both the key and the value, doubling its size.
I could make Column a simple list, and store in it a reference to the mapping from row_id to the array index. The benefit is that this mapping could be shared between all columns that correspond to the same initial table, and also once looked up once, it works for all columns. But does this create any other problems?
If I do this, can I go further, and actually store the mapping inside the initial table itself? And can I place references from the Column objects back to the initial table from which they were created? It seems very different from how I imagined a functional approach to work, but I cannot see what problems it would cause, since everything is immutable.
In general does functional approach frown on keeping a reference in the return value to one of the arguments? It doesn't seem like it would break anything (like optimization or lazy evaluation), since the argument was already known anyway. But maybe I'm missing something.
Here is how I would do it:
Derive your table class from a frozenset.
Each row should be a sublcass of tuple.
Now you can't modify the table -> immutability, great! The next step
could be to consider each function a mutation which you apply to the
table to produce a new one:
f T -> T'
That should be read as apply the function f on the table T to produce
a new table T'. You may also try to objectify the actual processing of
the table data and see it as an Action which you apply or add to the
table.
add(T, A) -> T'
The great thing here is that add could be subtract instead giving you
an easy way to model undo. When you get into this mindset, your code
becomes very easy to reason about because you have no state that can
screw things up.
Below is an example of how one could implement and process a table
structure in a purely functional way in Python. Imho, Python is not
the best language to learn about FP in because it makes it to easy to
program imperatively. Haskell, F# or Erlang are better choices I think.
class Table(frozenset):
def __new__(cls, names, rows):
return frozenset.__new__(cls, rows)
def __init__(self, names, rows):
frozenset.__init__(self, rows)
self.names = names
def add_column(rows, func):
return [row + (func(row, idx),) for (idx, row) in enumerate(rows)]
def table_process(t, (name, func)):
return Table(
t.names + (name,),
add_column(t, lambda row, idx: func(row))
)
def table_filter(t, (name, func)):
names = t.names
idx = names.index(name)
return Table(
names,
[row for row in t if func(row[idx])]
)
def table_rank(t, name):
names = t.names
idx = names.index(name)
rows = sorted(t, key = lambda row: row[idx])
return Table(
names + ('rank',),
add_column(rows, lambda row, idx: idx)
)
def table_print(t):
format_row = lambda r: ' '.join('%15s' % c for c in r)
print format_row(t.names)
print '\n'.join(format_row(row) for row in t)
if __name__ == '__main__':
from random import randint
cols = ('c1', 'c2', 'c3')
T = Table(
cols,
[tuple(randint(0, 9) for x in cols) for x in range(10)]
)
table_print(T)
# Columns to add to the table, this is a perfect fit for a
# reduce. I'd honestly use a boring for loop instead, but reduce
# is a perfect example for how in FP data and code "becomes one."
# In fact, this whole program could have been written as just one
# big reduce.
actions = [
('max', max),
('min', min),
('sum', sum),
('avg', lambda r: sum(r) / float(len(r)))
]
T = reduce(table_process, actions, T)
table_print(T)
# Ranking is different because it requires an ordering, which a
# table does not have.
T2 = table_rank(T, 'sum')
table_print(T2)
# Simple where filter: select * from T2 where c2 < 5.
T3 = table_filter(T2, ('c2', lambda c: c < 5))
table_print(T3)

Finding partial strings in a list of strings - python

I am trying to check if a user is a member of an Active Directory group, and I have this:
ldap.set_option(ldap.OPT_REFERRALS, 0)
try:
con = ldap.initialize(LDAP_URL)
con.simple_bind_s(userid+"#"+ad_settings.AD_DNS_NAME, password)
ADUser = con.search_ext_s(ad_settings.AD_SEARCH_DN, ldap.SCOPE_SUBTREE, \
"sAMAccountName=%s" % userid, ad_settings.AD_SEARCH_FIELDS)[0][1]
except ldap.LDAPError:
return None
ADUser returns a list of strings:
{'givenName': ['xxxxx'],
'mail': ['xxxxx#example.com'],
'memberOf': ['CN=group1,OU=Projects,OU=Office,OU=company,DC=domain,DC=com',
'CN=group2,OU=Projects,OU=Office,OU=company,DC=domain,DC=com',
'CN=group3,OU=Projects,OU=Office,OU=company,DC=domain,DC=com',
'CN=group4,OU=Projects,OU=Office,OU=company,DC=domain,DC=com'],
'sAMAccountName': ['myloginid'],
'sn': ['Xxxxxxxx']}
Of course in the real world the group names are verbose and of varied structure, and users will belong to tens or hundreds of groups.
If I get the list of groups out as ADUser.get('memberOf')[0], what is the best way to check if any members of a separate list exist in the main list?
For example, the check list would be ['group2', 'group16'] and I want to get a true/false answer as to whether any of the smaller list exist in the main list.
If the format example you give is somewhat reliable, something like:
import re
grps = re.compile(r'CN=(\w+)').findall
def anyof(short_group_list, adu):
all_groups_of_user = set(g for gs in adu.get('memberOf',()) for g in grps(gs))
return sorted(all_groups_of_user.intersection(short_group_list))
where you pass your list such as ['group2', 'group16'] as the first argument, your ADUser dict as the second argument; this returns an alphabetically sorted list (possibly empty, meaning "none") of the groups, among those in short_group_list, to which the user belongs.
It's probably not much faster to just a bool, but, if you insist, changing the second statement of the function to:
return any(g for g in short_group_list if g in all_groups_of_user)
might possibly save a certain amount of time in the "true" case (since any short-circuits) though I suspect not in the "false" case (where the whole list must be traversed anyway). If you care about the performance issue, best is to benchmark both possibilities on data that's realistic for your use case!
If performance isn't yet good enough (and a bool yes/no is sufficient, as you say), try reversing the looping logic:
def anyof_v2(short_group_list, adu):
gset = set(short_group_list)
return any(g for gs in adu.get('memberOf',()) for g in grps(gs) if g in gset)
any's short-circuit abilities might prove more useful here (at least in the "true" case, again -- because, again, there's no way to give a "false" result without examining ALL the possibilities anyway!-).
You can use set intersection (& operator) once you parse the group list out. For example:
> memberOf = 'CN=group1,OU=Projects,OU=Office,OU=company,DC=domain,DC=com'
> groups = [token.split('=')[1] for token in memberOf.split(',')]
> groups
['group1', 'Projects', 'Office', 'company', 'domain', 'com']
> checklist1 = ['group1', 'group16']
> set(checklist1) & set(groups)
set(['group1'])
> checklist2 = ['group2', 'group16']
> set(checklist2) & set(groups)
set([])
Note that a conditional evaluation on a set works the same as for lists and tuples. True if there are any elements in the set, False otherwise. So, "if set(checklist2) & set(groups): ..." would not execute since the condition evaluates to False in the above example (the opposite is true for the checklist1 test).
Also see:
http://docs.python.org/library/sets.html

Categories