Sort numpy array using two arguments - python

A small snippet of a data set I have is the following:
import numpy
fns = numpy.array(["filename_0004_0003_info.hdf5", "filename_0003_0003_info.hdf5", "filename_0001_0001_info.hdf5", "filename_0002_0001_info.hdf5", "filename_0006_0002_info.hdf5", "filename_0005_0002_info.hdf5"])
The first integer I call run, whereas the second integer I denote as order. I want to sort this data set. First based on the order number, and second based on the run number. Every order number exists twice while the run number is unique. When I only sort based on the order number using numpy.argsort():
order_nrs = numpy.array([int(fn.split("_")[2]) for fn in fns])
fns = numpy.copy(fns)[numpy.argsort(order_nrs)]
I obtain
['filename_0001_0001_info.hdf5' 'filename_0002_0001_info.hdf5' 'filename_0006_0002_info.hdf5' 'filename_0005_0002_info.hdf5' 'filename_0004_0003_info.hdf5' 'filename_0003_0003_info.hdf5']
Although fns is sorted based on order it should afterwards also be sorted by run. The results should be:
filename_0001_0001_info.hdf5
filename_0002_0001_info.hdf5
filename_0005_0002_info.hdf5
filename_0006_0002_info.hdf5
filename_0003_0003_info.hdf5
filename_0004_0003_info.hdf5
How do I achieve this?

fns = ["filename_0004_0003_info.hdf5", "filename_0003_0003_info.hdf5", "filename_0001_0001_info.hdf5", "filename_0002_0001_info.hdf5", "filename_0006_0002_info.hdf5", "filename_0005_0002_info.hdf5"]
def sortfunc(s):
words = s.split('_')
run, order = int(words[1]), int(words[2])
return order, run
fns.sort(key=sortfunc)

A bit verbose, but useful to see the 'order' parameter in argsort.
fns = np.array(["filename_0004_0003_info.hdf5", "filename_0003_0003_info.hdf5", "filename_0001_0001_info.hdf5", "filename_0002_0001_info.hdf5",
"filename_0006_0002_info.hdf5", "filename_0005_0002_info.hdf5"])
o = np.array([fn.split("_") for fn in fns])
a_r = np.core.records.fromarrays(o.transpose(),
names=['a', 'b', 'c', 'd'],
formats=['U9']*4)
idx = np.argsort(a_r, order=['c', 'b'])
out = fns[idx]
out
Out[22]:
array(['filename_0001_0001_info.hdf5', 'filename_0002_0001_info.hdf5',
'filename_0005_0002_info.hdf5', 'filename_0006_0002_info.hdf5',
'filename_0003_0003_info.hdf5', 'filename_0004_0003_info.hdf5'],
dtype='<U28')
Start with the filenames, split on the _ as you have done, convert to a recarray or structured array (a simple example given), use argsort but order on the two numeric types you want to use... and use the idx index to reorder the original.
There are probably other elegant ways, but I like to see exactly what is going on and argsort order property makes it clear for me. To bad you just can't use an index number instead of the named fields

One way is engineering the dtype such that only the two numbers are visible and the second comes first:
>>> fns = numpy.array(["filename_0004_0003_info.hdf5", "filename_0003_0003_info.hdf5", "filename_0001_0001_info.hdf5", "filename_0002_0001_info.hdf5", "filename_0006_0002_info.hdf5", "filename_0005_0002_info.hdf5"])
>>> fns.view(np.dtype({'names':['f1','f2'], 'formats':['<U4','<U4'], 'offsets':[56,36], 'itemsize':112})).sort()
>>> fns
array(['filename_0001_0001_info.hdf5', 'filename_0002_0001_info.hdf5',
'filename_0005_0002_info.hdf5', 'filename_0006_0002_info.hdf5',
'filename_0003_0003_info.hdf5', 'filename_0004_0003_info.hdf5'],
dtype='<U28')
This is direct inplace sort, but argsort, of course, also works.

Look at np.lexsort, it uses a stable sort on a series of keys to do what you want.

Related

How to structure complex function to apply to col of pandas df?

I have a large (>500k rows) pandas df like so
orig_df = pd.DataFrame(columns=list('id', 'free_text1', 'something_inert', 'free_text2'))
free_textX is a string field containing user input imported from a csv. The goal is to have a function func that does various checks on each row of free_textX and then a performs Levenshtein fuzzy text recognition based on the contents of another df reference. Something like
from rapidfuzz import process
LEVENSHTEIN_DIST = 25
def func(s) -> str:
if string == "25":
return s
elif s == "nothing":
return "something"
else:
s2 = process.extractOne(
query = s,
choices = reference['col_name'],
score_cutoff = LEVENSHTEIN_DIST
)
return s2
After this process a new column has to be inserted after free_textX called recog_textX containing the returned values from func.
I tried vectorization (for performance) like so
orig_df.insert(loc=new_col_index, #calculated before
column='recog_textX',
value=func(orig_df['free_textX'])
)
def func(series) -> pd.core.series.Series:
...
but I don't understand how to structure func (handling an entire df col as a series, by demand of vectorization, right?) as process.extractOne(...) -> str handles single strs instead of a series. Those interface concepts seem incompatible to me. But I do want to avoid a classic iteration here for performance reasons. My grasp of pandas is too shallow here. Help me out?
I may be missing a point, but you can use apply function to get what I think you want:
orig_df['recog_textX'] = orig_df['free_textX'].apply(func)
This will create a new column 'recog_textX' by applying your function func to each element of the 'free_textX' column.
Let me know if I misunderstood your question
As an aside, I do not think vectorizing this operation will make any difference speed-wise, given each application of func() is a complicated string operation. But it does look nicer than just looping through rows

How can I sort list of strings in specific order?

Let's say I have such a list:
['word_4_0_w_7',
'word_4_0_w_6',
'word_3_0_w_10',
'word_3_0_w_2']
and I want to sort them according to number that comes after "word" and according to number after "w".
It will look like this:
['word_3_0_w_2',
'word_3_0_w_10',
'word_4_0_w_6',
'word_4_0_w_7']
What comes in mind is to create a bunch of list and according to index after "word" stuff them with sorted strings according "w", and then merge them.
Is in Python more clever way to do it?
Use Python's key functionality, in conjunction with other answers:
def mykey(value):
ls = value.split("_")
return int(ls[1]), int(ls[-1])
newlist = sorted(firstlist, key=mykey)
## or, if you want it in place:
firstlist.sort(key=mykey)
Python will be more efficient with key vs cmp.
You can provide a function to the sort() method of list objects:
l = ['word_4_0_w_7',
'word_4_0_w_6',
'word_3_0_w_10',
'word_3_0_w_2']
def my_key_func(x):
xx = x.split("_")
return (int(xx[1]), int(xx[-1]))
l.sort(key=my_key_func)
Output:
print l
['word_3_0_w_2', 'word_3_0_w_10', 'word_4_0_w_6', 'word_4_0_w_7']
edit: Changed code according to comment by #dwanderson ; more info on this can be found here.
You can use a function to extract the relevant parts of your string and then use those parts to sort:
a = ['word_4_0_w_7', 'word_4_0_w_6', 'word_3_0_w_10', 'word_3_0_w_2']
def sort_func(x):
parts = x.split('_');
sort_key = parts[1]+parts[2]+"%02d"%int(parts[4])
return sort_key
a_sorted = sorted(a,key=sort_func)
The expression "%02d" %int(x.split('_')[4]) is used to add a leading zero in front of second number otherwise 10 will sort before 2. You may have to do the same with the number extracted by x.split('_')[2].

Converting an imperative algorithm that "grows" a table into pure functions

My program, written in Python 3, has many places where it starts with a (very large) table-like numeric data structure and adds columns to it following a certain algorithm. (The algorithm is different in every place.)
I am trying to convert this into pure functional approach since I run into problems with the imperative approach (hard to reuse, hard to memoize interim steps, hard to achieve "lazy" computation, bug-prone due to reliance on state, etc.).
The Table class is implemented as a dictionary of dictionaries: the outer dictionary contains rows, indexed by row_id; the inner contains values within a row, indexed by column_title. The table's methods are very simple:
# return the value at the specified row_id, column_title
get_value(self, row_id, column_title)
# return the inner dictionary representing row given by row_id
get_row(self, row_id)
# add a column new_column_title, defined by func
# func signature must be: take a row and return a value
add_column(self, new_column_title, func)
Until now, I simply added columns to the original table, and each function took the whole table as an argument. As I'm moving to pure functions, I'll have to make all arguments immutable. So, the initial table becomes immutable. Any additional columns will be created as standalone columns and passed only to those functions that need them. A typical function would take the initial table, and a few columns that are already created, and return a new column.
The problem I run into is how to implement the standalone column (Column)?
I could make each of them a dictionary, but it seems very expensive. Indeed, if I ever need to perform an operation on, say, 10 fields in each logical row, I'll need to do 10 dictionary lookups. And on top of that, each column will contain both the key and the value, doubling its size.
I could make Column a simple list, and store in it a reference to the mapping from row_id to the array index. The benefit is that this mapping could be shared between all columns that correspond to the same initial table, and also once looked up once, it works for all columns. But does this create any other problems?
If I do this, can I go further, and actually store the mapping inside the initial table itself? And can I place references from the Column objects back to the initial table from which they were created? It seems very different from how I imagined a functional approach to work, but I cannot see what problems it would cause, since everything is immutable.
In general does functional approach frown on keeping a reference in the return value to one of the arguments? It doesn't seem like it would break anything (like optimization or lazy evaluation), since the argument was already known anyway. But maybe I'm missing something.
Here is how I would do it:
Derive your table class from a frozenset.
Each row should be a sublcass of tuple.
Now you can't modify the table -> immutability, great! The next step
could be to consider each function a mutation which you apply to the
table to produce a new one:
f T -> T'
That should be read as apply the function f on the table T to produce
a new table T'. You may also try to objectify the actual processing of
the table data and see it as an Action which you apply or add to the
table.
add(T, A) -> T'
The great thing here is that add could be subtract instead giving you
an easy way to model undo. When you get into this mindset, your code
becomes very easy to reason about because you have no state that can
screw things up.
Below is an example of how one could implement and process a table
structure in a purely functional way in Python. Imho, Python is not
the best language to learn about FP in because it makes it to easy to
program imperatively. Haskell, F# or Erlang are better choices I think.
class Table(frozenset):
def __new__(cls, names, rows):
return frozenset.__new__(cls, rows)
def __init__(self, names, rows):
frozenset.__init__(self, rows)
self.names = names
def add_column(rows, func):
return [row + (func(row, idx),) for (idx, row) in enumerate(rows)]
def table_process(t, (name, func)):
return Table(
t.names + (name,),
add_column(t, lambda row, idx: func(row))
)
def table_filter(t, (name, func)):
names = t.names
idx = names.index(name)
return Table(
names,
[row for row in t if func(row[idx])]
)
def table_rank(t, name):
names = t.names
idx = names.index(name)
rows = sorted(t, key = lambda row: row[idx])
return Table(
names + ('rank',),
add_column(rows, lambda row, idx: idx)
)
def table_print(t):
format_row = lambda r: ' '.join('%15s' % c for c in r)
print format_row(t.names)
print '\n'.join(format_row(row) for row in t)
if __name__ == '__main__':
from random import randint
cols = ('c1', 'c2', 'c3')
T = Table(
cols,
[tuple(randint(0, 9) for x in cols) for x in range(10)]
)
table_print(T)
# Columns to add to the table, this is a perfect fit for a
# reduce. I'd honestly use a boring for loop instead, but reduce
# is a perfect example for how in FP data and code "becomes one."
# In fact, this whole program could have been written as just one
# big reduce.
actions = [
('max', max),
('min', min),
('sum', sum),
('avg', lambda r: sum(r) / float(len(r)))
]
T = reduce(table_process, actions, T)
table_print(T)
# Ranking is different because it requires an ordering, which a
# table does not have.
T2 = table_rank(T, 'sum')
table_print(T2)
# Simple where filter: select * from T2 where c2 < 5.
T3 = table_filter(T2, ('c2', lambda c: c < 5))
table_print(T3)

How to compare an element of a tuple (int) to determine if it exists in a list

I have the two following lists:
# List of tuples representing the index of resources and their unique properties
# Format of (ID,Name,Prefix)
resource_types=[('0','Group','0'),('1','User','1'),('2','Filter','2'),('3','Agent','3'),('4','Asset','4'),('5','Rule','5'),('6','KBase','6'),('7','Case','7'),('8','Note','8'),('9','Report','9'),('10','ArchivedReport',':'),('11','Scheduled Task',';'),('12','Profile','<'),('13','User Shared Accessible Group','='),('14','User Accessible Group','>'),('15','Database Table Schema','?'),('16','Unassigned Resources Group','#'),('17','File','A'),('18','Snapshot','B'),('19','Data Monitor','C'),('20','Viewer Configuration','D'),('21','Instrument','E'),('22','Dashboard','F'),('23','Destination','G'),('24','Active List','H'),('25','Virtual Root','I'),('26','Vulnerability','J'),('27','Search Group','K'),('28','Pattern','L'),('29','Zone','M'),('30','Asset Range','N'),('31','Asset Category','O'),('32','Partition','P'),('33','Active Channel','Q'),('34','Stage','R'),('35','Customer','S'),('36','Field','T'),('37','Field Set','U'),('38','Scanned Report','V'),('39','Location','W'),('40','Network','X'),('41','Focused Report','Y'),('42','Escalation Level','Z'),('43','Query','['),('44','Report Template ','\\'),('45','Session List',']'),('46','Trend','^'),('47','Package','_'),('48','RESERVED','`'),('49','PROJECT_TEMPLATE','a'),('50','Attachments','b'),('51','Query Viewer','c'),('52','Use Case','d'),('53','Integration Configuration','e'),('54','Integration Command f'),('55','Integration Target','g'),('56','Actor','h'),('57','Category Model','i'),('58','Permission','j')]
# This is a list of resource ID's that we do not want to reference directly, ever.
unwanted_resource_types=[0,1,3,10,11,12,13,14,15,16,18,20,21,23,25,27,28,32,35,38,41,47,48,49,50,57,58]
I'm attempting to compare the two in order to build a third list containing the 'Name' of each unique resource type that currently exists in unwanted_resource_types. e.g. The final result list should be:
result = ['Group','User','Agent','ArchivedReport','ScheduledTask','...','...']
I've tried the following that (I thought) should work:
result = []
for res in resource_types:
if res[0] in unwanted_resource_types:
result.append(res[1])
and when that failed to populate result I also tried:
result = []
for res in resource_types:
for type in unwanted_resource_types:
if res[0] == type:
result.append(res[1])
also to no avail. Is there something i'm missing? I believe this would be the right place to perform list comprehension, but that's still in my grey basket of understanding fully (The Python docs are a bit too succinct for me in this case).
I'm also open to completely rethinking this problem, but I do need to retain the list of tuples as it's used elsewhere in the script. Thank you for any assistance you may provide.
Your resource types are using strings, and your unwanted resources are using ints, so you'll need to do some conversion to make it work.
Try this:
result = []
for res in resource_types:
if int(res[0]) in unwanted_resource_types:
result.append(res[1])
or using a list comprehension:
result = [item[1] for item in resource_types if int(item[0]) in unwanted_resource_types]
The numbers in resource_types are numbers contained within strings, whereas the numbers in unwanted_resource_types are plain numbers, so your comparison is failing. This should work:
result = []
for res in resource_types:
if int( res[0] ) in unwanted_resource_types:
result.append(res[1])
The problem is that your triples contain strings and your unwanted resources contain numbers, change the data to
resource_types=[(0,'Group','0'), ...
or use int() to convert the strings to ints before comparison, and it should work. Your result can be computed with a list comprehension as in
result=[rt[1] for rt in resource_types if int(rt[0]) in unwanted_resource_types]
If you change ('0', ...) into (0, ... you can leave out the int() call.
Additionally, you may change the unwanted_resource_types variable into a set, like
unwanted_resource_types=set([0,1,3, ... ])
to improve speed (if speed is an issue, else it's unimportant).
The one-liner:
result = map(lambda x: dict(map(lambda a: (int(a[0]), a[1]), resource_types))[x], unwanted_resource_types)
without any explicit loop does the job.
Ok - you don't want to use this in production code - but it's fun. ;-)
Comment:
The inner dict(map(lambda a: (int(a[0]), a[1]), resource_types)) creates a dictionary from the input data:
{0: 'Group', 1: 'User', 2: 'Filter', 3: 'Agent', ...
The outer map chooses the names from the dictionary.

fast data comparison in python

I want to compare a large set of data in the form of 2 dictionaries of varying lengths.
(edit)
post = {0: [0.96180319786071777, 0.37529754638671875],
10: [0.20612385869026184, 0.17849941551685333],
20: [0.20612400770187378, 0.17510984838008881],...}
pre = {0: [0.96180319786071777, 0.37529754638671875],
1: [0.20612385869026184, 0.17849941551685333],
2: [0.20612400770187378, 0.17510984838008881],
5065: [0.80861318111419678, 0.76381617784500122],...}
The answer we need to get is 5065: [0.80861318111419678, 0.76381617784500122]. This is based on the fact that we are only comparing the values and not the indices at all.
I am using this key value pair only to remember the sequence of data. The data type can be replaced with a list/set if need be. I need to find out the key:value (index and value) pairs of the elements that are not in common to the dictionaries.
The code that I am using is very simple..
new = {}
found = []
for i in range(0, len(post)):
found= []
for j in range(0, len(pre)):
if post[i] not in pre.values():
if post[i] not in new:
new[i] = post[i]
found.append(j)
break
if found:
for f in found: pre.pop(f)
new{} contains the elements I need.
The problem I am facing is that this process is too slow. It takes sometimes over an hour to process. The data can be much larger at times. I need it to be faster.
Is there an efficient way of doing what I am trying to achieve ? I would like it if we dont depend on external packages apart from those bundled with python 2.5 (64 bit) unless absolutely necessary.
Thank you all.
This is basically what sets are designed for (computing differences in sets of items). The only gotcha is that the things you put into a set need to be hashable, and lists aren't. However, tuples are, so if you convert to that, you can put those into a set:
post_set = set(tuple(x) for x in post.itervalues())
pre_set = set(tuple(x) for x in pre.itervalues())
items_in_only_one_set = post_set ^ pre_set
For more about sets: http://docs.python.org/library/stdtypes.html#set
To get the original indices after you've computed the differences, what you'd probably want is to generate reverse lookup tables:
post_indices = dict((tuple(v),k) for k,v in post.iteritems())
pre_indices = dict((tuple(v),k) for k,v in pre.iteritems())
Then you can just take a given tuple and look up its index via the dictionaries:
index = post_indices.get(a_tuple, pre_indices.get(a_tuple))
Your problem is likely the nested for loops combined with use of range(), which creates a new list each time which can be slow. You will probably get some automatic speedups by iterating pre and post directly, and avoid doing so in a nested fashion.
post = {0: [0.96180319786071777, 0.37529754638671875],
10: [0.20612385869026184, 0.17849941551685333],
20: [0.20612400770187378, 0.17510984838008881]}
pre = {0: [0.96180319786071777, 0.37529754638671875],
1: [0.20612385869026184, 0.17849941551685333],
2: [0.20612400770187378, 0.17510984838008881],
5065: [0.80861318111419678, 0.76381617784500122]}
'''Create sets of values, independent of dict key for O(1) lookup'''
post_set=set(map(tuple, post.values()))
pre_set=set(map(tuple, pre.values()))
'''Iterate through each structure only once, filtering items that are found in
the sets we created earlier, updating new_diff'''
from itertools import ifilterfalse
new_diff=dict(ifilterfalse(lambda x: tuple(x[1]) in pre_set, post.items()))
new_diff.update(ifilterfalse(lambda x: tuple(x[1]) in post_set, pre.items()))
new_diff is now a dict such that each value is not found in both post and pre, with the original index preserved.
>>> print new_diff
{5065: [0.80861318111419678, 0.76381617784500122]}

Categories