My program, written in Python 3, has many places where it starts with a (very large) table-like numeric data structure and adds columns to it following a certain algorithm. (The algorithm is different in every place.)
I am trying to convert this into pure functional approach since I run into problems with the imperative approach (hard to reuse, hard to memoize interim steps, hard to achieve "lazy" computation, bug-prone due to reliance on state, etc.).
The Table class is implemented as a dictionary of dictionaries: the outer dictionary contains rows, indexed by row_id; the inner contains values within a row, indexed by column_title. The table's methods are very simple:
# return the value at the specified row_id, column_title
get_value(self, row_id, column_title)
# return the inner dictionary representing row given by row_id
get_row(self, row_id)
# add a column new_column_title, defined by func
# func signature must be: take a row and return a value
add_column(self, new_column_title, func)
Until now, I simply added columns to the original table, and each function took the whole table as an argument. As I'm moving to pure functions, I'll have to make all arguments immutable. So, the initial table becomes immutable. Any additional columns will be created as standalone columns and passed only to those functions that need them. A typical function would take the initial table, and a few columns that are already created, and return a new column.
The problem I run into is how to implement the standalone column (Column)?
I could make each of them a dictionary, but it seems very expensive. Indeed, if I ever need to perform an operation on, say, 10 fields in each logical row, I'll need to do 10 dictionary lookups. And on top of that, each column will contain both the key and the value, doubling its size.
I could make Column a simple list, and store in it a reference to the mapping from row_id to the array index. The benefit is that this mapping could be shared between all columns that correspond to the same initial table, and also once looked up once, it works for all columns. But does this create any other problems?
If I do this, can I go further, and actually store the mapping inside the initial table itself? And can I place references from the Column objects back to the initial table from which they were created? It seems very different from how I imagined a functional approach to work, but I cannot see what problems it would cause, since everything is immutable.
In general does functional approach frown on keeping a reference in the return value to one of the arguments? It doesn't seem like it would break anything (like optimization or lazy evaluation), since the argument was already known anyway. But maybe I'm missing something.
Here is how I would do it:
Derive your table class from a frozenset.
Each row should be a sublcass of tuple.
Now you can't modify the table -> immutability, great! The next step
could be to consider each function a mutation which you apply to the
table to produce a new one:
f T -> T'
That should be read as apply the function f on the table T to produce
a new table T'. You may also try to objectify the actual processing of
the table data and see it as an Action which you apply or add to the
table.
add(T, A) -> T'
The great thing here is that add could be subtract instead giving you
an easy way to model undo. When you get into this mindset, your code
becomes very easy to reason about because you have no state that can
screw things up.
Below is an example of how one could implement and process a table
structure in a purely functional way in Python. Imho, Python is not
the best language to learn about FP in because it makes it to easy to
program imperatively. Haskell, F# or Erlang are better choices I think.
class Table(frozenset):
def __new__(cls, names, rows):
return frozenset.__new__(cls, rows)
def __init__(self, names, rows):
frozenset.__init__(self, rows)
self.names = names
def add_column(rows, func):
return [row + (func(row, idx),) for (idx, row) in enumerate(rows)]
def table_process(t, (name, func)):
return Table(
t.names + (name,),
add_column(t, lambda row, idx: func(row))
)
def table_filter(t, (name, func)):
names = t.names
idx = names.index(name)
return Table(
names,
[row for row in t if func(row[idx])]
)
def table_rank(t, name):
names = t.names
idx = names.index(name)
rows = sorted(t, key = lambda row: row[idx])
return Table(
names + ('rank',),
add_column(rows, lambda row, idx: idx)
)
def table_print(t):
format_row = lambda r: ' '.join('%15s' % c for c in r)
print format_row(t.names)
print '\n'.join(format_row(row) for row in t)
if __name__ == '__main__':
from random import randint
cols = ('c1', 'c2', 'c3')
T = Table(
cols,
[tuple(randint(0, 9) for x in cols) for x in range(10)]
)
table_print(T)
# Columns to add to the table, this is a perfect fit for a
# reduce. I'd honestly use a boring for loop instead, but reduce
# is a perfect example for how in FP data and code "becomes one."
# In fact, this whole program could have been written as just one
# big reduce.
actions = [
('max', max),
('min', min),
('sum', sum),
('avg', lambda r: sum(r) / float(len(r)))
]
T = reduce(table_process, actions, T)
table_print(T)
# Ranking is different because it requires an ordering, which a
# table does not have.
T2 = table_rank(T, 'sum')
table_print(T2)
# Simple where filter: select * from T2 where c2 < 5.
T3 = table_filter(T2, ('c2', lambda c: c < 5))
table_print(T3)
Related
This question is probably me not understanding architecture of (new) sqlalchemy, typically I use code like this:
query = select(models.Organization).where(
models.Organization.organization_id == organization_id
)
result = await self.session.execute(query)
return result.scalars().all()
Works fine, I get a list of models (if any).
With a query with specific columns only:
query = (
select(
models.Payment.organization_id,
models.Payment.id,
models.Payment.payment_type,
)
.where(
models.Payment.is_cleared.is_(True),
)
.limit(10)
)
result = await self.session.execute(query)
return result.scalars().all()
I am getting first row, first column only. Same it seems to: https://docs.sqlalchemy.org/en/14/core/connections.html?highlight=scalar#sqlalchemy.engine.Result.scalar
My understanding so far was that in new sqlalchemy we should always call scalars() on the query, as described here: https://docs.sqlalchemy.org/en/14/changelog/migration_20.html#migration-orm-usage
But with specific columns, it seems we cannot use scalars() at all. What is even more confusing is that result.scalars() returns sqlalchemy.engine.result.ScalarResult that has fetchmany(), fechall() among other methods that I am unable to iterate in any meaningful way.
My question is, what do I not understand?
My understanding so far was that in new sqlalchemy we should always call scalars() on the query
That is mostly true, but only for queries that return whole ORM objects. Just a regular .execute()
query = select(Payment)
results = sess.execute(query).all()
print(results) # [(Payment(id=1),), (Payment(id=2),)]
print(type(results[0])) # <class 'sqlalchemy.engine.row.Row'>
returns a list of Row objects, each containing a single ORM object. Users found that awkward since they needed to unpack the ORM object from the Row object. So .scalars() is now recommended
results = sess.scalars(query).all()
print(results) # [Payment(id=1), Payment(id=2)]
print(type(results[0])) # <class '__main__.Payment'>
However, for queries that return individual attributes (columns) we don't want to use .scalars() because that will just give us one column from each row, normally the first column
query = select(
Payment.id,
Payment.organization_id,
Payment.payment_type,
)
results = sess.scalars(query).all()
print(results) # [1, 2]
Instead, we want to use a regular .execute() so we can see all the columns
results = sess.execute(query).all()
print(results) # [(1, 123, None), (2, 234, None)]
Notes:
.scalars() is doing the same thing in both cases: return a list containing a single (scalar) value from each row (default is index=0).
sess.scalars() is the preferred construct. It is simply shorthand for sess.execute().scalars().
I have a large (>500k rows) pandas df like so
orig_df = pd.DataFrame(columns=list('id', 'free_text1', 'something_inert', 'free_text2'))
free_textX is a string field containing user input imported from a csv. The goal is to have a function func that does various checks on each row of free_textX and then a performs Levenshtein fuzzy text recognition based on the contents of another df reference. Something like
from rapidfuzz import process
LEVENSHTEIN_DIST = 25
def func(s) -> str:
if string == "25":
return s
elif s == "nothing":
return "something"
else:
s2 = process.extractOne(
query = s,
choices = reference['col_name'],
score_cutoff = LEVENSHTEIN_DIST
)
return s2
After this process a new column has to be inserted after free_textX called recog_textX containing the returned values from func.
I tried vectorization (for performance) like so
orig_df.insert(loc=new_col_index, #calculated before
column='recog_textX',
value=func(orig_df['free_textX'])
)
def func(series) -> pd.core.series.Series:
...
but I don't understand how to structure func (handling an entire df col as a series, by demand of vectorization, right?) as process.extractOne(...) -> str handles single strs instead of a series. Those interface concepts seem incompatible to me. But I do want to avoid a classic iteration here for performance reasons. My grasp of pandas is too shallow here. Help me out?
I may be missing a point, but you can use apply function to get what I think you want:
orig_df['recog_textX'] = orig_df['free_textX'].apply(func)
This will create a new column 'recog_textX' by applying your function func to each element of the 'free_textX' column.
Let me know if I misunderstood your question
As an aside, I do not think vectorizing this operation will make any difference speed-wise, given each application of func() is a complicated string operation. But it does look nicer than just looping through rows
Suppose I have a QTreeWidget with three columns. Two of which with string values and a third with integral values, all of which may appear more than once in each column.
Username (str)
Product (str)
Quantity (int)
I then want to be able to sort these items by either username or product, and to have rows that share these values to be sorted by quantity.
As a side note, I also need to be able to sort the values of the hypothetical quantity as numeric values.
Imagine I had three rows sorted by quantity in the previous example and that those rows had these values:
1
2
10
I would then want these rows to be sorted in that same order, not as they would be if they were sorted as string values:
1
10
2
How do I implement this combination using PyQt5?
Foreword
I'm not a big fan of long answers, and even loss a fan of long pieces of hard to read code, but these are none the less the solution(s) that I came up with when looking for an answer to this question myself a while back.
Simple
This first piece of code is basically a very simplified solution of what I used in the end. It's more efficient and, more importantly, much more easy to read and understand.
from PyQt5.QtWidgets import QTreeWidget, QTreeWidgetItem
class SimpleMultisortTreeWidget(QTreeWidget):
def __init__(self, *a, **k):
super().__init__(*a, **k)
self._csort_order = []
self.header().sortIndicatorChanged.connect(self._sortIndicatorChanged)
def _sortIndicatorChanged(self, n, order):
try:
self._csort_order.remove(n)
except ValueError:
pass
self._csort_order.insert(0, n)
self.sortByColumn(n, order)
class SimpleMultisortTreeWidgetItem(QTreeWidgetItem):
def __lt__(self, other):
corder = self.treeWidget()._csort_order
return list(map(self .text, corder)) < \
list(map(other.text, corder))
Extended
I also had the need to...
Sort some columns as integers and/or decimal.Decimal type objects.
Mix ascending and descending order (i.e. mind the Qt.SortOrder set for each column)
The following example is therefore what I ended up using myself.
from PyQt5.QtWidgets import QTreeWidget, QTreeWidgetItem
class MultisortTreeWidget(QTreeWidget):
u"""QTreeWidget inheriting object, to be populated by
``MultisortTreeWidgetItems``, that allows sorting of multiple columns with
different ``Qt.SortOrder`` values.
"""
def __init__(self, *arg, **kw):
r"Pass on all positional and key word arguments to super().__init__"
super().__init__(*arg, **kw)
self._csort_corder = []
self._csort_sorder = []
self.header().sortIndicatorChanged.connect(
self._sortIndicatorChanged
)
def _sortIndicatorChanged(self, col_n, order):
r"""
Update private attributes to reflect the current sort indicator.
(Connected to self.header().sortIndicatorChanged)
:param col_n: Sort indicator indicates column with this index to be
the currently sorted column.
:type col_n: int
:param order: New sort order indication. Qt enum, 1 or 0.
:type order: Qt.SortOrder
"""
# The new and current column number may, or may not, already be in the
# list of columns that is used as a reference for their individual
# priority.
try:
i = self._csort_corder.index(col_n)
except ValueError:
pass
else:
del self._csort_corder[i]
del self._csort_sorder[i]
# Force current column to have highest priority when sorting.
self._csort_corder.insert(0, col_n)
self._csort_sorder.insert(0, order)
self._csort = list(zip(self._csort_corder,self._csort_sorder))
# Resort items using the modified attributes.
self.sortByColumn(col_n, order)
class MultisortTreeWidgetItem(QTreeWidgetItem):
r"""QTreeWidgetĂtem inheriting objects that, when added to a
MultisortTreeWidget, keeps the order of multiple columns at once. Also
allows for column specific type sensitive sorting when class attributes
SORT_COL_KEYS is set.
"""
#staticmethod
def SORT_COL_KEY(ins, c):
return ins.text(c)
SORT_COL_KEYS = []
def __lt__(self, other):
r"""Compare order between this and another MultisortTreeWidgetItem like
instance.
:param other: Object to compare against.
:type other: MultisortTreeWidgetItem.
:returns: bool
"""
# Fall back on the default functionality if the parenting QTreeWidget
# is not a subclass of MultiSortTreeWidget or the SortIndicator has not
# been changed.
try:
csort = self.treeWidget()._csort
except AttributeError:
return super(MultisortTreeWidgetItem, self).__lt__(other)
# Instead of comparing values directly, place them in two lists and
# extend those lists with values from columns with known sort order.
order = csort[0][1]
left = []
right = []
for c, o in csort:
try:
key = self.SORT_COL_KEYS[c]
except (KeyError, IndexError):
key = self.SORT_COL_KEY
# Reverse sort order for columns not sorted according to the
# current sort order indicator.
if o == order:
left .append(key(self , c))
right.append(key(other, c))
else:
left .append(key(other, c))
right.append(key(self , c))
return left < right
Usage
The static method SORT_COL_KEY and the SORT_COL_KEYS class attribute of the above stated MultisortTreeWidgetItem class also allow for other values than those returned by self.text(N) to be used, for example a list returned by self.data().
The following example sort the text in the rows of the first column as integers and sorts the rows of the third column by the corresponding object in a list returned by self.data(). All other columns is sorted by the item.text() values, sorted as strings.
class UsageExampleItem(MultisortTreeWidgetItem):
SORT_COL_KEYS = {
0: lambda item, col: int(item.text(col)),
2: lambda item, col: item.data()[col],
5: lambda item, col: int(item.text(col) or 0) # Empty str defaults to 0
}
Create a MultisortTreeWidget object and add it to a layout, then create UsageExampleItems and add them to the MultisortTreeWidget.
This solution "remembers" the columns and sort order used previously. So, if you want to sort the items in a UsageExampleItems widget by the values in the first column, and have rows that share a value to be sorted by the second column among themselves, then you would first click on the header item of the second column and then proceed to click on the header item of the first column.
Imagine I have a table with one column, product_line. I want to create another column with the product_line_name, based on the product_line number. The information is taken from a dictionary.
I have declared a dictionary containing product lines and descriptions:
categories={
'Apparel': 99,
'Bikes': 32}
######## function that returns the category from a gl number
def get_cat(glnum,dic=categories):
for cat, List in dic.items():
if glnum in List:
return cat
return(0)
data['category']=data['product_line'].apply(lambda x: get_cat(x))
Works
However I cannot get it to work using method chaining:
tt = (tt
.assign(category = lambda d: get_cat(d.gl_product_line)))
It should be an error related to the Series but I am unsure why it wouldn't work since the lambda function should call the get_cat repeatedly for each row of the dataframe - which does not happen apparently.
Any ideas of how I could do this using the .assign in method chaining?
[Python 3.1]
Edit: mistake in the original code.
I need to print a table. The first row should be a header, which consists of column names separated by tabs. The following rows should contain the data (also tab-separated).
To clarify, let's say I have columns "speed", "power", "weight". I originally wrote the following code, with the help from a related question I asked earlier:
column_names = ['speed', 'power', 'weight']
def f(row_number):
# some calculations here to populate variables speed, power, weight
# e.g., power = retrieve_avg_power(row_number) * 2.5
# e.g., speed = math.sqrt(power) / 2
# etc.
locals_ = locals()
return {x : locals_[x] for x in column_names}
def print_table(rows):
print(*column_names, sep = '\t')
for row_number in range(rows):
row = f(row_number)
print(*[row[x] for x in component_names], sep = '\t')
But then I learned that I should avoid using locals() if possible.
Now I'm stuck. I don't want to type the list of all the column names more than once. I don't want to rely on the fact that every dictionary I create inside f() is likely to iterate through its keys in the same order. And I don't want to use locals().
Note that the functions print_table() and f() do a lot of other stuff; so I have to keep them separate.
How should I write the code?
class Columns:
pass
def f(row_number):
c = Columns()
c.power = retrieve_avg_power(row_number) * 2.5
c.speed = math.sqrt(power) / 2
return c.__dict__
This also lets you specify which of the variables are meant as columns, instead of rather being temporary in the function.
You could use an OrderedDict to fix the order of the dictionaries. But as I see it that isn't even necessary. You are always taking the keys from the column_names list (except in the last line, I assume that is a typo), so the order of the values will always be the same.
an alternative to locals() will be to use the inspect module
import inspect
def f(row_number):
# some calculations here to populate variables speed, power, weight
# e.g., power = retrieve_avg_power(row_number) * 2.5
# e.g., speed = math.sqrt(power) / 2
# etc.
locals_ = inspect.currentframe().f_locals
return {x : locals_[x] for x in column_names }