Using c-like arrays in python - python

Is the following ever done in python to minimize the "allocation time" of creating new objects in a for loop in python? Or, is this considered bad practice / there is a better alternative?
for row in rows:
data_saved_for_row = [] // re-initializes every time (takes a while)
for item in row:
do_something()
do_something
vs. the "c-version" --
data_saved_for_row = []
for row in rows:
for index, item in enumerate(row):
do_something()
data_saved_for_row[index + 1] = '\0' # now we have a crude way of knowing
do_something_with_row() # when it ends without having
# to always reinitialize
Normally the second approach seems like a terrible idea, but I've run into situations when iterating million+ items where the initialization time of the row:
data_saved_for_row = []
has taken a second or more to do.
Here's an example:
>>> print timeit.timeit(stmt="l = list();", number=int(1e8))
7.77035903931

If you want functionality for this sort of performance, you may as well just write it in C yourself and import it with ctypes or something. But then, if you're writing this kind of performance-driven application, why are you using Python to do it in the first place?
You can use list.clear() as a middle-ground here, not having to reallocate anything immediately:
data_saved_for_row = []
for row in rows:
data_saved_for_row.clear()
for item in row:
do_something()
do_something
but this isn't a perfect solution, as shown by the cPython source for this (comments omitted):
static int
_list_clear(PyListObject *a)
{
Py_ssize_t i;
PyObject **item = a->ob_item;
if (item != NULL) {
i = Py_SIZE(a);
Py_SIZE(a) = 0;
a->ob_item = NULL;
a->allocated = 0;
while (--i >= 0) {
Py_XDECREF(item[i]);
}
PyMem_FREE(item);
}
return 0;
}
I'm not perfectly fluent in C, but this code looks like it's freeing the memory stored by the list, so that memory will have to be reallocated every time you add something to that list anyway. This strongly implies that the python language just doesn't natively support your approach.
Or you could write your own python data structure (as a subclass of list, maybe) that implements this paradigm (never actually clearing its own list, but maintaining a continuous notion of its own length), which might be a cleaner solution to your use case than implementing it in C.

Related

MapPartition in PySpark

I am extremely new to Python and not very familiar with the syntax. I am looking at some sample implementation of the pyspark mappartitions method. To articulate the ask better, I have written the Java Equivalent of what I need.
JavaRDD<Row> modified = auditSet.javaRDD().mapPartitions(new FlatMapFunction<Iterator<Row>, Row>() {
public Iterator<Row> call(Iterator<Row> t) throws Exception {
Iterable<Row> iterable = () -> t;
return StreamSupport.stream(iterable.spliterator(), false).map(m -> enrich(m)).iterator();
}
private Row enrich(Row r) {
//<code to enrich row r
return RowFactory.create(/*new row from enriched row r*/);
}
});
I have an rdd. I need to call the mappartitions on it. I am not sure how to pass/handle the iterator inside of python. Once the call reaches the method, I am looking to iterate each record and enrich it and return the result.
Any help is appreciated.
Not sure if this is the right way, but this is what I have done. Open to comments and corrections.
auditSetDF.rdd.mapPartitions(lambda itr:mpImpl(itr,locationListBrdcast))
def mpImpl(itr,broadcastList):
        lst=broadcastList.value
        for x in itr:
                yield enrich(x,lst)

Update value in a JOSNObject without pointers in python

I'm programming a python code in which I use JSONObjects to communicate with a Java application. My problem ist, that I want to change a value in the JSONObject (in this example called py_json) and the dimension of that JSONObject is not fixed but known.
varName[x] is the input of the method and the length of varName is the dimension/size of the JSONObjects.
The code would work like that but I can't copy and paste the code 100 times to be sure that there are no bigger JSONObjects.
if length == 1:
py_json[VarName[0]] = newValue
elif length == 2:
py_json[VarName[0]][VarName[1]] = newValue
elif length == 3:
py_json[VarName[0]][VarName[1]][VarName[2]] = newValue
In C I would solve it with pointers like that:
int *pointer = NULL;
pointer = &py_json;
for (i=0; i<length; i++){
pointer = &(*pointer[VarName[i]]);
}
*pointer = varValue;
But there are no pointers in python.
Do you known a way to have a dynamic solution in python?
Python's "variables" are just names pointing to objects (instead of symbolic names for memory addresses as in C) and Python assignement doesn't "copies" a variable value to a new memory location, it only make the name points to a different object - so you don't need pointers to get the same result (you probably want to read this for more on python's names / variables).
IOW, the solution is basically the same: just use a for loop to get the desired target (actually: the parent of the desired target), then assign to it:
target = py_json
for i in range(0, length - 1):
target = target[VarName[i]]
target[VarName[length - 1]] = newValue

Recursive function in Python: Best way to get a list of specific nested items

I have a tree of nested dictionaries. This is an small extract, just to give you an idea:
db = {
'compatibility': {
'style': {
'path_to_file': 'compatibility/render/style.py',
'checksum': {
'0.0.3':'AAA55d796c25ad867bbcb8e0da4e48d17826e6f9fce',
'0.0.2': '55d796c25ad867bbcb8e0da4e48d17826e6f9fe606',}}},
'developer': {
'render': {
'installation': {
'path_to_file': 'developer/render/installation.py',
'checksum': {
'0.0.1': 'c1c0d4080e72292710ac1ce942cf59ce0e26319cf3'}},
'tests': {
'path_to_file': 'developer/render/test.py',
'checksum': {
'0.0.1': 'e71173ac43ecd949fdb96cfb835abadb877a5233a36b115'}}}}}
I want to get a list of all the dictionary modules nested in the tree. That way I would be able to loop the list and test the checksum of each file (Note that the modules could be at different levels like in the example above).
To achieve that, I wrote the following recursive function. I know for a fact that each module has a "path_to_file" and "checksum" keys, so I use that to test if the dict is a module. Note that I had to wrap the recursive function inside another function that holds the list so that the list wouldn't be overwritten each time the recursive function runs.
def _get_modules_from_db(dictionary):
def recursive_find(inner_dictionary):
for k, v in inner_dictionary.iteritems():
if (isinstance(v, dict) and
not sorted(v.keys()) == ['path_to_file', 'sha512sum']):
recursive_find(v)
else:
leaves.append(v)
leaves = []
recursive_find(dictionary)
return leaves
This approach works, however having to wrap the function seems very ugly to me. So, my question for pros at Stack Overflow:
Is there is simpler (or better) approach you would recommend to achieve this without having to wrap the function?
First, the only reason you need to wrap the function is because you're making recursive_find mutate the leaves closure cell in-place instead of returning it. Sometimes that's a useful performance optimization (although just as often it's a pessimization), and sometimes it's just not clear how to do it otherwise, but this time it's trivial:
def _get_modules_from_db(dictionary):
leaves = []
for k, v in dictionary.iteritems():
if (isinstance(v, dict) and
not sorted(v.keys()) == ['path_to_file', 'sha512sum']):
leaves.extend(_get_modules_from_db(v))
else:
leaves.append(v)
return leaves
For additional improvements: I'd probably turn this into a generator (at least in 3.3+, with yield from; in 2.7 I might think twice). And, while we're at it, I'd compare the key-view (in 3.x) or set(v) (in 2.x) to a set rather than doing an unnecessary sort (and no reason for .keys() with either set or sorted), and use != rather than not and ==. And, unless there's a good reason to only accept actually dict and dict subclasses, I'd either duck-type it or use collections.[abc.]Mapping. So:
def _get_modules_from_db(dictionary):
for k, v in dictionary.items():
if isinstance(v, Mapping) and v.keys() != {'path_to_file', 'sha512sum'}:
yield from _get_modules_from_db(v)
else:
yield v
Or, alternatively, pull the base cases out, so you can call it directly on a string:
def _get_modules_from_db(d):
if isinstance(d, Mapping) and d.keys() != {'path_to_file', 'sha512sum'}:
for v in d.values():
yield from _get_modules_from_db(v)
else:
yield d
I think that's a little more readable than what you had, and it's 6 lines instead of 11 (although the 2.x version would be 7 lines). But I don't see anything actually wrong with your version.
If you're not sure how to turn that 3.3+ code into 2.7/3.2 code:
Rewrite yield from eggs as for egg in eggs: yield egg.
Mapping is in collections, not collections.abc.
Use set(v) instead of v.keys().
Possibly use itervalues instead of values (2.x only).
I don't see a problem with this approach. You want a recursive function that manipulates some global state - this is a pretty reasonable way to do it (inner functions are not terribly uncommon in Python).
That said, you could add a defaulted argument if you want to avoid the nested function:
def _get_modules_from_db(db, leaves=None):
if leaves is None:
leaves = []
if not isinstance(db, dict):
return leaves
# Use 'in' check to avoid sorting keys and doing a list compare
if 'path_to_file' in db and 'checksum' in db:
leaves.append(db)
else:
for v in db.values():
_get_modules_from_db(v, leaves)
return leaves
In my personal opinion, nested functions are nice, but here's a more concise version nonetheless
from operator import add
def _get_modules_from_db(db):
if 'path_to_file' in db and 'sha512sum' in db:
return [db]
return reduce(add, (_get_modules_from_db(db[m]) for m in db))

Python large list manipulation

I have python list like below:
DEMO_LIST = [
[{'unweighted_criket_data': [-46.14554728131345, 2.997789122813151, -23.66171024766996]},
{'weighted_criket_index_input': [-6.275794430258629, 0.4076993207025885, -3.2179925936831144]},
{'manual_weighted_cricket_data': [-11.536386820328362, 0.7494472807032877, -5.91542756191749]},
{'average_weighted_cricket_data': [-8.906090625293496, 0.5785733007029381, -4.566710077800302]}],
[{'unweighted_football_data': [-7.586729834820534, 3.9521665714843675, 5.702038461085529]},
{'weighted_football_data': [-3.512655913521907, 1.8298531225972623, 2.6400438074826]},
{'manual_weighted_football_data': [-1.8966824587051334, 0.9880416428710919, 1.4255096152713822]},
{'average_weighted_football_data': [-2.70466918611352, 1.4089473827341772, 2.0327767113769912]}],
[{'unweighted_rugby_data': [199.99999999999915, 53.91020408163265, -199.9999999999995]},
{'weighted_rugby_data': [3.3999999999999857, 0.9164734693877551, -3.3999999999999915]},
{'manual_rugby_data': [49.99999999999979, 13.477551020408162, -49.99999999999987]},
{'average_weighted_rugby_data': [26.699999999999886, 7.197012244897959, -26.699999999999932]}],
[{'unweighted_swimming_data': [2.1979283454982053, 14.079951031527246, -2.7585499298828777]},
{'weighted_swimming_data': [0.8462024130168091, 5.42078114713799, -1.062041723004908]},
{'manual_weighted_swimming_data': [0.5494820863745513, 3.5199877578818115, -0.6896374824707194]},
{'average_weighted_swimming_data': [0.6978422496956802, 4.470384452509901, -0.8758396027378137]}]]
I want to manipulate list items and do some basic math operation,like getting each data type list (example taking all first element of unweighted data and do sum etc)
Currently I am doing it like this.
The current solution is a very basic one, I want to do it in such way that if the list length is grown, it can automatically calculate the results. Right now there are four list, it can be 5 or 8,the final result should be the summation of all the first element of unweighted values,example:
now I am doing result_u1/4,result_u2/4,result_u3/4
I want it like result_u0/4,result_u1/4.......result_n4/4 # n is the number of list inside demo list
Any idea how I can do that?
(sorry for the beginner question)
You can implement a specific list class for yourself, that adds your summary with new item's values in append function, or decrease them on remove:
class MyList(list):
def __init__(self):
self.summary = 0
list.__init__(self)
def append(self, item):
self.summary += item.sample_value
list.append(self, item)
def remove(self, item):
self.summary -= item.sample_value
list.remove(self, item)
And a simple usage:
my_list = MyList()
print my_list.summary # Outputs 0
my_list.append({'sample_value': 10})
print my_list.summary # Outputs 10
In Python, whenever you start counting how many there are of something inside an iterable (a string, a list, a set, a collection of any of these) in order to loop over it - its a sign that your code can be revised.
Things can can work for 3 of something, can work for 300, 3000 and 3 million of the same thing without changing your code.
In your case, your logic is - "For every X inside DEMO_LIST, do something"
This translated into Python is:
for i in DEMO_LIST:
# do something with i
This snippet will run through any size of DEMO_LIST and each time i is each of whatever is in side DEMO_LIST. In your case it is the list that contains your dictionaries.
Further expanding on that, you can say:
for i in DEMO_LIST:
for k in i:
# now you are in each list that is inside the outer DEMO_LIST
Expanding this to do a practical example; a sum of all unweighted_criket_data:
all_unweighted_cricket_data = []
for i in DEMO_LIST:
for k in i:
if 'unweighted_criket_data' in k:
for data in k['unweighted_cricket_data']:
all_unweighted_cricked_data.append(data)
sum_of_data = sum(all_unweighted_cricket_data)
There are various "shortcuts" to do the same, but you can appreciate those once you understand the "expanded" version of what the shortcut is trying to do.
Remember there is nothing wrong with writing it out the 'long way' especially when you are not sure of the best way to do something. Once you are comfortable with the logic, then you can use shortcuts like list comprehensions.
Start by replacing this:
for i in range(0,len(data_list)-1):
result_u1+=data_list[i][0].values()[0][0]
result_u2+=data_list[i][0].values()[0][1]
result_u3+=data_list[i][0].values()[0][2]
print "UNWEIGHTED",result_u1/4,result_u2/4,result_u3/4
With this:
sz = len(data_list[i][0].values()[0])
result_u = [0] * sz
for i in range(0,len(data_list)-1):
for j in range(0,sz):
result_u[j] += data_list[i][0].values()[0][j]
print "UNWEIGHTED", [x/len(data_list) for x in result_u]
Apply similar changes elsewhere. This assumes that your data really is "rectangular", that is to say every corresponding inner list has the same number of values.
A slightly more "Pythonic"[*] version of:
for j in range(0,sz):
result_u[j] += data_list[i][0].values()[0][j]
is:
for j, dataval in enumerate(data_list[i][0].values()[0]):
result_u[j] += dataval
There are some problems with your code, though:
values()[0] might give you any of the values in the dictionary, since dictionaries are unordered. Maybe it happens to give you the unweighted data, maybe not.
I'm confused why you're looping on the range 0 to len(data_list)-1: if you want to include all the sports you need 0 to len(data_list), because the second parameter to range, the upper limit, is excluded.
You could perhaps consider reformatting your data more like this:
DEMO_LIST = {
'cricket' : {
'unweighted' : [1,2,3],
'weighted' : [4,5,6],
'manual' : [7,8,9],
'average' : [10,11,12],
},
'rugby' : ...
}
Once you have the same keys in each sport's dictionary, you can replace values()[0] with ['unweighted'], so you'll always get the right dictionary entry. And once you have a whole lot of dictionaries all with the same keys, you can replace them with a class or a named tuple, to define/enforce that those are the values that must always be present:
import collections
Sport = collections.namedtuple('Sport', 'unweighted weighted manual average')
DEMO_LIST = {
'cricket' : Sport(
unweighted = [1,2,3],
weighted = [4,5,6],
manual = [7,8,9],
average = [10,11,12],
),
'rugby' : ...
}
Now you can replace ['unweighted'] with .unweighted.
[*] The word "Pythonic" officially means something like, "done in the style of a Python programmer, taking advantage of any useful Python features to produce the best idiomatic Python code". In practice it usually means "I prefer this, and I'm a Python programmer, therefore this is the correct way to write Python". It's an argument by authority if you're Guido van Rossum, or by appeal to nebulous authority if you're not. In almost all circumstances it can be replaced with "good IMO" without changing the sense of the sentence ;-)

Skip keys without Type checking in Python (pymssql)

I need to access all the non-integer keys for a dict that looks like:
result = {
0 : "value 1",
1 : "value 2",
"key 1" : "value 1",
"key 2" : "value 2",
}
I am currently doing this by:
headers = [header for header in tmp_dict.keys() if not isinstance(header, int)]
My question:
Is there a way to do this without type checking?
This tmp_dict is coming out of a query using pymssql with the as_dict=True attribute, and for some reason it returns all the column names with data as expected, but also includes the same data indexed by integers. How can I get my query result as a dictionary with only the column values and data?
Thanks for your help!
PS - Despite my issues being resolved by potentially answering 2, I'm curious how this can be done without type checking. Mainly for the people who say "never do type checking, ever."
With regard to your question about type checking, the duck-type approach would be to see whether it can be converted to or used as an int.
def can_be_int(obj):
try:
int(obj)
except (TypeError, ValueError):
return False
return True
headers = [header for header in tmp_dict.keys() if not can_be_int(header)]
Note that floats can be converted to ints by truncating them, so this isn't necessarily exactly equivalent.
A slight variation on the above would be to use coerce(0, obj) in place of int(obj). This will allow any kind of object that can be converted to a common type with an integer. You could also do something like 0 + obj and 1 * obj which will check for something that can be used in a mathematical expression with integers.
You could also check to see whether its string representation is all digits:
headers = [header for header in tmp_dict.keys() if not str(header).isdigit()]
This is probably closer to a solution that doesn't use type-checking, although it will be slower, and it's of course entirely possible that a column name would be a string that is only digits! (Which would fail with many of these approaches, to be honest.)
Sometimes explicit type-checking really is the best choice, which is why the language has tools for letting you check types. In this situation I think you're fine, especially since the result dictionary is documented to have only integers and strings as keys. And you're doing it the right way by using isinstance() rather than explicitly checking type() == int.
Looking at the source code of pymssql (1.0.2), it is clear that there is no option for the module to not generate data indexed by integers. But note that data indexed by column name can be omitted if the column name is empty.
/* mssqldbmodule.c */
PyObject *fetch_next_row_dict(_mssql_connection *conn, int raise) {
[...]
for (col = 1; col <= conn->num_columns; col++) {
[...]
// add key by column name, do not add if name == ''
if (strlen(PyString_AS_STRING(name)) != 0)
if ((PyDict_SetItem(dict, name, val)) == -1)
return NULL;
// add key by column number
if ((PyDict_SetItem(dict, PyInt_FromLong(col-1), val)) == -1)
return NULL;
}
[...]
}
Regarding your first question, filtering result set by type checking is surely the best way to do that. And this is exactly how pymssql is returning data when as_dict is False:
if self.as_dict:
row = iter(self._source).next()
self._rownumber += 1
return row
else:
row = iter(self._source).next()
self._rownumber += 1
return tuple([row[r] for r in sorted(row.keys()) if type(r) == int])
The rationale behind as_dict=True is that you can access by index and by name. Normally you'd get a tuple you index into, but for compatibility reasons being able to index a dict as though it was a tuple means that code depending on column numbers can still work, without being aware that column names are available.
If you're just using result to retrieve columns (either by name or index), I don't see why you're concerned about removing them? Just carry on regardless. (Unless for some reason you plan to pickle or otherwise persist the data elsewhere...)
The best way to filter them out though, is using isinstance - duck typing in this case is actually unpythonic and inefficient. Eg:
names_only = dict( (k, v) for k,v in result.iteritems() if not isinstance(k, int) )
Instead of a try and except dance.
>>> sorted(result)[len(result)/2:]
['key 1', 'key 2']
This will remove the duplicated integer-keyed entrys. I think what you're doing is fine though.

Categories