Difference between approxCountDsitinct and approx_count_distinct in spark functions

Difference between approxCountDsitinct and approx_count_distinct in spark functions - python

Can anyone tell the difference between pyspark.sql.functions.approxCountDistinct (I know it is deprecated) and pyspark.sql.functions.approx_count_distinct? I have used both versions in a project and have experienced different values

As you mentioned it, pyspark.sql.functions.approxCountDistinct is deprecated. The reason is most likely just a style concern. They probably wanted everything to be in snake case. As you can see in the source code pyspark.sql.functions.approxCountDistinct simply calls pyspark.sql.functions.approx_count_distinct, nothing more except giving you a warning. So regardless the one you use, the very same code runs in the end.
Also, still according to the source code, approx_count_distinct is based on the HyperLogLog++ algorithm. I am not very familiar with the algorithm but it is based on repetitive set merging. Therefore, the result will most likely depend on the order in which the various results of the executors are merged. Since this is not deterministic with spark, this could explain why you witness different results.

Related

ctypes.c_int.from_address does not work in RStudio

I am trying to count references to an object in Python using the RStudio. I use following function:
ctypes.c_int.from_address(id(an_object)).value
This work perfectly in Pycharm and Jupyter as shown bellow:
and the result in the RStudio is:
The question is why the result is not correct in RStudio and how to fix it?
I also tried to use
sys.getrefcount
function, but It does not work in RStudio too!
I did it without using "id" function as shown below:
But the result in RStudio is not correct! Sometimes It may happen in PyCharm(I did not see, Perhaps no guarantee) But in RStudio something is wrong completely!
Why this is important?! And why I care about it.
Consider following example:
Sometimes it is important to know about "a" before change "b".
The big problem in RStudio is the result increases randomly! But in PyCharm and other Python tools I did not see that happen.
I am not an expert on this so if I am wrong on it correct me please.

the problem seems to be just that r-studio spams a lot of other references to the objects you create. In the same Python version you would otherwise see no divergence there.
That said, you are definitely taking the wrong approach there: either you alias some data structure expecting it to be mutable "under your feet", or you should refrain from making changes at all to data that "may" be in use in other parts of your code (notebook). That is the reasoning Pandas, for example, have been changing all default operations on dataframe from "inplace" to return a new copy, despite the memory cost of doing that.
Other than that, as stated in the comments, the reason this is so hidden in Python - to the point you have to use ctypes (you could also use the higher level gc module, actually), is that it should not matter.
There are other things you could do - like, create objects that use Locks so that they don't allow changes if they are in use in some other critical place - that would be higher level and reproducible.

What is the difference between random.normalvariate() and random.gauss() in python?

What is the difference between random.normalvariate() and random.gauss()?
They take the same parameters and return the same value, performing essentially the same function.
I understand from a previous answer that random.gauss() is not thread safe, but what does this mean in this context? Why should a programmer care about this? Alternatively posed, why was both a thread safe and non-thread safe version of included in Python's 'random'?

This is an interesting question. In general, the best way to know the difference between two python implementations is to inspect the code yourself:
import inspect, random
str_gauss = inspect.getsource(random.gauss)
str_nv=inspect.getsource(random.normalvariate)
and then you print each of the strings to see how the sources differ. A quick look at the codes show that not only they behave differently multithread-wise, but also that the algorithms are not the same; for example, normalvariate uses something called the Kinderman and Monahan method, as per the following comments in str_nv:
# Uses Kinderman and Monahan method. Reference: Kinderman,
# A.J. and Monahan, J.F., "Computer generation of random
# variables using the ratio of uniform deviates", ACM Trans
# Math Software, 3, (1977), pp257-260.

Thread-safe pieces of code must account for possible race conditions during execution. This introduces overhead as a result of synchronization schemes like mutexes, semaphores, etc.
However, if you are writing non-reentrant code, no race conditions normally arise, which essentially means that you can write code that executes a bit faster. I guess this is why random.gauss() was introduced, since the python doc says it's faster than the thread-safe version.

I'm not entirely sure about this but the Python Documentation says that random.gauss is slightly faster so if you're OK with non-thread safe then you can go a little faster.

In a multi-threaded system, calling random.normalvariate twice very quickly in succession will cause the internal code of random.normalvariate to be run twice, potentially before the first call has had a chance to return. Internal variables may of the function may not be reset before the second, which may cause errors in the function output.
Successive calls to random.gauss must instead wait for earlier calls to return before being called themselves.
The advantage with random.normalvariate is therefore that it is faster, but may produce an erroneous output.

Maintaining two versions of an ipython notebook

I often need to create two versions of an ipython notebook: One contains tasks to be carried out (usually including some python code and output), the other contains the same text plus solutions. Let's call them the assignment and the solution.
It is easy to generate the solution document first, then strip the answers to generate the assignment (or vice versa). But if I subsequently need to make changes (and I always do), I need to repeat the stripping process. Is there a reasonable workflow that will allow changes in the assignment to be propagated to the solutions document?
Partial self-answer: I have experimented with leveraging mercurial's hg copy, which will let two files with different names share history. But I can only get this to work if assignment and solution are in different directories, in two linked hg repositories. I would much prefer a simpler set-up. I've also noticed that diff gets very confused when one JSON file has more sections than another, making a VCS-based solution even less attractive. (To be clear: Ordinary use of a VCS with notebooks is fine; it's the parallel versions that stumble).
This question covers similar ground, but does not solve my problem. In fact an answer to my question would solve the OP's second remaining problem, "pulling changes" (see the Update section).

It sounds like you are maintaining an assignment and an answer key of some kind and want to be able to distribute the assignments (without solutions) to students, and still have the answers for yourself or a TA.
For something like this, I would create two branches "unsolved" and "solved". First write the questions on the "unsolved" branch. Then create the "solved" branch from there and add the solutions. If you ever need to update a question, update back to the "unsolved" branch, make the update and merge the change into "solved" and fix the solution.
You could try going the other way, but my hunch is that going "backwards" from solved to unsolved might be strange to maintain.

After some experimentation I concluded that it is best to tackle this by processing the notebook's JSON code. Version control systems are not the right approach, for the following reasons:
JSON doesn't diff very well when adding or deleting cells. A minimal change leads to mis-matched braces and a very messy diff.
In my use case, the superset version of the file (containing both the assignments and their solutions) must be the source document. This is because the assignment includes example code and output that depends on earlier parts, to be written by the students. This model does not play well with version control, as pointed out by #ChrisPhillips in his answer.
I ended up filtering the JSON structure for the notebook and stripping out the solution cells; they may be recognized via special metadata (which can be set interactively using the metadata button in the interface), or by pattern-matching on the cell contents. The following snippet shows how to filter out cells whose first line starts with # SOLUTION:
def stripcell(cell, pattern):
"""Check if the first line of the cell's content matches `pattern`"""
if cell["cell_type"] == "code":
content = cell["input"]
else:
content = cell["source"]
return ( len(content) > 0 and re.search(pattern, content[0]) )
pattern = r"^# SOLUTION:"
struct = json.load(open("input.ipynb"))
cells = struct["worksheets"][0]["cells"]
struct["worksheets"][0]["cells"] = [ c for c in cells if not stripcell(c, pattern) ]
json.dump(struct, open("output.ipynb", "wb"), indent=1)
I used the generic json library rather than the notebook API. If there's a better way to go about it, please let me know.

how to test in case of eventual consistency?

i have an app ( google app engine + high replication datastore ) which was not using eventual consistency ( high replication ) up till now and all my test worked perfectly.
now, for local testing in high replication, as soon as i moved to eventual consistency, they begin to fail. how do i prevent that ? or how do i test that part ?
i need it for x-entity transaction.
i am using something similar to https://developers.google.com/appengine/docs/python/tools/localunittesting#Writing_HRD_Datastore_Tests
edit:
I need to test the code correctly. The problem I have is with the testing part. How Anyone test eventual consistency ?
edit 1:
I have temporarily solved the problem with using probability=100% in above linked example. But Ideas are welcome.

Fix the failures.
Since you have no code and are very vague, it's hard to answer your question. But essentially either your app code or your tests are not taking into account the eventual consistency (ie, a query may not return with a value that was just updated in the database). When you turned on eventual consistency in the datastore, the query results you get will be different.
You either need to update your code to handle the eventual consistency situations with transactions, or update your tests to expect eventual consistency results.
edit
This question is still too general. It depends if you're doing, say functional or system testing. Are you looking for particular results? Or just an HTTP status=200?
In general, like all testing, you need to identify what constitutes success and what constitutes a failure case. In a given situation, is it acceptable for old data to appear? In that case, the test should succeed with either the old or new values.
I'd recommend starting out considering whether you want to run deterministic or non-deterministic tests. For deterministic tests, you'd essentially want to run through the same tests with probability=0 and probability=100, and ensure you get the correct values for both.
I haven't figured out how to write non-deterministic tests in a completely useful manner, other than as a stress test. You can verify that certain required values are met, and other eventually-consistent values fall within a valid range. This is a lot of work, because most likely you have a range of values that may depend on another range of values, and since your final output may consist of both, you'll have to validate that the combinations are correct - essentially you end up reproducing some of your application logic if you really want to verify everything is correct.

The situation you are facing is one of the drawbacks (or call it a feature) of High Replication Data Store. Usually these situations are tackled via transparent caching using memcache. If you had prior experience working with a db master/slave architecture, slave lags are tackled in a similar manner.

Regression testing when "test oracle" is an informal output comparison

I maintain a Python program that provides advice on certain topics. It does this by applying a complicated algorithm to the input data.
The program code is regularly changed, both to resolve newly found bugs, and to modify the underlying algorithm.
I want to use regression tests. Trouble is, there's no way to tell what the "correct" output is for a certain input - other than by running the program (and even then, only if it has no bugs).
I describe below my current testing process. My question is whether there are tools to help automate this process (and of course, if there is any other feedback on what I'm doing).
The first time the program seemed to run correctly for all my input cases, I saved their outputs in a folder I designated for "validated" outputs. "Validated" means that the output is, to the best of my knowledge, correct for a given version of my program.
If I find a bug, I make whatever changes I think would fix it. I then rerun the program on all the input sets, and manually compare the outputs. Whenever the output changes, I do my best to informally review those changes and figure out whether:
the changes are exclusively due to the bug fix, or
the changes are due, at least in part, to a new bug I introduced
In case 1, I increment the internal version counter. I mark the output file with a suffix equal to the version counter and move it to the "validated" folder. I then commit the changes to the Mercurial repository.
If in the future, when this version is no longer current, I decide to branch off it, I'll need these validated outputs as the "correct" ones for this particular version.
In case 2, I of course try to find the newly introduced bug, and fix it. This process continues until I believe the only changes versus the previous validated version are due to the intended bug fixes.
When I modify the code to change the algorithm, I follow a similar process.

Here's the approach I'll probably use.
Have Mercurial manage the code, the input files, and the regression test outputs.
Start from a certain parent revision.
Make and document (preferably as few as possible) modifications.
Run regression tests.
Review the differences with the parent revision regression test output.
If these differences do not match the expectations, try to see whether a new bug was introduced or whether the expectations were incorrect. Either fix the new bug and go to 3, or update the expectations and go to 4.
Copy the output of regression tests to the folder designated for validated outputs.
Commit the changes to Mercurial (including the code, the input files, and the output files).

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.