Applying multiple functions to pandas column - python

I am attempting to sort by the maximum value of a column such as:
dfWinners = df_new_out['Margin'].apply[max().sort_values(ascending=False)]
But am unsure of how to use the apply function for multiple methods such as max and sort.
I'm sure it's a very simple problem but I cannot for the life of me find it through searching on here.
Thank you!

It's not clear what you're trying to do by assigning to dfWinners
If you intend dfWinners to be a list of sorted maximum to minimum values as you've described above then you can use the native sorted() method of python.
dfWinners = sorted(df_new_out['Margin'])
Else, you can sort your dataframe in place
df_new_out.sort_values(by = ['Margin'], ascending=False, inplace=True)

Related

How to use approx_count_distinct to count distinct combinations of two columns in a Spark DataFrame?

I have a Spark DataFrame (sdf) where each row shows an IP visiting a URL. I want to count distinct IP-URL pairs in this data frame and the most straightforward solution is sdf.groupBy("ip", "url").count(). However, since the data frame has billions of rows, precise counts can take quite a while. I'm not particularly familiar with PySpark -- I tried replacing .count() with .approx_count_distinct(), which was syntactically incorrect.
I searched "how to use .approx_count_distinct() with groupBy()" and found this answer. However, the solution suggested there (something along those lines: sdf.groupby(["ip", "url"]).agg(F.approx_count_distinct(sdf.url).alias("distinct_count"))) doesn't seem to give me the counts that I want. The method .approx_count_distinct() can't take two columns as arguments, so I can't write sdf.agg(F.approx_count_distinct(sdf.ip, sdf.url).alias("distinct_count")), either.
My question is, is there a way to get .approx_count_distinct() to work on multiple columns and count distinct combinations of these columns? If not, is there another function that can do just that and what's an example usage of it?
Thank you so much for your help in advance!
Group with expressions and alias as needed. Lets try:
df.groupBy("ip", "url").agg(expr("approx_count_distinct(ip)").alias('ip_count'),expr("approx_count_distinct(url)").alias('url_count')).show()
Your code sdf.groupby(["ip", "url"]).agg(F.approx_count_distinct(sdf.url).alias("distinct_count")) will give a value of 1 to every group since you are counting the value of one of the grouping column; url.
If you want to count distinct of IP-URL pairs using approx_count_distinct function, you can compound them in an array then apply the function. It would be something like this
sdf.selectExpr("approx_count_distinct(array(ip, url)) as distinct_count")

Handling appending to abstraction of dataframe

If I have a "reference" to a dataframe, there appears to be no way to append to it in pandas because neither append nor concat support the inplace=True parameter.
An (overly) simple example:
chosen_df, chosen_row = (candidate_a_df, candidate_a_row) if some_test else (candidate_b_df, candidate_b_row)
chosen_df = chosen_df.append(chosen_row)
Now because Python does something akin to copy reference by value, chosen_df will initially be a reference to whichever candidate dataframe passed some_test.
But the update semantics of pandas mean that the referenced dataframe is not updated by the result of the append function; a new label is created instead. I believe, if there was the possibility to use inplace=True this would work, but it looks like that isn't likely to happen, given discussion here https://github.com/pandas-dev/pandas/issues/14796
It's worth noting that with a simpler example using lists rather than dataframes does work, because the contents of lists are directly mutated by append().
So my question is --- How could an updatable abstraction over N dataframes be achieved in Python?
The idiom is commonplace, useful and trivial in languages that allow references, so I'm guessing I'm missing a Pythonic trick, or thinking about the whole problem with the wrong hat on!
Obviously the pure illustrative example can be resolved by duplicating the append in the body of an if...else and concretely referencing each underlying dataframe in turn. But this isn't scalable to more complex examples and it's a generic solution akin to references I'm looking for.
Any ideas?
There is a simple way to do this specifically for pandas dataframes - so I'll answer my own question.
chosen_df, chosen_row = (candidate_a_df, candidate_a_row) if some_test else (candidate_b_df, candidate_b_row)
chosen_df.loc[max_idx+1] = chosen_row
The calculation of max_idx very much depends on the structure of chosen_df. In the simplest case when it is a dataframe with a sequential index starting at 0, then you can simply use the length of the index to calculate it.
If chosen_df is non-sequential you'll need call max() on the index column rather than rely on the length of the index.
If chosen_df is a slice or groupby object then you'll need to calculate the index off the max parent dataframe to ensure it's truly the max across all rows.

What is the fastest way to compare entries from two different pandas DataFrames?

I have two lists in form of pandas DataFrames which both contain a column of names. Now I want to compare these names and return a list of names which appear in both DataFrames. The problem is that my solution is way too slow since both list have several thousand entries.
Now I want to know if there is anything else I can do to accelerate the solution of my problem.
I already sorted my pandas dataframe by alphabet using "df.sort_values" in Order to create an alphabetical index so that a name in the first list which starts with the letter "X" will only be compared to entries with the same first letter in the second list.
I suspect that the main reason my program is running so slow is my way of accessing the fields which I am comparing.
I use a specific comparison function to compare the names and access the dataframe elements through the df.at[i, 'column_title'] method.
Edit: Note that this specific comparison function is more complex than a simple "==" since I am doing a kind of fuzzy string comparison to make sure names with slightly different spelling still get marked as a match. I use the whoswho library which returns me a match rate between 0 and 100. A simplified example focussing on my slow solution for the pandas dataframe comparison looks as follows:
for i in range(len(list1)):
for j in range(len(list2)):
# who.ratio returns a match rate between two strings
ratio = who.ratio(list1.at[i, 'name'], list2.at[j, 'name'])
if ratio > 75:
save(i,j) # stores values i and j in a result list
I also thought about switching from pandas to numpy but I read that this might slow it down even further since pandas is faster for big data amounts.
Can anybody tell me if there is there a faster way of accessing specific elements in a pandas array? Or is there a faster way in general to run a custom comparison function through two pd dataframes?
Edit2: spelling, addtitional information.

Pandas: Applying custom aggregation function (not w/ groupby)

We can think of applying two types of functions to a Pandas Series: transformations and aggregations. They make this distinction in the documentation; transformations map individual values in the Series while aggregations somehow summarize the entire Series (e.g. mean).
It is clear how to apply transformations using apply, but I have not be successful in implementing a custom aggregation. Note that groupby is not involved, and aggregation does not require a groupby.
I am working with the following case: I have a Series in which each row is a list of strings. One way I could aggregate this data is to count up the number of appearances of each string, and return the 5 most common terms.
def top_five_strings(series):
counter = {}
for row in series:
for s in row:
if s in counter:
counter[s] += 1
else:
counter[s] = 1
return sorted(s.items(), key=lambda x: x[1])[:5]
If I call this function as top_five_strings(series), it works fine, analogous to as if I had called np.mean(series) on a numeric series. However, the difference is I can also do series.agg(np.mean) and get the same result. If I do series.agg(top_five_strings), I instead get the top five letters of in each row of the Series (which makes sense if you make a single row the argument of the function).
I think the critical difference is that np.mean is a NumPy ufunc, but I haven't been able to work out how the _aggregate helper function works in the Pandas source.
I'm left with 2 questions:
1) Can I implement this by making my Python function a ufunc (and if so, how)?
2) Is this a stupid thing to do? I haven't found anyone else out there trying to do something like this. It seems to me like it would be quite nice, however, to be able to implement custom aggregations as well as custom transformations within the Pandas framework (e.g. I get a Series as a result as one might with df.describe).

pandas.dataframe.pop for all names in a array

I'm trying to remove a list of columns from a pandas.DataFrame. I've created a list through
diff = df1.difference(df2)
I would like to do something like:
for name in diff:
dataframe.pop(name)
Is there a way to apply the pop function in a vectorized way with all names in the diff index array?
Thanks all for the help!
Regards,
Jan
as MaxU said, cleanest way to do it is
dataframe.drop(diff, axis=1)
the .pop() method returns the series you are poping, which is not time effective if you only want to delete the column
also, you coulduse the del method which maps to the python magic method df.__delitem__('column'). I would not recommend that.
you can read this great SO answer to learn more about those

Categories