We can think of applying two types of functions to a Pandas Series: transformations and aggregations. They make this distinction in the documentation; transformations map individual values in the Series while aggregations somehow summarize the entire Series (e.g. mean).
It is clear how to apply transformations using apply, but I have not be successful in implementing a custom aggregation. Note that groupby is not involved, and aggregation does not require a groupby.
I am working with the following case: I have a Series in which each row is a list of strings. One way I could aggregate this data is to count up the number of appearances of each string, and return the 5 most common terms.
def top_five_strings(series):
counter = {}
for row in series:
for s in row:
if s in counter:
counter[s] += 1
else:
counter[s] = 1
return sorted(s.items(), key=lambda x: x[1])[:5]
If I call this function as top_five_strings(series), it works fine, analogous to as if I had called np.mean(series) on a numeric series. However, the difference is I can also do series.agg(np.mean) and get the same result. If I do series.agg(top_five_strings), I instead get the top five letters of in each row of the Series (which makes sense if you make a single row the argument of the function).
I think the critical difference is that np.mean is a NumPy ufunc, but I haven't been able to work out how the _aggregate helper function works in the Pandas source.
I'm left with 2 questions:
1) Can I implement this by making my Python function a ufunc (and if so, how)?
2) Is this a stupid thing to do? I haven't found anyone else out there trying to do something like this. It seems to me like it would be quite nice, however, to be able to implement custom aggregations as well as custom transformations within the Pandas framework (e.g. I get a Series as a result as one might with df.describe).
Related
If I have a "reference" to a dataframe, there appears to be no way to append to it in pandas because neither append nor concat support the inplace=True parameter.
An (overly) simple example:
chosen_df, chosen_row = (candidate_a_df, candidate_a_row) if some_test else (candidate_b_df, candidate_b_row)
chosen_df = chosen_df.append(chosen_row)
Now because Python does something akin to copy reference by value, chosen_df will initially be a reference to whichever candidate dataframe passed some_test.
But the update semantics of pandas mean that the referenced dataframe is not updated by the result of the append function; a new label is created instead. I believe, if there was the possibility to use inplace=True this would work, but it looks like that isn't likely to happen, given discussion here https://github.com/pandas-dev/pandas/issues/14796
It's worth noting that with a simpler example using lists rather than dataframes does work, because the contents of lists are directly mutated by append().
So my question is --- How could an updatable abstraction over N dataframes be achieved in Python?
The idiom is commonplace, useful and trivial in languages that allow references, so I'm guessing I'm missing a Pythonic trick, or thinking about the whole problem with the wrong hat on!
Obviously the pure illustrative example can be resolved by duplicating the append in the body of an if...else and concretely referencing each underlying dataframe in turn. But this isn't scalable to more complex examples and it's a generic solution akin to references I'm looking for.
Any ideas?
There is a simple way to do this specifically for pandas dataframes - so I'll answer my own question.
chosen_df, chosen_row = (candidate_a_df, candidate_a_row) if some_test else (candidate_b_df, candidate_b_row)
chosen_df.loc[max_idx+1] = chosen_row
The calculation of max_idx very much depends on the structure of chosen_df. In the simplest case when it is a dataframe with a sequential index starting at 0, then you can simply use the length of the index to calculate it.
If chosen_df is non-sequential you'll need call max() on the index column rather than rely on the length of the index.
If chosen_df is a slice or groupby object then you'll need to calculate the index off the max parent dataframe to ensure it's truly the max across all rows.
I have two lists in form of pandas DataFrames which both contain a column of names. Now I want to compare these names and return a list of names which appear in both DataFrames. The problem is that my solution is way too slow since both list have several thousand entries.
Now I want to know if there is anything else I can do to accelerate the solution of my problem.
I already sorted my pandas dataframe by alphabet using "df.sort_values" in Order to create an alphabetical index so that a name in the first list which starts with the letter "X" will only be compared to entries with the same first letter in the second list.
I suspect that the main reason my program is running so slow is my way of accessing the fields which I am comparing.
I use a specific comparison function to compare the names and access the dataframe elements through the df.at[i, 'column_title'] method.
Edit: Note that this specific comparison function is more complex than a simple "==" since I am doing a kind of fuzzy string comparison to make sure names with slightly different spelling still get marked as a match. I use the whoswho library which returns me a match rate between 0 and 100. A simplified example focussing on my slow solution for the pandas dataframe comparison looks as follows:
for i in range(len(list1)):
for j in range(len(list2)):
# who.ratio returns a match rate between two strings
ratio = who.ratio(list1.at[i, 'name'], list2.at[j, 'name'])
if ratio > 75:
save(i,j) # stores values i and j in a result list
I also thought about switching from pandas to numpy but I read that this might slow it down even further since pandas is faster for big data amounts.
Can anybody tell me if there is there a faster way of accessing specific elements in a pandas array? Or is there a faster way in general to run a custom comparison function through two pd dataframes?
Edit2: spelling, addtitional information.
I am trying to create an additional custom column using existing column of a data-frame, however the function I am using throws the type error while execution. I am very new to python, can someone please help.
The dataframe used is as below
match_all = match[['country_id','league_id','season','stage','date',
'home_team_api_id','away_team_api_id','home_team_goal','away_team_goal']]
And the function I am using is as below
def goal_diff(matches):
for i in matches:
i['home_team_goal']-i['away_team_goal']
goal_diff(match_all)
The reason your function did not work is because matches in your function is a dataframe. When you do:
for i in matches:
print(i)
You would see that column names are returned of your current df. This is how a for loop operates on a df. So in your function, when you are using i in your subtraction call:
i['home_team_goal'] -i['away_team_goal']
it is like doing
['country_id']['home_team_goal'] - ['country_id']['away_team_goal']
['league_id']['home_team_goal'] - ['league_id']['away_team_goal']
...
This operation in pandas doesn't make any sense. So what you actually want to do when you are calling specific dataframe columns is the name of the df with the column:
matches['home_team_goal'] - matches['away_team_goal']
remember, matches is your function's input df. Lastly, in your for loop you are neither returning any value or storing any value, you are just calling a subtraction method on 2 columns. In your text editor or IDE you might see something print to screen, but in the future you will probably want to use these values for the next step in your code. So in a function, we use the return call to have the function actually give us values when we call it on something.
In your case, if I write my function below without the return call, and then call the function on my dataframe, the operation would complete, and no value would be "returned" to me, it would just be produced and disappear.
Pre-edit answer.
You do not need to create a loop for this, pandas will do it for you:
def goal_dff(matches):
return matches['home_team_goal'] - matches['away_team_goal']
match_all['home_away_goal_diff'] = goal_diff(match_all)
This function takes an input df and uses the columns 'home_team_goal' and 'away_team_goal' to calculate the difference. You also don't need a function for this. If you wanted to create a new column in your existing match_all df you could do this:
match_all['home_away_goal_diff'] = match_all['home_team_goal'] - match_all['away_team_goal']
I am attempting to sort by the maximum value of a column such as:
dfWinners = df_new_out['Margin'].apply[max().sort_values(ascending=False)]
But am unsure of how to use the apply function for multiple methods such as max and sort.
I'm sure it's a very simple problem but I cannot for the life of me find it through searching on here.
Thank you!
It's not clear what you're trying to do by assigning to dfWinners
If you intend dfWinners to be a list of sorted maximum to minimum values as you've described above then you can use the native sorted() method of python.
dfWinners = sorted(df_new_out['Margin'])
Else, you can sort your dataframe in place
df_new_out.sort_values(by = ['Margin'], ascending=False, inplace=True)
When using the xarray package for Python 2.7, is it possible to group over multiple parameters like you can in pandas? In essence, an operation like:
data.groupby(['time.year','time.month']).mean()
if you wanted to get mean values for each year and month of a dataset.
Unfortunately, xarray does not support grouping with multiple arguments yet. This is something we would like to support and it would be relatively straightforward, but nobody has had the time to implement it yet (contributions would be welcome!).
An easy way around is to construct a multiindex and group by that "new" coordinate:
da_multiindex = da.stack(my_multiindex=['time.year','time.month'])
da_mean = da.groupby("my_multiindex").mean()
da_mean.unstack() # go back to normal index