Removing extraneous index for groupby-apply functions that generate dataframes - python

This is a problem I've encountered in various contexts, and I'm curious if I'm doing something wrong, or if my whole approach is off. The particular data/functions are not important here, but I'll include a concrete example in any case.
It's not uncommon to want a groupby/apply that does various operations on each group, and returns a new dataframe. An example might be something like this:
def patch_stats(df):
first = df.iloc[0]
diversity = (len(df['artist_id'].unique())/float(len(df))) * df['dist'].mean()
start = first['ts']
return pd.DataFrame({'diversity':[diversity],'start':[start]})
So, this is a grouping function that generates a new DataFrame with two columns, each derived from a different operation on the input data. Again, the specifics aren't too important here, but this is the issue:
When I look at the output, I get something like this:
result = df.groupby('patch_idx').apply(patch_stats)
print result
diversity start
patch_idx
0 0 0.876161 2007-02-24 22:54:28
1 0 0.588997 2007-02-25 01:55:39
2 0 0.655306 2007-02-25 04:27:05
3 0 0.986047 2007-02-25 05:37:58
4 0 0.997020 2007-02-25 06:27:08
5 0 0.639499 2007-02-25 17:40:56
6 0 0.687874 2007-02-26 05:24:11
7 0 0.003714 2007-02-26 07:07:20
8 0 0.065533 2007-02-26 09:01:11
9 0 0.000000 2007-02-26 19:23:52
10 0 0.068846 2007-02-26 20:43:03
...
It's all good, except I have an extraneous, unnamed index level that I don't want:
print result.index.names
FrozenList([u'patch_idx', None])
Now, this isn't a huge deal; I can always get rid of the extraneous index level with something like:
result = result.reset_index(level=1,drop=True)
But seeing how this comes up anytime I have grouping function that returns a DataFrame, I'm wondering if there's a better approach to how I'm approaching this. Is it bad form to have a grouping function that returns a DataFrame? If so, what's the right method to get the same kind of result? (again, this is a general question fitting problems of this type)

In your grouping function, return a Series instead of a DataFrame. Specifically, replace the last line of patch_stats with:
return pd.Series({'diversity':diversity, 'start':start})

I've encountered this same issue.
Solution
result = df.groupby('patch_idx', group_keys=False).apply(patch_stats)
print result

Related

How to optimally update cells based on previous cell value / How to elegantly spread values of cell to other cells?

I have a "large" DataFrame table with index being country codes (alpha-3) and columns being years (1900 to 2000) imported via a pd.read_csv(...) [as I understand, these are actually string so I need to pass it as '1945' for example].
The values are 0,1,2,3.
I need to "spread" these values until the next non-0 for each row.
example : 0 0 1 0 0 3 0 0 2 1
becomes: 0 0 1 1 1 3 3 3 2 1
I understand that I should not use iterations (current implementation is something like this, as you can see, using 2 loops is not optimal, I guess I could get rid of one by using apply(row) )
def spread_values(df):
for idx in df.index:
previous_v = 0
for t_year in range(min_year, max_year):
current_v = df.loc[idx, str(t_year)]
if current_v == 0 and previous_v != 0:
df.loc[idx, str(t_year)] = previous_v
else:
previous_v = current_v
However I am told I should use the apply() function, or vectorisation or list comprehension because it is not optimal?
The apply function however, regardless of the axis, does not allow to dynamically get the index/column (which I need to conditionally update the cell), and I think the core issue I can't make the vec or list options work is because I do not have a finite set of column names but rather a wide range (all examples I see use a handful of named columns...)
What would be the more optimal / more elegant solution here?
OR are DataFrames not suited for my data at all? what should I use instead?
You can use df.replace(to_replace=0, method='ffil). This will fill all zeros in your dataframe (except for zeros occuring at the start of your dataframe) with the previous non-zero value per column.
If you want to do it rowwise unfortunately the .replace() function does not accept an axis argument. But you can transpose your dataframe, replace the zeros and transpose it again: df.T.replace(0, method='ffill').T

How to switch column elements for only one specific row in pandas?

So, I am working w/data from my research lab and am trying to sort it and move it around etc. And most of the stuff isn't important to my issue and I don't want to go into detail because confidentiality stuff, but I have a big table w/columns and rows and I want to specifically switch the elements of two columns ONLY in one row.
The extremely bad attempt at code I have for it is this (I rewrote the variables to be more vague though so they make sense):
for x in df.columna.values:
*some if statements*
df.loc[df.index([df.loc[df['columna'] == x]]), ['columnb', 'columna']] = df[df.index([df.loc[df['columna'] == x]]), ['columna', 'columnb']].numpy()
I am aware that the code I have is trash (and also the method - w/the for loops and if statements. I know I can abstract it a TON but I just want to actually figure out a way to make it work and them I will clean it up and make it prettier and more efficient. I learned pandas existed on tuesday so I am not an expert), but I think my issue lies in the way I'm getting the row.
One error I was recently getting for a while is the method I was using to get the row was giving me 1 row x 22 columns and I think I needed the name/index of the row instead. Which is why the index function is now there. However, I am now getting the error:
TypeError: 'RangeIndex' object is not callable
And I am just so confused all around. Sorry I've written a ton of text, basically: is there any simpler way to just switch the elements of two columns for one specific row (in terms of x, an element in that row)?
I think my biggest issue is trying to like- get the rows "name" in the format it wants. Although I may have a ton of other problems because honestly I am just really lost.
You're sooooo close! The error you're getting stems from trying to slice df.index([df.loc[df['columna'] == x]]). The parentheses are unneeded here and this should read as: df.index[df.loc[df['columna'] == x]].
However, here's an example on how to swap values between columns when provided a value (or multiple values) to swap at.
Sample Data
df = pd.DataFrame({
"A": list("abcdefg"),
"B": [1,2,3,4,5,6,7]
})
print(df)
A B
0 a 1
1 b 2
2 c 3
3 d 4
4 e 5
5 f 6
6 g 7
Let's say we're going to swap the values where A is either "c" or "f". To do this we need to first create a mask that just selects those rows. To accomplish this, we can use .isin. Then to perform our swap, we actually take the same exact approach you had! Including the .to_numpy() is very important, because without it Pandas will actually realign your columns for you and cause the values to not be swapped. Putting it all together:
swap_at = ["c", "f"]
swap_at_mask = df["A"].isin(swap_at) # mask where columns "A" is either equal to "c" or "f"
# Without the `.to_numpy()` at the end, pandas will realign the Dataframe
# and no values will be swapped
df.loc[swap_at_mask, ["A", "B"]] = df.loc[swap_at_mask, ["B", "A"]].to_numpy()
print(df)
A B
0 a 1
1 b 2
2 3 c
3 d 4
4 e 5
5 6 f
6 g 7
I think it was probably a syntax problem. I am assuming you are using tensorflow with the numpy() function? Try this it switches the columns based on the code you provided:
for x in df.columna.values:
# *some if statements*
df.loc[
(df["columna"] == x),
['columna', 'columnb']
] = df.loc[(df["columna"] == x), ['columnb', 'columna']].values.numpy()
I am also a beginner and would recommend you aim to make it pretty from the get go. It will save you a lot of extra time in the long run. Trial and error!

Conditional Rolling Sum using filter on groupby group rows

I've been trying without success to find a way to create an "average_gain_up" in python and have gotten a bit stuck. Being new to groupby there is something of how it is treating functions that i've not managed to grasp so any intuition behind how to think through these types of problems would be helpful.
Problem:
Create a rolling 14 day sum, only summing if the value is >0 .
new=pd.DataFrame([[1,-2,3,-2,4,5],['a','a','a','b','b','b']])
new= new.T #transposing into a friendly groupby format
#Group by a or b, filter to only have positive values and then sum rolling, we
keep NAs to ensure the sum is ran over 14 values.
groupby=new.groupby(1)[0].filter(lambda x: x>0,dropna=False).rolling(14).sum()
Intended Sum Frame:
x.all()/len(x) result:
this throws a type error "the filter must return a boolean result" .
from reading other answers, I understand as i'm asking if a series/frame is superior to 0 .
The above code works with len(x), again makes sense in that context.
i tried with all() as well but it doesn't behave as intended. the .all() functions returns a single boolean per group and the sum is then just a simple rolling sum.
i've tried creating a list of booleans to say which values are positive and which are not but that also yields an error, this time i'm not sure why.
groupby1=new.groupby(1)[0]
groupby2=[y>0 for x in groupby1 for y in x[1] ]
groupby_try=new.groupby(1)[0].filter(lambda x:groupby2,dropna=False).rolling(2).sum()
1) how do i make the above code work and what is wrong in how i am thinking about it ?
2) is this the "best Practice" way to do these types of operations ?
any help appreciated, let me know if i've missed anything or any further clarification is needed.
According to the doc on filter after a groupby, it is not supposed to filter values within a group but groups as a whole if they don't meet some criteria, such as if the sum of all the elements of the group is above 2 then the group is kept in the first example given
One way could be to replace all the negative values by 0 in new[0] first, using np.clip for example, and then groupby, rolling and sum such as
print (np.clip(new[0],0,np.inf).groupby(new[1]).rolling(2).sum())
1
a 0 NaN
1 1.0
2 3.0
b 3 NaN
4 4.0
5 9.0
Name: 0, dtype: float64
This way prevents from modifying the data in new, if you don't mind you can change the column 0 with new[0] = np.clip(new[0],0,np.inf) and then do new.groupby(1)[0].rolling(2).sum() which give the same result.

What is the Python syntax for accessing specific data in a function call?

I am not generally a python user. I do most things in R and Stata. However, I cannot find a good semantic similarity package/API in either of those.
I have two data frames in the environment. One is questions, which consists of 2 columns and 3 rows. The other is results, which has 3 columns and 3 rows.
I am trying to compare each question (individually) in the first column of the questions dataframe to all of the questions in the second column. Then I want the output to populate the results dataframe. The function takes two strings as arguments So far my code looks like this:
for i in range(1, 3):
results.iloc[i-1,i] = liteClient.compare(questions.iloc[0,i], questions.iloc[:,1])
I assume that I am pointing to the data incorrectly, but I really haven't been able to find good documentation about this seemingly straightforward issue.
By comparison, here is my working R code, which uses a slightly different function and only one dataframe.
for (i in 1:3){
df[,i+2] <- levenshteinSim(df$yr1questions[i], df$yr2questions)
}
Any help would be greatly appreciated! I am trying to come up with proof-of-concept code to compare similar survey questions between years based on semantic meaning.
Bob
Let's try to compare(multiply) every question A to every question B
import pandas as pd
questions = pd.DataFrame(np.arange(6).reshape(3,2), columns=["question_A", "question_B"])
This gives:
question_A question_B
0 0 1
1 2 3
2 4 5
Then let's define a compare fonction:
def compare(row):
return pd.Series(row[0]*questions['question_B'])
results = questions.apply(compare, axis=1)
That gives us:
0 1 2
0 0 0 0
1 2 6 10
2 4 12 20
As you pointed in the comments, here is a version comparing only two strings at a time:
def compare(row):
question_a = row[0]
return pd.Series([liteClient.compare(question_a, question_b) for question_b in questions['question_B']])
Based on what you've put so far here are some issues with what you've written which are understandable from your R programming background:
for i in range(1, 3):
In python 3.x, what this does is create a range object which you can think of as a special type of function (though is really an object that contains iteration properties) that allows you to generate a sequence of numbers with a certain step size (default is 1) exclusively. Additionally you need to know that most programming languages index starting at zero, not 1, and this includes python.
What this range object does here is generate the sequence 1, 2 and that is it.
Your arrays you are using i to index over are not going to index over all indicies. What I believe you want is something like:
for i in range(3):
Notice how there is only one number here, this defaults to the exclusive maximum of the range, and 0 being the inclusive minimum, so this will generate the sequence of 0,1,2. If you have an array of size 3, this will represent all possible indices for that array.
This next line is a bit confusing to me, since I'm not familiar with R, but I sort of understand what you were trying to do. If I understand correctly you are trying to compare two columns of 3 questions each, and compare each question in column 1 to the questions in column 2, resulting in a 3 x 3 matrix of comparison results which you are trying to store in results?. Assuming the size are already correct (as in results is 3x3) I'd like to explain some peculiarities I see in this code.
results.iloc[i-1,i] = liteClient.compare(questions.iloc[0,i], questions.iloc[:,1])
with results.iloc[i-1,i] you are going to be indexing by row an column, as in i-1 is the row, and i is the column here. So with out changing range(1,3) this results in the following indexes being accessed, 0,1, 1,2 and that is it. I believe liteClient.compare(...) is supposed to return either a dataframe of 1x3, or a list of size 3 based on what you were trying to do inside of it, this may not be the case however, I'm not sure what object you are using to call that member function, so I don't know where the documentation for that function exists. Assuming it does return a list of size 3 or the dataframe, what you'll need to change the way you are trying to assign the data, via this:
results.iloc[i,:] = ...
What is happening here is that iloc(...) is taking a row positional argument and a slice positional argument, here you are assigning all the columns in the result matrix at that row to the values returned by compare. With the for statement changes this will iterate over all indices in the dataframe.
liteClient.compare(questions.iloc[0,i], questions.iloc[:,1])
This line as it currently stands you will be iterating over each column in the first row of questions.iloc, and then comparing them to the second column and all rows of the second questions.iloc.
I believe what you will want to do is change this to:
liteClient.compare(questions.iloc[i,0], questions.iloc[:,1])
What this does is for each i, 0, 1, 2, at column 0, compare this to every row in column 1. If your questions dataframe is actually organized as 2 columns and 3 rows this should work, otherwise you will need to change how you create questions as well.
in all I believe the fixed program should look something like:
for i in range(3):
results.iloc[i,:] = liteClient.compare(questions.iloc[i,0], questions.iloc[:,1])

Excel: How do I perform bitwise counting?

I have an excel file with close to 13 columns with 0's and 1's. I want to perform bitwise counting on columns like so:
A B Result
1 1 1
0 1 1
1 0 1
0 0 0
I tried lookup, vlookup, countif(s), but nothing seems to be working for me? Are there any other functions I could use?
* EDIT *
I am actually looking to implement this in Python because it is part of a rather long workflow and I don't want to interrupt the script by having to exit do this and then come back. What is a a rather naive way of doing this in Python?
So far, I have tried to write something where I ask the user to provide an input of which columns they would like grouped but I could not make it work.
Thanks,
If you're doing a bitwise OR (as your example seems to show) you can just put this formula in your Result column
=MIN(SUM(A1:B1),1)
And then just copy down
Or, you could use the OR function, which will return True if any value is 1 and False if all 0
=IF(OR(A1:B1),1,0)
The following formula will output 1 or 0, depending if there are 1 or more 1's in the columns A->C...
=IF(COUNTIF(A2:C2,"=1")>1,1,0)

Categories