Aggregate function in pandas dataframe not working appropriately - python

I'm trying to sum a certain column based on a groupby of another column, I have the code right, but the output is wildly different. So I tried a simply min() function on that groupby, the output from this is also completely different from the expected output, did I do something wrong by chance?
Below are the images of the df displayed. I grouped it by lga_desc, and when tested for minimum value from those rows, I get the wrong output
|Taxable Income |lga_desc|
|300,000,450 |Alpine |
|240,000 |Alpine |
|700,000 |Alpine |
|260,000,450 |Ararat |
|469,000 |Ararat |
|5,200,000 |Ararat |
df = df.groupby('lga_desc')
df = df['Taxable income'].min()
output when applying min function:
lga_desc
Alpine 700,000
Ararat 469,000
these are the wrong outputs, from the given dataframe
thank you for the help!
Update: After careful checking on my code again, apparently when I imported this file, all numbers became strings. So a lesson, don't forget to make sure your numbers are actual numbers! not strings :)

You need to convert the data type to int first:
df['Taxable Income'] = df['Taxable Income'].str.replace(',', '').astype(int)
result = df.groupby('lga_desc')['Taxable Income'].min().reset_index()
OUTPUT:
lga_desc Taxable Income
0 Alpine 240000
1 Ararat 469000

Related

Removing numbers from dataframe that lie within range

I have a pandas dataframe that contains values between -1000 to 1000. I want to eliminate all the numbers between the range of -0.00001 to 0.00001 i.e replace them with NaN. It is worth mentioning that my df contains numerous instances of very small positive and negative numbers that I want to include within this range as well e.g. 6.26478E-52.
How do I go about doing this?
P.S. I am attaching an image of the first few rows of my df for reference.
IIUC use if need less like -0.00001 and 0.00001:
df = df.mask(df.lt(-0.00001) | df.lt(0.00001))
is same like below 0.00001:
df = df.mask(df.lt(0.00001))
Or if need values between:
df = df.mask(df.gt(-0.00001) & df.lt(0.00001))

Number of quarters gap to int

I calculate number of quarters gap between two dates. Now, I want to test if the number of quarters gap is bigger than 2.
Thank you for your comments!
I'm actually running a code from WRDS (Wharton Research Data Services). Below, fst_vint is a DataFrame with two date variables, rdate and lag_rdate. First line seems to convert them to quarter variables (e.g., 9/8/2019 to 2019Q1), and then take differences between them, storing it in a new column qtr.
fst_vint.qtr >= 2 creates a problem, because the former is a QuarterEnd object, while the latter is an integer. How do I deal with this problem?
fst_vint['qtr'] = (fst_vint['rdate'].dt.to_period('Q')-\
fst_vint['lag_rdate'].dt.to_period('Q'))
# label first_report flag
fst_vint['first_report'] = ((fst_vint.qtr.isnull()) | (fst_vint.qtr>=2))
Using .diff() when column is converted to integer with .astype(int) results in the desired answer. So the code in your case would be:
fst_vint['qtr'] = fst_vint['rdate'].astype(int).diff()

Conditional Rolling Sum using filter on groupby group rows

I've been trying without success to find a way to create an "average_gain_up" in python and have gotten a bit stuck. Being new to groupby there is something of how it is treating functions that i've not managed to grasp so any intuition behind how to think through these types of problems would be helpful.
Problem:
Create a rolling 14 day sum, only summing if the value is >0 .
new=pd.DataFrame([[1,-2,3,-2,4,5],['a','a','a','b','b','b']])
new= new.T #transposing into a friendly groupby format
#Group by a or b, filter to only have positive values and then sum rolling, we
keep NAs to ensure the sum is ran over 14 values.
groupby=new.groupby(1)[0].filter(lambda x: x>0,dropna=False).rolling(14).sum()
Intended Sum Frame:
x.all()/len(x) result:
this throws a type error "the filter must return a boolean result" .
from reading other answers, I understand as i'm asking if a series/frame is superior to 0 .
The above code works with len(x), again makes sense in that context.
i tried with all() as well but it doesn't behave as intended. the .all() functions returns a single boolean per group and the sum is then just a simple rolling sum.
i've tried creating a list of booleans to say which values are positive and which are not but that also yields an error, this time i'm not sure why.
groupby1=new.groupby(1)[0]
groupby2=[y>0 for x in groupby1 for y in x[1] ]
groupby_try=new.groupby(1)[0].filter(lambda x:groupby2,dropna=False).rolling(2).sum()
1) how do i make the above code work and what is wrong in how i am thinking about it ?
2) is this the "best Practice" way to do these types of operations ?
any help appreciated, let me know if i've missed anything or any further clarification is needed.
According to the doc on filter after a groupby, it is not supposed to filter values within a group but groups as a whole if they don't meet some criteria, such as if the sum of all the elements of the group is above 2 then the group is kept in the first example given
One way could be to replace all the negative values by 0 in new[0] first, using np.clip for example, and then groupby, rolling and sum such as
print (np.clip(new[0],0,np.inf).groupby(new[1]).rolling(2).sum())
1
a 0 NaN
1 1.0
2 3.0
b 3 NaN
4 4.0
5 9.0
Name: 0, dtype: float64
This way prevents from modifying the data in new, if you don't mind you can change the column 0 with new[0] = np.clip(new[0],0,np.inf) and then do new.groupby(1)[0].rolling(2).sum() which give the same result.

Calculating running total

I have data frame df and I would like to keep a running total of names that occur in a column of that data frame. I am trying to calculate the running total column:
name running total
a 1
a 2
b 1
a 3
c 1
b 2
There are two ways I thought to do this:
Loop through the dataframe and use a separate dictionary containing name and current count. The current count for the relevant name would increase by 1 each time the loop is carried out, and that value would be copied into my dataframe.
Change the count in field for each value in the dataframe. In excel I would use a countif combined with a drag down formula A$1:A1 to fix the first value but make the second value relative so that the range I am looking in changes with the row.
The problem is I am not sure how to implement these. Does anyone have any ideas on which is preferable and how these could be implemented?
#bunji is right. I'm assuming you're using pandas and that your data is in a dataframe called df. To add the running totals to your dataframe, you could do something like this:
df['running total'] = df.groupby(['name']).cumcount() + 1
The + 1 gives you a 1 for your first occurrence instead of 0, which is what you would get otherwise.

Pandas groupby(...).mean() lost keys

I have dataframe rounds (which was the result of deleting a column from another dataframe) with the following structure (can't post pics, sorry):
----------------------------
|type|N|D|NATC|K|iters|time|
----------------------------
rows of data
----------------------------
I use groupby so I can then get the mean of the groups, like so:
rounds = results.groupby(['type','N','D','NATC','K','iters'])
results_mean = rounds.mean()
I get the means that I wanted but I get a problem with the keys. The results_mean dataframe has the following structure:
----------------------------
| | | | | | |time|
|type|N|D|NATC|K|iters| |
----------------------------
rows of data
----------------------------
The only key recognized is time (I executed results_mean.keys()).
What did I do wrong? How can I fix it?
In your aggregated data, time is the only column. The other ones are indices.
groupby has a parameter as_index. From the documentation:
as_index : boolean, default True
For aggregated output, return object with group labels as the index. Only relevant for DataFrame input. as_index=False is effectively “SQL-style” grouped output
So you can get the desired output by calling
rounds = results.groupby(['type','N','D','NATC','K','iters'], as_index = False)
results_mean = rounds.mean()
Or, if you want, you can always convert indices to keys by using reset_index. Using
rounds = results.groupby(['type','N','D','NATC','K','iters'])
results_mean = rounds.mean().reset_index()
should have the desired effect as well.
I've got the same problem of losing the dataframes's keys due to the use of the group_by() function and the answer I found for that problem was to convert the Dataframe into a CSV file then read this file.

Categories