Using pivot_table or pd.groupby for certain seasons - python

I have data set like below:
Year Month Dryden 3rdAve Clark Landfill
0 2015 1 0.00 0.00 0.0 NaN
1 2015 1 0.00 0.00 0.0 NaN
2 2015 1 0.00 0.00 0.0 NaN
3 2015 1 0.00 0.00 0.0 NaN
4 2015 1 0.00 0.00 0.0 NaN
5 2015 1 0.00 0.00 0.0 NaN
6 2015 1 0.00 0.00 0.0 NaN
7 2015 1 0.00 0.00 0.0 NaN
8 2015 1 0.00 0.00 0.0 NaN
9 2015 1 0.00 0.00 0.0 NaN
10 2015 1 0.00 0.00 0.0 NaN
11 2015 1 0.00 0.00 0.0 NaN
where I want to run the code below to calculate the mean of each season for Dryden values:
df.Dryden.groupby([df.Year,pd.cut(df.Month,[0,3,6,9,12],labels=['Winter','Spring','Summer','Autumn'],right =False)]).mean()
I am getting this error:
TypeError: '>' not supported between instances of 'int' and 'str'
df.dtype gives me:
Year int64
Month object
Dryden float64
3rdAve float64
Clark float64
Landfill float64
dtype: object
I was wondering if anyone could help me.

Convert your Month column to an integer like this:
df.Month = df.Month.astype(int)
Then run your code:
In [61]: df.Dryden.groupby([df.Year,pd.cut(df.Month,[0,3,6,9,12],labels=['Winter','Spring','Summer','Autumn'],right =False)]).mean()
Out[61]:
Year Month
2015 Winter 0.0
Name: Dryden, dtype: float64
If you're getting a value error, perhaps this works instead:
df.Month = pd.to_numeric(df.Month, errors='coerce')

Add this:
df['Month'] = df['Month'].astype(int)
Or:
df['Month'] = df['Month'].astype(float).astype(int)

Related

Pandas Dataframes Remove rows by unique count of values

I want to remove rows where specific columns unique value counts is less than some value.
Dataframe looks like that:
class reason bank_fees cash_advance community food_and_drink ... recreation service shops tax transfer travel
0 0 at&t 20.00 0.0 253.95 254.48 ... 19.27 629.34 842.77 0.0 -4089.51 121.23
1 0 at&t 0.00 0.0 0.00 319.55 ... 0.00 1327.53 656.84 -1784.0 -1333.20 79.60
2 1 entergy arkansas 80.00 0.0 3.39 3580.99 ... 612.36 3746.90 4990.33 0.0 -14402.54 888.67
3 1 entergy arkansas 0.00 0.0 0.00 37.03 ... 0.00 405.24 47.34 0.0 -400.01 41.12
4 1 entergy arkansas 0.00 0.0 0.00 250.18 ... 0.00 123.48 54.28 0.0 -270.15 87.00
... ... ... ... ... ... ... ... ... ... ... ... ... ...
6659 0 t-mobile 0.00 0.0 0.00 0.00 ... 0.00 0.00 50.00 0.0 -253.74 284.44
6660 0 optimum 0.00 -30.0 108.63 158.67 ... 10.11 7098.23 2657.95 0.0 -12641.89 3011.04
6661 0 optimum 0.00 0.0 0.00 267.86 ... 0.00 2459.41 1939.35 0.0 -5727.50 212.06
6662 0 state farm insurance 0.00 0.0 0.00 80.91 ... 25.00 130.27 195.42 0.0 -1189.71 48.79
6663 0 verizon 39.97 0.0 0.00 0.00 ... 0.00 110.00 0.00 0.0 0.00 0.00
[6664 rows x 15 columns]
this is the counts of the column reason
at&t 724
verizon 544
geico 341
t-mobile 309
state farm insurance 135
...
town of smyrna 1
city of hendersonville 1
duke energy 1
pima medical institute 1
gieco 1
Name: reason, Length: 649, dtype: int64
the important column there is the reason. for example, if the unique value count is less than 5 I want to remove those rows. How can I do that? Thanks
You can try to get the index of value counts where value is below 5 and use isin to filter out these value
out = df[~df['reason'].isin(df['reason'].value_counts().lt(5).pipe(lambda s: s[s].index))]
To elaborate each step usage
out = df[~df['reason'].isin(
df['reason'].value_counts() # get each value count
.lt(5) # mask value lower than 5
.pipe(lambda s: s[s].index) # get the index of value which is lower than 5
)] # if value is not in the index, keep it

change values in dataframe row based on condition

I have this dataframe
Region 2021 2022 2023
0 Europe 0.00 0.00 0.00
1 N.Amerca 0.50 0.50 0.50
2 N.Amerca 4.40 4.40 4.40
3 N.Amerca 0.00 8.00 8.00
4 Asia 0.00 0.00 1.75
5 Asia 0.00 0.00 0.00
6 Asia 0.00 0.00 2.00
7 N.Amerca 0.00 0.00 0.50
8 Eurpoe 6.00 6.00 6.00
9 Asia 7.50 7.50 7.50
10 Asia 3.75 3.75 3.75
11 Asia 3.50 3.50 3.50
12 Asia 3.80 3.80 3.80
13 Asia 0.00 0.00 0.00
14 Europe 6.52 6.52 6.52
Once a value in 2021 is found it should carry a 0 to the rest (2022 and 2023)
and if a value in 2022 is found -it should carry 0 to the rest. In other words, once value in found in columns 2021 and forth it should zero the rest on the right.
expected result would be:
Region 2021 2022 2023
0 Europe 0.00 0.00 0.00
1 N.Amerca 0.50 0.00 0.00
2 N.Amerca 4.40 0.00 0.00
3 N.Amerca 0.00 8.00 0.00
4 Asia 0.00 0.00 1.75
5 Asia 0.00 0.00 0.00
6 Asia 0.00 0.00 2.00
7 N.Amerca 0.00 0.00 0.50
8 Eurpoe 6.00 0.00 0.00
9 Asia 7.50 0.00 0.00
10 Asia 3.75 0.00 0.00
11 Asia 3.50 0.00 0.00
12 Asia 3.80 0.00 0.00
13 Asia 0.00 0.00 0.00
14 Europe 6.52 0.00 0.00
I have tried to apply a lambda:
def foo(r):
#if r['2021')>0: then 2020 and forth should be zero)
df = df.apply(lambda x: foo(x), axis=1)
but the challange is that there are 2021 - to 2030 and the foo becomes a mess)
Let us try duplicated
df = df.mask(df.T.apply(pd.Series.duplicated).T,0)
Out[57]:
Region 2021 2022 2023
0 Europe 0.00 0.0 0.00
1 N.Amerca 0.50 0.0 0.00
2 N.Amerca 4.40 0.0 0.00
3 N.Amerca 0.00 8.0 0.00
4 Asia 0.00 0.0 1.75
5 Asia 0.00 0.0 0.00
6 Asia 0.00 0.0 2.00
7 N.Amerca 0.00 0.0 0.50
8 Eurpoe 6.00 0.0 0.00
9 Asia 7.50 0.0 0.00
10 Asia 3.75 0.0 0.00
11 Asia 3.50 0.0 0.00
12 Asia 3.80 0.0 0.00
13 Asia 0.00 0.0 0.00
14 Europe 6.52 0.0 0.00
This is another way:
df2 = df.set_index('Region').diff(axis=1).reset_index()
df2['2021'] = df['2021']
or:
df.iloc[:,1:].where(df.iloc[:,1:].ne(0).cumsum(axis=1).eq(1),0)
Output:
2021 2022 2023
0 0.00 0.0 0.00
1 0.50 0.0 0.00
2 4.40 0.0 0.00
3 0.00 8.0 0.00
4 0.00 0.0 1.75
5 0.00 0.0 0.00
6 0.00 0.0 2.00
7 0.00 0.0 0.50
8 6.00 0.0 0.00
9 7.50 0.0 0.00
10 3.75 0.0 0.00
11 3.50 0.0 0.00
12 3.80 0.0 0.00
13 0.00 0.0 0.00
14 6.52 0.0 0.00

New data frame with the % of the debt paid in the month of the payment

I have two dataframes df1 and df2.
One with clients debt, the other with client payments with dates.
I want to create a new data frame with the % of the debt paid in the month of the payment until 01-2017.
import pandas as pd
d1 = {'client number': ['2', '2','3','6','7','7','8','8','8','8','8','8','8','8'],
'month': [1, 2, 3,1,10,12,3,5,8,1,2,4,5,8],
'year':[2013,2013,2013,2019,2013,2013,2013,2013,2013,2014,2014,2015,2016,2017],
'payment' :[100,100,200,10000,200,100,300,500,200,100,200,200,500,50]}
df1 = pd.DataFrame(data=d1).set_index('client number')
df1
d2 = {'client number': ['2','3','6','7','8'],
'debt': [200, 600,10000,300,3000]}
df2 = pd.DataFrame(data=d2)
x=[1,2,3,4,5,6,7,8,9,10]
y=[2013,2014,2015,2016,2017]
for x in month and y in year
if df1['month']=x and df1['year']=year :
df2[month&year] = df1['payment']/df2['debt']
the result needs to be something like this for all the clients
what am I missing?
thank you for your time and help
First set the index of both the dataframes df1 and df2 to client number, then use Index.map to map the client numbers in df1 to their corresponding debt's from df2, then use Series.div to divide the payments of each client by their respective debt's, thus obtaining the fraction of debt which is paid, then create a new column date in df1 from month and year columns finally use DataFrame.join along with DataFrame.pivot_table:
df1 = df1.set_index('client number')
df2 = df2.set_index('client number')
df1['pct'] = df1['payment'].div(df1.index.map(df2['debt'])).round(2)
df1['date'] = df1['year'].astype(str) + '-' + df1['month'].astype(str).str.zfill(2)
df3 = (
df2.join(
df1.pivot_table(index=df1.index, columns='date', values='pct', aggfunc='sum').fillna(0))
.reset_index()
)
Result:
# print(df3)
client number debt 2013-01 2013-02 2013-03 2013-05 2013-08 ... 2013-12 2014-01 2014-02 2015-04 2016-05 2017-08 2019-01
0 2 200 0.5 0.5 0.00 0.00 0.00 ... 0.00 0.00 0.00 0.00 0.00 0.00 0.0
1 3 600 0.0 0.0 0.33 0.00 0.00 ... 0.00 0.00 0.00 0.00 0.00 0.00 0.0
2 6 10000 0.0 0.0 0.00 0.00 0.00 ... 0.00 0.00 0.00 0.00 0.00 0.00 1.0
3 7 300 0.0 0.0 0.00 0.00 0.00 ... 0.33 0.00 0.00 0.00 0.00 0.00 0.0
4 8 3000 0.0 0.0 0.10 0.17 0.07 ... 0.00 0.03 0.07 0.07 0.17 0.02 0.0

Get proportionate values of columns in a dataframe - Pandas

I have a dataframe like this,
ds 0 1 2 4 5 6
0 1991Q3 nan nan nan nan 1.0 nan
1 2014Q2 1.0 3.0 nan nan 1.0 nan
2 2014Q3 1.0 nan nan 1.0 4.0 nan
3 2014Q4 nan nan nan 2.0 3.0 nan
4 2015Q1 nan 1.0 2.0 4.0 4.0 nan
I would like the proportions for each column 0-6 like this,
ds 0 1 2 4 5 6
0 1991Q3 0.00 0.00 0.00 0.00 1.00 0.00
1 2014Q2 0.20 0.60 0.00 0.00 0.20 0.00
2 2014Q3 0.16 0.00 0.00 0.16 0.67 0.00
3 2014Q4 0.00 0.00 0.00 0.40 0.60 0.00
4 2015Q1 0.00 0.09 0.18 0.36 0.36 0.00
Is there a pandas way to this? Any suggestion would be great.
You can do this:
df = df.replace(np.nan, 0)
df = df.set_index('ds')
In [3194]: df.div(df.sum(1),0).reset_index()
Out[3194]:
ds 0 1 2 4 5 6
0 1991Q3 0.00 0.00 0.00 0.00 1.00 0.00
1 2014Q2 0.20 0.60 0.00 0.00 0.20 0.00
2 2014Q3 0.17 0.00 0.00 0.17 0.67 0.00
3 2014Q4 0.00 0.00 0.00 0.40 0.60 0.00
4 2015Q1 0.00 0.09 0.18 0.36 0.36 0.00
OR you can use df.apply:
In [3196]: df = df.replace(np.nan, 0)
In [3197]: df.iloc[:,1:] = df.iloc[:,1:].apply(lambda x: x/x.sum(), axis=1)
In [3198]: df
Out[3197]:
ds 0 1 2 4 5 6
0 1991Q3 0.00 0.00 0.00 0.00 1.00 0.00
1 2014Q2 0.20 0.60 0.00 0.00 0.20 0.00
2 2014Q3 0.17 0.00 0.00 0.17 0.67 0.00
3 2014Q4 0.00 0.00 0.00 0.40 0.60 0.00
4 2015Q1 0.00 0.09 0.18 0.36 0.36 0.00
Set the first column as the index, get the sum of each row, and divide the main dataframe by the sums, and filling the null entries with 0
res = df.set_index("ds")
res.fillna(0).div(res.sum(1),axis=0)

removing the name of a pandas dataframe index after appending a total row to a dataframe

I have calculated a series of totals tips by day of a week and appended it to the bottom of totalspt dataframe.
I have set the index.name for the totalspt dataframe to None.
However while the dataframe is displaying the default 0,1,2,3 index it doesn't display the default empty cell in the top left directly above the index.
How could I make this cell empty in the dataframe?
total_bill tip sex smoker day time size tip_pct
0 16.54 1.01 F N Sun D 2 0.061884
1 12.54 1.40 F N Mon D 2 0.111643
2 10.34 3.50 M Y Tue L 4 0.338491
3 20.25 2.50 M Y Wed D 2 0.123457
4 16.54 1.01 M Y Thu D 1 0.061064
5 12.54 1.40 F N Fri L 2 0.111643
6 10.34 3.50 F Y Sat D 3 0.338491
7 23.25 2.10 M Y Sun B 3 0.090323
pivot = tips.pivot_table('total_bill', index=['sex', 'size'],columns=['day'],aggfunc='sum').fillna(0)
print pivot
day Fri Mon Sat Sun Thu Tue Wed
sex size
F 2 12.54 12.54 0.00 16.54 0.00 0.00 0.00
3 0.00 0.00 10.34 0.00 0.00 0.00 0.00
M 1 0.00 0.00 0.00 0.00 16.54 0.00 0.00
2 0.00 0.00 0.00 0.00 0.00 0.00 20.25
3 0.00 0.00 0.00 23.25 0.00 0.00 0.00
4 0.00 0.00 0.00 0.00 0.00 10.34 0.00
totals_row = tips.pivot_table('total_bill',columns=['day'],aggfunc='sum').fillna(0).astype('float')
totalpt = pivot.reset_index('sex').reset_index('size')
totalpt.index.name = None
totalpt = totalpt[['Fri', 'Mon','Sat', 'Sun', 'Thu', 'Tue', 'Wed']]
totalpt = totalpt.append(totals_row)
print totalpt
**day** Fri Mon Sat Sun Thu Tue Wed #problem text day
0 12.54 12.54 0.00 16.54 0.00 0.00 0.00
1 0.00 0.00 10.34 0.00 0.00 0.00 0.00
2 0.00 0.00 0.00 0.00 16.54 0.00 0.00
3 0.00 0.00 0.00 0.00 0.00 0.00 20.25
4 0.00 0.00 0.00 23.25 0.00 0.00 0.00
5 0.00 0.00 0.00 0.00 0.00 10.34 0.00
total_bill 12.54 12.54 10.34 39.79 16.54 10.34 20.25
That's the columns' name.
In [11]: df = pd.DataFrame([[1, 2]], columns=['A', 'B'])
In [12]: df
Out[12]:
A B
0 1 2
In [13]: df.columns.name = 'XX'
In [14]: df
Out[14]:
XX A B
0 1 2
You can set it to None to clear it.
In [15]: df.columns.name = None
In [16]: df
Out[16]:
A B
0 1 2
An alternative, if you wanted to keep it, is to give the index a name:
In [21]: df.columns.name = "XX"
In [22]: df.index.name = "index"
In [23]: df
Out[23]:
XX A B
index
0 1 2
You can use rename_axis. Since 0.17.0
In [3939]: df
Out[3939]:
XX A B
0 1 2
In [3940]: df.rename_axis(None, axis=1)
Out[3940]:
A B
0 1 2
In [3942]: df = df.rename_axis(None, axis=1)
In [3943]: df
Out[3943]:
A B
0 1 2

Categories