Pandas Dataframes Remove rows by unique count of values - python

I want to remove rows where specific columns unique value counts is less than some value.
Dataframe looks like that:
class reason bank_fees cash_advance community food_and_drink ... recreation service shops tax transfer travel
0 0 at&t 20.00 0.0 253.95 254.48 ... 19.27 629.34 842.77 0.0 -4089.51 121.23
1 0 at&t 0.00 0.0 0.00 319.55 ... 0.00 1327.53 656.84 -1784.0 -1333.20 79.60
2 1 entergy arkansas 80.00 0.0 3.39 3580.99 ... 612.36 3746.90 4990.33 0.0 -14402.54 888.67
3 1 entergy arkansas 0.00 0.0 0.00 37.03 ... 0.00 405.24 47.34 0.0 -400.01 41.12
4 1 entergy arkansas 0.00 0.0 0.00 250.18 ... 0.00 123.48 54.28 0.0 -270.15 87.00
... ... ... ... ... ... ... ... ... ... ... ... ... ...
6659 0 t-mobile 0.00 0.0 0.00 0.00 ... 0.00 0.00 50.00 0.0 -253.74 284.44
6660 0 optimum 0.00 -30.0 108.63 158.67 ... 10.11 7098.23 2657.95 0.0 -12641.89 3011.04
6661 0 optimum 0.00 0.0 0.00 267.86 ... 0.00 2459.41 1939.35 0.0 -5727.50 212.06
6662 0 state farm insurance 0.00 0.0 0.00 80.91 ... 25.00 130.27 195.42 0.0 -1189.71 48.79
6663 0 verizon 39.97 0.0 0.00 0.00 ... 0.00 110.00 0.00 0.0 0.00 0.00
[6664 rows x 15 columns]
this is the counts of the column reason
at&t 724
verizon 544
geico 341
t-mobile 309
state farm insurance 135
...
town of smyrna 1
city of hendersonville 1
duke energy 1
pima medical institute 1
gieco 1
Name: reason, Length: 649, dtype: int64
the important column there is the reason. for example, if the unique value count is less than 5 I want to remove those rows. How can I do that? Thanks

You can try to get the index of value counts where value is below 5 and use isin to filter out these value
out = df[~df['reason'].isin(df['reason'].value_counts().lt(5).pipe(lambda s: s[s].index))]
To elaborate each step usage
out = df[~df['reason'].isin(
df['reason'].value_counts() # get each value count
.lt(5) # mask value lower than 5
.pipe(lambda s: s[s].index) # get the index of value which is lower than 5
)] # if value is not in the index, keep it

Related

change values in dataframe row based on condition

I have this dataframe
Region 2021 2022 2023
0 Europe 0.00 0.00 0.00
1 N.Amerca 0.50 0.50 0.50
2 N.Amerca 4.40 4.40 4.40
3 N.Amerca 0.00 8.00 8.00
4 Asia 0.00 0.00 1.75
5 Asia 0.00 0.00 0.00
6 Asia 0.00 0.00 2.00
7 N.Amerca 0.00 0.00 0.50
8 Eurpoe 6.00 6.00 6.00
9 Asia 7.50 7.50 7.50
10 Asia 3.75 3.75 3.75
11 Asia 3.50 3.50 3.50
12 Asia 3.80 3.80 3.80
13 Asia 0.00 0.00 0.00
14 Europe 6.52 6.52 6.52
Once a value in 2021 is found it should carry a 0 to the rest (2022 and 2023)
and if a value in 2022 is found -it should carry 0 to the rest. In other words, once value in found in columns 2021 and forth it should zero the rest on the right.
expected result would be:
Region 2021 2022 2023
0 Europe 0.00 0.00 0.00
1 N.Amerca 0.50 0.00 0.00
2 N.Amerca 4.40 0.00 0.00
3 N.Amerca 0.00 8.00 0.00
4 Asia 0.00 0.00 1.75
5 Asia 0.00 0.00 0.00
6 Asia 0.00 0.00 2.00
7 N.Amerca 0.00 0.00 0.50
8 Eurpoe 6.00 0.00 0.00
9 Asia 7.50 0.00 0.00
10 Asia 3.75 0.00 0.00
11 Asia 3.50 0.00 0.00
12 Asia 3.80 0.00 0.00
13 Asia 0.00 0.00 0.00
14 Europe 6.52 0.00 0.00
I have tried to apply a lambda:
def foo(r):
#if r['2021')>0: then 2020 and forth should be zero)
df = df.apply(lambda x: foo(x), axis=1)
but the challange is that there are 2021 - to 2030 and the foo becomes a mess)
Let us try duplicated
df = df.mask(df.T.apply(pd.Series.duplicated).T,0)
Out[57]:
Region 2021 2022 2023
0 Europe 0.00 0.0 0.00
1 N.Amerca 0.50 0.0 0.00
2 N.Amerca 4.40 0.0 0.00
3 N.Amerca 0.00 8.0 0.00
4 Asia 0.00 0.0 1.75
5 Asia 0.00 0.0 0.00
6 Asia 0.00 0.0 2.00
7 N.Amerca 0.00 0.0 0.50
8 Eurpoe 6.00 0.0 0.00
9 Asia 7.50 0.0 0.00
10 Asia 3.75 0.0 0.00
11 Asia 3.50 0.0 0.00
12 Asia 3.80 0.0 0.00
13 Asia 0.00 0.0 0.00
14 Europe 6.52 0.0 0.00
This is another way:
df2 = df.set_index('Region').diff(axis=1).reset_index()
df2['2021'] = df['2021']
or:
df.iloc[:,1:].where(df.iloc[:,1:].ne(0).cumsum(axis=1).eq(1),0)
Output:
2021 2022 2023
0 0.00 0.0 0.00
1 0.50 0.0 0.00
2 4.40 0.0 0.00
3 0.00 8.0 0.00
4 0.00 0.0 1.75
5 0.00 0.0 0.00
6 0.00 0.0 2.00
7 0.00 0.0 0.50
8 6.00 0.0 0.00
9 7.50 0.0 0.00
10 3.75 0.0 0.00
11 3.50 0.0 0.00
12 3.80 0.0 0.00
13 0.00 0.0 0.00
14 6.52 0.0 0.00

New data frame with the % of the debt paid in the month of the payment

I have two dataframes df1 and df2.
One with clients debt, the other with client payments with dates.
I want to create a new data frame with the % of the debt paid in the month of the payment until 01-2017.
import pandas as pd
d1 = {'client number': ['2', '2','3','6','7','7','8','8','8','8','8','8','8','8'],
'month': [1, 2, 3,1,10,12,3,5,8,1,2,4,5,8],
'year':[2013,2013,2013,2019,2013,2013,2013,2013,2013,2014,2014,2015,2016,2017],
'payment' :[100,100,200,10000,200,100,300,500,200,100,200,200,500,50]}
df1 = pd.DataFrame(data=d1).set_index('client number')
df1
d2 = {'client number': ['2','3','6','7','8'],
'debt': [200, 600,10000,300,3000]}
df2 = pd.DataFrame(data=d2)
x=[1,2,3,4,5,6,7,8,9,10]
y=[2013,2014,2015,2016,2017]
for x in month and y in year
if df1['month']=x and df1['year']=year :
df2[month&year] = df1['payment']/df2['debt']
the result needs to be something like this for all the clients
what am I missing?
thank you for your time and help
First set the index of both the dataframes df1 and df2 to client number, then use Index.map to map the client numbers in df1 to their corresponding debt's from df2, then use Series.div to divide the payments of each client by their respective debt's, thus obtaining the fraction of debt which is paid, then create a new column date in df1 from month and year columns finally use DataFrame.join along with DataFrame.pivot_table:
df1 = df1.set_index('client number')
df2 = df2.set_index('client number')
df1['pct'] = df1['payment'].div(df1.index.map(df2['debt'])).round(2)
df1['date'] = df1['year'].astype(str) + '-' + df1['month'].astype(str).str.zfill(2)
df3 = (
df2.join(
df1.pivot_table(index=df1.index, columns='date', values='pct', aggfunc='sum').fillna(0))
.reset_index()
)
Result:
# print(df3)
client number debt 2013-01 2013-02 2013-03 2013-05 2013-08 ... 2013-12 2014-01 2014-02 2015-04 2016-05 2017-08 2019-01
0 2 200 0.5 0.5 0.00 0.00 0.00 ... 0.00 0.00 0.00 0.00 0.00 0.00 0.0
1 3 600 0.0 0.0 0.33 0.00 0.00 ... 0.00 0.00 0.00 0.00 0.00 0.00 0.0
2 6 10000 0.0 0.0 0.00 0.00 0.00 ... 0.00 0.00 0.00 0.00 0.00 0.00 1.0
3 7 300 0.0 0.0 0.00 0.00 0.00 ... 0.33 0.00 0.00 0.00 0.00 0.00 0.0
4 8 3000 0.0 0.0 0.10 0.17 0.07 ... 0.00 0.03 0.07 0.07 0.17 0.02 0.0

Get proportionate values of columns in a dataframe - Pandas

I have a dataframe like this,
ds 0 1 2 4 5 6
0 1991Q3 nan nan nan nan 1.0 nan
1 2014Q2 1.0 3.0 nan nan 1.0 nan
2 2014Q3 1.0 nan nan 1.0 4.0 nan
3 2014Q4 nan nan nan 2.0 3.0 nan
4 2015Q1 nan 1.0 2.0 4.0 4.0 nan
I would like the proportions for each column 0-6 like this,
ds 0 1 2 4 5 6
0 1991Q3 0.00 0.00 0.00 0.00 1.00 0.00
1 2014Q2 0.20 0.60 0.00 0.00 0.20 0.00
2 2014Q3 0.16 0.00 0.00 0.16 0.67 0.00
3 2014Q4 0.00 0.00 0.00 0.40 0.60 0.00
4 2015Q1 0.00 0.09 0.18 0.36 0.36 0.00
Is there a pandas way to this? Any suggestion would be great.
You can do this:
df = df.replace(np.nan, 0)
df = df.set_index('ds')
In [3194]: df.div(df.sum(1),0).reset_index()
Out[3194]:
ds 0 1 2 4 5 6
0 1991Q3 0.00 0.00 0.00 0.00 1.00 0.00
1 2014Q2 0.20 0.60 0.00 0.00 0.20 0.00
2 2014Q3 0.17 0.00 0.00 0.17 0.67 0.00
3 2014Q4 0.00 0.00 0.00 0.40 0.60 0.00
4 2015Q1 0.00 0.09 0.18 0.36 0.36 0.00
OR you can use df.apply:
In [3196]: df = df.replace(np.nan, 0)
In [3197]: df.iloc[:,1:] = df.iloc[:,1:].apply(lambda x: x/x.sum(), axis=1)
In [3198]: df
Out[3197]:
ds 0 1 2 4 5 6
0 1991Q3 0.00 0.00 0.00 0.00 1.00 0.00
1 2014Q2 0.20 0.60 0.00 0.00 0.20 0.00
2 2014Q3 0.17 0.00 0.00 0.17 0.67 0.00
3 2014Q4 0.00 0.00 0.00 0.40 0.60 0.00
4 2015Q1 0.00 0.09 0.18 0.36 0.36 0.00
Set the first column as the index, get the sum of each row, and divide the main dataframe by the sums, and filling the null entries with 0
res = df.set_index("ds")
res.fillna(0).div(res.sum(1),axis=0)

pandas.Series.drop_duplicates for picking a single value from a subpart of a column

I have a DataFrame df filled with rows and columns where there are duplicate Id's:
Index A B
0 0.00 0.00
1 0.00 0.00
29 0.50 105.00
36 0.80 167.00
37 0.80 167.00
42 1.00 209.00
44 0.50 105.00
45 0.50 105.00
46 0.50 105.00
50 0.00 0.00
51 0.00 0.00
52 0.00 0.00
53 0.00 0.00
When I use:
df.drop_duplicates(subset=['A'], keep='last')
I get:
Index A B
37 0.80 167.00
42 1.00 209.00
46 0.50 105.00
53 0.00 0.00
Which makes sense, that's what the function does. However, what I actually would like to achieve is something like:
Index A B
1 0.00 0.00
29 0.50 105.00
37 0.80 167.00
42 1.00 209.00
46 0.50 105.00
53 0.00 0.00
Basically from each subpart of column A (0,0), (0.80, 0.80), etc. To pick the last value.
It is also important the values in column A stay in this order 0; 0.5; 0.8; 1; 0.5;0 and they do not get mixed.
Compare by not equal Series.ne with Series.shift and filter by boolean indexing:
df1 = df[df['A'].ne(df['A'].shift(-1))]
print (df1)
A B
Index
1 0.0 0.0
29 0.5 105.0
37 0.8 167.0
42 1.0 209.0
46 0.5 105.0
53 0.0 0.0
Details:
print (df['A'].ne(df['A'].shift(-1)))
Index
0 False
1 True
29 True
36 False
37 True
42 True
44 False
45 False
46 True
50 False
51 False
52 False
53 True
Name: A, dtype: bool

Using pivot_table or pd.groupby for certain seasons

I have data set like below:
Year Month Dryden 3rdAve Clark Landfill
0 2015 1 0.00 0.00 0.0 NaN
1 2015 1 0.00 0.00 0.0 NaN
2 2015 1 0.00 0.00 0.0 NaN
3 2015 1 0.00 0.00 0.0 NaN
4 2015 1 0.00 0.00 0.0 NaN
5 2015 1 0.00 0.00 0.0 NaN
6 2015 1 0.00 0.00 0.0 NaN
7 2015 1 0.00 0.00 0.0 NaN
8 2015 1 0.00 0.00 0.0 NaN
9 2015 1 0.00 0.00 0.0 NaN
10 2015 1 0.00 0.00 0.0 NaN
11 2015 1 0.00 0.00 0.0 NaN
where I want to run the code below to calculate the mean of each season for Dryden values:
df.Dryden.groupby([df.Year,pd.cut(df.Month,[0,3,6,9,12],labels=['Winter','Spring','Summer','Autumn'],right =False)]).mean()
I am getting this error:
TypeError: '>' not supported between instances of 'int' and 'str'
df.dtype gives me:
Year int64
Month object
Dryden float64
3rdAve float64
Clark float64
Landfill float64
dtype: object
I was wondering if anyone could help me.
Convert your Month column to an integer like this:
df.Month = df.Month.astype(int)
Then run your code:
In [61]: df.Dryden.groupby([df.Year,pd.cut(df.Month,[0,3,6,9,12],labels=['Winter','Spring','Summer','Autumn'],right =False)]).mean()
Out[61]:
Year Month
2015 Winter 0.0
Name: Dryden, dtype: float64
If you're getting a value error, perhaps this works instead:
df.Month = pd.to_numeric(df.Month, errors='coerce')
Add this:
df['Month'] = df['Month'].astype(int)
Or:
df['Month'] = df['Month'].astype(float).astype(int)

Categories