How to get the most recurring row from a pandas dataframe column - python

I would like to get the most recurring amount, along side the it description from the below dataframe. The length of the dataframe is longer that what I displayed here.
dataframe
description amount type cosine_group
1 295302|service fee 295302|microloa 1500.0 D 24
2 1292092|rpmt microloan|71302 20000.0 D 31
3 qr q10065116702 fund trf 0002080661 30000.0 D 12
4 qr q10060597280 fund trf 0002080661 30000.0 D 12
5 1246175|service fee 1246175|microlo 3000.0 D 24
6 qr q10034118487 fund trf 0002080661 2000.0 D 12
Here I tried using the grouby function
df.groupby(['cosine_group'])['amount'].value_counts()[:2]
the above code returns
cosine_group amount
12 30000.0 7
30000.0 6
I need the description along side the most recurring amount
Expected output is :
description amount
qr q10065116702 fund trf 0002080661 30000.0
qr q10060597280 fund trf 0002080661 30000.0

You can use mode:
description amount type
0 A 15
1 B 2000
2 C 3000
3 C 3000
4 C 3000
5 D 30
6 E 20
7 A 15
df[df['amount type'].eq(df['amount type'].mode().loc[0])]
description amount type
2 C 3000
3 C 3000
4 C 3000
Explaination:
df[mask] # will slice the dataframe based on boolean series (select the True rows) which is called a mask
df['amount type'].eq(3000) # .eq() stands for equal, it is a synonym for == in pandas
df['amount type'].mode() # is the mode of multiple values, which is defined as the most common
df['amount type'].loc[0] # returns the result with index 0, to get int instead of series

Related

How to convert rows into columns in python

I have a dataframe like below which contains 4 columns. I want to convert each unique inventory item number (Z15, Z17 and so on) under the "inv" column into new columns with "info" value corresponds to each store and period. Transpose function does not work in this situation. Also, if I use pivot_table or groupby function, I won't be able to get the value for "High", "Medium" and so on.
Be noted, for the "info" column, I have many different combination of categorical values along with numerical values in real dataset. Also in the real dataset, I have over 100+ stores, over 400+ inventory items and 30+ periods. This is a simplified version of the data to demonstrate my idea. Any suggestion or advice are greatly appericated.
import pandas as pd
import numpy as np
inv = ['Z15','Z15','Z15','Z15','Z15','Z15','Z15','Z15','Z15','Z17','Z17','Z17','Z17','Z17','Z17','Z17']
store = ['store1','store1','store1','store2','store2','store2','store2','store2','store2','store3','store4','store5','store6','store7','store1','store2']
period = [2018,2019,2020,2015,2016,2017,2018,2019,2020,2022,2022,2022,2022,2022,2018,2019]
info = ['0.84773','0.8487','0.82254','0.75','0.65','0.432','0.546','0.777','0.1','High','High','Medium','Very Low','Low','High','Low']
df = pd.DataFrame({'inv':inv,
'store':store,
'period':period,
'info':info})
Data looks like this:
The desired output will be like this :
You're looking for pivot :
df.pivot(index = ['store', 'period'], columns='inv' ,values = 'info').reset_index()
Output:
inv store period Z15 Z17
0 store1 2018 0.84773 High
1 store1 2019 0.8487 NaN
2 store1 2020 0.82254 NaN
3 store2 2015 0.75 NaN
4 store2 2016 0.65 NaN
5 store2 2017 0.432 NaN
6 store2 2018 0.546 NaN
7 store2 2019 0.777 Low
8 store2 2020 0.1 NaN
9 store3 2022 NaN High
10 store4 2022 NaN High
11 store5 2022 NaN Medium
12 store6 2022 NaN Very Low
13 store7 2022 NaN Low

Can't fill nan values in pandas even with inplace flag

I have a pandas dataframe containing NaN values for some column.
I'm trying to fill them with a default value (30), but it doesn't work.
Original dataframe:
type avg_speed
0 CAR 32.0
1 CAR NaN
2 CAR NaN
3 BIKE 16.2
4 CAR 28.5
5 SCOOTER 29.7
6 CAR 30.7
7 CAR NaN
8 BIKE NaN
9 BIKE 35.1
...
Desired result:
type avg_speed
0 CAR 32.0
1 CAR 30
2 CAR 30
3 BIKE 16.2
4 CAR 28.5
5 SCOOTER 29.7
6 CAR 30.7
7 CAR 30
8 BIKE 30
9 BIKE 35.1
My code:
def fill_with_default(pandas_df, column_name, default_value):
print(f"Total count: {pandas_df.count()}")
print(f"Count of Nan BEFORE: {pandas_df[column_name].isna().sum()}")
pandas_df[column_name].fillna(default_value, inplace=True)
print(f"Count of Nan AFTER: {pandas_df[column_name].isna().sum()}")
return pandas_df
df = fill_with_default(df, "avg_speed", 30)
Output:
Total count: 105018
Count of Nan BEFORE: 49514
Count of Nan AFTER: 49514
The chain of dataframe transformations and list of columns are too long, so it's difficult to show all steps (join with another dataframe, drop useless columns, add usefull columns, join with other dataframes, filter etc.)
I've tried other options but they also don't work:
#pandas_df.fillna({column_name: default_value}, inplace=True)
#pandas_df.loc[pandas_df[column_name].isnull(),column_name] = default_value
...
Type of column before applying "fillna" is fload64, the same as default_value
Therefore, my question is: what could be the potential reasons of this problem?
What kind of transformation can lead to this problem? Because this is the method that works for another similar data frame. The only difference between them lies in the chain of transformations.
BTW, there is a system log at this place:
/home/hadoop/.local/lib/python3.6/site-
packages/pandas/core/generic.py:6287: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame
See the caveats in the documentation:
http://pandas.pydata.org/pandas-
docs/stable/user_guide/indexing.html#returning-a-view-versus-a-
copy
self._update_inplace(new_data)

Naming dataframe columns based on the content of one of the row indices

Here is my data frame after reading the csv file and splitting it into columns.
Index 0 1 2
0 Dylos Logger v 1.6.0.0 None None
1 Unit: DC1700 v 2.08 None None
2 Date/Time: 12-07-15 11:11 None None
3 ------------------------- None None
4 Particles per cubic foot None None
5 ------------------------- None None
6 Date/Time Small Large
7 11-27-15 10:08 161200 8300
8 11-27-15 10:09 136500 8700
9 11-27-15 10:10 124000 8400
10 11-27-15 10:11 127300 7900
I would like to name my columns based on the content in the 6th row index, then get rid of the first 6 indices, and reset the index from zero. This means that I wish my data to look like this:
0 Date/Time Small Large
1 11-27-15 10:08 161200 8300
2 11-27-15 10:09 136500 8700
3 11-27-15 10:10 124000 8400
4 11-27-15 10:11 127300 7900
I know how to remove the first 6 rows and rest the indices. But I do not know how to rename the column name based on row 6 at the first step. Can you please help me?
Thanks
import pandas as pd
df = pd.DataFrame({'0':['a','Date/Time','x'],'1':['b','Small','y'],'2':['c','Large','z']})
row_with_column_names = 1 #would be 6 for you
df = df.rename(columns={cur_name:new_name for cur_name,new_name in zip(df,df.iloc[row_with_column_names,:])}) #rename
df = df.drop(row_with_column_names,axis='index') #remove the row with the names in it
df = df.reset_index(drop=True)
df
#Produces
# Date/Time Small Large
#0 a b c
#1 x y z

how to remove rows in python data frame with condition?

I have the following data:
df =
Emp_Name Leaves Leave_Type Salary Performance
0 Christy 20 sick 3000.0 56.6
1 Rocky 10 Casual kkkk 22.4
2 jenifer 50 Emergency 2500.6 '51.6'
3 Tom 10 sick Nan 46.2
4 Harry nn Casual 1800.1 '58.3'
5 Julie 22 sick 3600.2 'unknown'
6 Sam 5 Casual Nan 47.2
7 Mady 6 sick unknown Nan
Output:
Emp_Name Leaves Leave_Type Salary Performance
0 Christy 20 sick 3000.0 56.6
1 jenifer 50 Emergency 2500.6 51.6
2 Tom 10 sick Nan 46.2
3 Sam 5 Casual Nan 47.2
4 Mady 6 sick unknown Nan
I want to delete records where there is datatype error in numerical columns(Leaves,Salary,Performance).
If numerical columns contains strings then that row show be deleted from data frame?
df[['Leaves','Salary','Performance']].apply(pd.to_numeric, errors = 'coerce')
but this will covert values to Nan.
Let's start from a note concerning your sample data:
It contains Nan strings, which are not among strings automatically
recognized as NaNs.
To treat them as NaN, I read the source text with read_fwf,
passing na_values=['Nan'].
And now get down to the main task:
Define a function to check whether a cell is acceptable:
def isAcceptable(cell):
if pd.isna(cell) or cell == 'unknown':
return True
return all(c.isdigit() or c == '.' for c in cell)
I noticed that you accept NaN values.
You also a cell if it contains only unknown string, but you don't
accept a cell if such word is enclosed between e.g. quotes.
If you change your mind about what is / is not acceptable, change the
above function accordingly.
Then, to leave only rows with all acceptable values in all 3 mentioned
columns, run:
df[df[['Leaves', 'Salary', 'Performance']].applymap(isAcceptable).all(axis=1)]

Python: Pandas dataframe re-arrange rows based on last three digits of Integer in Column

I have the following dataframe:
YearMonth Total Cost
2015009 $11,209,041
2015010 $20,581,043
2015011 $37,079,415
2015012 $36,831,335
2016008 $57,428,630
2016009 $66,754,405
2016010 $45,021,707
2016011 $34,783,970
2016012 $66,215,044
YearMonth is an int64 column. A value in YearMonth such as 2015009 stands for September 2015. I want to re-order the rows so that if the last 3 digits are the same, then I want the rows to appear right on top of each other sorted by year.
Below is my desired output:
YearMonth Total Cost
2015009 $11,209,041
2016009 $66,754,405
2015010 $20,581,043
2016010 $45,021,707
2015011 $37,079,415
2016011 $34,783,970
2015012 $36,831,335
2016012 $66,215,044
2016008 $57,428,630
I have scoured google to try and find how to do this but to no avail.
df['YearMonth'] = pd.to_datetime(df['YearMonth'],format = '%Y0%m')
df['Year'] = df['YearMonth'].dt.year
df['Month'] = df['YearMonth'].dt.month
df.sort_values(['Month','Year'])
YearMonth Total Year Month
8 2016-08-01 $57,428,630 2016 8
0 2015-09-01 $11,209,041 2015 9
1 2016-09-01 $66,754,405 2016 9
2 2015-10-01 $20,581,043 2015 10
3 2016-10-01 $45,021,707 2016 10
4 2015-11-01 $37,079,415 2015 11
5 2016-11-01 $34,783,970 2016 11
6 2015-12-01 $36,831,335 2015 12
7 2016-12-01 $66,215,044 2016 12
One way of doing. There may be a quicker way with fewer steps that don't involve converting YearMonth to datetime but if you have a date, it makes more sense to use that.
One way of dong this to cast your int column to string and use the string access with indexing.
df.assign(sortkey=df.YearMonth.astype(str).str[-3:])\
.sort_values('sortkey')\
.drop('sortkey', axis=1)
Output:
YearMonth Total Cost
4 2016008 $57,428,630
0 2015009 $11,209,041
5 2016009 $66,754,405
1 2015010 $20,581,043
6 2016010 $45,021,707
2 2015011 $37,079,415
7 2016011 $34,783,970
3 2015012 $36,831,335
8 2016012 $66,215,044

Categories