Pandas apply lambda to a function based on condition - python

I have a data frame of rental data and would like to annualise the rent based on whether a column containing the frequency states that the rent is monthly, i.e. price * 12
The frequency column contains the following values - 'Yearly', 'Monthly', nan
I have tried - np.where(df['frequency'] == "Monthly", df['price'].apply(lambda x: x*12), 0)
However, where there is monthly data, the figure seems to be being copied 12 times rather than multiplied by 12:
And I need to have the price multiplied by 12 but can't figure out how to do this

The problem is your price column contains string and not numeric values.
If you load your dataframe from a file (csv, xlsx), use thousands=',' as parameter of pd.read_csv or pd.read_excel to interpret string like '4,500 as the number 4500.
Demo:
import pandas as pd
import io
csvdata = """\
frequency;price
Monthly;4,500
Yearly;30,200
"""
df1 = pd.read_csv(io.StringIO(csvdata), sep=';')
df2 = pd.read_csv(io.StringIO(csvdata), sep=';', thousands=',')
For df1:
>>> df1
frequency price
0 Monthly 4,500
1 Yearly 30,200
>>> df1.dtypes
frequency object
price object # not numeric
dtype: object
>>> df1['price'] * 2
0 4,5004,500
1 30,20030,200
Name: price, dtype: object
For df2:
>>> df2
frequency price
0 Monthly 4500
1 Yearly 30200
>>> df2.dtypes
frequency object
price int64 # numeric
dtype: object
>>> df2['price'] * 2
0 9000
1 60400
Name: price, dtype: int64

It seems there are strings instead numbers floats in column price, so first replace , to . and then convert to floats, last multiple by 12:
np.where(df['frequency'] == "Monthly", df['price'].str.replace(',','.').astype(float)*12, 0)
If values are thousands separated by , replace by empty string:
np.where(df['frequency'] == "Monthly", df['price'].str.replace(',','').astype(float)*12, 0)

Related

Aggregating datetime64[ns] and float columns in a pandas dataframe

I have a pandas dataframe which looks like the one below.
racer
race_time_1
race_time_2
1st_Place
2nd_Place
...
joe shmo
0:24:12
NaN
1
0
joe shmo
NaN
0:32:43
0
0
joe shmo
NaN
0:30:21
0
1
sally sue
NaN
0:29:54
1
0
I would like to group all the rows by racer name to show me total race times, places, etc.
I am attempting to do this with
df.groupby('racer', dropna=True).agg('sum')
Each column was initially loaded as an object dtype which causes issues when aggregating numbers with anything that isn't a null value.
For the race_time values, after lots of searching I tried changing the columns to datetime64[ns] dtypes with dummy data for day/month/year, but instead of summing the race_time columns they are dropped from the dataframe when the groupby function is called.
The opposite issue arises when I change 1st_Place and 2nd_place to float dtypes. When groupby is called, the aggregation will work as expected, but every other column is dropped (the object columns that is).
For example, with joe shmo I would want to see:
racer
race_time_1
race_time_2
1st_Place
2nd_Place
joe shmo
0:24:12
1:03:04
1
1
How can I get pandas to aggregate my dataframe like this?
Use:
#function for formating timedeltas
def f(x):
ts = x.total_seconds()
hours, remainder = divmod(ts, 3600)
minutes, seconds = divmod(remainder, 60)
return ('{:02d}:{:02d}:{:02d}').format(int(hours), int(minutes), int(seconds))
#convert Place columns to numeric
cols1 = df.filter(like='Place').columns
df[cols1] = df[cols1].apply(pd.to_numeric)
#convert time columns to timedeltas and then to unix time
cols = df.filter(like='time').columns
df[cols] = df[cols].fillna('0').apply(pd.to_timedelta).astype(np.int64)
#aggregate sum
df = df.groupby('racer', dropna=True).sum()
#convert timedeltas to times with formating
df[cols] = df[cols].apply(lambda x: pd.to_timedelta(x).map(f))
print (df)
race_time_1 race_time_2 1st_Place 2nd_Place
racer
joe shmo 00:24:12 01:03:04 1 1
sally sue 00:00:00 00:29:54 1 0

Inconsistent AttributeError: 'str' object has no attribute

I am learning how to create heatmaps from CSV datasets using Pandas, Seaborn and Numpy.
# Canada Cases Year overview - Heatmap
# Read file and separate needed data subset
canada_df = pd.read_csv('https://raw.githubusercontent.com/datasets/covid-19/main/data/countries-aggregated.csv', usecols = [0, 1, 2], index_col = 0, parse_dates=[0])
canada_df.info()
canada_df.head()
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 110370 entries, 2020-01-22 to 2021-08-09
Data columns (total 2 columns):
#
Column
Non-Null
count
Dtype
0
Country
110370
non-null
object
1
Confirmed
110370
non-null
int64
dtypes: int64(1), object(1)
Country
Confirmed
Date
Afghanistan
0
2020-01-22
Afghanistan
0
2020-01-23
Afghanistan
0
2020-01-24
Afghanistan
0
2020-01-25
Afghanistan
0
2020-01-26
Afghanistan
0
#Filtering data for Canadian values only
canada_df.loc[canada_df['Country']=='Canada']
#Isolating needed subset
canada_cases = canada_df['Confirmed']
canada_cases.head()
# create a copy of the dataframe, and add columns for month and year
canada_heatmap = canada_cases.copy()
canada_heatmap['month'] = [i.month for i in canada_heatmap.index]
canada_heatmap['year'] = [i.year for i in canada_heatmap.index]
# group by month and year, get the average
canada_heatmap = canada_heatmap.groupby(['month', 'year']).mean()
At this point I get this error:
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
<ipython-input-54-787f01af1859> in <module>
2 canada_heatmap = canada_cases.copy()
3 canada_heatmap['month'] = [i.month for i in canada_heatmap.index]
----> 4 canada_heatmap['year'] = [i.year for i in canada_heatmap.index]
5 # group by month and year, get the average
6 canada_heatmap = canada_heatmap.groupby(['month', 'year']).mean()
<ipython-input-54-787f01af1859> in <listcomp>(.0)
2 canada_heatmap = canada_cases.copy()
3 canada_heatmap['month'] = [i.month for i in canada_heatmap.index]
----> 4 canada_heatmap['year'] = [i.year for i in canada_heatmap.index]
5 # group by month and year, get the average
6 canada_heatmap = canada_heatmap.groupby(['month', 'year']).mean()
AttributeError: 'str' object has no attribute 'year'
I'm stuck on how to solve this, as the line above is pretty much the same but doesn't raise the same issue. Does anyone know what's going on here?
Some of your indexes are not in a date format (2 elements are string, which are the two lasts elements)
# check the type of the elements in index
count = pd.Series(canada_heatmap.index).apply(type).value_counts()
print(count)
<class 'pandas._libs.tslibs.timestamps.Timestamp'> 110370
<class 'str'> 2
Name: Date, dtype: int64
# remove them
canada_heatmap = canada_heatmap.iloc[:-2]
I reproduced your error.
Here
canada_cases = canada_df['Confirmed']
you're extracting one column of the dataset and it becomes a Series object, not Dataframe. Which then carries over to canada_heatmap.
type(canada_heatmap)
>>> pandas.core.series.Series
As such, using an assignment with
canada_heatmap['month'] = ANYTHING
creates a new record in the series with the index value "month", not a new column.
Thus, on the first pass canada_heatmap.index is still a DatetimeIndex and has .year or .month attribute, but it breaks in the next line, as now the index is just strings. And strings don't have .year attributes.
Instead do:
import pandas as pd
covid_all_countries = pd.read_csv('https://raw.githubusercontent.com/datasets/covid-19/main/data/countries-aggregated.csv', usecols = [0, 1, 2], index_col = 0, parse_dates=[0])
covid_canada_confirmed = covid_all_countries.loc[covid_all_countries['Country']=='Canada']
canada_heatmap = covid_canada_confirmed.copy()
canada_heatmap.drop(columns='Country', inplace=True)
canada_heatmap['month'] = canada_heatmap.index.month
canada_heatmap['year'] = canada_heatmap.index.year
Note, that the last two statements are equivalent to what you were trying to achieve but without looping through all the values (even if using list comprehension). This is clearer, more concise and considerably faster.
A couple comments:
This line does nothing:
#Filtering data for Canadian values only
canada_df.loc[canada_df['Country']=='Canada']
You need to assign the filtering to a value like this:
#Filtering data for Canadian values only
canada_df_filt = canada_df.loc[canada_df['Country']=='Canada'].copy()
Next, try to set the month/year of canada_df before filtering it as a series like this:
canada_heatmap = canada_df_filt.copy()
canada_heatmap['month'] = [i.month for i in canada_heatmap.index]
canada_heatmap['year'] = [i.year for i in canada_heatmap.index]
This works on my machine.

How to Loop over Numeric Column in Pandas Dataframe and filter Values?

df:
Org_Name Emp_Name Age Salary
0 Axempl Rick 29 1000
1 Lastik John 34 2000
2 Xenon sidd 47 9000
3 Foxtrix Ammy thirty 2000
4 Hensaui giny 33 ten
5 menuia rony fifty 7000
6 lopex nick 23 Ninety
I want loop over Numeric Column (Age, Salary) to check each value whether it is numeric or not, if string value present in Numeric column filter out the record and create a new data frame without that error.
Output :
Org_Name Emp_Name Age Salary
0 Axempl Rick 29 1000
1 Lastik John 34 2000
2 Xenon sidd 47 9000
You could extend this answer to filter on multiple columns for numerical data types:
import pandas as pd
from io import StringIO
data = """
Org_Name,Emp_Name,Age,Salary
Axempl,Rick,29,1000
Lastik,John,34,2000
Xenon,sidd,47,9000
Foxtrix,Ammy,thirty,2000
Hensaui,giny,33,ten
menuia,rony,fifty,7000
lopex,nick,23,Ninety
"""
df = pd.read_csv(StringIO(data))
print('Original dataframe\n', df)
df = df[(df.Age.apply(lambda x: x.isnumeric())) &
(df.Salary.apply(lambda x: x.isnumeric()))]
print('Filtered dataframe\n', df)
gives
Original dataframe
Org_Name Emp_Name Age Salary
0 Axempl Rick 29 1000
1 Lastik John 34 2000
2 Xenon sidd 47 9000
3 Foxtrix Ammy thirty 2000
4 Hensaui giny 33 ten
5 menuia rony fifty 7000
6 lopex nick 23 Ninety
Filtered dataframe
Org_Name Emp_Name Age Salary
0 Axempl Rick 29 1000
1 Lastik John 34 2000
2 Xenon sidd 47 9000
I believe this can be solved using Pandas' "to_numeric" function.
import pandas as pd
df['Column to Check'] = pd.to_numeric(df['Column to Check'], downcast='integer', errors='coerce')
df.dropna(axis=0, inplace=True)
Where 'Column to Check' is the column name that your are checking for values that cannot be cast as an integer (or any numeric type); in your question I believe you will want to apply this code to 'Age' and 'Salary'. "to_numeric" will convert any values in those columns to NaN if they could not be cast as your selected type. The "dropna" method will remove all rows that have a NaN in any of your columns.
To loop over the columns like you ask, you could do the following:
for col in ['Age', 'Salary']:
df[col] = pd.to_numeric(df[col], downcast='integer', errors='coerce')
df.dropna(axis=0, inplace=True)
EDIT:
In response to harry's comment. If there are preexisting NaNs in the data, something like the following should keep any valid row that had a preexisting NaN in one of the other columns.
for col in ['Age', 'Salary']:
df[col] = pd.to_numeric(df[col], downcast='integer', errors='coerce')
df = df[df[col].notnull()]
You can use a mask to indicate wheter or not there is a string type among the Age and Salary columns:
mask_str = (df[['Age', 'Salary']]
.applymap(lambda x: str(type(x)))
.sum(axis=1)
.str.contains("str"))
df[~mask_str]
This is assuming that the dataframe already contains the proper types. If not, you can convert them using the following:
def convert(val):
try:
return int(val)
except ValueError:
return val
df = (df.assign(Age=lambda f: f.Age.apply(convert),
Salary=lambda f: f.Salary.apply(convert)))

Dividing rows for specific columns by date+n in Pandas

I want to divide rows in my dataframe via specific columns.
That is, I have a column named 'ticker' which has a attributes 'date' and 'price'.
I want to divide date[i+2] by date[i] where i and i+2 just mean the DAY and the DAY +2 for the price of that ticker. The date is also in proper datetime format for operations using Pandas.
The data looks like:
date | ticker | price |
2002-01-30 A 20
2002-01-31 A 21
2002-02-01 A 21.4
2002-02-02 A 21.3
.
.
That means I want to select the price based off the ticker and the DAY and the DAY + 2 specifically for each ticker to calculate the ratio date[i+2]/date[i].
I've considered using iloc but I'm not sure how to select for specific tickers only to do the math on.
use groupby:
df.groupby('ticker')['price'].transform(lambda x: x / x.shift(2))
0 NaN
1 NaN
2 1.070000
3 1.014286
Name: price, dtype: float64

How to modify the Pandas DataFrame and insert new columns

I have some data with information provide below,
df.info() is below,
<class 'pandas.core.frame.DataFrame'>
Int64Index: 6662 entries, 0 to 6661
Data columns (total 2 columns):
value 6662 non-null float64
country 6478 non-null object
dtypes: float64(1), object(1)
memory usage: 156.1+ KB
None
list of the columns,
[u'value' 'country']
the df is below,
value country
0 550.00 USA
1 118.65 CHINA
2 120.82 CHINA
3 86.82 CHINA
4 112.14 CHINA
5 113.59 CHINA
6 114.31 CHINA
7 111.42 CHINA
8 117.21 CHINA
9 111.42 CHINA
--------------------
--------------------
6655 500.00 USA
6656 500.00 USA
6657 390.00 USA
6658 450.00 USA
6659 420.00 USA
6660 420.00 USA
6661 450.00 USA
I need to add another column namely outlier and put 1
if the data is an outliers for that respective country,
otherwise, I need to put 0. I emphasize that the outlier will need to be computed for the respective countries and NOT for the countries altogether.
I find some formulas for calculating the outliers which may be in help, for example,
# keep only the ones that are within +3 to -3 standard
def exclude_the_outliers(df):
df = df[np.abs(df.col - df.col.mean())<=(3*df.col.std())]
return df
def exclude_the_outliers_extra(df):
LOWER_LIMIT = .35
HIGHER_LIMIT = .70
filt_df = df.loc[:, df.columns == 'value']
# Then, computing percentiles.
quant_df = filt_df.quantile([LOWER_LIMIT, HIGHER_LIMIT])
# Next filtering values based on computed percentiles. To do that I use
# an apply by columns and that's it !
filt_df = filt_df.apply(lambda x: x[(x>quant_df.loc[LOWER_LIMIT,x.name]) &
(x < quant_df.loc[HIGHER_LIMIT,x.name])], axis=0)
filt_df = pd.concat([df.loc[:, df.columns != 'value'], filt_df], axis=1)
filt_df.dropna(inplace=True)
return df
I was not able to use those formulas properly for this purpose, but, provided as suggestion.
Finally, I will need to count the percentage of the outliers for the
USA and CHINA presented in the data.
How to achieve that?
Note: putting the outlier column with all zeros is easy in the
pasdas and should be like this,
df['outlier'] = 0
However, it's still the issue to find the outlier and overwrite the
zeros with 1 for that respective country.
You can slice the dataframe by each country, calculate the quantiles for the slice, and set the value of outlier at the index of the country.
There might be a way to do it without iteration, but it is beyond me.
# using True/False for the outlier, it is the same as 1/0
df['outlier'] = False
# set the quantile limits
low_q = 0.35
high_q = 0.7
# iterate over each country
for c in df.country.unique():
# subset the dataframe where the country = c, get the quantiles
q = df.value[df.country==c].quantile([low_q, high_q])
# at the row index where the country column equals `c` and the column is `outlier`
# set the value to be true or false based on if the `value` column is within
# the quantiles
df.loc[df.index[df.country==c], 'outlier'] = (df.value[df.country==c]
.apply(lambda x: x<q[low_q] or x>q[high_q]))
Edit: To get the percentage of outliers per country, you can groupby the country column and aggregate using the mean.
gb = df[['country','outlier']].groupby('country').mean()
for row in gb.itertuples():
print('Percentage of outliers for {: <12}: {:.1f}%'.format(row[0], 100*row[1]))
# output:
# Percentage of outliers for China : 54.0%
# Percentage of outliers for USA : 56.0%

Categories