Calculate 'NEW COLUMN' from column 'A' based on condition from column 'B"

Calculate 'NEW COLUMN' from column 'A' based on condition from column 'B" - python

I have a DF roughly with columns: date, amount, currency.
There are several CURRENCY types.
I need to create a new column (USD) which will be a calculation of
( AMOUNT*EXCHANGE RATE ) based on CURRENCY type.
There are multiple EXCHANGE RATES to be applied.
I cant figure out the code/approach to do so.
Maybe df.where() should help but i keep getting errors.
Thank you
df['RUR'] = df.where(df['CUR']=='KES', df['AMOUNT']*3, axis=1)
or
df['RUR'] = df['AMOUNT'].apply(lambda x: x*2 if df['CUR']=='KES' else None)

use np.where
import numpy as np
df['RUR'] = np.where(df['CUR']=='KES',df['AMOUNT']*3,np.nan)
second sol
you can use.loc and apply condition in it.
df.loc[df['CUR']=='KES','RUR']=df['AMOUNT']*3

Related

Replace unknown values (with different median values)

I have a particular problem, I would like to clean and prepare my data and I have a lot of unknown values for the "highpoint_metres" column of my dataframe (members). As there is no missing information for the "peak_id", I calculated the median value of the height according to the peak_id to be more accurate.
I would like to do two steps: 1) add a new column to my "members" dataframe where there would be the value of the median but different depending on the "peak_id" (value calculated thanks to the code in the question). 2) That the code checks that the value in highpoint_metres is null, if it is, that the value of the new column is put instead. I don't know if this is clearer
code :
import pandas as pd
members = pd.read_csv("https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-09-22/members.csv")
print(members)
mediane_peak_id = members[["peak_id","highpoint_metres"]].groupby("peak_id",as_index=False).median()
And I don't know how to continue from there (my level of python is very bad ;-))

I believe that's what you're looking for:
import numpy as np
import pandas as pd
members = pd.read_csv("https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-09-22/members.csv")
median_highpoint_by_peak = members.groupby("peak_id")["highpoint_metres"].transform("median")
is_highpoint_missing = np.isnan(members.highpoint_metres)
members["highpoint_meters_imputed"] = np.where(is_highpoint_missing, median_highpoint_by_peak, members.highpoint_metres)

so one way to go about replacing 0 with median could be:
import numpy as np
df[col_name] = df[col_name].replace({0: np.median(df[col_name])})
You can also use apply function:
df[col_name] = df[col_name].apply(lambda x: np.median(df[col_name]) if x==0 else x)
Let me know if this helps.
So adding a little bit more info based on Marie's question.
One way to get median is through groupby and then left join it with the original dataframe.
df_gp = df.groupby(['peak_id']).agg(Median = (highpoint_metres, 'median')).reset_index()
df = pd.merge(df, df_gp, on='peak_id')
df = df.apply(lambda x['highpoint_metres']: x['Median'] if x['highpoint_metres']==np.nan else x['highpoint_metres'])
Let me know if this solves your issue

Change values in a dataframe column with values in another column base on a condition

I'm trying to change values for lat and long using postal codes as key in a main dataframe.
The problem is because of the main df I only need to replace those records for lat and long where the state is Arizona.
And the second df which is the substitute only have postal code, lat and long columns, so I've to use Postal codes as key.
If I want to replace all values I can use this function:
origin.loc[origin['Postal Code'].isin(sustitute['Postal Code']),
['Latitude', 'Longtiude']] = sustitute[['Latitude_ex', 'Longitude_ex']]
But I don't know how to put the condition for the state, so I use the next part to do it, but I want a more pythonic way to do this:
ari = codigos_cp.query("State == 'Arizona'").copy()
ari = pd.merge(ari , cp_sust, how='left', on='Postal Code')

You can use numpy.where to make the condition
import numpy as np
origin['Latitude'] = np.where(origin['State'] == 'Arizona', substitute['Latitude_ex'], origin['Latitude'])
origin['Longitude'] = np.where(origin['State'] == 'Arizona', substitute['Longitude_ex'], origin['Longitude'])

nested for loops, using values to create columns

I'm pretty new to python programming. I read a csv file to a dataframe with median house price of each month as columns. Now I want to create columns to get the mean value of each quarter. e.g. create column housing['2000q1'] as mean of 2000-01, 2000-02, and 2000-03, column housing['2000q2'] as mean of 2000-04,2000-05, 2000-06]...
raw dataframe named 'Housing'
I tried to use nested for loops as below, but always come with errors.
for i in range (2000,2017):
for j in range (1,5):
Housing[i 'q' j] = Housing[[i'-'j*3-2, i'-'j*3-1, i'_'j*3]].mean(axis=1)
Thank you!

Usually, we work with data where the rows are time, so it's good practice to do the same and transpose your data by starting with df = Housing.set_index('CountyName').T (also, variable names should usually start with a small letter, but this isn't important here).
Since your data is already in such a nice format, there is a pragmatic (in the sense that you need not know too much about datetime objects and methods) solution, starting with df = Housing.set_index('CountyName').T:
df.reset_index(inplace = True) # This moves the dates to a column named 'index'
df.rename(columns = {'index':'quarter'}, inplace = True) # Rename this column into something more meaningful
# Rename the months into the appropriate quarters
df.quarter.str.replace('-01|-02|-03', 'q1', inplace = True)
df.quarter.str.replace('-04|-05|-06', 'q2', inplace = True)
df.quarter.str.replace('-07|-08|-09', 'q3', inplace = True)
df.quarter.str.replace('-10|-11|-12', 'q4', inplace = True)
df.drop('SizeRank', inplace = True) # To avoid including this in the calculation of means
c = df.notnull().sum(axis = 1) # Count the number of non-empty entries
df['total'] = df.sum(axis = 1) # The totals on each month
df['c'] = c # only ssign c after computing the total, so it doesn't intefere with the total column
g = df.groupby('quarter')[['total','c']].sum()
g['q_mean'] = g['total']/g['c']
g
g['q_mean'] or g[['q_mean']] should give you the required answer.
Note that we needed to compute the mean manually because you had missing data; otherwise, df.groupby('quarter').mean().mean() would have immediately given you the answer you needed.
A remark: the technically 'correct' way would be to convert your dates into a datetime-like object (which you can do with the pd.to_datetime() method), then run a groupby with a pd.TimeGrouper() argument; this would certainly be worth learning more about if you are going to work with time-indexed data a lot.

You can achieve this using pandas resampling function to compute quarterly averages in a very simple way.
pandas resampling: http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.resample.html
offset names summary: pandas resample documentation
In order to use this function, you need to have only time as columns, so you should temporarily set CountryName and SizeRank as indexes.
Code:
QuarterlyAverage = Housing.set_index(['CountryName', 'SizeRank'], append = True)\
.resample('Q', axis = 1).mean()\
.reset_index(['CountryName', 'SizeRank'], drop = False)
Thanks to #jezrael for suggesting axis = 1 in resampling

Python Pandas Calculating Percentile per row

I have the following code and would like to create a new column per Transaction Number and Description that represents the 99th percentile of each row.
I am really struggling to achieve this - it seems that most posts cover calculating the percentile on the column.
Is there a way to achieve this? I would expect a new column to be create with two rows.
df_baseScenario = pd.DataFrame({'Transaction Number' : [1,10],
'Description' :['asf','def'],
'Calc_PV_CF_2479.0':[4418494.085,-3706270.679],
'Calc_PV_CF_2480.0':[4415476.321,-3688327.494],
'Calc_PV_CF_2481.0':[4421698.198,-3712887.034],
'Calc_PV_CF_2482.0':[4420541.944,-3706402.147],
'Calc_PV_CF_2483.0':[4396063.863,-3717554.946],
'Calc_PV_CF_2484.0':[4397897.082,-3695272.043],
'Calc_PV_CF_2485.0':[4394773.762,-3724893.702],
'Calc_PV_CF_2486.0':[4384868.476,-3741759.048],
'Calc_PV_CF_2487.0':[4379614.337,-3717010.873],
'Calc_PV_CF_2488.0':[4389307.584,-3754514.639],
'Calc_PV_CF_2489.0':[4400699.929,-3741759.048],
'Calc_PV_CF_2490.0':[4379651.262,-3714723.435]})

The following should work:
df['99th_percentile'] = df[cols].apply(lambda x: numpy.percentile(x, 99), axis=1)
I'm assuming here that the variable 'cols' contains a list of the columns you want to include in the percentile (You obviously can't use the Description in your calculation, for example).
What this code does is loops over rows in the dataframe, and for each row, computes the numpy.percentile to get the 99th percentile. You'll need to import numpy.
If you need maximum speed, then you can use numpy.vectorize to remove all loops at the expense of readability (untested):
perc99 = np.vectorize(lambda x: numpy.percentile(x, 99))
df['99th_percentile'] = perc99(df[cols].values)

Slightly modified from #mxbi.
import numpy as np
df = df_baseScenario.drop(['Transaction Number','Description'], axis=1)
df_baseScenario['99th_percentile'] = df.apply(lambda x: np.percentile(x, 99), axis=1)

Python Pandas 'apply' returns series; can't convert to dataframe

OK, I'm at half-wit's end. I'm geocoding a dataframe with geopy. I've written a simple function to take an input - country name - and return the latitude and longitude. I use apply to run the function and it returns a Pandas series object. I can't seem to convert it to a dataframe. I'm sure I'm missing something obvious, but I'm new to python and still RTFMing. BTW, the geocoder function works great.
# Import libraries
import os
import pandas as pd
import numpy as np
from geopy.geocoders import Nominatim
def locate(x):
geolocator = Nominatim()
# print(x) # debug
try:
#Get geocode
location = geolocator.geocode(x, timeout=8, exactly_one=True)
lat = location.latitude
lon = location.longitude
except:
#didn't work for some reason that I really don't care about
lat = np.nan
lon = np.nan
# print(lat,lon) #debug
return lat, lon # Note: also tried return { 'LAT': lat, 'LON': lon }
df_geo_in = df_addr.drop_duplicates(['COUNTRY']).reset_index() #works perfectly
df_geo_in['LAT'], df_geo_in['LON'] = df_geo_in.applymap(locate)
# error: returns more than 2 values - default index + column with results
I also tried
df_geo_in['LAT','LON'] = df_geo_in.applymap(locate)
I get a single dataframe with no index and a single colume with the series in it.
I've tried a number of other methods, including 'applymap' :
source_cols = ['LAT','LON']
new_cols = [str(x) for x in source_cols]
df_geo_in = df_addr.drop_duplicates(['COUNTRY']).set_index(['COUNTRY'])
df_geo_in[new_cols] = df_geo_in.applymap(locate)
which returned an error after a long time:
ValueError: Columns must be same length as key
I've also tried manually converting the series to a dataframe using the df.from_dict(df_geo_in) method without success.
The goal is to geocode 166 unique countries, then join it back to the 188K addresses in df_addr. I'm trying to be pandas-y in my code and not write loops if possible. But I haven't found the magic to convert series into dataframes and this is the first time I've tried to use apply.
Thanks in advance - ancient C programmer

I'm assuming that df_geo is a df with a single column so I believe the following should work:
change:
return lat, lon
to
return pd.Series([lat, lon])
then you should be able to assign like so:
df_geo_in[['LAT', 'LON']] = df_geo_in.apply(locate)
What you tried to do was assign the result of applymap to 2 new columns which is incorrect here as applymap is designed to work on every element in a df so unless the lhs has the same expected shape this won't give the desired result.
Your latter method is also incorrect because you drop the duplicate countries and then expect this to assign every country geolocation back but the shape is different.
It is probably quicker for large df's to create the geolocation non-duplicated df's and then merge this back to your larger df like so:
geo_lookup = df_addr.drop_duplicates(['COUNTRY'])
geo_lookup[['LAT','LNG']] = geo_lookup['COUNTRY'].apply(locate)
df_geo_in.merge(geo_lookup, left_on='COUNTRY', right_on='COUNTRY', how='left')
this will create a df with non duplicated countries with geo location addresses and then we perform a left merge back to the master df.

Always easier to test with some sample data, but please try the following zip function to see if it works.
df_geo_in['LAT_LON'] = df_geo_in.applymap(locate)
df_geo_in['LAT'], df_geo_in['LON'] = zip(*df_geo_in.LAT_LON)

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Calculate 'NEW COLUMN' from column 'A' based on condition from column 'B" - python

use np.where import numpy as np df['RUR'] = np.where(df['CUR']=='KES',df['AMOUNT']3,np.nan) second sol you can use.loc and apply condition in it. df.loc[df['CUR']=='KES','RUR']=df['AMOUNT']3

Related

Replace unknown values (with different median values)

Change values in a dataframe column with values in another column base on a condition

nested for loops, using values to create columns

Python Pandas Calculating Percentile per row

Python Pandas 'apply' returns series; can't convert to dataframe

Categories

Resources

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Calculate 'NEW COLUMN' from column 'A' based on condition from column 'B" - python

use np.where import numpy as np df['RUR'] = np.where(df['CUR']=='KES',df['AMOUNT']*3,np.nan) second sol you can use.loc and apply condition in it. df.loc[df['CUR']=='KES','RUR']=df['AMOUNT']*3

Related

Replace unknown values (with different median values)

Change values in a dataframe column with values in another column base on a condition

nested for loops, using values to create columns

Python Pandas Calculating Percentile per row

Python Pandas 'apply' returns series; can't convert to dataframe

Categories

Resources

use np.where import numpy as np df['RUR'] = np.where(df['CUR']=='KES',df['AMOUNT']3,np.nan) second sol you can use.loc and apply condition in it. df.loc[df['CUR']=='KES','RUR']=df['AMOUNT']3