Imputing the missing values string using a condition(pandas DataFrame)

Imputing the missing values string using a condition(pandas DataFrame) - python

Kaggle Dataset(working on)- Newyork Airbnb
Created with a raw data code for running better explanation of the issue
`airbnb= pd.read_csv("https://raw.githubusercontent.com/rafagarciac/Airbnb_NYC-Data-Science_Project/master/input/new-york-city-airbnb-open-data/AB_NYC_2019.csv")
airbnb[airbnb["host_name"].isnull()][["host_name","neighbourhood_group"]]
`DataFrame
I would like to fill the null values of "host_name" based on the "neighbourhood_group" column entities.
like
if airbnb['host_name'].isnull():
airbnb["neighbourhood_group"]=="Bronx"
airbnb["host_name"]= "Vie"
elif:
airbnb["neighbourhood_group"]=="Manhattan"
airbnb["host_name"]= "Sonder (NYC)"
else:
airbnb["host_name"]= "Michael"
(this is wrong,just to represent the output format i want)
I've tried using if statement but I couldn't apply in a correct way. Could you please me solve this.
Thanks

You could try this -
airbnb.loc[(airbnb['host_name'].isnull()) & (airbnb["neighbourhood_group"]=="Bronx"), "host_name"] = "Vie"
airbnb.loc[(airbnb['host_name'].isnull()) & (airbnb["neighbourhood_group"]=="Manhattan"), "host_name"] = "Sonder (NYC)"
airbnb.loc[airbnb['host_name'].isnull(), "host_name"] = "Michael"

Pandas has a special method to fill NA values:
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.fillna.html
You may create a dict with values for "host_name" field using "neighbourhood_group" values as keys and do this:
host_dict = {'Bronx': 'Vie', 'Manhattan': 'Sonder (NYC)'}
airbnb['host_name'] = airbnb['host_name'].fillna(value=airbnb[airbnb['host_name'].isna()]['neighbourhood_group'].map(host_dict))
airbnb['host_name'] = airbnb['host_name'].fillna("Michael")
"value" argument here may be a Series of values.
So, first of all, we create a Series with "neighbourhood_group" values which correspond to our missing values by using this part:
neighbourhood_group_series = airbnb[airbnb['host_name'].isna()]['neighbourhood_group']
Then using map function together with "host_dict" we get a Series with values that we want to impute:
neighbourhood_group_series.map(host_dict)
Finally we just impute in all other NA cells some default value, in our case "Michael".

You can do it with:
ornek = pd.DataFrame({'samp1': [None, None, None],
'samp2': ["sezer", "bozkir", "farkli"]})
def filter_by_col(row):
if row["samp2"] == "sezer":
return "ping"
if row["samp2"] == "bozkir":
return "pong"
return None
ornek.apply(lambda x: filter_by_col(x), axis=1)

Related

how to update pandas column multiple values based on another column

so I am creating a dummy data for a project, and I have a million row of this table:
you can see the sub-reason column contains NaN values all of it cz i'm creating this data. what I want is to put a value based on the Reason column:
if the Reason is 'Maintenance' I want to put a random value between: ['Indoor Connection','Last Mile Connection']
if the Reason is 'New Connection'I want to put a random value between: ['Delayed Connection','Connection Request']
if the Reason is 'Billing' I want to put a random value between: ['Update Request','Change Personal Info']
if the Reason is Complaints I want to put a random value between: ['Wire Cut','Bad Service']
so what I did is a very basic approach:
for i in range(len(cop2)):
if cop2['Reason'].loc[i][0] == 'Maintenance':
cop2['Sub-Reason'].loc[i][0] = np.random.choice(list(subReason1))
if cop2['Reason'].loc[i][0] == 'Connection':
cop2['Sub-Reason'].loc[i][0] = np.random.choice(list(subReason2))
if co2['Reason'].loc[i][0] == 'Billing':
cop2['Sub-Reason'].loc[i][0] = np.random.choice(list(subReason3))
if cop2['Reason'].loc[i][0] == 'Complaints':
cop2['Sub-Reason'].loc[i][0] = np.random.choice(list(subReason4))
it works fine but it takes a veryyyy long time (50mins). how can I do this in a way that it doesn't take a long time but works fine?

Did you try apply method? , it's probably faster
df['Sub-Reason'] = df['Reason'].apply(
lambda x: np.random.choice(list(subReason1)) if x=='Maintenance'
else (np.random.choice(list(subReason2)) if x=='Connection'
else (np.random.choice(list(subReason3)) if x=='Billing'
else np.random.choice(list(subReason4))) ))
Edited since OP 's column is actually a list , and using the first value
df['Sub-Reason'] = df['Reason'].apply(
lambda x: np.random.choice(list(subReason1)) if x[0]=='Maintenance'
else (np.random.choice(list(subReason2)) if x[0]=='Connection'
else (np.random.choice(list(subReason3)) if x[0]=='Billing'
else np.random.choice(list(subReason4))) ))

Python: sorting index column in pivot table by custom order

as result of dataframe manipulation
week_days = {0:'Monday', 1:'Tuesday', 2:'Wednesday', 3:'Thursday', 4:'Friday', 5:'Saturday', 6:'Sunday'}
week_view['day_week_name'] = week_view ['day_week'].apply(lambda x: week_days[x])
...
week_view.pivot_table(index = ['paymentType','type'], columns=['day_week_name'], aggfunc={'value':'sum'}, fill_value=0).sort_index(axis = 1, ascending = False)
I have result:
is there way to get it sorted in normal way where 'monday' is the first and 'sunday' is the last column?
i tried to use "key=" argument for .sort_index(), but got error back:
TypeError: sort_index() got an unexpected keyword argument 'key'
UPDATE (a kind of solution)
with help of your comments, i find a way to solve the task.
you have to use pd.Categorical to get it sorted. But the problem with pivot table, that zero-rows will be add to final table (that is non if you are not using categorization).
You have to add some more lines to get the result as desired:
week_view['day_week_name'] = pd.Categorical(week_view['day_week_name'], ['Понедельник','Вторник','Среда','Четверг','Пятница','Суббота','Воскресенье'])
week_pivot = week_view.pivot_table(index = ['paymentType','type'], columns=['day_week_name'], aggfunc={'value':'sum'}, fill_value=0).sort_index(axis = 1)
week_series = (week_pivot != 0).any(axis=1)
week_pivot.loc[week_series]

Empty values when assign a new column to an existing Datafram

when I try to make a new column to add to an existing dataframe , the new column only has empty values . However, when print "result" before assigns it to the dataframe it works fine! and thus I get this weird error of max arg!
ValueError: max() arg is an empty sequence
I'm using mplfinance to plot the data
strategy.py
def moving_average (self, df , i):
signal = df['sma20'][i]*1.10
if (df['sma20'][i] > df['sma50'][i]) & (signal >df['Close'][i]):
return df['Close'][i]
else:
return None
trading.py
for i in range(0, len(df['Close'])-1):
result = strategy.moving_average(df , i)
print(result)
df['buy']= result
df.to_csv('test.csv', encoding='utf-8')
apd = mpf.make_addplot(df['buy'],scatter=True,marker='^')
mpf.plot(df, type='candle', addplot=apd)

Based on the very small amount of information here, and on your comment
"because df['buy'] column has nan values only."
I'm going to guess that your problem is that strategy.moving_average() is returning None instead of nan when there is no signal.
There is a big difference between None and nan. (The main issue is that nan supports math, whereas None does not; and as a general rule plotting packages always do math).
I suggest you import numpy as np and then in strategy.moving_average()
change return None
to return np.nan.
ALSO just saw another problem.
You are only assigning a single value to df['buy'].
You need to take it out of the loop.
I suggest initialize result as an empty list before the loop
then:
result = []
for i in range(0, len(df['Close'])-1):
result.append(strategy.moving_average(df , i))
print(result)
df['buy']= result
df.to_csv('test.csv', encoding='utf-8')
apd = mpf.make_addplot(df['buy'],scatter=True,marker='^')
mpf.plot(df, type='candle', addplot=apd)

Creating Function for Pandas that takes arguements (df, columnname) and calculates null percentgage

I am learning Python's Pandas library using kaggle's titanic tutorial. I am trying to create a function which will calculate the nulls in a column.
My attempt below appears to print the entire dataframe, instead of null values in the specified column:
def null_percentage_calculator(df,nullcolumn):
df_column_null = df[nullcolumn].isnull().sum()
df_column_null_percentage = np.ceil((df_column_null /testtotal)*100)
print("{} percent of {} {} are NaN values".format(df_column_null_percentage,df,nullcolumn))
null_percentage_calculator(train,"Age")
My previous (and very first stack overflow question) was a similar problem, and it was explained to me that the .index method in pandas is undesirable and I should try and use other methods like [ ] and .loc to explicitly refer to the column.
So I have tried this:
df_column_null=[df[nullcolumn]].isnull().sum()
I have also tried
df_column_null=df[nullcolumn]df[nullcolumn].isnull().sum()
I am struggling to understand this aspect of Pandas. My non function method works fine:
Train_Age_Nulls = train["Age"].isnull().sum()
Train_Age_Nulls_percentage = (Train_Age_Nulls/traintotal)*100
Train_Age_Nulls_percentage_rounded = np.ceil(Train_Age_Nulls_percentage)
print("{} percent of Train's Age are NaN values".format(Train_Age_Nulls_percentage_rounded))
Could anyone let me know where I am going wrong?

def null_percentage_calculator(df,nullcolumn):
df_column_null = df[nullcolumn].isnull().sum()
df_column_null_percentage = np.ceil((df_column_null /testtotal)*100)
# what is testtotal?
print("{} percent of {} {} are NaN values".format(df_column_null_percentage,df,nullcolumn))
I would do this with:
def null_percentage_calculator(df,nullcolumn):
nulls = df[nullcolumn].isnull().sum()
pct = float(nulls) / len(df[nullcolumn]) # need float because of python division
# if you must you can * 100
print "{} percent of column {} are null".format(pct*100, nullcolumn)
beware of python integer division where 63/180 = 0
if you want a float out, you have to put a float in.

How to replace string values in pandas dataframe to integers?

I have a Pandas DataFrame that contains several string values.
I want to replace them with integer values in order to calculate similarities.
For example:
stores[['CNPJ_Store_Code','region','total_facings']].head()
Out[24]:
CNPJ_Store_Code region total_facings
1 93209765046613 Geo RS/SC 1.471690
16 93209765046290 Geo RS/SC 1.385636
19 93209765044084 Geo PR/SPI 0.217054
21 93209765044831 Geo RS/SC 0.804633
23 93209765045218 Geo PR/SPI 0.708165
and I want to replace region == 'Geo RS/SC' ==> 1, region == 'Geo PR/SPI'==> 2 etc.
Clarification: I want to do the replacement automatically, without creating a dictionary first, since I don't know in advance what my regions will be.
Any ideas? I am trying to use DictVectorizer, with no success.
I'm sure there's a way to do it in intelligent way, but I just can't find it.
Anyone familiar with a solution?

You can use the .apply() function and a dictionary to map all known string values to their corresponding integer values:
region_dictionary = {'Geo RS/SC': 1, 'Geo PR/SPI' : 2, .... }
stores['region'] = stores['region'].apply(lambda x: region_dictionary[x])

It looks to me like you really would like panda categories
http://pandas-docs.github.io/pandas-docs-travis/categorical.html
I think you just need to change the dtype of your text column to "category" and you are done.
stores['region'] = stores["region"].astype('category')

You can do:
df = pd.read_csv(filename, index_col = 0) # Assuming it's a csv file.
def region_to_numeric(a):
if a == 'Geo RS/SC':
return 1
if a == 'Geo PR/SPI':
return 2
df['region_num'] = df['region'].apply(region_to_numeric)

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Imputing the missing values string using a condition(pandas DataFrame) - python

You can do it with: ornek = pd.DataFrame({'samp1': [None, None, None], 'samp2': ["sezer", "bozkir", "farkli"]}) def filter_by_col(row): if row["samp2"] == "sezer": return "ping" if row["samp2"] == "bozkir": return "pong" return None ornek.apply(lambda x: filter_by_col(x), axis=1)

Related

how to update pandas column multiple values based on another column

Python: sorting index column in pivot table by custom order

Empty values when assign a new column to an existing Datafram

Creating Function for Pandas that takes arguements (df, columnname) and calculates null percentgage

How to replace string values in pandas dataframe to integers?

Categories

Resources