My dataset has a column called age and I'm trying to count the null values.
I know it can be easily achieved by doing something like len(df) - df['age'].count(). However, I'm playing around with functions and just like to apply the function to calculate the null count.
Here is what I have:
def age_is_null(df):
age_col = df['age']
null = df[age_col].isnull()
age_null = df[null]
return len(age_null)
count = df.apply(age_is_null)
print (count)
When I do that, I received an error: KeyError: 'age'.
Can someone tells me why I'm getting that error and what should I change in the code to make it work?
You need DataFrame.pipe or pass DataFrame to function here:
#function should be simplify
def age_is_null(df):
return df['age'].isnull().sum()
count = df.pipe(age_is_null)
print (count)
count = age_is_null(df)
print (count)
Error means if use DataFrame.apply then iterate by columns, so it failed if want select column age.
def func(x):
print (x)
df.apply(func)
EDIT: For selecting column use column name:
def age_is_null(df):
age_col = 'age' <- here
null = df[age_col].isnull()
age_null = df[null]
return len(age_null)
Or pass selected column for mask:
def age_is_null(df):
age_col = df['age']
null = age_col.isnull() <- here
age_null = df[null]
return len(age_null)
Instead of making a function, you can Try this
df[df["age"].isnull() == True].shape
You need to pass dataframe df while calling the function age_is_null.That's why age column is not recognised.
count = df.apply(age_is_null(df))
Related
#app.callback(
Output('stats', 'children'),
Input('picker_main', 'date'))
def update_table(date_value):
table = {}
for query_id in queries_daily:
df_temp = data_manager.data[query_id]
df_temp.set_index('day')
try:
table[query_id] = df_temp[query_id].where(df_temp['day'] == datetime.strptime(date_value, "%Y-%m-%d").date())
#table[query_id] = df_temp.loc["day", query_id].where(df_temp['day'] == temp)
#table[query_id] = df_temp[df_temp["day"] == temp]
except Exception as e:
table[query_id] = 0
print(e)
I'm trying to get a row from the dateframe and store it in a dictionary or another dateframe.
It's actually only about this bit:
table[query_id] = df_temp[query_id].where(df_temp['day'] == datetime.strptime(date_value, "%Y-%m-%d").date())
table --> empty dict
df_temp --> df with 2 columns - first with name in variable "query_id" and second with date. From which I'm trying to get value stored in the column named with "query_id" along with the "query_id" keyword.
I've been trying also converting date to string format and using dataframe instead of an empty dictionary
It doesn't return any data. I posted a longer piece of code at first as was wondering if maybe someone spot that I can do something in a better way
Thanks!
Seems like you need query
table = { query_id : data_manager.data[query_id]\
.query(f'day == "{date_value}"')['day']
for query_id in queries_daily }
I have a problem with getting data.
I have this DataFrame:
I need to filter by 'fabricante' == 'Kellogs' and get the 'calorias' column, I did this:
I need the second column (calorias) for introducing in this function:
def valor_medio_intervalo(fabricante, variable, confianza):
subconjunto = None # Select only the data: (fabricante, variable) from 'cereal_df'
inicio, final = None, None # put the statistical function here.
return inicio, final
And this is my code for the last part:
def valor_medio_intervalo(fabricante, variable, confianza):
subconjunto = cereal_df.loc[cereal_df['fabricante'] == fabricante][variable]
inicio, final = sm.stats.DescrStatsW(variable).tconfint_mean(alpha = 1-confianza)
return inicio, final
The error:
I'm gonna be so appreciative if you can help me
You called DescrStatsW('calorias').
But surely you wanted DescrStatsW(subconjunto), right?
I'm just reading https://www.statsmodels.org/stable/generated/statsmodels.stats.weightstats.DescrStatsW.html
which explains you should pass in
a 1- or 2-column numpy array or dataframe.
I am able to change the sequence of columns using below code I found on stackoverflow, now I am trying to convert it into a function for regular use but it doesnt seem to do anything. Pycharm says local variable df_name value is not used in last line of my function.
Working Code
columnsPosition = list(df.columns)
F, H = columnsPosition.index('F'), columnsPosition.index('H')
columnsPosition[F], columnsPosition[H] = columnsPosition[H], columnsPosition[F]
df = df[columnsPosition]
My Function - Doesnt work, need to make this work
def change_col_seq(df_name, old_col_position, new_col_position):
columnsPosition = list(df_name.columns)
F, H = columnsPosition.index(old_col_position), columnsPosition.index(new_col_position)
columnsPosition[F], columnsPosition[H] = columnsPosition[H], columnsPosition[F]
df_name = df_name[columnsPosition] # pycharm has issue on this line
I have tried adding return on last statement of function but I am unable to make it work.
To re-order the Columns
To change the position of 2 columns:
def change_col_seq(df_name:pd.DataFrame, old_col_position:str, new_col_position:str):
df_name[new_col_position], df_name[old_col_position] = df_name[old_col_position].copy(), df_name[new_col_position].copy()
df = df_name.rename(columns={old_col_position:new_col_position, new_col_position:old_col_position})
return df
To Rename the Columns
You can use the rename method (Documentation)
If you want to change the name of just one column:
def change_col_name(df_name, old_col_name:str, new_col_name:str):
df = df_name.rename(columns={old_col_name: new_col_name})
return df
If you want to change the name of multiple column:
def change_col_name(df_name, old_col_name:list, new_col_name:list):
df = df_name.rename(columns=dict(zip(old_col_name, new_col_name)))
return df
I have a data frame in pandas, one of the columns contains time intervals presented as strings like 'P1Y4M1D'.
The example of the whole CSV:
oci,citing,cited,creation,timespan,journal_sc,author_sc
0200100000236252421370109080537010700020300040001-020010000073609070863016304060103630305070563074902,"10.1002/pol.1985.170230401","10.1007/978-1-4613-3575-7_2",1985-04,P2Y,no,no
...
I created a parsing function, that takes that string 'P1Y4M1D' and returns an integer number.
I am wondering how is it possible to change all the column values to parsed values using that function?
def do_process_citation_data(f_path):
global my_ocan
my_ocan = pd.read_csv("citations.csv",
names=['oci', 'citing', 'cited', 'creation', 'timespan', 'journal_sc', 'author_sc'],
parse_dates=['creation', 'timespan'])
my_ocan = my_ocan.iloc[1:] # to remove the first row iloc - to select data by row numbers
my_ocan['creation'] = pd.to_datetime(my_ocan['creation'], format="%Y-%m-%d", yearfirst=True)
return my_ocan
def parse():
mydict = dict()
mydict2 = dict()
i = 1
r = 1
for x in my_ocan['oci']:
mydict[x] = str(my_ocan['timespan'][i])
i +=1
print(mydict)
for key, value in mydict.items():
is_negative = value.startswith('-')
if is_negative:
date_info = re.findall(r"P(?:(\d+)Y)?(?:(\d+)M)?(?:(\d+)D)?$", value[1:])
else:
date_info = re.findall(r"P(?:(\d+)Y)?(?:(\d+)M)?(?:(\d+)D)?$", value)
year, month, day = [int(num) if num else 0 for num in date_info[0]] if date_info else [0,0,0]
daystotal = (year * 365) + (month * 30) + day
if not is_negative:
#mydict2[key] = daystotal
return daystotal
else:
#mydict2[key] = -daystotal
return -daystotal
#print(mydict2)
#return mydict2
Probably I do not even need to change the whole column with new parsed values, the final goal is to write a new function that returns average time of ['timespan'] of docs created in a particular year. Since I need parsed values, I thought it would be easier to change the whole column and manipulate a new data frame.
Also, I am curious what could be a way to apply the parsing function on each ['timespan'] row without modifying a data frame, I can only assume It could be smth like this, but I don't have a full understanding of how to do that:
for x in my_ocan['timespan']:
x = parse(str(my_ocan['timespan'])
How can I get a column with new values? Thank you! Peace :)
A df['timespan'].apply(parse) (as mentioned by #Dan) should work. You would need to modify only the parse function in order to receive the string as an argument and return the parsed string at the end. Something like this:
import pandas as pd
def parse_postal_code(postal_code):
# Splitting postal code and getting first letters
letters = postal_code.split('_')[0]
return letters
# Example dataframe with three columns and three rows
df = pd.DataFrame({'Age': [20, 21, 22], 'Name': ['John', 'Joe', 'Carla'], 'Postal Code': ['FF_222', 'AA_555', 'BB_111']})
# This returns a new pd.Series
print(df['Postal Code'].apply(parse_postal_code))
# Can also be assigned to another column
df['Postal Code Letter'] = df['Postal Code'].apply(parse_postal_code)
print(df['Postal Code Letter'])
I am learning Python's Pandas library using kaggle's titanic tutorial. I am trying to create a function which will calculate the nulls in a column.
My attempt below appears to print the entire dataframe, instead of null values in the specified column:
def null_percentage_calculator(df,nullcolumn):
df_column_null = df[nullcolumn].isnull().sum()
df_column_null_percentage = np.ceil((df_column_null /testtotal)*100)
print("{} percent of {} {} are NaN values".format(df_column_null_percentage,df,nullcolumn))
null_percentage_calculator(train,"Age")
My previous (and very first stack overflow question) was a similar problem, and it was explained to me that the .index method in pandas is undesirable and I should try and use other methods like [ ] and .loc to explicitly refer to the column.
So I have tried this:
df_column_null=[df[nullcolumn]].isnull().sum()
I have also tried
df_column_null=df[nullcolumn]df[nullcolumn].isnull().sum()
I am struggling to understand this aspect of Pandas. My non function method works fine:
Train_Age_Nulls = train["Age"].isnull().sum()
Train_Age_Nulls_percentage = (Train_Age_Nulls/traintotal)*100
Train_Age_Nulls_percentage_rounded = np.ceil(Train_Age_Nulls_percentage)
print("{} percent of Train's Age are NaN values".format(Train_Age_Nulls_percentage_rounded))
Could anyone let me know where I am going wrong?
def null_percentage_calculator(df,nullcolumn):
df_column_null = df[nullcolumn].isnull().sum()
df_column_null_percentage = np.ceil((df_column_null /testtotal)*100)
# what is testtotal?
print("{} percent of {} {} are NaN values".format(df_column_null_percentage,df,nullcolumn))
I would do this with:
def null_percentage_calculator(df,nullcolumn):
nulls = df[nullcolumn].isnull().sum()
pct = float(nulls) / len(df[nullcolumn]) # need float because of python division
# if you must you can * 100
print "{} percent of column {} are null".format(pct*100, nullcolumn)
beware of python integer division where 63/180 = 0
if you want a float out, you have to put a float in.