How to get data with pandas - python

I have a problem with getting data.
I have this DataFrame:
I need to filter by 'fabricante' == 'Kellogs' and get the 'calorias' column, I did this:
I need the second column (calorias) for introducing in this function:
def valor_medio_intervalo(fabricante, variable, confianza):
subconjunto = None # Select only the data: (fabricante, variable) from 'cereal_df'
inicio, final = None, None # put the statistical function here.
return inicio, final
And this is my code for the last part:
def valor_medio_intervalo(fabricante, variable, confianza):
subconjunto = cereal_df.loc[cereal_df['fabricante'] == fabricante][variable]
inicio, final = sm.stats.DescrStatsW(variable).tconfint_mean(alpha = 1-confianza)
return inicio, final
The error:
I'm gonna be so appreciative if you can help me

You called DescrStatsW('calorias').
But surely you wanted DescrStatsW(subconjunto), right?
I'm just reading https://www.statsmodels.org/stable/generated/statsmodels.stats.weightstats.DescrStatsW.html
which explains you should pass in
a 1- or 2-column numpy array or dataframe.

Related

Python return statement failing to return a list to be written into a pandas DF

For the life of me I cannot figure out why this function is not returning anything. Any insight will be greatly appreciated!
Basically I create a list of string variables that I am preserving in a Pandas DF. I am using the DF to pull the variable to plug into the function via a .apply() method. But my return function yields NONE results in my DF.
def add_combinations_to_directory(comb_tuples, person_id):
meta_list = []
for comb in comb_tuples:
concat_name = generate_normalized_name(comb)
metaphone_tuple = doublemetaphone(concat_name)
meta_list.append(metaphone_tuple[0])
if metaphone_tuple[1] != '':
meta_list.append(metaphone_tuple[1])
if metaphone_tuple[0] in __lookup_dict[0]:
__lookup_dict[0][metaphone_tuple[0]].append(person_id)
else:
__lookup_dict[0][metaphone_tuple[0]] = [person_id]
if metaphone_tuple[1] in __lookup_dict[1]:
__lookup_dict[1][metaphone_tuple[1]].append(person_id)
else:
__lookup_dict[1][metaphone_tuple[1]] = [person_id]
print(meta_list)
return meta_list
def add_person_to_lookup_directory(person_id, name_tuple):
add_combinations_to_directory(name_tuple, person_id)
def create_meta_names(x, id):
add_person_to_lookup_directory(id, x)
other['Meta_names'] = other.apply(lambda x: create_meta_names(x['Owners'], x['place_id']), axis=1)
Figured it out! it was a problem of nested functions. The return value from the add_combinations_to_directory was being returned to the add_person_to_lookup_directory function and not passing through to the dataframe.

Create and append pandas dummy variables with pipe

I am trying to create a Pandas pipeline that creates dummy variables and append the column to the existing dataframe.
Unfortunately I can't get the appended columns to stick when the pipeline is finished.
Example:
def function(df):
pass
def create_dummy(df):
a = pd.get_dummy(df['col'])
b = df.append(a)
return b
def mah_pipe(df):
(df.pipe(function)
.pipe(create_dummy)
.pipe(print)
return df
print(mah_pipe(df))
First - I have no idea if this is good practice.
What's weird is that the .pipe(print) prints the dataframe with appended columns. Yay.
But the statement print(mah_pipe(df)) does not. I though they would behave the same way.
I have tried to read the documentation about pd.pipe but I couldn't figure it out.
Hoping someone could help shed some light on what's going on.
This is because print in Python returns None. Since you are not making a copy of df on your pipes, your df dies after print.
pipes in Pandas
Unless used as last pipe, in Pandas, we except (df) -> [pipe] -> (df_1)-> [pipe2] ->(df_2)-> [pipeN] -> df_N By having print as last pipe, the output is None.
Solution
...
def start_pipe(dataf):
# allows make a copy to avoid modifying original
dataf = dataf.copy()
def create_dummies(dataf, column_name):
dummies = pd.get_dummies(dataf[column_name])
dataf[dummies.columns] = dummies
return dataf
def print_dataf(dataf, n_rows=5):
print(dataf.head(n_rows))
return dataf # this is important
# usage
...
dt = (df
.pipe(start_pipe)
.pipe(create_dummies, column_name='a')
.pipe(print_dataf, n_rows=10)
)
def mah_pipe(df):
df = (df
.pipe(start_pipe)
.pipe(create_dummies, column_name='a')
.pipe(print_dataf, n_rows=10)
)
return df
print(mah_pipe(df))

Even after adding a column to a dataframe my shape remains the same which means I am not able to add a column to my dataframe

This is my piece of code:
def segregate_files(self, list_of_csv, each_sub_folder):
new_list_of_csv = []
for each_csv in list_of_csv:
pattern = f"{each_sub_folder}/(.*?)/"
self.data_centre = re.search(pattern, each_csv).group(1)
if "org_dashboards/" in each_csv:
each_csv = each_csv.replace("org_dashboards/", f"{self.file_path}/")
else:
each_csv = each_csv.replace("dashboards/", f"{self.file_path}/")
df = pd.read_csv(each_csv)
print(df.shape)
df["Data Centre"] = self.data_centre
print(df.shape)
df.to_csv(each_csv)
new_list_of_csv.append(each_csv)
# self.list_of_sub_folder.append(f"files/{blob_name}")
print(new_list_of_csv)
self.aggregate_csv_path = f"{self.file_path}/{each_sub_folder}"
return new_list_of_csv, self.aggregate_csv_path
and my dataframe is properly able to read the csv
and there is no error in df["Data Centre"] = self.data_centre
only the shape remains the same
FYI the value of self.data_centre is also correct
Sorry my bad. It was a file write issue. Now it has been resolved. Thank you.

Pandas apply function, receiving KeyError 'Column Name'

My dataset has a column called age and I'm trying to count the null values.
I know it can be easily achieved by doing something like len(df) - df['age'].count(). However, I'm playing around with functions and just like to apply the function to calculate the null count.
Here is what I have:
def age_is_null(df):
age_col = df['age']
null = df[age_col].isnull()
age_null = df[null]
return len(age_null)
count = df.apply(age_is_null)
print (count)
When I do that, I received an error: KeyError: 'age'.
Can someone tells me why I'm getting that error and what should I change in the code to make it work?
You need DataFrame.pipe or pass DataFrame to function here:
#function should be simplify
def age_is_null(df):
return df['age'].isnull().sum()
count = df.pipe(age_is_null)
print (count)
count = age_is_null(df)
print (count)
Error means if use DataFrame.apply then iterate by columns, so it failed if want select column age.
def func(x):
print (x)
df.apply(func)
EDIT: For selecting column use column name:
def age_is_null(df):
age_col = 'age' <- here
null = df[age_col].isnull()
age_null = df[null]
return len(age_null)
Or pass selected column for mask:
def age_is_null(df):
age_col = df['age']
null = age_col.isnull() <- here
age_null = df[null]
return len(age_null)
Instead of making a function, you can Try this
df[df["age"].isnull() == True].shape
You need to pass dataframe df while calling the function age_is_null.That's why age column is not recognised.
count = df.apply(age_is_null(df))

What might the variable createtot=None mean in this function?

I am trying to understand the following code, I do understand in general what was done: we define a data frame we want to work with, but can not get what in particular createtot=None here means?
def returnmyframe(dataframe_in, filter, grouper_in, columns_in, indexnames, createtot=None, selectcol=None):
outfram = (dataframe_in[dataframe_in['Portal'].isin(filter)].groupby(grouper_in)).sum()[columns_in]
if createtot is not None:
outfram[createtot["name"]] = outfram[createtot["totalsum"]].sum(axis=1)
if (selectcol is not None):
outfram = outfram[selectcol]
if len(columns_in) > 1:
outfram = (outfram.stack(0)).fillna(0)
outfram.index.names = indexnames
return (outfram)
I think it's short for 'create total': it's expected to be given as
{"totalsum": <input column name>, "name": <result column name>}
and will then add up (sum) all values in the input column and put that in a result column.

Categories