What might the variable createtot=None mean in this function? - python

I am trying to understand the following code, I do understand in general what was done: we define a data frame we want to work with, but can not get what in particular createtot=None here means?
def returnmyframe(dataframe_in, filter, grouper_in, columns_in, indexnames, createtot=None, selectcol=None):
outfram = (dataframe_in[dataframe_in['Portal'].isin(filter)].groupby(grouper_in)).sum()[columns_in]
if createtot is not None:
outfram[createtot["name"]] = outfram[createtot["totalsum"]].sum(axis=1)
if (selectcol is not None):
outfram = outfram[selectcol]
if len(columns_in) > 1:
outfram = (outfram.stack(0)).fillna(0)
outfram.index.names = indexnames
return (outfram)

I think it's short for 'create total': it's expected to be given as
{"totalsum": <input column name>, "name": <result column name>}
and will then add up (sum) all values in the input column and put that in a result column.

Related

Specific data from a dataframe based on date in another column

#app.callback(
Output('stats', 'children'),
Input('picker_main', 'date'))
def update_table(date_value):
table = {}
for query_id in queries_daily:
df_temp = data_manager.data[query_id]
df_temp.set_index('day')
try:
table[query_id] = df_temp[query_id].where(df_temp['day'] == datetime.strptime(date_value, "%Y-%m-%d").date())
#table[query_id] = df_temp.loc["day", query_id].where(df_temp['day'] == temp)
#table[query_id] = df_temp[df_temp["day"] == temp]
except Exception as e:
table[query_id] = 0
print(e)
I'm trying to get a row from the dateframe and store it in a dictionary or another dateframe.
It's actually only about this bit:
table[query_id] = df_temp[query_id].where(df_temp['day'] == datetime.strptime(date_value, "%Y-%m-%d").date())
table --> empty dict
df_temp --> df with 2 columns - first with name in variable "query_id" and second with date. From which I'm trying to get value stored in the column named with "query_id" along with the "query_id" keyword.
I've been trying also converting date to string format and using dataframe instead of an empty dictionary
It doesn't return any data. I posted a longer piece of code at first as was wondering if maybe someone spot that I can do something in a better way
Thanks!
Seems like you need query
table = { query_id : data_manager.data[query_id]\
.query(f'day == "{date_value}"')['day']
for query_id in queries_daily }

How to get data with pandas

I have a problem with getting data.
I have this DataFrame:
I need to filter by 'fabricante' == 'Kellogs' and get the 'calorias' column, I did this:
I need the second column (calorias) for introducing in this function:
def valor_medio_intervalo(fabricante, variable, confianza):
subconjunto = None # Select only the data: (fabricante, variable) from 'cereal_df'
inicio, final = None, None # put the statistical function here.
return inicio, final
And this is my code for the last part:
def valor_medio_intervalo(fabricante, variable, confianza):
subconjunto = cereal_df.loc[cereal_df['fabricante'] == fabricante][variable]
inicio, final = sm.stats.DescrStatsW(variable).tconfint_mean(alpha = 1-confianza)
return inicio, final
The error:
I'm gonna be so appreciative if you can help me
You called DescrStatsW('calorias').
But surely you wanted DescrStatsW(subconjunto), right?
I'm just reading https://www.statsmodels.org/stable/generated/statsmodels.stats.weightstats.DescrStatsW.html
which explains you should pass in
a 1- or 2-column numpy array or dataframe.

Pandas apply function, receiving KeyError 'Column Name'

My dataset has a column called age and I'm trying to count the null values.
I know it can be easily achieved by doing something like len(df) - df['age'].count(). However, I'm playing around with functions and just like to apply the function to calculate the null count.
Here is what I have:
def age_is_null(df):
age_col = df['age']
null = df[age_col].isnull()
age_null = df[null]
return len(age_null)
count = df.apply(age_is_null)
print (count)
When I do that, I received an error: KeyError: 'age'.
Can someone tells me why I'm getting that error and what should I change in the code to make it work?
You need DataFrame.pipe or pass DataFrame to function here:
#function should be simplify
def age_is_null(df):
return df['age'].isnull().sum()
count = df.pipe(age_is_null)
print (count)
count = age_is_null(df)
print (count)
Error means if use DataFrame.apply then iterate by columns, so it failed if want select column age.
def func(x):
print (x)
df.apply(func)
EDIT: For selecting column use column name:
def age_is_null(df):
age_col = 'age' <- here
null = df[age_col].isnull()
age_null = df[null]
return len(age_null)
Or pass selected column for mask:
def age_is_null(df):
age_col = df['age']
null = age_col.isnull() <- here
age_null = df[null]
return len(age_null)
Instead of making a function, you can Try this
df[df["age"].isnull() == True].shape
You need to pass dataframe df while calling the function age_is_null.That's why age column is not recognised.
count = df.apply(age_is_null(df))

Change Column values in pandas applying another function

I have a data frame in pandas, one of the columns contains time intervals presented as strings like 'P1Y4M1D'.
The example of the whole CSV:
oci,citing,cited,creation,timespan,journal_sc,author_sc
0200100000236252421370109080537010700020300040001-020010000073609070863016304060103630305070563074902,"10.1002/pol.1985.170230401","10.1007/978-1-4613-3575-7_2",1985-04,P2Y,no,no
...
I created a parsing function, that takes that string 'P1Y4M1D' and returns an integer number.
I am wondering how is it possible to change all the column values to parsed values using that function?
def do_process_citation_data(f_path):
global my_ocan
my_ocan = pd.read_csv("citations.csv",
names=['oci', 'citing', 'cited', 'creation', 'timespan', 'journal_sc', 'author_sc'],
parse_dates=['creation', 'timespan'])
my_ocan = my_ocan.iloc[1:] # to remove the first row iloc - to select data by row numbers
my_ocan['creation'] = pd.to_datetime(my_ocan['creation'], format="%Y-%m-%d", yearfirst=True)
return my_ocan
def parse():
mydict = dict()
mydict2 = dict()
i = 1
r = 1
for x in my_ocan['oci']:
mydict[x] = str(my_ocan['timespan'][i])
i +=1
print(mydict)
for key, value in mydict.items():
is_negative = value.startswith('-')
if is_negative:
date_info = re.findall(r"P(?:(\d+)Y)?(?:(\d+)M)?(?:(\d+)D)?$", value[1:])
else:
date_info = re.findall(r"P(?:(\d+)Y)?(?:(\d+)M)?(?:(\d+)D)?$", value)
year, month, day = [int(num) if num else 0 for num in date_info[0]] if date_info else [0,0,0]
daystotal = (year * 365) + (month * 30) + day
if not is_negative:
#mydict2[key] = daystotal
return daystotal
else:
#mydict2[key] = -daystotal
return -daystotal
#print(mydict2)
#return mydict2
Probably I do not even need to change the whole column with new parsed values, the final goal is to write a new function that returns average time of ['timespan'] of docs created in a particular year. Since I need parsed values, I thought it would be easier to change the whole column and manipulate a new data frame.
Also, I am curious what could be a way to apply the parsing function on each ['timespan'] row without modifying a data frame, I can only assume It could be smth like this, but I don't have a full understanding of how to do that:
for x in my_ocan['timespan']:
x = parse(str(my_ocan['timespan'])
How can I get a column with new values? Thank you! Peace :)
A df['timespan'].apply(parse) (as mentioned by #Dan) should work. You would need to modify only the parse function in order to receive the string as an argument and return the parsed string at the end. Something like this:
import pandas as pd
def parse_postal_code(postal_code):
# Splitting postal code and getting first letters
letters = postal_code.split('_')[0]
return letters
# Example dataframe with three columns and three rows
df = pd.DataFrame({'Age': [20, 21, 22], 'Name': ['John', 'Joe', 'Carla'], 'Postal Code': ['FF_222', 'AA_555', 'BB_111']})
# This returns a new pd.Series
print(df['Postal Code'].apply(parse_postal_code))
# Can also be assigned to another column
df['Postal Code Letter'] = df['Postal Code'].apply(parse_postal_code)
print(df['Postal Code Letter'])

Imputing the missing values string using a condition(pandas DataFrame)

Kaggle Dataset(working on)- Newyork Airbnb
Created with a raw data code for running better explanation of the issue
`airbnb= pd.read_csv("https://raw.githubusercontent.com/rafagarciac/Airbnb_NYC-Data-Science_Project/master/input/new-york-city-airbnb-open-data/AB_NYC_2019.csv")
airbnb[airbnb["host_name"].isnull()][["host_name","neighbourhood_group"]]
`DataFrame
I would like to fill the null values of "host_name" based on the "neighbourhood_group" column entities.
like
if airbnb['host_name'].isnull():
airbnb["neighbourhood_group"]=="Bronx"
airbnb["host_name"]= "Vie"
elif:
airbnb["neighbourhood_group"]=="Manhattan"
airbnb["host_name"]= "Sonder (NYC)"
else:
airbnb["host_name"]= "Michael"
(this is wrong,just to represent the output format i want)
I've tried using if statement but I couldn't apply in a correct way. Could you please me solve this.
Thanks
You could try this -
airbnb.loc[(airbnb['host_name'].isnull()) & (airbnb["neighbourhood_group"]=="Bronx"), "host_name"] = "Vie"
airbnb.loc[(airbnb['host_name'].isnull()) & (airbnb["neighbourhood_group"]=="Manhattan"), "host_name"] = "Sonder (NYC)"
airbnb.loc[airbnb['host_name'].isnull(), "host_name"] = "Michael"
Pandas has a special method to fill NA values:
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.fillna.html
You may create a dict with values for "host_name" field using "neighbourhood_group" values as keys and do this:
host_dict = {'Bronx': 'Vie', 'Manhattan': 'Sonder (NYC)'}
airbnb['host_name'] = airbnb['host_name'].fillna(value=airbnb[airbnb['host_name'].isna()]['neighbourhood_group'].map(host_dict))
airbnb['host_name'] = airbnb['host_name'].fillna("Michael")
"value" argument here may be a Series of values.
So, first of all, we create a Series with "neighbourhood_group" values which correspond to our missing values by using this part:
neighbourhood_group_series = airbnb[airbnb['host_name'].isna()]['neighbourhood_group']
Then using map function together with "host_dict" we get a Series with values that we want to impute:
neighbourhood_group_series.map(host_dict)
Finally we just impute in all other NA cells some default value, in our case "Michael".
You can do it with:
ornek = pd.DataFrame({'samp1': [None, None, None],
'samp2': ["sezer", "bozkir", "farkli"]})
def filter_by_col(row):
if row["samp2"] == "sezer":
return "ping"
if row["samp2"] == "bozkir":
return "pong"
return None
ornek.apply(lambda x: filter_by_col(x), axis=1)

Categories