I'm working on the kaggle Outbrain competition, and all datasets referenced in my code can be found at https://www.kaggle.com/c/outbrain-click-prediction/data.
On to the problem: I have a dataframe with columns ['document_id', 'category_id', 'confidence_level']. I would like to add a fourth column, 'max_cat', that returns the 'category_id' value that corresponds to the greatest 'confidence_level' value for the row's 'document_id'.
import pandas as pd
import numpy
main_folder = r'...filepath\data_location' + '\\'
docs_meta = pd.read_csv(main_folder + 'documents_meta.csv\documents_meta.csv',nrows=1000)
docs_categories = pd.read_csv(main_folder + 'documents_categories.csv\documents_categories.csv',nrows=1000)
docs_entities = pd.read_csv(main_folder + 'documents_entities.csv\documents_entities.csv',nrows=1000)
docs_topics = pd.read_csv(main_folder + 'documents_topics.csv\documents_topics.csv',nrows=1000)
def find_max(row,the_df,groupby_col,value_col,target_col):
return the_df[the_df[groupby_col]==row[groupby_col]].loc[the_df[value_col].idxmax()][target_col]
test = docs_categories.copy()
test['max_cat'] = test.apply(lambda x: find_max(x,test,'document_id','confidence_level','category_id'))
This gives me the error: KeyError: ('document_id', 'occurred at index document_id')
Can anyone help explain either why this error occurred, or how to achieve my goal in a more efficient manner? Thanks!
As answered by EdChum in the comments. The issue is that apply works column wise by default (see the docs). Therefore, the column names cannot be accessed.
To specify that it should be applied to each row instead, axis=1 must be passed:
test.apply(lambda x: find_max(x,test,'document_id','confidence_level','category_id'), axis=1)
Related
I am doing a groupby practice. But it returning dict not dataframe. I fallowed some of the solutions from Stack Overflow even no luck.
My code:
result[comNewColName] = sourceDF.groupby(context, as_index=False)[aggColumn].agg(aggOperation).reset_index()
and I tried:
result[comNewColName] = sourceDF.groupby(context)[aggColumn].agg(aggOperation).reset_index()
and
result[comNewColName] = sourceDF.groupby(context, as_index=False)[aggColumn].agg(aggOperation)
all three cases, I am getting dict only. But I should get dataframe
here:
comNewColName = "totalAmount"
context =['clientCode']
aggColumn = 'amount'
aggOperation = 'sum'
If need new column created by aggregeted values use GroupBy.transform, but assign to sourceDF:
sourceDF[comNewColName] = sourceDF.groupby(context)[aggColumn].transform(aggOperation)
Your solution return DataFrame:
df = sourceDF.groupby(context)[aggColumn].agg(aggOperation).reset_index()
print (type(df))
A TypeError appears within a For Loop. The For Loop's key, yrmon, is assigned to a column "YEAR" (avg_temp["YEAR"]). yrmon contains a six character string, and the "YEAR" column contains an 8 digit value. What's odd, is that the code is pulled from a lesson; I simply retyped it. I'm unsure of what I mistyped.
Lesson and repository can be found at:
https://geo-python.github.io/site/notebooks/L6/advanced-data-processing-with-pandas.html
https://github.com/Geo-Python-2020/Exercise-6
I'd recommend focusing between "String Slicing" & "For-loops and grouped objects". I contacted the instructors by email and LinkedIn.
I've looked at two TypeError postings. Because of my novice understanding, I wasn't able to determine their relevance.
Thanks for your collective time! Below is the code:
import pandas as pd
data = pd.read_csv('data/1091402.txt', skiprows=[1], delim_whitespace=True, na_values='******')
monthly_data = None
data["DT_YM_SL"] = data["DATE"].astype(str)
data["DT_YM_SL"] = data["DT_YM_SL"].str.slice(start=0, stop=6)
grouped = data.groupby(["DT_YM_SL"])
for yrmon, group in grouped:
avg_temp = group['TAVG'].mean()
avg_temp["YEAR"] = yrmon
monthly_data = monthly_data.append(avg_temp, ignore_index=True)
avg_temp = group['TAVG'].mean()
avg_temp is the mean of 'TAVG' column, which is float because mean() method returns a float number, and equals to ~29.48 on first iteration of FOR loop. So, this line of code:
avg_temp["YEAR"] = yrmon
becomes
29.48["YEAR"] = '195201'
and this is not possible in Python.
I have a dataframe with 4 columns: "Date" (in string format), "Hour" (in string format), "Energia_Attiva_Ingresso_Delta" and "Energia_Attiva_Uscita_Delta".
Obviously for every date there are multiple hours. I'd like to calculate a column for the overall dataframe, but on a daily base. Basically: the operation of the function must be calculated for every single date.
So, I thought to iter over the single values of the date column and to filter the dataframe with .loc, then pass the filtered df to the function. In the function I have to re-filter the df with loc (for the purpose of the calculation).
Here's the code I wrote and as you can see in the function i need to operate iterativelly on the row with the maximum value of 'Energia_Ingresso_Delta'; to do so I use again the .loc function:
#function
def optimize(df):
min_index = np.argmin(df.Margine)
max_index = np.argmax(df.Margine)
Energia_Prelevata_Da_Rete = df[df.Margine < 0]['Margine'].sum().round(1)
Energia_In_Eccesso = df[df.Margine > 0]['Margine'].sum().round(1)
carico_medio = (Energia_In_Eccesso / df[df['Margine']<0]['Margine'].count()).round(1)
while (Energia_In_Eccesso != 0):
max_index = np.argmax(df.Energia_Ingresso_Delta)
df.loc[max_index, 'Energia_Attiva_Ingresso_Delta'] = df.loc[max_index,'Energia_Attiva_Ingresso_Delta'] + carico_medio
Energia_In_Eccesso = (Energia_In_Eccesso - carico_medio).round(1)
#Call function with "partial dataframe". The dataframe is called "prova"
for items in list(prova.Data.unique()):
function(prova.loc[[items]])
But I keep getting this error:
"None of [Index(['2021-05-01'], dtype='object')] are in the [index]"
Can someone help me? :)
Thanks in advance
I have a particular problem, I would like to clean and prepare my data and I have a lot of unknown values for the "highpoint_metres" column of my dataframe (members). As there is no missing information for the "peak_id", I calculated the median value of the height according to the peak_id to be more accurate.
I would like to do two steps: 1) add a new column to my "members" dataframe where there would be the value of the median but different depending on the "peak_id" (value calculated thanks to the code in the question). 2) That the code checks that the value in highpoint_metres is null, if it is, that the value of the new column is put instead. I don't know if this is clearer
code :
import pandas as pd
members = pd.read_csv("https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-09-22/members.csv")
print(members)
mediane_peak_id = members[["peak_id","highpoint_metres"]].groupby("peak_id",as_index=False).median()
And I don't know how to continue from there (my level of python is very bad ;-))
I believe that's what you're looking for:
import numpy as np
import pandas as pd
members = pd.read_csv("https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-09-22/members.csv")
median_highpoint_by_peak = members.groupby("peak_id")["highpoint_metres"].transform("median")
is_highpoint_missing = np.isnan(members.highpoint_metres)
members["highpoint_meters_imputed"] = np.where(is_highpoint_missing, median_highpoint_by_peak, members.highpoint_metres)
so one way to go about replacing 0 with median could be:
import numpy as np
df[col_name] = df[col_name].replace({0: np.median(df[col_name])})
You can also use apply function:
df[col_name] = df[col_name].apply(lambda x: np.median(df[col_name]) if x==0 else x)
Let me know if this helps.
So adding a little bit more info based on Marie's question.
One way to get median is through groupby and then left join it with the original dataframe.
df_gp = df.groupby(['peak_id']).agg(Median = (highpoint_metres, 'median')).reset_index()
df = pd.merge(df, df_gp, on='peak_id')
df = df.apply(lambda x['highpoint_metres']: x['Median'] if x['highpoint_metres']==np.nan else x['highpoint_metres'])
Let me know if this solves your issue
I want to read three columns from my pandas data frame and then combine with some character to form a new data frame column, the below iteration code works fine.
def date_creation(a,b,c):
date=str(a) +'/'+str(b)+'/'+str(c)
return date
df.loc["Test_FL_DATE"]=df[:,["DAY_OF_MONTH","MONTH","AYEAR"]].apply(date_creation)
Sample Input
Sample Output
However, if I want to do the same job by using apply or lambda. In fact, I am trying but it is not working. the code is as below which I believe is not correct. Thanks in advance for helping me out.
def date_creation(a,b,c):
date=str(a) +'/'+str(b)+'/'+str(c)
return date
df.loc["Test_FL_DATE"]=df[:,["DAY_OF_MONTH","MONTH","AYEAR"]].apply(date_creation)
Here is possible use if need lambda function:
cols = ["DAY_OF_MONTH","MONTH","AYEAR"]
df["Test_FL_DATE"] = df[cols].astype(str).apply(lambda x: '/'.join(x))
Or:
df["Test_FL_DATE"] = df[cols].apply(lambda x: '/'.join(x.astype(str)))
But nicer is:
df["Test_FL_DATE"] = df[["DAY_OF_MONTH","MONTH","AYEAR"]].astype(str).apply('/'.join)
And faster solution is simply join by +:
df["Test_FL_DATE"] = (df["DAY_OF_MONTH"].astype(str) + '/' +
df["MONTH"].astype(str) + '/' +
df["AYEAR"].astype(str))
Probably easiest to use pd.Series.str.cat, which concatenates one string Series with other Series.
df['Test_FL_Date'] = (df['DAY_OF_MONTH']
.astype(str)
.str
.cat([df['MONTH'], df['AYEAR'], sep='/'))