I have a dataframe with 4 columns: "Date" (in string format), "Hour" (in string format), "Energia_Attiva_Ingresso_Delta" and "Energia_Attiva_Uscita_Delta".
Obviously for every date there are multiple hours. I'd like to calculate a column for the overall dataframe, but on a daily base. Basically: the operation of the function must be calculated for every single date.
So, I thought to iter over the single values of the date column and to filter the dataframe with .loc, then pass the filtered df to the function. In the function I have to re-filter the df with loc (for the purpose of the calculation).
Here's the code I wrote and as you can see in the function i need to operate iterativelly on the row with the maximum value of 'Energia_Ingresso_Delta'; to do so I use again the .loc function:
#function
def optimize(df):
min_index = np.argmin(df.Margine)
max_index = np.argmax(df.Margine)
Energia_Prelevata_Da_Rete = df[df.Margine < 0]['Margine'].sum().round(1)
Energia_In_Eccesso = df[df.Margine > 0]['Margine'].sum().round(1)
carico_medio = (Energia_In_Eccesso / df[df['Margine']<0]['Margine'].count()).round(1)
while (Energia_In_Eccesso != 0):
max_index = np.argmax(df.Energia_Ingresso_Delta)
df.loc[max_index, 'Energia_Attiva_Ingresso_Delta'] = df.loc[max_index,'Energia_Attiva_Ingresso_Delta'] + carico_medio
Energia_In_Eccesso = (Energia_In_Eccesso - carico_medio).round(1)
#Call function with "partial dataframe". The dataframe is called "prova"
for items in list(prova.Data.unique()):
function(prova.loc[[items]])
But I keep getting this error:
"None of [Index(['2021-05-01'], dtype='object')] are in the [index]"
Can someone help me? :)
Thanks in advance
Related
---Hello, everyone! New student of Python's Pandas here.
I have a dataframe I artificially constructed here: https://i.stack.imgur.com/cWgiB.png. Below is a text reconstruction.
df_dict = {
'header0' : [55,12,13,14,15],
'header1' : [21,22,23,24,25],
'header2' : [31,32,55,34,35],
'header3' : [41,42,43,44,45],
'header4' : [51,52,53,54,33]
}
index_list = {
0:'index0',
1:'index1',
2:'index2',
3:'index3',
4:'index4'
}
df = pd.DataFrame(df_dict).rename(index = index_list)
GOAL:
I want to pull the index row(s) and column header(s) of any ARBITRARY value(s) (int, float, str, etc.). So for eg, if I want the values of 55, this code will return: header0, index0, header2, index2 in some format. They could be list or tuple or print, etc.
CLARIFICATIONS:
Imagine the dataframe is of a large enough size that I cannot "just find it manually"
I do not know how large this value is in comparison to other values (so a "simple .idxmax()" probably won't cut it)
I do not know where this value is column or index wise (so "just .loc,.iloc where the value is" won't help either)
I do not know whether this value has duplicates or not, but if it does, return all its column/indexes.
WHAT I'VE TRIED SO FAR:
I've played around with .columns, .index, .loc, but just can't seem to get the answer. The farthest I've gotten is creating a boolean dataframe with df.values == 55 or df == 55, but cannot seem to do anything with it.
Another "farthest" way I've gotten is using df.unstack.idxmax(), which would return a tuple of the column and header, but has 2 major problems:
Only returns the max/min as per the .idxmax(), .idxmin() functions
Only returns the FIRST column/index matching my value, which doesn't help if there are duplicates
I know I could do a for loop to iterate through the entire dataframe, tracking which column and index I am on in temporary variables. Once I hit the value I am looking for, I'll break and return the current column and index. Was just hoping there was a less brute-force-y method out there, since I'd like a "high-speed calculation" method that would work on any dataframe of any size.
Thanks.
EDIT: Added text database, clarified questions.
Use np.where:
r, c = np.where(df == 55)
list(zip(df.index[r], df.columns[c]))
Output:
[('index0', 'header0'), ('index2', 'header2')]
There is a function in pandas that gives duplicate rows.
duplicate = df[df.duplicated()]
print(duplicate)
Use DataFrame.unstack for Series with MultiIndex and then filter duplicates by Series.duplicated with keep=False:
s = df.unstack()
out = s[s.duplicated(keep=False)].index.tolist()
If need also duplicates with values:
df1 = (s[s.duplicated(keep=False)]
.sort_values()
.rename_axis(index='idx', columns='cols')
.reset_index(name='val'))
If need tet specific value change mask for Series.eq (==):
s = df.unstack()
out = s[s.eq(55)].index.tolist()
So, in the code below, there is an iteration. However, it doesn't iterate over the whole DataFrame, but it just iterates over the columns, and then use .any() to check if there is any of the desierd value. Then using loc feature in the pandas it locates the value, and finally returns the index.
wanted_value = 55
for col in list(df.columns):
if df[col].eq(wanted_value).any() == True:
print("row:", *list(df.loc[df[col].eq(wanted_value)].index), ' col', col)
I have created a function that parses through each column of a dataframe, shifts up the data in that respective column to the first observation (shifting past '-'), and stores that column in a dictionary. I then convert the dictionary back to a dataframe to have the appropriately shifted columns. The function is operational and takes about 10 seconds on a 12x3000 dataframe. However, when applying it to 12x25000 it is extremely extremely slow. I feel like there is a much better way to approach this to increase the speed - perhaps even an argument of the shift function that I am missing. Appreciate any help.
def create_seasoned_df(df_orig):
"""
Creates a seasoned dataframe with only the first 12 periods of a loan
"""
df_seasoned = df_orig.reset_index().copy()
temp_dic = {}
for col in cols:
to_shift = -len(df_seasoned[df_seasoned[col] == '-'])
temp_dic[col] = df_seasoned[col].shift(periods=to_shift)
df_seasoned = pd.DataFrame.from_dict(temp_dic, orient='index').T[:12]
return df_seasoned
Try using this code with apply instead:
def create_seasoned_df(df_orig):
df_seasoned = df_orig.reset_index().apply(lambda x: x.shift(df_seasoned[col].eq('-').sum()), axis=0)
return df_seasoned
I have some DataFrames with information about some elements, for instance:
my_df1=pd.DataFrame([[1,12],[1,15],[1,3],[1,6],[2,8],[2,1],[2,17]],columns=['Group','Value'])
my_df2=pd.DataFrame([[1,5],[1,7],[1,23],[2,6],[2,4]],columns=['Group','Value'])
I have used something like dfGroups = df.groupby('group').apply(my_agg).reset_index(), so now I have DataFrmaes with informations on groups of the previous elements, say
my_df1_Group=pd.DataFrame([[1,57],[2,63]],columns=['Group','Group_Value'])
my_df2_Group=pd.DataFrame([[1,38],[2,49]],columns=['Group','Group_Value'])
Now I want to clean my groups according to properties of their elements. Let's say that I want to discard groups containing an element with Value greater than 16. So in my_df1_Group, there should only be the first group left, while both groups qualify to stay in my_df2_Group.
As I don't know how to get my_df1_Group and my_df2_Group from my_df1 and my_df2 in Python (I know other languages where it would simply be name+"_Group" with name looping in [my_df1,my_df2], but how do you do that in Python?), I build a list of lists:
SampleList = [[my_df1,my_df1_Group],[my_df2,my_df2_Group]]
Then, I simply try this:
my_max=16
Bad=[]
for Sample in SampleList:
for n in Sample[1]['Group']:
df=Sample[0].loc[Sample[0]['Group']==n] #This is inelegant, but trying to work
#with Sample[1] in the for doesn't work
if (df['Value'].max()>my_max):
Bad.append(1)
else:
Bad.append(0)
Sample[1] = Sample[1].assign(Bad_Row=pd.Series(Bad))
Sample[1] = Sample[1].query('Bad_Row == 0')
Which runs without errors, but doesn't work. In particular, this doesn't add the column Bad_Row to my df, nor modifies my DataFrame (but the query runs smoothly even if Bad_Rowcolumn doesn't seem to exist...). On the other hand, if I run this technique manually on a df (i.e. not in a loop), it works.
How should I do?
Based on your comment below, I think you are wanting to check if a Group in your aggregated data frame has a Value in the input data greater than 16. One solution is to perform a row-wise calculation using a criterion of the input data. To accomplish this, my_func accepts a row from the aggregated data frame and the input data as a pandas groupby object. For each group in your grouped data frame, it will subset you initial data and use boolean logic to see if any of the 'Values' in your input data meet your specified criterion.
def my_func(row,grouped_df1):
if (grouped_df1.get_group(row['Group'])['Value']>16).any():
return 'Bad Row'
else:
return 'Good Row'
my_df1=pd.DataFrame([[1,12],[1,15],[1,3],[1,6],[2,8],[2,1],[2,17]],columns=['Group','Value'])
my_df1_Group=pd.DataFrame([[1,57],[2,63]],columns=['Group','Group_Value'])
grouped_df1 = my_df1.groupby('Group')
my_df1_Group['Bad_Row'] = my_df1_Group.apply(lambda x: my_func(x,grouped_df1), axis=1)
Returns:
Group Group_Value Bad_Row
0 1 57 Good Row
1 2 63 Bad Row
Based on dubbbdan idea, there is a code that works:
my_max=16
def my_func(row,grouped_df1):
if (grouped_df1.get_group(row['Group'])['Value']>my_max).any():
return 1
else:
return 0
SampleList = [[my_df1,my_df1_Group],[my_df2,my_df2_Group]]
for Sample in SampleList:
grouped_df = Sample[0].groupby('Group')
Sample[1]['Bad_Row'] = Sample[1].apply(lambda x: my_func(x,grouped_df), axis=1)
Sample[1].drop(Sample[1][Sample[1]['Bad_Row']!=0].index, inplace=True)
Sample[1].drop(['Bad_Row'], axis = 1, inplace = True)
I have a hard time to do a complex dataframe filtering.
Here the problem:
For each column 'id' of same value, the column 'job' can take the values 'fireman','nan','policeman'.
I would like to filter my dataframe so that for each id of same value,
I keep only the rows starting where the value 'fireman' for job is occuring the last consecutive time. I first have to group by 'job' values to filter on:
df.groupby("job").filter(lambda x: f(x))
I don't know which function f is appropriate.
Any ideas ?
To try:
df = pd.DataFrame([[79,1,], [79,2,'fireman'],[79,3,'fireman'],[79,4,],[79,5,],[79,6,'fireman'],[79,7,'fireman'],[79,8,'policeman']], columns=['id','day','job'])
output = pd.DataFrame([[79,6,'fireman'],[79,7,'fireman'],[79,8,'policeman']], columns=['id','day','job'])
Here is a version without the need of extra variables:
df.groupby('imo').apply(lambda grp: grp[grp.index >=
((grp.polygon.shift() != grp.polygon) &
(grp.polygon.shift(-1) == grp.polygon) &
(grp.polygon == 'FE')
).cumsum().idxmax()]
).reset_index(level=0, drop=True)
I want to create a new column comp in a pandas dataframe containing a single column price. The value of this new column should be generated by a function that works on the current and last 3 values of the price.
df.apply() works off a single row, shift() doesnt seem to work. Do experts have any suggestion to make it work in a vectorized operation?
Use a series sum group.apply() function. Below assumes you have an index or column named ID of increasing row values 1, 2, 3, ... that can be used to count back 3 values.
# SERIES SUM FUNCTION
def intsum(x):
if x < 3:
ser = df.price[(df.ID < x)]
else:
ser = df.price[(df.ID >= x - 3) & (df.ID < x)]
return ser.sum()
# APPLY FUNCTION
df['comp'] = df['ID'].apply(intsum)