When calculating a new column called "duration_minutes", some of the results are negative because the values were put in the original columns backwards.
time.started_at=pd.to_datetime(time.started_at)
time.ended_at=pd.to_datetime(time.ended_at)
time["duration_minutes"]=(time.ended_at-time.started_at).dt.total_seconds()/60
time.head()
A quick check for negatives time[time.duration_minutes<0] in the "duration_minutes" column shows many rows with negative values because the start and stop times are in the wrong columns.
Is there a way to create and calculate the "duration_minutes" column to deal with this situation?
Related
I am fairly new to Python and I trying to simulate the following logic with in pandas
I am currently looping throw the rows and want to sum the values in the AMOUNT column in the prior rows but only till the last seen 'TRUE' value. It seems inefficient with the actual data (I have a dataframe of about 5 million rows)? Was wondering what the efficient way of handling such a logic in Python would entail?
Logic:
The logic is that if FLAG is TRUE I want to sum the values in the AMOUNT column in the prior rows but only till the last seen 'TRUE' value. Basically sum the values in 'AMOUNT' between the rows where FLAG is TRUE
Check with cumsum and transform sum
df['SUM']=df.groupby(df['FLAG'].cumsum()).Amount.transform('sum').where(df.FLAG)
maybe try something around the following:
import pandas
df = pd.read_csv('name of file.csv')
df['AMOUNT'].sum()
I am working on a dataframe with where I have multiple columns and in one of the columns where there are many rows approx more than 1000 rows which contains the string values. Kindly check the below table for more details:
In the above image I want to change the string values in the column Group_Number to number by picking the values from the first column (MasterGroup) and increment by one (01) and want values to be like below:
Also need to verify that if the String is duplicating then instead of giving a new number it replaces with already changed number. For example in the above image ANAYSIM is duplicating and instead of giving a new sequence number I want already given number to repeating string.
Have checked different links but they are focusing on giving values from user:
Pandas DataFrame: replace all values in a column, based on condition
Change one value based on another value in pandas
Conditional Replace Pandas
Any help with achieving the desired outcome is highly appreciated.
We could do cumcount with groupby
s=(df.groupby('MasterGroup').cumcount()+1).mul(10).astype(str)
t=pd.to_datetime(df.Group_number, errors='coerce')
Then we assign
df.loc[t.isnull(), 'Group_number']=df.MasterGroup.astype(str)+s
I am exploring if it possible to create a calculation or total row which uses the column value based on matching a specified index value. I am quite new to Python so I am not sure if it is possible using pivots. See pivot I want to replicate below.
As you can see in the image above, I want the Ordered Row to be the calculation row. This will minus the Not Ordered Row value of each column from the Grand Total.
Is it possible in Python to Search the index, specifying criteria (E.g "Not Ordered") and loop through the columns to calculate the "Ordered Row"?
I am just wondering if it's possible to sum a dataframe showing a total value at the end of each column while keeping the label string description in the zero column (like you would in Excel)?
I am using Python 2.7
Summing a column is as easy as Dataframe_Name['COLUMN_NAME'].sum() you can review it In the Documentation Here
You can also do Dataframe_Name.sum() and it will return the sums for each column
I have a long list of columns for which I want to calculate non-zero median,mean & std in a one go. I cannot just delete rows with 0 based on 1 column because the value for another column in same column may not be 0.
Below is the code I currently have which calculates median,mean etc. including zero.
agg_list_oper={'ABC1':[max,np.std,np.mean,np.median],
'ABC2':[max,np.std,np.mean,np.median],
'ABC3':[max,np.std,np.mean,np.median],
'ABC4':[max,np.std,np.mean,np.median],
.....
.....
.....
}
df=df_tmp.groupby(['id']).agg(agg_list_oper).reset_index()
I know I can write long code with loops to process one column at a time.
Is there a way to do this in pandas groupby.agg() or some other functions elegantly?
You can temporarily replace 0's with NaNs. Then, pandas will ignore the NaNs while calculating medians.
df_tmp.replace(0, np.nan).groupby(['id']).agg(agg_list_oper).reset_index()