I am exploring if it possible to create a calculation or total row which uses the column value based on matching a specified index value. I am quite new to Python so I am not sure if it is possible using pivots. See pivot I want to replicate below.
As you can see in the image above, I want the Ordered Row to be the calculation row. This will minus the Not Ordered Row value of each column from the Grand Total.
Is it possible in Python to Search the index, specifying criteria (E.g "Not Ordered") and loop through the columns to calculate the "Ordered Row"?
Related
When calculating a new column called "duration_minutes", some of the results are negative because the values were put in the original columns backwards.
time.started_at=pd.to_datetime(time.started_at)
time.ended_at=pd.to_datetime(time.ended_at)
time["duration_minutes"]=(time.ended_at-time.started_at).dt.total_seconds()/60
time.head()
A quick check for negatives time[time.duration_minutes<0] in the "duration_minutes" column shows many rows with negative values because the start and stop times are in the wrong columns.
Is there a way to create and calculate the "duration_minutes" column to deal with this situation?
I have the following pandas df:
it is sorted by 'patient_id', 'StartTime', 'hour_counter'.
I'm looking to perform two conditional operations on the df:
Change the value of the Delta_Value column
Delete the entire row
Where the condition depends on the values of ParameterID or patient_id in the current row and the row before.
I managed to do that using classic programming (i.e. a simple loop in Python), but not using Pandas.
Specifically, I want to change the 'Delta_Value' to 0 or delete the entire row, if the ParameterID in the current row is different from the one at the row before.
I've tried to use .groupby().first(), but that won't work in some cases because the same patient_id can have multiple occurrences of the same ParameterID with a different
ParameterID in between those occurrences. For example record 10 in the df.
And I need the records to be sorted by the StartTime & hour_counter.
Any suggestions?
I have some data with 4 features of interest: account_id, location_id, date_from and date_to. Each entry corresponds to a period where a customer account was associated with a particular location.
There are some pairs of account_id and location_id which have multiple entries, with different dates. This means that the customer is associated with the location for a longer period, covered by multiple consecutive entries.
So I want to create an extra column with the total length of time that a customer was associated with a given location. I am able to use groupby and apply to calculate this for each pair (see code below).. this works fine but I don't understand how to then add this back into the original dataframe as a new column.
lengths = non_zero_df.groupby(['account_id','location_id'], group_keys=False).apply(lambda x: x.date_to.max() - x.date_from.min())
Thanks
I think Mephy is right that this should probably go to StackOverflow.
You're going to have a shape incompatibility because there will be fewer entries in the grouped result than in the original table. You'll need to do the equivalent of an SQL left outer join with the original table and the results, and you'll have the total length show up multiple times in the new column -- every time you have an equal (account_id, location_id) pair, you'll have the same value in the new column. (There's nothing necessarily wrong with this, but it could cause an issue if people are trying to sum up the new column, for example)
Check out pandas.DataFrame.join (you can also use merge). You'll want to join the old table with the results, on (account_id, location_id), as a left (or outer) join.
I am just wondering if it's possible to sum a dataframe showing a total value at the end of each column while keeping the label string description in the zero column (like you would in Excel)?
I am using Python 2.7
Summing a column is as easy as Dataframe_Name['COLUMN_NAME'].sum() you can review it In the Documentation Here
You can also do Dataframe_Name.sum() and it will return the sums for each column
I'm trying to average values in columns E, F, G ... based on which key they have in column A.
I have an excel formula that works, I got it from adapting this post: =AVERAGEIF(A:A,0,E:E). However, there are potentially n keys (up to 100000), and m columns of values (up to 100)
My problem is that typing in that formula for each variation isn't logical. Is there a better way to do this; so that it takes into account the varying groups and columns?
NOTE
The first 7 rows are junk (they contain info about the file)
The data is in CSV format, I am only using Excel to open it
I am currently looking into modifying this script, but as I am unfamiliar with python so it may take some time
EXAMPLE
Column A is the group, column E are the values to be averaged. Column F contains the output averages (this is a little messy because it doesn't show which groups the averages are for, but they are just in ascending order, ie: 0, 1, 2)
If I understand you correctly I would use a Pivot Table to summarize your data. You can group by column A and then get the means of the rest of columns.
On the ribbon interface click "Insert" and select "Pivot Table". You'll then be prompted to select your data and set a location.
After that you should see a window on the right asking for fields. Drag the Column A field to the Row Labels list, and then the columns you want averages of to the values list. You might need to change the value field settings from the default sum to average which is what you want.
https://support.office.com/en-AU/Article/Create-a-PivotTable-to-analyze-worksheet-data-a9a84538-bfe9-40a9-a8e9-f99134456576