Select single minimum value of Pandas dataframe column instead multiple - python

I want get minimum value a year from a dataframe(df_greater_TDS) column('DTS38').
So I grouped by year-column and applied transform(min). However, as there are multiple minimum values, min function is returning multiple rows.
how to get only one value or here a single row?
idx = df_greater_TDS.groupby('year')['DTS38'].transform(min)==df_greater_TDS['DTS38']
df_TDS=df_greater_TDS[idx]

Related

How to modify column names when combining rows of multiple columns into single rows in a dataframe based on a categorical value. + selective sum/mean

I'm using the pandas .groupby function to combine multiple rows into a single row.
Currently I have a dataframe df_clean which has 405 rows, and 85 columns. There are up to 3 rows that correspond to a single Batch.
My current code for combining the multiple rows is:
num_V = 84 #number of rows -1 for the row "Batch" that they are being grouped by
max_row = df_clean.groupby('Batch').Batch.count().max()
df2= (
df_clean.groupby('Batch')
.apply(lambda x: x.values[:,1:].reshape(1,-1)[0])
.apply(pd.Series)
)
This code works creating a dataframe df2 which groups the rows by Batch, however the columns in the resulting dataframe are simply numbered (0,1,2,3,...249,250,251) note that 84*3=252, ((number of columns - Batch column)*3)=252, Batch becomes the index.
I'm cleaning some data for analysis and I want to combine the data of several (generally 1-3) Sub_Batch values on separate rows into a single row based on their Batch. Ideally I would like to be able to determine which columns are grouped into a row and remain separate columns in the row, as well as which columns the average, or total value is reported.
for example desired input/output:
Original dataframe
Output dataframe
note: naming of columns, and that all columns are copied over and that the columns are ordered according to which sub-batch they belong to. ie Weight_2 will always correspond to the second sub_batch that is a part of that Batch, Weight_3 will correspond to the third Sub_batch that is part of the Batch.
Ideal output dataframe
note: naming of columns, and that in this dataframe there is only a single column that records the Color as they are identical for all Sub-Batch values within a Batch. The individual Temperature values are recorded, as well as the average of the Temperature values for a Batch. The individual Weight values are recorded as well as the sum of the weight values in the column 'Total_weight`
I am 100% okay with the Output Dataframe scenario as I will simply add the values that I want afterwards using .mean and .sum for the values that I desire, I am simply asking if it can be done using `.groupby' as it is not something that I have worked with before, and I know that it does have some ability to sum or average results.

limit pandas .loc method output within a iloc range

I am looking for a maximum value within my pandas dataframe but only within certain index range:
df.loc[df['Score'] == df['Score'].iloc[430:440].max()]
This gives me a pandas.core.frame.DataFrame type output with multiple rows.
I specifically need the the index integer of the maximum value within iloc[430:440] and only the first index the maximum value occurs.
Is there anyway to limit the range of the .loc method?
Thank you
If you just want the index:
i = df['Score'].iloc[430:440].idxmax()
If you want to get the row as well:
df.loc[i]
If you want to get the first row in the entire dataframe with that value (rather than just within the range you specified originally):
df[df['Score'] == df['Score'].iloc[430:440].max()].iloc[0]

Grouping unique values with low value counts

My Data frame contains over 40 unique values for a particular attribute. I want to do some visualisation of this data, but fitting in all 40 points is challenging. Using wine['country'].value_counts(), I can see the frequency of each unique value.
When I go to create, for example, a bar chart, I would like any unique values with value counts less than 100 to be grouped together to create it's own bar in the visualisation (and say call it 'rest' or 'other').
Any way of doing this?
Initiate a variable x = 0.Iterate through wine['country'].value_counts() using for loop. Then check if a particular value_counts() is less than 100, if true, then add the value_counts() value for that particular iteration to x. This way you will have the sum of such values whose count is less than 100.
Now before charting, create a new dataframe having data of country vs value_counts() with only those rows whose value_counts() value is greater than 100. Then manually add another row named 'other' to this new dataframe with its value_counts() as x. Use this new dataframe for charting.

Aggregate Function to dataframe while retaining rows in Pandas

I want to aggregate my data based off a field known as COLLISION_ID and a count of each COLLISION_ID.
I want to remove repeating COLLISION_IDs since they have the same Coordinates, but retain a count of occurrences in original data-set.
My code is below
df2 = df1.groupby(['COLLISION_ID'])[['COLLISION_ID']].count()
This returns such:
I would like my data returned as the COLLISION_ID numbers, the count, and the remaining columns of my data which are not shown here(~40 additional columns that will be filtered later)
If you are talking about filter , we should do transform
df1['count_col']=df1.groupby(['COLLISION_ID'])['COLLISION_ID'].transform('count')
Then you can filter the df1 with column count

Removing rows from a data frame when the value In a specific column is less than the previous value

I have a pandas dataframe with multiple rows. Before I can produce plots I must perform filtering on the time column. Typically the value for time will increase at a 1Hz rate however there will be cases when the value for time will go backward. I need to drop any rows that have those "invalid" values for time.
This should work for you:
df = pd.concat([df[:1],df[df.shift(1)['time'] < df['time']]])
If you have a DF with a time column (or some other numeric representation of time):
DF=pd.DataFrame({'Time':[1,2,3,4,3,4,5,6]})
With pandas use the diff method to find negative rows (i.e. rows where the next time value is less), which are then filtered out:
DF[DF.Time.diff()>=0]
True values in the DF are retained.

Categories