Sorting Values by rows in data frame - python

I have a 4 column data frame with numerical values and Nan. What I need is to put the largest numbers in the first columns so that always the first column has the maximum value and the second column the next maximum value.
for x in Exapand_re_metrs[0]:
for y in Exapand_re_metrs[1]:
for z in Exapand_re_metrs[2]:
for a in Exapand_re_metrs[3]:
lista=[x,y,z,a]
lista.sort()
df["AREA_Mayor"]=lista[0]
df["AREA_Menor"]=lista[1]

I'm not so sure what you want to do but here is a solution according to what I understood:
From what I see you have a dataframe with several columns and you would like it to be grouped in a single column with the values ​​from highest to lowest, so I will create a dataframe with almost the same characteristics as follows:
import pandas as pd
import numpy as np
cols = 3
rows = 4
np.random.seed(0)
df = pd.DataFrame(np.random.randint(0, 1000, (rows, cols)), columns= ["A","B","C"])
print(df)
A B C
0 684 559 629
1 192 835 763
2 707 359 9
3 723 277 754
Now I will group all the columns in a single row and organize them in descending order like this:
data = df.to_numpy().flatten()
data = pd.DataFrame(data)
data.sort_values(by=[0],ascending=False)
So as a result we will obtain a 1xn matrix where the values ​​are in descending order:
0
4 835
5 763
11 754
9 723
6 707
0 684
2 629
1 559
7 359
10 277
3 192
8 9
Note: This code fragment should be adapted to your script; I didn't do it because I don't know your dataset and lastly my English is not that good sorry for any grammatical errors

Related

Python Pandas: Best way to find local maximums in large DF

I have a large dataframe that consitsts of many cycles, each cycle has 2 maximum peak values inside that I need to capture into another dataframe.
I have created a sample data frame that mimics the data I am seeing:
import pandas as pd
data = {'Cycle':[1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,2,2,3,3,3,3,3,3,3,3,3,3], 'Pressure':[100,110,140,180,185,160,120,110,189,183,103,115,140,180,200,162,125,110,196,183,100,110,140,180,185,160,120,180,201,190]}
df = pd.DataFrame(data)
As you can see in each cycle there are two maxes but the part I was having trouble with was that the 2nd peak is usaully higher than the first peak, so there could be rows of numbers technicially higher than the other peaks max in the cycle. The results should look something like this:
data2 = {'Cycle':[1,1,2,2,3,3], 'Peak Maxs': [185,189,200,196,185,201]}
df2= pd.DataFrame(data2)
I have tried a couple methods including .nlargest(2) per cycle, but the problem is that since one of the peaks is usually higher it will pull the 2nd highest number in the data, which isnt necesssarily the other peak.
This graph shows the peak pressures from each cycle that I would like to be able to find.
Thanks for any help.
From scipy argrelextrema
from scipy.signal import argrelextrema
out = df.groupby('Cycle')['Pressure'].apply(lambda x : x.iloc[argrelextrema(x.values, np.greater)])
Out[124]:
Cycle
1 4 185
8 189
2 14 200
18 196
3 24 185
28 201
Name: Pressure, dtype: int64
out = out.sort_values().groupby(level=0).tail(2).sort_index()
out
Out[138]:
Cycle
1 4 185
8 189
2 14 200
18 196
3 24 185
28 201
Name: Pressure, dtype: int64
Use groupby().shift() to get the neighborhood values, then compare:
g = df.groupby('Cycle')
local_maxes = (df['Pressure'].gt(g['Pressure'].shift()) # greater than previous row
& df['Pressure'].gt(g['Pressure'].shift(-1))] # greater than next row
)
df[local_maxes]
Output:
Cycle Pressure
4 1 185
8 1 189
14 2 200
18 2 196
24 3 185
28 3 201

Calculate column values in pandas based on previous rows of data in another column

Let's say I have a table with two columns: Date and Amount. Number of rows are not more than 3000.
Row Date Amount
1 15/05/2021 248
2 16/05/2021 115
3 17/05/2021 387
4 18/05/2021 214
5 19/05/2021 678
6 20/05/2021 489
7 21/05/2021 875
8 22/05/2021 123
................
I need to add a third column which will calculate the trim mean values based on the Amount column.
I will be using this function: my_table['TrimMean'] = stats.trim_mean(my_table['Amount'], 0.1), but adapted for my problem.
The problem is that this is not a fixed range, but a dynamic one, following this logic: for each row in my table, the trim mean value will be calculated based on the previous 90 values of the Amount column, starting from the row above current row. If there are less that 90 values, then calculate with whatever amount of rows is available.
e.g. TrimMean[1000]=stats.trim_mean(array from column Amount containing values from rows 910 to 999) TrimMean[12]=stats.trim_mean(array from column Amount containing values from rows 1 to 11)
Hope that makes sense.
Is there any way I can calculate this in a simple way, without going through row by row iteration?
We can calculate the trim_mean by applying the function over a rolling window of size 90 and min_periods=1
from scipy.stats import trim_mean
df['Amount'].rolling(90, min_periods=1).apply(trim_mean, args=(0.1, )).shift()
0 NaN
1 248.000000
2 181.500000
3 250.000000
4 241.000000
5 328.400000
6 355.166667
7 429.428571
Name: Amount, dtype: float64

Can I cluster these records without having to run these loops for every record?

So I want to cluster the records in this table to find which records are 'similar' (i.e. have enough in common). An example of the table is as follows:
author beginpage endpage volume publication year id_old id_new
0 NaN 495 497 NaN 1975 1 1
1 NaN 306 317 14 1997 2 2
2 lowry 265 275 193 1951 3 3
3 smith p k 76 85 150 1985 4 4
4 NaN 248 254 NaN 1976 5 5
5 hamill p 85 100 391 1981 6 6
6 NaN 1513 1523 7 1979 7 7
7 b oregan 737 740 353 1991 8 8
8 NaN 503 517 98 1975 9 9
9 de wijs 503 517 98 1975 10 10
In this small table, the last row should get 'new_id' equal to 9, to show that these two records are similar.
To make this happen I wrote the code below, which works fine for a small number of records. However, I want to use my code for a table with 15000 records. And of course, if you do the maths, with this code this is going to take way too long.
Anyone who could help me make this code more efficient? Thanks in advance!
My code, where 'dfhead' is the table with the records:
for r in range(0,len(dfhead)):
for o_r in range(r+1,len(dfhead)):
if ((dfhead.loc[r,c] == dfhead.loc[o_r,c]).sum() >= 3) :
if (dfhead.loc[o_r,['id_new']] > dfhead.loc[r,['id_new']]).sum() ==1:
dfhead.loc[o_r,['id_new']] = dfhead.loc[r,['id_new']]
If you are only trying to detect whole equalities between "beginpage", "endpage","volume", "publication", "year", you should try to work on duplicates. I'm not sure about this as your code is still a mistery for me.
Something like this might work (your column "id" needs to be named "id_old" at first in the dataframe though):
cols = ["beginpage", "endpage","volume", "publication", "year"]
#isolate duplicated rows
duplicated = df[df.duplicated(cols, keep=False)]
#find the minimum key to keep
temp = duplicated.groupby(cols, as_index=False)['index'].min()
temp.rename({'id_old':'id_new'}, inplace=True, axis=1)
#import the "minimum key" to duplicated by merging the dataframes
duplicated = duplicated.merge(temp, on=cols, how="left")
#gather the "un-duplicated" rows
unduplicated = df[~df.duplicated(cols, keep=False)]
#concatenate both datasets and reset the index
new_df = unduplicated.append(duplicated)
new_df.reset_index(drop=True, inplace=True)
#where "id_new" is empty, then the data comes from "unduplicated"
#and you could fill the datas from id_old
ix = new_df[new_df.id_new.isnull()].index
new_df.loc[ix, 'id_new'] = new_df.loc[ix, 'id_old']

How to count consecutive ordered values on pandas data frame

I'm trying to get the max count of consecutive 0 values from a given data frame with id,date,value columns from a data frame on pandas which look's like that:
id date value
354 2019-03-01 0
354 2019-03-02 0
354 2019-03-03 0
354 2019-03-04 5
354 2019-03-05 5
354 2019-03-09 7
354 2019-03-10 0
357 2019-03-01 5
357 2019-03-02 5
357 2019-03-03 8
357 2019-03-04 0
357 2019-03-05 0
357 2019-03-06 7
357 2019-03-07 7
540 2019-03-02 7
540 2019-03-03 8
540 2019-03-04 9
540 2019-03-05 8
540 2019-03-06 7
540 2019-03-07 5
540 2019-03-08 2
540 2019-03-09 3
540 2019-03-10 2
The desired result will be grouped by the Id and will look like this:
id max_consecutive_zeros
354 3
357 2
540 0
I've achieved what i want with a for but it gets really slow when you are working with huge pandas dataframes, i've found some similar solutions but it didn't work with my problem at all.
Create groupID m for consecutive rows of same value. Next, groupby on id and m and call value_counts, and .loc on multiindex to slice only 0 value of the right-most index level. Finally, filter out duplicates index by duplicated in id and reindex to create 0 value for id having no 0 count
m = df.value.diff().ne(0).cumsum().rename('gid')
#Consecutive rows having the same value will be assigned same IDNumber by this command.
#It is the way to identify a group of consecutive rows having the same value, so I called it groupID.
df1 = df.groupby(['id', m]).value.value_counts().loc[:,:,0].droplevel(-1)
#this groupby groups consecutive rows of same value per ID into separate groups.
#within each group, count number of each value and `.loc` to pick specifically only `0` because we only concern on the count of value `0`.
df1[~df1.index.duplicated()].reindex(df.id.unique(), fill_value=0)
#There're several groups of value `0` per `id`. We want only group of highest count.
#`value_count` already sorted number of count descending, so we just need to pick
#the top one of duplicates by slicing on True/False mask of `duplicated`.
#finally, `reindex` adding any `id` doesn't have value 0 in original `df`.
#Note: `id` is the column `id` in `df`. It is different from groupID `m` we create to use with groupby
Out[315]:
id
354 3
357 2
540 0
Name: value, dtype: int64
Here is one way we need to create the additional key for groupby then , just need groupby this key and id
s=df.groupby('id').value.apply(lambda x : x.ne(0).cumsum())
df[df.value==0].groupby([df.id,s]).size().max(level=0).reindex(df.id.unique(),fill_value=0)
Out[267]:
id
354 3
357 2
540 0
dtype: int64
you could do :
df.groupby('id').value.apply(lambda x : ((x.diff() !=0).cumsum()).where(x ==0,\
np.nan).value_counts().max()).fillna(0)
Output
id
354 3.0
357 2.0
540 0.0
Name: value, dtype: float64

Slicing in Pandas Dataframes and comparing elements

Morning. Recently I have been trying to implement pandas in creating large data tables for machine learning (I'm trying to move away from numpy as best I can).
However-I'm running into some issues-namely, slicing pandas date frames.
Namely-I'd like to return the rows I specify and reference and compare particular elements with those in other arrays-here's some a small amount of code i've implemented and some outline
import pandas as pd
import csv
import math
import random as nd
import numpy
#create the pandas dataframe from my csv. The Csv is entirely numerical data
#with exception of the first row vector which has column labels
df=pd.read_csv(r"C:\Users\Python\Downloads\Data for Brent - Secondattampatatdfrandomsample.csv")
#I use panda functionality to return a random sample of the data (a subset
#of the array)
df_sample=pd.DataFrame.sample(df,10)
It's at this point that I want to compare the first element along each row vector to the original data. Specifically, the first element in any row contains an id number.
If the elements of the original data frame and the sample frame match up like to compute a 3 and 6 month average of the associated column elements with matching id number
I want to disclaim I'm comfy moving to numpy and away from pandas-but there are training model methods I hear a ton of good things about in pandas (My training is the mathematics side of things and less so the program development). thanks for the input!
edit: here is the sample input for the first 11 row vectors in the dataframe (id, year, month,x,y,z)
id year month x y z
0 2 2016 2 1130 343.627538 163660.060200
1 2 2016 4 859 913.314513 360633.159400
2 2 2016 5 931 858.548056 93608.190030
3 2 2016 6 489 548.314860 39925.669950
4 2 2016 7 537 684.441725 80270.240060
5 2 2016 8 618 673.887072 124041.560000
6 2 2016 9 1030 644.749493 88975.429980
7 2 2016 10 1001 543.312870 54874.599830
8 2 2016 11 1194 689.053707 79930.230000
9 2 2016 12 673 483.644736 27567.749940
10 2 2017 1 912 657.716386 54590.460070
11 2 2017 2 671 682.007537 52514.580380
here is how sample data is returned given N same n tuple as before. I used native panda functions to return a randomly generated subset of 10 row vectors out of almost 9000 entries
2 2016 1 633 877.9282175 75890.97027
5185 2774 2016 4 184 399.418719 9974.375000
9441 4974 2017 2 239 135.520851 0.000000
5134 2745 2017 2 187 217.220657 7711.333333
8561 4063 2017 1 103 505.714286 18880.000000
3328 2033 2016 11 118 452.152542 7622.000000
3503 2157 2016 3 287 446.668831 8092.588235
5228 2791 2016 2 243 400.166008 12655.250000
9380 4708 2017 2 210 402.690583 5282.352941
1631 1178 2016 10 56 563.716667 16911.500000
2700 1766 2016 1 97 486.764151 6449.625000
I'd like to decry the appropriate positions in the sample array to search for identical elements in the original array and compute averages (and eventually more rigorous statistical modeling) to their associated numerical data
for id in df_sample['id'].unique():
df.groupby('id').mean()[['x', 'y', 'z']].reset_index()
I'm not sure if this is exactly what you want but I'll walk through it to see if it gives you ideas. For each unique id in the sample (I did it for all of them, implement whatever check you like), I grouped the original dataframe by that id (all rows with id == 2 are smushed together) and took the mean of the resulting pandas.GroupBy object as required (which averages the smushed together rows, for each column not in the groupby call). Since this averages your month and year as well, and all I think I care about is x, y, and z, I selected those columns, and then for aesthetic purposes reset the index.
Alternatively, if you wanted the average for that id for each year in the original df, you could do
df.groupby(['id', 'year']).mean()[['x', 'y', 'z']].reset_index()

Categories