Sorting based on row values with pandas - python

I have a data frame which looks as follows
df= pd.DataFrame(np.array([[1, 2, 3], [1, 5, 6], [1, 8, 9],[2, 18, 9],[3, 99, 10],[3, 0.3, 5],[2, 58, 78],[4, 8, 9]]),
columns=['id', 'point_A', 'point_B'])
Now I want to create column which is the sum of both point_A and point_B row . I can do that by this code: df["sum_of_all"] = df[["point_A","point_B"]].sum(axis = 1)
Now I want to give sort them based on sum_of_all. Meaning the most sum will be graded as 1 and so on. Now it has to be done based on id , How can I do that ?
Update :
Once I have finished the sum and sorting I get the above output. Now My goal is to assigne grade based on id. i.e : id 2 ,index 6, -> grade = 1,id 2 in index 3 -> grade 2 , id 3 on index 4 -> grade 1 and id 3 on index 5 -> 2 and so on
Thats the expection

IICU
df2=df.sort_values(by=['sum_of_all','id'], ascending=[False, False])
df2['grade']=df2.groupby('id')['sum_of_all'].cumcount()+1
df2
Outcome

Related

How to find the most frequent value of a column per row, where each column value is a list of values

I have a dataframe that, as a result of a previous group by, contains 5 rows and two columns. column A is a unique name, and column B contains a list of unique numbers that correspond to different factors related to the unique name. How can I find the most common number (mode) for each row?
df = pd.DataFrame({"A": [Name1,Name2,...], "B": [[3, 5, 6, 6], [1, 1, 1, 4],...]})
I have tried:
df['C'] = df[['B']].mode(axis=1)
but this simply creates a copy of the lists from column B. Not really sure how to access each list in this case.
Result should be:
A: B: C:
Name 1 [3,5,6,6] 6
Name 2 [1,1,1,4] 1
Any help would be great.
Here's a method using statistics module's mode function
from statistics import mode
Two options:
df["C"] = df["B"].apply(mode)
df.head()
# A B C
# 0 Name1 [3, 5, 6, 6] 6
# 1 Name2 [1, 1, 1, 4] 1
Or
df["C"] = [mode(df["B"][i]) for i in range(len(df))]
df.head()
# A B C
# 0 Name1 [3, 5, 6, 6] 6
# 1 Name2 [1, 1, 1, 4] 1
I would use Pandas' .apply() function here. It will execute a function on each element in a series. First, we define the function, I'm taking the mode from Find the most common element in a list
def mode(lst):
return max(set(lst), key=lst.count)
Then, we apply this function to the B column to get C:
df['C'] = df['B'].apply(mode)
Our output is:
>>> df
A B C
0 Name1 [3, 5, 6, 6] 6
1 Name2 [1, 1, 1, 4] 1

Try to get the cross of 2 series of a pandas table

I am stuck with an issue on a massive pandas table. I would like to get a boolean to check the cross of 2 series.
df = pd.DataFrame({'A': [1, 2, 3, 4],
'B': [10, 1, 2, 8]})
I would like to add one column in my array to get a result like this one
df = pd.DataFrame({'A': [1, 2, 3, 4],
'B': [10, 1, 2, 8],
'C': [0, -1, 0, 1]
})
So basically to get
0 when there is no cross between series B and A
-1 when table B crosses down table A
1 when table B crosses up table A
I need to do vector calculation because my real table is like more than one million rows.
Thank you
You can compute the relative position of the 2 columns with lt, then convert to integer and compute the diff:
m = df['A'].lt(df['B'])
df['C'] = m.astype(int).diff().fillna(0, downcast='infer')
output:
A B C
0 1 10 0
1 2 1 -1
2 3 2 0
3 4 8 1
visual of A/B:

Automatically create multiple python datasets based on column names

I have a huge data set with columns like: "Eas_1", "Eas_2", and so on to "Eas_40" and "Nor_1" to "Nor_40". I want to automatically create multiple separate data sets that consist of all columns that end with the same number (grouped by column name number) and column number pasted as values in the new column (Bin).
My data frame:
df = pd.DataFrame({
"Eas_1": [3, 4, 9, 1],
"Eas_2": [4, 5, 10, 2],
"Nor_1": [9, 7, 9, 2],
"Nor_2": [10, 8, 10, 3],
"Error_1": [2, 5, 1, 6],
"Error_2": [5, 0, 3, 2],
})
I don't know how to create Bin column and paste the column name values, but I could separate data sets manually like this:
df1 = df.filter(regex='_1')
df2 = df.filter(regex='_2')
This would take a lot of effort for me, plus I would have to change the script every time I get new data. This is how I imagine end result:
df1 = pd.DataFrame({
"Eas_1": [3, 4, 9, 1],
"Nor_1": [9, 7, 9, 2],
"Error_1": [2, 5, 1, 6],
"Bin": [1, 1, 1, 1],
})
Thanks in advance!
You can extract the suffixes with .str.extract, then groupby on those:
suffixes = df.columns.str.extract('(\d+)$', expand=False)
for label, data in df.groupby(suffixes, axis=1):
print('-'*10, label, '-'*10)
print(data)
Note To collect your dataframes, you can do:
dfs = [data for _, data in df.groupby(suffixes, axis=1)]
# access the second dataframe
dfs[1]
Output:
---------- 1 ----------
Eas_1 Nor_1 Error_1
0 3 9 2
1 4 7 5
2 9 9 1
3 1 2 6
---------- 2 ----------
Eas_2 Nor_2 Error_2
0 4 10 5
1 5 8 0
2 10 10 3
3 2 3 2

How to apply rolling mean function while keeping all the observations with duplicated indices in time

I have a dataframe that has duplicated time indices and I would like to get the mean across all for the previous 2 days (I do not want to drop any observations; they are all information that I need). I've checked pandas documentation and read previous posts on Stackoverflow (such as Apply rolling mean function on data frames with duplicated indices in pandas), but could not find a solution. Here's an example of how my data frame look like and the output I'm looking for. Thank you in advance.
data:
import pandas as pd
df = pd.DataFrame({'id': [1,1,1,2,3,3,4,4,4],'t': [1, 2, 3, 2, 1, 2, 2, 3, 4],'v1':[1, 2, 3, 4, 5, 6, 7, 8, 9]})
output:
t
v2
1
-
2
-
3
4.167
4
5
5
6.667
A rough proposal to concatenate 2 copies of the input frame in which values in 't' are replaced respectively by values of 't+1' and 't+2'. This way, the meaning of the column 't' becomes "the target day".
Setup:
import pandas as pd
df = pd.DataFrame({'id': [1,1,1,2,3,3,4,4,4],
't': [1, 2, 3, 2, 1, 2, 2, 3, 4],
'v1':[1, 2, 3, 4, 5, 6, 7, 8, 9]})
Implementation:
len = df.shape[0]
incr = pd.DataFrame({'id': [0]*len, 't': [1]*len, 'v1':[0]*len}) # +1 in 't'
df2 = pd.concat([df + incr, df + incr + incr]).groupby('t').mean()
df2 = df2[1:-1] # Drop the days that have no full values for the 2 previous days
df2 = df2.rename(columns={'v1': 'v2'}).drop('id', axis=1)
Output:
v2
t
3 4.166667
4 5.000000
5 6.666667
Thank you for all the help. I ended up using groupby + rolling (2 Day), and then drop duplicates (keep the last observation).

python pandas deduplication with complex criteria

I have a dataframe below:
import pandas as pd
d = {'id': [1, 2, 3, 4, 4, 6, 1, 8, 9], 'cluster': [7, 2, 3, 3, 3, 6, 7, 8, 8]}
df = pd.DataFrame(data=d)
df = df.sort_values('cluster')
I want to keep ALL the rows
if there is the same cluster but different id AND keep every row from that cluster
even if it is the same id since there was a different id AT LEAST once within that cluster.
The code I have been using to achieve this is the following below, BUT, the only problem
with this is it drops too many rows for what I am looking for.
df = (df.assign(counts=df.count(axis=1))
.sort_values(['id', 'counts'])
.drop_duplicates(['id','cluster'], keep='last')
.drop('counts', axis=1))
The output dataframe I am expecting that the code above does not do
would drop rows at
dataframe index 1, 5, 0, and 6 but leave dataframe indexes 2, 3, 4, 7, and 8. Essentially
resulting in what the code below produces:
df = df.loc[[2, 3, 4, 7, 8]]
I have looked at many deduplication pandas posts on stack overflow but have yet to find this
scenario. Any help would be greatly appreciated.
I think we can do this with a single boolean. using .groupby().nunique()
con1 = df.groupby('cluster')['id'].nunique() > 1
#of these we only want the True indexes.
cluster
2 False
3 True
6 False
7 False
8 True
df.loc[(df['cluster'].isin(con1[con1].index))]
id cluster
2 3 3
3 4 3
4 4 3
7 8 8
8 9 8

Categories