I have a dataframe that looks like this:
ID Age Score
0 9 5 3
1 4 6 1
2 9 7 2
3 3 2 1
4 12 1 15
5 2 25 6
6 9 5 4
7 9 5 61
8 4 2 12
I want to sort based on the first column, then the second column, and so on.
So I want my output to be this:
ID Age Score
5 2 25 6
3 3 2 1
8 4 2 12
1 4 6 1
0 9 5 3
6 9 5 4
7 9 5 61
2 9 7 2
4 12 1 15
I know I can do the above with df.sort_values(df.columns.to_list()), however I'm worried this might be quite slow for much larger dataframes (in terms of columns and rows).
Is there a more optimal solution?
You can use numpy.lexsort to improve performance.
import numpy as np
a = df.to_numpy()
out = pd.DataFrame(a[np.lexsort(np.rot90(a))],
index=df.index, columns=df.columns)
Assuming as input a random square DataFrame of side n:
df = pd.DataFrame(np.random.randint(0, 100, size=(n, n)))
here is the comparison for 100 to 100M items (slower runtime is the best):
Same graph with the speed relative to pandas
By still using df.sort_values() you can speed it up a bit by selecting the type of sorting algorithm. By default it's set to quicksort, but there is the alternatives of 'mergesort', 'heapsort' and 'stable'.
Maybe specifying one of these would improve it?
df.sort_values(df.columns.to_list(), kind="mergesort")
https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.sort_values.html
I want to select 3 residual data that only pass through the threshold in a row, where my threshold is 3. Here I attach the csv data to the link and what I currently do is for the filter. where I need the time criteria there. Consecutive data are those that pass the threshold and are sequentially timed
df[df.residual_value >= 3]
Data csv
IIUC, you want to filter the rows that are greater or equal than 3, only if 3 consecutive rows match the criterion. You can use rolling+min:
processing:
df[df['col'].rolling(window=3).min().shift(-2).ge(3)]
example dataset:
np.random.seed(0)
df = pd.DataFrame({'col': np.random.randint(0,10,100)})
>>> df.head(15)
col
0 5
1 0
2 3
3 3
4 7
5 9
6 3
7 5
8 2
9 4
10 7
11 6
12 8
13 8
14 1
output:
col
2 3
3 3
4 7
5 9
9 4
10 7
11 6
...
I have a Pandas Series that represents a group count.
How to create a new series with the maximum values from the series up to alter the count group?
Minimal example:
import pandas as pd
s_count = pd.Series([1,2,3,1,2,3,4,5,1,2,3,4])
Desired:
s_max_count_group = pd.Series([3,3,3,5,5,5,5,5,4,4,4,4])
Print result:
df = pd.DataFrame({
'counts': s_count,
'expected': s_max_count_group
})
print(df)
Display:
counts expected
0 1 3
1 2 3
2 3 3
3 1 5
4 2 5
5 3 5
6 4 5
7 5 5
8 1 4
9 2 4
10 3 4
11 4 4
I looked for similar questions, tested some answers, so i'm trying to use fill, cumsum, diff and mask methods, but no success up to now.
We can identify the individual groups by comparing the count group with 1 followed by cumsum, then group the given series on these indentified groups and transform using max
s_count.groupby(s_count.eq(1).cumsum()).transform('max')
0 3
1 3
2 3
3 5
4 5
5 5
6 5
7 5
8 4
9 4
10 4
11 4
dtype: int64
I'm pretty new to python.
I am trying to concat two dataframes (df1, df2) where if a row already exists in df1 then it is not added. if not, it adds to df1.
I don't want to use .concat().drop_duplicates() because I don't want duplicates within the same DataFrame to be removed.
BackStory:
I have multiple csv files that are exported from a software in different locations once in a while I want to merge these into one file. the problem is the exported files will have the same data as before along with the new records made within that period of time. therefore I need to check if the record is already in there as I will be executing the same code each time I export the data.
for the sake of example:
import pandas as pd
main_df = pd.DataFrame([[1,2,3,4],[1,2,3,4],[4,2,5,1],[2,4,1,5],[2,5,4,5],[9,8,7,6],[8,5,6,7]])
df1 = pd.DataFrame([[1,2,3,4],[1,2,3,4],[4,2,5,1],[2,4,1,5],[1,5,4,8],[7,3,5,7],[4,3,8,5],[4,3,8,5]])
main_df
0 1 2 3
0 1 2 3 4 --duplicates I want to include--
1 1 2 3 4 --duplicates I want to include--
2 4 2 5 1
3 2 4 1 5
4 2 5 4 5
5 9 8 7 6
6 8 5 6 7
df1
0 1 2 3
0 1 2 3 4 --duplicates I want to exclude--
1 1 2 3 4 --duplicates I want to exclude--
2 4 2 5 1 --duplicates I want to exclude--
3 2 4 1 5 --duplicates I want to exclude--
4 1 5 4 8
5 7 3 5 7
6 4 3 8 5 --duplicates I want to include--
7 4 3 8 5 --duplicates I want to include--
I need the end result to be
main_df (after code execution)
0 1 2 3
0 1 2 3 4
1 1 2 3 4
2 4 2 5 1
3 2 4 1 5
4 2 5 4 5
5 9 8 7 6
6 8 5 6 7
7 1 5 4 8
8 7 3 5 7
9 4 3 8 5
10 4 3 8 5
I hope I have explained my issue in a clear way. Thank you
Check for every row in df1 whether it exists in main_df using pandas apply, and turn that into a mask by negating it with the ~ operator. I like using functools partial to make explicit that we are comparing to main_df.
import pandas as pd
from functools import partial
main_df = pd.DataFrame([
[1,2,3,4],
[1,2,3,4],
[4,2,5,1],
[2,4,1,5],
[2,5,4,5],
[9,8,7,6],
[8,5,6,7]
])
df1 = pd.DataFrame([
[1,2,3,4],
[1,2,3,4],
[4,2,5,1],
[2,4,1,5],
[1,5,4,8],
[7,3,5,7],
[4,3,8,5],
[4,3,8,5]
])
def has_row(df, row):
return (df == row).all(axis = 1).any()
main_df_has_row = partial(has_row, main_df)
duplicate_rows = df1.apply(main_df_has_row, axis = 1)
df1_add = df1.loc[~duplicate_rows]
pd.concat([main_df, df1_add])
I'm new to Python and using pandas dataframes to store and work with a large dataset.
I'm interested in knowing whether it's possible to compare values between dataframes of similarly named columns. For example, the functionality I'm after would be similar to comparing the column 'A' in this dataframe:
A
0 9
1 9
2 5
3 8
4 7
5 9
6 2
7 2
8 5
9 7
to the column 'A' in this one:
A
0 6
1 3
2 7
3 8
4 2
5 5
6 1
7 8
8 4
9 9
Then, for each row I would determine which of the two 'A' values is smaller and add it to, say, a new column in the first dataframe called 'B':
A B
0 9 6
1 9 3
2 5 5
3 8 8
4 7 2
5 9 5
6 2 1
7 2 2
8 5 4
9 7 7
I'm aware of the
pandas.DataFrame.min
method but as I understand it this will only located the smallest value of one column and can't be used to compare columns of different dataframes. I'm not sure of any other ways in which this functionality could be achieved
Any suggestions for solving this (probably) very simple question would be much appreciated! Thank you.
You can use numpy.minimum():
import numpy as np
df1['B'] = np.minimum(df1.A, df2.A)
Or use Series.where() to replace values:
df1['B'] = df1['A'].where(df1.A < df2.A, df2.A)