Comparing columns of different pandas dataframes - python

I'm new to Python and using pandas dataframes to store and work with a large dataset.
I'm interested in knowing whether it's possible to compare values between dataframes of similarly named columns. For example, the functionality I'm after would be similar to comparing the column 'A' in this dataframe:
A
0 9
1 9
2 5
3 8
4 7
5 9
6 2
7 2
8 5
9 7
to the column 'A' in this one:
A
0 6
1 3
2 7
3 8
4 2
5 5
6 1
7 8
8 4
9 9
Then, for each row I would determine which of the two 'A' values is smaller and add it to, say, a new column in the first dataframe called 'B':
A B
0 9 6
1 9 3
2 5 5
3 8 8
4 7 2
5 9 5
6 2 1
7 2 2
8 5 4
9 7 7
I'm aware of the
pandas.DataFrame.min
method but as I understand it this will only located the smallest value of one column and can't be used to compare columns of different dataframes. I'm not sure of any other ways in which this functionality could be achieved
Any suggestions for solving this (probably) very simple question would be much appreciated! Thank you.

You can use numpy.minimum():
import numpy as np
df1['B'] = np.minimum(df1.A, df2.A)
Or use Series.where() to replace values:
df1['B'] = df1['A'].where(df1.A < df2.A, df2.A)

Related

Ordering a dataframe by each column

I have a dataframe that looks like this:
ID Age Score
0 9 5 3
1 4 6 1
2 9 7 2
3 3 2 1
4 12 1 15
5 2 25 6
6 9 5 4
7 9 5 61
8 4 2 12
I want to sort based on the first column, then the second column, and so on.
So I want my output to be this:
ID Age Score
5 2 25 6
3 3 2 1
8 4 2 12
1 4 6 1
0 9 5 3
6 9 5 4
7 9 5 61
2 9 7 2
4 12 1 15
I know I can do the above with df.sort_values(df.columns.to_list()), however I'm worried this might be quite slow for much larger dataframes (in terms of columns and rows).
Is there a more optimal solution?
You can use numpy.lexsort to improve performance.
import numpy as np
a = df.to_numpy()
out = pd.DataFrame(a[np.lexsort(np.rot90(a))],
index=df.index, columns=df.columns)
Assuming as input a random square DataFrame of side n:
df = pd.DataFrame(np.random.randint(0, 100, size=(n, n)))
here is the comparison for 100 to 100M items (slower runtime is the best):
Same graph with the speed relative to pandas
By still using df.sort_values() you can speed it up a bit by selecting the type of sorting algorithm. By default it's set to quicksort, but there is the alternatives of 'mergesort', 'heapsort' and 'stable'.
Maybe specifying one of these would improve it?
df.sort_values(df.columns.to_list(), kind="mergesort")
https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.sort_values.html

Combining data frames with the same columns

I have a data frame that looks like this
A B C
1 4 7
2 5 8
3 6 9
And also another data frame that looks like this
A B C
2 1 7
4 3 9
6 5 8
How can I combine those two data frames to get a new data frame that looks like this
A B C
1 4 7
2 5 8
3 6 9
2 1 7
4 3 9
6 5 8
Basically, the two data frames have the same column names and number of columns. I just want to combine all of the rows. Would prefer using pandas to do this.
Check with append
df1 = df1.append(df2)

Remove parenthesis and contents in parenthesis if present in a df column

I have a dataframe where the top scores/instances have parenthesis. I would like to remove the parenthesis and only leave the number. How would I do so?
I have tried the code below, but it leaves me with nans for all other numbers that do not have paranthesis.
.str.replace(r"\(.*\)","")
This is what the columns look like:
0 1(1P)
1 3(3P)
2 2(2P)
3 4(RU)
4 5(RU)
5 6(RU)
6 8
7 7
8 11
9 13
I want clean columns with only numbers.
Thanks!
Reason is mixed values - numeric with strings, possible solution is:
df['a'] = df['a'].astype(str).str.replace(r"\(.*\)","").astype(int)
print (df)
a
0 1
1 3
2 2
3 4
4 5
5 6
6 8
7 7
8 11
9 13

How to concat two python DataFrames where if the row already exits it doesn't add it, if not, append it

I'm pretty new to python.
I am trying to concat two dataframes (df1, df2) where if a row already exists in df1 then it is not added. if not, it adds to df1.
I don't want to use .concat().drop_duplicates() because I don't want duplicates within the same DataFrame to be removed.
BackStory:
I have multiple csv files that are exported from a software in different locations once in a while I want to merge these into one file. the problem is the exported files will have the same data as before along with the new records made within that period of time. therefore I need to check if the record is already in there as I will be executing the same code each time I export the data.
for the sake of example:
import pandas as pd
main_df = pd.DataFrame([[1,2,3,4],[1,2,3,4],[4,2,5,1],[2,4,1,5],[2,5,4,5],[9,8,7,6],[8,5,6,7]])
df1 = pd.DataFrame([[1,2,3,4],[1,2,3,4],[4,2,5,1],[2,4,1,5],[1,5,4,8],[7,3,5,7],[4,3,8,5],[4,3,8,5]])
main_df
0 1 2 3
0 1 2 3 4 --duplicates I want to include--
1 1 2 3 4 --duplicates I want to include--
2 4 2 5 1
3 2 4 1 5
4 2 5 4 5
5 9 8 7 6
6 8 5 6 7
df1
0 1 2 3
0 1 2 3 4 --duplicates I want to exclude--
1 1 2 3 4 --duplicates I want to exclude--
2 4 2 5 1 --duplicates I want to exclude--
3 2 4 1 5 --duplicates I want to exclude--
4 1 5 4 8
5 7 3 5 7
6 4 3 8 5 --duplicates I want to include--
7 4 3 8 5 --duplicates I want to include--
I need the end result to be
main_df (after code execution)
0 1 2 3
0 1 2 3 4
1 1 2 3 4
2 4 2 5 1
3 2 4 1 5
4 2 5 4 5
5 9 8 7 6
6 8 5 6 7
7 1 5 4 8
8 7 3 5 7
9 4 3 8 5
10 4 3 8 5
I hope I have explained my issue in a clear way. Thank you
Check for every row in df1 whether it exists in main_df using pandas apply, and turn that into a mask by negating it with the ~ operator. I like using functools partial to make explicit that we are comparing to main_df.
import pandas as pd
from functools import partial
main_df = pd.DataFrame([
[1,2,3,4],
[1,2,3,4],
[4,2,5,1],
[2,4,1,5],
[2,5,4,5],
[9,8,7,6],
[8,5,6,7]
])
df1 = pd.DataFrame([
[1,2,3,4],
[1,2,3,4],
[4,2,5,1],
[2,4,1,5],
[1,5,4,8],
[7,3,5,7],
[4,3,8,5],
[4,3,8,5]
])
def has_row(df, row):
return (df == row).all(axis = 1).any()
main_df_has_row = partial(has_row, main_df)
duplicate_rows = df1.apply(main_df_has_row, axis = 1)
df1_add = df1.loc[~duplicate_rows]
pd.concat([main_df, df1_add])

Transpose in Pandas

I imported the data from csv file with pandas. I want to split the column which includes 50 (0 to 49) values into 5 rows each having ten values. Can anyone tell me how i can do this transpose in form of pandas frame?
Let me rephrase what i said:
I attached the data that i have. I wanted to select the second column, and split it into two rows each having 10 values.
That is the code i have done so far:(I couldn't get the picture of 50 rows so i have only put 20 rowsenter image description here)
import numpy as np
import pandas as pd
df = pd.read_csv('...csv')
df.iloc[:50,:2]
Consider the dataframe df
np.random.seed([3,1415])
df = pd.DataFrame(dict(mycolumn=np.random.randint(10, size=50)))
using numpy and reshape'ing, ignoring indices
pd.DataFrame(df.mycolumn.values.reshape(5, -1))
0 1 2 3 4 5 6 7 8 9
0 0 2 7 3 8 7 0 6 8 6
1 0 2 0 4 9 7 3 2 4 3
2 3 6 7 7 4 5 3 7 5 9
3 8 7 6 4 7 6 2 6 6 5
4 2 8 7 5 8 4 7 6 1 5
​

Categories