how to count the match number between two dataframe fast?

how to count the match number between two dataframe fast? - python

I'm writing some programs on calculate the match item number between two dataframes.
for example,
A is the dataframe as : A = pd.DataFrame({'pick_num1':[1, 2, 3], 'pick_num2':[2, 3, 4], 'pick_num3':[4, 5, 6]})
B is the answer I want to match, like:
B = pd.DataFrame({'ans_num1':[1, 2, 3], 'ans_num2':[2, 3, 4], 'ans_num3':[4, 5, 6], 'ans_num4':[7, 8, 1], 'ans_num5':[9, 1, 9]})
DataFrame A
pick_num1 pick_num2 pick_num3 match_num
0 1 2 4 2
1 2 3 5 2
2 3 4 6 2
DataFrame B
ans_num1 ans_num2 ans_num3 ans_num4 ans_num5
0 1 2 4 7 9
1 2 3 5 8 1
2 3 4 6 1 9
and I want to append a new column of ['match_num'] at the end of A.
Now I have tried to write a mapping function to compare and calculate, and I found the speed is not that fast while the dataframe is huge, the functions are below:
def win_prb_func(df1, p_name):
df1['match_num'] += np.sum(pd.concat([df1[p_name]]*5, axis=1).values==df1[open_ball_name_ls].values, 1)
return df1
def compute_win_prb(df1):
return list(map(lambda p_name: win_prb_func(df1, p_name), pick_name_ls))
df1 = pd.concat([A, B], axis=1)
df1['win prb.'] = 0
result_df = compute_win_prb(df1)
where pick_name_ls is ['pick_num1', 'pick_num2', 'pick_num3'], and open_ball_name_ls is ['ans_num1', 'ans_num2', 'ans_num3', 'ans_num4', 'ans_num5'].
I'm wondering is it possible to make the computation more fast or smart than I did?
now the performance would is: 0.015626192092895508 seconds
Thank you for helping me!

You can use broadcasting instead of concatenating the columns:
def win_prb_func(df1, p_name):
df1['match_num'] += np.sum(df1[p_name].values[:, np.newaxis] == df1[open_ball_name_ls].values, 1)
return df1
Since df1[p_name].values will return an 1-D array, you have to convert it into the column vector by adding a new axis. It only takes me 0.004 second.

Related

Sorting matrix columns

I have a matrix 4*5 and I need to sort it by several columns.
Given these inputs:
sort_columns = [3, 1, 2, 4, 5, 2]
matrix = [[3, 1, 8, 1, 9],
[3, 7, 8, 2, 9],
[2, 7, 7, 1, 2],
[2, 1, 7, 1, 9]]
the matrix should first be sorted by the 3nd column (so the values 8, 8, 7, 7), then the sorted result should again be sorted by column 1 (values 3, 3, 2, 2) and so on.
So, after first sorting by column 3, the matrix would be:
2 7 7 1 2
2 1 7 1 9
3 1 8 1 9
3 7 8 2 9
and sorting on column 1 then has no effect as the values are already in the right order. The next column, 2, then makes the order:
2 1 7 1 9
3 1 8 1 9
2 7 7 1 2
3 7 8 2 9
etc.
After sorting on all the sort_columns numbers, I expect to get the result:
2 7 7 1 2
3 1 8 1 9
2 1 7 1 9
3 7 8 2 9
This is my code to sort the matrix:
def sort_matrix_columns(matrix, n, sort_columns):
for col in sort_columns:
column = col - 1
for i in range(n):
for j in range(i + 1, n):
if matrix[i][column] > matrix[j][column]:
temp = matrix[i]
matrix[i] = matrix[j]
matrix[j] = temp
which is called like this:
sort_matrix_columns(matrix, len(matrix), sort_columns)
But when I do I get the following wrong result:
3 1 8 1 9
2 1 7 1 9
2 7 7 1 2
3 7 8 2 9
Why am I getting the wrong order here? Where is my sort implementation failing?

The short answer is that your sort implementation is not stable.
A sort algorithm is stable when two entries in the sorted sequence keep the same (relative) order when their sort key is the same. For example, when sorting only by the first letter, a stable algorithm will always sort the sequence ['foo', 'flub', 'bar'] to be ['bar', 'foo', 'flub'], keeping the 'foo' and 'flub' values in the same relative order. Your algorithm would swap 'foo' and 'bar' (as 'f' > 'b' is true) without touching 'flub', and so you'd end up with ['bar', 'flub', 'foo'].
You need a stable sort algorithm when applying sort multiple times as you do when using multiple columns, because subsequent sortings should leave the original order applied by preceding sort operations when the value in the current column is the same between two rows.
You can see this when your implementation sorts by column 5, after first sorting on columns 3, 1, 2, 4. After those first 4 sort operations the matrix looks like this:
2 1 7 1 9
3 1 8 1 9
2 7 7 1 2
3 7 8 2 9
Your implementation then sorts by column 5, so by 9, 9, 2, 9. The first row is then swapped with the 3rd row (2 1 7 1 9 and 2 7 7 1 2, leaving the other rows all untouched. This changed the relative order of all the columns with a 9:
2 7 7 1 2 < - was third
3 1 8 1 9 < - so this row is now re-ordered!
2 1 7 1 9 < - was first
3 7 8 2 9
Sorting the above output by the 2nd column (7, 1, 1, 7) then leads to the wrong output you see.
A stable sort algorithm would have moved the 2 7 7 1 2 row to be the first row without reordering the other rows:
2 7 7 1 2 < - was third
2 1 7 1 9 < - was first
3 1 8 1 9 < - was second, stays *after* the first row
3 7 8 2 9 < - was third, stays *after* the second row
and sorting by the second column produces the correct output.
The default Python sort implementation, TimSort (named after its inventor, Tim Peters), is a stable sort function. You could just use that (via the list.sort() method and a sort key function):
def sort_matrix_columns(matrix, sort_columns):
for col in sort_columns:
matrix.sort(key=lambda row: row[col - 1])
Heads-up: I removed the n parameter from the function, for simplicity's sake.
Demo:
>>> def pm(m): print(*(' '.join(map(str, r)) for r in m), sep="\n")
...
>>> def sort_matrix_columns(matrix, sort_columns):
... for col in sort_columns:
... matrix.sort(key=lambda row: row[col - 1])
...
>>> sort_columns = [3, 1, 2, 4, 5, 2]
>>> matrix = [[3, 1, 8, 1, 9],
... [3, 7, 8, 2, 9],
... [2, 7, 7, 1, 2],
... [2, 1, 7, 1, 9]]
>>> sort_matrix_columns(matrix, sort_columns)
>>> pm(matrix)
2 1 7 1 9
3 1 8 1 9
2 7 7 1 2
3 7 8 2 9
You don't need to use loop, if you reverse the sort_columns list and use that to create a single sort key, you can do this with a single call:
def sort_matrix_columns(matrix, sort_columns):
matrix.sort(key=lambda r: [r[c - 1] for c in sort_columns[::-1]])
This works the same way, the most significant sort is the last column, only when two rows have the same value (a tie) would the one-but-last column sort matter, etc.
There are other stable sort algorithms, e.g. insertion or bubble sort would work just as well here. Wikipedia has a handy table of comparison sort algorithms that includes a 'stable' column, if you wanted to implement sorting yourself still.
E.g. here is a version using insertion sort:
def insertionsort_matrix_columns(matrix, sort_columns):
for col in sort_columns:
column = col - 1
for i in range(1, len(matrix)):
for j in range(i, 0, -1):
if matrix[j - 1][column] <= matrix[j][column]:
break
matrix[j - 1], matrix[j] = matrix[j], matrix[j - 1]
I didn't use a temp variable to swap two rows. In Python, you can swap two values simply by using tuple assignments.
Because insertion sort is stable, this produces the expected outcome:
>>> matrix = [[3, 1, 8, 1, 9],
... [3, 7, 8, 2, 9],
... [2, 7, 7, 1, 2],
... [2, 1, 7, 1, 9]]
>>> insertionsort_matrix_columns(matrix, sort_columns)
>>> pm(matrix)
2 1 7 1 9
3 1 8 1 9
2 7 7 1 2
3 7 8 2 9

How to Find Closest data points for a query data point in a pandas dataframe?

I have a query data point with 15 columns and I have a pandas data frame with same columns(15) and i want to find closest data points present in data frame to my query data point. can some one guide me on this ?
Example:
query data point
[1, 2, 3, 4]
df
1 3 5 6
2 7 9 1
2 8 1 8
5 4 9 0
2 4 6 7
here, below rows are closest , in the same way i want to retrieve first n closest data points to my query point.
1 3 5 6
2 4 6 7
I tried clustering but it was too complex for me to understand and KNN is expecting a target variable, so need your help .Thank you!

You can use the Euclidean distance or L2Norm to calculate the distance between each row of your dataframe and your query point.
df = pd.DataFrame([[1, 3, 5, 6],
[2, 7, 9, 1],
[2, 8, 1, 8],
[5, 4, 9, 0],
[2, 4, 6, 7]])
vec = [1, 2, 3, 4]
dist = df.sub(vec, axis=1).pow(2).sum(axis=1).pow(.5)
This gives the output,
0 3.000000
1 8.426150
2 7.549834
3 8.485281
4 4.795832
dtype: float64
You can select the shortest n distances, which give you the positions of n-closest data points to your query points.
Or you can use the np.linlag.norm
dist = np.linalg.norm(source.to_numpy() - vec, axis=1)
which gives you the output
array([3. , 8.42614977, 7.54983444, 8.48528137, 4.79583152])
Check out the answers to this question.

You can try:
query_point = [1, 2, 3, 4]
n = 2
n_closest_points = df.loc[(df - query_point).pow(2).sum(axis=1).nsmallest(n).index]
gives
0 1 2 3
0 1 3 5 6
4 2 4 6 7
We take the sum of squared distance between each row and the query_point by chaining subtraction (which broadcasts), taking square (pow) and summing (sum). Then we require the n closest rows via getting the rows that have the smallest distance (nsmallest). Then this gives a series with values being the squared distance and index indicating the desired rows, so we take its index and look them into the original df (.loc).

Appending seperate dataframes, each as column

I am currently working on the following:
data - with the correct index
for i in range(1, 11):
kmeans = KMeans(n_clusters=i, init='k-means++', max_iter=300, n_init=10, random_state=0)
kmeans.fit(data_values)
wcss.append(kmeans.inertia_)
kmeans = KMeans(n_clusters=2).fit(data_values)
y = kmeans.fit_predict(data_values) # prediction of k
df= pd.DataFrame(y,index = data.index)
....
#got here multiple dicts
Example of y:
[1 2 3 4 5 2 2 5 1 0 0 1 0 0 1 0 1 4 4 4 3 1 0 0 1 0 0 ...]
f = pd.DataFrame(y, columns = [buster] )
f.to_csv('busters.csv, mode = 'a')
y = clusters after determination
I dont know how did I stuck on this.. I am iterating over 20 dataframes, each one consists of one columns and values from 1-9. The index is irrelevent. I am trying to append all frame together but instead it just prints them one after the other. If I put ".T" to transpose it , I still got rows with irrelevent values as index, which I cant remove them because they are actually headers.
Needed result

If the dicts produced in each iteration look like {'Buster1': [0, 2, 2, 4, 5]}, {'Buster2': [1, 2, 3, 4, 5]} ..., using 5 elements here for illustration purposes, and all the lists, i.e., values in the dicts, have the same number of elements (as it is the case in your example), you could create a single dict and use pd.DataFrame directly. (You may also want to take a look at pandas.DataFrame.from_dict.)
You may have lists with more than 5 elements, more than 3 dicts (and thus columns), and you will be generating the dicts with a loop, but the code below should be sufficient for getting the idea.
>>> import pandas as pd
>>>
>>> d = {}
>>> # update d in every iteration
>>> d.update({'Buster 1': [0, 2, 2, 4, 5]})
>>> d.update({'Buster 2': [1, 2, 3, 4, 5]})
>>> # ...
>>> d.update({'Buster n': [0, 9, 3, 0, 0]})
>>>
>>> pd.DataFrame(d, columns=d.keys())
Buster 1 Buster 2 Buster n
0 0 1 0
1 2 2 9
2 2 3 3
3 4 4 0
4 5 5 0
If you have the keys, e.g., 'Buster 1', and values, e.g., [0, 2, 2, 4, 5], separated, as I believe is the case, you can simplify the above (and make it more efficient) by replacing d.update({'Buster 1': [0, 2, 2, 4, 5]}) with d['Buster 1']=[0, 2, 2, 4, 5].
I included columns=d.keys() because depending on your Python and pandas version the ordering of the columns may not be as you expect it to be. You can specify the ordering of the columns through specifying the order in which you provide the keys. For example:
>>> pd.DataFrame(d, columns=sorted(d.keys(),reverse=True))
Buster n Buster 2 Buster 1
0 0 1 0
1 9 2 2
2 3 3 2
3 0 4 4
4 0 5 5
Although it may not apply to your use case, if you do not want to print the index, you can take a look at How to print pandas DataFrame without index.

Passing a list of values to remap in Python

I was wondering if there was a way to pass a list of values/corresponding values that I am remapping in my dataframe. I am using a truncated version of my dataset and would rather not have to update the code one by one, there are about 30 different unique QFundMaster variables in my data.
QFundMaster NPS
0 3 1
1 5 2
2 10 3
3 23 9
4 26 1
The code I am using to remap the data is as follows:
df['Fund'] = df['QFundMaster'] \
.map({3: 'Fund1'\
,5: 'Fund2'\
,10: 'Fund3'\
,23: 'Fund4'\
,26: 'Fund5'})
The code works perfectly fine, but was after a way to pass a list of values/new values so I don't have to edit the code one by one and to make it more efficient. Any help in the right direction would be appreciated. Thanks!
print(df.Fund)
0 Fund1
1 Fund2
2 Fund3
3 Fund4
4 Fund5
Name: Fund, dtype: object

If you have the list of old_values and new_values, you could do:
import pandas as pd
data = [[3, 1],
[5, 2],
[10, 3],
[23, 9],
[26, 1]]
df =pd.DataFrame(data=data, columns=['QFundMaster', 'NPS'])
old_values = [3, 5, 10, 23, 26]
new_values = ['Fund1', 'Fund2', 'Fund3', 'Fund4', 'Fund5']
df['Fund'] = df['QFundMaster'].map(dict(zip(old_values, new_values)))
print(df)
Output
QFundMaster NPS Fund
0 3 1 Fund1
1 5 2 Fund2
2 10 3 Fund3
3 23 9 Fund4
4 26 1 Fund5

Pandas : determine mapping from unique rows to original dataframe

Given the following inputs:
In [18]: input
Out[18]:
1 2 3 4
0 1 5 9 1
1 2 6 10 2
2 1 5 9 1
3 1 5 9 1
In [26]: df = input.drop_duplicates()
Out[26]:
1 2 3 4
0 1 5 9 1
1 2 6 10 2
How would I go about getting an array that has the indices of the rows from the subset that are equivalent, eg:
resultant = [0, 1, 0, 0]
I.e. the '1' here is basically stating that (row[1] in input) == (row[1] in df). Since there will be fewer unique rows than there will be multiple values in 'resultant' that will equate to similar values in df. i.e (row[k] in input == row[k+N] in input) == (row[1] in df) could be a case.
I am looking for actual row number mapping from input:df.
While this example is trivial in my case i have a ton of dropped mappings that might map to one index as an example.
Why do I want this? I am training an autoencoder type system where the target sequence is non-unique.

One way would be to treat it as a groupby on all columns:
>> df.groupby(list(df.columns)).groups
{(1, 5, 9, 1): [0, 2, 3], (2, 6, 10, 2): [1]}
Another would be to sort and then compare, which is less efficient in theory but could very well be faster in some cases and is definitely easier to make more tolerant of error:
>>> ds = df.sort(list(df.columns))
>>> eqs = (ds != ds.shift()).all(axis=1).cumsum()
>>> ds.index.groupby(eqs)
{1: [0, 2, 3], 2: [1]}
This seems the right datastructure to me, but if you really do want an array with the group ids, that's easy too, e.g.
>>> eqs.sort_index() - 1
0 0
1 1
2 0
3 0
dtype: int64

Don't have pandas installed on this computer, but I think you could use df.iterrows() like:
def find_matching_row(row, df_slimmed):
for index, slimmed_row in df_slimmed.iterrows():
if slimmed_row.equals(row[slimmed_row.columns]):
return index
def rows_mappings(df, df_slimmed):
for _, row in df.iterrows():
yield find_matching_row(row, df_slimmed)
list(rows_mappings(df, input))
This is if you are interested in generating the resultant list in your example, I don't quite follow the latter part of your reasoning.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

how to count the match number between two dataframe fast? - python

Related

Sorting matrix columns

How to Find Closest data points for a query data point in a pandas dataframe?

Appending seperate dataframes, each as column

Passing a list of values to remap in Python

Pandas : determine mapping from unique rows to original dataframe

Categories

Resources