I have two columns A and B in a pandas dataframe, where values are repeated multiple times. For a unique value in A, B is expected to have "another" unique value too. And each unique value of A has a corresponding unique value in B (See example below in the form of two lists). But since each value in each column is repeated multiple times, I would like to check if any one-to-one relationship exists between two columns or not. Is there any inbuilt function in pandas to check that? If not, is there an efficient way of achieving that task?
Example:
A = [1, 3, 3, 2, 1, 2, 1, 1]
B = [5, 12, 12, 10, 5, 10, 5, 5]
Here, for each 1 in A, the corresponding value in B is always 5, and nothing else. Similarly, for 2-->10, and for 3-->12. Hence, each number in A has only one/unique corresponding number in B (and no other number). I have called this one-on-one relationship. Now I want to check if such relationship exists between two columns in pandas dataframe or not.
An example where this relationship is not satisfied:
A = [1, 3, 3, 2, 1, 2, 1, 1]
B = [5, 12, 12, 10, 5, 10, 7, 5]
Here, 1 in A doesn't have a unique corresponding value in B. It has two corresponding values - 5 and 7. Hence, the relationship is not satisfied.
Consider you have some dataframe:
d = df({'A': [1, 3, 1, 2, 1, 3, 2], 'B': [4, 6, 4, 5, 4, 6, 5]})
d has groupby method, which returns GroupBy object. This is the interface to group some rows by equal column value, for example.
gb = d.groupby('A')
grouped_b_column = gb['B']
On grouped rows you could perform an aggregation. Lets find min and max value in every group.
res = grouped_b_column.agg([np.min, np.max])
>>> print(res)
amin amax
A
1 4 4
2 5 5
3 6 6
Now we just should check that amin and amax are equal in every group, so every group consists of equal B fields:
res['amin'].equals(res['amax'])
If this check is OK, then for every A you have unique B. Now you should check the same criteria for A and B columns swapped.
Related
I am working with a series in python, What I want to achieve is to get the highest value out of every n values in the series.
For example:
if n is 3
Series: 2, 1, 3, 5, 3, 6, 1, 6, 9
Expected Series: 3, 6, 9
I have tried nlargest function in pandas but it returns largest values in descending order, But I need the values in order of the original series.
There are various options. If the series is guaranteed to have a length of a multiple of n, you could drop down to numpy and do a .reshape followed by .max along an axis.
Otherwise, if the index is the default (0, 1, 2, ...), you can use groupby:
import pandas as pd
n = 3
ser = pd.Series([2, 1, 3, 5, 3, 6, 1, 6, 9])
out = ser.groupby(ser.index // n).max()
out:
0 3
1 6
2 9
dtype: int64
I have a dataframe that, as a result of a previous group by, contains 5 rows and two columns. column A is a unique name, and column B contains a list of unique numbers that correspond to different factors related to the unique name. How can I find the most common number (mode) for each row?
df = pd.DataFrame({"A": [Name1,Name2,...], "B": [[3, 5, 6, 6], [1, 1, 1, 4],...]})
I have tried:
df['C'] = df[['B']].mode(axis=1)
but this simply creates a copy of the lists from column B. Not really sure how to access each list in this case.
Result should be:
A: B: C:
Name 1 [3,5,6,6] 6
Name 2 [1,1,1,4] 1
Any help would be great.
Here's a method using statistics module's mode function
from statistics import mode
Two options:
df["C"] = df["B"].apply(mode)
df.head()
# A B C
# 0 Name1 [3, 5, 6, 6] 6
# 1 Name2 [1, 1, 1, 4] 1
Or
df["C"] = [mode(df["B"][i]) for i in range(len(df))]
df.head()
# A B C
# 0 Name1 [3, 5, 6, 6] 6
# 1 Name2 [1, 1, 1, 4] 1
I would use Pandas' .apply() function here. It will execute a function on each element in a series. First, we define the function, I'm taking the mode from Find the most common element in a list
def mode(lst):
return max(set(lst), key=lst.count)
Then, we apply this function to the B column to get C:
df['C'] = df['B'].apply(mode)
Our output is:
>>> df
A B C
0 Name1 [3, 5, 6, 6] 6
1 Name2 [1, 1, 1, 4] 1
I am looking to quickly combine columns that are genetic complements of each other. I have a large data frame with counts and want to combine columns where the column names are complements. I have a currently have a system that
Gets the complement of a column name
Checks the columns names for the compliment
Adds together the columns if there is a match
Then deletes the compliment column
However, this is slow (checking every column name) and gives different column names based on the ordering of the columns (i.e. deletes different compliment columns between runs). I was wondering if there was a way to incorporate a dictionary key:value pair to speed the process and keep the output consistent. I have an example dataframe below with the desired result (ATTG|TAAC & CGGG|GCCC are compliments).
df = pd.DataFrame({"ATTG": [3, 6, 0, 1],"CGGG" : [0, 2, 1, 4],
"TAAC": [0, 1, 0, 1], "GCCC" : [4, 2, 0, 0], "TTTT": [2, 1, 0, 1]})
## Current Pseudocode
for item in df.columns():
if compliment(item) in df.columns():
df[item] = df[item] + df[compliment(item)]
del df[compliment(item)]
## Desired Result
df_result = pd.DataFrame({"ATTG": [3, 7, 0, 2],"CGGG" : [4, 4, 1, 4], "TTTT": [2, 1, 0, 1]})
Translate the columns, then assign the columns the translation or original that is sorted first. This allows you to group compliments.
import numpy as np
mytrans = str.maketrans('ATCG', 'TAGC')
df.columns = np.sort([df.columns, [x.translate(mytrans) for x in df.columns]], axis=0)[0, :]
df.groupby(level=0, axis=1).sum()
# AAAA ATTG CGGG
#0 2 3 4
#1 1 7 4
#2 0 0 1
#3 1 2 4
Suppose I have a matrix (2d list in Python) and I want to create another matrix with the rank of all the elements based on the row and column they are in. The conditions are:
Rank should start from 1
Same rank should be provided for same elements in the same row or column
Same elements may have different ranks if they are in different rows or columns based on their particular row or column ranking
The maximum rank should be as small as possible
Suppose the 5X5 matrix given is:
18, 25, 7, 11, 11
33, 37, 14, 22, 25
29, 29, 11, 14, 11
25, 25, 14, 14, 11
29, 25, 14, 11, 7
The expected output is:
3, 4, 1, 2, 2
6, 7, 3, 4, 5
5, 5, 2, 3, 2
4, 4, 3, 3, 2
5, 4, 3, 2, 1
How can I write this code in Python or any programming Language or what is the algorithm behind solving this problem?
Try using this nested list comprehension:
print([[sorted(set(i)).index(x) + 1 for x in i] for i in l])
Output:
[[3, 4, 1, 2, 2], [4, 5, 1, 2, 3], [3, 3, 1, 2, 1], [3, 3, 2, 2, 1], [5, 4, 3, 2, 1]]
Taking inspiration from Faruk Hossain's answer. You can start by creating tuple.
Create a tuple with (value, row_number, col_number) for each element and store them in a list.
Sort the tuple in ascending order by the value
Iterate based on values now for each tuple, find maximum ranked element in it rows an d columns from elements that are already ranked (i.e. elements with value<=current value), now fix the value of current element as maximum of these two numbers and add a plus 1 if maximum is not equal to current element.
Also note that here we need to do an additional step if maximum of row and column both happen to have value equal to value of current element, we need to recursively update the rank of elements equal.
To signify importance of 4th step take matrix
3 2 4
5 1 3
2 6 4
Here if we first fill bottom 4 then if we will rank it as 3 but next when we rank top row 4 as 4 we will have a contradiction.
Time complexity - O(n^4)
You can do it with different ways.
I am providing a simple way to do this.
Steps:
Create a tuple with (value, row_number, col_number) for each element and store them in a list.
Sort the tuple in ascending order by the value
Iterate through the list. At each iteration you will get a tuple. For each tuple now you have row and column number of that value. And we will rank this position (row, col). Iterate to current row, column and find the maximum value has at that row and column. Now current position's rank will be maximum between that specific row and column + 1 (Or 0 at some cases). You can maintain two arrays to optimise this searching.
Follow the previous steps until we finish the task for all elements.
I was asked this question at a Software Engineering interview.
I have a very large pandas dataframe with two columns that I'd like to recursively lookup.
Given input of the following dataframe:
NewID, OldID
1, 0
2, 1
3, 2
5, 4
7, 6
8, 7
9, 5
I'd like to generate the series OriginalId:
NewID, OldId, OriginalId
1, 0, 0
2, 1, 0
3, 2, 0
5, 4, 4
7, 6, 6
8, 7, 6
9, 5, 4
This can be trivially solved by iterating over the sorted data and for each row, checking if OldId points to an existing NewId and if so, setting OriginalId to OriginalId for that row.
This can be solved by iteratively merging and updating columns, by the following algorithm:
Merge OldId to NewId.
For any one that did not match, set OriginalId to OldId.
If they did match, set OldId to OldId for the matched column.
Repeat until OriginalIds are all filled in.
Feels like there should be a pandas friendly way to do this via cumulative sums or similar.
Easy:
df.set_index('NewID', inplace=True)
df.loc[:, 'OriginalId'] = df.loc[df['OldId'], 'OldID'].fillna(df['OldId'])