I have a very large pandas dataframe with two columns that I'd like to recursively lookup.
Given input of the following dataframe:
NewID, OldID
1, 0
2, 1
3, 2
5, 4
7, 6
8, 7
9, 5
I'd like to generate the series OriginalId:
NewID, OldId, OriginalId
1, 0, 0
2, 1, 0
3, 2, 0
5, 4, 4
7, 6, 6
8, 7, 6
9, 5, 4
This can be trivially solved by iterating over the sorted data and for each row, checking if OldId points to an existing NewId and if so, setting OriginalId to OriginalId for that row.
This can be solved by iteratively merging and updating columns, by the following algorithm:
Merge OldId to NewId.
For any one that did not match, set OriginalId to OldId.
If they did match, set OldId to OldId for the matched column.
Repeat until OriginalIds are all filled in.
Feels like there should be a pandas friendly way to do this via cumulative sums or similar.
Easy:
df.set_index('NewID', inplace=True)
df.loc[:, 'OriginalId'] = df.loc[df['OldId'], 'OldID'].fillna(df['OldId'])
Related
I am working with a series in python, What I want to achieve is to get the highest value out of every n values in the series.
For example:
if n is 3
Series: 2, 1, 3, 5, 3, 6, 1, 6, 9
Expected Series: 3, 6, 9
I have tried nlargest function in pandas but it returns largest values in descending order, But I need the values in order of the original series.
There are various options. If the series is guaranteed to have a length of a multiple of n, you could drop down to numpy and do a .reshape followed by .max along an axis.
Otherwise, if the index is the default (0, 1, 2, ...), you can use groupby:
import pandas as pd
n = 3
ser = pd.Series([2, 1, 3, 5, 3, 6, 1, 6, 9])
out = ser.groupby(ser.index // n).max()
out:
0 3
1 6
2 9
dtype: int64
I have a dataframe that has duplicated time indices and I would like to get the mean across all for the previous 2 days (I do not want to drop any observations; they are all information that I need). I've checked pandas documentation and read previous posts on Stackoverflow (such as Apply rolling mean function on data frames with duplicated indices in pandas), but could not find a solution. Here's an example of how my data frame look like and the output I'm looking for. Thank you in advance.
data:
import pandas as pd
df = pd.DataFrame({'id': [1,1,1,2,3,3,4,4,4],'t': [1, 2, 3, 2, 1, 2, 2, 3, 4],'v1':[1, 2, 3, 4, 5, 6, 7, 8, 9]})
output:
t
v2
1
-
2
-
3
4.167
4
5
5
6.667
A rough proposal to concatenate 2 copies of the input frame in which values in 't' are replaced respectively by values of 't+1' and 't+2'. This way, the meaning of the column 't' becomes "the target day".
Setup:
import pandas as pd
df = pd.DataFrame({'id': [1,1,1,2,3,3,4,4,4],
't': [1, 2, 3, 2, 1, 2, 2, 3, 4],
'v1':[1, 2, 3, 4, 5, 6, 7, 8, 9]})
Implementation:
len = df.shape[0]
incr = pd.DataFrame({'id': [0]*len, 't': [1]*len, 'v1':[0]*len}) # +1 in 't'
df2 = pd.concat([df + incr, df + incr + incr]).groupby('t').mean()
df2 = df2[1:-1] # Drop the days that have no full values for the 2 previous days
df2 = df2.rename(columns={'v1': 'v2'}).drop('id', axis=1)
Output:
v2
t
3 4.166667
4 5.000000
5 6.666667
Thank you for all the help. I ended up using groupby + rolling (2 Day), and then drop duplicates (keep the last observation).
I have a dataframe below:
import pandas as pd
d = {'id': [1, 2, 3, 4, 4, 6, 1, 8, 9], 'cluster': [7, 2, 3, 3, 3, 6, 7, 8, 8]}
df = pd.DataFrame(data=d)
df = df.sort_values('cluster')
I want to keep ALL the rows
if there is the same cluster but different id AND keep every row from that cluster
even if it is the same id since there was a different id AT LEAST once within that cluster.
The code I have been using to achieve this is the following below, BUT, the only problem
with this is it drops too many rows for what I am looking for.
df = (df.assign(counts=df.count(axis=1))
.sort_values(['id', 'counts'])
.drop_duplicates(['id','cluster'], keep='last')
.drop('counts', axis=1))
The output dataframe I am expecting that the code above does not do
would drop rows at
dataframe index 1, 5, 0, and 6 but leave dataframe indexes 2, 3, 4, 7, and 8. Essentially
resulting in what the code below produces:
df = df.loc[[2, 3, 4, 7, 8]]
I have looked at many deduplication pandas posts on stack overflow but have yet to find this
scenario. Any help would be greatly appreciated.
I think we can do this with a single boolean. using .groupby().nunique()
con1 = df.groupby('cluster')['id'].nunique() > 1
#of these we only want the True indexes.
cluster
2 False
3 True
6 False
7 False
8 True
df.loc[(df['cluster'].isin(con1[con1].index))]
id cluster
2 3 3
3 4 3
4 4 3
7 8 8
8 9 8
Suppose I have a matrix (2d list in Python) and I want to create another matrix with the rank of all the elements based on the row and column they are in. The conditions are:
Rank should start from 1
Same rank should be provided for same elements in the same row or column
Same elements may have different ranks if they are in different rows or columns based on their particular row or column ranking
The maximum rank should be as small as possible
Suppose the 5X5 matrix given is:
18, 25, 7, 11, 11
33, 37, 14, 22, 25
29, 29, 11, 14, 11
25, 25, 14, 14, 11
29, 25, 14, 11, 7
The expected output is:
3, 4, 1, 2, 2
6, 7, 3, 4, 5
5, 5, 2, 3, 2
4, 4, 3, 3, 2
5, 4, 3, 2, 1
How can I write this code in Python or any programming Language or what is the algorithm behind solving this problem?
Try using this nested list comprehension:
print([[sorted(set(i)).index(x) + 1 for x in i] for i in l])
Output:
[[3, 4, 1, 2, 2], [4, 5, 1, 2, 3], [3, 3, 1, 2, 1], [3, 3, 2, 2, 1], [5, 4, 3, 2, 1]]
Taking inspiration from Faruk Hossain's answer. You can start by creating tuple.
Create a tuple with (value, row_number, col_number) for each element and store them in a list.
Sort the tuple in ascending order by the value
Iterate based on values now for each tuple, find maximum ranked element in it rows an d columns from elements that are already ranked (i.e. elements with value<=current value), now fix the value of current element as maximum of these two numbers and add a plus 1 if maximum is not equal to current element.
Also note that here we need to do an additional step if maximum of row and column both happen to have value equal to value of current element, we need to recursively update the rank of elements equal.
To signify importance of 4th step take matrix
3 2 4
5 1 3
2 6 4
Here if we first fill bottom 4 then if we will rank it as 3 but next when we rank top row 4 as 4 we will have a contradiction.
Time complexity - O(n^4)
You can do it with different ways.
I am providing a simple way to do this.
Steps:
Create a tuple with (value, row_number, col_number) for each element and store them in a list.
Sort the tuple in ascending order by the value
Iterate through the list. At each iteration you will get a tuple. For each tuple now you have row and column number of that value. And we will rank this position (row, col). Iterate to current row, column and find the maximum value has at that row and column. Now current position's rank will be maximum between that specific row and column + 1 (Or 0 at some cases). You can maintain two arrays to optimise this searching.
Follow the previous steps until we finish the task for all elements.
I was asked this question at a Software Engineering interview.
I have two columns A and B in a pandas dataframe, where values are repeated multiple times. For a unique value in A, B is expected to have "another" unique value too. And each unique value of A has a corresponding unique value in B (See example below in the form of two lists). But since each value in each column is repeated multiple times, I would like to check if any one-to-one relationship exists between two columns or not. Is there any inbuilt function in pandas to check that? If not, is there an efficient way of achieving that task?
Example:
A = [1, 3, 3, 2, 1, 2, 1, 1]
B = [5, 12, 12, 10, 5, 10, 5, 5]
Here, for each 1 in A, the corresponding value in B is always 5, and nothing else. Similarly, for 2-->10, and for 3-->12. Hence, each number in A has only one/unique corresponding number in B (and no other number). I have called this one-on-one relationship. Now I want to check if such relationship exists between two columns in pandas dataframe or not.
An example where this relationship is not satisfied:
A = [1, 3, 3, 2, 1, 2, 1, 1]
B = [5, 12, 12, 10, 5, 10, 7, 5]
Here, 1 in A doesn't have a unique corresponding value in B. It has two corresponding values - 5 and 7. Hence, the relationship is not satisfied.
Consider you have some dataframe:
d = df({'A': [1, 3, 1, 2, 1, 3, 2], 'B': [4, 6, 4, 5, 4, 6, 5]})
d has groupby method, which returns GroupBy object. This is the interface to group some rows by equal column value, for example.
gb = d.groupby('A')
grouped_b_column = gb['B']
On grouped rows you could perform an aggregation. Lets find min and max value in every group.
res = grouped_b_column.agg([np.min, np.max])
>>> print(res)
amin amax
A
1 4 4
2 5 5
3 6 6
Now we just should check that amin and amax are equal in every group, so every group consists of equal B fields:
res['amin'].equals(res['amax'])
If this check is OK, then for every A you have unique B. Now you should check the same criteria for A and B columns swapped.