I want to calculate some values between type1 and type2 elements. For example if have index like: (a,b) and (b,a) then it will equals to (a,b)+(b,a) I want to sum values if reverse indexes exists.
One way using frozenset:
df = df.reset_index()
df['total'].groupby(df[['type1', 'type2']].apply(frozenset, 1)).sum()
Output:
(b, a) 15
(c, a) 19
Name: total, dtype: int64
Let df be your DataFrame. Swap the first and second levels of the MultiIndex, concatenate the original and the new DataFrames, and calculate row sums:
pd.concat([df, df.swaplevel()], axis=1).sum(1)
#a b 15
# c 19
#b a 15
#c a 19
The solution works even for the rows that do not have matching reversed rows. The answer has duplicated rows for the direct and the reversed index. You will have to filter out the unwanted rows.
Related
I'm trying to find the duplicates between 2 columns, were order is independent, but I need to keep the count of duplicates after dropping them
df = pd.DataFrame([['A','B'],['D','B'],['B','A'],['B','C'],['C','B']],
columns=['source', 'target'],
)
This is my expected result
source target count
0 A B 2
1 D B 1
3 B C 2
I've already tried several approaches, but I can't come close to a solution.
It does not matter which combination is maintained. In the result example I kept the first.
The following approach creates a new column containing a set of the values in the columns specified. The advantage is that all other columns are preserved in the final result. Furthermore, the indices are preserved the same way as in the expected output you posted:
df = pd.DataFrame([['A','B'],['D','B'],['B','A'],['B','C'],['C','B']],
columns=['source', 'target'],)
# Create column with set of both columns
df['tmp'] = df.apply(lambda x: frozenset([x['source'], x['target']]), axis=1)
# Create count column based on new tmp column
df['count'] = df.groupby(['tmp'])['target'].transform('size')
# Drop duplicate rows based on new tmp column
df = df[~df.duplicated(subset='tmp', keep='first')]
# Remove tmp column
df = df.drop('tmp', 1)
df
Output:
source target count
0 A B 2
1 D B 1
3 B C 2
You can use df.duplicated() to see which ones are duplicated, the output is true if item is duplicated and false if it isn’t. For more infos and practical example check out the documentation
Create a summary based on applying a frozenset to your desired columns. Here we're using all columns.
summary = df.apply(frozenset, axis=1).value_counts()
This'll give you a Series of:
(A, B) 2
(C, B) 2
(B, D) 1
dtype: int64
You can then reconstruct a DataFrame by iterating over that series, eg:
df2 = pd.DataFrame(((*idx, val) for idx, val in summary.items()), columns=[*df.columns, 'count'])
Which results in:
source target count
0 A B 2
1 C B 2
2 B D 1
So I have four columns in a pandas dataframe, column A, B, C and D. Column A contains 30 words, 18 of which are in column B. Column C contains either a 1 or 2 (keyboard response to column B words) and column D contains 1 or 2 also (the correct response).
What I need to do is see the total correct for only the words where column A and B overlap. I understand how to compare the C and D columns to get the total correct once I have the correct dataframe, but I am having a hard time wrapping my head around comparing the overlap in A and B.
Use Series.isin():
df.B.isin(df.A)
That will give you a boolean Series the same length as df.B indicating for each value in df.B whether it is also present in df.A.
For example, I have a dataframe called dat, then I want to apply a function on each column of the dataframe, if the return value is Ture, then keep this column and turn to next column, if the return value is False, then drop this column and turn to next column.
I know I can write a for loop to do this, but is there a efficient way to do this?
You could do it like this using boolean index on df.columns:
I want to drop all columns where the 'sum' for simplicity is greater than 50
df = pd.DataFrame({'A':[2,4,6,8],'B':[101,102,102,102]})
r = df.apply(np.sum) # applies the sum function to all columns
c = r <= 50 #create boolean test for columns
df[c[c].index] #Use boolea indexing to get columns and column filter for dataframe
Output:
A
0 2
1 4
2 6
3 8
Updating an old answer:
df.loc[:, df.sum() <= 50]
I have a dataframe with millions of rows with unique indexes and a column('b') that has several repeated values.
I would like to generate a dataframe without the duplicated data but I do not want to lose the index information. I want the new dataframe to have an index that is a concatenation of the indexes ("old_index1,old_index2") where 'b' had duplicated values but remains unchanged for rows where 'b' had unique values. The values of the 'b' column should remain unchanged like in a keep=first strategy. Example below.
Input dataframe:
df = pd.DataFrame(data = [[1,"non_duplicated_1"],
[2,"duplicated"],
[2,"duplicated"],
[3,"non_duplicated_2"],
[4,"non_duplicated_3"]],
index=['one','two','three','four','five'],
columns=['a','b'])
desired output:
a b
one 1 non_duplicated_1
two,three 2 duplicated
four 3 non_duplicated_2
five 4 non_duplicated_3
The actual dataframe is quite large so I would like to avoid non-vectorized operations.
I am finding this surprisingly difficult...Any ideas?
You can use transform on the index column (after you use reset_index). Then, drop duplicates in column b:
df.index = df.reset_index().groupby('b')['index'].transform(','.join)
df.drop_duplicates('b',inplace=True)
>>> df
a b
index
one 1 non_duplicated_1
two,three 2 duplicated
four 3 non_duplicated_2
five 4 non_duplicated_3
Setup
dct = {'index': ','.join, 'a': 'first'}
You can reset_index before using groupby, although it's unclear to me why you want this:
df.reset_index().groupby('b', as_index=False, sort=False).agg(dct).set_index('index')
b a
index
one non_duplicated_1 1
two,three duplicated 2
four non_duplicated_2 3
five non_duplicated_3 4
Suppose I have a dataframe (say) of 25 columns as follows:
A B C ...... I J ......... Y
I-1 yes 3 1-2-2017 100 james
I-2 no 4 NaN 100 ashok
I-3 NaN 9 2-10-2017 5 mary
I-4 yes NaN 2-10-2017 0 sania
I would like to obtain 3 dataframes from the above dataframe such that
a) the first dataframe consists of columns A to G
b) the second dataframe consists of column A and columns I to J.
c) the third dataframe consists of column A and columns K to Y.
How should I approach it ? (Preferably in Python. Only some column values are illustrated. I will show more if required.)
You can create new DataFrames by using loc in combination with join:
df_a_to_g = df.loc[:, 'A':'G']
df_a_and_i_to_j = df.loc[:, ['A']].join(df.loc[:, 'I':'J'])
df_a_and_k_to_y = df.loc[:, ['A']].join(df.loc[:, 'K':'Y'])
If you want to select the columns 'numerically' you can use iloc instead of loc:
# Select first column and columns 11 through 25.
# We have to slice with 12:27 because indexing starts with 0,
# so 12 equals to column number 11. The destination index '27'
# equals to column 26, from which we have to subtract 1 because
# the last element is exclusive in numerical slicing.
df_new = df.iloc[:, [0]].join(df.iloc[:, 12:27])