Comparing multiple columns for a single row - python
I am grouping columns and identifying rows that have different values for each group. For example: I can group columns A,B,C,D and delete column A because it is different (Row 2 is 2.1). Also, I can group columns E,F,G,H and delete column G because Row 1 (Row 0 is Blue).
A | B | C | D | E | F | G | H
| ---------------------------------------------------------|
0 | 1.0 | 1 | 1 in | 1 inch | Red | Red | Blue | Red
| ---------------------------------------------------------|
1 | 2.0 | 2 | 2 in | 2 inch | Green | Green | Green| Green
| ---------------------------------------------------------|
2 | 2.1 | 2 | 2 in | 2 inch | Blue | Blue | Blue | Blue
What I have tried so far to compare values:
import difflib
text1 = '1.0'
text2 = '1 in'
text3 = '1 inch'
output = str(int(difflib.SequenceMatcher(None, text1, text2, text3).ratio()*100))
output: '28'
This does not work well to compare numbers followed by a measurement like inches or mm. I then tried spacy.load('en_core_web_sm') and that works better but its still not there yet. Are there any ways to compare a group of values that are similar to 1.0, 1, 1 in, 1 inch?
For columns with only strings, you can use pandas df.equals() that compares two dataframes or series (cols)
#Example
df.E.equals(df.F)
You can use this function to compare many columns to a single one I called main or template, which should be the column where you have the "correct" values.
def col_compare(main_col, *to_compare):
'''Compares each column from a list to another column
Inputs:
* main_col: enter the column name (e.g. 'A')
* to_compare: enter as many column names as you want (e.g. 'B', 'C') '''
# Columns to compare to list
to_compare = list(to_compare)
# List to store results
results = []
# Compare columns from the list with the template column
for col in to_compare:
if not df[main_col].equals(df[col]):
results.append(col)
print(f'Main Column: {main_col}')
print(f'Compared to: {to_compare}')
return f"The columns that have different values from {main_col} are {results}"
e.g
`col_compare('E', 'F', 'G', 'H')`
output:
Main Column: E
Compared to: ['F', 'G', 'H']
The columns that have different values from E are ['G']
For the columns A, B, C and D, where you have numbers you want to compare, but pieces of strings after that, one option is to extract the numbers into new columns just for comparison and you can drop them later.
You can create new columns with the code below for each column with numbers and strings:
df['C_num'] = df.C.apply( lambda x: int(re.search('[0-9]*', x).group() ) )
and then use the function col_compare above to run the comparison between the numeric columns.
I found an answer to my question. Crystal L recommended that I use FuzzyMatch and I found it to be useful. Here is the documentation: https://www.datacamp.com/community/tutorials/fuzzy-string-python Here are a couple of things I tried:
# Fucntion to compare length and similar characters
def levenshtein_ratio_and_distance(s, t, ratio_calc = False):
""" levenshtein_ratio_and_distance:
Calculates levenshtein distance between two strings.
If ratio_calc = True, the function computes the
levenshtein distance ratio of similarity between two strings
For all i and j, distance[i,j] will contain the Levenshtein
distance between the first i characters of s and the
first j characters of t
"""
# Initialize matrix of zeros
rows = len(s)+1
cols = len(t)+1
distance = np.zeros((rows,cols),dtype = int)
# Populate matrix of zeros with the indeces of each character of both strings
for i in range(1, rows):
for k in range(1,cols):
distance[i][0] = i
distance[0][k] = k
# Iterate over the matrix to compute the cost of deletions,insertions and/or substitutions
for col in range(1, cols):
for row in range(1, rows):
if s[row-1] == t[col-1]:
cost = 0 # If the characters are the same in the two strings in a given position [i,j] then the cost is 0
else:
# In order to align the results with those of the Python Levenshtein package, if we choose to calculate the ratio
# the cost of a substitution is 2. If we calculate just distance, then the cost of a substitution is 1.
if ratio_calc == True:
cost = 2
else:
cost = 1
distance[row][col] = min(distance[row-1][col] + 1, # Cost of deletions
distance[row][col-1] + 1, # Cost of insertions
distance[row-1][col-1] + cost) # Cost of substitutions
if ratio_calc == True:
# Computation of the Levenshtein Distance Ratio
Ratio = ((len(s)+len(t)) - distance[row][col]) / (len(s)+len(t))
return Ratio
else:
# print(distance) # Uncomment if you want to see the matrix showing how the algorithm computes the cost of deletions,
# insertions and/or substitutions
# This is the minimum number of edits needed to convert string a to string b
return "The strings are {} edits away".format(distance[row][col])
Str1= '1 mm'
Str2= '1 in'
Distance = levenshtein_ratio_and_distance(Str1.lower(),Str2.lower())
print(Distance)
Ratio = levenshtein_ratio_and_distance(Str1.lower(),Str2.lower(),ratio_calc = True)
print(Ratio)
import Levenshtein as lev
Str1= '1 mm'
Str2= '1 in'
Distance = lev.distance(Str1.lower(),Str2.lower()),
print(Distance)
Ratio = lev.ratio(Str1.lower(),Str2.lower())
print(Ratio)
# pip install fuzzywuzzy
from fuzzywuzzy import fuzz
Str1= '2 inches'
Str2= '1 mm'
Ratio = fuzz.ratio(Str1.lower(),Str2.lower())
Partial_Ratio = fuzz.partial_ratio(Str1.lower(),Str2.lower())
Token_Sort_Ratio = fuzz.token_sort_ratio(Str1,Str2)
Token_Set_Ratio = fuzz.token_set_ratio(Str1,Str2)
print(Ratio)
print(Partial_Ratio)
print(Token_Sort_Ratio)
print(Token_Set_Ratio)
Related
Python script to sum values according to conditions in a loop
I need to sum the value contained in a column (column 9) if a condition is satisfied: the condition is that it needs to be a pair of individuals (column 1 and column 3), whether they are repeated or not. My input file is made this way: Sindhi_HGDP00171 0 Tunisian_39T 0 1 120437718 147097266 3.02 7.111 Sindhi_HGDP00183 1 Sindhi_HGDP00206 2 1 242708729 244766624 7.41 3.468 Sindhi_HGDP00183 1 Sindhi_HGDP00206 2 1 242708729 244766624 7.41 4.468 IBS_HG01768 2 Moroccan_MRA46 1 1 34186193 36027711 30.46 3.108 IBS_HG01710 1 Sardinian_HGDP01065 2 1 246117191 249120684 7.53 3.258 IBS_HG01768 2 Moroccan_MRA46 2 1 34186193 37320967 43.4 4.418 Therefore for instance, I would need the value of column 9 for each pair to be summed. Some of these pairs appear multiple time, in this case I would need the sum of value in column 9 betweem IBS_HG01768 and Moroccan_MRA46, and the sum of the value between Sindhi_HGDP00183 and Sindhi_HGDP00206. Some of these pairs are not repeated but I still need them to appear in the final results. What I manage so far is to sum by group (population), so I sum column 9 value by pair of population like Sindhi and Tunisian for instance. I need to do the sum by pairs of Individuals. My script is this: import pandas as pd import numpy as np import itertools # defines columns names cols = ['ID1', 'HAP1', 'ID2', 'HAP2', 'CHR', 'STARTPOS', 'ENDPOS', 'LOD', 'IBDLENGTH'] # loads data (the file needs to be in the same folder where the script is) data = pd.read_csv("./Roma_Ref_All_sorted.txt", sep = '\t', names = cols) # removes the sample ID for ID1/ID2 columns and places it in two dedicated columns data[['ID1', 'ID1_samples']] = data['ID1'].str.split('_', expand = True) data[['ID2', 'ID2_samples']] = data['ID2'].str.split('_', expand = True) # gets the groups list from both ID columns... groups_id1 = list(data.ID1.unique()) groups_id2 = list(data.ID2.unique()) groups = list(set(groups_id1 + groups_id2)) # ... and all the possible pairs group_pairs = [i for i in itertools.combinations(groups, 2)] # subsets the pairs having Roma group_pairs_roma = [x for x in group_pairs if ('Roma' in x[0] and x[0] != 'Romanian') or ('Roma' in x[1] and x[1] != 'Romanian')] # preapres output df result = pd.DataFrame(columns = ['ID1', 'ID2', 'IBD_sum']) # loops all the possible pairs and computes the sum of IBD length for idx, group_pair in enumerate(group_pairs_roma): id1 = group_pair[0] id2 = group_pair[1] ibd_sum = round(data.loc[((data['ID1'] == id1) & (data['ID2'] == id2)) | ((data['ID1'] == id2) & (data['ID2'] == id1)), 'IBDLENGTH'].sum(),3) result.loc [idx, ['ID1', 'ID2', 'IBD_sum']] = [id1, id2, ibd_sum] # saves results result.to_csv("./groups_pairs_sum_IBD.txt", sep = '\t', index = False) My current output is something like this: ID1 ID2 IBD_sum Sindhi IBS 3.275 Sindhi Moroccan 74.201 Sindhi Sindhi 119.359 While I need something like: ID1 ID2 IBD_sum Sindhi_individual1 Moroccan_individual1 3.275 Sindhi_individual2 Moroccan_individual2 5.275 Sindhi_individual3 IBS_individual1 4.275 I have tried by substituting one line in my code, by writing groups_id1 = list(data.ID1_samples.unique()) groups_id2 = list(data.ID2_samples.unique()) and later ibd_sum = round(data.loc[((data['ID1_samples'] == id1) & (data['ID2_samples'] == id2)) | ((data['ID1_samples'] == id2) & (data['ID2_samples'] == id1)), 'IBDLENGTH'].sum(),3) Which in theory should work because I set the individuals as pairs instead of populations as pairs, but the output was empty. What could I do to edit the code for what I need?
I have solved the problem on my own but using R language. This is the code: ibd <- read.delim("input.txt", sep='\t') ibd_sum_indv <- ibd %>% group_by(ID1, ID2) %>% summarise(SIBD = sum(IBDLENGTH), NIBD = n()) %>% ungroup()
Find combinations (without "of size=r") from a set with decreasing sum value using Python
(Revised for clarity 02-08-2021) This is similar to the question here: Find combinations of size r from a set with decreasing sum value This is different from the answer posted in the link above because I am looking for answers without "size r=3". I have a set (array) of numbers. I need to have the sums of combinations of the numbers sorted from largest to smallest and show the numbers from the array that were used to get the total for that row. Any number in the array can only be used once per row but all the numbers don't have to be used in each row as the total decreases. If a number is not used then zero should be used as a placeholder instead so I can create a CSV file with the columns aligned. Input Example #1 with 7 numbers in the array: [30,25,20,15,10,5,1] Desired Output Example #1 format where the last number in each row is the total (sum) of the row: Beginning of list 30,25,20,15,10,5,1,106 30,25,20,15,10,5,0,105 30,25,20,15,10,0,1,101 30,25,20,15,10,0,0,100 ...(all number combinations in between) 30,0,0,0,0,0,0,30 0,25,0,0,0,5,0,30 ...(all number combinations in between) 0,0,0,15,0,0,1,16 0,0,0,15,0,0,0,15 0,0,0,0,10,5,0,15 0,0,0,0,10,0,1,11 0,0,0,0,0,5,1,6 0,0,0,0,0,5,0,5 0,0,0,0,0,0,1,1 End of list Also, duplicate totals are allowed and preferred showing different combinations that have the same total (sum) of the row. For Example #1: 30,0,0,0,0,0,0,30 0,25,0,0,0,5,0,30 For example this is one row of output based on the Input Example #1 above: 30,25,0,0,0,5,1,61 Last number in the row is the total. The total can also be the first number but the important thing is that the output list is sorted in descending order by the total. Input Example #2 with 5 numbers in the array: [20,15,10,5,1] Desired Output Example #2 format where the last number in each row is the total (sum) of the row: Beginning of list 20,15,10,5,1,51 20,15,10,5,0,50 20,15,10,0,1,46 20,15,10,0,0,45 ...(all number combinations in between) 20,0,10,0,0,30 0,15,10,5,0,30 ...(all number combinations in between) 0,15,0,0,1,16 0,15,0,0,0,15 0,0,10,5,0,15 0,0,10,0,1,11 0,0,10,0,0,10 0,0,0,5,1,6 0,0,0,5,0,5 0,0,0,0,1,1 End of list Input Example #1: [30,25,20,15,10,5,1] Every row of the output should show each number in the array used only once at most per row to get the total for the row. The rows must be sorted in decreasing order by the sums of the numbers used to get the total. The first output row in the list would show the result of 30 + 25 + 20 + 15 + 10 + 5 + 1 = 106 The second output row in the list would show the result of 30 + 25 + 20 + 15 + 10 + 5 + 0 = 105 The third output row in the list would show the result of 30 + 25 + 20 + 15 + 10 + 0 + 1 = 101 ...The rest of the rows would continue with the total (sum) of the row getting smaller until it reaches 1... The third to last output row in the list would show the result of 0 + 0 + 0 + 0 + 0 + 5 + 1 = 6 The second to last output row in the list would show the result of 0 + 0 + 0 + 0 + 0 + 5 + 0 = 5 The last output row in the list would show the result of 0 + 0 + 0 + 0 + 0 + 0 + 1 = 1 I started with the code provided by user Divyanshu modified with different input numbers and the () added to the last line (but I need to use all the numbers in the array instead of size=4 as shown here): import itertools array = [30,25,20,15,10,5,1] size = 4 answer = [] # to store all combination order = [] # to store order according to sum number = 0 # index of combination for comb in itertools.combinations(array,size): answer.append(comb) order.append([sum(comb),number]) # Storing sum and index number += 1 order.sort(reverse=True) # sorting in decreasing order for key in order: print (key[0],answer[key[1]]) # key[0] is sum of combination So this is what I need as an Input (in this example): [30,25,20,15,10,5,1] size=4 in the above code limits the output to 4 of the numbers in the array. If I take out size=4 I get an error. I need to use the entire array of numbers. I can manually change size=4 to size=1 and run it then size=2 then run it and so on. Entering size=1 through size=7 in the code and running it (7 times in this example) to get a list of all possible combinations gives me 7 different outputs. I could then manually put the lists together but that won't work for larger sets (arrays) of numbers. Can I modify the code referenced above or do I need to use a different approach?
I think you could do it as follow: Import the following: import pandas as pd import numpy as np The beginning of the code as in the questions: import itertools array = [30,25,20,15,10,5,1] size = 4 answer = [] # to store all combination order = [] # to store order according to sum number = 0 # index of combination for comb in itertools.combinations(array,size): answer.append(comb) order.append([sum(comb),number]) # Storing sum and index number += 1 order.sort(reverse=True) # sorting in decreasing order for key in order: print (key[0],answer[key[1]]) # key[0] is sum of combination The code, I think could help to obtain the final option: array_len = array.__len__() # Auxiliary to place in reference to the original array dict_array = {} for i in range(0,array_len): print(i) dict_array[array[i]]=i # Reorder the previous combinations aux = [] for key in order: array_zeros = np.zeros([1, array_len+1]) for i in answer[key[1]]: print(i,dict_array[i] ) array_zeros[0][dict_array[i]] = i # Let add the total array_zeros[0][array_len]=key[0] aux.append(array_zeros[0]) # Tranform into a dataframe aux = pd.DataFrame(aux) # This is to add the names to the columns # for the dataframe aux.columns=array + ['total'] aux = aux.astype(int) print(aux.head().astype(int)) 30 25 20 15 10 5 1 total 0 30 25 20 15 0 0 0 90 1 30 25 20 0 10 0 0 85 2 30 25 0 15 10 0 0 80 3 30 25 20 0 0 5 0 80 4 30 25 20 0 0 0 1 76 Now generalization for all sizes import itertools import pandas as pd import numpy as np array = [30,25,20,15,10,5,1] array_len = array.__len__() answer = [] # to store all combination order = [] # to store order according to sum number = 0 # index of combination for size in range(1,array_len+1): print(size) for comb in itertools.combinations(array,size): answer.append(comb) order.append([sum(comb),number]) # Storing sum and index number += 1 order.sort(reverse=True) # sorting in decreasing order for key in order: print (key[0],answer[key[1]]) # key[0] is sum of combination # Auxiliary to place in reference to the original array dict_array = {} for i in range(0,array_len): print(i) dict_array[array[i]]=i # Reorder the previous combinations aux = [] for key in order: array_zeros = np.zeros([1, array_len+1]) for i in answer[key[1]]: print(i,dict_array[i] ) array_zeros[0][dict_array[i]] = i # Let add the total array_zeros[0][array_len]=key[0] aux.append(array_zeros[0]) # Tranform into a dataframe aux = pd.DataFrame(aux) # This is to add the names to the columns # for the dataframe aux.columns=array + ['total'] aux = aux.astype(int) print(aux.head().astype(int)) 30 25 20 15 10 5 1 total 0 30 25 20 15 10 5 1 106 1 30 25 20 15 10 5 0 105 2 30 25 20 15 10 0 1 101 3 30 25 20 15 10 0 0 100 4 30 25 20 15 0 5 1 96
Thanks to #RafaelValero (Rafael Valero) I was able to learn about pandas, numpy, and dataframes. I looked up options for pandas to get the desired output. Here is the final code with some extra lines left in for reference but commented out: import itertools import pandas as pd import numpy as np array = [30,25,20,15,10,5,1] array_len = array.__len__() answer = [] # to store all combination order = [] # to store order according to sum number = 0 # index of combination for size in range(1,array_len+1): # Commented out line below as it was giving extra information # print(size) for comb in itertools.combinations(array,size): answer.append(comb) order.append([sum(comb),number]) # Storing sum and index number += 1 order.sort(reverse=True) # sorting in decreasing order # Commented out two lines below as it was from the original code and giving extra information #for key in order: # print (key[0],answer[key[1]]) # key[0] is sum of combination # Auxiliary to place in reference to the original array dict_array = {} for i in range(0,array_len): # Commented out line below as it was giving extra information # print(i) dict_array[array[i]]=i # Reorder the previous combinations aux = [] for key in order: array_zeros = np.zeros([1, array_len+1]) for i in answer[key[1]]: # Commented out line below as it was giving extra information # print(i,dict_array[i] ) array_zeros[0][dict_array[i]] = i # Let add the total array_zeros[0][array_len]=key[0] aux.append(array_zeros[0]) # Tranform into a dataframe aux = pd.DataFrame(aux) # This is to add the names to the columns # for the dataframe # Update: removed this line below as I didn't need a header # aux.columns=array + ['total'] aux = aux.astype(int) # Tried option below first but it was not necessary when using to_csv # pd.set_option('display.max_rows', None) print(aux.to_csv(index=False,header=None)) Searched references: Similar question: Find combinations of size r from a set with decreasing sum value Pandas references: https://thispointer.com/python-pandas-how-to-display-full-dataframe-i-e-print-all-rows-columns-without-truncation/ https://pandas.pydata.org/pandas-docs/version/0.17.0/generated/pandas.DataFrame.to_csv.html Online compiler used: https://www.programiz.com/python-programming/online-compiler/ Output using Input Example #1 with 7 numbers in the array: [30,25,20,15,10,5,1]: 30,25,20,15,10,5,1,106 30,25,20,15,10,5,0,105 30,25,20,15,10,0,1,101 30,25,20,15,10,0,0,100 30,25,20,15,0,5,1,96 30,25,20,15,0,5,0,95 30,25,20,0,10,5,1,91 30,25,20,15,0,0,1,91 30,25,20,0,10,5,0,90 30,25,20,15,0,0,0,90 30,25,0,15,10,5,1,86 30,25,20,0,10,0,1,86 30,25,0,15,10,5,0,85 30,25,20,0,10,0,0,85 30,0,20,15,10,5,1,81 30,25,0,15,10,0,1,81 30,25,20,0,0,5,1,81 30,0,20,15,10,5,0,80 30,25,0,15,10,0,0,80 30,25,20,0,0,5,0,80 0,25,20,15,10,5,1,76 30,0,20,15,10,0,1,76 30,25,0,15,0,5,1,76 30,25,20,0,0,0,1,76 0,25,20,15,10,5,0,75 30,0,20,15,10,0,0,75 30,25,0,15,0,5,0,75 30,25,20,0,0,0,0,75 0,25,20,15,10,0,1,71 30,0,20,15,0,5,1,71 30,25,0,0,10,5,1,71 30,25,0,15,0,0,1,71 0,25,20,15,10,0,0,70 30,0,20,15,0,5,0,70 30,25,0,0,10,5,0,70 30,25,0,15,0,0,0,70 0,25,20,15,0,5,1,66 30,0,20,0,10,5,1,66 30,0,20,15,0,0,1,66 30,25,0,0,10,0,1,66 0,25,20,15,0,5,0,65 30,0,20,0,10,5,0,65 30,0,20,15,0,0,0,65 30,25,0,0,10,0,0,65 0,25,20,0,10,5,1,61 30,0,0,15,10,5,1,61 0,25,20,15,0,0,1,61 30,0,20,0,10,0,1,61 30,25,0,0,0,5,1,61 0,25,20,0,10,5,0,60 30,0,0,15,10,5,0,60 0,25,20,15,0,0,0,60 30,0,20,0,10,0,0,60 30,25,0,0,0,5,0,60 0,25,0,15,10,5,1,56 0,25,20,0,10,0,1,56 30,0,0,15,10,0,1,56 30,0,20,0,0,5,1,56 30,25,0,0,0,0,1,56 0,25,0,15,10,5,0,55 0,25,20,0,10,0,0,55 30,0,0,15,10,0,0,55 30,0,20,0,0,5,0,55 30,25,0,0,0,0,0,55 0,0,20,15,10,5,1,51 0,25,0,15,10,0,1,51 0,25,20,0,0,5,1,51 30,0,0,15,0,5,1,51 30,0,20,0,0,0,1,51 0,0,20,15,10,5,0,50 0,25,0,15,10,0,0,50 0,25,20,0,0,5,0,50 30,0,0,15,0,5,0,50 30,0,20,0,0,0,0,50 0,0,20,15,10,0,1,46 0,25,0,15,0,5,1,46 30,0,0,0,10,5,1,46 0,25,20,0,0,0,1,46 30,0,0,15,0,0,1,46 0,0,20,15,10,0,0,45 0,25,0,15,0,5,0,45 30,0,0,0,10,5,0,45 0,25,20,0,0,0,0,45 30,0,0,15,0,0,0,45 0,0,20,15,0,5,1,41 0,25,0,0,10,5,1,41 0,25,0,15,0,0,1,41 30,0,0,0,10,0,1,41 0,0,20,15,0,5,0,40 0,25,0,0,10,5,0,40 0,25,0,15,0,0,0,40 30,0,0,0,10,0,0,40 0,0,20,0,10,5,1,36 0,0,20,15,0,0,1,36 0,25,0,0,10,0,1,36 30,0,0,0,0,5,1,36 0,0,20,0,10,5,0,35 0,0,20,15,0,0,0,35 0,25,0,0,10,0,0,35 30,0,0,0,0,5,0,35 0,0,0,15,10,5,1,31 0,0,20,0,10,0,1,31 0,25,0,0,0,5,1,31 30,0,0,0,0,0,1,31 0,0,0,15,10,5,0,30 0,0,20,0,10,0,0,30 0,25,0,0,0,5,0,30 30,0,0,0,0,0,0,30 0,0,0,15,10,0,1,26 0,0,20,0,0,5,1,26 0,25,0,0,0,0,1,26 0,0,0,15,10,0,0,25 0,0,20,0,0,5,0,25 0,25,0,0,0,0,0,25 0,0,0,15,0,5,1,21 0,0,20,0,0,0,1,21 0,0,0,15,0,5,0,20 0,0,20,0,0,0,0,20 0,0,0,0,10,5,1,16 0,0,0,15,0,0,1,16 0,0,0,0,10,5,0,15 0,0,0,15,0,0,0,15 0,0,0,0,10,0,1,11 0,0,0,0,10,0,0,10 0,0,0,0,0,5,1,6 0,0,0,0,0,5,0,5 0,0,0,0,0,0,1,1
Comparing values in different pairs of columns in Pandas
I would like to count how many times column A has the same value with B and with C. Similarly, I would like to count how many time A2 has the same value with B2 and with C2. I have this dataframe: ,A,B,C,A2,B2,C2 2018-12-01,7,0,8,17,17,17 2018-12-02,0,0,8,20,18,18 2018-12-03,9,8,8,17,17,18 2018-12-04,8,8,8,17,17,18 2018-12-05,8,8,8,17,17,17 2018-12-06,9,8,8,15,17,17 2018-12-07,8,9,9,17,17,16 2018-12-08,0,0,0,17,17,17 2018-12-09,8,0,0,17,20,18 2018-12-10,8,8,8,17,17,17 2018-12-11,8,8,9,17,17,17 2018-12-12,8,8,8,17,17,17 2018-12-13,8,8,8,17,17,17 2018-12-14,8,8,8,17,17,17 2018-12-15,9,9,9,17,17,17 2018-12-16,12,0,0,17,19,17 2018-12-17,11,9,9,17,17,17 2018-12-18,8,9,9,17,17,17 2018-12-19,8,9,8,17,17,17 2018-12-20,9,8,8,17,17,17 2018-12-21,9,9,9,17,17,17 2018-12-22,10,9,0,17,17,17 2018-12-23,10,11,10,17,17,17 2018-12-24,10,10,8,17,19,17 2018-12-25,7,10,10,17,17,18 2018-12-26,10,0,10,17,19,17 2018-12-27,9,10,8,18,17,17 2018-12-28,9,9,9,17,17,17 2018-12-29,10,10,12,18,17,17 2018-12-30,10,0,10,16,19,17 2018-12-31,11,8,8,19,17,16 I expect the following value: A with B = 14 A with C = 14 A2 with B2 = 14 A2 with C2 = 14 I have done this: ia = 0 for i in range(0,len(dfr_h_max1)): if dfr_h_max1['A'][i] == dfr_h_max1['B'][i]: ia=ia+1 ib = 0 for i in range(0,len(dfr_h_max1)): if dfr_h_max1['A'][i] == dfr_h_max1['C'][i]: ib=ib+1 In order to take advantage of pandas, this is one possible solution: import numpy as np dfr_h_max1['que'] = np.where((dfr_h_max1['A'] == dfr_h_max1['B']), 1, 0) After that I could sum all the elements in the new column 'que'. Another possibility could be related to some sort of boolean variable. Unfortunately, I still do not have enough knowledge about that. Any other more efficient or elegant solutions?
The primary calculation you need here is, for example, dfr_h_max1['A'] == dfr_h_max1['B'] - as you've done in your edit. That gives you a Series of True/False values based on the equality of each pair of items in the two series. Since True evaluates to 1 and False evaluates to 0, the .sum() is the count of how many True's there were - hence, how many matches. Put that in a loop and add the required "text" for the output you want: mains = ('A', 'A2') # the main columns comps = (['B', 'C'], ['B2', 'C2']) # columns to compare each main with for main, pair in zip(mains, comps): for col in pair: print(f'{main} with {col} = {(dfr_h_max1[main] == dfr_h_max1[col]).sum()}') # or without f-strings, do: # print(main, 'with', col, '=', (dfr_h_max1[main] == dfr_h_max1[col]).sum()) Output: A with B = 14 A with C = 14 A2 with B2 = 21 A2 with C2 = 20 Btw, (df[main] == df[comp]).sum() for Series.sum() can also be written as sum(df[main] == df[comp]) for Python's builtin sum(). In case you have more than two "triplets" of columns (not just A & A2), change the mains and comps to this, so that it works on all triplets: mains = dfr_h_max1.columns[::3] # main columns (A's), in steps of 3 comps = zip(dfr_h_max1.columns[1::3], # offset by 1 column (B's), dfr_h_max1.columns[2::3]) # offset by 2 columns (C's), # in steps of 3 (Or even using the column names / starting letter.)
Pandas - Create new column from calculation over irregular string patterns
I have some data in a pandas dataframe like so: | Data | ---------------------------- | 10-9 8-6 100-2 | ---------------------------- | 1-2 3-4 | ---------------------------- | 55-45 | ---------------------------- Now my question is, using pandas, what is the best way to do the following: Calculate the average of the first numbers before the hyphen, and the average of the numbers after the hyphen. Subract the second from the first, and place into a new column. For example, for the first row, the value in the new column will be: average(10, 8, 100) - average(9, 6, 2) I am guessing I will need to use some sort of lambda function, but I am not sure how to go about it. Any help is appreciated. Thank you!
Make a function to contain the string parsing logic: import pandas as pd import numpy as np def string_handling(string): values = [it for it in string.strip().split(' ') if it] values = [v.split('-') for v in values] first_values = [int(v[0]) for v in values] second_values = [int(v[1]) for v in values] return pd.Series([np.mean(first_values), np.mean(second_values)]) Apply the function: df[['first_value','second_value']] = df['Data'].apply(string_handling) df['diff'] = df['first_value'] - df['second_value']
This might do the trick. split() will get rid of all the white space. Also using list comprehension to go through all the tokens created by split() (e.g. ['10-9', '8-6', '100-2']). In [37]: df = DataFrame({'Data': [" 10-9 8-6 100-2 ", " 1-2 3-4 ", " 55-45 "]}) In [38]: def process(cell): ...: avg = [] ...: for i in range(2): ...: l = [int(x.split("-")[i]) for x in cell.split()] ...: avg.append(sum(l) * 1. / len(l)) ...: return avg[0] - avg[1] ...: In [39]: df['Data'].apply(process) Out[39]: 0 33.666667 1 -1.000000 2 10.000000 Name: Data, dtype: float64 Hope this helps!
Advanced groupby column creation in pandas Dataframe
I have a catalogue of groups of galaxies in a DataFrame, 'compact', which consists mainly in a group id ('CG', int), a magnitude ('R', negative float) and a morphology ('Morph', string, for example 'S' or 'E'). I'm trying to construct a second pandas DataFrame with the following properties of the groups: 'Morph' of the object having the lowest 'R' in the group Difference between the second lowest and the lowest 'R' in the group Difference between the lowest 'R' in the group and R of the group, defined as -2.5*log10(sum(10**(-0.4*R))) Proportions of objects having a given 'Morph' (on column for 'S', one for other morphologies, for example) in the group, NOT COUNTING THE ONE HAVING THE LOWEST 'R'. I'm having troubles for the last one, could you help me to write it? The other ones work, but, as a secondary question, I would like if I'm doing it clean or if there's better to do. Here is my code (with a line for my last column which works but doesn't give exactly what I want, and a try in comments which doesn't work): GroupBy = compact.sort_values('R').groupby('CG', as_index=False) R2 = GroupBy.head(2).groupby('CG', as_index=False).last().R R1 = GroupBy.first().sort_values('CG').R DeltaR12 = R2 - R1 MorphCen = GroupBy.first().sort_values('CG').Morph Group = GroupBy.first().sort_values('CG').CG RGroup = GroupBy.apply(lambda x: -2.5*np.log10((10**(-0.4*x.R)).sum())) DeltaR1gr = R1 - RGroup # Works, but counts the object with lowest R: PropS = GroupBy.apply(lambda x: 1.0*x.loc[x['Morph'] == 'S'].shape[0]/x.shape[0]) # Tries to let aside lowest R, but doesn't work: # PropS = GroupBy.apply(lambda x: 1.0*x.loc[x['Morph'] == 'S' & # x['R']>x['R'].min()].shape[0]/x.shape[0]) # PropRed = same than PropS, but for 'Morph' != 'S' CompactML = pd.DataFrame([Group,MorphCen,DeltaR12,DeltaR1gr]).transpose() CompactML.columns = ['CG', 'MorphCen', 'DeltaR12','DeltaR1gr']
First, its nice if you provide actual data or create some fake data. Below I have created some fake data with 5 different integer CG groups, 2 types of morphology (S and E) and random negative numbers for 'R'. I have then redone all your aggregations in a custom function that computes each of the 4 returning aggregations in one line and sends the results back as a Series which adds each output as row to your original DataFrame. #create fake data df = pd.DataFrame({'CG':np.random.randint(0, 5, 100), 'Morph':np.random.choice(['S', 'E'], 100), 'R':np.random.rand(100) * -100}) print(df.head()) CG Morph R 0 3 E -72.377887 1 2 E -26.126565 2 0 E -4.428494 3 0 E -2.055434 4 4 E -93.341489 # define custom aggregation function def my_agg(x): x = x.sort_values('R') morph = x.head(1)['Morph'].values[0] diff = x.iloc[0]['R'] - x.iloc[1]['R'] diff2 = -2.5*np.log10(sum(10**(-0.4*x['R']))) prop = (x['Morph'].iloc[1:] == 'S').mean() return pd.Series([morph, diff, diff2, prop], index=['morph', 'diff', 'diff2', 'prop']) # apply custom agg function df.groupby('CG').apply(my_agg) morph diff diff2 prop CG 0 E -1.562630 -97.676934 0.555556 1 S -3.228845 -98.398337 0.391304 2 S -6.537937 -91.092164 0.307692 3 E -0.023813 -99.919336 0.500000 4 E -11.943842 -99.815734 0.705882
So, here is the final code, thanks to Ted Pertou: # define custom aggregation function def my_agg(x): x = x.sort_values('R') morph = x.head(1)['Morph'].values[0] diff = x.iloc[1]['R'] - x.iloc[0]['R'] diff2 = x.iloc[0]['R'] + 2.5*np.log10(sum(10**(-0.4*x['R']))) prop = (x['Morph'].iloc[1:] == 'S').mean() return pd.Series([morph, diff, diff2, prop], index=['MorphCen', 'DeltaR12', 'DeltaRGrp1', 'PropS']) # apply custom agg function compact.groupby('CG').apply(my_agg)