I'm trying to create a new column in my pandas dataframe which will be the difference of two other columns, but the new column has values that are significantly different what what the differences between the values of the columns are. I have heard that 'float' values often don't subtract precisely, so I have tried to convert the decimal values here to integers by changing the columns' dtypes to 'int64' (as suggested here Pandas Subtract Two Columns Not Working Correctly) and then multiplying each value by 100000:
# Read in data
data = pd.read_csv('/Users/aaron/Downloads/treatment_vs_control.csv')
# Cleaning and preprocessing
data.rename({"names": "Gene"}, axis=1, inplace=True)
columns = data.columns
garbage_filter = columns.str.startswith('names') | columns.str.startswith('0') | columns.str.startswith('1') | \
columns.str.startswith('Gene')
data = data.loc[:,garbage_filter]
scores_filter = columns.str.endswith('scores')
columns = data.columns
scores_filter = columns.str.endswith('scores')
data = data.iloc[:,~scores_filter]
## To create Diff columns correctly, change logFC columns to integer dtype
data = data.astype({'1_logfoldchanges': 'int64', '0_logfoldchanges': 'int64'})
data['1_logfoldchanges'] = data['1_logfoldchanges'] * 100000
data['0_logfoldchanges'] = data['0_logfoldchanges'] * 100000
data["diff_logfoldchanges0"] = data['0_logfoldchanges'] - data['1_logfoldchanges']
data["diff_logfoldchanges1"] = data['1_logfoldchanges'] - data['0_logfoldchanges']
data['1_logfoldchanges'] = data['1_logfoldchanges'] / 100000
data['0_logfoldchanges'] = data['0_logfoldchanges'] / 100000
data['diff_logfoldchanges0'] = data['diff_logfoldchanges0'] / 100000
data['diff_logfoldchanges1'] = data['diff_logfoldchanges1'] / 100000
data = data.astype({'1_logfoldchanges': 'float64', '0_logfoldchanges': 'float64'})
data.sort_values('diff_logfoldchanges0', ascending=False, inplace=True)
The values in the new column still do not equal the difference in the two original columns and I haven't been able to find any questions on this site or others that have been able to help me resolve this. Could someone point out how I could fix this? I would be extremely grateful for any help.
For reference, here is a snapshot of my data with the incorrect difference-column values:
EDIT: Here is a a bit of my CSV data too:
names,0_scores,0_logfoldchanges,0_pvals,0_pvals_adj,1_scores,1_logfoldchanges,1_pvals,1_pvals_adj,2_scores,2_logfoldchanges,2_pvals,2_pvals_adj,3_scores,3_logfoldchanges,3_pvals,3_pvals_adj,4_scores,4_logfoldchanges,4_pvals,4_pvals_adj,5_scores,5_logfoldchanges,5_pvals,5_pvals_adj,6_scores,6_logfoldchanges,6_pvals,6_pvals_adj,7_scores,7_logfoldchanges,7_pvals,7_pvals_adj,8_scores,8_logfoldchanges,8_pvals,8_pvals_adj
0610005C13Rik,-0.06806567,-1.3434665,0.9457333570044608,0.9996994148075796,-0.06571575,-2.952315,0.9476041278614572,0.9998906553041256,0.17985639,1.9209933,0.8572653106998014,0.9994124851941415,-0.0023527155,0.85980946,0.9981228063933416,0.9993920957240323,0.0021153346,0.08053488,0.9983122084427253,0.9993417421686092,0.07239167,2.6473796,0.9422902189641795,0.9998255096296015,-0.029918168,-18.44805,0.9761323166853361,0.998901292435457,-0.021452557,-18.417543,0.9828846479876278,0.9994515175269552,-0.011279659,-18.393742,0.9910003250967939,0.9994694916208285
0610006L08Rik,-0.015597747,-15.159286,0.9875553033428832,0.9996994148075796,-0.015243248,-15.13933,0.9878381189626457,0.9998906553041256,-0.008116434,-14.795435,0.9935240935555751,0.9994124851941415,-0.0073064035,-14.765995,0.9941703851753109,0.9993920957240323,-0.0068988753,-14.752146,0.9944955375479378,0.9993417421686092,0.100005075,18.888618,0.9203402935001026,0.9998255096296015,-0.004986361,-14.696446,0.9960214758176429,0.998901292435457,-0.0035754263,-14.665947,0.9971472286106732,0.9994515175269552,-0.0018799432,-14.64215,0.9985000232597367,0.9994694916208285
0610009B22Rik,0.7292792,-0.015067068,0.46583086269639506,0.9070814087688549,0.42489842,0.18173021,0.67091072915639,0.9998906553041256,17.370018,1.0877438,1.3918130408174961e-67,6.801929840389262e-67,-6.5684495,-1.237505,5.084194721546539e-11,3.930798968247645e-10,-5.6669636,-0.42557448,1.4535041077956595e-08,5.6533712043729706e-08,-3.5668032,-0.5939982,0.0003613625764821466,0.001766427013499565,-7.15373,-1.7427195,8.445118618740649e-13,4.689924532441606e-12,-2.6011736,-0.66274893,0.009290541915973735,0.05767076032846401,1.7334439,1.2316034,0.08301681426158236,0.3860271115408991
Ideally, I'd like to create a 'diff_logfoldchanges0' column that is equal to the values from the '0_logfoldchanges' column minus the values from the '1_logfoldchanges' column. In the CSV data below, I believe that might be "-1.3434665 - -2.952315", "-15.159286 - -15.13933", and "-0.015067068 - 0.18173021".
pd.read_csv by default uses a fast but less precise way of reading floating point numbers
import pandas as pd
import io
csv = """names,0_scores,0_logfoldchanges,0_pvals,0_pvals_adj,1_scores,1_logfoldchanges,1_pvals,1_pvals_adj,2_scores,2_logfoldchanges,2_pvals,2_pvals_adj,3_scores,3_logfoldchanges,3_pvals,3_pvals_adj,4_scores,4_logfoldchanges,4_pvals,4_pvals_adj,5_scores,5_logfoldchanges,5_pvals,5_pvals_adj,6_scores,6_logfoldchanges,6_pvals,6_pvals_adj,7_scores,7_logfoldchanges,7_pvals,7_pvals_adj,8_scores,8_logfoldchanges,8_pvals,8_pvals_adj
0610005C13Rik,-0.06806567,-1.3434665,0.9457333570044608,0.9996994148075796,-0.06571575,-2.952315,0.9476041278614572,0.9998906553041256,0.17985639,1.9209933,0.8572653106998014,0.9994124851941415,-0.0023527155,0.85980946,0.9981228063933416,0.9993920957240323,0.0021153346,0.08053488,0.9983122084427253,0.9993417421686092,0.07239167,2.6473796,0.9422902189641795,0.9998255096296015,-0.029918168,-18.44805,0.9761323166853361,0.998901292435457,-0.021452557,-18.417543,0.9828846479876278,0.9994515175269552,-0.011279659,-18.393742,0.9910003250967939,0.9994694916208285
0610006L08Rik,-0.015597747,-15.159286,0.9875553033428832,0.9996994148075796,-0.015243248,-15.13933,0.9878381189626457,0.9998906553041256,-0.008116434,-14.795435,0.9935240935555751,0.9994124851941415,-0.0073064035,-14.765995,0.9941703851753109,0.9993920957240323,-0.0068988753,-14.752146,0.9944955375479378,0.9993417421686092,0.100005075,18.888618,0.9203402935001026,0.9998255096296015,-0.004986361,-14.696446,0.9960214758176429,0.998901292435457,-0.0035754263,-14.665947,0.9971472286106732,0.9994515175269552,-0.0018799432,-14.64215,0.9985000232597367,0.9994694916208285
0610009B22Rik,0.7292792,-0.015067068,0.46583086269639506,0.9070814087688549,0.42489842,0.18173021,0.67091072915639,0.9998906553041256,17.370018,1.0877438,1.3918130408174961e-67,6.801929840389262e-67,-6.5684495,-1.237505,5.084194721546539e-11,3.930798968247645e-10,-5.6669636,-0.42557448,1.4535041077956595e-08,5.6533712043729706e-08,-3.5668032,-0.5939982,0.0003613625764821466,0.001766427013499565,-7.15373,-1.7427195,8.445118618740649e-13,4.689924532441606e-12,-2.6011736,-0.66274893,0.009290541915973735,0.05767076032846401,1.7334439,1.2316034,0.08301681426158236,0.3860271115408991"""
data = pd.read_csv(io.StringIO(csv))
print(data["0_logfoldchanges"][0]) # -1.3434665000000001 instead of -1.3434665
This difference is tiny (less than a quadrillionth of the original value), and usually not visible because the display rounds it, so I would not in most contexts call this 'significant' (it's likely to be insignificant in relation to the precision / accuracy of the input data), but does show up if you check your calculation by manually typing the same numbers into the Python interpreter.
To read the values more precisely, use float_precision="round_trip":
data = pd.read_csv(io.StringIO(csv), float_precision="round_trip")
Subtracting now produces the expected values (the same as doing a conventional python subtraction):
difference = data["0_logfoldchanges"] - data["1_logfoldchanges"]
print(difference[0] == -1.3434665 - -2.952315) # checking first row - True
This is not due to floating point being imprecise as such, but is specific to the way pandas reads CSV files. This is a good guide to floating point rounding. In general, converting to integers will not help, except sometimes when dealing with money or other quantities that have a precise decimal representation.
Did you try to use sub method of Pandas? I have done these arithmetic operations on float values many times without any issues.
Please try https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.sub.html
I need to perform some simple calculations on a large number of combinations of rows or columns for a pandas dataframe. I need to figure out how to do so most efficiently because the number of combinations might go up above a billion.
The basic approach is easy--just performing means, comparison operators, and sums on subselections of a dataframe. But the only way I've figured out involves doing a loop over the combinations, which isn't very pythonic and isn't super efficient. Since efficiency will matter as the number of samples goes up I'm hoping there might be some smarter way to do this.
Right now I am building the list of combinations and then selecting those rows and doing the calculations using built-in pandas tools (see pseudo-code below). One possibility is to parallelize this, which should be pretty easy. However, I wonder if I'm missing a deeper way to do this more efficiently.
A few thoughts, ordered from big to small:
Is there some smart pandas/python or even some smart linear algebra way to do this? I haven't figured such out, but want to check.
Is the best approach to stick with pandas? Or convert to a numpy array and just do everything using numeric indices there, and then convert back to easier-to-understand data-frames?
Is the built-in mean() the best approach, or should I use some kind of apply()?
Is it faster to select rows or columns in any way? The matrix is symmetric so it's easy to grab either.
I'm currently actually selecting 18 rows because each of the 6 rows actually has three entries with slightly different parameters--I could combine those into individual rows beforehand if it's faster to select 6 rows than 18 for some reason.
Here's a rough-sketch of what I'm doing:
from itertools import combinations
df = from_excel() #Test case is 30 rows & cols
df = df.set_index('Col1') # Column and row 1 are names, rest are the actual matrix values
allSets = combinations(df.columns, 6)
temp = []
for s in allSets:
avg1 = df.loc[list(s)].mean().mean()
cnt1 = df.loc[list(s)].gt(0).sum().sum()
temp.append([s,avg1,cnt1])
dfOut = pd.DataFrame(temp,columns=['Set','Average','Count'])
A few general considerations that should help:
Not that I know of, though the best place to ask is on Mathematics or Math Professionals. And it is worth a try. There may be a better way to frame the question if you are doing something very specific with the results - looking for minimum/maximum, etc.
In general, you are right, that pandas, as a layer on top of NumPy is probably not the speeding things up. However, most of the heavy-lifting is done at the level of NumPy, and until you are sure pandas is to blame, use it.
mean is better than your own function applied across rows or columns because it uses C implementation of mean in NumPy under the hood which is always going to be faster than Python.
Given that pandas is organizing data in column fashion (i.e. each column is a contiguous NumPy array), it is better to go row-wise.
It would be great to see an example of data here.
Now, some comments on the code:
use iloc and numeric indices instead of loc - it is way faster
it is unnecessary to turn tuples into list here: df.loc[list(s)].gt(0).sum().sum()
just use: df.loc[s].gt(0).sum().sum()
you should rather use a generator instead of the for loop where you append elements to a temporary list (this is awfully slow and unnecessary, because you are creating pandas dataframe either way). Also, use tuples instead of lists wherever possible for maximum speed:
def gen_fun():
allSets = combinations(df.columns, 6)
for s in allSets:
avg1 = df.loc[list(s)].mean().mean()
cnt1 = df.loc[list(s)].gt(0).sum().sum()
yield (s, avg1, cnt1)
dfOut = pd.DataFrame(gen_fun(), columns=['Set', 'Average', 'Count'])
Another thing is, that you can preprocess the dataframe to use only values that are positive to avoid gt(0) operation in each loop.
In this way you are sparing both memory and CPU time.