Pandas Dataframe: Change each value's ones-digit - python

I am writing unit tests for 2 data frames to test for equality by converting them to dictionaries and using unittest's assertDictEqual(). The context is that I'm converting Excel functions to Python but due to their different rounding system, some values are off by merely +/- 1
I've attempted to use the DF.round(-1) to round to the nearest 10th but due to the +/- 1, some numbers may round the opposite way so for example 15 would round up but 14 would round down and the test would fail. All values in the 12x20 data frame are integers
What I'm looking for (feel free to suggest any alternate solution):
A CLEAN way to test for approximate equality of data frames or nested dictionaries
or a way to make the ones-digit of each element '0' to avoid the rounding issue
Thank you, and please let me know if any additional context is required. Due to confidentiality issues and my NDA (non-disclosure agreement), I cannot share the code but I can formulate an example if necessary

You could take the element-wise absolute difference between the two DataFrames and check that all values are below a certain tolerance (in your case 1). For example, we can create two DataFrames with values in the interval [0.0, 1.0).
import numpy as np
import pandas as pd
np.random.seed(42)
## df2 are 10x10 arrays with values in the interval [0.0, 1.0)
df1 = pd.DataFrame(np.random.random_sample((10,10)))
df2 = pd.DataFrame(np.random.random_sample((10,10)))
Then the following should return True:
(abs(df2-df1) < 1).all(axis=None)
And you can write an assert statement like:
assert((abs(df2-df1) < 1).all(axis=None) == True)

I'm not 100 pourcent sure I got what you are trying to do but why not just divide by 10 to lose the last digit that is bothering you?
division with "//" will keep only the significant numbers. You can then multiply by ten if you want to keep the overall number size.

Related

Pandas Split Scientific Notation into two Columns - Significand and Exponent

I have a column in a data frame called MARKET_VALUE that I need to pass to a downstream system in a defined format. MARKET_VALUE, a float, needs to be passed as two integer columns (significand, with no trailing zeros and exp) as follows
MARKET VALUE SIGNIFICAND EXP
6.898806e+09 6898806 3
6.898806e+05 6898806 -1
6.898806e+03 6898806 -3
I contemplated using formatted strings but am convinced there must be a smarter solution. The data frame is large, containing millions of rows, so a solution that doesn't depend on apply would be preferable.
Generate a random pandas dataframe
I use a DataFrame consiting in 1e5 rows (you could try with more to test the bottleneck)
import pandas as pd
import numpy as np
df=pd.DataFrame(np.random.random((100000,2))**10, columns=['random1', 'random2'])
Use .apply method
In this case I use the standard python formatting.
8E is the number of digits after point.
[:-4] to remove the exponential notation and keep only the significand.
[-3:] to get only the exponential with the sign, then convert it into a int value.
# get the signficand
df.random1.apply(lambda x: f'{x:.8E}'[:-4].replace('.', ''))
# get the exp
df.random1.apply(lambda x: int(f'{x:.0E}'[-3:]))
On my laptop it took less than 100ms.
I am thinking about faster solution (vectorized one), but for now I hope that this can help.

Trouble subtracting two column values correctly/precisely in pandas dataframe in Python

I'm trying to create a new column in my pandas dataframe which will be the difference of two other columns, but the new column has values that are significantly different what what the differences between the values of the columns are. I have heard that 'float' values often don't subtract precisely, so I have tried to convert the decimal values here to integers by changing the columns' dtypes to 'int64' (as suggested here Pandas Subtract Two Columns Not Working Correctly) and then multiplying each value by 100000:
# Read in data
data = pd.read_csv('/Users/aaron/Downloads/treatment_vs_control.csv')
# Cleaning and preprocessing
data.rename({"names": "Gene"}, axis=1, inplace=True)
columns = data.columns
garbage_filter = columns.str.startswith('names') | columns.str.startswith('0') | columns.str.startswith('1') | \
columns.str.startswith('Gene')
data = data.loc[:,garbage_filter]
scores_filter = columns.str.endswith('scores')
columns = data.columns
scores_filter = columns.str.endswith('scores')
data = data.iloc[:,~scores_filter]
## To create Diff columns correctly, change logFC columns to integer dtype
data = data.astype({'1_logfoldchanges': 'int64', '0_logfoldchanges': 'int64'})
data['1_logfoldchanges'] = data['1_logfoldchanges'] * 100000
data['0_logfoldchanges'] = data['0_logfoldchanges'] * 100000
data["diff_logfoldchanges0"] = data['0_logfoldchanges'] - data['1_logfoldchanges']
data["diff_logfoldchanges1"] = data['1_logfoldchanges'] - data['0_logfoldchanges']
data['1_logfoldchanges'] = data['1_logfoldchanges'] / 100000
data['0_logfoldchanges'] = data['0_logfoldchanges'] / 100000
data['diff_logfoldchanges0'] = data['diff_logfoldchanges0'] / 100000
data['diff_logfoldchanges1'] = data['diff_logfoldchanges1'] / 100000
data = data.astype({'1_logfoldchanges': 'float64', '0_logfoldchanges': 'float64'})
data.sort_values('diff_logfoldchanges0', ascending=False, inplace=True)
The values in the new column still do not equal the difference in the two original columns and I haven't been able to find any questions on this site or others that have been able to help me resolve this. Could someone point out how I could fix this? I would be extremely grateful for any help.
For reference, here is a snapshot of my data with the incorrect difference-column values:
EDIT: Here is a a bit of my CSV data too:
names,0_scores,0_logfoldchanges,0_pvals,0_pvals_adj,1_scores,1_logfoldchanges,1_pvals,1_pvals_adj,2_scores,2_logfoldchanges,2_pvals,2_pvals_adj,3_scores,3_logfoldchanges,3_pvals,3_pvals_adj,4_scores,4_logfoldchanges,4_pvals,4_pvals_adj,5_scores,5_logfoldchanges,5_pvals,5_pvals_adj,6_scores,6_logfoldchanges,6_pvals,6_pvals_adj,7_scores,7_logfoldchanges,7_pvals,7_pvals_adj,8_scores,8_logfoldchanges,8_pvals,8_pvals_adj
0610005C13Rik,-0.06806567,-1.3434665,0.9457333570044608,0.9996994148075796,-0.06571575,-2.952315,0.9476041278614572,0.9998906553041256,0.17985639,1.9209933,0.8572653106998014,0.9994124851941415,-0.0023527155,0.85980946,0.9981228063933416,0.9993920957240323,0.0021153346,0.08053488,0.9983122084427253,0.9993417421686092,0.07239167,2.6473796,0.9422902189641795,0.9998255096296015,-0.029918168,-18.44805,0.9761323166853361,0.998901292435457,-0.021452557,-18.417543,0.9828846479876278,0.9994515175269552,-0.011279659,-18.393742,0.9910003250967939,0.9994694916208285
0610006L08Rik,-0.015597747,-15.159286,0.9875553033428832,0.9996994148075796,-0.015243248,-15.13933,0.9878381189626457,0.9998906553041256,-0.008116434,-14.795435,0.9935240935555751,0.9994124851941415,-0.0073064035,-14.765995,0.9941703851753109,0.9993920957240323,-0.0068988753,-14.752146,0.9944955375479378,0.9993417421686092,0.100005075,18.888618,0.9203402935001026,0.9998255096296015,-0.004986361,-14.696446,0.9960214758176429,0.998901292435457,-0.0035754263,-14.665947,0.9971472286106732,0.9994515175269552,-0.0018799432,-14.64215,0.9985000232597367,0.9994694916208285
0610009B22Rik,0.7292792,-0.015067068,0.46583086269639506,0.9070814087688549,0.42489842,0.18173021,0.67091072915639,0.9998906553041256,17.370018,1.0877438,1.3918130408174961e-67,6.801929840389262e-67,-6.5684495,-1.237505,5.084194721546539e-11,3.930798968247645e-10,-5.6669636,-0.42557448,1.4535041077956595e-08,5.6533712043729706e-08,-3.5668032,-0.5939982,0.0003613625764821466,0.001766427013499565,-7.15373,-1.7427195,8.445118618740649e-13,4.689924532441606e-12,-2.6011736,-0.66274893,0.009290541915973735,0.05767076032846401,1.7334439,1.2316034,0.08301681426158236,0.3860271115408991
Ideally, I'd like to create a 'diff_logfoldchanges0' column that is equal to the values from the '0_logfoldchanges' column minus the values from the '1_logfoldchanges' column. In the CSV data below, I believe that might be "-1.3434665 - -2.952315", "-15.159286 - -15.13933", and "-0.015067068 - 0.18173021".
pd.read_csv by default uses a fast but less precise way of reading floating point numbers
import pandas as pd
import io
csv = """names,0_scores,0_logfoldchanges,0_pvals,0_pvals_adj,1_scores,1_logfoldchanges,1_pvals,1_pvals_adj,2_scores,2_logfoldchanges,2_pvals,2_pvals_adj,3_scores,3_logfoldchanges,3_pvals,3_pvals_adj,4_scores,4_logfoldchanges,4_pvals,4_pvals_adj,5_scores,5_logfoldchanges,5_pvals,5_pvals_adj,6_scores,6_logfoldchanges,6_pvals,6_pvals_adj,7_scores,7_logfoldchanges,7_pvals,7_pvals_adj,8_scores,8_logfoldchanges,8_pvals,8_pvals_adj
0610005C13Rik,-0.06806567,-1.3434665,0.9457333570044608,0.9996994148075796,-0.06571575,-2.952315,0.9476041278614572,0.9998906553041256,0.17985639,1.9209933,0.8572653106998014,0.9994124851941415,-0.0023527155,0.85980946,0.9981228063933416,0.9993920957240323,0.0021153346,0.08053488,0.9983122084427253,0.9993417421686092,0.07239167,2.6473796,0.9422902189641795,0.9998255096296015,-0.029918168,-18.44805,0.9761323166853361,0.998901292435457,-0.021452557,-18.417543,0.9828846479876278,0.9994515175269552,-0.011279659,-18.393742,0.9910003250967939,0.9994694916208285
0610006L08Rik,-0.015597747,-15.159286,0.9875553033428832,0.9996994148075796,-0.015243248,-15.13933,0.9878381189626457,0.9998906553041256,-0.008116434,-14.795435,0.9935240935555751,0.9994124851941415,-0.0073064035,-14.765995,0.9941703851753109,0.9993920957240323,-0.0068988753,-14.752146,0.9944955375479378,0.9993417421686092,0.100005075,18.888618,0.9203402935001026,0.9998255096296015,-0.004986361,-14.696446,0.9960214758176429,0.998901292435457,-0.0035754263,-14.665947,0.9971472286106732,0.9994515175269552,-0.0018799432,-14.64215,0.9985000232597367,0.9994694916208285
0610009B22Rik,0.7292792,-0.015067068,0.46583086269639506,0.9070814087688549,0.42489842,0.18173021,0.67091072915639,0.9998906553041256,17.370018,1.0877438,1.3918130408174961e-67,6.801929840389262e-67,-6.5684495,-1.237505,5.084194721546539e-11,3.930798968247645e-10,-5.6669636,-0.42557448,1.4535041077956595e-08,5.6533712043729706e-08,-3.5668032,-0.5939982,0.0003613625764821466,0.001766427013499565,-7.15373,-1.7427195,8.445118618740649e-13,4.689924532441606e-12,-2.6011736,-0.66274893,0.009290541915973735,0.05767076032846401,1.7334439,1.2316034,0.08301681426158236,0.3860271115408991"""
data = pd.read_csv(io.StringIO(csv))
print(data["0_logfoldchanges"][0]) # -1.3434665000000001 instead of -1.3434665
This difference is tiny (less than a quadrillionth of the original value), and usually not visible because the display rounds it, so I would not in most contexts call this 'significant' (it's likely to be insignificant in relation to the precision / accuracy of the input data), but does show up if you check your calculation by manually typing the same numbers into the Python interpreter.
To read the values more precisely, use float_precision="round_trip":
data = pd.read_csv(io.StringIO(csv), float_precision="round_trip")
Subtracting now produces the expected values (the same as doing a conventional python subtraction):
difference = data["0_logfoldchanges"] - data["1_logfoldchanges"]
print(difference[0] == -1.3434665 - -2.952315) # checking first row - True
This is not due to floating point being imprecise as such, but is specific to the way pandas reads CSV files. This is a good guide to floating point rounding. In general, converting to integers will not help, except sometimes when dealing with money or other quantities that have a precise decimal representation.
Did you try to use sub method of Pandas? I have done these arithmetic operations on float values many times without any issues.
Please try https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.sub.html

Calculating intermittent average

I have a huge dataframe with a lot of zero values. And, I want to calculate the average of the numbers between the zero values. To make it simple, the data shows for example 10 consecutive values then it renders zeros then values again. I just want to tell python to calculate the average of each patch of the data.
The pic shows an example
first of all I'm a little bit confused why you are using a DataFrame. This is more likely being stored in a pd.Series while I would suggest storing numeric data in an numpy array. Assuming that you are having a pd.Series in front of you and you are trying to calculate the moving average between two consecutive points, there are two approaches you can follow.
zero-paddding for the last integer:
assuming circularity and taking the average between the first and the last value
Here is the expected code:
import numpy as np
import pandas as pd
data_series = pd.Series([0,0,0.76231, 0.77669,0,0,0,0,0,0,0,0,0.66772, 1.37964, 2.11833, 2.29178, 0,0,0,0,0])
np_array = np.array(data_series)
#assuming zero_padding
np_array_zero_pad = np.hstack((np_array, 0))
mvavrg_zeropad = [np.mean([np_array_zero_pad[i], np_array_zero_pad[i+1]]) for i in range(len(np_array_zero_pad)-1)]
#asssuming circularity
np_array_circ_arr = np.hstack((np_array, np_array[-1]))
np_array_circ_arr = [np.mean([np_array_circ_arr[i], np_array_circ_arr[i+1]]) for i in range(len(np_array_circ_arr)-1)]

Efficiently perform cheap calculations on many (1e6-1e10) combinations of rows in a pandas dataframe in python

I need to perform some simple calculations on a large number of combinations of rows or columns for a pandas dataframe. I need to figure out how to do so most efficiently because the number of combinations might go up above a billion.
The basic approach is easy--just performing means, comparison operators, and sums on subselections of a dataframe. But the only way I've figured out involves doing a loop over the combinations, which isn't very pythonic and isn't super efficient. Since efficiency will matter as the number of samples goes up I'm hoping there might be some smarter way to do this.
Right now I am building the list of combinations and then selecting those rows and doing the calculations using built-in pandas tools (see pseudo-code below). One possibility is to parallelize this, which should be pretty easy. However, I wonder if I'm missing a deeper way to do this more efficiently.
A few thoughts, ordered from big to small:
Is there some smart pandas/python or even some smart linear algebra way to do this? I haven't figured such out, but want to check.
Is the best approach to stick with pandas? Or convert to a numpy array and just do everything using numeric indices there, and then convert back to easier-to-understand data-frames?
Is the built-in mean() the best approach, or should I use some kind of apply()?
Is it faster to select rows or columns in any way? The matrix is symmetric so it's easy to grab either.
I'm currently actually selecting 18 rows because each of the 6 rows actually has three entries with slightly different parameters--I could combine those into individual rows beforehand if it's faster to select 6 rows than 18 for some reason.
Here's a rough-sketch of what I'm doing:
from itertools import combinations
df = from_excel() #Test case is 30 rows & cols
df = df.set_index('Col1') # Column and row 1 are names, rest are the actual matrix values
allSets = combinations(df.columns, 6)
temp = []
for s in allSets:
avg1 = df.loc[list(s)].mean().mean()
cnt1 = df.loc[list(s)].gt(0).sum().sum()
temp.append([s,avg1,cnt1])
dfOut = pd.DataFrame(temp,columns=['Set','Average','Count'])
A few general considerations that should help:
Not that I know of, though the best place to ask is on Mathematics or Math Professionals. And it is worth a try. There may be a better way to frame the question if you are doing something very specific with the results - looking for minimum/maximum, etc.
In general, you are right, that pandas, as a layer on top of NumPy is probably not the speeding things up. However, most of the heavy-lifting is done at the level of NumPy, and until you are sure pandas is to blame, use it.
mean is better than your own function applied across rows or columns because it uses C implementation of mean in NumPy under the hood which is always going to be faster than Python.
Given that pandas is organizing data in column fashion (i.e. each column is a contiguous NumPy array), it is better to go row-wise.
It would be great to see an example of data here.
Now, some comments on the code:
use iloc and numeric indices instead of loc - it is way faster
it is unnecessary to turn tuples into list here: df.loc[list(s)].gt(0).sum().sum()
just use: df.loc[s].gt(0).sum().sum()
you should rather use a generator instead of the for loop where you append elements to a temporary list (this is awfully slow and unnecessary, because you are creating pandas dataframe either way). Also, use tuples instead of lists wherever possible for maximum speed:
def gen_fun():
allSets = combinations(df.columns, 6)
for s in allSets:
avg1 = df.loc[list(s)].mean().mean()
cnt1 = df.loc[list(s)].gt(0).sum().sum()
yield (s, avg1, cnt1)
dfOut = pd.DataFrame(gen_fun(), columns=['Set', 'Average', 'Count'])
Another thing is, that you can preprocess the dataframe to use only values that are positive to avoid gt(0) operation in each loop.
In this way you are sparing both memory and CPU time.

Non-zero mean when calculating a Z-score in Python / Pandas

I am attempting calculate z-scores at once for a series of columns, but inspecting the data reveals that the mean values for columns are NOT 0 as you should expect for the calculation of a z-score.
As you can see by running the code below, column a and column d does not have 0 means in the newly created *_zscore column.
import pandas as pd
df = pd.DataFrame({'a': [500,4000,20], 'b': [10,20,30], 'c': [30,40,50], 'd':[50,400,20] })
cols = list(df.columns)
for col in cols:
col_zscore = col + '_zscore'
df[col_zscore] = (df[col] - df[col].mean())/df[col].std(ddof=0)
print(df.describe())
My actual data is obviously different, but the results are similar (i.e.: non-zero means). I have also used
from scipy import stats
stats.zscore(df)
which leads to a similar result. Doing the same transformation in R (i.e.: scaled.df <- scale(df)) works though.
Does anyone have an idea what is going on here? The columns with error contain higher values, but it should also be possible to z-transform them.
EDIT: as Rob pointed out, the results are essentially 0.
Your mean values are of the order 10^-17, which for all practical purposes is equal to zero. The reason why you do not get exactly zero has to do with the way floating point numbers are represented (finite precision).
I'm surprised that you don't see it in R, but that may have to do with the example you use and the fact that scale is implemented a bit differently in R (ddof=1 e.g.). But in R, you see the same thing happening:
> mean(scale(c(5000,40000,2000)))
[1] 7.401487e-17

Categories