I have a column in a data frame called MARKET_VALUE that I need to pass to a downstream system in a defined format. MARKET_VALUE, a float, needs to be passed as two integer columns (significand, with no trailing zeros and exp) as follows
MARKET VALUE SIGNIFICAND EXP
6.898806e+09 6898806 3
6.898806e+05 6898806 -1
6.898806e+03 6898806 -3
I contemplated using formatted strings but am convinced there must be a smarter solution. The data frame is large, containing millions of rows, so a solution that doesn't depend on apply would be preferable.
Generate a random pandas dataframe
I use a DataFrame consiting in 1e5 rows (you could try with more to test the bottleneck)
import pandas as pd
import numpy as np
df=pd.DataFrame(np.random.random((100000,2))**10, columns=['random1', 'random2'])
Use .apply method
In this case I use the standard python formatting.
8E is the number of digits after point.
[:-4] to remove the exponential notation and keep only the significand.
[-3:] to get only the exponential with the sign, then convert it into a int value.
# get the signficand
df.random1.apply(lambda x: f'{x:.8E}'[:-4].replace('.', ''))
# get the exp
df.random1.apply(lambda x: int(f'{x:.0E}'[-3:]))
On my laptop it took less than 100ms.
I am thinking about faster solution (vectorized one), but for now I hope that this can help.
Related
I am writing unit tests for 2 data frames to test for equality by converting them to dictionaries and using unittest's assertDictEqual(). The context is that I'm converting Excel functions to Python but due to their different rounding system, some values are off by merely +/- 1
I've attempted to use the DF.round(-1) to round to the nearest 10th but due to the +/- 1, some numbers may round the opposite way so for example 15 would round up but 14 would round down and the test would fail. All values in the 12x20 data frame are integers
What I'm looking for (feel free to suggest any alternate solution):
A CLEAN way to test for approximate equality of data frames or nested dictionaries
or a way to make the ones-digit of each element '0' to avoid the rounding issue
Thank you, and please let me know if any additional context is required. Due to confidentiality issues and my NDA (non-disclosure agreement), I cannot share the code but I can formulate an example if necessary
You could take the element-wise absolute difference between the two DataFrames and check that all values are below a certain tolerance (in your case 1). For example, we can create two DataFrames with values in the interval [0.0, 1.0).
import numpy as np
import pandas as pd
np.random.seed(42)
## df2 are 10x10 arrays with values in the interval [0.0, 1.0)
df1 = pd.DataFrame(np.random.random_sample((10,10)))
df2 = pd.DataFrame(np.random.random_sample((10,10)))
Then the following should return True:
(abs(df2-df1) < 1).all(axis=None)
And you can write an assert statement like:
assert((abs(df2-df1) < 1).all(axis=None) == True)
I'm not 100 pourcent sure I got what you are trying to do but why not just divide by 10 to lose the last digit that is bothering you?
division with "//" will keep only the significant numbers. You can then multiply by ten if you want to keep the overall number size.
I have a huge dataframe with a lot of zero values. And, I want to calculate the average of the numbers between the zero values. To make it simple, the data shows for example 10 consecutive values then it renders zeros then values again. I just want to tell python to calculate the average of each patch of the data.
The pic shows an example
first of all I'm a little bit confused why you are using a DataFrame. This is more likely being stored in a pd.Series while I would suggest storing numeric data in an numpy array. Assuming that you are having a pd.Series in front of you and you are trying to calculate the moving average between two consecutive points, there are two approaches you can follow.
zero-paddding for the last integer:
assuming circularity and taking the average between the first and the last value
Here is the expected code:
import numpy as np
import pandas as pd
data_series = pd.Series([0,0,0.76231, 0.77669,0,0,0,0,0,0,0,0,0.66772, 1.37964, 2.11833, 2.29178, 0,0,0,0,0])
np_array = np.array(data_series)
#assuming zero_padding
np_array_zero_pad = np.hstack((np_array, 0))
mvavrg_zeropad = [np.mean([np_array_zero_pad[i], np_array_zero_pad[i+1]]) for i in range(len(np_array_zero_pad)-1)]
#asssuming circularity
np_array_circ_arr = np.hstack((np_array, np_array[-1]))
np_array_circ_arr = [np.mean([np_array_circ_arr[i], np_array_circ_arr[i+1]]) for i in range(len(np_array_circ_arr)-1)]
I have a dataframe extracted with Pandas for which one of the colums looks something like this:
What I want to do is to extract the numerical values (floats) in this column, which by itself I could do. The issue comes because I have some cells, like the cell 20 in the image, in which I have more than one number, so I would like to make an average of these values. I think that for that I would first need to recognize the different groups of numerical values in the string (each float number) and then extract them as floats to then operate with them. I don't know how to do this.
Edit: I have found an solution to this using the re.findall command from regex. This is based on the answer of a question in this thread Find all floats or ints in a given string.
for index,value in z.iteritems():
z[index]=statistics.mean([float(h) for h in re.findall(r'(?:\b\d{1,2}\b(?:\.\d*))',value)])
Note that I haven't included match for integers, and only account for values up to 99, just due to the type of data that I have.
However, I get a warning with this approach, due to the loop (there is no warning when I do it only for one element of the series):
SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame
Although I don't see any issue happening with my data, is this warning important?
I think you can benefit from the Pandas vectorized operations here. Use findall over the original dataframe and apply in sequence the pd.Series to transform from list to columns and pd.to_numeric to convert from string to numeric type (default return dtype is float64). Then calculate the average of the values on each row with .mean(axis=1).
import pandas as pd
d = {0: {0: '2.469 (VLT: emission host)',
1: '1.942 (VLT: absorption)',
2: '1.1715 (VLT: absorption)',
3: '0.42 (NOT: absorption)|0.4245 (GTC)|0.4250 (ESO-VLT UT2: absorption & emission)',
4: '3.3765 (VLT: absorption)',
5: '1.86 (Xinglong: absorption)| 1.86 (GMG: absorption)|1.859 (VLT: absorption)',
6: '<2.4 (NOT: inferred)'}}
df = pd.DataFrame(d)
print(df)
s_mean = df[0].str.findall(r'(?:\b\d{1,2}\b(?:\.\d*))')\
.apply(pd.Series)\
.apply(pd.to_numeric)\
.mean(axis=1)
print(s_mean)
Output from s_mean
0 2.469000
1 1.942000
2 1.171500
3 0.423167
4 3.376500
5 1.859667
6 2.400000
I have found a solution based on what I wrote previously in the Edit of the original post:
It consists on using the re.findall() command with regex, as posted in this thread Find all floats or ints in a given string:
statistics.mean([float(h) for h in re.findall(r'(?:\b\d{1,2}\b(?:\.\d*))',string)])
Then, to loop over the dataframe column, just use the lambda x: method with the pandas apply command (df.apply). For this, I have defined a function (redshift_to_num) executing the operation above, and then apply this function to each element in the dataframe column:
import re
import pandas as pd
import statistics
def redshift_to_num(string):
measures=[float(h) for h in re.findall(r'(?:\b\d{1,2}\b(?:\.\d*))',string)]
mean=statistics.mean(measures)
return mean
df.Redshift=df.Redshift.apply(lambda x: redshift_to_num(x))
Notes:
The data of interest in my case is stored in the dataframe column df.Redshift.
In the re.findall command I haven't included match for integers, and only account for values up to 99, just due to the type of data that I have.
How to convert log2 transformed values back to normal scale in python
Any suggestions would be great
log2(x) is the inverse of 2**x (2 to the power of x). If you have a column of data that has been transformed by log2(x), all you have to do is perform the inverse operation:
df['colname'] = [2**i for i in df['colname']]
As suggested in the comment below, it would be more efficient to do:
df['colname'] = df['colname'].rpow(2)
rpow is a pandas Series method that is built in to the pandas package. The first argument is the base you'd like to take powers to. You can also use a fill_value argument, which is nice because you can tell it what to do if the result is NaN
Is there a preferred way to keep the data type of a numpy array fixed as int (or int64 or whatever), while still having an element inside listed as numpy.NaN?
In particular, I am converting an in-house data structure to a Pandas DataFrame. In our structure, we have integer-type columns that still have NaN's (but the dtype of the column is int). It seems to recast everything as a float if we make this a DataFrame, but we'd really like to be int.
Thoughts?
Things tried:
I tried using the from_records() function under pandas.DataFrame, with coerce_float=False and this did not help. I also tried using NumPy masked arrays, with NaN fill_value, which also did not work. All of these caused the column data type to become a float.
NaN can't be stored in an integer array. This is a known limitation of pandas at the moment; I have been waiting for progress to be made with NA values in NumPy (similar to NAs in R), but it will be at least 6 months to a year before NumPy gets these features, it seems:
http://pandas.pydata.org/pandas-docs/stable/gotchas.html#support-for-integer-na
(This feature has been added beginning with version 0.24 of pandas, but note it requires the use of extension dtype Int64 (capitalized), rather than the default dtype int64 (lower case):
https://pandas.pydata.org/pandas-docs/version/0.24/whatsnew/v0.24.0.html#optional-integer-na-support
)
This capability has been added to pandas beginning with version 0.24.
At this point, it requires the use of extension dtype 'Int64' (capitalized), rather than the default dtype 'int64' (lowercase).
If performance is not the main issue, you can store strings instead.
df.col = df.col.dropna().apply(lambda x: str(int(x)) )
Then you can mix then with NaN as much as you want. If you really want to have integers, depending on your application, you can use -1, or 0, or 1234567890, or some other dedicated value to represent NaN.
You can also temporarily duplicate the columns: one as you have, with floats; the other one experimental, with ints or strings. Then inserts asserts in every reasonable place checking that the two are in sync. After enough testing you can let go of the floats.
In case you are trying to convert a float (1.143) vector to integer (1), and that vector has NAs, converting it to the new 'Int64' dtype will give you an error. In order to solve this you have to round the numbers and then do ".astype('Int64')"
s1 = pd.Series([1.434, 2.343, np.nan])
#without round() the next line returns an error
s1.astype('Int64')
#cannot safely cast non-equivalent float64 to int64
##with round() it works
s1.round().astype('Int64')
0 1
1 2
2 NaN
dtype: Int64
My use case is that I have a float series that I want to round to int, but when you do .round() still has decimals, you need to convert to int to remove decimals.
This is not a solution for all cases, but mine (genomic coordinates) I've resorted to using 0 as NaN
a3['MapInfo'] = a3['MapInfo'].fillna(0).astype(int)
This at least allows for the proper 'native' column type to be used, operations like subtraction, comparison etc work as expected
Pandas v0.24+
Functionality to support NaN in integer series will be available in v0.24 upwards. There's information on this in the v0.24 "What's New" section, and more details under Nullable Integer Data Type.
Pandas v0.23 and earlier
In general, it's best to work with float series where possible, even when the series is upcast from int to float due to inclusion of NaN values. This enables vectorised NumPy-based calculations where, otherwise, Python-level loops would be processed.
The docs do suggest : "One possibility is to use dtype=object arrays instead." For example:
s = pd.Series([1, 2, 3, np.nan])
print(s.astype(object))
0 1
1 2
2 3
3 NaN
dtype: object
For cosmetic reasons, e.g. output to a file, this may be preferable.
Pandas v0.23 and earlier: background
NaN is considered a float. The docs currently (as of v0.23) specify the reason why integer series are upcasted to float:
In the absence of high performance NA support being built into NumPy
from the ground up, the primary casualty is the ability to represent
NAs in integer arrays.
This trade-off is made largely for memory and performance reasons, and
also so that the resulting Series continues to be “numeric”.
The docs also provide rules for upcasting due to NaN inclusion:
Typeclass Promotion dtype for storing NAs
floating no change
object no change
integer cast to float64
boolean cast to object
New for Pandas v1.00 +
You do not (and can not) use numpy.nan any more.
Now you have pandas.NA.
Please read: https://pandas.pydata.org/pandas-docs/stable/user_guide/integer_na.html
IntegerArray is currently experimental. Its API or implementation may
change without warning.
Changed in version 1.0.0: Now uses pandas.NA as the missing value
rather than numpy.nan.
In Working with missing data, we saw that pandas primarily uses NaN to
represent missing data. Because NaN is a float, this forces an array
of integers with any missing values to become floating point. In some
cases, this may not matter much. But if your integer column is, say,
an identifier, casting to float can be problematic. Some integers
cannot even be represented as floating point numbers.
If there are blanks in the text data, columns that would normally be integers will be cast to floats as float64 dtype because int64 dtype cannot handle nulls. This can cause inconsistent schema if you are loading multiple files some with blanks (which will end up as float64 and others without which will end up as int64
This code will attempt to convert any number type columns to Int64 (as opposed to int64) since Int64 can handle nulls
import pandas as pd
import numpy as np
#show datatypes before transformation
mydf.dtypes
for c in mydf.select_dtypes(np.number).columns:
try:
mydf[c] = mydf[c].astype('Int64')
print('casted {} as Int64'.format(c))
except:
print('could not cast {} to Int64'.format(c))
#show datatypes after transformation
mydf.dtypes
This is now possible, since pandas v 0.24.0
pandas 0.24.x release notes
Quote: "Pandas has gained the ability to hold integer dtypes with missing values.
I know that OP has asked for NumPy or Pandas only, but I think it is worth mentioning polars as an alternative that supports the requested feature.
In Polars any missing values in an integer column are simply null values and the column remains an integer column.
See Polars - User Guide > Coming from Pandas for more info.