Why the dataframe is altered using isin? - python

I don't understand what is happening. It was working fine, and suddently, it's not.
I got a dataframe looking like this:
And when I try to filter the indicators, the data is altered to things looking like this:
This is the code I use to filter the indicators and I expect to keep the same data
dfCountry = data.copy()
goodIndicators = ['GDP per capita (current US$)', 'Internet users (per 100 people)', 'other indicators']
dfCountry = dfCountry[dfCountry["Indicator Name"].isin(goodIndicators)]

This is something that Pandas does with large or small numbers. It uses scientific notation. You could fix it by simply round your numbers. You could even add a number between the parentheses of the round method to round to a specific number of decimals.
n_decimals = 0
df = df.round(n_decimals) # or df = df.round() since zero is default
You could also change the Pandas configuration. Change the number of decimals yourself that Pandas should show.
pd.set_option('display.float_format', lambda x: '%.5f' % x)
You could also just convert the float numbers to integers when you don't care about decimals.
for column in [c for c in df.columns if df.startswith('2')]:
df.loc[:, column] = df.loc[:, column].apply(int)
More about this Pandas notation.

Related

Trouble subtracting two column values correctly/precisely in pandas dataframe in Python

I'm trying to create a new column in my pandas dataframe which will be the difference of two other columns, but the new column has values that are significantly different what what the differences between the values of the columns are. I have heard that 'float' values often don't subtract precisely, so I have tried to convert the decimal values here to integers by changing the columns' dtypes to 'int64' (as suggested here Pandas Subtract Two Columns Not Working Correctly) and then multiplying each value by 100000:
# Read in data
data = pd.read_csv('/Users/aaron/Downloads/treatment_vs_control.csv')
# Cleaning and preprocessing
data.rename({"names": "Gene"}, axis=1, inplace=True)
columns = data.columns
garbage_filter = columns.str.startswith('names') | columns.str.startswith('0') | columns.str.startswith('1') | \
columns.str.startswith('Gene')
data = data.loc[:,garbage_filter]
scores_filter = columns.str.endswith('scores')
columns = data.columns
scores_filter = columns.str.endswith('scores')
data = data.iloc[:,~scores_filter]
## To create Diff columns correctly, change logFC columns to integer dtype
data = data.astype({'1_logfoldchanges': 'int64', '0_logfoldchanges': 'int64'})
data['1_logfoldchanges'] = data['1_logfoldchanges'] * 100000
data['0_logfoldchanges'] = data['0_logfoldchanges'] * 100000
data["diff_logfoldchanges0"] = data['0_logfoldchanges'] - data['1_logfoldchanges']
data["diff_logfoldchanges1"] = data['1_logfoldchanges'] - data['0_logfoldchanges']
data['1_logfoldchanges'] = data['1_logfoldchanges'] / 100000
data['0_logfoldchanges'] = data['0_logfoldchanges'] / 100000
data['diff_logfoldchanges0'] = data['diff_logfoldchanges0'] / 100000
data['diff_logfoldchanges1'] = data['diff_logfoldchanges1'] / 100000
data = data.astype({'1_logfoldchanges': 'float64', '0_logfoldchanges': 'float64'})
data.sort_values('diff_logfoldchanges0', ascending=False, inplace=True)
The values in the new column still do not equal the difference in the two original columns and I haven't been able to find any questions on this site or others that have been able to help me resolve this. Could someone point out how I could fix this? I would be extremely grateful for any help.
For reference, here is a snapshot of my data with the incorrect difference-column values:
EDIT: Here is a a bit of my CSV data too:
names,0_scores,0_logfoldchanges,0_pvals,0_pvals_adj,1_scores,1_logfoldchanges,1_pvals,1_pvals_adj,2_scores,2_logfoldchanges,2_pvals,2_pvals_adj,3_scores,3_logfoldchanges,3_pvals,3_pvals_adj,4_scores,4_logfoldchanges,4_pvals,4_pvals_adj,5_scores,5_logfoldchanges,5_pvals,5_pvals_adj,6_scores,6_logfoldchanges,6_pvals,6_pvals_adj,7_scores,7_logfoldchanges,7_pvals,7_pvals_adj,8_scores,8_logfoldchanges,8_pvals,8_pvals_adj
0610005C13Rik,-0.06806567,-1.3434665,0.9457333570044608,0.9996994148075796,-0.06571575,-2.952315,0.9476041278614572,0.9998906553041256,0.17985639,1.9209933,0.8572653106998014,0.9994124851941415,-0.0023527155,0.85980946,0.9981228063933416,0.9993920957240323,0.0021153346,0.08053488,0.9983122084427253,0.9993417421686092,0.07239167,2.6473796,0.9422902189641795,0.9998255096296015,-0.029918168,-18.44805,0.9761323166853361,0.998901292435457,-0.021452557,-18.417543,0.9828846479876278,0.9994515175269552,-0.011279659,-18.393742,0.9910003250967939,0.9994694916208285
0610006L08Rik,-0.015597747,-15.159286,0.9875553033428832,0.9996994148075796,-0.015243248,-15.13933,0.9878381189626457,0.9998906553041256,-0.008116434,-14.795435,0.9935240935555751,0.9994124851941415,-0.0073064035,-14.765995,0.9941703851753109,0.9993920957240323,-0.0068988753,-14.752146,0.9944955375479378,0.9993417421686092,0.100005075,18.888618,0.9203402935001026,0.9998255096296015,-0.004986361,-14.696446,0.9960214758176429,0.998901292435457,-0.0035754263,-14.665947,0.9971472286106732,0.9994515175269552,-0.0018799432,-14.64215,0.9985000232597367,0.9994694916208285
0610009B22Rik,0.7292792,-0.015067068,0.46583086269639506,0.9070814087688549,0.42489842,0.18173021,0.67091072915639,0.9998906553041256,17.370018,1.0877438,1.3918130408174961e-67,6.801929840389262e-67,-6.5684495,-1.237505,5.084194721546539e-11,3.930798968247645e-10,-5.6669636,-0.42557448,1.4535041077956595e-08,5.6533712043729706e-08,-3.5668032,-0.5939982,0.0003613625764821466,0.001766427013499565,-7.15373,-1.7427195,8.445118618740649e-13,4.689924532441606e-12,-2.6011736,-0.66274893,0.009290541915973735,0.05767076032846401,1.7334439,1.2316034,0.08301681426158236,0.3860271115408991
Ideally, I'd like to create a 'diff_logfoldchanges0' column that is equal to the values from the '0_logfoldchanges' column minus the values from the '1_logfoldchanges' column. In the CSV data below, I believe that might be "-1.3434665 - -2.952315", "-15.159286 - -15.13933", and "-0.015067068 - 0.18173021".
pd.read_csv by default uses a fast but less precise way of reading floating point numbers
import pandas as pd
import io
csv = """names,0_scores,0_logfoldchanges,0_pvals,0_pvals_adj,1_scores,1_logfoldchanges,1_pvals,1_pvals_adj,2_scores,2_logfoldchanges,2_pvals,2_pvals_adj,3_scores,3_logfoldchanges,3_pvals,3_pvals_adj,4_scores,4_logfoldchanges,4_pvals,4_pvals_adj,5_scores,5_logfoldchanges,5_pvals,5_pvals_adj,6_scores,6_logfoldchanges,6_pvals,6_pvals_adj,7_scores,7_logfoldchanges,7_pvals,7_pvals_adj,8_scores,8_logfoldchanges,8_pvals,8_pvals_adj
0610005C13Rik,-0.06806567,-1.3434665,0.9457333570044608,0.9996994148075796,-0.06571575,-2.952315,0.9476041278614572,0.9998906553041256,0.17985639,1.9209933,0.8572653106998014,0.9994124851941415,-0.0023527155,0.85980946,0.9981228063933416,0.9993920957240323,0.0021153346,0.08053488,0.9983122084427253,0.9993417421686092,0.07239167,2.6473796,0.9422902189641795,0.9998255096296015,-0.029918168,-18.44805,0.9761323166853361,0.998901292435457,-0.021452557,-18.417543,0.9828846479876278,0.9994515175269552,-0.011279659,-18.393742,0.9910003250967939,0.9994694916208285
0610006L08Rik,-0.015597747,-15.159286,0.9875553033428832,0.9996994148075796,-0.015243248,-15.13933,0.9878381189626457,0.9998906553041256,-0.008116434,-14.795435,0.9935240935555751,0.9994124851941415,-0.0073064035,-14.765995,0.9941703851753109,0.9993920957240323,-0.0068988753,-14.752146,0.9944955375479378,0.9993417421686092,0.100005075,18.888618,0.9203402935001026,0.9998255096296015,-0.004986361,-14.696446,0.9960214758176429,0.998901292435457,-0.0035754263,-14.665947,0.9971472286106732,0.9994515175269552,-0.0018799432,-14.64215,0.9985000232597367,0.9994694916208285
0610009B22Rik,0.7292792,-0.015067068,0.46583086269639506,0.9070814087688549,0.42489842,0.18173021,0.67091072915639,0.9998906553041256,17.370018,1.0877438,1.3918130408174961e-67,6.801929840389262e-67,-6.5684495,-1.237505,5.084194721546539e-11,3.930798968247645e-10,-5.6669636,-0.42557448,1.4535041077956595e-08,5.6533712043729706e-08,-3.5668032,-0.5939982,0.0003613625764821466,0.001766427013499565,-7.15373,-1.7427195,8.445118618740649e-13,4.689924532441606e-12,-2.6011736,-0.66274893,0.009290541915973735,0.05767076032846401,1.7334439,1.2316034,0.08301681426158236,0.3860271115408991"""
data = pd.read_csv(io.StringIO(csv))
print(data["0_logfoldchanges"][0]) # -1.3434665000000001 instead of -1.3434665
This difference is tiny (less than a quadrillionth of the original value), and usually not visible because the display rounds it, so I would not in most contexts call this 'significant' (it's likely to be insignificant in relation to the precision / accuracy of the input data), but does show up if you check your calculation by manually typing the same numbers into the Python interpreter.
To read the values more precisely, use float_precision="round_trip":
data = pd.read_csv(io.StringIO(csv), float_precision="round_trip")
Subtracting now produces the expected values (the same as doing a conventional python subtraction):
difference = data["0_logfoldchanges"] - data["1_logfoldchanges"]
print(difference[0] == -1.3434665 - -2.952315) # checking first row - True
This is not due to floating point being imprecise as such, but is specific to the way pandas reads CSV files. This is a good guide to floating point rounding. In general, converting to integers will not help, except sometimes when dealing with money or other quantities that have a precise decimal representation.
Did you try to use sub method of Pandas? I have done these arithmetic operations on float values many times without any issues.
Please try https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.sub.html

Remove scientific notation floats in a dataframe

I am receiving different series from a source. Some of those series have the values in big numbers (X billions). I then combine all the series to a dataframe with individual columns for each series.
Now, when I print the dataframe, the big numbers in the series are showed in scientific notation. Even printing the series individually shows the numbers in scientific notation.
Dataframe df (multiindex) output is:
Values
Item Sub
A 1 1.396567e+12
B 1 2.868929e+12
I have tried this:
pd.set_option('display.float_format', lambda x: '%,.2f' % x)
This doesn't work as:
it converts everywhere. I only need the conversion in that specific dataframe.
it tries to convert all kinds of floats, and not just those in scientific. So, even if the float is 89.142, it will try to convert the format and as there's no digit to put ',' it shows an error.
Then I tried these:
df.round(2)
This only converted numeric floats to 2 decimals from existing 3 decimals. Didn't do anything to scientific values.
Then I tried:
df.astypes(floats)
Doesn't do anything visible. Output stayed the same.
How else can we change the scientific notation to normal float digits inside the dataframe. I do not want to create a new list with the converted values. The dataframe itself should show the values in normal terms.
Can you guys please help me find a solution for this?
Thank you.
try df['column'] = df['column'].astype(str) . if does not work you should change type of numbers to string before create pandas dataframe from your data
I would suggest keeping everything in a float type and adjust the display setting.
For example, I have generated a df with some random numbers.
df = pd.DataFrame({"Item": ["A", "B"], "Sub": [1,1],
"Value": [float(31132314122123.1), float(324231235232315.1)]})
# Item Sub Value
#0 A 1 3.113231e+13
#1 B 1 3.242312e+14
If we print(df), we can see that the Sub values are ints and the Value values are floats.
Item object
Sub int64
Value float64
dtype: object
You can then call pd.options.display.float_format = '{:.1f}'.format to suppress the scientific notation of the floats, while retaining the float format.
# Item Sub Value
#0 A 1 31132314122123.1
#1 B 1 324231235232315.1
Item object
Sub int64
Value float64
dtype: object
If you want the scientific notation back, you can call pd.reset_option('display.float_format')
Okay. I found something called option_context for pandas that allows to change the display options just for the particular case / action using a with statement.
with pd.option_context('display.float_format',{:.2f}.format):
print(df)
So, we do not have to reset the options again as well as the options stay default for all other data in the file.
Sadly though, I could find no way to store different columns in different float format (for example one column with currency - comma separated and 2 decimals, while next column in percentage - non-comma and 2 decimals.)

Is it possible to format numbers in a Data Frame without turning them into strings?

I would like to export a dataframe to Excel as xls and show numbers with a 1000 separator and 2 decimal places and percentages as % with 2 decimals, e.g. 54356 as 54,356.00 and 0.0345 as 3.45%
When I change the format in python using .map() and .format it displays them correctly in python, but it turns them into a string and when I export them to xls Excel does not recognize them as numbers/percentages.
import pandas as pd
d = {'Percent': [0.01, 0.0345], 'Number': [54464.43, 54356]}
df = pd.DataFrame(data=d)
df['Percent'] = pd.Series(["{0:.2f}%".format(val * 100) for val in df['Percent']], index = df.index)
df['Number'] = df['Number'].map('{:,.2f}'.format)
The data frame looks as expected, but the type of the cells is now str and if I export it to xls (df.to_excel('file.xls')), Excel shows the "The number in this cell is formatted as text or preceded by an apostrophe" warning.
You can edit the style property when you display the DataFrame, but the underlying data is not touched
df.style.format({
'Number': '{0:.2f}'.format,
'Percent': '{0:.2%}'.format,
})
see: pandas style guide
If I understand your question correctly, your goal is the way how the actual cells in the output excel file are formatted. To do this, you may want to focus on the actual file rather than formatting within pandas DataFrame (i.e. your data is correct, the problem is the way they're displayed in the output).
You may want to try something like this.
maybe lambda would be more helpful
df['Percent'] = df['Percent'].apply(lambda x: round(x*100, 2))
df['Number'] = df['Number'].apply(lambda x: round(x, 2))
For df["Number"] : df.Number = df.Number.apply(lambda x : '{0:,.2f}'.format(x)) will work
For df["Percent"] : first multiply the series by 100. df["Percent"] = df.Percent.multiply(100)
Then use df.Percent = df.Percent.apply(lambda x: "{0:.2f}%".format(x))
Hope this helps

Pandas adding decimal points when using read_csv

I'm working with some csv files and using pandas to turn them into a dataframe. After that, I use an input to find values to delete
I'm hung up on one small issue: for some columns it's adding ".o" to the values in the column. It only does this in columns with numbers, so I'm guessing it's reading the column as a float. How do I prevent this from happening?
The part that really confuses me is that it only happens in a few columns, so I can't quite figure out a pattern. I need to chop off the ".0" so I can re-import it, and I feel like it would be easiest to prevent it from happening in the first place.
Thanks!
Here's a sample of my code:
clientid = int(input('What client ID needs to be deleted?'))
df1 = pd.read_csv('Client.csv')
clientclean = df1.loc[df1['PersonalID'] != clientid]
clientclean.to_csv('Client.csv', index=None)
Ideally, I'd like all of the values to be the same as the original csv file, but without the rows with the clientid from the user input.
The part that really confuses me is that it only happens in a few columns, so I can't quite figure out a pattern. I need to chop off the ".0" so I can re-import it, and I feel like it would be easiest to prevent it from happening in the first place.
Thanks!
If PersonalID if the header of the problematic column, try this:
df1 = pd.read_csv('Client.csv', dtype={'PersonalID':np.int32})
Edit:
As there are no NaN value for integer.
You can try this on each problematic colums:
df1[col] = df1[col].fillna(-9999) # or 0 or any value you want here
df1[col] = df1[col].astype(int)
You could go through each value, and if it is a number x, subtract int(x) from it, and if this difference is not 0.0, convert the number x to int(x). Or, if you're not dealing with any non-integers, you could just convert all values that are numbers to ints.
For an example of the latter (when your original data does not contain any non-integer numbers):
for index, row in df1.iterrows():
for c, x in enumerate(row):
if isinstance(x, float):
df1.iloc[index,c] = int(x)
For an example of the former (if you want to keep non-integer numbers as non-integer numbers, but want to guarantee that integer numbers stay as integers):
import numbers
import sys
for c, col in enumerate(df1.columns):
foundNonInt = False
for r, index in enumerate(df1.index):
if isinstance(x, float):
if (x - int(x) > sys.float_info.epsilon):
foundNonInt = True
break
if (foundNonInt==False):
df1.iloc[:,c] = int(df1.iloc[:,c])
else:
Note, the above method is not fool-proof: if by chance, a non-integer number column from the original data set contains non-integers that are all x.0000000, all the way to the last decimal place, this will fail.
It was a datatype issue.
ALollz's comment lead me in the right direction. Pandas was assuming a data type of float, which added the decimal points.
I specified the datatype as object (from Akarius's comment) when using read_csv, which resolved the issue.

how to get rid of pandas converting large numbers in excel sheet to exponential?

In the excel sheet , i have two columns with large numbers.
But when i read the excel file with read_excel() and display the dataframe,
those two columns are printed in scientific format with exponential.
How can get rid of this format?
Thanks
Output in Pandas
The way scientific notation is applied is controled via pandas' display options:
pd.set_option('display.float_format', '{:.2f}'.format)
df = pd.DataFrame({'Traded Value':[67867869890077.96,78973434444543.44],
'Deals':[789797, 789878]})
print(df)
Traded Value Deals
0 67867869890077.96 789797
1 78973434444543.44 789878
If this is simply for presentational purposes, you may convert your
data to strings while formatting them on a column-by-column basis:
df = pd.DataFrame({'Traded Value':[67867869890077.96,78973434444543.44],
'Deals':[789797, 789878]})
df
Deals Traded Value
0 789797 6.786787e+13
1 789878 7.897343e+13
df['Deals'] = df['Deals'].apply(lambda x: '{:d}'.format(x))
df['Traded Value'] = df['Traded Value'].apply(lambda x: '{:.2f}'.format(x))
df
Deals Traded Value
0 789797 67867869890077.96
1 789878 78973434444543.44
An alternative more straightforward method would to put the following line at the top of your code that would format floats only:
pd.options.display.float_format = '{:.2f}'.format
try '{:.0f}' with Sergeys, worked for me.

Categories