Convert selected elements in data frame from float to integer unsuccessful - python

I'm trying to convert a list of elements in the dataframe called "GDP" from floating to integers. The cells that I want to convert are specified in GDP.iloc[4,-10]. I have tried the following methods:
for x in GDP.iloc[4,-10:]:
pd.to_numeric(x, downcast='signed')
GDP.iloc[4,-10:]=GDP.iloc[4,-10:].astype(int)
GDP.iloc[4,-10:]=int(GDP.iloc[4,-10:])
However, none of them seem to be working in converting the float to integers. No errors appear for methods 1 and 2 but for option 3, the following error appears:
TypeError: cannot convert the series to
The data can be found here: https://data.worldbank.org/indicator/NY.GDP.MKTP.CD
GDP = pd.read_csv('world_bank.csv',header=None)
Method 1
for x in GDP.iloc[4,-10:]:
pd.to_numeric(x, downcast='signed')
Method 2:
GDP.iloc[4,-10:]=GDP.iloc[4,-10:].astype(int)
Method 3:
GDP.iloc[4,-10:]=int(GDP.iloc[4,-10:])
Can someone help me out? Much appreciated.
enter image description here

You can use astype(np.int64) to convert to int
import pandas as pd
import numpy as np
df = pd.read_csv('data.csv')
# df.head()
df = df.fillna('custom_none_values')
# df.head()
df = df[df['1960'] != 'custom_none_values']
df['1960'] = df['1960'].astype(np.int64)
df.head()

Related

How do i convert one column from an imported csv using numpy from string to float?

I have two csv files which i have imported to python using numpy.
the data has 2 columns:
[['month' 'total_rainfall']
['1982-01' '107.1']
['1982-02' '27.8']
['1982-03' '160.8']
['1982-04' '157']
['1982-05' '102.2']
I need to create a 2D array and calculate statistics with the 'total_rainfall' column. (Mean,Std Dev, Min and Max)
So i have this:
import numpy as np
datafile=np.genfromtxt("C:\rainfall-monthly-total.csv",delimiter=",",dtype=None,encoding=None)
print(datafile)
rainfall=np.asarray(datafile).astype(np.float32)
print (np.mean(datafile,axis=1))
ValueError: could not convert string to float: '2019-04'
Converting str to float is like below:
>>> a = "545.2222"
>>> float(a)
545.22220000000004
>>> int(float(a))
545
but the error message says the problem is converting 2019-04 to float.
when you want to convert 2019-04 to float it doesn't work because float numbers don't have - in between . That is why you got error.
You can convert values of rainfall into float or int but date can't be converted. To convert date into int you have to split the string and combine it back as date formate then convert it to milliseconds as:
from datetime import datetime
month1 = '1982-01'
date = datetime(month1.split('-')[0], month1.split('-')[1], 1)
milliseconds = int(round(date.timestamp() * 1000))
This way, you assume its first date of the month.
Your error message reads could not convert string to float,
but actually your problem is a bit different.
Your array contains string columns, which should be converted:
month - to Period (month),
total_rainfall - to float.
Unfortunately, Numpy has been created to process arrays where all
cells are of the same type, so much more convenient tool is Pandas,
where each column can be of its own type.
First, convert your Numpy array (I assume arr) to a pandasonic
DataFrame:
import pandas as pd
df = pd.DataFrame(arr[1:], columns=arr[0])
I took column names from the initial row and data from
following rows. Print df to see the result.
So far both columns are still of object type (actually string),
so the only thing to do is to convert both columns,
each to its desired type:
df.month = pd.PeriodIndex(df.month, freq='M')
df.total_rainfall = df.total_rainfall.astype(float)
Now, when you run df.info(), you will see that both
columns are of proper types.
To process your data, use also Pandas. It is a more convenient tool.
E.g. to get quarterly sums, you can run:
df.set_index('month').resample('Q').sum()
getting (for your data sample):
total_rainfall
month
1982Q1 295.7
1982Q2 259.2

How to convert Decimal128 to decimal in pandas dataframe

I have a dataframe with many (But not all) Decimal128 columns (taken from a mongodb collection). I can't perform any math or comparisons on them (e.g. '<' not supported between instances of 'Decimal128' and 'float').
What is the quickest/easiest way to convert all these to float or some simpler built-in type that i can work with?
There is the Decimal128 to_decimal() method, and pandas astype(), but how can I do it for all (the decimal128) columns in one step/helper method?
Edit, I've tried:
testdf = my_df.apply(lambda x: x.astype(str).astype(float) if isinstance(x, Decimal128) else x)
testdf[testdf["MyCol"] > 80].head()
but I get:
TypeError: '>' not supported between instances of 'Decimal128' and 'int'
Converting a single column using .astype(str).astype(float) works.
Casting full DataFrame.
df = df.astype(str).astype(float)
For single column. IDs is the name of the column.
df["IDs"] = df.IDs.astype(str).astype(float)
Test implementation
from pprint import pprint
import bson
df = pd.DataFrame()
y = []
for i in range(1,6):
i = i *2/3.5
y.append(bson.decimal128.Decimal128(str(i)))
pprint(y)
df["D128"] = y
df["D128"] = df.D128.astype(str).astype(float)
print("\n", df)
Output:
[Decimal128('0.5714285714285714'),
Decimal128('1.1428571428571428'),
Decimal128('1.7142857142857142'),
Decimal128('2.2857142857142856'),
Decimal128('2.857142857142857')]
D128
0 0.571429
1 1.142857
2 1.714286
3 2.285714
4 2.857143
Just use:
df = df.astype(float)
You can also use apply or applymap(applying element wise operations), although these are inefficient compared to previous method.
df = df.applymap(float)
I can't reproduce a Decimal128 number in my system. Can you please check if the next line works for you?
df = df.apply(lambda x: x.astype(float) if isinstance(x, bson.decimal.Decimal128) else x)
It will check if a column is of type Decimal128 and then convert it to float.

panda read_csv() converting imaginary to real

After calling a file using pandas by this two lines:
import pandas as pd
import numpy as np
df = pd.read_csv('PN_lateral_n_eff.txt', header=None)
df.columns = ["effective_index"]
here is my output:
effective_index
0 2.568393573877396+1.139080496494329e-006i
1 2.568398351899841+1.129979376397734e-006i
2 2.568401556986464+1.123872317134941e-006i
after that, i can not use the numpy to convert it into a real number. Because, panda dtype was object. I tried this:
np.real(df, dtype = float)
TypeError: real() got an unexpected keyword argument 'dtype'
Any way to do that?
Looks like astype(complex) works with Numpy arrays of strings, but not with Pandas Series of objects:
cmplx = df['effective_index'].str.replace('i','j')\ # Go engineering
.values\ # Go NumPy
.astype('str')\ # Go string
.astype(np.complex) # Go complex
#array([ 2.56839357 +1.13908050e-06j, 2.56839835 +1.12997938e-06j,
# 2.56840156 +1.12387232e-06j])
df['effective_index'] = cmplx # Go Pandas again

Converting list of strings to list of floats in pandas

I have what I assumed would be a super basic problem, but I'm unable to find a solution. The short is that I have a column in a csv that is a list of numbers. This csv that was generated by pandas with to_csv. When trying to read it back in with read_csv it automatically converts this list of numbers into a string.
When then trying to use it I obviously get errors. When I try using the to_numeric function I get errors as well because it is a list, not a single number.
Is there any way to solve this? Posting code below for form, but probably not extremely helpful:
def write_func(dataset):
features = featurize_list(dataset[column]) # Returns numpy array
new_dataset = dataset.copy() # Don't want to modify the underlying dataframe
new_dataset['Text'] = features
new_dataset.rename(columns={'Text': 'Features'}, inplace=True)
write(new_dataset, dataset_name)
def write(new_dataset, dataset_name):
dump_location = feature_set_location(dataset_name, self)
featurized_dataset.to_csv(dump_location)
def read_func(read_location):
df = pd.read_csv(read_location)
df['Features'] = df['Features'].apply(pd.to_numeric)
The Features column is the one in question. When I attempt to run the apply currently in read_func I get this error:
ValueError: Unable to parse string "[0.019636873200000002, 0.10695576670000001,...]" at position 0
I can't be the first person to run into this issue, is there some way to handle this at read/write time?
You want to use literal_eval as a converter passed to pd.read_csv. Below is an example of how that works.
from ast import literal_eval
form io import StringIO
import pandas as pd
txt = """col1|col2
a|[1,2,3]
b|[4,5,6]"""
df = pd.read_csv(StringIO(txt), sep='|', converters=dict(col2=literal_eval))
print(df)
col1 col2
0 a [1, 2, 3]
1 b [4, 5, 6]
I have modified your last function a bit and it works fine.
def read_func(read_location):
df = pd.read_csv(read_location)
df['Features'] = df['Features'].apply(lambda x : pd.to_numeric(x))

Py Pandas .format(dataframe)

As Python newbie I recently discovered that with Py 2.7 I can do something like:
print '{:20,.2f}'.format(123456789)
which will give the resulting output:
123,456,789.00
I'm now looking to have a similar outcome for a pandas df so my code was like:
import pandas as pd
import random
data = [[random.random()*10000 for i in range(1,4)] for j in range (1,8)]
df = pd.DataFrame (data)
print '{:20,.2f}'.format(df)
In this case I have the error:
Unknown format code 'f' for object of type 'str'
Any suggestions to perform something like '{:20,.2f}'.format(df) ?
As now my idea is to index the dataframe (it's a small one), then format each individual float within it, might be assign astype(str), and rebuild the DF ... but looks so looks ugly :-( and I'm not even sure it'll work ..
What do you think ? I'm stuck ... and would like to have a better format for my dataframes when these are converted to reportlabs grids.
import pandas as pd
import numpy as np
data = np.random.random((8,3))*10000
df = pd.DataFrame (data)
pd.options.display.float_format = '{:20,.2f}'.format
print(df)
yields (random output similar to)
0 1 2
0 4,839.01 6,170.02 301.63
1 4,411.23 8,374.36 7,336.41
2 4,193.40 2,741.63 7,834.42
3 3,888.27 3,441.57 9,288.64
4 220.13 6,646.20 3,274.39
5 3,885.71 9,942.91 2,265.95
6 3,448.75 3,900.28 6,053.93
The docstring for pd.set_option or pd.describe_option explains:
display.float_format: [default: None] [currently: None] : callable
The callable should accept a floating point number and return
a string with the desired format of the number. This is used
in some places like SeriesFormatter.
See core.format.EngFormatter for an example.

Categories