Py Pandas .format(dataframe) - python

As Python newbie I recently discovered that with Py 2.7 I can do something like:
print '{:20,.2f}'.format(123456789)
which will give the resulting output:
123,456,789.00
I'm now looking to have a similar outcome for a pandas df so my code was like:
import pandas as pd
import random
data = [[random.random()*10000 for i in range(1,4)] for j in range (1,8)]
df = pd.DataFrame (data)
print '{:20,.2f}'.format(df)
In this case I have the error:
Unknown format code 'f' for object of type 'str'
Any suggestions to perform something like '{:20,.2f}'.format(df) ?
As now my idea is to index the dataframe (it's a small one), then format each individual float within it, might be assign astype(str), and rebuild the DF ... but looks so looks ugly :-( and I'm not even sure it'll work ..
What do you think ? I'm stuck ... and would like to have a better format for my dataframes when these are converted to reportlabs grids.

import pandas as pd
import numpy as np
data = np.random.random((8,3))*10000
df = pd.DataFrame (data)
pd.options.display.float_format = '{:20,.2f}'.format
print(df)
yields (random output similar to)
0 1 2
0 4,839.01 6,170.02 301.63
1 4,411.23 8,374.36 7,336.41
2 4,193.40 2,741.63 7,834.42
3 3,888.27 3,441.57 9,288.64
4 220.13 6,646.20 3,274.39
5 3,885.71 9,942.91 2,265.95
6 3,448.75 3,900.28 6,053.93
The docstring for pd.set_option or pd.describe_option explains:
display.float_format: [default: None] [currently: None] : callable
The callable should accept a floating point number and return
a string with the desired format of the number. This is used
in some places like SeriesFormatter.
See core.format.EngFormatter for an example.

Related

type conversion in pandas on assignment of DataFrame series

I am noticing something a little strange in pandas (1.4.3). Is this the expected behaviour? The result of an optimization, or a bug? Basically I'd like to guarantee the type does not change unexpectedly, I'd at least like to see an error raised, so any tips are welcome.
If you assign all values of a series in a DataFrame this way, the dtype is altered
>>> import pandas as pd
>>> import numpy as np
>>> df1 = pd.DataFrame({"a": np.array([1,2,3], dtype="int32")})
>>> df1.iloc[:, df1.columns.get_loc("a")] = 0
>>> df1["a"].dtype
dtype('int64')
and if you index the rows in a different way pandas does not convert the dtype
>>> df2 = pd.DataFrame({"a": np.array([1,2,3], dtype="int32")})
>>> df2.iloc[0:len(df2.index), df2.columns.get_loc("a")] = 0
>>> df2["a"].dtype
dtype('int32')
Not really an answer but some thoughts that might help you in your quest. My guess as to what is happening is this. In your multiple choice question above I am picking option A - optimization.
I think when 'pandas' sees df1.iloc[:, df1.columns.get_loc("a")] = 0 it is thinking full column(s) replacement of all rows. No slicing - even though df1.iloc[: ... ] is involved. [:] gets translated into all-rows-not-a-slice mode. When it sees = 0 it sees that (via broadcast) as full column(s) of int64. And since it is full replacement then the new column has the same dtype as the source.
But when it sees df2.iloc[0:len(df2.index), df2.columns.get_loc("a")] = 0 it goes into index-slice mode. Even though it is a full-column index slice it doesn't know that and makes an early decision to go into index-slice mode. Index-slice mode then operates on the assumption that only part of the column is going to be updated - not a replacement. Then in update mode the column is assumed to be partially updated and retains its existing dtype.
I got the above hypothesis from looking around at this: https://github.com/pandas-dev/pandas/blob/main/pandas/core/indexes/base.py
If I didn't have a day job I might have the time to actually find the smoking gun in those 6242 lines of code.
If you look at this code ( I wrote your code little differently to see what is
happening in the middle)
from pandas._libs import index
import pandas as pd
import numpy as np
dfx= pd.DataFrame({"x": np.array([4,5,6], dtype="int32")}
P=dfx.iloc[:, dfx.columns.get_loc("x")] = 0
P1=dfx.iloc[:, dfx.columns.get_loc("x")]
print(P1)# here you are automatically changing the datatype to int64 ( while
keep the value 0 , as int64 is default access mechanism for the hardware to
process the data.
print(P)
print(dfx["x"].dtype)
dfy= pd.DataFrame({"y": np.array([4,5,6], dtype="int32")})
Q=dfy.iloc[0:len(dfy.index), dfy.columns.get_loc("y")] = 0
print(Q)
Q1=dfy.iloc[0:len(dfy.index), dfy.columns.get_loc("y")]
print(Q1)
print(dfy["y"].dtype)
print(len(dfx.index))
print(len(dfy.index))
Don't know why this is happening, but adding square brackets seem to solve the issue:
df1.iloc[:, [df1.columns.get_loc("a")]] = 0
An other solution seems to be:
df1.iloc[range(len(df1.index)), df1.columns.get_loc("a")] = 0

Importing numbers as string into a dataframe from text

I'm trying to import a text file into Python as a dataframe.
My text file essentially consists of 2 columns, both of which are numbers.
The problem is: I want one of the columns to be imported as a string (since many of the 'numbers' start with a zero, e.g. 0123, and I will need this column to merge the df with another later on)
My code looks like this:
mydata = pd.read_csv("text_file.txt", sep = "\t", dtype = {"header_col2": str})
However, I still lose the zeros in the output, so a 4-digit number is turned into a 3-digit number.
I'm assuming there is something wrong with my import code but I could not find any solution yet.
I'm new to python/pandas, so any help/suggestions would be much appreciated!
Hard to see why your original code not working:
from io import StringIO
import pandas as pd
# this mimics your data
mock_txt = StringIO("""header_col2\theader_col3
0123\t5
0333\t10
""")
# same reading as you suggested
df = pd.read_csv(mock_txt, sep = "\t", dtype = {"header_col2": str})
# are they really strings?
assert isinstance(df.header_col2[0], str)
assert isinstance(df.header_col2[1], str)
P.S. as always at SO - really nice to have some of the data and a minimal working example with code in the original post.

Converting list of strings to list of floats in pandas

I have what I assumed would be a super basic problem, but I'm unable to find a solution. The short is that I have a column in a csv that is a list of numbers. This csv that was generated by pandas with to_csv. When trying to read it back in with read_csv it automatically converts this list of numbers into a string.
When then trying to use it I obviously get errors. When I try using the to_numeric function I get errors as well because it is a list, not a single number.
Is there any way to solve this? Posting code below for form, but probably not extremely helpful:
def write_func(dataset):
features = featurize_list(dataset[column]) # Returns numpy array
new_dataset = dataset.copy() # Don't want to modify the underlying dataframe
new_dataset['Text'] = features
new_dataset.rename(columns={'Text': 'Features'}, inplace=True)
write(new_dataset, dataset_name)
def write(new_dataset, dataset_name):
dump_location = feature_set_location(dataset_name, self)
featurized_dataset.to_csv(dump_location)
def read_func(read_location):
df = pd.read_csv(read_location)
df['Features'] = df['Features'].apply(pd.to_numeric)
The Features column is the one in question. When I attempt to run the apply currently in read_func I get this error:
ValueError: Unable to parse string "[0.019636873200000002, 0.10695576670000001,...]" at position 0
I can't be the first person to run into this issue, is there some way to handle this at read/write time?
You want to use literal_eval as a converter passed to pd.read_csv. Below is an example of how that works.
from ast import literal_eval
form io import StringIO
import pandas as pd
txt = """col1|col2
a|[1,2,3]
b|[4,5,6]"""
df = pd.read_csv(StringIO(txt), sep='|', converters=dict(col2=literal_eval))
print(df)
col1 col2
0 a [1, 2, 3]
1 b [4, 5, 6]
I have modified your last function a bit and it works fine.
def read_func(read_location):
df = pd.read_csv(read_location)
df['Features'] = df['Features'].apply(lambda x : pd.to_numeric(x))

Converting pandas column to string w/o scientific notation

Lots of questions address this, but none of the solutions seem to work exactly as I need.
I have a dataframe with two columns of numbers with 10-20 digits each. These are actually ID #s, and I'd like to concatenate them. It looks like that's best done by first converting the values to strings.
However, when converting with .astype(str), pandas keeps the scientific notation, which won't fly.
Things I've tried:
tried: dtype arg ('str') or converters (using str()) in read_csv()
outcome: df.dtypes still lists 'objects,' and values still display in sci. notation
tried: pd.set_option('display.float_format', lambda x: '%.0f' % x)
outcome: displays good in df.head(), but reverts to scientific notation upon coercion to string & concatenation using + operator
tried: coercing to int, str, or str(int(x)).
outcome: int works when i coerce one value with int(), but not when I use astype(int). using .apply() with int() throws an 'invalid literal long() with base 10' error.
This feels like it should be pretty straightforward, anxious to figure out what I'm missing.
What you tried sets the display format. You could just format the float as a string in the dataframe.
import numpy as np
import pandas as pd
import numpy as np
import pandas as pd
df=pd.DataFrame(data={'a':np.random.randint(low=1,high=100,size=10)*1e20,'b':np.random.randint(low=1,high=100,size=10)*1e20})
df.apply(lambda x: '{0:20.0f}|{1:20.0f}'.format(x.a,x.b),axis=1)
Out[34]:
0 9699999999999998951424|4600000000000000000000
1 300000000000000000000|2800000000000000000000
2 9400000000000000000000|9000000000000000000000
3 2100000000000000000000|4500000000000000000000
4 5900000000000000000000|4800000000000000000000
5 7700000000000000000000|6200000000000000000000
6 1600000000000000000000|8000000000000000000000
7 100000000000000000000|400000000000000000000
8 9699999999999998951424|8000000000000000000000
9 4500000000000000000000|3500000000000000000000

Python Pandas Pivot - Why Fails

I have tried for a while to get this to wrk and I can't - I read the documentation and I must be misunderstanding something
I have a Data Frame in long format and I want to make it wide - this is quite common. But I get an error
from pandas import DataFrame
data = DataFrame({'value' : [1,2,3,4,5,6,7,8,9,10,11,12],
'group' : ['a','a','a','b','b','b','b','c','c','c','d','d']})
data.pivot(columns='group')
the error I get is (the lats part, as they are quite extensive): ValueError: cannot label index with a null key
I tried this in python (notebook) and also on regular python c command line in OS X with the same result
Thanks for any insight, I am sure it will be something basic
From what you were trying to do, you were trying to pass 'group' as index so the pivot fails.
It should be:
data.pivot(data.index, 'group')
or,
# the format is pivot(index=None, columns=None, values=None)
data.pivot(index=data.index, columns='group')
However I'm not entirely sure what expected output you want, if you just want shorter presentation, you can always use transpose:
data.T
or, the best for presentation in your case, is groupby:
data.groupby('group').sum()
value
group
a 6
b 22
c 27
d 23

Categories