Is there a way to format the output of a particular column of a panda's data frame (e.g., as currency, with ${:,.2f}, or percentage, with {:,.2%}) without changing the data itself?
In this post I see that map can be used, but it changes the data to strings.
I also see that I can use .style.format (see here) to print a data frame with some formatting, but it returns a Styler object.
I would like just to change the default print out of the data frame itself, so that it always print it formatted as specified. (I suppose this means changing __repr__ or __repr_html__.) I'd assume that there is a simple way of doing this in pandas, but I could not find it.
Any help would be greatly appreciated!
EDIT (for clarification): Suppose I have a data frame df:
df = pd.DataFrame({"Price": [1234.5, 3456.789], "Increase": [0.01234, 0.23456]})
I want the column Price to be formatted with "${:,.2f}" and column Increase to be formatted with "{:,.2%}" whenever I print df in a Jupyter notebook (with print or just running a cell ending in df).
I can use
df.style.format({"Price": "${:,.2f}", "Increase": "{:,.2%}"})
but I do not want to type that every time I print df.
I could also do
df["Price"] = df["Price"].map("${:,.2f}".format)
df["Increase"] = df["Increase"].map("{:,.2%}".format)
which does always print as I want (with print(df)), but this changes the columns from float64 to object, so I cannot manipulate the data frame anymore.
It is a natural feature, but pandas cannot guess what your format is, and each time you create a styler it has to be informed of such decisions, since it is a separate object. It does not dynamically update if you change your DataFrame.
The best you can do is to create a generic print via styler.
def p(df):
styler = dy.style
styler.format({"Price": "${:,.2f}", "Increase": "{:,.2%}"})
return styler
df = DataFrame([[1,2],[3,4]], columns=["Price", "Increase"])
p(df)
Related
[ 10-07-2022 - For anyone stopping by with the same issue. After much searching, I have yet to find a way, that isn't convoluted and complicated, to accurately pull mixed type data from excel using Pandas/Python. My solution is to convert the files using unoconv on the command line, which preserves the formatting, then read into pandas from there. ]
I have to concatenate 1000s of individual excel workbooks with a single sheet, into one master sheet. I use a for loop to read them into a data frame, then concatenate the data frame to a master data frame. There is one column in each that could represent currency, percentages, or just contain notes. Sometimes it has been filled out with explicit indicators in the cell, Eg., '$' - other times, someone has used cell formatting to indicate currency while leaving just a decimal in the cell. I've been using a formatting routine to catch some of this but have run into some edge cases.
Consider a case like the following:
In the actual spreadsheet, you see: $0.96
When read_excel siphons this in, it will be represented as 0.96. Because of the mixed-type nature of the column, there is no sure way to know whether this is 96% or $0.96
Is there a way to read excel files into a data frame for analysis and record what is visually represented in the cell, regardless of whether cell formatting was used or not?
I've tried using dtype="str", dtype="object" and have tried using both the default and openpyxl engines.
UPDATE
Taking the comments below into consideration, I'm rewriting with openpyxl.
import openpyxl
from openpyxl import load_workbook
def excel_concat(df_source):
df_master = pd.DataFrame()
for index, row in df_source.iterrows():
excel_file = Path(row['Test Path']) / Path(row['Original Filename'])
wb = openpyxl.load_workbook(filename = excel_file)
ws = wb.active
df_data = pd.DataFrame(ws.values)
df_master = pd.concat([df_master, df_data], ignore_index=True)
return df_master
df_master1 = excel_concat(df_excel_files)
This appears to be nothing more than a "longcut" to just calling the openpyxl engine with pandas. What am I missing in order to capture the visible values in the excel files?
looking here,https://pandas.pydata.org/docs/reference/api/pandas.read_excel.html , noticed the following
dtypeType name or dict of column -> type, default None
Data type for data or columns. E.g. {‘a’: np.float64, ‘b’: np.int32} Use object to preserve data as stored in Excel and not interpret dtype. **If converters are specified, they will be applied INSTEAD of dtype conversion.**
converters dict, default None
Dict of functions for converting values in certain columns. Keys can either be integers or column labels, values are functions that take one input argument, the Excel cell content, and return the transformed content.
Do you think that might work for you?
I am importing a file that is semicolon delimited. my code:
df = pd.read_csv('bank-full.csv', sep = ';')
print(df.shape)
When I use this in Jupyter Notebooks and Spyder I get a shape output of (45211, 1). When I print my dataframe the data looks like this at this point:
<bound method NDFrame.head of age;"job";"marital";"education";"default";"balance";"housing";"loan";"contact";"day";"month";"duration";"campaign";"pdays";"previous";"poutcome";"y"
0 58;"management";"married";"tertiary";"no";2143...
I can get the correct shape by using
df = pd.read_csv('bank-full.csv', sep = '[;]')
print(df.shape)
or
df = pd.read_csv('bank-full.csv', sep = '\;')
print(df.shape)
However when I do this the data seems to get pulled in as though each row is a string. The first and last column get added preceding and ending double quotations respectively, and when I attempt to strip them nothing is working to remove them so either way I am stuck with many of my columns called objects and unable to force them into integers when needed. My data comes out like this:
"age ""job"" ""marital"" ""education"" ""default"" \
0 "58 ""management"" ""married"" ""tertiary"" ""no""
with final column:
""y"""
0 ""no"""
I have reached out to those in my class and had them send me their .csv file, restarted from scratch, tried a different UI, and even copy/pasted their line of code to read and shape the data and get nothing. I have used every resource except asking this here and am out of ideas.
CSVs are usually separated by commas, but sometimes the cells are separated by a different character(s). So, since I don't have access to your exact dataset, I will give you advice that should help you overall.
First, look at the CSV and assess what character(s) are separating each value, then use that as the value in "sep" during your pd.read_csv() call.
Then, whatever columns you want to convert to numeric, you can use pd.to_numeric() to convert the data type. This may present problems if any of the values in the column cannot be converted to numeric, and you will then need to do additional data cleaning.
Below is an example of how to do this to a particular column that I am calling "col":
import pandas as pd
df = pd.read_csv('bank-full.csv', sep = '[;]')
df[col] = pd.to_numeric(df[col])
Let me know if you have further questions, or better yet, share the data with me if you can't get this to work for you.
I'm trying to read from an Excel file that gets converted to python and then gets split into numbers (Integers and floats) and everything else. There are numerous columns of different types.
I currently bring in the data with
pd.read_excel
and then split the data up with
DataFrame.select_dtypes("number")
When users upload a time (so 12:30:00) they expect for it to be recognized as a time. However python (currently) treats it as dtype object.
If I specify the column with parse_dates then it works, however since I don't know what the data is in advance I ideally want this to be done automatically. I`ve tried setting parse_dates = True however it doesn't seem to make a difference.
I'm not sure if there is a way to recognize the datatime after the file is uploaded. Again however I would want this to be done without having to specify the column (so anything that can be converted is)
Many Thanks
If your data contains only one column with dtype object (I assume it is a string) you can do the following:
1) filter the column with dtype object
import pandas as pd
datatime_col = df.select_dtypes(object)
2) convert it to seconds
datetime_col_in_seconds = pd.to_timedelta(datatime_col.loc[0]).dt.total_seconds()
Then you can re-append the converted column to your original data and/or do whatever processing you want.
Eventually, you can convert it back to datetime.
datetime_col = pd.to_datetime(datetime_col_in_seconds, unit='s')
if you have more than one column with dtype object you might have to do some more pre-processing but I guess this is a good way to start tackling your particular case.
This does what I need
for column_name in df.columns:
try:
df.loc[:, column_name] = pd.to_timedelta(df.loc[:, column_name].astype(str))
except ValueError:
pass
This tries to convert every column into a timedelta format. If it isn't capable of transforming it, it returns a value error and moves onto the next column.
After being run any columns that could be recognized as a timedelta format are transformed.
I wonder how could I pass python pandas groupby result to html formatted such as printed in console. Pic below. to_html does not work because It says that
Series object has no attribute to_html()
The one on the left is from console the one from the right is from my html view.
Using reset_index() on your GroupBy object will enable you to treat it as a normal DataFrame i.e. apply to_html to it.
You can make sure you output a DataFrame, even if the output is a single series.
I can think of two ways.
results_series = df[column_name] # your results, returns a series
# method 1: select column from list, as a DataFrame
results_df = df[[column_name]] # returns a DataFrame
# method 2: after selection, generate a new DataFrame
results_df = pd.DataFrame(results_series)
# then, export to html
results_df.to_html('output.html')
The task is a very simple data analysis, where I download a report using an api and it comes as a csv file. I have been trying to convert it correctly to a DataFrame using the following code:
#staticmethod
def convert_csv_to_data_frame(csv_buffer_file):
data = StringIO(csv_buffer_file)
dataframe = DataFrame.from_csv(path=data, index_col=0)
return dataframe
However, since the csv don't have indexes inside it, the first column of the data I need is beeing ignored by the dataframe because it is considered the index column.
I wanted to know if there is a way to make the dataframe insert an index column automatically.
Your error here was to assume that param index_col=0 meant that it would not treat your csv as having an index column. This should've been index_col=None and in fact this is the default value so you could have not specified this and it would have worked:
#staticmethod
def convert_csv_to_data_frame(csv_buffer_file):
data = StringIO(csv_buffer_file)
dataframe = DataFrame.from_csv(path=data) # remove index_col param
return dataframe
For more info consult the docs