Passing pandas groupby result to html in a pretty way - python

I wonder how could I pass python pandas groupby result to html formatted such as printed in console. Pic below. to_html does not work because It says that
Series object has no attribute to_html()
The one on the left is from console the one from the right is from my html view.

Using reset_index() on your GroupBy object will enable you to treat it as a normal DataFrame i.e. apply to_html to it.

You can make sure you output a DataFrame, even if the output is a single series.
I can think of two ways.
results_series = df[column_name] # your results, returns a series
# method 1: select column from list, as a DataFrame
results_df = df[[column_name]] # returns a DataFrame
# method 2: after selection, generate a new DataFrame
results_df = pd.DataFrame(results_series)
# then, export to html
results_df.to_html('output.html')

Related

Automatically format pandas table without changing data

Is there a way to format the output of a particular column of a panda's data frame (e.g., as currency, with ${:,.2f}, or percentage, with {:,.2%}) without changing the data itself?
In this post I see that map can be used, but it changes the data to strings.
I also see that I can use .style.format (see here) to print a data frame with some formatting, but it returns a Styler object.
I would like just to change the default print out of the data frame itself, so that it always print it formatted as specified. (I suppose this means changing __repr__ or __repr_html__.) I'd assume that there is a simple way of doing this in pandas, but I could not find it.
Any help would be greatly appreciated!
EDIT (for clarification): Suppose I have a data frame df:
df = pd.DataFrame({"Price": [1234.5, 3456.789], "Increase": [0.01234, 0.23456]})
I want the column Price to be formatted with "${:,.2f}" and column Increase to be formatted with "{:,.2%}" whenever I print df in a Jupyter notebook (with print or just running a cell ending in df).
I can use
df.style.format({"Price": "${:,.2f}", "Increase": "{:,.2%}"})
but I do not want to type that every time I print df.
I could also do
df["Price"] = df["Price"].map("${:,.2f}".format)
df["Increase"] = df["Increase"].map("{:,.2%}".format)
which does always print as I want (with print(df)), but this changes the columns from float64 to object, so I cannot manipulate the data frame anymore.
It is a natural feature, but pandas cannot guess what your format is, and each time you create a styler it has to be informed of such decisions, since it is a separate object. It does not dynamically update if you change your DataFrame.
The best you can do is to create a generic print via styler.
def p(df):
styler = dy.style
styler.format({"Price": "${:,.2f}", "Increase": "{:,.2%}"})
return styler
df = DataFrame([[1,2],[3,4]], columns=["Price", "Increase"])
p(df)

Using apply on a pandas DataFrame with a function that returns another DataFrame

I have a function that takes a Series as input and returns a DataFrame. I want to run this function on every row in a DataFrame and collect all the rows from all the returned DataFrames into a single DataFrame. I know that whenever you want to do something over all rows, apply is the go-to function, but because the function returns a DataFrame, I'm not sure how to use apply to produce the result I need. What is the best way to do this? Is this where the itertuples function that I always see "don't use this, use apply instead" should actually be used?

Why does my Pandas DataFrame not display new order using `sort_values`?

New to Pandas, so maybe I'm missing a big idea?
I have a Pandas DataFrame of register transactions with shape like (500,4):
Time datetime64[ns]
Net Total float64
Tax float64
Total Due float64
I'm working through my code in a Python3 Jupyter notebook. I can't get past sorting any column. Working through the different code examples for sort, I'm not seeing the output reorder when I inspect the df. So, I've reduced the problem to trying to order just one column:
df.sort_values(by='Time')
# OR
df.sort_values(['Total Due'])
# OR
df.sort_values(['Time'], ascending=True)
No matter which column title, or which boolean argument I use, the displayed results never change order.
Thinking it could be a Jupyter thing, I've previewed the results using print(df), df.head(), and HTML(df.to_html()) (the last example is for Jupyter notebooks). I've also rerun the whole notebook from import CSV to this code. And, I'm also new to Python3 (from 2.7), so I get stuck with that sometimes, but I don't see how that's relevant in this case.
Another post has a similar problem, Python pandas dataframe sort_values does not work. In that instance, the ordering was on a column type string. But as you can see all of the columns here are unambiguously sortable.
Why does my Pandas DataFrame not display new order using sort_values?
df.sort_values(['Total Due']) returns a sorted DF, but it doesn't update DF in place.
So do it explicitly:
df = df.sort_values(['Total Due'])
or
df.sort_values(['Total Due'], inplace=True)
My problem, fyi, was that I wasn't returning the resulting dataframe, so PyCharm wasn't bothering to update said dataframe. Naming the dataframe after the return keyword fixed the issue.
Edit:
I had return at the end of my method instead of
return df,
which the debugger must of noticed, because df wasn't being updated in spite of my explicit, in-place sort.

Python pandas.DataFrame.from_csv

The task is a very simple data analysis, where I download a report using an api and it comes as a csv file. I have been trying to convert it correctly to a DataFrame using the following code:
#staticmethod
def convert_csv_to_data_frame(csv_buffer_file):
data = StringIO(csv_buffer_file)
dataframe = DataFrame.from_csv(path=data, index_col=0)
return dataframe
However, since the csv don't have indexes inside it, the first column of the data I need is beeing ignored by the dataframe because it is considered the index column.
I wanted to know if there is a way to make the dataframe insert an index column automatically.
Your error here was to assume that param index_col=0 meant that it would not treat your csv as having an index column. This should've been index_col=None and in fact this is the default value so you could have not specified this and it would have worked:
#staticmethod
def convert_csv_to_data_frame(csv_buffer_file):
data = StringIO(csv_buffer_file)
dataframe = DataFrame.from_csv(path=data) # remove index_col param
return dataframe
For more info consult the docs

Viewing the content of a Spark Dataframe Column

I'm using Spark 1.3.1.
I am trying to view the values of a Spark dataframe column in Python. With a Spark dataframe, I can do df.collect() to view the contents of the dataframe, but there is no such method for a Spark dataframe column as best as I can see.
For example, the dataframe df contains a column named 'zip_code'. So I can do df['zip_code'] and it turns a pyspark.sql.dataframe.Column type, but I can't find a way to view the values in df['zip_code'].
You can access underlying RDD and map over it
df.rdd.map(lambda r: r.zip_code).collect()
You can also use select if you don't mind results wrapped using Row objects:
df.select('zip_code').collect()
Finally, if you simply want to inspect content then show method should be enough:
df.select('zip_code').show()
You can simply write:
df.select('your column's name').show()
In your case here, it will be:
df.select('zip_code').show()
To view the complete content:
df.select("raw").take(1).foreach(println)
(show will show you an overview).

Categories