Rendering an 12k row dataframe with MultiIndex in Jupyter Notebook

Rendering an 12k row dataframe with MultiIndex in Jupyter Notebook - python

A generalized recreation of the table I'm having issues with, which will come out being a table with 12000 rows and a MultiIndex. The big issue: I can't get this table to display in a rendered version of a Jupyter Notebook.
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randn(12000, 6), columns=list('ABCDCD'))
second_cols = ([''] * 2) + ['X', 'X', 'Y', 'Y']
df.columns = pd.MultiIndex.from_arrays([df.columns,second_cols])
df = df.swaplevel(0,1,1).sort_index(1)
df
My company uses proprietary software that lets us turn an .ipynb file into a interactive report. However, most of my issues are Jupyter notebook-level. Here's what I've tried and what's happened:
Just display df: The tables print out, but only show rows [0:29, 19930:12000] and there is no sorting available.
display(HTML(df.to_html())): gives me the following error:
IOPub data rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
--NotebookApp.iopub_data_rate_limit.
DataViz and GoogleCharts do not support MultiIndexes.
Any ideas to make my dataframe render?! Perhaps packages that can paginate my table, etc? Note the big issue here is the MultiIndex.

Related

GUI for editing and saving a python pandas dataframe

In a python function I want to show the user a pandas dataframe and let the user edit the cells in the dataframe. My function should use the edited values in that dataframe (i.e. they should be saved).
I've tried pandasgui, but it does not seem to return the edits to the function.
Is there a function/library I can use for this?

Recently solved this problem with dtale
import pandas as pd
import dtale
df = pd.read_csv('table_data.csv')
dt = dtale.show(df) # create dtale with our df
dt.open_browser() # bring it to a new tab (optional)
df = dt.data # connect all the updates from dtale gui and our df
# (so rn if u edit any cell, you will immediately get the result saved in ur df)
Yesterday I came across with some bugs while using dtale. Filtering broke my changes and creating some new rows I dont need.
Usually I use dtale and pandasgui together.
Hope it helps!

Using Chunksize and Dask to process 8GB Redshift table in Pandas with Missingno

I have successfully connected Python to a redshift table with my Jupyter Notebook.
I sampled 1 day of data (176707 rows) and performed a function using Missingno to assess how much data is missing and where. No problem there.
Here's the code so far (redacted for security)...
#IMPORT NECESSARY PACKAGES FOR REDSHIFT CONNECTION AND DATA VIZ
import psycopg2
from getpass import getpass
from pandas import read_sql
import seaborn as sns
import missingno as msno
#PASSWORD INPUT PROMPT
pwd = getpass('password')
#REDSHIFT CREDENTIALS
config = { 'dbname': 'abcxyz',
'user':'abcxyz',
'pwd':pwd,
'host':'abcxyz.redshift.amazonaws.com',
'port':'xxxx'
}
#CONNECTION UDF USING REDSHIFT CREDS AS DEFINED ABOVE
def create_conn(*args,**kwargs):
config = kwargs['config']
try:
con=psycopg2.connect(dbname=config['dbname'], host=config['host'],
port=config['port'], user=config['user'],
password=config['pwd'])
return con
except Exception as err:
print(err)
#DEFINE CONNECTION
con = create_conn(config=config)
#SQL TO RETRIEVE DATASET AND STORE IN DATAFRAME
df = read_sql("select * from schema.table where date = '2020-06-07'", con=con)
# MISSINGNO VIZ
msno.bar(df, labels=True, figsize=(50, 20))
This produces the following, which is exactly what I want to see:
However, I need to perform this task on a subset of the entire table, not just one day.
I ran...
SELECT "table", size, tbl_rows FROM SVV_TABLE_INFO
...and I can see that the table is a total size of 9GB and 32.5M rows, although the sample I need to assess the data completion of is 11M rows
So far I have identified 2 options for retrieving a larger dataset than the ~18k rows from my initial attempt.
These are:
1) Using chunksize
2) Using Dask
Using Chunksize
I replaced the necessary line of code with this:
#SQL TO RETRIEVE DATASET AND STORE IN DATAFRAME
df = read_sql("select * from derived.page_views where column_name = 'something'", con=con, chunksize=100000)
This still took several hours to run on a MacBook Pro 2.2 GHz Intel Core i7 with 16 GB RAM and gave memory warnings toward the end of the task.
When it was complete I wasn't able to view the chunks anyway and the kernel disconnected, meaning the data held in memory was lost and I'd essentially wasted a morning.
My question is:
Assuming this is not an entirely foolish endeavour, would Dask be a better approach? If so, how could I perform this task using Dask?
The Dask documentation gives this example:
df = dd.read_sql_table('accounts', 'sqlite:///path/to/bank.db',
npartitions=10, index_col='id') # doctest: +SKIP
But I don't understand how I could apply this to my scenario whereby I have connected to a redshift table in order to retrieve the data.
Any help gratefully recieved.

Moving dataframes between notebooks

I'm trying to move two dataframes from notebook1 to notebook2
I've tried using nbimporter:
import nbimporter
import notebook1 as nb1
nb1.df()
Which returns:
AttributeError: module 'notebook1' has no attribute 'df' (it does)
I also tried using ipynb but that didn't work either
I would just write it to a excel file and read it but the index gets messed up when reading it in the other notebook.

You could use a magic (literally what it's called, not me being cute lol) command called store. It works like this:
In notebook A:
df = pd.DataFrame(...)
%store df # Store the variable df in the IPython database
Then in another notebook B:
%store -r # This will load variables from the IPython database
df
An advantage of this approach is that you won't run into problems with datatypes changing or indexes getting messed up. This will work with variable types other than pandas dataframes too.
The official documentation displays some more features here

You could do something like this to save it as a csv:
df.to_csv('example.csv')
And then while accessing it in another notebook simply use:
df = pd.read_csv('example.csv', index_col=0)

I propose to use pickle to save then load your dataframe
From the first notebook
df.to_pickle("./df.pkl")
then from the second notebook
df = pd.read_pickle("./df.pkl")

How do I ensure 'read_excel' on Pandas reads the correct sheet?

The following piece of code is getting the data from Excel in the 5th row and the 14th row:
import pandas as pd
import pymssql
df=[]
fp = "G:\\Data\\Hotels\\ABZPD - Daily Strategy Tool.xlsm"
data = pd.read_excel(fp,sheet_name ="CRM View" )
row_date = data.loc[2, :]
row_sita = "ABZPD"
row_event = data.iloc[11, :]
df = pd.DataFrame({'date': row_date,
'sita': row_sita,
'event': row_event
})
print(df)
However, it is not actually using the worksheet I need it to. Instead of using "CRM View" (like I told it to!) it is using the worksheet "Previous CRM View". I assume this is because both worksheets have similar names.
So the question is, how do I get it to use the one that is called "CRM View"?

I was able to reproduce your problem. It didn't seem like it was about that the supplied sheet name is similar, it just read the first sheet in the file no matter what you put sheet_name to.
Anyway, It seemed like a bug so I checked what version of pandas I was running, which was 0.20.3. After updating to 0.22.0 the problem was gone and the right sheet was selected.
Edit: this was apparently a known bug in 0.20.3.

How to copy/paste a dataframe from iPython into Google Sheets or Excel?

I've been using iPython (aka Jupyter) quite a bit lately for data analysis and some machine learning. But one big headache is copying results from the notebook app (browser) into either Excel or Google Sheets so I can manipulate results or share them with people who don't use iPython.
I know how to convert results to csv and save. But then I have to dig through my computer, open the results and paste them into Excel or Google Sheets. That takes too much time.
And just highlighting a resulting dataframe and copy/pasting usually completely messes up the formatting, with columns overflowing. (Not to mention the issue of long resulting dataframes being truncated when printed in iPython.)
How can I easily copy/paste an iPython result into a spreadsheet?

Try using the to_clipboard() method. E.g., for a dataframe, df: df.to_clipboard() will copy said dataframe to your clipboard. You can then paste it into Excel or Google Docs.

If df.to_clipboard doesn't work. This will work.
import io
with io.StringIO() as buffer:
df.to_csv(buffer, sep=' ', index=False)
print(buffer.getvalue())
Then, you can copy the printed dataframe and paste it in Excel or Google Sheets.

Paste the output to an IDE like Atom and then paste in Google Sheets/Excel

I use display() instead of print() and it works fine for me. Example:
from IPython.display import display
import pandas as pd
dict = {'Name' : ['Alice', 'Bob', 'Charlie'],
'English' : [73, 55, 90],
'Math' : [78, 100, 33],
'Geography' : [92, 87, 72]}
df = pd.DataFrame(dict)
display(df)
The result can easily be copied and pasted into Excel and formatting won't be messed up. This method also works with Colab.

If you are able to make the csv or html available and reachable by a url - you can use this in google sheets.
=IMPORTDATA("url to the csv/html file")

In my experience SpreadSheet uses tabulation (\t) to separate cells and newline (\n) to separate rows.
Assuming this I wrote a simple function to convert from clipboard data:
def from_excel_to_list(copy_text):
"""Use it to copy and paste data from SpreadSheet software
(MS Excel, Libreoffice) and convert to a list
"""
if isinstance(copy_text, str):
array = []
rows = copy_text.split("\n") # splits rows
for row in rows:
if len(row): # removes empty lines
array.append(row.split("\t"))
return array
else:
raise TypeError("text must be string")
You can define the function inside Jupiter and use it in this way:
Copy with ctrl-c on the SpreadSheet and than call the function from_excel_to_list pasting the data with ctrl-v inside the double brackets
my_excel_converted = from_excel_to_list("""Paste here with ctrl-v the text""")
Example
Data from ctrl-c:
N U tot
1 18,236 18,236
17 20,37 346,29
5 6,318 31,59
Call The function:
from_excel_to_list("""N U tot
1 18,236 18,236
17 20,37 346,29
5 6,318 31,59
""")
Result in Jupiter:
[['N', 'U', 'tot'],
['1', '18,236', '18,236'],
['17', '20,37', '346,29'],
['5', '6,318', '31,59']]
This is a base for further elaboration.
The same approach can be used to obtain dictionary, namedtuple and so on.

For a small table, you can print the dataframe, use mouse to select the table, copy the table using Ctrl/Cmd + C, go to spreadsheet and paste the table, and you will get the following:
click on the first cell and insert a cell to fix the header:
Done.
PS: for a bigger table, some rows/columns will show as '...', refer to How do I expand the output display to see more columns of a Pandas DataFrame? to show all rows and columns. For a even bigger table (that is difficult to select using the mouse), this method is not so convenient.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Rendering an 12k row dataframe with MultiIndex in Jupyter Notebook - python

Related

GUI for editing and saving a python pandas dataframe

Using Chunksize and Dask to process 8GB Redshift table in Pandas with Missingno

Moving dataframes between notebooks

How do I ensure 'read_excel' on Pandas reads the correct sheet?

How to copy/paste a dataframe from iPython into Google Sheets or Excel?

Categories

Resources