print data_frame.head gives output not as a nice table - python

I have a data set taken from kaggle, and I want to get the result shown here
So, I took that code, changed it a bit and what I ran is this:
# get titanic & test csv files as a DataFrame
titanic_df = pd.read_csv("./input/train.csv")
test_df = pd.read_csv("./input/test.csv")
# preview the data
print titanic_df.head()
This works, as it outputs the right data, but not as neatly as in the tutorial... Can I make it right?
Here is my output (Python 2, Spyder):

Try using Jupyter notebook if you have not used it before. In ipython console, it will wrap the text and show it in multiple lines. In kaggle, what you are seeing is itself a jupyter notebook.

Related

How do I only show certain cells input or output when exporting a Juypter Notebook from VSCode?

I only want certain cells and certain cell outputs to show up when I export my Juypter Notebook from VSCode. I have not been able to get an answer that works from Google, StackOverflow, and ChatGPT.
So when I export the .ipynb file to HTML in VSCode, how do I modify which cells are included in the HTML and which are not? For example, what would I do to include just the ouptut of the cell below and not the actual code?
import pandas as pd
import seaborn as sns
df = pd.read_csv(file.csv)
sns.histplot(df['Variable 1']
This post seems to indicate the best/only option is tagging cells then removing them with nbconvert. This seems inefficient in VSCode, especially compared to the easy output = FALSE or echo = FALSE in RStudio.
This seems like it should be an easy and common question but I am getting no good solutions from the internet. ChatGPT suggested include #hide-in-export to the cells I didn't want but that didn't work
The StackOverflow post I linked suggested using TagRemovePreprocessor with nbconvert and marking all the cells I want gone but that seems so clunky. Follow-up question: If tagging cells and removing them in export with nbconvert, what is the fastest way to tag cells in VSCode?
Although it is still a bit cumbersome, I think it is still a feasible method. Use F12 to open the web background, delete cells or output cells.
I still don't know if there is an easier way but here is what I have done with help from ChatGPT, this blog post, and this StackOverflow answer.
First, have a function that adds cell tags to the certain cells you want to hide:
import json
def add_cell_tag(nb_path, tag, cell_indices):
# Open the .ipynb file
with open(nb_path, 'r', encoding='utf-8') as f:
nb = json.load(f)
# Get the cells from the notebook
cells = nb['cells']
# Add the tag to the specified cells
for index in cell_indices:
cell = cells[index]
if 'metadata' not in cell:
cell['metadata'] = {}
if 'tags' not in cell['metadata']:
cell['metadata']['tags'] = []
cell['metadata']['tags'].append(tag)
# Save the modified notebook
with open(nb_path, 'w', encoding='utf-8') as f:
json.dump(nb, f)
Second, run the function and add a tag (can be any string) to the cells you want to hide in the HTML export:
add_cell_tag(nb_path, 'hide-code', [0, 1, 2])
Finally, use nbconvert in the terminal to export and filter the notebook:
jupyter nbconvert --to html --TagRemovePreprocessor.remove_cell_tags=hide-code path/to/notebook.ipynb
The cells made be entirely removed or just the output or just the input:
TagRemovePreprocessor.remove_input_tags
TagRemovePreprocessor.remove_single_output_tags
TagRemovePreprocessor.remove_all_outputs_tags
Not sure the difference between those last two. Additionally, I had a helper function to count the cells in the notebook and one to clear all tags in the notebook.

Writing pandas/numpy statements in python functions

I am working on multiple data sets with similar data attributes (column names) in jupyter notebook. But it is really tiresome to run all the commands again and again with multiple data sets to achieve the same target. Can anyone let me know if I can automate the process and run this for various data sets. Let's say I'm running following commands for one data set in jupyter notebook:
data = pd.read_csv(r"/d/newfolder/test.csv",low_memory=False) <br>
data.head()
list(data.columns)
data_new=data.sort_values(by='column_name')
Now I'd want to run all the commands saving in one function, for different data sets in the notebook.
Can anyone help me out pls on what are the possible ways? Thanks in advance
IIUC, your issue is that something like print(df) doesn't show as pretty as if you just have df as the last line in a Jupyter cell.
You can have the pretty output whenever you want (as long as your jupyter is updated) by using display!
Modifying your code:
def process_data(file):
data = pd.read_csv(file, low_memory=False)
display(data.head())
display(data.columns)
data_new = data.sort_values(by='column_name')
display(data_new.head())
process_data(r"/d/newfolder/test.csv")
This will output data.head(), data.columns, and data_new.head() from a single cell~

Saving `h2o_model.accuracy` printed output to a file

h2o_model.accuracy prints model validation data when executed in a Jupyter Notebook cell (which is desirable, despite the function name). How to save this whole validation output (entire notebook cell contents) to a file? Please test before suggesting redirections.
I'd be careful using %%capture, it doesn't capture html content (tables) in the stdout.
The redirect_stdout works flawlessly when used from python CLI/script. IPython/Jupyter might cause issues with tables as they are displayed not printed. Note that you should not use .readlines() to get the results from StringIO - use .getvalue().
You can use h2o_model.save_model_details(path) to persist information about the model to a json file (which might serve you better in a long run but it's not really human readable).
If you really want to have the output that looks like what would you get from a Jupyter notebook, you can use the following hack:
create a template jupyter notebook that contains:
import os
import h2o
h2o.connect(verbose=False)
h2o.get_model(os.environ["H2O_MODEL"])
and in your original notebook add
!H2O_MODEL={h2o_model.key} jupyter nbconvert --to html --execute template.ipynb --output={h2o_model.key}_results.html
You can also create a template for the nbconvert to hide the code cells.
You should call h2o_model.accuracy() (note the parentheses). The reason the whole model gets printed is non-idiomatic implementation of __repl__ in h2o models which prints rather then returning a string (there's a JIRA to fix that).
If you encounter some other situation where you would like to save printed output of some command, you can use redirect_stdout[1] to capture it (assuming you have python 3.4+).
[1] https://docs.python.org/3.9/library/contextlib.html#contextlib.redirect_stdout
Ok, so only the h2o_model.accuracy output cannot be captured, while xgb_model.cross_validation_metrics_summary or even h2o_model alone can - e.g. like that:
%%capture captured_output
# print model validation
# data to `captured_output`
xgb_model
In another notebook cell:
# print(captured_output.stdout.replace("\n\n","\n"))
with open(filename, 'w') as f:
f.write((captured_output.stdout.replace("\n\n","\n")))

load code from a code cell from one jupyter notebook into another jupyter notebook

I want to load (i.e., copy the code as with %load) the code from a code cell in one jupyter notebook into another jupyter notebook (Jupyter running Python, but not sure if that matters). I would really like to enter something like
%load cell[5] notebookname.ipynb
The command copies all code in cell 5 of notebookname.ipynb to the code cell of the notebook I am working on. Does anybody know a trick how to do that?
Adapting some code found here at Jupyter Notebook, the following will display the code of a specific cell in the specified notebook:
import io
from nbformat import read
def print_cell_code(fname, cellno):
with io.open(fname, 'r', encoding='utf-8') as f:
nb = read(f, 4)
cell = nb.cells[cellno]
print(cell.source)
print_cell_code("Untitled.ipynb",2)
Not sure what you want to do once the code is there, but maybe this can be adapted to suit your needs. Try print(nb.cells) to see what read brings in.
You'll probably want to use or write your own nbconvert preprocessor to extract a cell from one and insert into another. There is a good amount research into these docs it takes to understand how to write your preprocessor, but this is the preferred way.
The quick fix option you have is that the nbformat specification is predicated on JSON, which means that if you read in a ipynb file with pure python (ie with open and read), you can call json.loads on it to turn the entire file into a dict. From there, you can access cells in the cells entry (which is a list of cells). So, something like like this:
import json
with open("nb1.ipynb", "r") as nb1, open("nb2.ipynb", "r") as nb2:
nb1, nb2 = json.loads(nb1.read()), json.loads(nb2.read())
nb2["cells"].append(nb1["cells"][0]) # adds nb1's first cell to end of nb2
This assumes (as does your question) there is no metadata conflict between the notebooks.

Cannot print the first 5 rows using df.head(5)

Complete noob here. I've been trying the following instructions to access a data set on Kaggle and to read the first 5 rows.
https://towardsdatascience.com/simple-and-multiple-linear-regression-with-python-c9ab422ec29c
I'm using spyder and when I run the following code, I only obtain a runfile wdir= comment in the console
Following is the Code:
import pandas as pd
df=pd.read_csv('weight-height.csv')
df.head(5)
Output:
Code and Console Output
The medium post is probably using jupyter notebooks which will take the last line and put it as formatted output in a cell below it without a print. In a regular python script / idle or other IDEs, you need to actually use the print function to print to the terminal/console.
import pandas as pd
df = pd.read_csv('weight-height.csv')
print(df.head(5))

Categories