IPython / Jupyter: how to customize the display function - python

I have 3 basic problems when displaying things in IPython / JupyterLab.
(1) I have a pandas dataframe with many columns. First, I make sure I can see a decent portion of it:
import numpy as np
import pandas as pd
np.set_printoptions(linewidth=240,edgeitems=5,precision=3)
pd.set_option('display.width',1800) #number of pixels of the output
pd.set_option('display.max_columns',100) #replace the number with None to print all columns
pd.set_option('display.max_rows',10) #max_columns/max_rows sets the maximum number of columns/rows displayed when a frame is pretty-printed
pd.set_option('display.min_rows',9) #once max_rows is exceeded, min_rows determines how many rows are shown in the truncated representation
pd.set_option('display.precision',3) #number of digits in the printed float number
If I print it, everything gets mushed together:
enter image description here
Is it possible to print text wide, i.e. in a way that each line (even if longer) is printed on only 1 line in output, with a slider when lines are wider than the window?
(2) If I display the mentioned dataframe, it looks really nice (has a slider), but some string entries are displayed over 4 rows:
enter image description here
How can I make sure every entry is displayed in 1 row?
(3) The code below produces the output, which works fine:
import numpy as np
import matplotlib.pyplot as plt
x=np.linspace(0.1,30,1000);
fig,ax=plt.subplots(1, 4, constrained_layout=True, figsize=[15,2])
ax=ax.ravel()
ax[0].plot( x, np.sin(x))
ax[1].plot( x, np.log(1+x))
ax[2].plot( x, np.sin(30/x))
ax[3].plot( x, np.sin(x)*np.sin(2*x))
plt.show()
enter image description here
However, when I change [15,2] to [35,2], the figure will only be as wide as the window. How can I achieve that larger widths produce a slider (like display of a dataframe) so that I can make images as wide as I wish?

You solved (1) already by deciding to display the dataframe with the method in (2). Using print to display a dataframe is not very useful in my opinion.
(2): The display(df) automatically utilizes white spaces to wrap cell content. I did not find a pandas option to disable this behavior. Luckily, someone else had the same problem already and another person provided a solution.
You have to change the style properties of your dataframe. For this you use the Styler, which holds the styled dataframe. I made a short example from which you can copy the line:
import pandas as pd
# Construct data frame content
long_text = 'abcabcabc ' * 10
long_row = [long_text for _ in range(45)]
long_table = [long_row for _ in range(15)]
# Create dataframe
df = pd.DataFrame(long_table)
# Select data you want to output
# df_selected = df.head(5) # First five rows
# df_selected = df.tail(5) # Last five rows
df_selected = df.iloc[[1,3, 7]] # Select row 1,3 and 7
# Create styler of df_selected with changed style of white spaces
dataframe_styler = df_selected.style.applymap(lambda x:'white-space:nowrap')
# Display styler
display(dataframe_styler)
Output:
(3) As I already mentioned in the comment, you simply have to double click onto the diagram and it is displayed with a slider.

Related

Is there a way to show crosstab count of 0 in a plot.bar chart (Python)?

I'm fairly new to data analysis with Python. I was doing crosstab with my survey data just to count how many answers each answer got (1-5 rating) and then turned them into a plot.bar charts to visualize. I added colors from red to green to reflect the data better.
1 = red, 2 = orange, 3 = yellow, 4 = light green, 5 = green
Everything was fine until in one crosstab no one answered number 1 so in the plot.bar chart it only showed answers between 2-5 and also colored them incorrectly since there's only 4 answers now.
I tried dropna=False in the crosstab but it didn't do nothing.
Everything i have imported in the beginning (Don't know if this matters)
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from scipy.stats import chi2_contingency
%matplotlib inline
plt.style.use('seaborn-whitegrid')
The code showing how it's suppose to look and below, what went wrong
Is there a way to make the crosstab and/or plot.bar show all five categories of answers even if no one answered number 1 and start the colors from number 2 with orange?
While you trying to use crosstab, don't know if it is a requirement. Below is an option of just grouping the data and plotting the same which will do what you need. Hope this solution is acceptable. You can first groupby() and count the number of occurrence of each value (1,2,3..). If any of the numbers 1-5 is not present, add a new row with 0. Post that, you can plot as you did in your code.
df=pd.DataFrame({'col':[2,2,3,3,3,3,3,3,4,4,4,4,4,4,4,4,4,5,5,5,5,5]})
rating=[1,2,3,4,5]
df_temp = df.groupby('col').agg({'col':'count'}).rename(columns={'col':'counts'}).reset_index() ## New dataframe with counts of 1,2,3
df_temp = df_temp.append(pd.DataFrame({'col' : [x for x in rating if x not in df['col'].unique()]})).fillna(0).astype(int).sort_values(by='col').reset_index() ##If any rating not present, add row with 0 count
df_temp['counts'].plot.bar(color = ('#EE9494', '#EEB594', '#EEDD94', '#CFEE94', '#9FEE94'))

Why is the code not plotting the expected output?

country = str(input())
import matplotlib.pyplot as plt
lines = f.readlines ()
x = []
y = []
results = []
for line in lines:
words = line.split(',')
f.close()
plt.plot(x,y)
plt.show()
First problem is in the title of the plot. It is giving Population inCountryI instead of Population in Country I.
Second problem is in the graph.
While my answer could point out the mistakes in your code, I think it might also be enlightening to show another, perhaps more standard way, of doing this. This is particularly useful if you're going to do this more often, or with large datasets.
Handling CSV files and creating subgroups out of them by yourself is nice, but can become very tricky. Python already has a built-in csv module, but the Pandas library is nowadays basically the default (there are other options as well) for handling tabular data. Which means it is widely available, and/or easy to install. Plus it goes well with Matplotlib. (Read some of Pandas' user's guide for a good overview.)
With Pandas, you can use the following (I've put comments on the code in between the actual code):
import pandas as pd
import matplotlib.pyplot as plt
mpl.rcParams['figure.figsize'] = (8, 8)
# Read the CSV file into a Pandas dataframe
# For a normal CSV, this will work fine without tweaks
df = pd.read_csv('population.csv')
# Convert the month and year columns to a datetime
# Years have to be converted to string type for that
# '%b%Y' is the format for month abbrevation (English) and 4-digit year;
# see e.g. https://strftime.org/
# Instead of creating a new column, we set the date as the index ("row-indices")
# of the dataframe
df.index = pd.to_datetime(df['Month'] + df['Year'].astype(str), format='%b%Y')
# We can remove the month and year columns now
df = df.drop(columns=['Month', 'Year'])
# For nicety, replace the dot in the country name with a space
df['Country'] = df['Country'].str.replace('.', ' ', regex=False)
# Group the dataframe by country, and loop over the groups
# The resulting grouped dataframes, `grouped`, will have just
# their index (date) values and population values
# The .plot() method will therefore automatically use
# the index/dates as x-axis, and the population as
# y-axis.
for country, grouped in df.groupby('Country'):
# Use the convenience .plot() method
grouped.plot()
# Standard Matplotlib functions are still available
plt.title(country)
The resulting plots are shown below (2, given the example data).
If you don't want a legend (since there is only one line), use grouped.plot(legend=None) instead.
If you want to pick one specific country, remove and replace the whole for-loop with the following
country = "Country II"
df[df['Country'] == country].plot()
If you want to do even more, also have a look at the Seaborn library.
Resulting plots:

plotly python lines going backwards even after sorting values

I am trying to create a plot which shows each individual's trajectory as well as the mean. This is working OK except that there appear to be extra lines and the lines go backwards, even after sorting the values.
Example:
import pandas as pd
import plotly.graph_objects as go
df = pd.DataFrame({"id": [1,1,1,1,2,2,2,2],
"months": [0,1,2,3,0,1,2,3],
"outcome":[5,2,7,11,18,3,15,3]})
#sort by each individual and the months ie. time column
df.sort_values(by=["id", "months"], inplace=True)
#create mean to overlay on plot
grouped = df.groupby("months")["outcome"].mean().reset_index()
#create plot
fig = go.Figure()
fig.add_trace(go.Scatter(x= df['months'], y= df['outcome'], name = "Individuals"))
fig.add_trace(go.Scatter(x=grouped['months'], y=grouped['outcome'], name = "Mean"))
fig.write_image("test.jpeg", scale = 2)
fig.show()
Now that I'm looking at it it actually looks like it's just creating one giant line for all IDs together, whereas I'd like one line for ID 1, and one line for ID2.
Any help much appreciated. Thanks in advance.
I believe the issue is in your x-values. In Pycharm, I looked at the dataframe and it looks like this:
Your months go from 0-3 and then back to 0-3. I'm a little unclear on what you want to do though - do you want to display only the ones with IDs that match? Such as all the ID with 1 and ID with 2?
Let us know what you expect to see given this dataframe I'm showing, it would be helpful.
EDIT So, I couldn't read the original question. Looking at it more, I believe I can at least answer the first portion however that led me to another bug. The line in question should be changed like so:
fig.add_trace(go.Scatter(x=df['months'][df['id'] == 1], y=df['outcome'][df['id'] == 1], name="Individuals"))
This will pull from the dataframe only where the id == 1, however this then won't show on your graph since your grouped dataframe doesn't fall within the same bounds.

Plotly hovermode=x displays too many values

The hovermode='x' parameter on Plotly (specifically plotly express, which is what I'm using) isn't strict and can get very confusing. For example, I have a dataframe with rows for date (which will be the X), number (which will be the Y), and category (which will be the hue/color, making multiple lines).
If a category has a gap in dates (e.g. category1 has a value for 3/24 but category2 only has values for 3/23 and 3/25), when I hover over the x-value of 3/24, it will show the number value for category1 but instead of not showing anything at all for category2 (or showing a 0/nan), it will show the date and then number value of the closest point. So in this case, hovering on x=3/24 would produce 3 boxes: one (correct) box with the number for category1, and two (incorrect) boxes with the date and number for close category2 points that don't actually have values for 3/24.
In practice, I'm working with a very large, grouped dataset in pyspark, so ideally the categories that don't have data at that date would show a 0. However, not showing a box at all would be acceptable.
I've considered grouping the data and including rows with a count of 0 so that for each date every category has a row, but I couldn't find a way to do it in pyspark and pandas isn't fast enough.
I'm thinking that this must be possible somehow, because using the basic visualizations that Pyspark in Databricks offers works correctly--showing only hover text for categories that are actually present at the x value, and "0" for categories that aren't. The basic visualizations unfortunately only include 1000 rows, though, so they aren't viable.
Code to reproduce the issue:
import numpy as np
import pandas as pd
import datetime
base_categories = ['category1', 'category2']
dates, categories, numbers = [], [], []
for i in range(0, 50):
dates.append(datetime.datetime(2021, 3, 1, 12, np.random.randint(1, 30)))
numbers.append(np.random.randint(1, 1000))
categories.append(base_categories[np.random.randint(0, 2)])
df = pd.DataFrame(columns=['dates', 'categories', 'numbers'])
df.dates = dates
df.numbers = numbers
df.categories = categories
df = df.sort_values('dates')
fig = px.line(df, x='dates', y='numbers', color='categories')
fig.update_traces(mode='lines', hovertemplate=None)
fig.update_layout(height=450, width=750, hovermode='x')
fig

How to get multiple legends from multiple pandas plots

I've got two dataframes (both indexed on time), and I'd like plot columns from both dataframes together on the same plot, with legend as if there were two columns in the same dataframe.
If I turn on legend with one column, it works fine, but if I try to do both, the 2nd one overwrites the first one.
import pandas as pd
# Use ERDDAP's built-in relative time functionality to get last 48 hours:
start='now-7days'
stop='now'
# URL for wind data
url='http://www.neracoos.org/erddap/tabledap/E01_met_all.csv?\
station,time,air_temperature,barometric_pressure,wind_gust,wind_speed,\
wind_direction,visibility\
&time>=%s&time<=%s' % (start,stop)
# load CSV data into Pandas
df_met = pd.read_csv(url,index_col='time',parse_dates=True,skiprows=[1]) # skip the units row
# URL for wave data
url='http://www.neracoos.org/erddap/tabledap/E01_accelerometer_all.csv?\
station,time,mooring_site_desc,significant_wave_height,dominant_wave_period&\
time>=%s&time<=%s' % (start,stop)
# Load the CSV data into Pandas
df_wave = pd.read_csv(url,index_col='time',parse_dates=True,skiprows=[1]) # skip the units row
plotting one works fine:
df_met['wind_speed'].plot(figsize=(12,4),legend=True);
but if I try to plot both, the first legend disappears:
df_met['wind_speed'].plot(figsize=(12,4),legend=True)
df_wave['significant_wave_height'].plot(secondary_y=True,legend=True);
Okay, thanks to the comment by unutbu pointing me to essentially the same question (which I searched for but didn't find), I just need to modify my plot command to:
df_met['wind_speed'].plot(figsize=(12,4))
df_wave['significant_wave_height'].plot(secondary_y=True);
ax = gca();
lines = ax.left_ax.get_lines() + ax.right_ax.get_lines()
ax.legend(lines, [l.get_label() for l in lines])
and now I get this, which is what I was looking for:
Well. Almost. It would be nice to get the (right) and (left) on the legend to make it clear which scale was for which line. #unutbu to the rescue again:
df_met['wind_speed'].plot(figsize=(12,4))
df_wave['significant_wave_height'].plot(secondary_y=True);
ax = gca();
lines = ax.left_ax.get_lines() + ax.right_ax.get_lines()
ax.legend(lines, ['{} ({})'.format(l.get_label(), side) for l, side in zip(lines, ('left', 'right'))]);
produces:

Categories