I'm plotting a dataframe of binned values using seaborn
dist_to_next_melt = pd.melt(pd.DataFrame(dist_to_next))
dist_to_next_melt["bins"] = pd.qcut(dist_to_next_melt.index, 10)
print(dist_to_next_melt)
variable value bins
0 0 1 (-0.001, 91.7]
1 0 24 (-0.001, 91.7]
2 0 5 (-0.001, 91.7]
3 0 74 (-0.001, 91.7]
4 0 110 (-0.001, 91.7]
.. ... ... ...
913 0 290 (825.3, 917.0]
914 0 6 (825.3, 917.0]
915 0 15 (825.3, 917.0]
916 0 71 (825.3, 917.0]
917 0 0 (825.3, 917.0]
[918 rows x 3 columns]
(I can put the whole df in a pastebin if it seems relevant, but this doesn't look like it's an issue with the data)
I can get a basic plot of my data:
However when I try to remove my errorbars using sns.barplot(data=dist_to_next_melt, x="bins", y="value", color="pink", errorbar=None), as indicated in the docs, I get this error message: AttributeError: 'Rectangle' object has no property 'errorbar'. I was on seaborn-0.11.1, I have just updated seborn to seaborn-0.12.2 and the problem persists.
I'm running this on a Jupyter Notebook, using Conda to manage my modules, if any of this is relevant.
Have I missed something obvious? Or am I using sns.barplot() incorrectly?
Related
I'm working on this dataset with the following columns, N/A counts and example of a record:
Serial No. 0
GRE Score 0
TOEFL Score 0
University Rating 0
SOP 0
LOR 0
CGPA 0
Research 0
Chance of Admit 0
dtype: int64
0: 1 337 118 4 4.5 4.5 9.65 1 0.92
1: 2 324 107 4 4.0 4.5 8.87 1 0.76
The column Chance of admit is a normalised intergar value ranging from 0 to 1, what i wanted to do was take this column and output a corrosponding ordered values where chance would be bins (low medium high) (unlikely doable likely) ect
What i have come across is that pandas has a built in function named to_categorical however, i don't understand it enough and what i read i still don't exactly get.
This dataset would be used for a decision tree where the labels would be the chance of admit
Thank you for your help
Since they are "normalized" values...why would you need to categorize them? A simple threshould should work right?
i.e.
0-0.33 low
0.33-0.66 medium
0.66-1.0 high
The only reason you would want to use an automated method would probably be if your number of categories keeps changing?
To do the category, you could use pandas to_categorical but you will need to determine the range and the number of bins (categories). From the docs this should work I think.
In [6]: df = pd.DataFrame({'value': np.random.randint(0, 100, 20)})
In [7]: labels = ["{0} - {1}".format(i, i + 9) for i in range(0, 100, 10)]
In [8]: df['group'] = pd.cut(df.value, range(0, 105, 10), right=False, labels=labels)
In [9]: df.head(10)
Out[9]:
value group
0 65 60 - 69
1 49 40 - 49
2 56 50 - 59
3 43 40 - 49
4 43 40 - 49
5 91 90 - 99
6 32 30 - 39
7 87 80 - 89
8 36 30 - 39
9 8 0 - 9
You can then replace df['group'] with your chance of admit column and fill up the necessary ranges for your discrete bins by threshold or automatic based on number of bins.
For your reference:
https://pandas.pydata.org/pandas-docs/stable/user_guide/categorical.html
IIUC, you want to map a continuous variable to a categorical value based on ranges, for example:
0.96 -> high,
0.31 -> low
...
So pandas provides with a function for just that, cut, from the documentation:
Use cut when you need to segment and sort data values into bins. This
function is also useful for going from a continuous variable to a
categorical variable.
Setup
Serial No. GRE Score TOEFL Score ... CGPA Research Chance of Admit
0 1 337 118 ... 9.65 1 0.92
1 2 324 107 ... 8.87 1 0.76
2 2 324 107 ... 8.87 1 0.31
3 2 324 107 ... 8.87 1 0.45
[4 rows x 9 columns]
Assuming the above setup, you could use cut like this:
labels = pd.cut(df['Chance of Admit'], [0, 0.33, 0.66, 1.0], labels=['low', 'medium', 'high'])
print(labels)
Output
0 high
1 high
2 low
3 medium
Name: Chance of Admit, dtype: category
Categories (3, object): [low < medium < high]
Notice that we are use 3 bins: [(0, 0.33], (0.33, 0.66], (0.66, 1.0]] and that the values of the column Chance of Admit are [0.92, 0.76, 0.31, 0.45]. If you want to change the label names just change the value of the labels parameter, for example: labels=['unlikely', 'doable', 'likely']. If you need an ordinal value do:
labels = pd.cut(df['Chance of Admit'], [0, 0.33, 0.66, 1.0], labels=list(range(3)))
print(labels)
Output
0 2
1 2
2 0
3 1
Name: Chance of Admit, dtype: category
Categories (3, int64): [0 < 1 < 2]
Finally to put all in perspective you could do the following to add it to your DataFrame:
df['group'] = pd.cut(df['Chance of Admit'], [0, 0.33, 0.66, 1.0], labels=['low', 'medium', 'high'])
print(df)
Output
Serial No. GRE Score TOEFL Score ... Research Chance of Admit group
0 1 337 118 ... 1 0.92 high
1 2 324 107 ... 1 0.76 high
2 2 324 107 ... 1 0.31 low
3 2 324 107 ... 1 0.45 medium
[4 rows x 10 columns]
I'm not getting my whole output as well as my column names in my Screen.
import sqlite3
import pandas as pd
hello = sqlite3.connect(r"C:\Users\ravjo\Downloads\Chinook.sqlite")
rs = hello.execute("SELECT * FROM PlaylistTrack INNER JOIN Track on PlaylistTrack.TrackId = Track.TrackId WHERE Milliseconds < 250000")
df = pd.DataFrame(rs.fetchall())
hello.close()
print(df.head())
actual result:
0 1 2 3 4 ... 6 7 8 9 10
0 1 3390 3390 One and the Same 271 ... 23 None 217732 3559040 0.99
1 1 3392 3392 Until We Fall 271 ... 23 None 230758 3766605 0.99
2 1 3393 3393 Original Fire 271 ... 23 None 218916 3577821 0.99
3 1 3394 3394 Broken City 271 ... 23 None 228366 3728955 0.99
4 1 3395 3395 Somedays 271 ... 23 None 213831 3497176 0.99
[5 rows x 11 columns]
expected result:
PlaylistId TrackId TrackId Name AlbumId MediaTypeId \
0 1 3390 3390 One and the Same 271 2
1 1 3392 3392 Until We Fall 271 2
2 1 3393 3393 Original Fire 271 2
3 1 3394 3394 Broken City 271 2
4 1 3395 3395 Somedays 271 2
GenreId Composer Milliseconds Bytes UnitPrice
0 23 None 217732 3559040 0.99
1 23 None 230758 3766605 0.99
2 23 None 218916 3577821 0.99
3 23 None 228366 3728955 0.99
4 23 None 213831 3497176 0.99
The ... in the middle actually says that some of the data have been omitted from display. If you want to see the entire data, you should modify the pandas options. You can do so by using pandas.set_option() method. Documentation here.
In your case, you should set display.max_columns to None so that pandas displays unlimited number of columns. You will have to read in the column names from the database of set it manually. Refer here on how to read in the column names from the database itself.
To display all the columns please use below mentioned code snippet.
pd.set_option("display.max_columns",None)
By default, pandas limits number of rows for display. However you can change it to as per your need. Here is helper function I use, whenever I need to print full data-frame
def print_full(df):
import pandas as pd
pd.set_option('display.max_rows', len(df))
print(df)
pd.reset_option('display.max_rows')
I have the below pandas column. I need to convert cells containing the word 'anaphylaxis' to 1 and the cells not containing the word to 0.
Till now I have tried but there is something missing
df['Name']= df['Name'].replace(r"^(.(?=anaphylaxis))*?$", 1,regex=True)
df['Name']= df['Name'].replace(r"^(.(?<!anaphylaxis))*?$", 0, regex=True)
ID Name
84 Drug-induced anaphylaxis
1041 Acute anaphylaxis
1194 Anaphylactic reaction
1483 Anaphylactic reaction, due to adverse effect o...
2226 Anaphylaxis, initial encounter
2428 Anaphylaxis
2831 Anaphylactic shock
4900 Other anaphylactic reaction
Use str.contains for case-insensitive matching.
import re
df['Name'] = df['Name'].str.contains(r'anaphylaxis', flags=re.IGNORECASE).astype(int)
Or, more concisely,
df['Name'] = df['Name'].str.contains(r'(?i)anaphylaxis').astype(int)
df
ID Name
0 84 1
1 1041 1
2 1194 0
3 1483 0
4 2226 1
5 2428 1
6 2831 0
7 4900 0
contains is useful when you want to also perform regex-based matching. Although in this case, you can probably get rid of the regex completely by adding regex=False for a bit more performance.
However, for even more performance, use a list comprehension.
df['Name'] = np.array(['anaphylaxis' in x.lower() for x in df['Name']], dtype=int)
Or even better,
df['Name'] = [1 if 'anaphylaxis' in x.lower() else 0 for x in df['Name'].tolist()]
df
ID Name
0 84 1
1 1041 1
2 1194 0
3 1483 0
4 2226 1
5 2428 1
6 2831 0
7 4900 0
You can use pd.Series.str.contains instead of regex. This method returns a Boolean series, which we then convert to int.
df['Name']= df['Name'].str.contains('anaphylaxis', case=False, regex=False)\
.astype(int)
Result:
ID Name
0 84 1
1 1041 1
2 1194 0
3 1483 0
4 2226 1
5 2428 1
6 2831 0
7 4900 0
I'm very new with these libraries and i'm having troubles while plotting this:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np
import random
df5 = pd.read_csv('../../../../datos/tiempos-exacto-variando-n-m0.csv', sep=', ', engine='python')
print(df5)
df5['n'] = df5['n'].apply(lambda x: x**2)
sns.jointplot(df5['n'], df5['tiempoTotal'], kind="reg")
sns.plt.show()
And i'm getting this output:
n m tiempoTotal
0 1 0 2274
1 2 0 3370
2 3 0 5709
3 4 0 8959
4 5 0 13354
5 6 0 18503
6 7 0 26329
7 8 0 33859
8 9 0 41110
9 10 0 52710
10 11 0 64364
11 12 0 74142
12 13 0 81072
13 14 0 69332
14 15 0 71027
15 16 0 89721
16 17 0 85459
17 18 0 95217
18 19 0 119210
19 20 0 136888
20 21 0 131903
21 22 0 138395
22 23 0 151222
23 24 0 163542
24 25 0 177236
25 26 0 192475
26 27 0 240162
27 28 0 260701
28 29 0 235752
29 30 0 250835
.. ... .. ...
580 581 0 88306854
581 582 0 89276420
582 583 0 87457875
583 584 0 90807004
584 585 0 87790003
585 586 0 89821530
586 587 0 89486585
587 588 0 88496901
588 589 0 89090661
589 590 0 89110803
590 591 0 90397942
591 592 0 94029839
592 593 0 92749859
593 594 0 105991135
594 595 0 95383921
595 596 0 105155207
596 597 0 114193414
597 598 0 98108892
598 599 0 97888966
599 600 0 103802453
600 601 0 97249346
601 602 0 101917488
602 603 0 104943847
603 604 0 98966140
604 605 0 97924262
605 606 0 97379587
606 607 0 97518808
607 608 0 99839892
608 609 0 100046492
609 610 0 103857464
[610 rows x 3 columns]
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
<ipython-input-21-63146953b89d> in <module>()
9 df5['n'] = df5['n'].apply(lambda x: x**2)
10 sns.jointplot(df5['n'], df5['tiempoTotal'], kind="reg")
---> 11 sns.plt.show()
AttributeError: 'module' object has no attribute 'plt'
I'm running this in my Jupyter Notebook with Python 2.7.12. Any ideas?
sns.plt.show() works fine for me using seaborn 0.7.1. Could be that this is different in other versions. However, if you anyways import matplotlib.pyplot as plt you may as well simply use plt.show(), as sns.plt.show() is only working because pyplot is available inside the seaborn namespace.
Well, I ran into this issue as well with Seaborn 0.8.1. Turns out being able to call sns.plt.show() is bad practice and the fact that it worked was a bug which the developer fixed. Unfortunately, there are many tutorials out there that still advise one to use sns.plt.show(). This is how I solved it:
Import plt directly: import matplotlib.pyplot as plt
Before you plot anything, set the default aesthetic parameters: sns.set() - important, because otherwise you won't get the Seaborn palettes.
Replace all calls to sns.plt with plt
As of Seaborn 0.8.1, sns.plt.plot() raises the error module 'seaborn' has no attribute 'plt'.
sns.plot() also raises an error; these methods are not in Seaborn's API.
Dropping the “sns.” to leave “plt.plot()” (as other answers suggest) does work, but only because we've called the sns.set() method in place earlier in the script... i.e. Seaborn is making an aesthetic change: Matplotlib is still the object, which does the plotting, via its plt.plot() method.
This script shows sns.set() in action... if you follow the comments and swap sns.set() between different locations in the script, it changes the appearance of the subplots. They look like Seaborn plots, but Matplotlib is doing the plotting.
Seaborn does of course have a load of its own plot methods (like sns.boxplot(), sns.violinplot() etc) but there is no longer a method sns.plt.plot().
I just want to confirm that I got the same error using Jupyter inside Anaconda (Feb 2018). Got the code from here but the error occured. It turns out that I need to simply add
import matplotlib.pyplot as plt
on top of
import seaborn as sns
and it work just fine using plt.show() instead of sns.plt.show()
Ensure you have updated your python shell as well IDE's like Anaconda.
Like I had a constant error in Spyder (Hosted under Anaconda) with relplot and catplot until I updated Anaconda as well as seaborn (0.90).
Updating via the Anaconda commandline should be pretty straightforward like in my case.
i'm attempting to make a HeatMap just like this one using Bokeh.
Here is my dataframe Data from which i'm trying to make the HeatMap
Day Code Total
0 1 6001 44
1 1 6002 40
2 1 6006 8
3 1 6008 2
4 1 6010 38
5 1 6011 1
6 1 6014 19
7 1 6018 1
8 1 6019 1
9 1 6023 10
10 1 6028 4
11 2 6001 17
12 2 6010 2
13 2 6014 4
14 2 6020 1
15 2 6028 2
16 3 6001 48
17 3 6002 24
18 3 6003 1
19 3 6005 1
20 3 6006 2
21 3 6008 18
22 3 6010 75
23 3 6011 1
24 3 6014 72
25 3 6023 34
26 3 6028 1
27 3 6038 3
28 4 6001 19
29 4 6002 105
30 5 6001 52
...
And here is my code:
from bokeh.io import output_file
from bokeh.io import show
from bokeh.models import (
ColumnDataSource,
HoverTool,
LinearColorMapper
)
from bokeh.plotting import figure
output_file('SHM_Test.html', title='SHM', mode='inline')
source = ColumnDataSource(Data)
TOOLS = "hover,save"
# Creating the Figure
SHM = figure(title="HeatMap",
x_range=[str(i) for i in range(1,32)],
y_range=[str(i) for i in range(6043,6000,-1)],
x_axis_location="above", plot_width=400, plot_height=970,
tools=TOOLS, toolbar_location='right')
# Figure Styling
SHM.grid.grid_line_color = None
SHM.axis.axis_line_color = None
SHM.axis.major_tick_line_color = None
SHM.axis.major_label_text_font_size = "5pt"
SHM.axis.major_label_standoff = 0
SHM.toolbar.logo = None
SHM.title.text_alpha = 0.3
# Color Mapping
CM = LinearColorMapper(palette='RdPu9', low=Data.Total.min(), high=Data.Total.max())
SHM.rect(x='Day', y="Code", width=1, height=1,source=source,
fill_color={'field': 'Total','transform': CM})
show(SHM)
When i excecute my code i don't get any errors but i just get an empty Frame, as shown in the image below.
I've been struggling trying to find where is my mistake, ¿Why i'm getting this? ¿Where is my error?
The problem with your code is the data type that you are setting for the x and y axis range and the data type of your ColumnDataSource are different. You are setting the x_range and y_range to be a list of strings, but from looking at your data in csv format it will be treated as integers.
In your case, you would want to make sure that your Day and Code column are in
string format.
This can be easily done using the pandas package with
Data['Day'] = Data['Day'].astype('str')
Data['Code'] = Date['Code'].astype('str')