Create many distribution plots using For loop with seaborn - python

I'm trying to create many distribution plots at once to few different fields. I have created simple for loop but I make always the same mistake and python doesn't understand what is "i".
This is the code I have written:
for i in data.columns:
sns.distplot(data[i])
KeyError: 'i'
I have also tried to put 'i' instead of i, but I get error:
TypeError: unsupported operand type(s) for /: 'str' and 'int'
I b elieve my mistake is something basic that I don't know about loops so understand that will help me a lot in the future.
My end goal is to get many distribution plots (with skewness a kurtosis values) at once without writing each one of them.

To run only over numeric columns use:
numeric_data = data._get_numeric_data()
for i in numeric_data.columns:
sns.distplot(numeric_data[i])

As mentioned in the comments, you cannot make a distplot from a string column. If you want to ignore string columns, you can check for each column as you are iterating through them as such:
for i in data.columns:
if(data[i].dtype == np.float64 or data[i].dtype == np.int64):
sns.distplot(data[i])
else:
//your code to handle strings.
I ran a simple test based on what you needed and it works fine on my machine. Here is the code:
import seaborn as sns
import matplotlib.pyplot as plt
a = [1,2,3,4]
c = [1,4,6,7,4,6,7,4,3,5,543,543,54,46,656,76,43,56]
d = [43,3,3,56,5,76,686,876,8768,78,77,98,79,8798,987,978,98]
sns.distplot(a)
e = [a,c,d]
for i, col in enumerate(e):
plt.figure(i)
sns.distplot(col)
plt.show()
In your case, it would be like this:
import matplotlib.pyplot as plt
for index, i in enumerate(data.columns):
if(data[i].dtype == np.float64 or data[i].dtype == np.int64):
plt.figure(index)
sns.distplot(data[i])
else:
//your code to handle strings.
plt.show()

Related

Pandas - plotting multiple histograms [duplicate]

I need some guidance in working out how to plot a block of histograms from grouped data in a pandas dataframe. Here's an example to illustrate my question:
from pandas import DataFrame
import numpy as np
x = ['A']*300 + ['B']*400 + ['C']*300
y = np.random.randn(1000)
df = DataFrame({'Letter':x, 'N':y})
grouped = df.groupby('Letter')
In my ignorance I tried this code command:
df.groupby('Letter').hist()
which failed with the error message "TypeError: cannot concatenate 'str' and 'float' objects"
Any help most appreciated.
I'm on a roll, just found an even simpler way to do it using the by keyword in the hist method:
df['N'].hist(by=df['Letter'])
That's a very handy little shortcut for quickly scanning your grouped data!
For future visitors, the product of this call is the following chart:
One solution is to use matplotlib histogram directly on each grouped data frame. You can loop through the groups obtained in a loop. Each group is a dataframe. And you can create a histogram for each one.
from pandas import DataFrame
import numpy as np
x = ['A']*300 + ['B']*400 + ['C']*300
y = np.random.randn(1000)
df = DataFrame({'Letter':x, 'N':y})
grouped = df.groupby('Letter')
for group in grouped:
figure()
matplotlib.pyplot.hist(group[1].N)
show()
Your function is failing because the groupby dataframe you end up with has a hierarchical index and two columns (Letter and N) so when you do .hist() it's trying to make a histogram of both columns hence the str error.
This is the default behavior of pandas plotting functions (one plot per column) so if you reshape your data frame so that each letter is a column you will get exactly what you want.
df.reset_index().pivot('index','Letter','N').hist()
The reset_index() is just to shove the current index into a column called index. Then pivot will take your data frame, collect all of the values N for each Letter and make them a column. The resulting data frame as 400 rows (fills missing values with NaN) and three columns (A, B, C). hist() will then produce one histogram per column and you get format the plots as needed.
With recent version of Pandas, you can do
df.N.hist(by=df.Letter)
Just like with the solutions above, the axes will be different for each subplot. I have not solved that one yet.
I write this answer because I was looking for a way to plot together the histograms of different groups. What follows is not very smart, but it works fine for me. I use Numpy to compute the histogram and Bokeh for plotting. I think it is self-explanatory, but feel free to ask for clarifications and I'll be happy to add details (and write it better).
figures = {
'Transit': figure(title='Transit', x_axis_label='speed [km/h]', y_axis_label='frequency'),
'Driving': figure(title='Driving', x_axis_label='speed [km/h]', y_axis_label='frequency')
}
cols = {'Vienna': 'red', 'Turin': 'blue', 'Rome': 'Orange'}
for gr in df_trips.groupby(['locality', 'means']):
locality = gr[0][0]
means = gr[0][1]
fig = figures[means]
h, b = np.histogram(pd.DataFrame(gr[1]).speed.values)
fig.vbar(x=b[1:], top=h, width=(b[1]-b[0]), legend_label=locality, fill_color=cols[locality], alpha=0.5)
show(gridplot([
[figures['Transit']],
[figures['Driving']],
]))
I find this even easier and faster.
data_df.groupby('Letter').count()['N'].hist(bins=100)

Iteration, calculation via pandas

I am new in Python and I would like to ask something.
My code reads a csv file. I want to use one column. I want to use an equation which calculates, depending on the value of the column I want to use, several values. I am using commands for and if.
my code
import pandas as pd
import matplotlib as mpl
import numpy as np
dfArxika = pd.read_csv('AIALL.csv', usecols=[0,1,2,3,4,5,6,7,8,9,10], header=None, index_col=False)
print(dfArxika.columns)
A=dfArxika[9]
for i in A:
if (A(i)>=4.8 and A(i)<66):
IA=(2.2*log10(A(i)/66)+5.5)
elif A(i)>=66:
IA=3.66*log10(A(i)/66)+5.5
else:
IA=2.2*log10(A(i)/66)+5.5
but command window shbows me the error:
TypeError: 'Series' object is not callable
Could you help me?
As #rdas mentioned in the comments, you are using parentheses () instead of brackets [] for indexing the values of your column.
I am not sure whatIA is in your example, but this might work:
for i in range(len(dfArxika)):
if (A.loc[i, 9]>=4.8 and A.loc[i, 9]<66):
IA=(2.2*log10(A.loc[i, 9]/66)+5.5)
elif A.loc[i, 9]>=66:
IA=3.66*log10(A.loc[i, 9]/66)+5.5
else:
IA=2.2*log10(A.loc[i, 9]/66)+5.5

Python 3.6.5 returns '<' not supported between instances of 'tuple' and 'str' error message

I'm trying to split a data set into a training and testing part. I am struggling at a structural problem as it seems as the hierarchy of the data seems to be wrong to proceed with below code.
I tried the following:
import pandas as pd
data = pd.DataFrame(web.DataReader('SPY', data_source='morningstar')['Close'])
cutoff = '2015-1-1'
data = data[data.index < cutoff].dropna().copy()
As data.head() will reveal, data is not actually a pd.DataFrame but a pd.Series whose index is a pd.MultiIndex (as suggested also by the error which hints that each element is a tuple) rather than a pd.DatetimeIndex.
What you could do would be to simply let
df = data.unstack(0)
With that, df[df.index < cutoff] performs the filtering you are trying to do.

Python - Error scatter plotting with Matplotlib: Index out of range

I'm very new to Python, and I have a CSV file with three columns. They represent a transmission time in milliseconds, signal amplitude, and FM radio frequency in kHz. There's a lot of lines, but they look something like this:
My task is to find out which radio frequency is generating random noise and which is a structured signal. For how to do this, I'm trying to first find the unique values in the frequency column of my data file (column 3) and then plot them individually to find the structured data. My guess is that the 71.231012 frequency is the white noise (it seemed less frequent in the file), and so I'm basically trying to plot both frequencies to see if my guess is somewhat correct.
So far, this is my code:
from __future__ import division
import matplotlib.pyplot as mplot
import numpy as np
file=open("data.csv", "r")
data=file.read()
data=data.replace(" ", ",")
data=data.split("\n")
xscatter=[]
yscatter=[]
for row in data:
row=row.split(",")
row[2]=float(row[2])
if row[2] == 71.231012:
xscatter.append(row[2])
yscatter.append(row[1])
mplot.scatter(xscatter, yscatter, color="blue", marker="o")
mplot.show()
But I keep getting this error:
row[2]=float(row[2])
IndexError: list index out of range
I'm not sure why this is the case; I thought that, with the split, I would have three indexes per row (0,1,2). And because I'm so new to Python, I'm also not sure how accurate or efficient my code is at doing what I want, but it's a start. I'd greatly appreciate some help.
EDIT: Here is a sample of my output after splitting the file, before the for loop:
The code row=row.split(",") sets the row variable to something like ['0.000000', '', '0.000000', '', '0.000000']. Your code is giving index error because there are no index 2 from the string ''.
There are 2 ways of doing this:
My idea is to remove those annoying empty strings in the array by changing your row=row.split(",") to row=row.split(",,"), this will work perfectly.
Change your data=data.replace(" ", ",") to data=data.replace(" ", ",") (two whitespaces), that will also work perfectly.
If you have an input csv file like the following,
0,1.62435,7.61417
0,-0.611756,7.61417
0,-0.528172,71.231
0,-1.07297,71.231
0,0.865408,7.61417
0,-2.30154,7.61417
0,1.74481,7.61417
0,-0.761207,7.61417
0,0.319039,71.231
0,-0.24937,71.231
1,1.46211,71.231
1,-2.06014,7.61417
1,-0.322417,71.231
1,-0.384054,7.61417
1,1.13377,7.61417
1,-1.09989,71.231
1,-0.172428,71.231
1,-0.877858,7.61417
1,0.0422137,71.231
1,0.582815,71.231
You can read it in using numpy.loadtxt and plot it separated by frequency value by looping over the respective unique frequencies in the last column.
import numpy as np
import matplotlib.pyplot as plt
data = np.loadtxt("data/filename.csv", delimiter=",")
for freq in np.unique(data[:,2]):
thisdata = data[data[:,2] == freq]
plt.scatter(thisdata[:,0], thisdata[:,1], label="{}".format(freq))
plt.legend()
plt.show()

Plotting histograms from grouped data in a pandas DataFrame

I need some guidance in working out how to plot a block of histograms from grouped data in a pandas dataframe. Here's an example to illustrate my question:
from pandas import DataFrame
import numpy as np
x = ['A']*300 + ['B']*400 + ['C']*300
y = np.random.randn(1000)
df = DataFrame({'Letter':x, 'N':y})
grouped = df.groupby('Letter')
In my ignorance I tried this code command:
df.groupby('Letter').hist()
which failed with the error message "TypeError: cannot concatenate 'str' and 'float' objects"
Any help most appreciated.
I'm on a roll, just found an even simpler way to do it using the by keyword in the hist method:
df['N'].hist(by=df['Letter'])
That's a very handy little shortcut for quickly scanning your grouped data!
For future visitors, the product of this call is the following chart:
One solution is to use matplotlib histogram directly on each grouped data frame. You can loop through the groups obtained in a loop. Each group is a dataframe. And you can create a histogram for each one.
from pandas import DataFrame
import numpy as np
x = ['A']*300 + ['B']*400 + ['C']*300
y = np.random.randn(1000)
df = DataFrame({'Letter':x, 'N':y})
grouped = df.groupby('Letter')
for group in grouped:
figure()
matplotlib.pyplot.hist(group[1].N)
show()
Your function is failing because the groupby dataframe you end up with has a hierarchical index and two columns (Letter and N) so when you do .hist() it's trying to make a histogram of both columns hence the str error.
This is the default behavior of pandas plotting functions (one plot per column) so if you reshape your data frame so that each letter is a column you will get exactly what you want.
df.reset_index().pivot('index','Letter','N').hist()
The reset_index() is just to shove the current index into a column called index. Then pivot will take your data frame, collect all of the values N for each Letter and make them a column. The resulting data frame as 400 rows (fills missing values with NaN) and three columns (A, B, C). hist() will then produce one histogram per column and you get format the plots as needed.
With recent version of Pandas, you can do
df.N.hist(by=df.Letter)
Just like with the solutions above, the axes will be different for each subplot. I have not solved that one yet.
I write this answer because I was looking for a way to plot together the histograms of different groups. What follows is not very smart, but it works fine for me. I use Numpy to compute the histogram and Bokeh for plotting. I think it is self-explanatory, but feel free to ask for clarifications and I'll be happy to add details (and write it better).
figures = {
'Transit': figure(title='Transit', x_axis_label='speed [km/h]', y_axis_label='frequency'),
'Driving': figure(title='Driving', x_axis_label='speed [km/h]', y_axis_label='frequency')
}
cols = {'Vienna': 'red', 'Turin': 'blue', 'Rome': 'Orange'}
for gr in df_trips.groupby(['locality', 'means']):
locality = gr[0][0]
means = gr[0][1]
fig = figures[means]
h, b = np.histogram(pd.DataFrame(gr[1]).speed.values)
fig.vbar(x=b[1:], top=h, width=(b[1]-b[0]), legend_label=locality, fill_color=cols[locality], alpha=0.5)
show(gridplot([
[figures['Transit']],
[figures['Driving']],
]))
I find this even easier and faster.
data_df.groupby('Letter').count()['N'].hist(bins=100)

Categories