pandas groupby objects, combining and plotting - python

I probably don't really understand when or how to use the groupby function of pandas.DataFrame. In the example below I want to bin my dataframe in petal length and calculate the number of entries, the mean and spread for each bin. I can do that with three groupby calls, but then I have the answers in three separated objects. Therefore, I concat them afterwards. Now I have one object, but all columns are called sepal width, passing names to concat did not work for me. Also I would like to get the bin and the mean values e.g. for plotting, but I do not know how to do that.
import matplotlib.pyplot as plt
import pandas as pd
from sklearn import datasets
iris = datasets.load_iris()
data = pd.DataFrame(iris.data)
data.columns = iris.feature_names
data["bin"] = pd.cut(data["petal length (cm)"], 5)
g0 = data.groupby(["bin"])["sepal width (cm)"].count()
g1 = data.groupby(["bin"])["sepal width (cm)"].mean()
g2 = data.groupby(["bin"])["sepal width (cm)"].std()
# how to get better names?
g = pd.concat([g0, g1, g2], axis=1)
print g
# how to extract bin and mean e.g. for plotting?
#plt.plot(g.bin, g.mean)

About the second part of your question, you can use string manipulation.
If I understand correctly you can use this:
a = data['bin']
a1 = a.astype(str).str.strip('([])').str.split(',').str[0].astype(float)
a2 = a.astype(str).str.strip('([])').str.split(',').str[1].astype(float)
data['bin_center'] = (a1+a2)/2
g = data.groupby('bin_center')['sepal width (cm)'].agg(['count', 'mean', 'std'])
plt.plot(g.index, g['mean'])
by the way, if you don't relly want the bin center, and you want to see the plot with the bins
you can use dataframe plot:
g = data.groupby('bin')['sepal width (cm)'].agg(['count', 'mean', 'std'])
print(g)
g['mean'].plot()

Related

Extract nominal and standard deviation from ufloat inside a panda dataframe

For convenience purpose I am using pandas dataframes in order to perform an uncertainty propagation on a large set on data.
I then wish to plot the nominal value of my data set but something like myDF['colLabel'].n won't work. How to extract the nominal and standard deviation from a dataframe in order to plot the nominal value and the errorbar?
Here is a MWE to be more consistent:
#%% MWE
import pandas as pd
from uncertainties import ufloat
import matplotlib.pyplot as plt
# building of a dataframe filled with ufloats
d = {'value1': [ufloat(1,.1),ufloat(3,.2),ufloat(5,.6),ufloat(8,.2)], 'value2': [ufloat(10,5),ufloat(50,2),ufloat(30,3),ufloat(5,1)]}
df = pd.DataFrame(data = d)
# plot of value2 vs. value1 with errobars.
plt.plot(x = df['value1'].n, y = df['value2'].n)
plt.errorbar(x = df['value1'].n, y = df['value2'].n, xerr = df['value1'].s, yerr = df['value2'].s)
# obviously .n and .s won't work.
I get as an error AttributeError: 'Series' object has no attribute 'n' which suggest to extract the values from each series, is there a shorter way to do it than going through a loop which would separate the nominal and std values into two separated vectors?
Thanks.
EDIT: Using those functions from the package won't work either: uncertainties.nominal_value(df['value2']) and uncertainties.std_dev(df['value2'])
Actually solved it with the
unumpy.nominal_values(arr) and unumpy.std_devs(arr) functions from uncertainties.

Pandas: Apply function to set of groups

I have the following problem:
Given a 2D dataframe, first column with values and second giving categories of the points, I would like to compute a k-means dictionary of the means of each category and assign the centroid that the group mean of a particular value is closest to as a new column in the original data frame.
I would like to do this using groupby.
More generally, my problem is, that apply (to my knowledge) only can use functions that are defined on the individual groups (like mean()). k-means needs information on all the groups. Is there a nicer way than transforming everything to numpy arrays and working with these?
import pandas as pd
import numpy as np
from scipy.cluster.vq import kmeans2
k=4
raw_data = np.random.randint(0,100,size=(100, 4))
f = pd.DataFrame(raw_data, columns=list('ABCD'))
df = pd.DataFrame(f, columns=['A','B'])
groups = df.groupby('A')
means = groups.mean().unstack()
centroids, dictionary = kmeans2(means,k)
fig, ax = plt.subplots()
print dictionary
What I would like to get now, is a new column in df, that gives the value in dictionary for each entry.
You can achieve it by the following:
import pandas as pd
import numpy as np
from scipy.cluster.vq import kmeans2
k = 4
raw_data = np.random.randint(0,100,size=(100, 4))
f = pd.DataFrame(raw_data, columns=list('ABCD'))
df = pd.DataFrame(f, columns=['A','B'])
groups = df.groupby('A')
means_data_frame = pd.DataFrame(groups.mean())
centroid, means_data_frame['cluster'] = kmeans2(means_data_frame['B'], k)
df.join(means_data_frame, rsuffix='_mean', on='A')
This will append 2 more columns to df B_mean and cluster denoting the group's mean and the cluster that group's mean is closest to, respectively.
If you really want to use apply, you can write a function to read the cluster value from means_data_frame and assign it to a new column in df

Making histogram with Spark DataFrame column

I am trying to make a histogram with a column from a dataframe which looks like
DataFrame[C0: int, C1: int, ...]
If I were to make a histogram with the column C1, what should I do?
Some things I have tried are
df.groupBy("C1").count().histogram()
df.C1.countByValue()
Which do not work because of mismatch in data types.
The pyspark_dist_explore package that #Chris van den Berg mentioned is quite nice. If you prefer not to add an additional dependency you can use this bit of code to plot a simple histogram.
import matplotlib.pyplot as plt
# Show histogram of the 'C1' column
bins, counts = df.select('C1').rdd.flatMap(lambda x: x).histogram(20)
# This is a bit awkward but I believe this is the correct way to do it
plt.hist(bins[:-1], bins=bins, weights=counts)
What worked for me is
df.groupBy("C1").count().rdd.values().histogram()
I have to convert to RDD because I found histogram method in pyspark.RDD class, but not in spark.SQL module
You can use histogram_numeric Hive UDAF:
import random
random.seed(323)
sqlContext = HiveContext(sc)
n = 3 # Number of buckets
df = sqlContext.createDataFrame(
sc.parallelize(enumerate(random.random() for _ in range(1000))),
["id", "v"]
)
hists = df.selectExpr("histogram_numeric({0}, {1})".format("v", n))
hists.show(1, False)
## +------------------------------------------------------------------------------------+
## |histogram_numeric(v,3) |
## +------------------------------------------------------------------------------------+
## |[[0.2124888140177466,415.0], [0.5918851340384337,330.0], [0.8890271451209697,255.0]]|
## +------------------------------------------------------------------------------------+
You can also extract the column of interest and use histogram method on RDD:
df.select("v").rdd.flatMap(lambda x: x).histogram(n)
## ([0.002028109534323752,
## 0.33410233677189705,
## 0.6661765640094703,
## 0.9982507912470436],
## [327, 326, 347])
Let's say your values in C1 are between 1-1000 and you want to get a histogram of 10 bins. You can do something like:
df.withColumn("bins", df.C1/100).groupBy("bins").count()
If your binning is more complex you can make a UDF for it (and at worse, you might need to analyze the column first, e.g. by using describe or through some other method).
If you want a to plot the Histogram, you could use the pyspark_dist_explore package:
fig, ax = plt.subplots()
hist(ax, df.groupBy("C1").count().select("count"))
If you would like the data in a pandas DataFrame you could use:
pandas_df = pandas_histogram(df.groupBy("C1").count().select("count"))
One easy way could be
import pandas as pd
x = df.select('symboling').toPandas() # symboling is the column for histogram
x.plot(kind='hist')

class labels in Pandas scattermatrix

This question has been asked before, Multiple data in scatter matrix, but didn't receive an answer.
I'd like to make a scatter matrix, something like in the pandas docs, but with differently colored markers for different classes. For example, I'd like some points to appear in green and others in blue depending on the value of one of the columns (or a separate list).
Here's an example using the Iris dataset. The color of the points represents the species of Iris -- Setosa, Versicolor, or Virginica.
Does pandas (or matplotlib) have a way to make a chart like that?
Update: This functionality is now in the latest version of Seaborn. Here's an example.
The following was my stopgap measure:
def factor_scatter_matrix(df, factor, palette=None):
'''Create a scatter matrix of the variables in df, with differently colored
points depending on the value of df[factor].
inputs:
df: pandas.DataFrame containing the columns to be plotted, as well
as factor.
factor: string or pandas.Series. The column indicating which group
each row belongs to.
palette: A list of hex codes, at least as long as the number of groups.
If omitted, a predefined palette will be used, but it only includes
9 groups.
'''
import matplotlib.colors
import numpy as np
from pandas.tools.plotting import scatter_matrix
from scipy.stats import gaussian_kde
if isinstance(factor, basestring):
factor_name = factor #save off the name
factor = df[factor] #extract column
df = df.drop(factor_name,axis=1) # remove from df, so it
# doesn't get a row and col in the plot.
classes = list(set(factor))
if palette is None:
palette = ['#e41a1c', '#377eb8', '#4eae4b',
'#994fa1', '#ff8101', '#fdfc33',
'#a8572c', '#f482be', '#999999']
color_map = dict(zip(classes,palette))
if len(classes) > len(palette):
raise ValueError('''Too many groups for the number of colors provided.
We only have {} colors in the palette, but you have {}
groups.'''.format(len(palette), len(classes)))
colors = factor.apply(lambda group: color_map[group])
axarr = scatter_matrix(df,figsize=(10,10),marker='o',c=colors,diagonal=None)
for rc in xrange(len(df.columns)):
for group in classes:
y = df[factor == group].icol(rc).values
gkde = gaussian_kde(y)
ind = np.linspace(y.min(), y.max(), 1000)
axarr[rc][rc].plot(ind, gkde.evaluate(ind),c=color_map[group])
return axarr, color_map
As an example, we'll use the same dataset as in the question, available here
>>> import pandas as pd
>>> iris = pd.read_csv('iris.csv')
>>> axarr, color_map = factor_scatter_matrix(iris,'Name')
>>> color_map
{'Iris-setosa': '#377eb8',
'Iris-versicolor': '#4eae4b',
'Iris-virginica': '#e41a1c'}
Hope this is helpful!
You can also call the scattermatrix from pandas as follow :
pd.scatter_matrix(df,color=colors)
with colors being an list of size len(df)containing colors

plot multiple data series from numpy array

I had a very ambitious project (for my novice level) to use on numpy array, where I load a series of data, and make different plots based on my needs - I have uploaded a slim version of my data file input_data and wanted to make plots based on: F (where I would like to choose the desired F before looping), and each series will have the data from E column (e.g. A12 one data series, A23 another data series in the plot, etc) and on the X axis I would like to use the corresponding values in D.
so to summarize for a chosen value on column F I want to have 4 different data series (as the number of variables on column E) and the data should be reference (x-axis) on the value of column D (which is date)
I stumbled in the first step (although spend too much time) where I wanted to plot all data with F column identifier as one plot.
Here is what I have up to now:
import os
import numpy as np
N = 8 #different values on column F
M = 4 #different values on column E
dataset = open('array_data.txt').readlines()[1:]
data = np.genfromtxt(dataset)
my_array = data
day = len(my_array)/M/N # number of measurement sets - variation on column D
for i in range(0, len(my_array), N):
plt.xlim(0, )
plt.ylim(-1, 2)
plt.plot(my_array[i, 0], my_array[i, 2], 'o')
plt.hold(True)
plt.show()
this does nothing.... and I still have a long way to go..
With pandas you can do:
import pandas as pd
dataset = pd.read_table("toplot.txt", sep="\t")
#make D index (automatically puts it on the x axis)
dataset.set_index("D", inplace=True)
#plotting R vs. D
dataset.R.plot()
#plotting F vs. D
dataset.F.plot()
dataset is a DataFrame object and DataFrame.plot is just a wrapper around the matplotlib function to plot the series.
I'm not clear on how you are wanting to plot it, but it sound like you'll need to select some values of a column. This would be:
# get where F == 1000
maskF = dataset.F == 1000
# get the values where F == 1000
rows = dataset[maskF]
# get the values where A12 is in column E
rows = rows[rows.E == "A12"]
#remove the we don't want to see
del rows["E"]
del rows["F"]
#Plot the result
rows.plot(xlim=(0,None), ylim=(-1,2))

Categories