import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
dates = np.arange(1990,2061, 1)
dates = dates.astype('str').astype('datetime64')
df = pd.DataFrame(np.random.randint(0, dates.size, size=(dates.size,3)), columns=list('ABC'))
df['year'] = dates
cols = df.columns.tolist()
cols = [cols[-1]] + cols[:-1]
df = df[cols]
fig = plt.figure()
ax = fig.add_subplot(1,1,1)
ax.stackplot(df['year'], df.drop('year',axis=1))
Based on this code, I'm getting an error "TypeError: ufunc 'isfinite' not supported for the input types, and the inputs could not be safely coerced to any supported types according to the casting rule ''safe''"
I'm trying to figure out how to plot a DataFrame object with years in the first column, and then stacked area from the subsequent columns (A, B, C)..
Also, since I'm a complete beginner here... feel free to comment on my code as to make it cleaner / better. I understand that if I use Matplotlib instead of the Pandas integrated plot method, that I have more functionality to adjust things later on?
Thanks!
I run into two problems running your code.
First, stackplot seems to dislike using string representations of dates. Datetime data types are very finicky sometimes. Either use integers for your 'year' column, or use .values to convert from pandas to numpy datatypes as described in this question
Secondly, according to the documentation for stackplot, when you call stackplot(x, y) if x is a Nx1 array, then y must be MxN, where M is the number of columns. Your df.drop('year',axis=1)) will end up as NxM and throw another error at you. If you take the transpose, however, you can make it work.
If I just replace your final line with
ax.stackplot(df['year'].values, df.drop('year',axis=1).T)
I get a plot that looks like this:
Related
I was plotting a scatter plot to show null values in dataframe. As you can see the plt.scatter() function is not expressive enough. Relation between list(range(0,1200)) and 'a' is not clear unless you see the previous lines. Can the plt.scatter(x,y) be written in a more explicit way where it could be easily understood how x and y is related. Like if somebody only see the plt.scatter(x,y) , they would understand what it is about.
a = []
for i in range(0,1200):
feature_with_na = [feature for feature in df.columns if df[feature].isnull().sum()>i]
a.append(len(feature_with_na))
plt.scatter(list(range(0,1200)), a)
On your x axis you have the number, then on the y-axis you want to plot the number of columns in your DataFrame that have more than that number of null values.
Instead of your loop you can count the number of null values within each column and use numpy.broadcasting, ([:, None]), to compare with an array of your numbers. This allows you to specify an xarr of the numbers, then you use that same array in the comparison.
Sample Data
import pandas as pd
import numpy as np
import matplotlib.pyplot as plot
df = pd.DataFrame(np.random.choice([1,2,3,4,5,np.NaN], (100,10)))
Code
# Range of 'x' values to consider
xarr = np.arange(0, 100)
plt.scatter(xarr, (df.isnull().sum().to_numpy()>xarr[:, None]).sum(axis=1))
ALollz answer is good, but here's a less numpy-heavy alternative if that's your thing:
feature_null_counts = df.isnull().sum()
n_nulls = list(range(100))
features_with_n_nulls = [sum(feature_null_counts > n) for n in n_nulls]
plt.scatter(n_nulls, features_with_n_nulls)
I want to normalize all the numeric values in my dataset.
I have taken my whole dataset into a pandas dataframe.
My code to do this so far:
for column in numeric: #numeric=df._get_numeric_data()
x_array=np.array(df[column])
normalized_X=preprocessing.normalize([x_array])
But how do i verify this is correct though?
I tried plotting a histogram for one of the columns before normalizing and after adding this piece of code before and after my for loop:
x=df['Below.Primary'] #Below.Primary is one of my column names
plt.hist(x, bins=45)
The blue histogram was before the for loop and the orange, after.
My total code looked like this:
ln[21] plt.hist(df['Below.Primary'], bins=45)
ln[22] for column in numeric:
x_array=np.array(df[column])
normalized_X=preprocessing.normalize([x_array])
x=df['Below.Primary']
plt.hist(x, bins=45)
I don't see any reduction in scale. What have i done wrong? If not correct, can someone point out the correct way to do what i wanted to do?
Try use this:
scaler = preprocessing.StandardScaler()
df[col] = scaler.fit_transform(df[col])
A couple general things first.
If numeric is a list of column names (looks like this is the case), the for loop is not necessary.
A Pandas series using an ndarray under the hood so you can just request the ndarray with Series.values instead of calling np.array(). See this page on the Pandas Series.
I am assuming you are using preprocessing from sklearn.
I recommend using sklearn.preprocessing.Normalizer for this.
import pandas as pd
from sklearn.preprocessing import Normalizer
### Without the for loop (recommended)
# this version returns array
normalizer = Normalizer()
normalized_values = normalizer.fit_transform(df[numeric])
# normalized_values is a 2D array which is useful
# for many applications
# to convert back to DataFrame
df = pd.DataFrame(normalized_values, columns = numeric)
### with the for-loop (not recommended)
for column in numeric:
x_array = df[column].values.reshape(-1,1)
df[column] = normalizer.fit_transform(x_array)
You have to set normalized_X to the respective column while iterating.
for column in numeric:
x_array=np.array(df[column])
normalized_X=preprocessing.normalize([x_array])
df[column]= normalized_X #Setting normalized value in the column
x=df['Below.Primary']
plt.hist(x, bins=45)
This is undoubtedly a bit of a "can't see the wood for the trees" moment. I've been staring at this code for an hour and can't see what I've done wrong. I know it's staring me in the face but I just can't see it!
I'm trying to convert between two geographical co-ordinate systems using Python.
I have longitude (x-axis) and latitude (y-axis) values and want to convert to OSGB 1936. For a single point, I can do the following:
import numpy as np
import pandas as pd
import shapefile
import pyproj
inProj = pyproj.Proj(init='epsg:4326')
outProj = pyproj.Proj(init='epsg:27700')
x1,y1 = (-2.772048, 53.364265)
x2,y2 = pyproj.transform(inProj,outProj,x1,y1)
print(x1,y1)
print(x2,y2)
This produces the following:
-2.772048 53.364265
348721.01039783185 385543.95241055806
Which seems reasonable and suggests that longitude of -2.772048 is converted to a co-ordinate of 348721.0103978.
In fact, I want to do this in a Pandas dataframe. The dataframe contains columns containing longitude and latitude and I want to add two additional columns that contain the converted co-ordinates (called newLong and newLat).
An exemplar dataframe might be:
latitude longitude
0 53.364265 -2.772048
1 53.632481 -2.816242
2 53.644596 -2.970592
And the code I've written is:
import numpy as np
import pandas as pd
import shapefile
import pyproj
inProj = pyproj.Proj(init='epsg:4326')
outProj = pyproj.Proj(init='epsg:27700')
df = pd.DataFrame({'longitude':[-2.772048,-2.816242,-2.970592],'latitude':[53.364265,53.632481,53.644596]})
def convertCoords(row):
x2,y2 = pyproj.transform(inProj,outProj,row['longitude'],row['latitude'])
return pd.Series({'newLong':x2,'newLat':y2})
df[['newLong','newLat']] = df.apply(convertCoords,axis=1)
print(df)
Which produces:
latitude longitude newLong newLat
0 53.364265 -2.772048 385543.952411 348721.010398
1 53.632481 -2.816242 415416.003113 346121.990302
2 53.644596 -2.970592 416892.024217 335933.971216
But now it seems that the newLong and newLat values have been mixed up (compared with the results of the single point conversion shown above).
Where have I got my wires crossed to produce this result? (I apologise if it's completely obvious!)
When you do df[['newLong','newLat']] = df.apply(convertCoords,axis=1), you are indexing the columns of the df.apply output. However, the column order is arbitrary because your series was defined using a dictionary (which is inherently unordered).
You can opt to return a Series with a fixed column ordering:
return pd.Series([x2, y2])
Alternatively, if you want to keep the convertCoords output labelled, then you can use .join to combine results instead:
return pd.Series({'newLong':x2,'newLat':y2})
...
df = df.join(df.apply(convertCoords, axis=1))
Please note that the transform function of pyproj accepts also arrays, which is quite useful when it comes to large dataframes, and much faster than using lambda/apply function
import pandas as pd
from pyproj import Proj, transform
inProj, outProj = Proj(init='epsg:4326'), Proj(init='epsg:27700')
df['newLon'], df['newLat'] = transform(inProj, outProj, df['longitude'].tolist(), df['longitude'].tolist())
I am trying to make a histogram with a column from a dataframe which looks like
DataFrame[C0: int, C1: int, ...]
If I were to make a histogram with the column C1, what should I do?
Some things I have tried are
df.groupBy("C1").count().histogram()
df.C1.countByValue()
Which do not work because of mismatch in data types.
The pyspark_dist_explore package that #Chris van den Berg mentioned is quite nice. If you prefer not to add an additional dependency you can use this bit of code to plot a simple histogram.
import matplotlib.pyplot as plt
# Show histogram of the 'C1' column
bins, counts = df.select('C1').rdd.flatMap(lambda x: x).histogram(20)
# This is a bit awkward but I believe this is the correct way to do it
plt.hist(bins[:-1], bins=bins, weights=counts)
What worked for me is
df.groupBy("C1").count().rdd.values().histogram()
I have to convert to RDD because I found histogram method in pyspark.RDD class, but not in spark.SQL module
You can use histogram_numeric Hive UDAF:
import random
random.seed(323)
sqlContext = HiveContext(sc)
n = 3 # Number of buckets
df = sqlContext.createDataFrame(
sc.parallelize(enumerate(random.random() for _ in range(1000))),
["id", "v"]
)
hists = df.selectExpr("histogram_numeric({0}, {1})".format("v", n))
hists.show(1, False)
## +------------------------------------------------------------------------------------+
## |histogram_numeric(v,3) |
## +------------------------------------------------------------------------------------+
## |[[0.2124888140177466,415.0], [0.5918851340384337,330.0], [0.8890271451209697,255.0]]|
## +------------------------------------------------------------------------------------+
You can also extract the column of interest and use histogram method on RDD:
df.select("v").rdd.flatMap(lambda x: x).histogram(n)
## ([0.002028109534323752,
## 0.33410233677189705,
## 0.6661765640094703,
## 0.9982507912470436],
## [327, 326, 347])
Let's say your values in C1 are between 1-1000 and you want to get a histogram of 10 bins. You can do something like:
df.withColumn("bins", df.C1/100).groupBy("bins").count()
If your binning is more complex you can make a UDF for it (and at worse, you might need to analyze the column first, e.g. by using describe or through some other method).
If you want a to plot the Histogram, you could use the pyspark_dist_explore package:
fig, ax = plt.subplots()
hist(ax, df.groupBy("C1").count().select("count"))
If you would like the data in a pandas DataFrame you could use:
pandas_df = pandas_histogram(df.groupBy("C1").count().select("count"))
One easy way could be
import pandas as pd
x = df.select('symboling').toPandas() # symboling is the column for histogram
x.plot(kind='hist')
I have floating point data in a Pandas dataframe. Each column represents a variable (they have string names) and each row a set of values (the rows have integer names which are not important).
>>> print data
0 kppawr23 kppaspyd
1 3.312387 13.266040
2 2.775202 0.100000
3 100.000000 100.000000
4 100.000000 39.437420
5 17.017150 33.019040
...
I want to plot a histogram for each column. The best result I have achieved is with the hist method of dataframe:
data.hist(bins=20)
but I want the x-axis of each histogram to be on a log10 scale. And the bins to be on log10 scale too, but that is easy enough with bins=np.logspace(-2,2,20).
A workaround might be to log10 transform the data before plotting, but the approaches I have tried,
data.apply(math.log10)
and
data.apply(lambda x: math.log10(x))
give me a floating point error.
"cannot convert the series to {0}".format(str(converter)))
TypeError: ("cannot convert the series to <type 'float'>", u'occurred at index kppawr23')
You could use
ax.set_xscale('log')
data.hist() returns an array of axes. You'll need to call
ax.set_xscale('log') for each axes, ax to make each of the logarithmically
scaled.
For example,
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
np.random.seed(2015)
N = 100
arr = np.random.random((N,2)) * np.logspace(-2,2,N)[:, np.newaxis]
data = pd.DataFrame(arr, columns=['kppawr23', 'kppaspyd'])
bins = np.logspace(-2,2,20)
axs = data.hist(bins=bins)
for ax in axs.ravel():
ax.set_xscale('log')
plt.gcf().tight_layout()
plt.show()
yields
By the way, to take the log of every value in the DataFrame, data, you could use
logdata = np.log10(data)
because NumPy ufuncs (such as np.log10) can be applied to pandas DataFrames because they operate elementwise on all the values in the DataFrame.
data.apply(math.log10) did not work because apply tries to pass an entire column (a Series) of values to math.log10. math.log10 expects a scalar value only.
data.apply(lambda x: math.log10(x)) fails for the same reason that data.apply(math.log10) does. Moreover, if data.apply(func) and data.apply(lambda x: func(x)) were both viable options, the first should be preferred since the lambda function would just make the call a tad slower.
You could use data.apply(np.log10), again since the NumPy ufunc np.log10 can be applied to Series, but there is no reason to bother doing this when np.log10(data) works.
You could also use data.applymap(math.log10) since applymap calls
math.log10 on each value in data one-at-a-time. But this would be far slower
than calling the equivalent NumPy function, np.log10 on the entire
DataFrame. Still, it is worth knowing about applymap in case you need to call
some custom function which is not a ufunc.