Seaborn plots kdeplot but not distplot - python

I want to plot data from a pandas dataframe column formed from couchdb. This is what code and output from the data:
print df4.Patient_Age
Doc_ID
000103f8-7f48-4afd-b532-8e6c1028d965 99
00021ec5-9945-47f7-bfda-59cf8918f10b 92
0002510f-fb89-11e3-a6eb-742f68319ca7 32
00025550-9a97-44a4-84d9-1f6f7741f973 73
0002d1b8-b576-4db7-af55-b3f26f7ca63d 49
0002d40f-2b45-11e3-8f66-742f68319ca7 42
000307eb-18a6-47cd-bb03-33e484fad029 18
00033d3d-1345-4739-9522-b41b8db3ee23 42
00036d2e-0a51-4cfb-93d1-3e137a026f19 42
0003b054-5f3b-4553-8104-f71d7a940d84 10
Name: Patient_Age, dtype: object
If I execute this code:
sns.kdeplot(df4.Patient_Age)
the plot is generated as expected. However, when I run this:
sns.distplot(df4.Patient_Age)
I get the following error with distplot:
TypeError: unsupported operand type(s) for /: 'unicode' and 'long'
To correct the error, I use:
df4.Patient_Age = [int(i) for i in df4.Patient_Age]
all(isinstance(item,int) for item in df4.Patient_Age)
The output is:
False
What I would like to understand are:
Why was the kdeplot being generated earlier but not the histplot?
When I changed the datatype to int, why do I still get a False? And if the data is not int (as indicated by False), why does the histplot work after the transformation?

The problem is that your values are not numeric. If you force them to integers or floats, it will work.
from io import StringIO
import pandas
import seaborn
seaborn.set(style='ticks')
data = StringIO("""\
Doc_ID Age
000103f8-7f48-4afd-b532-8e6c1028d965 99
00021ec5-9945-47f7-bfda-59cf8918f10b 92
0002510f-fb89-11e3-a6eb-742f68319ca7 32
00025550-9a97-44a4-84d9-1f6f7741f973 73
0002d1b8-b576-4db7-af55-b3f26f7ca63d 49
0002d40f-2b45-11e3-8f66-742f68319ca7 42
000307eb-18a6-47cd-bb03-33e484fad029 18
00033d3d-1345-4739-9522-b41b8db3ee23 42
00036d2e-0a51-4cfb-93d1-3e137a026f19 42
0003b054-5f3b-4553-8104-f71d7a940d84 10
""")
df = pandas.read_table(data, sep='\s+')
df['Age'] = df['Age'].astype(float)
df.info()
# prints
<class 'pandas.core.frame.DataFrame'>
Int64Index: 10 entries, 0 to 9
Data columns (total 2 columns):
Doc_ID 10 non-null object
Age 10 non-null float64
dtypes: float64(1), object(1)
memory usage: 240.0+ bytes
So then:
seaborn.distplot(df['Age'])
Give me:

Related

getting strange error while calculating z-score

i want to calculate z-score of my whole dataset. i have tried two types of code but unfortunately they both gave me the same error.
my 1 code is here:
zee=stats.zscore(df)
print(zee)
my 2 code is:
from scipy import stats
import numpy as np
z = np.abs(stats.zscore(df))
print(z)
am using jupyter
The error i have got:
-----
TypeError Traceback (most recent call last)
<ipython-input-23-ef429aebacfd> in <module>
1 from scipy import stats
2 import numpy as np
----> 3 z = np.abs(stats.zscore(df))
4 print(z)
~/.local/lib/python3.8/site-packages/scipy/stats/stats.py in zscore(a, axis, ddof, nan_policy)
2495 sstd = np.nanstd(a=a, axis=axis, ddof=ddof, keepdims=True)
2496 else:
-> 2497 mns = a.mean(axis=axis, keepdims=True)
2498 sstd = a.std(axis=axis, ddof=ddof, keepdims=True)
2499
~/.local/lib/python3.8/site-packages/numpy/core/_methods.py in _mean(a, axis, dtype, out, keepdims)
160 ret = umr_sum(arr, axis, dtype, out, keepdims)
161 if isinstance(ret, mu.ndarray):
--> 162 ret = um.true_divide(
163 ret, rcount, out=ret, casting='unsafe', subok=False)
164 if is_float16_result and out is None:
TypeError: unsupported operand type(s) for /: 'str' and 'int'
and here the info of my dataframe,if theres something wrong with my datafarme.
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100 entries, 0 to 99
Data columns (total 14 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Region 100 non-null object
1 Country 100 non-null object
2 Item Type 100 non-null object
3 Sales Channel 100 non-null object
4 Order Priority 100 non-null object
5 Order Date 100 non-null object
6 Order ID 100 non-null int64
7 Ship Date 100 non-null object
8 Units Sold 100 non-null int64
9 Unit Price 100 non-null float64
10 Unit Cost 100 non-null float64
11 Total Revenue 100 non-null float64
12 Total Cost 100 non-null float64
13 Total Profit 100 non-null float64
dtypes: float64(5), int64(2), object(7)
memory usage: 11.1+ KB
thanks in advance.
Your df contains non float/int values, please try sending only int/float cols to your zscore func.
stats.zscore(df[['Unit Cost', 'Total Revenue', 'Total Cost', 'Total Profit']])

Why does memory usage in Pandas report the same number for integers as for object dtype?

I'm trying to understand the difference in memory usage between integers and string (objects) dtypes in Pandas.
import pandas as pd
df_int = pd.DataFrame(np.random.randint(0,100,size=(100, 4)), columns=list('ABCD'), dtype=int)
As expected, this takes around 3.2 KB of memory as each column is a 64-bit integer
In [38]: df_int.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100 entries, 0 to 99
Data columns (total 4 columns):
A 100 non-null int64
B 100 non-null int64
C 100 non-null int64
D 100 non-null int64
dtypes: int64(4)
memory usage: 3.2 KB
However, when I try to initialize it as a string, it is telling me that it has roughly the same memory usage
import pandas as pd
df_str = pd.DataFrame(np.random.randint(0,100,size=(100, 4)), columns=list('ABCD'), dtype=str)
In [40]: df_str.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100 entries, 0 to 99
Data columns (total 4 columns):
A 100 non-null object
B 100 non-null object
C 100 non-null object
D 100 non-null object
dtypes: object(4)
memory usage: 3.2+ KB
When I use sys.getsizeof, the difference is clear. For the dataframe containing only 64-bit integers, the size is roughly 3.3 KB (including the dataframe overhead of 24 bytes)
In [44]: sys.getsizeof(df_int)
Out[44]: 3304
For the dataframe initialized with integers converted to strings, it is nearly 24 KB
In [42]: sys.getsizeof(df_str)
Out[42]: 23984
Why does memory usage in Pandas report the same number for integers as for strings (object dtype)?
Following the docs, use 'deep' to get the actual value (otherwise it's an estimate)
df_str.info(memory_usage='deep')
#<class 'pandas.core.frame.DataFrame'>
#RangeIndex: 100 entries, 0 to 99
#Data columns (total 4 columns):
#A 100 non-null object
#B 100 non-null object
#C 100 non-null object
#D 100 non-null object
#dtypes: object(4)
#memory usage: 23.3 KB
A value of ‘deep’ is equivalent to “True with deep introspection”.
Memory usage is shown in human-readable units (base-2 representation).
Without deep introspection a memory estimation is made based in column
dtype and number of rows assuming values consume the same memory
amount for corresponding dtypes. With deep memory introspection, a
real memory usage calculation is performed at the cost of
computational resources.

Quickly sampling large number of rows from large dataframes in python

I have a very large dataframe (about 1.1M rows) and I am trying to sample it.
I have a list of indexes (about 70,000 indexes) that I want to select from the entire dataframe.
This is what Ive tried so far but all these methods are taking way too much time:
Method 1 - Using pandas :
sample = pandas.read_csv("data.csv", index_col = 0).reset_index()
sample = sample[sample['Id'].isin(sample_index_array)]
Method 2 :
I tried to write all the sampled lines to another csv.
f = open("data.csv",'r')
out = open("sampled_date.csv", 'w')
out.write(f.readline())
while 1:
total += 1
line = f.readline().strip()
if line =='':
break
arr = line.split(",")
if (int(arr[0]) in sample_index_array):
out.write(",".join(e for e in (line)))
Can anyone please suggest a better method? Or how I can modify this to make it faster?
Thanks
We don't have your data, so here is an example with two options:
after reading: use a pandas Index object to select a subset via the .iloc selection method
while reading: a predicate with the skiprows parameter
Given
A collection of indices and a (large) sample DataFrame written to test.csv:
import pandas as pd
import numpy as np
indices = [1, 2, 3, 10, 20, 30, 67, 78, 900, 2176, 78776]
df = pd.DataFrame(np.random.randint(0, 100, size=(1000000, 4)), columns=list("ABCD"))
df.to_csv("test.csv", header=False)
df.info()
Output
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000000 entries, 0 to 999999
Data columns (total 4 columns):
A 1000000 non-null int32
B 1000000 non-null int32
C 1000000 non-null int32
D 1000000 non-null int32
dtypes: int32(4)
memory usage: 15.3 MB
Code
Option 1 - after reading
Convert a sample list of indices to an Index object and slice the loaded DataFrame:
idxs = pd.Index(indices)
subset = df.iloc[idxs, :]
print(subset)
The .iat and .at methods are even faster, but require scalar indices.
Option 2 - while reading (Recommended)
We can write a predicate that keeps selected indices as the file is being read (more efficient):
pred = lambda x: x not in indices
data = pd.read_csv("test.csv", skiprows=pred, index_col=0, names="ABCD")
print(data)
See also the issue that led to extending skiprows.
Results
The same output is produced from the latter options:
A B C D
1 74 95 28 4
2 87 3 49 94
3 53 54 34 97
10 58 41 48 15
20 86 20 92 11
30 36 59 22 5
67 49 23 86 63
78 98 63 60 75
900 26 11 71 85
2176 12 73 58 91
78776 42 30 97 96

Plotting Pandas' pivot_table from long data

I have a xls file with data organized in long format. I have four columns: the variable name, the country name, the year and the value.
After importing the data in Python with pandas.read_excel, I want to plot the time series of one variable for different countries. To do so, I create a pivot table that transforms the data in wide format. When I try to plot with matplotlib, I get an error
ValueError: could not convert string to float: 'ZAF'
(where 'ZAF' is the label of one country)
What's the problem?
This is the code:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
data = pd.read_excel('raw_emissions_energy.xls','raw data', index_col = None, thousands='.',parse_cols="A,C,F,M")
data['Year'] = data['Year'].astype(str)
data['COU'] = data['COU'].astype(str)
# generate sub-datasets for specific VARs
data_CO2PROD = pd.pivot_table(data[(data['VAR']=='CO2_PBPROD')], index='COU', columns='Year')
plt.plot(data_CO2PROD)
The xls file with raw data looks like:
raw data Excel view
This is what I get from data_CO2PROD.info()
<class 'pandas.core.frame.DataFrame'>
Index: 105 entries, ARE to ZAF
Data columns (total 16 columns):
(Value, 1990) 104 non-null float64
(Value, 1995) 105 non-null float64
(Value, 2000) 105 non-null float64
(Value, 2001) 105 non-null float64
(Value, 2002) 105 non-null float64
(Value, 2003) 105 non-null float64
(Value, 2004) 105 non-null float64
(Value, 2005) 105 non-null float64
(Value, 2006) 105 non-null float64
(Value, 2007) 105 non-null float64
(Value, 2008) 105 non-null float64
(Value, 2009) 105 non-null float64
(Value, 2010) 105 non-null float64
(Value, 2011) 105 non-null float64
(Value, 2012) 105 non-null float64
(Value, 2013) 105 non-null float64
dtypes: float64(16)
memory usage: 13.9+ KB
None
Using data_CO2PROD.plot() instead of plt.plot(data_CO2PROD) allowed me to plot the data. http://pandas.pydata.org/pandas-docs/stable/visualization.html.
Simple code:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
data= pd.DataFrame(np.random.randn(3,4), columns=['VAR','COU','Year','VAL'])
data['VAR'] = ['CC','CC','KK']
data['COU'] =['ZAF','NL','DK']
data['Year']=['1987','1987','2006']
data['VAL'] = [32,33,35]
data['Year'] = data['Year'].astype(str)
data['COU'] = data['COU'].astype(str)
# generate sub-datasets for specific VARs
data_CO2PROD = pd.pivot_table(data=data[(data['VAR']=='CC')], index='COU', columns='Year')
data_CO2PROD.plot()
plt.show()
I think you need add parameter values to pivot_table:
data_CO2PROD = pd.pivot_table(data=data[(data['VAR']=='CC')],
index='COU',
columns='Year',
values='Value')
data_CO2PROD.plot()
plt.show()

pandas dataframe conversion for linear regression

I read the CSV file and get a dataframe (name: data) that has a few columns, the first a few are in format numeric long(type:pandas.core.series.Series) and the last column(label) is a binary response variable string 'P(ass)'/'F(ail)'
import statsmodels.api as sm
label = data.ix[:, -1]
label[label == 'P'] = 1
label[label == 'F'] = 0
fea = data.ix[:, 0: -1]
logit = sm.Logit(label, fea)
result = logit.fit()
print result.summary()
Pandas throws me this error message: "ValueError: Pandas data cast to numpy dtype of object. Check input data with np.asarray(data)"
Numpy,Pandas etc modules are imported already. I tried to convert fea columns to float but still does not go through. Could someone tell me how to correct?
Thanks
update:
data.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 500 entries, 68135 to 3002
Data columns (total 8 columns):
TestQty 500 non-null int64
WaferSize 500 non-null int64
ChuckTemp 500 non-null int64
Notch 500 non-null int64
ORIGINALDIEX 500 non-null int64
ORIGINALDIEY 500 non-null int64
DUTNo 500 non-null int64
PassFail 500 non-null object
dtypes: int64(7), object(1)
memory usage: 35.2+ KB
data.sum()
TestQty 530
WaferSize 6000
ChuckTemp 41395
Notch 135000
ORIGINALDIEX 12810
ORIGINALDIEY 7885
DUTNo 271132
PassFail 20
dtype: float64
Shouldn't your features be this:
fea = data.ix[:, 0:-1]
From you data, you see that PassFail sums to 20 before you convert 'P' to 1 and 'F' to zero. I believe that is the source of your error.
To see what is in there, try:
data.PassFail.unique()
To verify that it totals to 500 (the number of rows in the DataFrame):
sum(label[label == 0]) + sum(label[label == 1)
Finally, try passing values to the function rather than Series and DataFrames:
logit = sm.Logit(label.values, fea.values)

Categories