I tried to do sequential colormap on pandas. This is my outcome and I want to do colormap.
A G C T -
A - 5823 1997 1248 962
G 9577 - 2683 2492 788
C 2404 2574 - 9569 722
T 1272 1822 5931 - 767
- 795 583 599 559 -
df = pd.DataFrame(index= ["A", "G", "C", "T", "-"], columns=["A", "G", "C", "T", "-"])
import matplotlib.pyplot as plt
import numpy as np
column_labels = list("AGCT-")
row_labels = list("AGCT-")
data = df
fig, ax = plt.subplots()
heatmap = ax.pcolor(data, cmap=plt.cm.Blues)
ax.set_xticks(np.arange(data.shape[0])+0.5, minor=False)
ax.set_yticks(np.arange(data.shape[1])+0.5, minor=False)
ax.invert_yaxis()
ax.xaxis.tick_top()
ax.set_xticklabels(row_labels, minor=False)
ax.set_yticklabels(column_labels, minor=False)
plt.show()
But it keeps giving an error.
File "//anaconda/lib/python2.7/site-packages/spyderlib/widgets/externalshell/sitecustomize.py", line 685, in runfile
execfile(filename, namespace)
File "//anaconda/lib/python2.7/site-packages/spyderlib/widgets/externalshell/sitecustomize.py", line 78, in execfile
builtins.execfile(filename, *where)
File "/Users/macbookpro/Desktop/mutations/first.py", line 115, in <module>
heatmap = ax.pcolor(data, cmap=plt.cm.Blues)
File "//anaconda/lib/python2.7/site-packages/matplotlib/axes/_axes.py", line 4967, in pcolor
collection.autoscale_None()
File "//anaconda/lib/python2.7/site-packages/matplotlib/cm.py", line 335, in autoscale_None
self.norm.autoscale_None(self._A)
File "//anaconda/lib/python2.7/site-packages/matplotlib/colors.py", line 956, in autoscale_None
self.vmax = ma.max(A)
File "//anaconda/lib/python2.7/site-packages/numpy/ma/core.py", line 6036, in max
return asanyarray(obj).max(axis=axis, fill_value=fill_value, out=out)
File "//anaconda/lib/python2.7/site-packages/numpy/ma/core.py", line 5280, in max
result = self.filled(fill_value).max(axis=axis, out=out).view(type(self))
AttributeError: 'str' object has no attribute 'view'
The problem is that the '-' characters in your dataframe are causing the values to be stored as strings, rather than integers.
You can convert your data frame to integers like this (the first part replaces '-' with 0, and the second part changes the data type to int):
df = df.where(df != '-', 0).astype(int)
df
A G C T -
A 0 5823 1997 1248 962
G 9577 0 2683 2492 788
C 2404 2574 0 9569 722
T 1272 1822 5931 0 767
- 795 583 599 559 0
Related
I am trying to plot two pandas series
Series A
Private 11210
Self-emp-not-inc 1321
Local-gov 1043
? 963
State-gov 683
Self-emp-inc 579
Federal-gov 472
Without-pay 7
Never-worked 3
Name: workclass, dtype: int64
Series B
Self-emp-not-inc 1321
Local-gov 1043
State-gov 683
Self-emp-inc 579
Federal-gov 472
Without-pay 7
Never-worked 3
Name: workclass, dtype: int64
g = sns.barplot(x=A.index, y=A.values, color='green', ax=faxes[ax_id]) # some subplot
g.set_xticklabels(g.get_xticklabels(), rotation=30)
sns.barplot(x=B.index, y=B.values, color='red', ax=faxes[ax_id])
The first plot draws as expected:
however, once I draw the second something goes wrong (a couple of bar disappear, labels are incorrect, etc).
Partially related ... how can I use log for y-axis (11K vs 3 hides the low number completely)
You can concatenate A and B joining the index. Rows that appear in one but not in the other will be filled in with NaN or NA and will not be shown in the bar plot.
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
A = pd.Series({'Private': 11210,
'Self-emp-not-inc': 1321,
'Local-gov': 1043,
'?': 963,
'State-gov': 683,
'Self-emp-inc': 579,
'Federal-gov': 472,
'Without-pay': 7,
'Never-worked': 3}, name='workclass')
B = pd.Series({'Self-emp-not-inc': 1321,
'Local-gov': 1043,
'State-gov': 683,
'Self-emp-inc': 579,
'Federal-gov': 472,
'Without-pay': 7,
'Never-worked': 3}, name='workclass')
df = pd.concat([A.rename('workclass A'), B.rename('workclass B')], axis=1)
ax = df.plot.bar(rot=30, color=['darkgreen', 'crimson'])
plt.tight_layout()
plt.show()
The concatenated dataframe looks like:
workclass A workclass B
Private 11210 NaN
Self-emp-not-inc 1321 1321.0
Local-gov 1043 1043.0
? 963 NaN
State-gov 683 683.0
Self-emp-inc 579 579.0
Federal-gov 472 472.0
Without-pay 7 7.0
Never-worked 3 3.0
Note that an integer can't be NaN, so B is automatically converted to a float type.
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
A = {'Private':11210,
'Self-emp-not-inc':1321,
'Local-gov':1043,
'?':963,
'State-gov':683,
'Self-emp-inc':579,
'Federal-gov':472,
'Without-pay':7,
'Never-worked':3}
B = {'Self-emp-not-inc':1321,
'Local-gov':1043,
'State-gov':683,
'Self-emp-inc':579,
'Federal-gov':472,
'Without-pay':7,
'Never-worked':3}
df = pd.concat([pd.Series(A, name='A'), pd.Series(B, name='B')], axis=1)
sns.barplot(y=df.A.values, x=df.index, color='b', alpha=0.4, label='A')
sns.barplot(y=df.B.values, x=df.index, color='r', alpha=0.4, label='B', bottom=df.A.values)
plt.yscale('log')
I have made a pandas dataframe from a CSV file like so:
import pandas as pd
data = pd.read_csv('dataset.csv')
In it, there's a column called CLASS. These are the values contained in CLASS:
from collections import Counter
Counter(CLASS)
Out [1]: Counter({'1': 60783, '2': 37313, '3': 2564, '4': 959, ' ': 346, 'D': 27})
Now, I add a column to the dataframe manually, and save it in a new csv:
data['DURATION'] = DURATION
data.to_csv('new_dataset.csv')
Then, when I open the new CSV and check the values in CLASS, some of them have become integers!
dataset = pd.read_csv('new_dataset.csv')
CLASS = dataset['OCCUP_CLASS']
Counter(CLASS)
Out [1]: Counter({' ': 346,
1: 48418,
'1': 12365,
2: 16189,
'2': 21124,
3: 848,
'3': 1716,
4: 81,
'4': 878,
'D': 43})
Why is this happening? This creates problems as I cannot plot or make histograms of CLASS anymore, while before I was able to do so:
import matplotlib.pyplot as plt
plt.plot(CLASS)
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-158-b6bafcfd7ad5> in <module>()
1 import matplotlib.pyplot as plt
----> 2 plt.plot(CLASS)
c:\users\h473\appdata\local\programs\python\python35\lib\site-packages\matplotlib\pyplot.py in plot(*args, **kwargs)
3356 mplDeprecation)
3357 try:
-> 3358 ret = ax.plot(*args, **kwargs)
3359 finally:
3360 ax._hold = washold
c:\users\h473\appdata\local\programs\python\python35\lib\site-packages\matplotlib\__init__.py in inner(ax, *args, **kwargs)
1853 "the Matplotlib list!)" % (label_namer, func.__name__),
1854 RuntimeWarning, stacklevel=2)
-> 1855 return func(ax, *args, **kwargs)
1856
1857 inner.__doc__ = _add_data_doc(inner.__doc__,
c:\users\h473\appdata\local\programs\python\python35\lib\site-packages\matplotlib\axes\_axes.py in plot(self, *args, **kwargs)
1526
1527 for line in self._get_lines(*args, **kwargs):
-> 1528 self.add_line(line)
1529 lines.append(line)
1530
c:\users\h473\appdata\local\programs\python\python35\lib\site-packages\matplotlib\axes\_base.py in add_line(self, line)
1930 line.set_clip_path(self.patch)
1931
-> 1932 self._update_line_limits(line)
1933 if not line.get_label():
1934 line.set_label('_line%d' % len(self.lines))
c:\users\h473\appdata\local\programs\python\python35\lib\site-packages\matplotlib\axes\_base.py in _update_line_limits(self, line)
1952 Figures out the data limit of the given line, updating self.dataLim.
1953 """
-> 1954 path = line.get_path()
1955 if path.vertices.size == 0:
1956 return
c:\users\h473\appdata\local\programs\python\python35\lib\site-packages\matplotlib\lines.py in get_path(self)
949 """
950 if self._invalidy or self._invalidx:
--> 951 self.recache()
952 return self._path
953
c:\users\h473\appdata\local\programs\python\python35\lib\site-packages\matplotlib\lines.py in recache(self, always)
655 if always or self._invalidy:
656 yconv = self.convert_yunits(self._yorig)
--> 657 y = _to_unmasked_float_array(yconv).ravel()
658 else:
659 y = self._y
c:\users\h473\appdata\local\programs\python\python35\lib\site-packages\matplotlib\cbook\__init__.py in _to_unmasked_float_array(x)
2048 return np.ma.asarray(x, float).filled(np.nan)
2049 else:
-> 2050 return np.asarray(x, float)
2051
2052
c:\users\h473\appdata\local\programs\python\python35\lib\site-packages\numpy\core\numeric.py in asarray(a, dtype, order)
490
491 """
--> 492 return array(a, dtype, copy=False, order=order)
493
494
ValueError: could not convert string to float:
EDIT: Adding the first 20 rows of the 2 relevant columns from the dataset:
DURATION CLASS
10 1
14 1
-1 1
-1 1
0 1
-1 1
14 2
8 2
-1 1
14 3
-1 3
-1
-1 4
-1 4
-1 3
8 1
-1 2
-1 2
-1 1
EDIT 2: Output of print(dataset['CLASS'].value_counts()):
import pandas as pd
dataset = pd.read_csv('dataset.csv', dtype={'CLASS': str})
print(dataset['CLASS'].value_counts())
1 48418
2 21124
2 16189
1 12365
3 1716
4 878
3 848
346
4 81
D 43
Name: CLASS, dtype: int64
EDIT 3: Plotting is not a problem for blank elements, as shown with the following plot with the original data, where the blank point on x-axis is highlighted:
Pandas tries to detect the data type of a column, but sometimes fails as you noticed. You can force the data type of a column in read_csv like this:
dataset = pd.read_csv('new_dataset.csv', dtype={'CLASS': str})
I have a dataframe like:
x1 y1 x2 y2
0 149 2653 2152 2656
1 149 2465 2152 2468
2 149 1403 2152 1406
3 149 1215 2152 1218
4 170 2692 2170 2695
5 170 2475 2170 2478
6 170 1413 2170 1416
7 170 1285 2170 1288
I need to pair by each two rows from data frame index. i.e., [0,1], [2,3], [4,5], [6,7] etc.,
and extract x1,y1 from first row of the pair x2,y2 from second row of the pair, similarly for each pair of rows.
Sample Output:
[[149,2653,2152,2468],[149,1403,2152,1218],[170,2692,2170,2478],[170,1413,2170,1288]]
Please feel free to ask if it's not clear.
So far I tried grouping by pairs, and tried shift operation.
But I didn't manage to make make pair records.
Python solution:
Select values of columns by positions to lists:
a = df[['x2', 'y2']].iloc[1::2].values.tolist()
b = df[['x1', 'y1']].iloc[0::2].values.tolist()
And then zip and join together in list comprehension:
L = [y + x for x, y in zip(a, b)]
print (L)
[[149, 2653, 2152, 2468], [149, 1403, 2152, 1218],
[170, 2692, 2170, 2478], [170, 1413, 2170, 1288]]
Thank you, #user2285236 for another solution:
L = np.concatenate([df.loc[::2, ['x1', 'y1']], df.loc[1::2, ['x2', 'y2']]], axis=1).tolist()
Pure pandas solution:
First DataFrameGroupBy.shift by each 2 rows:
df[['x2', 'y2']] = df.groupby(np.arange(len(df)) // 2)[['x2', 'y2']].shift(-1)
print (df)
x1 y1 x2 y2
0 149 2653 2152.0 2468.0
1 149 2465 NaN NaN
2 149 1403 2152.0 1218.0
3 149 1215 NaN NaN
4 170 2692 2170.0 2478.0
5 170 2475 NaN NaN
6 170 1413 2170.0 1288.0
7 170 1285 NaN NaN
Then remove NaNs rows, convert to int and then to list:
print (df.dropna().astype(int).values.tolist())
[[149, 2653, 2152, 2468], [149, 1403, 2152, 1218],
[170, 2692, 2170, 2478], [170, 1413, 2170, 1288]]
Here's one solution via numpy.hstack. Note it is natural to feed numpy arrays directly to pd.DataFrame, since this is how Pandas stores data internally.
import numpy as np
arr = np.hstack((df[['x1', 'y1']].values[::2],
df[['x2', 'y2']].values[1::2]))
res = pd.DataFrame(arr)
print(res)
0 1 2 3
0 149 2653 2152 2468
1 149 1403 2152 1218
2 170 2692 2170 2478
3 170 1413 2170 1288
Here's a solution using a custom iterator based on iterrows(), but it's a bit clunky:
import pandas as pd
df = pd.DataFrame( columns=['x1','y1','x2','y2'], data=
[[149, 2653, 2152, 2656], [149, 2465, 2152, 2468], [149, 1403, 2152, 1406], [149, 1215, 2152, 1218],
[170, 2692, 2170, 2695], [170, 2475, 2170, 2478], [170, 1413, 2170, 1416], [170, 1285, 2170, 1288]] )
def iter_oddeven_pairs(df):
row_it = df.iterrows()
try:
while True:
_,row = next(row_it)
yield row[0:2]
_,row = next(row_it)
yield row[2:4]
except StopIteration:
pass
print(pd.concat([pair for pair in iter_oddeven_pairs(df)]))
So I have this df with the first column called "Week":
0 2018-01-07
1 2018-01-14
2 2018-01-21
3 2018-01-28
4 2018-02-04
5 2018-02-11
6 2018-02-18
7 2018-02-25
8 2018-03-04
9 2018-03-11
10 2018-03-18
11 2018-03-25
12 2018-04-01
13 2018-04-08
14 2018-04-15
15 2018-04-22
16 2018-04-29
17 2018-05-06
Name: Week, dtype: object
And other three columns with different names and intergers as values.
My ideia is to plot these dates at X axis and the other 3 columns with ints at Y.
I've tried everything I found but nothing have worked yet...
I did:
df.set_index('Week')
df.plot()
plt.show()
Which worked very well, but X axis stil a float in range(0, 17)...
I also tried:
df['Week'] = pd.to_datetime(df['Week'])
df.set_index('Week')
df.plot()
plt.show()
But I got this error:
Traceback (most recent call last):
File "C:\Users\mar\Desktop\Web Dev\PJ E\EZA.py", line 33, in <module>
df.plot()
File "C:\Users\mar\AppData\Local\Programs\Python\Python36-32\lib\site-packages\pandas\plotting\_core.py", line 2677, in __call__
sort_columns=sort_columns, **kwds)
File "C:\Users\mar\AppData\Local\Programs\Python\Python36-32\lib\site-packages\pandas\plotting\_core.py", line 1902, in plot_frame
**kwds)
File "C:\Users\mar\AppData\Local\Programs\Python\Python36-32\lib\site-packages\pandas\plotting\_core.py", line 1729, in _plot
plot_obj.generate()
File "C:\Users\mar\AppData\Local\Programs\Python\Python36-32\lib\site-packages\pandas\plotting\_core.py", line 258, in generate
self._post_plot_logic_common(ax, self.data)
File "C:\Users\mar\AppData\Local\Programs\Python\Python36-32\lib\site-packages\pandas\plotting\_core.py", line 397, in _post_plot_logic_common
self._apply_axis_properties(ax.yaxis, fontsize=self.fontsize)
File "C:\Users\mar\AppData\Local\Programs\Python\Python36-32\lib\site-packages\pandas\plotting\_core.py", line 470, in _apply_axis_properties
labels = axis.get_majorticklabels() + axis.get_minorticklabels()
File "C:\Users\mar\AppData\Local\Programs\Python\Python36-32\lib\site-packages\matplotlib\axis.py", line 1188, in get_majorticklabels
ticks = self.get_major_ticks()
File "C:\Users\mar\AppData\Local\Programs\Python\Python36-32\lib\site-packages\matplotlib\axis.py", line 1339, in get_major_ticks
numticks = len(self.get_major_locator()())
File "C:\Users\mar\AppData\Local\Programs\Python\Python36-32\lib\site-packages\matplotlib\dates.py", line 1054, in __call__
self.refresh()
File "C:\Users\mar\AppData\Local\Programs\Python\Python36-32\lib\site-packages\matplotlib\dates.py", line 1074, in refresh
dmin, dmax = self.viewlim_to_dt()
File "C:\Users\mar\AppData\Local\Programs\Python\Python36-32\lib\site-packages\matplotlib\dates.py", line 832, in viewlim_to_dt
return num2date(vmin, self.tz), num2date(vmax, self.tz)
File "C:\Users\mar\AppData\Local\Programs\Python\Python36-32\lib\site-packages\matplotlib\dates.py", line 441, in num2date
return _from_ordinalf(x, tz)
File "C:\Users\mar\AppData\Local\Programs\Python\Python36-32\lib\site-packages\matplotlib\dates.py", line 256, in _from_ordinalf
dt = datetime.datetime.fromordinal(ix).replace(tzinfo=UTC)
ValueError: ordinal must be >= 1
Thanks in advance.
you can do something like this below:
df['Week'] = pd.to_datetime(df['Week'])
df.set_index('Week', inplace=True)
df.plot()
I have this test part and it works well :
data = pd.read_csv('/home/noodle/B_train.csv')
print(data.head())
features = data.iloc[:, :-1].as_matrix()
targets = data.iloc[:, -1:].as_matrix()
targets = targets.reshape(-1)
print(targets.shape, utils.multiclass.type_of_target(targets))
clf = tree.DecisionTreeClassifier(max_depth=5)
scores = model_selection.cross_val_score(clf, features, targets)
print(scores)
the targets's shape is (115,) and 'type_of_target' is binary...
here's the head of data:
no x y z m k l t
0 17 1 4 1 1 1020 1 1
1 17 1 10 2 1 1037 2 1
2 18 1 5 1 1 1512 3 1
3 18 1 2 0 1 1440 1 1
4 15 1 4 1 1 465 1 1
Here comes the problem:
while I am running anther code, it raises errors:
File "/home/noodle/PycharmProjects/qh/dc_tree.py", line 61, in find_common
scores = model_selection.cross_val_score(clf, features, labels, cv=5)
File "/usr/local/python34/lib/python3.4/site-packages/sklearn/model_selection/_validation.py", line 130, in cross_val_score
cv = check_cv(cv, y, classifier=is_classifier(estimator))
File "/usr/local/python34/lib/python3.4/site-packages/sklearn/model_selection/_split.py", line 1584, in check_cv
(type_of_target(y) in ('binary', 'multiclass'))):
File "/usr/local/python34/lib/python3.4/site-packages/sklearn/utils/multiclass.py", line 237, in type_of_target
if is_multilabel(y):
File "/usr/local/python34/lib/python3.4/site-packages/sklearn/utils/multiclass.py", line 153, in is_multilabel
labels = np.unique(y)
File "/usr/local/python34/lib/python3.4/site-packages/numpy/lib/arraysetops.py", line 214, in unique
ar.sort()
TypeError: unorderable types: str() > float()
Here is the code and data head
data = data.as_matrix()
labels = data[:, 0]
features = data[:, 1:]
print(labels.shape, utils.multiclass.type_of_target(labels))
clf = RandomForestClassifier(n_estimators=i, max_depth=None,
min_samples_split=2, random_state=0)
scores = model_selection.cross_val_score(clf, features, labels, cv=5)
working data head:
flag UserInfo_1 UserInfo_2 UserInfo_3 UserInfo_4 ProductInfo_1
0 0 missing 5226.590000 0.000000 0.0 0.0
1 0 missing 0.000000 0.000000 0.0 0.0
2 0 missing 5272.206555 2412.077228 missing missing
3 0 missing 5272.206555 2412.077228 missing missing
4 0 missing 5272.206555 2412.077228 missing missing
the labels's shape is (4000,) and 'type_of_target' is binary. there seems to be no differences between labels and targets(in the test part), except the shape in the first dimension.so I think it may be caused by the str in features...I don't want to get_dummies my working data, at first. So I try to change test data into :
no x y z m k l t
0 17 g 4 1 1 1020 1 1
1 17 g 10 2 1 1037 2 1
2 18 g 5 1 1 1512 3 1
3 18 g 2 0 1 1440 1 1
4 15 g 4 1 1 465 1 1
and run it to figure out what's wrong, but it raises another different error:
Traceback (most recent call last):
File "/home/noodle/PycharmProjects/bigtest/tensortest.py", line 71, in <module>
scores = model_selection.cross_val_score(clf, features, targets1)
File "/usr/local/python34/lib/python3.4/site-packages/sklearn/model_selection/_validation.py", line 140, in cross_val_score
for train, test in cv_iter)
File "/usr/local/python34/lib/python3.4/site-packages/sklearn/externals/joblib/parallel.py", line 758, in __call__
while self.dispatch_one_batch(iterator):
File "/usr/local/python34/lib/python3.4/site-packages/sklearn/externals/joblib/parallel.py", line 608, in dispatch_one_batch
self._dispatch(tasks)
File "/usr/local/python34/lib/python3.4/site-packages/sklearn/externals/joblib/parallel.py", line 571, in _dispatch
job = self._backend.apply_async(batch, callback=cb)
File "/usr/local/python34/lib/python3.4/site-packages/sklearn/externals/joblib/_parallel_backends.py", line 109, in apply_async
result = ImmediateResult(func)
File "/usr/local/python34/lib/python3.4/site-packages/sklearn/externals/joblib/_parallel_backends.py", line 326, in __init__
self.results = batch()
File "/usr/local/python34/lib/python3.4/site-packages/sklearn/externals/joblib/parallel.py", line 131, in __call__
return [func(*args, **kwargs) for func, args, kwargs in self.items]
File "/usr/local/python34/lib/python3.4/site-packages/sklearn/externals/joblib/parallel.py", line 131, in <listcomp>
return [func(*args, **kwargs) for func, args, kwargs in self.items]
File "/usr/local/python34/lib/python3.4/site-packages/sklearn/model_selection/_validation.py", line 238, in _fit_and_score
estimator.fit(X_train, y_train, **fit_params)
File "/usr/local/python34/lib/python3.4/site-packages/sklearn/tree/tree.py", line 739, in fit
X_idx_sorted=X_idx_sorted)
File "/usr/local/python34/lib/python3.4/site-packages/sklearn/tree/tree.py", line 122, in fit
X = check_array(X, dtype=DTYPE, accept_sparse="csc")
File "/usr/local/python34/lib/python3.4/site-packages/sklearn/utils/validation.py", line 382, in check_array
array = np.array(array, dtype=dtype, order=order, copy=copy)
ValueError: could not convert string to float: 'n'
So the error that the working part raises is not caused by the str in the working data... right? how can I fix it?