Descriptive statistics along rows of text file using Pandas - python

I am reading a text file using Pandas in Python. I am using Python 2.7. The dataset in use in this question is related to a question that I had asked before here. To be specific, the first two rows, and the first column of my data comprise of text information. The following is a snapshot of a truncated version of my dataset.
The data file can be found here. I am using the helpful answers given here to load the dataset (df = pd.read_csv('dum.txt',sep='\t', header=[0,1], index_col=0)).
I want to get descriptive statistics of my pandas dataframe along rows, instead of columns. I have tried using df.describe(), but it gives me descriptive statistics along columns. I had a look at the answers given in this question, but I get the following error when I use the answers suggested in that link.
TypeError: ('unbound method describe() must be called with DataFrame instance as first argument (got Series instance instead)', u'occurred at index foxq1')
How can I get descriptive statistics using Pandas for the numerical entries in every row for the dataset that I have? Thanks in advance.
Following a few comments, I am including the actual code that I am using, and the error message:
The actual code is this:
df = pd.read_csv('dum.txt',sep='\t', header=[0,1], index_col=0)
df.apply(pd.DataFrame.describe, axis=1)
Error message:
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-20-0d7a5fde0f42> in <module>()
----> 1 df.apply(pd.DataFrame.describe, axis=1)
2 #df.apply(pd.DataFrame.describe, axis=1)
/Users/LG/anaconda2/lib/python2.7/site-packages/pandas/core/frame.pyc in apply(self, func, axis, broadcast, raw, reduce, args, **kwds)
4260 f, axis,
4261 reduce=reduce,
-> 4262 ignore_failures=ignore_failures)
4263 else:
4264 return self._apply_broadcast(f, axis)
/Users/LG/anaconda2/lib/python2.7/site-packages/pandas/core/frame.pyc in _apply_standard(self, func, axis, ignore_failures, reduce)
4356 try:
4357 for i, v in enumerate(series_gen):
-> 4358 results[i] = func(v)
4359 keys.append(v.name)
4360 except Exception as e:
TypeError: ('unbound method describe() must be called with DataFrame instance as first argument (got Series instance instead)', u'occurred at index object1')

From the question you referenced, you can just use this code (in other words apply describe along the rows):
df.apply(pd.DataFrame.describe, axis=1)
And you get the following result:
count mean std min 25% 50% 75% max
object1 5.0 3.1 1.581139 1.1 2.1 3.1 4.1 5.1
object2 5.0 3.2 1.581139 1.2 2.2 3.2 4.2 5.2
object3 5.0 3.3 1.581139 1.3 2.3 3.3 4.3 5.3
object4 5.0 3.4 1.581139 1.4 2.4 3.4 4.4 5.4
object5 5.0 3.5 1.581139 1.5 2.5 3.5 4.5 5.5
object6 5.0 3.6 1.581139 1.6 2.6 3.6 4.6 5.6
object7 5.0 3.7 1.581139 1.7 2.7 3.7 4.7 5.7
object8 5.0 3.8 1.581139 1.8 2.8 3.8 4.8 5.8

You can try to use numpy to obtain much of the statistics for rows:
df = pd.read_csv('dum.txt',sep='\t', header=[0,1], index_col=0)
print df
Type T1 T2 T3 T4 T5 T6 T7
Tag Tag1 Tag1 Tag1 Tag5 Tag5 Tag6 Tag6
object1 1.1 2.1 3.1 4.1 5.1 6.1 7.1
object2 1.2 2.2 3.2 4.2 5.2 6.2 7.2
object3 1.3 2.3 3.3 4.3 5.3 6.3 7.3
object4 1.4 2.4 3.4 4.4 5.4 6.4 7.4
object5 1.5 2.5 3.5 4.5 5.5 6.5 7.5
object6 1.6 2.6 3.6 4.6 5.6 6.6 7.6
object7 1.7 2.7 3.7 4.7 5.7 6.7 7.7
object8 1.8 2.8 3.8 4.8 5.8 6.8 7.8
data = df.values
data_mean = np.mean(data, axis=1)
data_std = np.std(data, axis=1)
data_min = np.min(data, axis=1)
data_max = np.max(data, axis=1)
print data_mean
[ 4.1 4.2 4.3 4.4 4.5 4.6 4.7 4.8]
print data_std
[ 2. 2. 2. 2. 2. 2. 2. 2.]
print data_min
[ 1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8]
print data_max
[ 7.1 7.2 7.3 7.4 7.5 7.6 7.7 7.8]

Related

Python : TypeError when taking CSV data from url with IPython.display

I am trying to take the data from direct URL in python JupyterNotebook. but the error I am getting is really frustrating me .
Here is the Link which I am fetching:
https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data
I am simply using Jupyter Notebook function as:
from IPython.display import HTML
HTML('https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data')
The error I am getting is the Following:
TypeError Traceback (most recent call last)
<ipython-input-15-0a8be2c0a7c6> in <module>
1 from IPython.display import HTML
----> 2 HTML('https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data')
C:\ProgramData\Anaconda3\lib\site-packages\IPython\core\display.py in __init__(self, data, url, filename, metadata)
693 if warn():
694 warnings.warn("Consider using IPython.display.IFrame instead")
--> 695 super(HTML, self).__init__(data=data, url=url, filename=filename, metadata=metadata)
696
697 def _repr_html_(self):
C:\ProgramData\Anaconda3\lib\site-packages\IPython\core\display.py in __init__(self, data, url, filename, metadata)
619
620 self.reload()
--> 621 self._check_data()
622
623 def __repr__(self):
C:\ProgramData\Anaconda3\lib\site-packages\IPython\core\display.py in _check_data(self)
668 def _check_data(self):
669 if self.data is not None and not isinstance(self.data, str):
--> 670 raise TypeError("%s expects text, not %r" % (self.__class__.__name__, self.data))
671
672 class Pretty(TextDisplayObject):
TypeError: HTML expects text, not b'5.1,3.5,1.4,0.2,Iris-setosa\n4.9,3.0,1.4,0.2,Iris-setosa\n4.7,3.2,1.3,0.2,Iris-setosa\n4.6, ....
Any Help is really appreciated.
The error is due to the fact that HTML() dosent expect CSV.
Use pandas read_csv to easily and conveniently read the CSV file:
import pandas as pd
data = pd.read_csv('https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data')
print(data)
Output:
5.1 3.5 1.4 0.2 Iris-setosa
0 4.9 3.0 1.4 0.2 Iris-setosa
1 4.7 3.2 1.3 0.2 Iris-setosa
2 4.6 3.1 1.5 0.2 Iris-setosa
3 5.0 3.6 1.4 0.2 Iris-setosa
4 5.4 3.9 1.7 0.4 Iris-setosa
.. ... ... ... ... ...
144 6.7 3.0 5.2 2.3 Iris-virginica
145 6.3 2.5 5.0 1.9 Iris-virginica
146 6.5 3.0 5.2 2.0 Iris-virginica
147 6.2 3.4 5.4 2.3 Iris-virginica
148 5.9 3.0 5.1 1.8 Iris-virginica

Python - Find the difference in two csv given the threshold values

I have two csv, one is excerpted from other. I want to compare two csv and have a new csv with difference between them. Orignally i used implemented sets and compare both. if both set don't mactch appended to new file. however i ran to issue that with slight change in number it is showing difference in both the files. is there a way to write the code memory efficient way where it compares the cell value and if the difference is more than 2 then it appends the row.
following are the sample:
code that i have used is as below:
orig = open('T1.csv','r')
new = open('T2.csv','r')
bigb = set(new) - set(orig)
print(bigb)
with open('different.csv', 'w') as file_out:
for line in bigb:
file_out.write(line)
orig.close()
new.close()
file_out.close()
enter code here
csv1:
0 1.1 -19.1 -29.1
1 2.1 -18.1 -28.1
2 3.1 -17.1 -27.1
3 4.1 -16.1 -26.1
4 5.1 -15.1 -25.1
5 6.1 -14.1 -24.1
6 7.1 -13.1 -23.1
7 8.1 -12.1 -22.1
8 9.1 -11.1 -21.1
9 10.1 -10.1 -20.1
10 11.1 -9.1 -19.1
csv2:
0 1.4 -19.6 -29.8
1 2.4 -18.6 -28.8
2 3.4 -17.6 -27.8
3 4.4 -16.6 -26.8
4 5.4 -15.6 -25.8
5 6.4 -14.6 -24.8
6 7.4 -13.6 -23.8
7 8.4 -12.6 -22.8
8 9.4 -11.6 -21.8
9 10.4 -10.6 -20.8
10 11.4 -9.6 -19.8

Understanding the details of this Python Code

The task is to load the iris data set from sklearn and then make some plots. I wish to understand what each command is doing.
from sklearn.datasets import load_iris
Q1 Is load_iris a function in sklearn?
data = load_iris()
Q2 Now I believe this load_iris function is returning some output which we are storing as data. What exactly is the output of load_iris()? type etc?
df = pd.DataFrame(data.data, columns=data.feature_names)
Q3 Now we are storing this as a dataframe. but what is data.data and data.feature_names
df['target_names'] = [data.target_names[i] for i in data.target]
Q4 I don't understand the right hand side of the above code
Need help with Questions 1,2,3 and 4. I tried looking at the Scikit documentation but didn't understand it. Also this code is from an online course on edx but they didn't explain the code.
Discover the power of intercativity of Jupyter/iPython.
I'm using iPython in this example.
Q1 Is load_iris a function in sklearn?
In [33]: type(load_iris)
Out[33]: function
Q2 Now I believe this load_iris function is returning some output
which we are storing as data. What exactly is the output of
load_iris()? type etc?
Docstring - is very helpful:
In [34]: load_iris?
Signature: load_iris(return_X_y=False)
Docstring:
Load and return the iris dataset (classification).
The iris dataset is a classic and very easy multi-class classification
dataset.
================= ==============
Classes 3
Samples per class 50
Samples total 150
Dimensionality 4
Features real, positive
================= ==============
Read more in the :ref:`User Guide <datasets>`.
Parameters
----------
return_X_y : boolean, default=False.
If True, returns ``(data, target)`` instead of a Bunch object. See
below for more information about the `data` and `target` object.
.. versionadded:: 0.18
Returns
-------
data : Bunch
Dictionary-like object, the interesting attributes are:
'data', the data to learn, 'target', the classification labels,
'target_names', the meaning of the labels, 'feature_names', the
meaning of the features, and 'DESCR', the
full description of the dataset.
(data, target) : tuple if ``return_X_y`` is True
...
print description:
In [51]: print(data.DESCR)
Iris Plants Database
====================
Notes
-----
Data Set Characteristics:
:Number of Instances: 150 (50 in each of three classes)
:Number of Attributes: 4 numeric, predictive attributes and the class
...
Q3 Now we are storing this as a dataframe. but what is data.data and data.feature_names
In [37]: type(data.data)
Out[37]: numpy.ndarray
In [88]: data.data.shape
Out[88]: (150, 4)
In [38]: df = pd.DataFrame(data.data, columns=data.feature_names)
In [39]: pd.set_option('display.max_rows', 10)
In [40]: df
Out[40]:
sepal length (cm) sepal width (cm) petal length (cm) petal width (cm)
0 5.1 3.5 1.4 0.2
1 4.9 3.0 1.4 0.2
2 4.7 3.2 1.3 0.2
3 4.6 3.1 1.5 0.2
4 5.0 3.6 1.4 0.2
.. ... ... ... ...
145 6.7 3.0 5.2 2.3
146 6.3 2.5 5.0 1.9
147 6.5 3.0 5.2 2.0
148 6.2 3.4 5.4 2.3
149 5.9 3.0 5.1 1.8
[150 rows x 4 columns]
In [41]: df.columns
Out[41]: Index(['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)'], dtype='object')
In [42]: data.feature_names
Out[42]:
['sepal length (cm)',
'sepal width (cm)',
'petal length (cm)',
'petal width (cm)']
Q4 I don't understand the right hand side of the above code Need help with Questions 1,2,3 and 4. I tried looking at the Scikit
documentation but didn't understand it. Also this code is from an
online course on edx but they didn't explain the code.
Execute the code and check the result - usually it's easy to see what has happened. BTW, i'd use Numpy for this step:
In [49]: df['target_names'] = np.take(data.target_names, data.target)
In [50]: df
Out[50]:
sepal length (cm) sepal width (cm) petal length (cm) petal width (cm) target_names
0 5.1 3.5 1.4 0.2 setosa
1 4.9 3.0 1.4 0.2 setosa
2 4.7 3.2 1.3 0.2 setosa
3 4.6 3.1 1.5 0.2 setosa
4 5.0 3.6 1.4 0.2 setosa
.. ... ... ... ... ...
145 6.7 3.0 5.2 2.3 virginica
146 6.3 2.5 5.0 1.9 virginica
147 6.5 3.0 5.2 2.0 virginica
148 6.2 3.4 5.4 2.3 virginica
149 5.9 3.0 5.1 1.8 virginica
[150 rows x 5 columns]

Python: List Nested Dictionary to pandas DataFrame Issue

I am struggling with some simple from_dict conversion. I have a list nested dictionaries in dictionary as below. (quite confusing to me as well)
dict_total = {'Jane' : {'a1' : [1.1,1.3,1.4,1.9],
'a2' : [3.1,2.4,2.3,1.2],
'a3' : [4.3,2.3,1.5,5.3],
'st' : ['d','dc','sc','sc']},
'Mark' : {'a1' : [3.1,2.3,1.3,1.9],
'a2' : [1.2,2.3,9.3,1.2],
'a3' : [1.1,5.5,1.2,5.3],
'st' : ['cs','s','wc','cd']}
}
Above is just simple example, but my original contains more then 20000+ keys in dict_total. I want to convert this dictionary to dataframe (hopefully on loops) like below.
df_total =
a1 a2 a3 st
Jane 1.1 3.1 4.3 d
Jane 1.3 2.4 2.3 dc
Jane 1.4 2.3 1.5 sc
Jane 1.9 1.2 5.3 sc
Mark 3.1 1.2 1.1 cs
Mark 2.3 2.3 5.5 sc
Mark 1.3 9.3 1.2 wc
Mark 1.9 1.2 5.3 cd
As you can see the keys for dict_total would be the index of dataframe, and each keys for "Jane" and "Mark" will be the column name, and lists for values.
Hope there is a pythonic way to solve this. Thanks
I think need concat with dict comprehension, last remove first level by reset_index:
df_total = (pd.concat({k: pd.DataFrame(v) for k, v in dict_total.items()})
.reset_index(level=1, drop=True))
print (df_total)
a1 a2 a3 st
Jane 1.1 3.1 4.3 d
Jane 1.3 2.4 2.3 dc
Jane 1.4 2.3 1.5 sc
Jane 1.9 1.2 5.3 sc
Mark 3.1 1.2 1.1 cs
Mark 2.3 2.3 5.5 s
Mark 1.3 9.3 1.2 wc
Mark 1.9 1.2 5.3 cd

python pandas groupby and subtract columns from different groups

I have a dataframe df1
pid stat h1 h2 h3 h4 h5 h6 ... h20
1 a 3.2 3.5 6.2 7.1 1.2 2.3 ... 3.2
1 b 3.3 1.5 4.2 7.7 4.2 3.5 ... 8.4
1 a 3.1 3.8 2.2 1.1 6.2 5.3 ... 9.2
1 b 3.7 1.2 8.2 4.7 3.2 8.5 ... 2.4
: : : : : : : : : :
2 a 2.2 3.8 6.2 7.3 1.3 4.3 ... 3.2
2 b 4.3 1.3 4.2 5.7 2.2 3.1 ... 2.4
2 a 2.1 3.7 2.4 1.6 6.4 9.3 ... 9.6
2 b 3.8 1.3 8.7 3.7 7.2 8.3 ... 9.4
: : : : : : : : : :
3 a 2.2 3.8 6.2 7.3 1.3 4.3 ... 3.2
3 b 4.3 1.3 4.2 5.7 2.2 3.1 ... 2.4
3 a 2.1 3.7 2.4 1.6 6.4 9.3 ... 9.6
3 b 3.8 1.3 8.7 3.7 7.2 8.3 ... 9.4
: : : : : : : : : :
I would like to obtain groups indexed on pid and stat and then subtract h values of group1 from h values of group2 for a final dataframe (df2). This final dataframe needs to be reindexed with numbers starting from 0:len(groups) Repeat it iteratively for all permutations of pid like 1-2, 1-3, 1-4, 2-1, 2-3 ... etc. I need to perform other calculations on the on the final dataframe df2(values in the below df2 are not exact subtracted, but just a representation)
pid(string) stat h1p1-h1p2 h2p1-h2p2 h3p1-h3p2 h4p1-h4p2 h5p1-h5p2 h6p1-h6p2 ... h20p1-h2p2
1-2 a 3.2 3.5 6.2 7.1 1.2 2.3 ... 3.2
1-2 b 3.3 1.5 4.2 7.7 4.2 3.5 ... 8.4
1-2 a 3.1 3.8 2.2 1.1 6.2 5.3 ... 9.2
1-2 b 3.7 1.2 8.2 4.7 3.2 8.5 ... 2.4
1-3 ....
I looked at options of;
for (pid, stat), group in df1.groupby(['pid', 'stat']):
print('pid = %s Stat = %s' %(pid, stat))
print group
this gives me groups but, I am not sure how to access dataframes from this for loop and use it for subtracting from other groups. Also
df_grouped = df.groupby(['pid', 'stat']).groups()
still not sure how to access the new dataframe of groups and perform operations. I would like to know, if this can be done using groupby or if there is any better approach. Thanks in advance!
I implemented a generator and ignored the stat column because it makes no different in any groups according to your sample. Please tell me if I did it wrong.
import pandas as pd
from itertools import permutations
def subtract_group(df, col):
pid = df['pid'].unique()
# select piece with pid == i
segment = lambda df, i: df[df['pid'] == i].reset_index()[col]
for x, y in permutations(pid, 2):
result_df = pd.DataFrame(segment(df, x) - segment(df, y))
# rename columns
result_df.columns=["%sp%d-%sp%d" % (c, x, c, y) for c in col]
# insert pid column
result_df.insert(0, 'pid', '-'.join([str(x), str(y)]))
yield result_df
You can test it with:
# column name in your case
columns = ['h' + str(i+1) for i in range(20)]
print next(subtract_group(df1, columns))
Hope it helps.

Categories