I am struggling with some simple from_dict conversion. I have a list nested dictionaries in dictionary as below. (quite confusing to me as well)
dict_total = {'Jane' : {'a1' : [1.1,1.3,1.4,1.9],
'a2' : [3.1,2.4,2.3,1.2],
'a3' : [4.3,2.3,1.5,5.3],
'st' : ['d','dc','sc','sc']},
'Mark' : {'a1' : [3.1,2.3,1.3,1.9],
'a2' : [1.2,2.3,9.3,1.2],
'a3' : [1.1,5.5,1.2,5.3],
'st' : ['cs','s','wc','cd']}
}
Above is just simple example, but my original contains more then 20000+ keys in dict_total. I want to convert this dictionary to dataframe (hopefully on loops) like below.
df_total =
a1 a2 a3 st
Jane 1.1 3.1 4.3 d
Jane 1.3 2.4 2.3 dc
Jane 1.4 2.3 1.5 sc
Jane 1.9 1.2 5.3 sc
Mark 3.1 1.2 1.1 cs
Mark 2.3 2.3 5.5 sc
Mark 1.3 9.3 1.2 wc
Mark 1.9 1.2 5.3 cd
As you can see the keys for dict_total would be the index of dataframe, and each keys for "Jane" and "Mark" will be the column name, and lists for values.
Hope there is a pythonic way to solve this. Thanks
I think need concat with dict comprehension, last remove first level by reset_index:
df_total = (pd.concat({k: pd.DataFrame(v) for k, v in dict_total.items()})
.reset_index(level=1, drop=True))
print (df_total)
a1 a2 a3 st
Jane 1.1 3.1 4.3 d
Jane 1.3 2.4 2.3 dc
Jane 1.4 2.3 1.5 sc
Jane 1.9 1.2 5.3 sc
Mark 3.1 1.2 1.1 cs
Mark 2.3 2.3 5.5 s
Mark 1.3 9.3 1.2 wc
Mark 1.9 1.2 5.3 cd
Related
I have 4 CSV files with \t or tab as delimiter.
alok#alok-HP-Laptop-14s-cr1:~/tmp/krati$ for file in sample*.csv; do echo $file; cat $file; echo ; done
sample1.csv
ProbeID p_code intensities
B1_1_3 6170 2
B2_1_3 6170 2.2
B3_1_4 6170 2.3
12345 6170 2.4
1234567 6170 2.5
sample2.csv
ProbeID p_code intensities
B1_1_3 5320 3
B2_1_3 5320 3.2
B3_1_4 5320 3.3
12345 5320 3.4
1234567 5320 3.5
sample3.csv
ProbeID p_code intensities
B1_1_3 1234 4
B2_1_3 1234 4.2
B3_1_4 1234 4.3
12345 1234 4.4
1234567 1234 4.5
sample4.csv
ProbeID p_code intensities
B1_1_3 3120 5
B2_1_3 3120 5.2
B3_1_4 3120 5.3
12345 3120 5.4
1234567 3120 5.5
All 4 files have same headers.
ProbeID is same across all files, order is also same. Each file have same p_code across single CSV file.
I have to merge all these CSV files into one in this format.
alok#alok-HP-Laptop-14s-cr1:~/tmp/krati$ cat output1.csv
ProbeID 6170 5320 1234 3120
B1_1_3 2 3 4 5
B2_1_3 2.2 3.2 4.2 5.2
B3_1_4 2.3 3.3 4.3 5.3
12345 2.4 3.4 4.4 5.4
1234567 2.5 3.5 4.5 5.5
In this output file columns are dynamic based on p_code value.
I can do this easily in Python using dictionary. How can I produce such output using Pandas?
We can achieve this using pandas.concat and DataFrame.pivot_table:
import os
import pandas as pd
df = pd.concat(
[pd.read_csv(f, sep="\t") for f in os.listdir() if f.endswith(".csv") and f.startswith("sample")],
ignore_index=True
)
df = df.pivot_table(index="ProbeID", columns="p_code", values="intensities", aggfunc="sum")
print(df)
I have two csv, one is excerpted from other. I want to compare two csv and have a new csv with difference between them. Orignally i used implemented sets and compare both. if both set don't mactch appended to new file. however i ran to issue that with slight change in number it is showing difference in both the files. is there a way to write the code memory efficient way where it compares the cell value and if the difference is more than 2 then it appends the row.
following are the sample:
code that i have used is as below:
orig = open('T1.csv','r')
new = open('T2.csv','r')
bigb = set(new) - set(orig)
print(bigb)
with open('different.csv', 'w') as file_out:
for line in bigb:
file_out.write(line)
orig.close()
new.close()
file_out.close()
enter code here
csv1:
0 1.1 -19.1 -29.1
1 2.1 -18.1 -28.1
2 3.1 -17.1 -27.1
3 4.1 -16.1 -26.1
4 5.1 -15.1 -25.1
5 6.1 -14.1 -24.1
6 7.1 -13.1 -23.1
7 8.1 -12.1 -22.1
8 9.1 -11.1 -21.1
9 10.1 -10.1 -20.1
10 11.1 -9.1 -19.1
csv2:
0 1.4 -19.6 -29.8
1 2.4 -18.6 -28.8
2 3.4 -17.6 -27.8
3 4.4 -16.6 -26.8
4 5.4 -15.6 -25.8
5 6.4 -14.6 -24.8
6 7.4 -13.6 -23.8
7 8.4 -12.6 -22.8
8 9.4 -11.6 -21.8
9 10.4 -10.6 -20.8
10 11.4 -9.6 -19.8
I am reading a text file using Pandas in Python. I am using Python 2.7. The dataset in use in this question is related to a question that I had asked before here. To be specific, the first two rows, and the first column of my data comprise of text information. The following is a snapshot of a truncated version of my dataset.
The data file can be found here. I am using the helpful answers given here to load the dataset (df = pd.read_csv('dum.txt',sep='\t', header=[0,1], index_col=0)).
I want to get descriptive statistics of my pandas dataframe along rows, instead of columns. I have tried using df.describe(), but it gives me descriptive statistics along columns. I had a look at the answers given in this question, but I get the following error when I use the answers suggested in that link.
TypeError: ('unbound method describe() must be called with DataFrame instance as first argument (got Series instance instead)', u'occurred at index foxq1')
How can I get descriptive statistics using Pandas for the numerical entries in every row for the dataset that I have? Thanks in advance.
Following a few comments, I am including the actual code that I am using, and the error message:
The actual code is this:
df = pd.read_csv('dum.txt',sep='\t', header=[0,1], index_col=0)
df.apply(pd.DataFrame.describe, axis=1)
Error message:
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-20-0d7a5fde0f42> in <module>()
----> 1 df.apply(pd.DataFrame.describe, axis=1)
2 #df.apply(pd.DataFrame.describe, axis=1)
/Users/LG/anaconda2/lib/python2.7/site-packages/pandas/core/frame.pyc in apply(self, func, axis, broadcast, raw, reduce, args, **kwds)
4260 f, axis,
4261 reduce=reduce,
-> 4262 ignore_failures=ignore_failures)
4263 else:
4264 return self._apply_broadcast(f, axis)
/Users/LG/anaconda2/lib/python2.7/site-packages/pandas/core/frame.pyc in _apply_standard(self, func, axis, ignore_failures, reduce)
4356 try:
4357 for i, v in enumerate(series_gen):
-> 4358 results[i] = func(v)
4359 keys.append(v.name)
4360 except Exception as e:
TypeError: ('unbound method describe() must be called with DataFrame instance as first argument (got Series instance instead)', u'occurred at index object1')
From the question you referenced, you can just use this code (in other words apply describe along the rows):
df.apply(pd.DataFrame.describe, axis=1)
And you get the following result:
count mean std min 25% 50% 75% max
object1 5.0 3.1 1.581139 1.1 2.1 3.1 4.1 5.1
object2 5.0 3.2 1.581139 1.2 2.2 3.2 4.2 5.2
object3 5.0 3.3 1.581139 1.3 2.3 3.3 4.3 5.3
object4 5.0 3.4 1.581139 1.4 2.4 3.4 4.4 5.4
object5 5.0 3.5 1.581139 1.5 2.5 3.5 4.5 5.5
object6 5.0 3.6 1.581139 1.6 2.6 3.6 4.6 5.6
object7 5.0 3.7 1.581139 1.7 2.7 3.7 4.7 5.7
object8 5.0 3.8 1.581139 1.8 2.8 3.8 4.8 5.8
You can try to use numpy to obtain much of the statistics for rows:
df = pd.read_csv('dum.txt',sep='\t', header=[0,1], index_col=0)
print df
Type T1 T2 T3 T4 T5 T6 T7
Tag Tag1 Tag1 Tag1 Tag5 Tag5 Tag6 Tag6
object1 1.1 2.1 3.1 4.1 5.1 6.1 7.1
object2 1.2 2.2 3.2 4.2 5.2 6.2 7.2
object3 1.3 2.3 3.3 4.3 5.3 6.3 7.3
object4 1.4 2.4 3.4 4.4 5.4 6.4 7.4
object5 1.5 2.5 3.5 4.5 5.5 6.5 7.5
object6 1.6 2.6 3.6 4.6 5.6 6.6 7.6
object7 1.7 2.7 3.7 4.7 5.7 6.7 7.7
object8 1.8 2.8 3.8 4.8 5.8 6.8 7.8
data = df.values
data_mean = np.mean(data, axis=1)
data_std = np.std(data, axis=1)
data_min = np.min(data, axis=1)
data_max = np.max(data, axis=1)
print data_mean
[ 4.1 4.2 4.3 4.4 4.5 4.6 4.7 4.8]
print data_std
[ 2. 2. 2. 2. 2. 2. 2. 2.]
print data_min
[ 1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8]
print data_max
[ 7.1 7.2 7.3 7.4 7.5 7.6 7.7 7.8]
The task is to load the iris data set from sklearn and then make some plots. I wish to understand what each command is doing.
from sklearn.datasets import load_iris
Q1 Is load_iris a function in sklearn?
data = load_iris()
Q2 Now I believe this load_iris function is returning some output which we are storing as data. What exactly is the output of load_iris()? type etc?
df = pd.DataFrame(data.data, columns=data.feature_names)
Q3 Now we are storing this as a dataframe. but what is data.data and data.feature_names
df['target_names'] = [data.target_names[i] for i in data.target]
Q4 I don't understand the right hand side of the above code
Need help with Questions 1,2,3 and 4. I tried looking at the Scikit documentation but didn't understand it. Also this code is from an online course on edx but they didn't explain the code.
Discover the power of intercativity of Jupyter/iPython.
I'm using iPython in this example.
Q1 Is load_iris a function in sklearn?
In [33]: type(load_iris)
Out[33]: function
Q2 Now I believe this load_iris function is returning some output
which we are storing as data. What exactly is the output of
load_iris()? type etc?
Docstring - is very helpful:
In [34]: load_iris?
Signature: load_iris(return_X_y=False)
Docstring:
Load and return the iris dataset (classification).
The iris dataset is a classic and very easy multi-class classification
dataset.
================= ==============
Classes 3
Samples per class 50
Samples total 150
Dimensionality 4
Features real, positive
================= ==============
Read more in the :ref:`User Guide <datasets>`.
Parameters
----------
return_X_y : boolean, default=False.
If True, returns ``(data, target)`` instead of a Bunch object. See
below for more information about the `data` and `target` object.
.. versionadded:: 0.18
Returns
-------
data : Bunch
Dictionary-like object, the interesting attributes are:
'data', the data to learn, 'target', the classification labels,
'target_names', the meaning of the labels, 'feature_names', the
meaning of the features, and 'DESCR', the
full description of the dataset.
(data, target) : tuple if ``return_X_y`` is True
...
print description:
In [51]: print(data.DESCR)
Iris Plants Database
====================
Notes
-----
Data Set Characteristics:
:Number of Instances: 150 (50 in each of three classes)
:Number of Attributes: 4 numeric, predictive attributes and the class
...
Q3 Now we are storing this as a dataframe. but what is data.data and data.feature_names
In [37]: type(data.data)
Out[37]: numpy.ndarray
In [88]: data.data.shape
Out[88]: (150, 4)
In [38]: df = pd.DataFrame(data.data, columns=data.feature_names)
In [39]: pd.set_option('display.max_rows', 10)
In [40]: df
Out[40]:
sepal length (cm) sepal width (cm) petal length (cm) petal width (cm)
0 5.1 3.5 1.4 0.2
1 4.9 3.0 1.4 0.2
2 4.7 3.2 1.3 0.2
3 4.6 3.1 1.5 0.2
4 5.0 3.6 1.4 0.2
.. ... ... ... ...
145 6.7 3.0 5.2 2.3
146 6.3 2.5 5.0 1.9
147 6.5 3.0 5.2 2.0
148 6.2 3.4 5.4 2.3
149 5.9 3.0 5.1 1.8
[150 rows x 4 columns]
In [41]: df.columns
Out[41]: Index(['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)'], dtype='object')
In [42]: data.feature_names
Out[42]:
['sepal length (cm)',
'sepal width (cm)',
'petal length (cm)',
'petal width (cm)']
Q4 I don't understand the right hand side of the above code Need help with Questions 1,2,3 and 4. I tried looking at the Scikit
documentation but didn't understand it. Also this code is from an
online course on edx but they didn't explain the code.
Execute the code and check the result - usually it's easy to see what has happened. BTW, i'd use Numpy for this step:
In [49]: df['target_names'] = np.take(data.target_names, data.target)
In [50]: df
Out[50]:
sepal length (cm) sepal width (cm) petal length (cm) petal width (cm) target_names
0 5.1 3.5 1.4 0.2 setosa
1 4.9 3.0 1.4 0.2 setosa
2 4.7 3.2 1.3 0.2 setosa
3 4.6 3.1 1.5 0.2 setosa
4 5.0 3.6 1.4 0.2 setosa
.. ... ... ... ... ...
145 6.7 3.0 5.2 2.3 virginica
146 6.3 2.5 5.0 1.9 virginica
147 6.5 3.0 5.2 2.0 virginica
148 6.2 3.4 5.4 2.3 virginica
149 5.9 3.0 5.1 1.8 virginica
[150 rows x 5 columns]
I have a dataframe df1
pid stat h1 h2 h3 h4 h5 h6 ... h20
1 a 3.2 3.5 6.2 7.1 1.2 2.3 ... 3.2
1 b 3.3 1.5 4.2 7.7 4.2 3.5 ... 8.4
1 a 3.1 3.8 2.2 1.1 6.2 5.3 ... 9.2
1 b 3.7 1.2 8.2 4.7 3.2 8.5 ... 2.4
: : : : : : : : : :
2 a 2.2 3.8 6.2 7.3 1.3 4.3 ... 3.2
2 b 4.3 1.3 4.2 5.7 2.2 3.1 ... 2.4
2 a 2.1 3.7 2.4 1.6 6.4 9.3 ... 9.6
2 b 3.8 1.3 8.7 3.7 7.2 8.3 ... 9.4
: : : : : : : : : :
3 a 2.2 3.8 6.2 7.3 1.3 4.3 ... 3.2
3 b 4.3 1.3 4.2 5.7 2.2 3.1 ... 2.4
3 a 2.1 3.7 2.4 1.6 6.4 9.3 ... 9.6
3 b 3.8 1.3 8.7 3.7 7.2 8.3 ... 9.4
: : : : : : : : : :
I would like to obtain groups indexed on pid and stat and then subtract h values of group1 from h values of group2 for a final dataframe (df2). This final dataframe needs to be reindexed with numbers starting from 0:len(groups) Repeat it iteratively for all permutations of pid like 1-2, 1-3, 1-4, 2-1, 2-3 ... etc. I need to perform other calculations on the on the final dataframe df2(values in the below df2 are not exact subtracted, but just a representation)
pid(string) stat h1p1-h1p2 h2p1-h2p2 h3p1-h3p2 h4p1-h4p2 h5p1-h5p2 h6p1-h6p2 ... h20p1-h2p2
1-2 a 3.2 3.5 6.2 7.1 1.2 2.3 ... 3.2
1-2 b 3.3 1.5 4.2 7.7 4.2 3.5 ... 8.4
1-2 a 3.1 3.8 2.2 1.1 6.2 5.3 ... 9.2
1-2 b 3.7 1.2 8.2 4.7 3.2 8.5 ... 2.4
1-3 ....
I looked at options of;
for (pid, stat), group in df1.groupby(['pid', 'stat']):
print('pid = %s Stat = %s' %(pid, stat))
print group
this gives me groups but, I am not sure how to access dataframes from this for loop and use it for subtracting from other groups. Also
df_grouped = df.groupby(['pid', 'stat']).groups()
still not sure how to access the new dataframe of groups and perform operations. I would like to know, if this can be done using groupby or if there is any better approach. Thanks in advance!
I implemented a generator and ignored the stat column because it makes no different in any groups according to your sample. Please tell me if I did it wrong.
import pandas as pd
from itertools import permutations
def subtract_group(df, col):
pid = df['pid'].unique()
# select piece with pid == i
segment = lambda df, i: df[df['pid'] == i].reset_index()[col]
for x, y in permutations(pid, 2):
result_df = pd.DataFrame(segment(df, x) - segment(df, y))
# rename columns
result_df.columns=["%sp%d-%sp%d" % (c, x, c, y) for c in col]
# insert pid column
result_df.insert(0, 'pid', '-'.join([str(x), str(y)]))
yield result_df
You can test it with:
# column name in your case
columns = ['h' + str(i+1) for i in range(20)]
print next(subtract_group(df1, columns))
Hope it helps.