Importing file containing text and numerical data using Python - python

I have a .txt file which has text data and numerical data. The first two rows of the file have essential information in text data form, while the first column (I am referring to the zeroth column as the first column) also has essential data in text form. At all other locations in the file, the data is in numerical form. I wish to analyze the numerical data present in the file using libraries in python ,preferably numpy or pandas, or a combination of both (analysis like regression, correlation, scikit-learn etc). I reiterate that all of the data in the file is essential for my analysis. The following snapshot (taken from Excel) shows a truncated version of the format in which my data is in:
The data shown in this snapshot can be found here.
In particular, what I want is to be able to import all the numerical data from this file using python (numpy or pandas), and be able to refer to specific rows in this data using the text data in the first two rows (Type, Tag) and the first column (object number). In my actual data file, I have hundreds of thousands of rows (object types) and scores of columns.
I have already tried using numpy.loadtxt(...) and pandas.read_csv(...) to open this file, but I have either run into errors, or have loaded data in clumsy formats. I will be really thankful to have some direction as to how I can import the file in python in a way so that I have the functionality that I desire.

If I were you, I would use pandas, and import it using something like this:
df = pd.read_csv('dum.txt',sep='\t', header=[0,1], index_col=0)
This gives you the dataframe:
>>> df
Type T1 T2 T3 T4 T5
Tag Good Good Good Good Good
object1 1.1 2.1 3.1 4.1 5.1
object2 1.2 2.2 3.2 4.2 5.2
object3 1.3 2.3 3.3 4.3 5.3
object4 1.4 2.4 3.4 4.4 5.4
object5 1.5 2.5 3.5 4.5 5.5
object6 1.6 2.6 3.6 4.6 5.6
object7 1.7 2.7 3.7 4.7 5.7
object8 1.8 2.8 3.8 4.8 5.8
And all of your columns are floats:
>>> df.dtypes
Type Tag
T1 Good float64
T2 Good float64
T3 Good float64
T4 Good float64
T5 Good float64
dtype: object
It contains a multi-indexed column header:
>>> df.columns
MultiIndex(levels=[['T1', 'T2', 'T3', 'T4', 'T5'], ['Good']],
labels=[[0, 1, 2, 3, 4], [0, 0, 0, 0, 0]],
names=['Type', 'Tag'])
And a regular index containing the information from Type:
>>> df.index
Index(['object1', 'object2', 'object3', 'object4', 'object5', 'object6',
'object7', 'object8'],
dtype='object')
Furthermore, you can convert your values to a numpy array of floats simply by using:
>>> df.values
array([[1.1, 2.1, 3.1, 4.1, 5.1],
[1.2, 2.2, 3.2, 4.2, 5.2],
[1.3, 2.3, 3.3, 4.3, 5.3],
[1.4, 2.4, 3.4, 4.4, 5.4],
[1.5, 2.5, 3.5, 4.5, 5.5],
[1.6, 2.6, 3.6, 4.6, 5.6],
[1.7, 2.7, 3.7, 4.7, 5.7],
[1.8, 2.8, 3.8, 4.8, 5.8]])

Use sep with \s for any spaces not only tabs, engine='python' for removing warning:
df=pd.read_csv('dum.txt',engine='python',sep='\s')
print(df)
Output:
Type T1 T2 T3 T4 T5
0 Tag Good Good Good Good Good
1 object1 1.1 2.1 3.1 4.1 5.1
2 object2 1.2 2.2 3.2 4.2 5.2
3 object3 1.3 2.3 3.3 4.3 5.3
4 object4 1.4 2.4 3.4 4.4 5.4
5 object5 1.5 2.5 3.5 4.5 5.5
6 object6 1.6 2.6 3.6 4.6 5.6
7 object7 1.7 2.7 3.7 4.7 5.7
8 object8 1.8 2.8 3.8 4.8 5.8
Or if want two rows of columns (i would not recommend because then it's hard to use):
df=pd.read_csv('dum.txt',engine='python',sep='\s',header=[0,1])
print(df)
Output:
Type T1 T2 T3 T4 T5
Tag Good Good Good Good Good
0 object1 1.1 2.1 3.1 4.1 5.1
1 object2 1.2 2.2 3.2 4.2 5.2
2 object3 1.3 2.3 3.3 4.3 5.3
3 object4 1.4 2.4 3.4 4.4 5.4
4 object5 1.5 2.5 3.5 4.5 5.5
5 object6 1.6 2.6 3.6 4.6 5.6
6 object7 1.7 2.7 3.7 4.7 5.7
Otherwise default direct read_csv (like pd.read_csv('dum.txt')) will return:
Type\tT1\tT2\tT3\tT4\tT5
0 Tag\tGood\tGood\tGood\tGood\tGood
1 object1\t1.1\t2.1\t3.1\t4.1\t5.1
2 object2\t1.2\t2.2\t3.2\t4.2\t5.2
3 object3\t1.3\t2.3\t3.3\t4.3\t5.3
4 object4\t1.4\t2.4\t3.4\t4.4\t5.4
5 object5\t1.5\t2.5\t3.5\t4.5\t5.5
6 object6\t1.6\t2.6\t3.6\t4.6\t5.6
7 object7\t1.7\t2.7\t3.7\t4.7\t5.7
8 object8\t1.8\t2.8\t3.8\t4.8\t5.8

Related

What is the pandas equivalent of the R function %in%?

What is the pandas equivalent of the R function %in% ?
When we have a dataframe in R, we can check for which rows a column contains strings from a list using the operator %in% which gives a Boolean output.
Concrete example: If we want to check which rows the strings "setosa" and "virginica" are in the column species of the iris dataset, we can simply use the following code:
iris[:,c('species')] %in% c('setosa', 'virginica').
How can we do the same thing in python for a pandas DataFrame?
The reason I want to do this is I want to filter the dataset and only keep rows with the species "setosa" or "virginica".
%in% in R is actually is.element:
r$> 1 %in% 1:2
[1] TRUE
r$> is.element(1, 1:2)
[1] TRUE
datar has ported some functions in R to python:
>>> from datar.all import c, f, is_element, filter
>>> from datar.datasets import iris
>>>
>>> iris >> filter(is_element(f.Species, c('setosa', 'virginica')))
Sepal_Length Sepal_Width Petal_Length Petal_Width Species
<float64> <float64> <float64> <float64> <object>
0 5.1 3.5 1.4 0.2 setosa
1 4.9 3.0 1.4 0.2 setosa
2 4.7 3.2 1.3 0.2 setosa
3 4.6 3.1 1.5 0.2 setosa
.. ... ... ... ... ...
4 5.0 3.6 1.4 0.2 setosa
95 6.7 3.0 5.2 2.3 virginica
96 6.3 2.5 5.0 1.9 virginica
97 6.5 3.0 5.2 2.0 virginica
98 6.2 3.4 5.4 2.3 virginica
99 5.9 3.0 5.1 1.8 virginica
[100 rows x 5 columns]
I am the author of the datar package. Feel free to submit issues if you have any questions.
The pandas package has the .str method for columns that are strings and the .str method itself contains the .isin() method which is equivalent to the %in% operator in R. Further, as pointed out by #rhug123 the .isin method can be directly applied on a series. I have made the corresponding change to the code below.
Your R code above can be implemented in python using pandas as follows - assuming that iris is a pandas DataFrame:
iris.species.isin(['setosa', 'virginica'])
You can then filter your DataFrame and only keep the rows with species 'setosa' or 'virginica' as follows:
iris[iris.species.isin(['setosa', 'virginica'])]

Move index values into column names in pandas Data Frame

I'm trying to reshape a multi-indexed data frame so that the values from the second level of the index are incorporated into the column names in the new data frame. In the data frame below, I want to move A and B from "source" into the columns so that I have s1_A, s1_B, s2_A, ..., s3_B.
I've tried creating the structure of the new data frame explicitly and populating it with a nested for loop to reassign the values, but it is excruciatingly slow. I've tried a number of functions from the pandas API, but without much luck. Any help would be much appreciated.
midx = pd.MultiIndex.from_product( [[1,2,3], ['A','B']], names=["sample","source"])
df = pd.DataFrame( index=midx, columns=['s1', 's2', 's3'], data=np.ndarray(shape=(6,3)) )
>>> df
s1 s2 s3
sample source
1 A 1.2 3.4 5.6
B 1.2 3.4 5.6
2 A 1.2 3.4 5.6
B 1.2 3.4 5.6
3 A 1.2 3.4 5.6
B 1.2 3.4 5.6
# Want to build a new data frame thatlooks like this:
>>> df_new
s1_A s1_B s2_A s2_B s3_A s3_B
sample
1 1.2 1.2 3.4 3.4 5.6 5.6
2 1.2 1.2 3.4 3.4 5.6 5.6
3 1.2 1.2 3.4 3.4 5.6 5.6
Here's how I'm currently doing it. It's extremely slow, and I know there must be a more idiomatic way to do this with pandas, but I'm still new to its API:
substances = df.columns.values
sources = ['A','B']
subst_and_src = sorted([ subst + "_" + src for src in sources for subst in substances ])
df_new = pd.DataFrame(index=df.index.unique(0), columns=subst_and_src)
# Runs forever
for (sample, source) in df.index:
for subst in df.columns:
df_new[sample, subst + "_" + source] = df.loc[(sample,source), subst]
df = df.unstack(level=1)
df.columns = ['_'.join(col).strip() for col in df.columns.values]
print(df)
Prints:
s1_A s1_B s2_A s2_B s3_A s3_B
sample
1 4.665045e-310 6.904071e-310 0.0 0.0 6.903913e-310 2.121996e-314
2 6.904071e-310 0.000000e+00 0.0 0.0 3.458460e-323 0.000000e+00
3 0.000000e+00 0.000000e+00 0.0 0.0 0.000000e+00 0.000000e+00
Unstack into a new dataframe and collapse multilevel index of resulting frmae using f string
df1= df.unstack()
df1.columns = df1.columns.map('{0[0]}_{0[1]}'.format)
s1_A s1_B s2_A s2_B s3_A s3_B
sample
1 1.2 1.2 3.4 3.4 5.6 5.6
2 1.2 1.2 3.4 3.4 5.6 5.6
3 1.2 1.2 3.4 3.4 5.6 5.6

How do I get the mean in a range of values in a 2d DataFrame

I got a 2d DataFrame (df_1) in my Jupyter-Notebook and want to copy the mean of a certain range of values into a new DataFrame. The first bin (based on v_wind) should count from 3.00 to 3.10 and avaraging all corresponding values from p_abs. The data contains about 5502 rows.
p_abs v_wind
19.94 3.00
3.35 3.02
29.26 3.03
47.97 3.04
42.99 3.05
16.20 3.06
19.00 3.07
34.54 3.10
16.16 3.10
7.49 3.11
48.85 3.14
23.19 3.16
25.69 3.18
34.47 3.18
27.82 3.19
31.18 3.19
58.86 3.19
36.17 3.19
36.47 3.19
33.79 3.22
23.72 3.23
I already tried to summarise the DataFrame with:
df_1.groupby(['v_wind']).mean()
but this does not allow me to avarage all values in my range.
Could someone tell me how to create a new DataFrame (df_2), looking like this:
p_abs v_wind
avg_value 3.1
avg_value 3.2
avg_value 3.3
avg_value 3.4
avg_value 3.5
avg_value 3.6
I am a bloody beginner in Python and thankfull for any advice...
With pd.cut. You'll need to determine if you want bins like [3, 3.1) or (3, 3.1] by specifying right as an argument.
import pandas as pd
import numpy as np
bins = np.arange(3, 4, 0.1)
df.groupby(pd.cut(df.v_wind, bins=bins, right=False)).p_abs.mean()
v_wind
[3.0, 3.1) 25.530000
[3.1, 3.2) 31.740833
[3.2, 3.3) 28.755000
[3.3, 3.4) NaN
[3.4, 3.5) NaN
[3.5, 3.6) NaN
[3.6, 3.7) NaN
[3.7, 3.8) NaN
[3.8, 3.9) NaN
Name: p_abs, dtype: float64
If you want this to be more generalizable instead of hardcoding the bins you could get "even" bins with:
space = 0.1
bins = np.arange(df['v_wind'].min()//space*space,
(df['v_wind'].max()+space)//space*space, space)
#array([3. , 3.1, 3.2, 3.3])

python blaze postgresql can't print "distinct" iris species

Going through this tutorial about blaze, but using the iris dataset in a local postgresql db.
I dont seem to get the same output as shown when using db.iris.Species.distinct() (see In 16 of the Ipython notebook).
My connection string is postgresql://postgres:postgres#localhost:5432/blaze_test
and my simple Python code is:
import blaze as bz
db = bz.Data('postgresql://postgres:postgres#localhost:5432/blaze_test')
mySpecies = db.iris_data.species.distinct()
print mySpecies
All I get in the console (using the Spyder IDE) is distinct(_55.iris_data.species)
How can actually print the distinct species in the table?
NB:I know I am using lowercase "s" for the "species" part in the code, otherwise I just get an error to say: 'Field' object has no attribute 'Species'
The printing mechanism is tripping you up a bit here.
The __str__ implementation (which is what Python's print function calls) returns a string version of the expression.
The __repr__ implementation (called when you execute a line in the interpreter) triggers computation and thus allows you to see the results of a computation.
In [2]: iris = Data(odo(os.path.abspath('./blaze/examples/data/iris.csv'), 'postgresql://localhost::iris'))
In [3]: iris
Out[3]:
sepal_length sepal_width petal_length petal_width species
0 5.1 3.5 1.4 0.2 Iris-setosa
1 4.9 3.0 1.4 0.2 Iris-setosa
2 4.7 3.2 1.3 0.2 Iris-setosa
3 4.6 3.1 1.5 0.2 Iris-setosa
4 5.0 3.6 1.4 0.2 Iris-setosa
5 5.4 3.9 1.7 0.4 Iris-setosa
6 4.6 3.4 1.4 0.3 Iris-setosa
7 5.0 3.4 1.5 0.2 Iris-setosa
8 4.4 2.9 1.4 0.2 Iris-setosa
9 4.9 3.1 1.5 0.1 Iris-setosa
...
In [4]: iris.species.distinct()
Out[4]:
species
0 Iris-versicolor
1 Iris-virginica
2 Iris-setosa
In [8]: print(str(iris.species.distinct()))
distinct(_1.species)
In [9]: print(repr(iris.species.distinct()))
species
0 Iris-versicolor
1 Iris-virginica
2 Iris-setosa
If you want to shove the result into a concrete data structure like a pandas.Series, do this:
In [5]: odo(iris.species.distinct(), pd.Series)
Out[5]:
0 Iris-versicolor
1 Iris-virginica
2 Iris-setosa
Name: species, dtype: object
Ok, I think I know now. The rest of the YouTube video made it a bit more clear.
I should do something like output = odo(mySpecies, pdDataFrame) or output = odo(mySpecies, list) then print output to do the transformation.
Other solutions/points welcome.

What is the difference between pandas agg and apply function?

I can't figure out the difference between Pandas .aggregate and .apply functions.
Take the following as an example: I load a dataset, do a groupby, define a simple function,
and either user .agg or .apply.
As you may see, the printing statement within my function results in the same output
after using .agg and .apply. The result, on the other hand is different. Why is that?
import pandas
import pandas as pd
iris = pd.read_csv('iris.csv')
by_species = iris.groupby('Species')
def f(x):
...: print type(x)
...: print x.head(3)
...: return 1
Using apply:
by_species.apply(f)
#<class 'pandas.core.frame.DataFrame'>
# Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#0 5.1 3.5 1.4 0.2 setosa
#1 4.9 3.0 1.4 0.2 setosa
#2 4.7 3.2 1.3 0.2 setosa
#<class 'pandas.core.frame.DataFrame'>
# Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#0 5.1 3.5 1.4 0.2 setosa
#1 4.9 3.0 1.4 0.2 setosa
#2 4.7 3.2 1.3 0.2 setosa
#<class 'pandas.core.frame.DataFrame'>
# Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#50 7.0 3.2 4.7 1.4 versicolor
#51 6.4 3.2 4.5 1.5 versicolor
#52 6.9 3.1 4.9 1.5 versicolor
#<class 'pandas.core.frame.DataFrame'>
# Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#100 6.3 3.3 6.0 2.5 virginica
#101 5.8 2.7 5.1 1.9 virginica
#102 7.1 3.0 5.9 2.1 virginica
#Out[33]:
#Species
#setosa 1
#versicolor 1
#virginica 1
#dtype: int64
Using agg
by_species.agg(f)
#<class 'pandas.core.frame.DataFrame'>
# Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#0 5.1 3.5 1.4 0.2 setosa
#1 4.9 3.0 1.4 0.2 setosa
#2 4.7 3.2 1.3 0.2 setosa
#<class 'pandas.core.frame.DataFrame'>
# Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#50 7.0 3.2 4.7 1.4 versicolor
#51 6.4 3.2 4.5 1.5 versicolor
#52 6.9 3.1 4.9 1.5 versicolor
#<class 'pandas.core.frame.DataFrame'>
# Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#100 6.3 3.3 6.0 2.5 virginica
#101 5.8 2.7 5.1 1.9 virginica
#102 7.1 3.0 5.9 2.1 virginica
#Out[34]:
# Sepal.Length Sepal.Width Petal.Length Petal.Width
#Species
#setosa 1 1 1 1
#versicolor 1 1 1 1
#virginica 1 1 1 1
apply applies the function to each group (your Species). Your function returns 1, so you end up with 1 value for each of 3 groups.
agg aggregates each column (feature) for each group, so you end up with one value per column per group.
Do read the groupby docs, they're quite helpful. There are also a bunch of tutorials floating around the web.
(Note: These comparisons are relevant for DataframeGroupby objects)
Some plausible advantages of using .agg() compared to .apply(), for DataFrame GroupBy objects would be:
.agg() gives the flexibility of applying multiple functions at once, or pass a list of function to each column.
Also, applying different functions at once to different columns of dataframe.
That means you have pretty much control over each column with each operation.
Here is the link for more details: http://pandas.pydata.org/pandas-docs/version/0.13.1/groupby.html
However, the apply function could be limited to apply one function to each column of the dataframe at a time. So, you might have to call the apply function repeatedly to call upon different operations to the same column.
Here are some example comparisons for .apply() vs .agg() for DataframeGroupBy objects :
Given the following dataframe:
In [261]: df = pd.DataFrame({"name":["Foo", "Baar", "Foo", "Baar"], "score_1":[5,10,15,10], "score_2" :[10,15,10,25], "score_3" : [10,20,30,40]})
In [262]: df
Out[262]:
name score_1 score_2 score_3
0 Foo 5 10 10
1 Baar 10 15 20
2 Foo 15 10 30
3 Baar 10 25 40
Lets first see the operations using .apply():
In [263]: df.groupby(["name", "score_1"])["score_2"].apply(lambda x : x.sum())
Out[263]:
name score_1
Baar 10 40
Foo 5 10
15 10
Name: score_2, dtype: int64
In [264]: df.groupby(["name", "score_1"])["score_2"].apply(lambda x : x.min())
Out[264]:
name score_1
Baar 10 15
Foo 5 10
15 10
Name: score_2, dtype: int64
In [265]: df.groupby(["name", "score_1"])["score_2"].apply(lambda x : x.mean())
Out[265]:
name score_1
Baar 10 20.0
Foo 5 10.0
15 10.0
Name: score_2, dtype: float64
Now, look at the same operations using .agg( ) effortlessly:
In [276]: df.groupby(["name", "score_1"]).agg({"score_3" :[np.sum, np.min, np.mean, np.max], "score_2":lambda x : x.mean()})
Out[276]:
score_2 score_3
<lambda> sum amin mean amax
name score_1
Baar 10 20 60 20 30 40
Foo 5 10 10 10 10 10
15 10 30 30 30 30
So, .agg() could be really handy at handling the DataFrameGroupBy objects, as compared to .apply(). But, if you are handling only pure dataframe objects and not DataFrameGroupBy objects, then apply() can be very useful, as apply() can apply a function along any axis of the dataframe.
(For Eg: axis = 0 implies column-wise operation with .apply(), which is a default mode, and axis = 1 would imply for row-wise operation while dealing with pure dataframe objects).
The main difference between apply and aggregate is:
apply()-
cannot be applied to multiple groups together
For apply() - We have to get_group()
ERROR : -iris.groupby('Species').apply({'Sepal.Length':['min','max'],'Sepal.Width':['mean','min']})# It will throw error
Work Fine:-iris.groupby('Species').get_group('Setosa').apply({'Sepal.Length':['min','max'],'Sepal.Width':['mean','min']})# It will throw error
#because functions are applied to one data frame
agg()-
can be applied to multiple groups together
For apply() - We do not have to get_group()
iris.groupby('Species').agg({'Sepal.Length':['min','max'],'Sepal.Width':['mean','min']})
iris.groupby('Species').get_group('versicolor').agg({'Sepal.Length':['min','max'],'Sepal.Width':['mean','min']})
Refer here . Let me requote the same statement here
Some operations on the grouped data might not fit into either the aggregate or transform categories. Or, you may simply want GroupBy to infer how to combine the results. For these, use the apply function, which can be substituted for both aggregate and transform in many standard use cases. However, apply can handle some exceptional use cases, for example:
More details with example are presented in the pandas documentation (link provided above)
Kindly refer this great write up from #ted Petrou and #Eric O Lebigot. We can reapply the logic they have used to investigate difference between Apply and transform to Apply and Agg
Then to understand how axis works refer this link
These three link should help is getting better clarity on how they are different.
When using apply to a groupby I have encountered that .apply will return the grouped columns. There is a note in the documentation (pandas.pydata.org/pandas-docs/stable/groupby.html):
"...Thus the grouped columns(s) may be included in the output as well as set the indices."
.aggregate will not return the grouped columns.
Besides everything other mentioned, another difference I think no one highlighted yet is that apply can be used to apply a function to a group of columns together. Agg only applies a function to one column separately. An example is:
Let's use the same example as of other example:
d = pd.DataFrame({"name":["Foo", "Baar", "Foo", "Baar"], "score_1":[5,10,15,10], "score_2" :[10,15,10,25], "score_3" : [10,20,30,40]})
Here, the apply is using a function that sums up the values of all the columns of a group together.
d.groupby(["name", "score_1"]).apply(lambda x: x.values.sum())

Categories