Add string column to float matrix NumPy - python

I'm looking for a method to add a column of float values to a matrix of string values.
Mymatrix =
[["a","b"],
["c","d"]]
I need to have a matrix like this =
[["a","b",0.4],
["c","d",0.6]]

I would suggest using a pandas DataFrame instead:
import pandas as pd
df = pd.DataFrame([["a","b",0.4],
["c","d",0.6]])
print(df)
0 1 2
0 a b 0.4
1 c d 0.6
You can also specify column (Series) names:
df = pd.DataFrame([["a","b",0.4],
["c","d",0.6]], columns=['A', 'B', 'C'])
df
A B C
0 a b 0.4
1 c d 0.6

As noted you can't mix data types in a ndarray, but can do so in a structured or record array. They are similar in that you can mix datatypes as defined by the dtype= argument (it defines the datatypes and field names). Record arrays allow access to fields of structured arrays by attribute instead of only by index. You don't need for loops when you want to copy the entire contents between arrays. See my example below (using your data):
Mymatrix = np.array([["a","b"], ["c","d"]])
Mycol = np.array([0.4, 0.6])
dt=np.dtype([('col0','U1'),('col1','U1'),('col2',float)])
new_recarr = np.empty((2,), dtype=dt)
new_recarr['col0'] = Mymatrix[:,0]
new_recarr['col1'] = Mymatrix[:,1]
new_recarr['col2'] = Mycol[:]
print (new_recarr)
Resulting output looks like this:
[('a', 'b', 0.4) ('c', 'd', 0.6)]
From there, use formatted strings to print.
You can also copy from a recarray to an ndarray if you reverse assignment order in my example.
Note: I discovered there can be a significant performance penalty when using recarrays. See answer in this thread:
is ndarray faster than recarray access?

You need to understand why you do that. Numpy is efficient because data are aligned in memory. So mixing types is generally source of bad performance. but in your case you can preserve alignement, since all your strings have same length. since types are not homogeneous, you can use structured array:
raw=[["a","b",0.4],
["c","d",0.6]]
dt=dtype([('col0','U1'),('col1','U1'),('col2',float)])
aligned=ndarray(len(raw),dt)
for i in range (len(raw)):
for j in range (len(dt)):
aligned[i][j]=raw[i][j]
You can also use pandas, but you loose often some performance.

Related

Why does pandas.to_numeric result in a list of lists?

I am trying to import csv data into a pandas dataframe. To do this I am doing the following:
df = pd.read_csv(StringIO(contents), skiprows=4, delim_whitespace=True,index_col=False,header=None)
index = pd.MultiIndex.from_arrays((columns, units, descr))
df.columns = index
df.columns.names = ['Name','Unit','Description']
df = df.apply(pd.to_numeric)
data['isotherm'] = df
This produces e.g. the following table:
In: data['isotherm']
Out:
Name Relative_Pressure Volume_STP
Unit - ccm/g
Description p/p0
0 0.042691 29.3601
1 0.078319 30.3071
2 0.129529 31.1643
3 0.183355 31.8513
4 0.233435 32.3972
5 0.280847 32.8724
However if I only want to get the values of the column Relative_Pressure I get this output:
In: data['isotherm']['Relative_Pressure'].values
Out:
array([[0.042691],
[0.078319],
[0.129529],
[0.183355],
[0.233435],
[0.280847]])
Of course I could now for every column I want to use flatten
x = [item for sublist in data['isotherm']['Relative_Pressure'].values for item in sublist]
However this would lead to a lot of extra effort and would also reduce the readability. How can I for the whole data frame make sure the data is flat?
array([[...]]) is not a list of lists, but a 2D numpy array. (I'm not sure why the values are returned as a single-column 2D array rather than a 1D array here, though. When I create a primitive DataFrame, a single column's values are returned as a 1D array.)
You can concatenate and flatten them using numpy's built-in functions, eg.
x = data['isotherm']['Relative_Pressure'].flatten()
Edit: This might be caused by the MultiIndex.
The direct way of indexing into one column belonging to your MultiIndex object is with a tuple as follows:
data[('isotherm', 'Relative_Pressure')]
which will return a Series object whose .values attribute will give you the expected 1D array. The docs discuss this here
You should be careful using chained indexing like data['isotherm']['Relative_Pressure'] because you won't know if you are dealing with a copy of the data or a view of the data. Please do a SO search of pandas' SettingWithCopyWarning for more details or read the docs here.

How to unpack the columns of a pandas DataFrame to multiple variables

Lists or numpy arrays can be unpacked to multiple variables if the dimensions match. For a 3xN array, the following will work:
import numpy as np
a,b = [[1,2,3],[4,5,6]]
a,b = np.array([[1,2,3],[4,5,6]])
# result: a=[1,2,3], b=[4,5,6]
How can I achieve a similar behaviour for the columns of a pandas DataFrame? Extending the above example:
import pandas as pd
df = pd.DataFrame([[1,2,3],[4,5,6]])
df.columns = ['A','B','C'] # Rename cols and
df.index = ['i', 'ii'] # rows for clarity
The following does not work as expected:
a,b = df.T
# result: a='i', b='ii'
a,b,c = df
# result: a='A', b='B', c='C'
However, what I would like to get is the following:
a,b,c = unpack(df)
result: a=df['A'], b=df['B'], c=df['C']
Is the function unpack already available in pandas? Or can it be mimicked in an easy way?
I just figured that the following works, which is already close to what I try to achieve:
a,b,c = df.T.values # Common
a,b,c = df.T.to_numpy() # Recommended
# a,b,c = df.T.as_matrix() # Deprecated
Details: As always, things are a little more complicated than one thinks. Note that a pd.DataFrame stores columns separately in Series. Calling df.values (or better: df.to_numpy()) is potentially expensive, as it combines the columns in a single ndarray, which likely involves copying actions and type conversions. Also, the resulting container has a single dtype able to accommodate all data in the data frame.
In summary, the above approach loses the per-column dtype information and is potentially expensive. It is technically cleaner to iterate the columns in one of the following ways (there are more options):
# The following alternatives create VIEWS!
a,b,c = (v for _,v in df.items()) # returns pd.Series
a,b,c = (df[c] for c in df) # returns pd.Series
Note that the above creates views! Modifying the data likely will trigger a SettingWithCopyWarning.
a.iloc[0] = "blabla" # raises SettingWithCopyWarning
If you want to modify the unpacked variables, you have to copy the columns.
# The following alternatives create COPIES!
a,b,c = (v.copy() for _,v in df.items()) # returns pd.Series
a,b,c = (df[c].copy() for c in df) # returns pd.Series
a,b,c = (df[c].to_numpy() for c in df) # returns np.ndarray
While this is cleaner, it requires more characters. I personally do not recommend the above approach for production code. But to avoid typing (e.g., in interactive shell sessions), it is still a fair option...
# More verbose and explicit alternatives
a,b,c = df["the first col"], df["the second col"], df["the third col"]
a,b,c = df.iloc[:,0], df.iloc[:,1], df.iloc[:,2]
The dataframe.values shown method is indeed a good solution, but it involves building a numpy array.
In the case you want to access pandas series methods after unpacking, I personally use a different approach.
For the people like me that use a lot of chained methods, I have a solution by adding a custom unpacking method to pandas. Note that this may not be very good for production pipelines, but it is very handy in ad-hoc data analyses.
df = pd.DataFrame({
"lat": [30, 40],
"lon": [0, 1],
})
This approach involves returning a generator on a .unpack() call.
from typing import Tuple
def unpack(self: pd.DataFrame) -> Tuple[pd.Series]:
return (
self[col]
for col in self.columns
)
pd.DataFrame.unpack = unpack
This can be used in two major ways.
Either directly as a solution to your problem:
lat, lon = df.unpack()
Or, can be used in a method chaining.
Imagine a geo function which has to take a latitude serie in the first arg and a longitude in the second arg, named do_something_geographical(lat, lon)
df_result = (
df
.(...some method chaining...)
.assign(
geographic_result=lambda dataframe: do_something_geographical(dataframe[["lat", "lon"]].unpack())
)
.(...some method chaining...)
)

Will changes in DataFrame.values always modify the values in the data frame?

On the documentation, it says
Numpy representation of NDFrame -- Source
What does "Numpy representation of NDFrame" mean? Will modifying this numpy representation affect my original dataframe? In other words, will .values return a copy or a view?
There are answers to questions in StackOverflow implicitly suggesting (relying on) that a view be returned. For example, in the accepted answer of Set values on the diagonal of pandas.DataFrame,np.fill_diagonal(df.values, 0) is used to set all values on the diagonal part of df to 0. That is a view is returned in this case. However, as shown in #coldspeed's answer, sometimes a copy is returned.
This feels very basic. It is just a bit weird to me because I do not have a more detailed source of .values.
Another experiment that returns a view in addition to the current experiments in #coldspeed's answer:
df = pd.DataFrame([["A", "B"],["C", "D"]])
df.values[0][0] = 0
We get
df
0 1
0 0 B
1 C D
Even though it is mixed type now, we can still modify original df by setting df.values
df.values[0][1] = 5
df
0 1
0 0 5
1 C D
TL;DR:
It's an implementation detail if a copy is returned (then changing the values would not change the DataFrame) or if values returns a view (then changing the values would change the DataFrame). Don't rely on any of these cases. It could change if the pandas developers think it would be beneficial (for example if they changed the internal structure of DataFrame).
I guess the documentation has changed since the question was asked, currently it reads:
pandas.DataFrame.values
Return a Numpy representation of the DataFrame.
Only the values in the DataFrame will be returned, the axes labels will be removed.
It doesn't mention NDFrame anymore - but simply mentions a "NumPy representation of the DataFrame". A NumPy representation could be either a view or a copy!
The documentation also contains a Note about mixed dtypes:
Notes
The dtype will be a lower-common-denominator dtype (implicit upcasting); that is to say if the dtypes (even of numeric types) are mixed, the one that accommodates all will be chosen. Use this with care if you are not dealing with the blocks.
e.g. If the dtypes are float16 and float32, dtype will be upcast to float32. If dtypes are int32 and uint8, dtype will be upcast to int32. By numpy.find_common_type() convention, mixing int64 and uint64 will result in a float64 dtype.
From these Notes it's obvious that accessing the values of a DataFrame that contains different dtypes can (almost) never return a view. Simply because it needs to put the values into an array of the "lowest-common-denominator" dtype and that involves a copy.
However it doesn't say anything about the view / copy behavior and that's by design. jreback mentioned on the pandas issue tracker 1 that this really is just an implementation detail:
this is an implementation detail. since you are getting a single dtyped numpy array, it is upcast to a compatible dtype. if you have mixed dtypes, then you almost always will have a copy (the exception is mixed float dtypes will not copy I think), but this is a numpy detail.
I agree this is not great, but it has been there from the beginning and will not change in current pandas. If exporting to numpy you need to take care.
Even the documentation of Series mentions nothing about a view:
pandas.Series.values
Return Series as ndarray or ndarray-like depending on the dtype
It even mentions that it might not even return a plain array depending on the dtype. And that certainly includes the possibility (even if it's only hypothetical) that it returns a copy. It does not guarantee that you get a view.
When does .values return a view and when does it return a copy?
The answer is simply: It's an implementation detail and as long as it's an implementation detail there won't be any guarantees. The reason it's an implementation detail is because the pandas developers want to make certain that they can change the internal storage if they want to.
However in some cases it's impossible to create a view. For example with a DataFrame containing columns of different dtypes.
There might be advantages if you analyze the behavior to date. But as long as that's an implementation detail you shouldn't really rely on it anyways.
However if you're interested: Pandas currently stores columns with the same dtype internally as multi-dimensional array. That has the advantage that you can operate on rows and columns very efficiently (at least as long as they have the same dtype). But if the DataFrame contains mixed types it will have several internal multi-dimensional arrays. One for each dtype. It's not possible to create a view that points into two distinct arrays (at least for NumPy) so when you have mixed dtypes you'll get a copy when you want the values.
A side-note, your example:
df = pd.DataFrame([["A", "B"],["C", "D"]])
df.values[0][0] = 0
Isn't mixed-dtype. It has a specific dtype: object. However object arrays can contain any Python object, so I can see why you would say/assume that it's of mixed types.
Personal note:
Personally I would have preferred that the values property only ever returns views or errors when it cannot return a view and an additional method (e.g. as_array) that only ever returns copies even if it would be possible to get a view. That would certainly make the behavior more predictable and avoid some surprises like having a property doing an expensive copy is certainly unexpected.
1 This question has been mentioned in the issue post, so maybe the docs changed because of this question.
Let's test it out.
First, with pd.Series objects.
In [750]: s = pd.Series([1, 2, 3])
In [751]: v = s.values
In [752]: v[0] = 10000
In [753]: s
Out[753]:
0 10000
1 2
2 3
dtype: int64
Now, for DataFrame objects. First, consider non-mixed dtypes -
In [780]: df = pd.DataFrame(1 - np.eye(3, dtype=int))
In [781]: df
Out[781]:
0 1 2
0 0 1 1
1 1 0 1
2 1 1 0
In [782]: v = df.values
In [783]: v[0] = 12345
In [784]: df
Out[784]:
0 1 2
0 12345 12345 12345
1 1 0 1
2 1 1 0
Modifications are made, so that means .values returned a view.
Now, consider a scenario with mixed dtypes -
In [755]: df = pd.DataFrame({'A' :[1, 2], 'B' : ['ccc', 'ddd']})
In [756]: df
Out[756]:
A B
0 1 ccc
1 2 ddd
In [757]: v = df.values
In [758]: v[0] = 123
In [759]: v[0, 1] = 'zzxxx'
In [760]: df
Out[760]:
A B
0 1 ccc
1 2 ddd
Here, .values returns a copy.
Observation
.values for Series returns a view regardless of dtypes of each row, whereas for DataFrames this depends. For homogenous dtypes, a view is returned. Otherwise, a copy.

Pandas df.columns.values.tostring()

When I use the following on a df...
df.columns.values.tostring()
I get the following which are not at all like my column names (and there are far fewer columns than that). When I omit "tolist()", I just get the column names.
b'0\x16B\n\x00\x00\x00\x00p\x84P\n\x00\x00\x00\x00\xf0\xe7x\t\x00\x00\x00\x00\xb0\xf3J\n\x00\x00\x00\x00p\xfc\t\x0c\x00\x00\x00\x000\xad\xd7\x00\x00\x00\x00\x00p\xae\xd7\x00\x00\x00\x00\x00\xf0\xab\xd7\x00\x00\x00\x00\x00(9\x05\x01\x00\x00\x00\x00\xf0\xa7\xdd\x0b\x00\x00\x00\x00p\xac\xdd\x0b\x00\x00\x00\x00\xf0\xed\xc1\x00\x00\x00\x00\x00\xb0\xa3\xdd\x0b\x00\x00\x00\x000g\xdd\x0b\x00\x00\x00\x00p\xf2\xb2\x0c\x00\x00\x00\x000\xf1\xb2\x0c\x00\x00\x00\x00\xf0\xf0\xb2\x0c\x00\x00\x00\x00\xb0\xf0\xb2\x0c\x00\x00\x00\x00\xa0w\x9a\x05\x00\x00\x00\x000\xae\xd7\x00\x00\x00\x00\x00\x90\x9c\xe4\x00\x00\x00\x00\x00\xd0U\n\x0c\x00\x00\x00\x00\xb0\xfa\t\x0c\x00\x00\x00\x00\xb0\n\xca\x00\x00\x00\x00\x00\x88\x8e\xbb\x00\x00\x00\x00\x00\xf0\x05\xca\x00\x00\x00\x00\x00\x90<y\t\x00\x00\x00\x00\x18?y\t\x00\x00\x00\x00\xb0\x01\xca\x00\x00\x00\x00\x00\xb0=y\t\x00\x00\x00\x00\xf8=y\t\x00\x00\x00\x00p\xac\xd7\x00\x00\x00\x00\x00\xb0\xad\xd7\x00\x00\x00\x00\x00'
I can't figure out why. The df is a product of several instances of pd.merge and type conversions.
This isn't really a pandas thing, it's a numpy thing. df.columns.values gives us a numpy array:
>>> df = pd.DataFrame({"A": [1,2,3], "B": [4,5,6]})
>>> df
A B
0 1 4
1 2 5
2 3 6
>>> df.columns
Index(['A', 'B'], dtype='object')
>>> df.columns.values
array(['A', 'B'], dtype=object)
The tostring method of a numpy array promises:
Construct Python bytes containing the raw data bytes in the array.
Constructs Python bytes showing a copy of the raw contents of data memory. The bytes object can be produced in either ‘C’ or ‘Fortran’, or ‘Any’ order (the default is ‘C’-order). ‘Any’ order means C-order unless the F_CONTIGUOUS flag in the array is set, in which case it means ‘Fortran’ order.
This function is a compatibility alias for tobytes. Despite its name it returns bytes not strings.
which is why you get something messy:
>>> df.columns.values.tostring()
b'\xe0N\x0e\xb7\x00\\\x14\xb7'

making multiple pandas data frames using a loop or list comprehension

I have a Python data frame that I want to subdivide by row BUT in 32 different slices (think of a large data set chopped by row into 32 smaller data sets). I can manually divide the data frames in this way:
df_a = df[df['Type']=='BROKEN PELVIS']
df_b = df[df['Type']=='ABDOMINAL STRAIN']
I'm assuming there is a much more Pythonic expression someone might like to share. I'm looking for something along the lines of:
for i in new1:
df_%s= df[df['#RIC']=='%s'] , %i
Hope that makes sense.
In these kind of situations I think it's more pythonic to store the DataFrames in a python dictionary:
injuries = {injury: df[df['Type'] == injury] for injury in df['Type'].unique()}
injuries['BROKEN PELVIS'] # is the same as df_a above
Most of the time you don't need to create a new DataFrame but can use a groupby (it depends what you're doing next), see http://pandas.pydata.org/pandas-docs/stable/groupby.html:
g = df.groupby('Type')
Update: in fact there is a method get_group to access these:
In [21]: df = pd.DataFrame([['A', 2], ['A', 4], ['B', 6]])
In [22]: g = df.groupby(0)
In [23]: g.get_group('A')
Out[23]:
0 1
0 A 2
1 A 4
Note: most of the time you don't need to do this, apply, aggregate and transform are your friends!

Categories