Does the pandas.Series.isin function require the data to be hashable? I did not find this requirement in the documentation (Series.isin or Series, though I see the index needs to be hashable, not the data).
foo = ['x', 'y']
bar = pd.Series(foo, index=['a', 'b'])
baz = pd.Series([foo[0]], index=['c'])
print(bar.isin(baz))
works as expected and returns
a True
b False
dtype: bool
However, the following fails with the error TypeError: unhashable type: 'list':
foo = [['x', 'y'], ['z', 't']]
bar = pd.Series(foo, index=['a', 'b'])
baz = pd.Series([foo[0]], index=['c'])
print(bar.isin(baz))
Is that intended? Is it documented somewhere? Or is it a bug in pandas?
Related
I would like to view an object array with a dtype that encapsulates entire rows:
data = np.array([['a', '1'], ['a', 'z'], ['b', 'a']], dtype=object)
dt = np.dtype([('x', object), ('y', object)])
data.view(dt)
I get an error:
TypeError: Cannot change data-type for object array.
I have tried the following workarounds:
dt2 = np.dtype([('x', np.object, 2)])
data.view()
data.view(np.uint8).view(dt)
data.view(np.void).view(dt)
All cases result in the same error. Is there some way to view an object array with a different dtype?
I have also tried a more general approach (this is for reference, since it's functionally identical to what's shown above):
dt = np.dtype(','.join(data.dtype.char * data.shape[1]))
dt2 = np.dtype([('x', data.dtype, data.shape[1])])
It seems that you can always force a view of a buffer using np.array:
view = np.array(data, dtype=dt, copy=not data.flags['C_CONTIGUOUS'])
While this is a quick and dirty approach, the data gets copied in this case, and dt2 does not get applied correctly:
>>> print(view.base)
None
>>> np.array(data, dtype=dt2, copy=not data.flags['C_CONTIGUOUS'])
array([[(['a', 'a'],), (['1', '1'],)],
[(['a', 'a'],), (['z', 'z'],)],
[(['b', 'b'],), (['a', 'a'],)]], dtype=[('x', 'O', (2,))])
For a more correct approach (in some circumstances), you can use the raw np.ndarray constructor:
real_view = np.ndarray(data.shape[:1], dtype=dt2, buffer=data)
This makes a true view of the data:
>>> real_view
array([(['a', '1'],), (['a', 'z'],), (['b', 'a'],)], dtype=[('x', 'O', (2,))])
>>> real_view.base is data
True
As shown, this only works when the data has C-contiguous rows.
I don't understand the following strange conversion behavior in pandas:
d = pd.DataFrame({'a':['x', 'y'], 'b': ['s', 't']})
s = d['a'].astype('|S1')
print(s.dtypes)
d['a'] = s
print(d.dtypes)
print(s.dtypes)
print(d.astype('|S1').dtypes)
produced the output:
|S1
a object
b object
dtype: object
|S1
a |S1
b |S1
dtype: object
When I convert a column as a pd.Series it gets converted, but when put back into the DataFrame, it reverts back to Object. But the whole 'DataFrame` can be converted. What gives?
I have been scouring the documentation to find some reference to this behavior, but didn't find any clue.
Just for completeness, here is an abbreviated version:
python : 3.7.3.final.0
python-bits : 64
OS : Windows
OS-release : 10
machine : AMD64
processor : Intel64 Family 6 Model 158 Stepping 9, GenuineIntel
pandas : 0.25.1
numpy : 1.17.1
Hopefully this helps if I understand the issue:
Try overwriting the column instead of assigning it to s
d = pd.DataFrame({'a':['x', 'y'], 'b': ['s', 't']})
d['a'] = d['a'].astype('|S1')
I have always been accustomed to using .astype(str), but not sure of your specific situation.
I am trying to filter data from large HDF store to the required subset using the attribute where of method read_hdb:
phase = pd.read_hdf(DSPATH + '/phase-table.h5', 'phase', where='EXTSIDNAME=="A"')
According to the documentation I can specify any column defined in the dataset with basic logical conditions. According to Pandas documentation syntax column_name == 'string literal' is supported.
The library throws however a ValueError exception for any column I am trying to specify:
ValueError: The passed where expression: EXTSIDNAME=="A"
contains an invalid variable reference
all of the variable references must be a reference to
an axis (e.g. 'index' or 'columns'), or a data_column
The currently defined references are: index,columns
The only condition, which does not through the error is 'index=1'.
The column exists in the data store. If I load it without filter I can see that I am trying to specify in the where condition do exist:
Index(['EXTSIDNAME', 'HOSTNAME', 'TIMESTP', 'SUM_ENDDATE','MODULE_ID','MODULENAME',
'MODULE_STARTDATE', 'MODULE_ENDDATE', 'PHASE_ID','PHASENAME',
'PHASE_STARTDATE', 'PHASE_ENDDATE', 'ID', 'PhaseDuration'], dtype='object')
I am using the latest stable libraries from Anaconda bundle.
If you created the HDF store with to_hdf(), you need to specify the data_columns parameter. A similar question is posted here.
An example:
d = {'Col': ['A', 'A', 'A', 'B', 'B', 'B', 'C', 'D'],
'X': [1, 2, 3, 4, 5, 6, 7, 8]}
df = pd.DataFrame.from_dict(d)
df looks like this:
Col X
0 A 1
1 A 2
2 A 3
3 B 4
4 B 5
5 B 6
6 C 7
7 D 8
Let's write it to .h5 file with to_hdf(). It's important that format equals 'table':
df.to_hdf('myhdf.h5', 'somekey', format='table')
Now let's read it with read_hdf():
pd.read_hdf('myhdf.h5', key='somekey', where='Col==A')
Returns error:
ValueError: The passed where expression: Col==A
contains an invalid variable reference
all of the variable refrences must be a reference to
an axis (e.g. 'index' or 'columns'), or a data_column
The currently defined references are: index,columns
What gives?
When you do to_hdf(), you need to also define data_columns like this:
df.to_hdf('myhdf.h5', 'somekey', format='table', data_columns=['Col', 'X'])
Now you can read data from .h5 file using where:
pd.read_hdf('myhdf.h5', key='somekey', where='Col==A')
Col X
0 A 1
1 A 2
2 A 3
With where as a list:
pd.read_hdf('myhdf.h5', key='somekey', where=['Col==A', 'X==2'])
Col X
1 A 2
Suppose I have two pandas of the form:
>>> df
A B C
first 62.184209 39.414005 60.716563
second 51.508214 94.354199 16.938342
third 36.081861 39.440953 38.088336
>>> df1
A B C
first 0.828069 0.762570 0.717368
second 0.136098 0.991668 0.547499
third 0.120465 0.546807 0.346949
>>>
That I generated with:
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.random([3, 3])*100,
columns=['A', 'B', 'C'], index=['first', 'second', 'third'])
df1 = pd.DataFrame(np.random.random([3, 3]),
columns=['A', 'B', 'C'], index=['first', 'second', 'third'])
Could you find the smartest and quickest way of getting something like:
A B C
first 62.184209 39.414005 60.716563
first_s 0.828069 0.762570 0.717368
second 51.508214 94.354199 16.938342
second_s 0.136098 0.991668 0.547499
third 36.081861 39.440953 38.088336
third_s 0.120465 0.546807 0.346949
?
I guess I could do with a for cycle saying take even rows from the first and odd rows from the second but it does not seem very efficient to me.
Try this:
In [501]: pd.concat([df, df1.set_index(df1.index + '_s')]).sort_index()
Out[501]:
A B C
first 62.184209 39.414005 60.716563
first_s 0.828069 0.762570 0.717368
second 51.508214 94.354199 16.938342
second_s 0.136098 0.991668 0.547499
third 36.081861 39.440953 38.088336
third_s 0.120465 0.546807 0.346949
I am trying to extract data from a csv file using python's pandas module. The experiment data has 6 columns (lets say a,b,c,d,e,f) and i have a list of model directories. Not every model has all 6 'species' (columns) so i need to split the data specifically for each model. Here is my code:
def read_experimental_data(self,experiment_path):
[path,fle]=os.path.split(experiment_path)
os.chdir(path)
data_df=pandas.read_csv(experiment_path)
# print data_df
experiment_species=data_df.keys() #(a,b,c,d,e,f)
# print experiment_species
for i in self.all_models_dirs: #iterate through a list of model directories.
[path,fle]=os.path.split(i)
model_specific_data=pandas.DataFrame()
species_dct=self.get_model_species(i+'.xml') #gives all the species (culuns) in this particular model
# print species_dct
#gives me only species that are included in model dir i
for l in species_dct.keys():
for m in experiment_species:
if l == m:
#how do i collate these pandas series into a single dataframe?
print data_df[m]
The above code gives me the correct data but i'm having trouble collecting it in a usable format. I've tried to merge and concatenate them but no joy. Does any body know how to do this?
Thanks
You can create a new DataFrame from data_df by passing it a list of columns you want,
import pandas as pd
df = pd.DataFrame({'a': [1,2,3], 'b': [4,5,6], 'c': [7,8,9]})
df_filtered = df[['a', 'c']]
or an example using some of your variable names,
import pandas as pd
data_df = pd.DataFrame({'a': [1,2], 'b': [3,4], 'c': [5,6],
'd': [7,8], 'e': [9,10], 'f': [11,12]})
experiment_species = data_df.keys()
species_dct = ['b', 'd', 'e', 'x', 'y', 'z']
good_columns = list(set(experiment_species).intersection(species_dct))
df_filtered = data_df[good_columns]