Set single value on pandas multiindex dataframe - python

With a single-index dataframe, we can use loc to get, set, and change values:
>>> df=pd.DataFrame()
>>> df.loc['A',1]=1
>>> df
1
A 1.0
>>> df.loc['A',1]=2
>>> df.loc['A',1]
2.0
However, with a multiindex dataframe, loc can get and change values:
>>> df=pd.DataFrame([['A','B',1]])
>>> df=df.set_index([0,1])
>>> df.loc[('A','B'),2]
1
>>> df.loc[('A','B'),2]=3
>>> df.loc[('A','B'),2]
3
but setting them seems to fail:
>>> df=pd.DataFrame()
>>> df.loc[('A','B'),2]=3
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "C:\Program Files\Python39\lib\site-packages\pandas\core\indexing.py", line 688, in __setitem__
indexer = self._get_setitem_indexer(key)
File "C:\Program Files\Python39\lib\site-packages\pandas\core\indexing.py", line 630, in _get_setitem_indexer
return self._convert_tuple(key, is_setter=True)
File "C:\Program Files\Python39\lib\site-packages\pandas\core\indexing.py", line 754, in _convert_tuple
idx = self._convert_to_indexer(k, axis=i, is_setter=is_setter)
File "C:\Program Files\Python39\lib\site-packages\pandas\core\indexing.py", line 1212, in _convert_to_indexer
return self._get_listlike_indexer(key, axis, raise_missing=True)[1]
File "C:\Program Files\Python39\lib\site-packages\pandas\core\indexing.py", line 1266, in _get_listlike_indexer
self._validate_read_indexer(keyarr, indexer, axis, raise_missing=raise_missing)
File "C:\Program Files\Python39\lib\site-packages\pandas\core\indexing.py", line 1308, in _validate_read_indexer
raise KeyError(f"None of [{key}] are in the [{axis_name}]")
KeyError: "None of [Index(['A', 'B'], dtype='object')] are in the [index]"
Why is this, and what is the "right" way to use loc to set a single value in a multiindex dataframe?

This fails because you don't have the correct number of levels in the MultiIndex.
You need to initialize an empty DataFrame with the correct number of levels, for example using pandas.MultiIndex.from_arrays:
idx = pd.MultiIndex.from_arrays([[],[]])
df = pd.DataFrame(index=idx)
df.loc[('A','B'), 2] = 3
Output:
2
A B 3.0

Related

NaN values not replaced into dask dataframe

I'm trying to convert a dask dataframe to a pandas dataframe with the following code:
import dask.dataframe as dd
uri = "mysql+pymysql://myUser:myPassword#myHost:myPort/myDatabase"
dataframe = dd.read_sql_table("myTable", uri, "id", columns=["id", "name", "type_id"])
df = dataframe.fillna(0)
print(len(df.index))
However I'm facing the following error:
Traceback (most recent call last):
File "tmp.py", line 5, in <module>
print(len(df.index))
File "/home/user/.local/lib/python3.7/site-packages/dask/dataframe/core.py", line 593, in __len__
len, np.sum, token="len", meta=int, split_every=False
File "/home/user/.local/lib/python3.7/site-packages/dask/base.py", line 288, in compute
(result,) = compute(self, traverse=False, **kwargs)
File "/home/user/.local/lib/python3.7/site-packages/dask/base.py", line 570, in compute
results = schedule(dsk, keys, **kwargs)
File "/home/user/.local/lib/python3.7/site-packages/dask/threaded.py", line 87, in get
**kwargs
File "/home/user/.local/lib/python3.7/site-packages/dask/local.py", line 517, in get_async
raise_exception(exc, tb)
File "/home/user/.local/lib/python3.7/site-packages/dask/local.py", line 325, in reraise
raise exc
File "/home/user/.local/lib/python3.7/site-packages/dask/local.py", line 223, in execute_task
result = _execute_task(task, data)
File "/home/user/.local/lib/python3.7/site-packages/dask/core.py", line 121, in _execute_task
return func(*(_execute_task(a, cache) for a in args))
File "/home/user/.local/lib/python3.7/site-packages/dask/utils.py", line 35, in apply
return func(*args, **kwargs)
File "/home/user/.local/lib/python3.7/site-packages/dask/dataframe/io/sql.py", line 232, in _read_sql_chunk
return df.astype(meta.dtypes.to_dict(), copy=False)
File "/home/user/.local/lib/python3.7/site-packages/pandas/core/generic.py", line 5683, in astype
col.astype(dtype=dtype[col_name], copy=copy, errors=errors)
File "/home/user/.local/lib/python3.7/site-packages/pandas/core/generic.py", line 5698, in astype
new_data = self._data.astype(dtype=dtype, copy=copy, errors=errors)
File "/home/user/.local/lib/python3.7/site-packages/pandas/core/internals/managers.py", line 582, in astype
return self.apply("astype", dtype=dtype, copy=copy, errors=errors)
File "/home/user/.local/lib/python3.7/site-packages/pandas/core/internals/managers.py", line 442, in apply
applied = getattr(b, f)(**kwargs)
File "/home/user/.local/lib/python3.7/site-packages/pandas/core/internals/blocks.py", line 625, in astype
values = astype_nansafe(vals1d, dtype, copy=True)
File "/home/user/.local/lib/python3.7/site-packages/pandas/core/dtypes/cast.py", line 868, in astype_nansafe
raise ValueError("Cannot convert non-finite values (NA or inf) to integer")
ValueError: Cannot convert non-finite values (NA or inf) to integer
The table I'm using has the following structure (retrieved using only pandas):
id name type_id
-------------------------
2 name_2 3.0
3 name_3 3.0
4 name_4 1.0
6 name_6 NaN
7 name_7 2.0
...
I tried the same code without retrieving the 'type_id' column and it works as expected.
What I don't understand is why are NaN values not replaced by '0' since I'm using fillna(0) function before trying to convert the dataframe.
If I look at my database with phpmyadmin the pandas 'NaN' values are 'NULL' values.
How are NaN values not replaced by '0'
By using df = dataframe.fillna(0) you are instructing to fill nans in all columns, which can be problematic. Specifying the columns with nans explicitly might work:
df = dataframe.copy()
df["type_id"] = df["type_id"].astype('float').fillna(0)
Another options is to try dd.to_numeric:
df["type_id"] = dd.to_numeric(df["type_id"], errors="coerce").fillna(0)

pandas filter dataframe based on chained splits

I have a pandas dataframe which contains a column (column name filenames) with filenames. The filenames look something like:
long_file1_name_0.jpg
long_file2_name_1.jpg
long_file3_name_0.jpg
...
To filter, I do this (lets say `select_string="0"):
df_fp = df_fp[~df_fp["filenames"].str.split(".jpg")[0].split("_")[-1]==select_string]
but I get thrown this:
Traceback (most recent call last):
File "/file/location/dir/lib/python3.7/site-packages/pandas/core/indexes/base.py", line 2889, in get_loc
return self._engine.get_loc(casted_key)
File "pandas/_libs/index.pyx", line 70, in pandas._libs.index.IndexEngine.get_loc
File "pandas/_libs/index.pyx", line 97, in pandas._libs.index.IndexEngine.get_loc
File "pandas/_libs/hashtable_class_helper.pxi", line 1032, in pandas._libs.hashtable.Int64HashTable.get_item
File "pandas/_libs/hashtable_class_helper.pxi", line 1039, in pandas._libs.hashtable.Int64HashTable.get_item
KeyError: 0
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "python_file.py", line 118, in <module>
main()
File "inference.py", line 57, in main
_=some_function(config_dict=config_dict, logger=logger, select_string=config_dict['global']['select_string'])
File "/file/location/dir/etc/fprint/dataloaders.py", line 31, in some_function2
logger=logger, select_string=select_string)
File "/file/location/dir/etc/fprint/preprocess.py", line 25, in df_preprocess
df_fp = df_fp[~df_fp["filenames"].str.split(".jpg")[0].split("_")[-1]==select_string]
File "/file/location/dir/lib/python3.7/site-packages/pandas/core/series.py", line 882, in __getitem__
return self._get_value(key)
File "/file/location/dir/lib/python3.7/site-packages/pandas/core/series.py", line 991, in _get_value
loc = self.index.get_loc(label)
File "/file/location/dir/lib/python3.7/site-packages/pandas/core/indexes/base.py", line 2891, in get_loc
raise KeyError(key) from err
KeyError: 0
I think it does not like me chaining the splits, but I vaguely remember doing this sometime ago and it did work.. so, I am perplexed why it throws this error.
PS: I do know how to solve using .contains but I would like to use this approach of comparig strings.
Any pointers would be great!
Here is another way, with .str.extract():
import pandas as pd
df = pd.DataFrame({'filename': ['long_file1_name_0.jpg',
'long_file2_name_1.jpg',
'long_file3_name_0.jpg',
'long_file3_name_33.jpg',]
})
Now, create a boolean mask. The squeeze() method ensures we have a series, so the mask will work:
mask = (df['filename'].str.extract( r'\w+_(\d+).jpg' )
.astype(int)
.eq(0)
.squeeze())
print(df.loc[mask])
filename
0 long_file1_name_0.jpg
2 long_file3_name_0.jpg
Assuming all rows contain .jpg, if not please change it to only . instead
select_string=str(0) #select string should be of type str
df_fp=df_fp[df_fp["filenames"].apply(lambda x: x.split(".jpg")[0].split("_")[-1]).astype(str)==select_string]
This part:
df_fp["filenames"].str.split(".jpg")[0]
returns you the first row of the DataFrame, not the first element of the list.
What you are looking for is expand (it will create a new columns for every element in the list after the split) parameter:
df[df['filenames'].str.split('.jpg', expand=True)[0].str.split('_', expand=True)[1] == '0']
Alternatively you could do that via apply:
df[df['filenames'].apply(lambda x: x.split('.jpg')[0].split('_')[-1]) == '0']
but contains is definitely more appropriate here.

DataFrame.apply with str.extract throws error even though function works on each column-series

With this example DataFrame: df = pd.DataFrame([['A-3', 'B-4'], ['C-box', 'D1-go']])
Calling extract on individual columns as series works fine:
df.iloc[:, 0].str.extract('-(.+)')
df.iloc[:, 1].str.extract('-(.+)')
and also on the other axis:
df.iloc[0, :].str.extract('-(.+)')
df.iloc[1, :].str.extract('-(.+)')
So, I'd expect using apply would work (by applying extract to each column):
df.apply(lambda s: s.str.extract('-(.+)'), axis=0)
But it throws this error:
Traceback (most recent call last):
File "C:\ProgramData\Miniconda3\envs\py3\lib\site-packages\IPython\core\interactiveshell.py", line 3325, in run_code
exec(code_obj, self.user_global_ns, self.user_ns)
File "<ipython-input-588-70b1808d5457>", line 2, in <module>
df.apply(lambda s: s.str.extract('-(.+)'))
File "C:\ProgramData\Miniconda3\envs\py3\lib\site-packages\pandas\core\frame.py", line 6487, in apply
return op.get_result()
File "C:\ProgramData\Miniconda3\envs\py3\lib\site-packages\pandas\core\apply.py", line 151, in get_result
return self.apply_standard()
File "C:\ProgramData\Miniconda3\envs\py3\lib\site-packages\pandas\core\apply.py", line 260, in apply_standard
return self.wrap_results()
File "C:\ProgramData\Miniconda3\envs\py3\lib\site-packages\pandas\core\apply.py", line 308, in wrap_results
return self.wrap_results_for_axis()
File "C:\ProgramData\Miniconda3\envs\py3\lib\site-packages\pandas\core\apply.py", line 340, in wrap_results_for_axis
result = self.obj._constructor(data=results)
File "C:\ProgramData\Miniconda3\envs\py3\lib\site-packages\pandas\core\frame.py", line 392, in __init__
mgr = init_dict(data, index, columns, dtype=dtype)
File "C:\ProgramData\Miniconda3\envs\py3\lib\site-packages\pandas\core\internals\construction.py", line 212, in init_dict
return arrays_to_mgr(arrays, data_names, index, columns, dtype=dtype)
File "C:\ProgramData\Miniconda3\envs\py3\lib\site-packages\pandas\core\internals\construction.py", line 51, in arrays_to_mgr
index = extract_index(arrays)
File "C:\ProgramData\Miniconda3\envs\py3\lib\site-packages\pandas\core\internals\construction.py", line 308, in extract_index
raise ValueError('If using all scalar values, you must pass'
ValueError: If using all scalar values, you must pass an index
Using axis=1 yields an unexpected result, a Series with each row being a Series:
Out[2]:
0 0
0 3
1 4
1 0
0 box
1 go
dtype: object
I'm using apply, because I think this would result in the fastest execution time, but open to other suggestions
You can use split instead.
df.apply(lambda s: s.str.split('-', expand=True)[1])
Out[1]:
0 1
0 3 4
1 box go
The default parameter for expand in str.extract is True and it returns a Dataframe. Since you are applying it to multiple columns, it tries to return multiple dataframes. Set expand to False to handle that,
df.apply(lambda x: x.str.extract('-(.*)', expand = False))
0 1
0 3 4
1 box go

How to fix 'KeyError:****_target ' error in Python 3.7 [duplicate]

This question already has answers here:
NumPy Error: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()
(4 answers)
Solution to Error: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()
(2 answers)
ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()
(10 answers)
Closed 3 years ago.
I got this code from youtube,
not sure, why the tutor(Sentdex) is not getting the same error as I am.
I have a Test.csv file with dates as Index
Dates 'A Close' 'B Close' 'DLF Close' 'ICICI Close'
1 jan 18 555 111 122 400
2 jan 18 566 132 128 398
and so on .....
from collections import Counter
import numpy as np
import pandas as pd
hm_days = 7
def process_data(ticker):
df = pd.read_csv('Test.csv', index_col=0)
tickers = df.columns.values.tolist()
df.fillna(0, inplace=True)
for i in range(1, hm_days+1):
df['{}_{}d'.format(ticker, i)] = (df[ticker].shift(-i)-
df[ticker])/df[ticker]
df.fillna(0, inplace=True)
return tickers, df
def buy_sell_hold(*args):
cols = [c for c in args]
req = 0.02
for col in cols:
if all(col) > req:
return 1
if all(col) < -req:
return -1
return 0
def extract_feature(ticker):
tickers, df = process_data(ticker)
df['{}_target'.format(ticker)] = list(map(buy_sell_hold,
df[['{}_{}d'.format(ticker, i)
for i in range(1, hm_days + 1)]].values))
vals = df['{}_target'.format(ticker)].values.tolist()
str_vals = [str(i) for i in vals]
print('Data spread:', Counter(str_vals))
df.fillna(0, inplace=True)
df = df.replace([np.inf, -np.inf], np.nan)
df.dropna(inplace=True)
df_vals = df[[ticker for ticker in tickers]].pct_change()
df_vals = df_vals.replace([np.inf, -np.inf], 0)
df_vals.fillna(0, inplace=True)
x = df_vals.values
y = df['{}_target'.format(ticker)].values
return x, y, df
extract_feature('DLF Close')
This is the error I am getting:
Traceback (most recent call last):
File "C:\Users\Sudipto\Anaconda3\lib\site-packages\pandas\core\indexes\base.py", line 3078, in get_loc
return self._engine.get_loc(key)
File "pandas\_libs\index.pyx", line 140, in pandas._libs.index.IndexEngine.get_loc
File "pandas\_libs\index.pyx", line 162, in pandas._libs.index.IndexEngine.get_loc
File "pandas\_libs\hashtable_class_helper.pxi", line 1492, in pandas._libs.hashtable.PyObjectHashTable.get_item
File "pandas\_libs\hashtable_class_helper.pxi", line 1500, in pandas._libs.hashtable.PyObjectHashTable.get_item
KeyError: 'DLF Close_target'
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "C:\Users\Sudipto\Dropbox\Sentdex\PPF Backup\try.py", line 48, in <module>
extract_feature('DLF Close')
File "C:\Users\Sudipto\Dropbox\Sentdex\PPF Backup\try.py", line 33, in extract_feature
vals = df['{}_target'.format(ticker)].values.tolist()
File "C:\Users\Sudipto\Anaconda3\lib\site-packages\pandas\core\frame.py", line 2688, in __getitem__
return self._getitem_column(key)
File "C:\Users\Sudipto\Anaconda3\lib\site-packages\pandas\core\frame.py", line 2695, in _getitem_column
return self._get_item_cache(key)
File "C:\Users\Sudipto\Anaconda3\lib\site-packages\pandas\core\generic.py", line 2489, in _get_item_cache
values = self._data.get(item)
File "C:\Users\Sudipto\Anaconda3\lib\site-packages\pandas\core\internals.py", line 4115, in get
loc = self.items.get_loc(item)
File "C:\Users\Sudipto\Anaconda3\lib\site-packages\pandas\core\indexes\base.py", line 3080, in get_loc
return self._engine.get_loc(self._maybe_cast_indexer(key))
File "pandas\_libs\index.pyx", line 140, in pandas._libs.index.IndexEngine.get_loc
File "pandas\_libs\index.pyx", line 162, in pandas._libs.index.IndexEngine.get_loc
File "pandas\_libs\hashtable_class_helper.pxi", line 1492, in pandas._libs.hashtable.PyObjectHashTable.get_item
File "pandas\_libs\hashtable_class_helper.pxi", line 1500, in pandas._libs.hashtable.PyObjectHashTable.get_item
KeyError: 'DLF Close_target'
I gather the issue with the line:
vals = df['{}_target'.format(ticker)].values.tolist()
I checked the codes twice, thrice...and couldn't figure out what is wrong when I call for "DLF Close". Can anyone help me with this?

load text file with separate columns in python pandas

I have a text file that looks like this:
# Pearson correlation [n=344 #col=2]
# Name Name Value BiasCorr 2.50% 97.50% N: 2.50% N:97.50%
# --------------- --------------- -------- -------- -------- -------- -------- --------
101_DGCA3.1D[0] 101_LEC.1D[0] +0.85189 +0.85071 +0.81783 +0.87777 +0.82001 +0.87849
I have loaded it into python pandas using the following code:
import pandas as pd
data = pd.read_table('test.txt')
print data
However, I can't seem to access the different columns separately. I have tried using sep=' ' and copying the spaces between the columns in the text file, but I still don't get any column names and trying to print data[0] gives me an error:
Traceback (most recent call last):
File "cut_afni_output.py", line 3, in <module>
print data[0]
File "/home/user/anaconda2/lib/python2.7/site-packages/pandas/core/frame.py", line 1969, in __getitem__
return self._getitem_column(key)
File "/home/user/anaconda2/lib/python2.7/site-packages/pandas/core/frame.py", line 1976, in _getitem_column
return self._get_item_cache(key)
File "/home/user/anaconda2/lib/python2.7/site-packages/pandas/core/generic.py", line 1091, in _get_item_cache
values = self._data.get(item)
File "/home/user/anaconda2/lib/python2.7/site-packages/pandas/core/internals.py", line 3211, in get
loc = self.items.get_loc(item)
File "/home/user/anaconda2/lib/python2.7/site-packages/pandas/core/index.py", line 1759, in get_loc
return self._engine.get_loc(key)
File "pandas/index.pyx", line 137, in pandas.index.IndexEngine.get_loc (pandas/index.c:3979)
File "pandas/index.pyx", line 157, in pandas.index.IndexEngine.get_loc (pandas/index.c:3843)
File "pandas/hashtable.pyx", line 668, in pandas.hashtable.PyObjectHashTable.get_item (pandas/hashtable.c:12265)
File "pandas/hashtable.pyx", line 676, in pandas.hashtable.PyObjectHashTable.get_item (pandas/hashtable.c:12216)
KeyError: 0
I haven't been able to set the header row manually because it seems like python views the whole thing as one column. How do I make the text file be read in as separate columns that I can call?
Try this:
In [33]: df = pd.read_csv(filename, comment='#', header=None, delim_whitespace=True)
In [34]: df
Out[34]:
0 1 2 3 4 5 6 7
0 101_DGCA3.1D[0] 101_LEC.1D[0] 0.85189 0.85071 0.81783 0.87777 0.82001 0.87849

Categories