How can we concatenate two columns based on names? - python

I was working with a multiindex dataframe (which I find unbeleivably complicated to work with). I flattened the multiindex into jest Level0, with this line of code.
df_append.columns = df_append.columns.map('|'.join).str.strip('|')
Now, when I print columns, I get this.
Index(['IDRSSD', 'RCFD3531|TRDG ASSETS-US TREAS SECS IN DOM OFF',
'RCFD3532|TRDG ASSETS-US GOV AGC CORP OBLGS',
'RCFD3533|TRDG ASSETS-SECS ISSD BY ST POL SUB',
'TEXTF660|3RD ITEMIZED AMT FOR OTHR TRDG ASTS',
'Unnamed: 115_level_0|Unnamed: 115_level_1',
'Unnamed: 133_level_0|Unnamed: 133_level_1',
'Unnamed: 139_level_0|Unnamed: 139_level_1',
'Unnamed: 20_level_0|Unnamed: 20_level_1',
'Unnamed: 87_level_0|Unnamed: 87_level_1', 'file', 'schedule_code',
'year', 'qyear'],
dtype='object', length=202)
I am trying to concatenate two columns into one single column, like this.
df_append['period'] = df_append['IDRSSD'].astype(str) + '-' + df_append['qyear'].astype(str)
Here is the error that I am seeing.
Traceback (most recent call last):
File "C:\Users\ryans\Anaconda3\lib\site-packages\pandas\core\indexes\base.py", line 2895, in get_loc
return self._engine.get_loc(casted_key)
File "pandas\_libs\index.pyx", line 70, in pandas._libs.index.IndexEngine.get_loc
File "pandas\_libs\index.pyx", line 101, in pandas._libs.index.IndexEngine.get_loc
File "pandas\_libs\hashtable_class_helper.pxi", line 1675, in pandas._libs.hashtable.PyObjectHashTable.get_item
File "pandas\_libs\hashtable_class_helper.pxi", line 1683, in pandas._libs.hashtable.PyObjectHashTable.get_item
KeyError: 'IDRSSD'
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "<ipython-input-153-92d2e8486595>", line 1, in <module>
df_append['period'] = df_append['IDRSSD'].astype(str) + '-' + df_append['qyear'].astype(str)
File "C:\Users\ryans\Anaconda3\lib\site-packages\pandas\core\frame.py", line 2902, in __getitem__
indexer = self.columns.get_loc(key)
File "C:\Users\ryans\Anaconda3\lib\site-packages\pandas\core\indexes\base.py", line 2897, in get_loc
raise KeyError(key) from err
KeyError: 'IDRSSD'
To me, it looks like I have a column named 'IDRSSD' and a column named 'qyear', but Python disagrees. Or, perhaps I am misinterpreting the error message. Can I get these two columns concatenated into one, or is this impossible to do? Thanks everyone.

I tried the method below. It worked for me.
1.) First convert the column to string:
df_append['IDRSSD'] = df_append['IDRSSD'].astype(str)
df_append['qyear'] = df_append['qyear'].astype(str)
2.) Now join then both the columns into one using '-' as seperator
df_append['period'] = df_append[['IDRSSD', 'qyear']].apply(lambda x: '-'.join(x), axis=1)
Attached the screenshot of my approach.

You can use df_append.columns = df_append.columns.to_flat_index() to change the MultiIndex into a one dimensional array of tuples. From there you should be able to manipulate the columns more easily, or at least see what the issue is.
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.MultiIndex.to_flat_index.html

use apply method
import pandas as pd
def concat(row):
if ("col1" in row) & ("col2" in row):
return str(row['col1']) + "-" +str(row['col2'])
df =pd.DataFrame([["1","2"],["1","2"]],columns=["col1","col2"])
df['col3'] = df.apply(lambda row: concat(row), axis=1)
df

Related

pandas filter dataframe based on chained splits

I have a pandas dataframe which contains a column (column name filenames) with filenames. The filenames look something like:
long_file1_name_0.jpg
long_file2_name_1.jpg
long_file3_name_0.jpg
...
To filter, I do this (lets say `select_string="0"):
df_fp = df_fp[~df_fp["filenames"].str.split(".jpg")[0].split("_")[-1]==select_string]
but I get thrown this:
Traceback (most recent call last):
File "/file/location/dir/lib/python3.7/site-packages/pandas/core/indexes/base.py", line 2889, in get_loc
return self._engine.get_loc(casted_key)
File "pandas/_libs/index.pyx", line 70, in pandas._libs.index.IndexEngine.get_loc
File "pandas/_libs/index.pyx", line 97, in pandas._libs.index.IndexEngine.get_loc
File "pandas/_libs/hashtable_class_helper.pxi", line 1032, in pandas._libs.hashtable.Int64HashTable.get_item
File "pandas/_libs/hashtable_class_helper.pxi", line 1039, in pandas._libs.hashtable.Int64HashTable.get_item
KeyError: 0
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "python_file.py", line 118, in <module>
main()
File "inference.py", line 57, in main
_=some_function(config_dict=config_dict, logger=logger, select_string=config_dict['global']['select_string'])
File "/file/location/dir/etc/fprint/dataloaders.py", line 31, in some_function2
logger=logger, select_string=select_string)
File "/file/location/dir/etc/fprint/preprocess.py", line 25, in df_preprocess
df_fp = df_fp[~df_fp["filenames"].str.split(".jpg")[0].split("_")[-1]==select_string]
File "/file/location/dir/lib/python3.7/site-packages/pandas/core/series.py", line 882, in __getitem__
return self._get_value(key)
File "/file/location/dir/lib/python3.7/site-packages/pandas/core/series.py", line 991, in _get_value
loc = self.index.get_loc(label)
File "/file/location/dir/lib/python3.7/site-packages/pandas/core/indexes/base.py", line 2891, in get_loc
raise KeyError(key) from err
KeyError: 0
I think it does not like me chaining the splits, but I vaguely remember doing this sometime ago and it did work.. so, I am perplexed why it throws this error.
PS: I do know how to solve using .contains but I would like to use this approach of comparig strings.
Any pointers would be great!
Here is another way, with .str.extract():
import pandas as pd
df = pd.DataFrame({'filename': ['long_file1_name_0.jpg',
'long_file2_name_1.jpg',
'long_file3_name_0.jpg',
'long_file3_name_33.jpg',]
})
Now, create a boolean mask. The squeeze() method ensures we have a series, so the mask will work:
mask = (df['filename'].str.extract( r'\w+_(\d+).jpg' )
.astype(int)
.eq(0)
.squeeze())
print(df.loc[mask])
filename
0 long_file1_name_0.jpg
2 long_file3_name_0.jpg
Assuming all rows contain .jpg, if not please change it to only . instead
select_string=str(0) #select string should be of type str
df_fp=df_fp[df_fp["filenames"].apply(lambda x: x.split(".jpg")[0].split("_")[-1]).astype(str)==select_string]
This part:
df_fp["filenames"].str.split(".jpg")[0]
returns you the first row of the DataFrame, not the first element of the list.
What you are looking for is expand (it will create a new columns for every element in the list after the split) parameter:
df[df['filenames'].str.split('.jpg', expand=True)[0].str.split('_', expand=True)[1] == '0']
Alternatively you could do that via apply:
df[df['filenames'].apply(lambda x: x.split('.jpg')[0].split('_')[-1]) == '0']
but contains is definitely more appropriate here.

Why does pandas generate a KeyError when looking up date in date-indexed table?

Consider the following code:
date_index = np.array(['2019-01-01', '2019-01-02'], dtype=np.datetime64)
df = pd.DataFrame({'a': np.array([1, 2])}, index=date_index)
date_to_lookup = date_index[0]
print(df.at[date_to_lookup, 'a'])
One might expect it to work and print 1. Yet (at least in Anaconda python 3.7.3 with Pandas 0.24.2) it fails with the following error:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File ".../site-packages/pandas/core/indexing.py", line 2270, in __getitem__
return self.obj._get_value(*key, takeable=self._takeable)
File ".../site-packages/pandas/core/frame.py", line 2771, in _get_value
return engine.get_value(series._values, index)
File "pandas/_libs/index.pyx", line 81, in pandas._libs.index.IndexEngine.get_value
File "pandas/_libs/index.pyx", line 89, in pandas._libs.index.IndexEngine.get_value
File "pandas/_libs/index.pyx", line 447, in pandas._libs.index.DatetimeEngine.get_loc
File "pandas/_libs/hashtable_class_helper.pxi", line 987, in pandas._libs.hashtable.Int64HashTable.get_item
File "pandas/_libs/hashtable_class_helper.pxi", line 993, in pandas._libs.hashtable.Int64HashTable.get_item
KeyError: 17897
It appears that Pandas DataFrame and Series objects always store dates as dtype 'datetime64[ns]' or 'datetime64[ns, tz]', and the issue arises because Pandas automatically converts 'datetime64[D]' dtype to 'datetime64[ns]' when creating the index, but does not do that when looking up an element in that index. I could avoid the error above by converting the key to 'datetime64[ns]'. E.g. both of the following lines successfully print 1:
print(df.at[pd.to_datetime(date_to_lookup), 'a'])
print(df.at[date_to_lookup.astype('datetime64[ns]'), 'a'])
This behavior (automatic dtype conversion when creating an index, but not when looking up an element) seems counterintuitive to me. What is the reason it was implemented this way? Is there some coding style one is expected to follow to avoid errors like this? Or is it a bug I should file?
You can avoid this by select by positions with DataFrame.iat and Index.get_loc for position of column a:
print(df.iat[0, df.columns.get_loc('a')])
#alternative
#print(df.iloc[0, df.columns.get_loc('a')])
1
Another idea is use df.index for selecting instead date_index[0]:
print(df.at[df.index[0], 'a'])
I think this is a bug you found in 0.24.2, it works on my system python 3.7.2 and pandas 0.25.3:
date_index = np.array(['2019-01-01', '2019-01-02'], dtype=np.datetime64)
df = pd.DataFrame({'a': np.array([1, 2])}, index=date_index)
date_to_lookup = date_index[0]
print(df.at[date_to_lookup, 'a'])
1

ERROR (glitch in pandas?) : Why when indexes of a pandas.core.series.Series are numbers as strings python can't retrive the its values?

I'm going to the grain. Every one knows that a column, say col = df['field'] is a 'pandas.core.series.Series'. And also counts = df['field'].value_counts() with the method value_counts() is a 'pandas.core.series.Series' data type.
And that you can extract the value from the first row of a 'pandas.core.series.Series' with double brackets: col[0] or counts[0]
Nontheless indexes from col and counts are different. And this insight is what I think is the problem I'm about to present.
I have the next 'pandas.core.series.Series' data type generated by the next code:
We read the data frame as df
df = pd.read_csv('file.csv')
df has 'year' and 'product' columns, which I get its unique values and transform them into strings
vals_year = df['year'].astype('str').unique()
vals_product = df['product'].astype('str').unique()
This is the content in each variable:
>>>vals_year
>>>['16' '18' '17']
>>> vals_product
>>>['card' 'cash']
Then I use the value_counts() method to count and create 'pandas.core.series.Series' data type :
cy = df['year'].value_counts()
cp = df['product'].value_counts()
This is the output:
>>>cy
>>>16 65
17 40
18 12
Name: year, dtype: int64
>>>cp
>>>card 123
cash 106
Name: product, dtype: int64
Here is the first value of cp:
>>>cp[0]
>>>123
But when I try to see the first value from cy this happens:
>>>cy[0]
Traceback (most recent call last):
File "C:.../Test3.py", line 44, in <module>
print(cr[0])
File "C:\...\venv\lib\site-packages\pandas\core\series.py", line 1064, in __getitem__
result = self.index.get_value(self, key)
File "C:\...\venv\lib\site-packages\pandas\core\indexes\base.py", line 4723, in get_value
return self._engine.get_value(s, k, tz=getattr(series.dtype, "tz", None))
File "pandas\_libs\index.pyx", line 80, in pandas._libs.index.IndexEngine.get_value
File "pandas\_libs\index.pyx", line 88, in pandas._libs.index.IndexEngine.get_value
File "pandas\_libs\index.pyx", line 131, in pandas._libs.index.IndexEngine.get_loc
File "pandas\_libs\hashtable_class_helper.pxi", line 992, in pandas._libs.hashtable.Int64HashTable.get_item
File "pandas\_libs\hashtable_class_helper.pxi", line 998, in pandas._libs.hashtable.Int64HashTable.get_item
KeyError: 0
(I just copy paste the message)
Why this happens? It makes no sense!! Is this a glitch in pandas? I believe the problem resides, as I said before, in the fact that The original values from 'year' column were ints

pandas index_col="datetime" makes df['datetime'] unavailable

The title says it all.
The following bit of pseudo-code returns the following error:
df = pd.read_sql(query, conn, parse_dates=["datetime"],
index_col="datetime")
df['datetime']
I get :
Exception in thread Thread-1:
Traceback (most recent call last):
File "C:\Users\admin\.virtualenvs\EnkiForex-ey09TNOL\lib\site-packages\pandas\core\indexes\base.py", line 2656, in get_loc
return self._engine.get_loc(key)
File "pandas\_libs\index.pyx", line 108, in pandas._libs.index.IndexEngine.get_loc
File "pandas\_libs\index.pyx", line 132, in pandas._libs.index.IndexEngine.get_loc
File "pandas\_libs\hashtable_class_helper.pxi", line 1601, in pandas._libs.hashtable.PyObjectHashTable.get_item
File "pandas\_libs\hashtable_class_helper.pxi", line 1608, in pandas._libs.hashtable.PyObjectHashTable.get_item
KeyError: 'datetime'
Am I misunderstanding what's going on by indexing the datetime col? I can access all the other columns normally though.
An index is not a column. Think of the index as labels for the rows of the DataFrame. index_col='datetime' makes the datetime column (in the csv) the index of df. To access the index, use df.index.
import pandas as pd
d = {'col1': [1, 2], 'col2': [3, 4]}
df = pd.DataFrame(d)
time = pd.date_range(end='4/5/2018',periods=2)
df.index = time
df.index
the end is DatetimeIndex(['2018-04-04', '2018-04-05'], dtype='datetime64[ns]', freq='D')
just use df.index can get the information of the index_col

Not able to access a column in pandas data frame

I have data frame df.
df.columns gives this output
Index([u'Talk Time\t', u'Hold Time\t', u'Work Time\t', u'Call Type'], dtype='object')
Here, column 'Talk Time' has "\t" character with it, so if I do the following, I get an error
df['Talk Time']
Traceback (most recent call last):
File "<ipython-input-78-f2b7b9f43f59>", line 1, in <module>
old['Talk Time']
File "C:\Users\Admin\Anaconda\lib\site-packages\pandas\core\frame.py", line 1780, in __getitem__
return self._getitem_column(key)
File "C:\Users\Admin\Anaconda\lib\site-packages\pandas\core\frame.py", line 1787, in _getitem_column
return self._get_item_cache(key)
File "C:\Users\Admin\Anaconda\lib\site-packages\pandas\core\generic.py", line 1068, in _get_item_cache
values = self._data.get(item)
File "C:\Users\Admin\Anaconda\lib\site-packages\pandas\core\internals.py", line 2849, in get
loc = self.items.get_loc(item)
File "C:\Users\Admin\Anaconda\lib\site-packages\pandas\core\index.py", line 1402, in get_loc
return self._engine.get_loc(_values_from_object(key))
File "pandas\index.pyx", line 134, in pandas.index.IndexEngine.get_loc (pandas\index.c:3820)
File "pandas\index.pyx", line 154, in pandas.index.IndexEngine.get_loc (pandas\index.c:3700)
File "pandas\hashtable.pyx", line 696, in pandas.hashtable.PyObjectHashTable.get_item (pandas\hashtable.c:12323)
File "pandas\hashtable.pyx", line 704, in pandas.hashtable.PyObjectHashTable.get_item (pandas\hashtable.c:12274)
KeyError: 'Talk Time'
So I modify columns to remove tab characters as follows:
for n in range(len(df.columns)):
df.columns.values[n] = df.columns.values[n].rstrip()
Tab characters get removed, df.columns give the following output
Index([u'Talk Time', u'Hold Time', u'Work Time', u'Call Type'], dtype='object')
But, still when I am trying to access a column as
df['Talk Time']
, I am seeing the same error. Why is it happening?
The main issue is, that you replaced the value of the columns and that is you actually managed to do. But that is just an alias, thus the actual name stayed as was before. So df['Talk Time\t'] worked well on, if you tried to, but obviously that wasn't the result you waited for.
So the solution is that you have to change the df.columns instead of df.columns.value
df.columns = [c.rstrip() for c in df.columns]
This is what works fine according to your needs
I can't reproduce your second error, however, you could do:
df.columns = [i.rstrip() for i in df.columns]
Maybe this will help !

Categories