Accessing column values with columns set up as values of an index - python

I have the following dataframe named df:
V1 V2
IDS
a 1 2
b 3 4
If I print out the index and the columns, this is the result:
> print(df.index)
Index(['a','b'],dtype='object',name='IDS',length=2)
> print(df.columns)
Index(['V1','V2'],dtype='object',length=2)
I want to perform a calculation on these two columns (row-wise) and add this to a new column. I have tried the following, but I can't seem to access the column as expected.
df['sum'] = df.apply(lambda row: row['V1'] + row['V2'], axis=1)
I get the following error running the last line of code:
KeyError: ('V1', 'occurred at index a')
How do I access these columns?
Update: contrived example is not showing the error, here is the actual dataframe I am working with:
DATE ... gathering_size_100_to_26 shelter_in_place
FIPS
10001 2020-01-22 ... 2020-01-01 2020-01-01
10002 2020-01-22 ... 2020-01-01 2020-01-02
10003 2020-02-25 ... 2020-01-01 2020-01-03
... ... ... ... ...
9013 2020-02-22 ... 2020-01-01 2020-01-01
I want to take the difference between 'gathering_size_100_to_26' and 'DATE', as well as 'shelter_in_place' and 'DATE' and replace this value in place.

df["v1_v2_sum"] = df["V1"] + df["V2"]
Anyways, avoid using df.apply and UDF, they have bad performance, and only needed when you have no options.

df = pd.DataFrame(data=[[0.8062, 0.9308], [0.364 , 0.6909]],index=['a','b'], columns=['V1','V2'])
print(df)
Output:
V1 V2
a 0.8062 0.9308
b 0.3640 0.6909
df['sum'] = df.apply(sum,axis=1)
print(df)
Output:
V1 V2 sum
a 0.8062 0.9308 1.7370
b 0.3640 0.6909 1.0549```

I realized I had a typo.... what was stated above (but reworked for me instance) works.

Related

Grouper() and agg() functions produce multiple copies when squashed

I have a sample dataframe as given below.
import pandas as pd
import numpy as np
NaN = np.nan
data = {'ID':['A', 'A', 'A', 'B','B','B'],
'Date':['2021-09-20 04:34:57', '2021-09-20 04:37:25', '2021-09-20 04:38:26', '2021-09-01
00:12:29','2021-09-01 11:20:58','2021-09-02 09:20:58'],
'Name':['xx','xx',NaN,'yy',NaN,NaN],
'Height':[174,174,NaN,160,NaN,NaN],
'Weight':[74,NaN,NaN,58,NaN,NaN],
'Gender':[NaN,'Male',NaN,NaN,'Female',NaN],
'Interests':[NaN,NaN,'Hiking,Sports',NaN,NaN,'Singing']}
df1 = pd.DataFrame(data)
df1
I want to combine the data present on the same date into a single row. The 'Date' column is in timestamp format. I have written a code for it. Here is my TRY code:
TRY:
df1['Date'] = pd.to_datetime(df1['Date'])
df_out = (df1.groupby(['ID', pd.Grouper(key='Date', freq='D')])
.agg(lambda x: ''.join(x.dropna().astype(str)))
.reset_index()
).replace('', np.nan)
This gives an output where if there are multiple entries of same value, the final result has multiple entries in the same row as shown below.
Obtained Output
However, I do not want the values to be repeated if there are multiple entries. The final output should look like the image shown below.
Required Output
The first column should not have 'xx' and 174.0 instead of 'xxxx' and '174.0 174.0'.
Any help is greatly appreciated. Thank you.
In your case replace agg join to first
df_out = (df1.groupby(['ID', pd.Grouper(key='Date', freq='D')])
.first()
.reset_index()
).replace('', np.nan)
df_out
Out[113]:
ID Date Name Height Weight Gender Interests
0 A 2021-09-20 xx 174.0 74.0 Male Hiking,Sports
1 B 2021-09-01 yy 160.0 58.0 Female None
2 B 2021-09-02 None NaN NaN None Singing
Since you're only trying to keep the first available value for each column for each date, you can do:
>>> df1.groupby(["ID", pd.Grouper(key='Date', freq='D')]).agg("first").reset_index()
ID Date Name Height Weight Gender Interests
0 A 2021-09-20 xx 174.0 74.0 Male Hiking,Sports
1 B 2021-09-01 yy 160.0 58.0 Female None
2 B 2021-09-02 None NaN NaN None Singing

How to count the values in duplicated rows in pandas

Although this seems like an easy problem I have be struggling with it for a while. I have two dataframes that I want to determine the duplicates between with respect to certain columns and then I want to sum the values of the the both dataframes with respect to another column. I will do my best to show. The following tables describe the structure of the two dataframes, I will call then df1 and df2.
make
2019-12-01
2019-06-04
BMW
0
3
VW
1
3
make
2018-12-01
2019-06-04
TESLA
0
2
VW
2
2
this is my attempt
df = pd.concat ([df1, df2], axis=1)
df_2 = df [df.duplicated (subset=[ 'make'], keep=False)]
df_2 = pd.DataFrame(df_2)
valuePosition1 = df_2.columns.get_loc(2019-12-01)
valuePosition2 = df_2.columns.get_loc(2018-12-01)
flow = min(df_2.iloc[:, valuePosition1].sum(), df_2.iloc[:, valuePosition2].sum())
print(flow)
the answer should be 1, as there is a VW in both df1[2019-12-01] and df2[2018-12-01]. But I keep getting weird errors:
The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all()
Which doesn't even seem to understand what I am doing. I am really at my wits end. Both dataframes are also very big so I would need a quick way to do it.
Any guidance or help would be appreciated!
It is better to concatenate along the row axis (concat(..., axis=0)) since duplicated expects to work along that axis:
Return boolean Series denoting duplicate rows.
You can also use loc (which is primarily label based) rather than iloc (which is primarily integer position based) considering you know the columns you're interested in.
import pandas as pd
df1 = pd.read_csv('sample1.csv', sep='\s+')
df2 = pd.read_csv('sample2.csv', sep='\s+')
df = pd.concat([df1,df2], axis=0)
print(df)
dfd = df[df.duplicated(subset=['make'], keep=False)]
print(dfd)
flow = min(dfd.loc[:, '2019-12-01'].sum(),
dfd.loc[:, '2018-12-01'].sum())
print(flow)
Output from df
make 2019-12-01 2019-06-04 2018-12-01
0 BMW 0.0 3 NaN
1 VW 1.0 3 NaN
0 TESLA NaN 2 0.0
1 VW NaN 2 2.0
Output from dfd
make 2019-12-01 2019-06-04 2018-12-01
1 VW 1.0 3 NaN
1 VW NaN 2 2.0
Output from flow
1.0

How to get all informations of the rows related with min and max values after a groupby function

I have the follow data set.
df=pd.DataFrame({'listing_id':['12345','12349','12345','12349','12345'], 'price':[3,5,67,7,12]})
df['date'] = pd.date_range(start='1/2/2020', periods=len(df), freq='D')
df
And I would like to apply the aggreation functions.
df.groupby('listing_id').agg({'price':['count','mean', 'std','min','max']})
What is the best way to get the date related with min AND max price e put these information on same row.
We can adding the idxmim and idxmax , then assign the value based on it
s=df.groupby('listing_id')['price'].agg(['count','mean', 'std','min','max','idxmax','idxmin'])
...
s['Date_max']=df.reindex(s['idxmax'])['date'].values
s['Date_min']=df.reindex(s['idxmin'])['date'].values
s
count mean std ... idxmin Date_max Date_min
listing_id ...
12345 3 27.333333 34.645827 ... 0 2020-01-04 2020-01-02
12349 2 6.000000 1.414214 ... 1 2020-01-05 2020-01-03
[2 rows x 9 columns]

Apply Numpy function over entire Dataframe

I am applying this function over a dataframe df1 such as the following:
AA AB AC AD
2005-01-02 23:55:00 "EQUITY" "EQUITY" "EQUITY" "EQUITY"
2005-01-03 00:00:00 32.32 19.5299 32.32 31.0455
2005-01-04 00:00:00 31.9075 19.4487 31.9075 30.3755
2005-01-05 00:00:00 31.6151 19.5799 31.6151 29.971
2005-01-06 00:00:00 31.1426 19.7174 31.1426 29.9647
def func(x):
for index, price in x.iteritems():
x[index] = price / np.sum(x,axis=1)
return x[index]
df3=func(df1.ix[1:])
However, I only get a single column returned as opposed to 3
2005-01-03 0.955843
2005-01-04 0.955233
2005-01-05 0.955098
2005-01-06 0.955773
2005-01-07 0.955877
2005-01-10 0.95606
2005-01-11 0.95578
2005-01-12 0.955621
I am guessing I am missing something in the formula to make it apply to the entire dataframe. Also how could I return the first index that has strings in its row?
You need to do it the following way :
def func(row):
return row/np.sum(row)
df2 = pd.concat([df[:1], df[1:].apply(func, axis=1)], axis=0)
It has 2 steps :
df[:1] extracts the first row, which contains strings, while df[1:] represents the rest of the DataFrame. You concatenate them later on, which answers the second part of your question.
For operating over rows you should use apply() method.

What does `ValueError: cannot reindex from a duplicate axis` mean?

I am getting a ValueError: cannot reindex from a duplicate axis when I am trying to set an index to a certain value. I tried to reproduce this with a simple example, but I could not do it.
Here is my session inside of ipdb trace. I have a DataFrame with string index, and integer columns, float values. However when I try to create sum index for sum of all columns I am getting ValueError: cannot reindex from a duplicate axis error. I created a small DataFrame with the same characteristics, but was not able to reproduce the problem, what could I be missing?
I don't really understand what ValueError: cannot reindex from a duplicate axismeans, what does this error message mean? Maybe this will help me diagnose the problem, and this is most answerable part of my question.
ipdb> type(affinity_matrix)
<class 'pandas.core.frame.DataFrame'>
ipdb> affinity_matrix.shape
(333, 10)
ipdb> affinity_matrix.columns
Int64Index([9315684, 9315597, 9316591, 9320520, 9321163, 9320615, 9321187, 9319487, 9319467, 9320484], dtype='int64')
ipdb> affinity_matrix.index
Index([u'001', u'002', u'003', u'004', u'005', u'008', u'009', u'010', u'011', u'014', u'015', u'016', u'018', u'020', u'021', u'022', u'024', u'025', u'026', u'027', u'028', u'029', u'030', u'032', u'033', u'034', u'035', u'036', u'039', u'040', u'041', u'042', u'043', u'044', u'045', u'047', u'047', u'048', u'050', u'053', u'054', u'055', u'056', u'057', u'058', u'059', u'060', u'061', u'062', u'063', u'065', u'067', u'068', u'069', u'070', u'071', u'072', u'073', u'074', u'075', u'076', u'077', u'078', u'080', u'082', u'083', u'084', u'085', u'086', u'089', u'090', u'091', u'092', u'093', u'094', u'095', u'096', u'097', u'098', u'100', u'101', u'103', u'104', u'105', u'106', u'107', u'108', u'109', u'110', u'111', u'112', u'113', u'114', u'115', u'116', u'117', u'118', u'119', u'121', u'122', ...], dtype='object')
ipdb> affinity_matrix.values.dtype
dtype('float64')
ipdb> 'sums' in affinity_matrix.index
False
Here is the error:
ipdb> affinity_matrix.loc['sums'] = affinity_matrix.sum(axis=0)
*** ValueError: cannot reindex from a duplicate axis
I tried to reproduce this with a simple example, but I failed
In [32]: import pandas as pd
In [33]: import numpy as np
In [34]: a = np.arange(35).reshape(5,7)
In [35]: df = pd.DataFrame(a, ['x', 'y', 'u', 'z', 'w'], range(10, 17))
In [36]: df.values.dtype
Out[36]: dtype('int64')
In [37]: df.loc['sums'] = df.sum(axis=0)
In [38]: df
Out[38]:
10 11 12 13 14 15 16
x 0 1 2 3 4 5 6
y 7 8 9 10 11 12 13
u 14 15 16 17 18 19 20
z 21 22 23 24 25 26 27
w 28 29 30 31 32 33 34
sums 70 75 80 85 90 95 100
This error usually rises when you join / assign to a column when the index has duplicate values. Since you are assigning to a row, I suspect that there is a duplicate value in affinity_matrix.columns, perhaps not shown in your question.
As others have said, you've probably got duplicate values in your original index. To find them do this:
df[df.index.duplicated()]
Indices with duplicate values often arise if you create a DataFrame by concatenating other DataFrames. IF you don't care about preserving the values of your index, and you want them to be unique values, when you concatenate the the data, set ignore_index=True.
Alternatively, to overwrite your current index with a new one, instead of using df.reindex(), set:
df.index = new_index
Simple Fix
Run this before grouping
df = df.reset_index()
Thanks to this github comment for the solution.
For people who are still struggling with this error, it can also happen if you accidentally create a duplicate column with the same name. Remove duplicate columns like so:
df = df.loc[:,~df.columns.duplicated()]
Simply skip the error using .values at the end.
affinity_matrix.loc['sums'] = affinity_matrix.sum(axis=0).values
I came across this error today when I wanted to add a new column like this
df_temp['REMARK_TYPE'] = df.REMARK.apply(lambda v: 1 if str(v)!='nan' else 0)
I wanted to process the REMARK column of df_temp to return 1 or 0. However I typed wrong variable with df. And it returned error like this:
----> 1 df_temp['REMARK_TYPE'] = df.REMARK.apply(lambda v: 1 if str(v)!='nan' else 0)
/usr/lib64/python2.7/site-packages/pandas/core/frame.pyc in __setitem__(self, key, value)
2417 else:
2418 # set column
-> 2419 self._set_item(key, value)
2420
2421 def _setitem_slice(self, key, value):
/usr/lib64/python2.7/site-packages/pandas/core/frame.pyc in _set_item(self, key, value)
2483
2484 self._ensure_valid_index(value)
-> 2485 value = self._sanitize_column(key, value)
2486 NDFrame._set_item(self, key, value)
2487
/usr/lib64/python2.7/site-packages/pandas/core/frame.pyc in _sanitize_column(self, key, value, broadcast)
2633
2634 if isinstance(value, Series):
-> 2635 value = reindexer(value)
2636
2637 elif isinstance(value, DataFrame):
/usr/lib64/python2.7/site-packages/pandas/core/frame.pyc in reindexer(value)
2625 # duplicate axis
2626 if not value.index.is_unique:
-> 2627 raise e
2628
2629 # other
ValueError: cannot reindex from a duplicate axis
As you can see it, the right code should be
df_temp['REMARK_TYPE'] = df_temp.REMARK.apply(lambda v: 1 if str(v)!='nan' else 0)
Because df and df_temp have a different number of rows. So it returned ValueError: cannot reindex from a duplicate axis.
Hope you can understand it and my answer can help other people to debug their code.
In my case, this error popped up not because of duplicate values, but because I attempted to join a shorter Series to a Dataframe: both had the same index, but the Series had fewer rows (missing the top few). The following worked for my purposes:
df.head()
SensA
date
2018-04-03 13:54:47.274 -0.45
2018-04-03 13:55:46.484 -0.42
2018-04-03 13:56:56.235 -0.37
2018-04-03 13:57:57.207 -0.34
2018-04-03 13:59:34.636 -0.33
series.head()
date
2018-04-03 14:09:36.577 62.2
2018-04-03 14:10:28.138 63.5
2018-04-03 14:11:27.400 63.1
2018-04-03 14:12:39.623 62.6
2018-04-03 14:13:27.310 62.5
Name: SensA_rrT, dtype: float64
df = series.to_frame().combine_first(df)
df.head(10)
SensA SensA_rrT
date
2018-04-03 13:54:47.274 -0.45 NaN
2018-04-03 13:55:46.484 -0.42 NaN
2018-04-03 13:56:56.235 -0.37 NaN
2018-04-03 13:57:57.207 -0.34 NaN
2018-04-03 13:59:34.636 -0.33 NaN
2018-04-03 14:00:34.565 -0.33 NaN
2018-04-03 14:01:19.994 -0.37 NaN
2018-04-03 14:02:29.636 -0.34 NaN
2018-04-03 14:03:31.599 -0.32 NaN
2018-04-03 14:04:30.779 -0.33 NaN
2018-04-03 14:05:31.733 -0.35 NaN
2018-04-03 14:06:33.290 -0.38 NaN
2018-04-03 14:07:37.459 -0.39 NaN
2018-04-03 14:08:36.361 -0.36 NaN
2018-04-03 14:09:36.577 -0.37 62.2
I wasted couple of hours on the same issue. In my case, I had to reset_index() of a dataframe before using apply function.
Before merging, or looking up from another indexed dataset, you need to reset the index as 1 dataset can have only 1 Index.
I got this error when I tried adding a column from a different table. Indeed I got duplicate index values along the way. But it turned out I was just doing it wrong: I actually needed to df.join the other table.
This pointer might help someone in a similar situation.
In my case it was caused by mismatch in dimensions:
accidentally using a column from different df during the mul operation
This can also be a cause for this[:) I solved my problem like this]
It may happen even if you are trying to insert a dataframe type column inside dataframe
you can try this
df['my_new']=pd.Series(my_new.values)
if you get this error after merging two dataframe and remove suffix adnd try to write to excel
Your problem is that there are columns you are not merging on that are common to both source DataFrames. Pandas needs a way to say which one came from where, so it adds the suffixes, the defaults being '_x' on the left and '_y' on the right.
If you have a preference on which source data frame to keep the columns from, then you can set the suffixes and filter accordingly, for example if you want to keep the clashing columns from the left:
# Label the two sides, with no suffix on the side you want to keep
df = pd.merge(
df,
tempdf[what_i_care_about],
on=['myid', 'myorder'],
how='outer',
suffixes=('', '_delete_suffix') # Left gets no suffix, right gets something identifiable
)
# Discard the columns that acquired a suffix
df = df[[c for c in df.columns if not c.endswith('_delete_suffix')]]
Alternatively, you can drop one of each of the clashing columns prior to merging, then Pandas has no need to assign a suffix.
Just add .to_numpy() to the end of the series you want to concatenate.
It happened to me when I appended 2 dataframes into another (df3 = df1.append(df2)), so the output was:
df1
A B
0 1 a
1 2 b
2 3 c
df2
A B
0 4 d
1 5 e
2 6 f
df3
A B
0 1 a
1 2 b
2 3 c
0 4 d
1 5 e
2 6 f
The simplest way to fix the indexes is using the "df.reset_index(drop=bool, inplace=bool)" method, as Connor said... you can also set the 'drop' argument True to avoid the index list to be created as a columns, and 'inplace' to True to make the indexes reset permanent.
Here is the official refference: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.reset_index.html
In addition, you can also use the ".set_index(keys=list, inplace=bool)" method, like this:
new_index_list = list(range(0, len(df3)))
df3['new_index'] = new_index_list
df3.set_index(keys='new_index', inplace=True)
official refference: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.set_index.html
Make sure your index does not have any duplicates, I simply did df.reset_index(drop=True, inplace=True) and I don't get the error anymore! But you might want to keep the index, in that case just set drop to False
df = df.reset_index(drop=True) worked for me
I was trying to create a histogram using seaborn.
sns.histplot(data=df, x='Blood Chemistry 1', hue='Outcome', discrete=False, multiple='stack')
I get ValueError: cannot reindex from a duplicate axis. To solve it, I had to choose only the rows where x has no missing values:
data = df[~df['Blood Chemistry 1'].isnull()]

Categories