KeyError in Pandas sort_values - python

I am trying to sort a dataframe by a particular column: "Lat". However, although when I print out the column names, "Lat" clearly shows up, when I try to use it as the "by" parameter in the sort_values function, I get a KeyError. It doesn't matter which column name I use, I get a key error no matter what.
I have tried using different columns, running in place, stripping the columns names, nothing seems to work
print(lights_df.columns.tolist())
lights_by_lat = lights_df.sort_values(axis = 'columns', by = "Lat", kind
= "mergesort")
outputs:
['the_geom', 'OBJECTID', 'TYPE', 'Lat', 'Long']
KeyError: 'Lat'
^output from trying to sort

All you have to do is remove the axis argument:
lights_by_lat = lights_df.sort_values(by = "Lat", kind = "mergesort")
and you should be good.

Related

Apply function to two columns of a Pandas dataframe

I'm finding several answers to this question, but none that seem to address or solve the error that pops up when I apply them. Per e.g. this answer I have a dataframe df and a function my_func(string_1,string_2) and I'm attempting to create a new column with the following:
df.['new_column'] = df.apply(lambda x: my_func(x['old_col_1'],x['old_col_2']),axis=1)
I'm getting an error originating inside my_func telling me that old_col_1 is type float and not a string as expected. In particular, the first line of my_func is old_col_1 = old_col_1.lower(), and the error is
AttributeError: 'float' object has no attribute 'lower'
By including debug statements using dataframe printouts I've verified old_col_1 and old_col_2 are indeed both strings. If I explicitly cast them to strings when passing as arguments, then my_func behaves as you would expect if it were being fed numeric data cast as strings, though the column values are decidedly not numeric.
Per this answer I've even explicitly ensured these columns are not being "intelligently" cast incorrectly when creating the dataframe:
df = pd.read_excel(file_name, sheetname,header=0,converters={'old_col_1':str,'old_col_2':str})
The function my_func works very well when it's called on its own. All this is making me suspect that the indices or some other numeric data from the dataframe is being passed, and not (exclusively) the column values.
Other implementations seem to give the same problem. For instance,
df['new_column'] = np.vectorize(my_func)(df['old_col_1'],df['old_col_2'])
produces the same error. Variations (e.g. using df['old_col_1'].to_numpy() or df['old_col_1'].values in place of df['old_col_1']) don't change this.
Is it possible that you have a np.nan/None/null data in your columns? If so you might be getting an error similar to the one that is caused with this data
data = {
'Column1' : ['1', '2', np.nan, '3']
}
df = pd.DataFrame(data)
df['Column1'] = df['Column1'].apply(lambda x : x.lower())
df

Renaming pandas columns gives not found in index error

I have a data frame called v where columns are = ['self','id','desc','name','arch','rel']. And when I rename is as follows it won't let me drop columns giving column not found in axis error.
case1:
for i in range(0,len(v.columns)):
#I'm trying to add 'v_' prefix to all col names
v.columns.values[i] = 'v_' + v.columns.values[i]
v.drop('v_self',1)
#leads to error
KeyError: "['v_self'] not found in axis"
But if I do it as follows then it works fine
case2:
v.columns = ['v_self','v_id','v_desc','v_name','v_arch','v_rel']
v.drop('v_self',1)
# no error
In both cases if I do following it give same results for its columns
v.columns
#both cases gives
Index(['v_self', 'v_id', 'v_description', 'v_name', 'v_archived',
'v_released'],
dtype='object')
I can't understand why in the case1 it gives an error? Please help, thanks.
That's because .values returns the underlying values. You're not supposed to change those directly. Assigning directly to .columns is supported though.
Try something like this:
import pandas
df = pandas.DataFrame(
[
{key: 0 for key in ["self", "id", "desc", "name", "arch", "rel"]}
for _ in range(100)
]
)
# Add a v_ to every column
df.columns = [f"v_{column}" for column in df.columns]
# Drop one column
df = df.drop(columns=["v_self"])
To your "case 1":
You meet a bug (#38547) in pandas — “Direct renaming of 1 column seems to be accepted, but only old name is working”.
It means that after that "renaming", you may delete the first column
not by using
v.drop('v_self',1)
but using the old name
v.drop('self',1)`.
Of course, the better option is not using such a buggy renaming in the
current versions of pandas.
To renaming columns by adding a prefix to every label:
There is a direct dateframe method .add_prefix() for it, isn't it?
v = df.add_prefix("v_")

Pandas Series from two-columned DataFrame produces a Series of NaN's

state_codes = pd.read_csv('name-abbr.csv', header=None)
state_codes.columns = ['State', 'Code']
codes = state_codes['Code']
states = pd.Series(state_codes['State'], index=state_codes['Code'])
name-abbr.csv is a two-columned CSV file of US state names in the first column and postal codes in the second: "Alabama" and "AL" in the first row, "Alaska" and "AK" in the second, and so forth.
The above code correctly sets the index, but the Series is all NaN. If I don't set the index, the state names correctly show. But I want both.
I also tried this line:
states = pd.Series(state_codes.iloc[:,0], index=state_codes.iloc[:,1])
Same result. How do I get this to work?
Here is reason called alignment, it means pandas try match index of state_codes['State'].index with new index of state_codes['Code'] and because different get missing values in output, for prevent it is necessary convert Series to numpy array:
states = pd.Series(state_codes['State'].to_numpy(), index=state_codes['Code'])
Or you can use DataFrame.set_index:
states = state_codes.set_index('Code')['State']

Removing duplictes appearing in two or more columns Python

Same problem but it didn't help. How To Solve KeyError: u"None of [Index([..], dtype='object')] are in the [columns]"
First try:
df = pd.read_csv('ABCD.csv', index_col=['A'])
df=df.drop_duplicates(['A'],['B'])
KeyError: Index(['Sample_ID'], dtype='object')
Here I have found out that it impossible to removed the index itself so I removed it from the top:
df = pd.read_csv('ABCD.csv')
df=df.drop_duplicates(['A'],['B'],keep = 'first')
TypeError: drop_duplicates() got multiple values for argument 'keep'
When I print df(type) it posts "DataFrame" , what could be the problem?
I thought that would be
df=df.drop_duplicates(['A', 'B'],keep = 'first')
instead of:
df=df.drop_duplicates(['A'],['B'],keep = 'first')
The subset must be a list of columns, not separate to multiple arguments:
subsetcolumn label or sequence of labels, optional doc
PS: You should use df.drop_duplicates(['A', 'B'], keep='first', inplace=True), you dont need to assign back to df when adding inplace

How to specify my grouping with a lamda function and keep other columns I need in the dataframe?

I have grouped a dataframe. Then I want to use a lambda function throughout those groups. However, the code I am using is not returning the proper results. I'm thinking there is a problem with the grouping or perhaps the lambda function. Very specifically, I'm looking to take the grouped dataframe and only return the rows with the highest JUGCODE count. The n for how many of the highest rows to return depends of the SPOT_NUM for that group. So, every group should return a different number of rows.
I've tried changing the grouping in my original dataframe. Then I tried specifying the grouping I want in the line with the lambda function. I've also researched resetting the index. I'm not sure if that had anything to do with it. But I am still not getting the results I'm looking for.
#subsetting the logs
filtered_logs = log_master.loc[(log_master['BILL_TYPE'] == "Town")]
#grouping the logs by everything except the JUGCODE - because that's the count I need
town_grouped = filtered_logs.groupby(['SPOT_ID','TIME', 'R_CODE', 'SPOT_NUM', 'CONTRACT_ID', 'NAME', 'CHNN', 'ZONE', 'MONTH', 'BILL_TYPE', 'MASTER', 'DATE', ]).count()
town_grouped1 = town_grouped.reset_index()
town_grouped1.head(10)
That's how I grouped the dataframe that I'm going to use in my lambda function.
verified_spots = town_grouped1.groupby('SPOT_ID').apply(lambda grp: grp.nlargest((town_grouped1['SPOT_NUM'].iloc[0]), ['JUGCODE']))
I'm finding that some groups are returning more rows than the SPOT_NUM for its group. That's the problem I'm trying to solve.
sample data code:
data = [['363662402','17:24:29','1061',3,'647333','BL08253061','CMDY','Savannah','201610','tampa','COMIC','10/01/2016',30], ['363662402', '17:31:15','1061',3,'647333','BL08253062','CMDY','Savannah','201610','tampa','COMIC', '10/03/2016',30], ['363662402','17:34:15','1061',3,'647333','BL08253061','CMDY','Savannah','201610','tampa','COMIC','10/02/2016',29], ['363662402','17:34:30','1061',3,'647333','BL08253061','CMDY','Savannah','201610','tampa','COMIC','10/02/2016',1], ['363662403','16:26:14','1061',3,'647333','BL08258415','CMDY','Savannah','201610','tampa','COMIC','10/09/2016',30], ['363662394','20:39:12','1061',4,'647333','BL08253061','CMDY','Savannah','201610','tampa','COMIC','10/02/2016',30], ['363662394','22:48:01','1061',4,'647333','BL08253061','CMDY','Savannah','201610','tampa','COMIC','10/02/2016',30], ['363662394','22:40:21','1061',4,'647333','BL08253061','CMDY','Savannah','201610','tampa','COMIC','10/01/2016',29], ['363662394','19:44:51','1061',4,'647333','BL08253061','CMDY','Savannah','201610','tampa','COMIC','10/01/2016',23]]
town_grouped1 = pd.DataFrame(data, columns = ['SPOT_ID','TIME', 'R_CODE', 'SPOT_NUM', 'CONTRACT_ID', 'NAME', 'CHNN', 'ZONE', 'MONTH', 'BILL_TYPE', 'MASTER', 'DATE', 'JUGCODE'])

Categories