Pandas when renaming a column, key error is encountered - python

I have the below code to rename a column
df.rename(columns = {'Long Name 1':'Court'}, inplace = True)
But encounter the below error
KeyError: "['Long Name 1'] not in index"
Not sure why there is an error. When I see the columns in the df, it exists
print(df.columns)
Result:
Index(['Activity', 'Date', 'Hirer Category', 'No of Slots', 'Slot Status', 'Start Time', 'Court', 'Long Name 1'], dtype='object')
Why am I not able to rename column 'Long Name 1'?

Your Problem can't reproduce. I checked with dummy values but not found any error. You can see the screenshot and your code is working fine.
Link to the Screenshot as I not have enough reputation to embed it
Hope this helps. Thank you

Related

Spark Dataframe column name change does not reflect

I am trying to rename some special characters from my spark dataframe. For some weird reason, it shows the updated column name when I print the schema, but any attempt to access the data results in an error complaining about the old column name. Here is what I am trying:
# Original Schema
upsertDf.columns
# Output: ['col 0', 'col (0)', 'col {0}', 'col =0', 'col, 0', 'col; 0']
for c in upsertDf.columns:
upsertDf = upsertDf.withColumnRenamed(c, c.replace(" ", "_").replace("(","__").replace(")","__").replace("{","___").replace("}","___").replace(",","____").replace(";","_____").replace("=","_"))
upsertDf.columns
# Works and returns expected result
# Output: ['col_0', 'col___0__', 'col____0___', 'col__0', 'col_____0', 'col______0']
# Print contents of dataframe
# Throws error for original attribute name "
upsertDf.show()
AnalysisException: 'Attribute name "col 0" contains invalid character(s) among " ,;{}()\\n\\t=". Please use alias to rename it.;'
I have tried other options to rename the column (using alias etc...) and they all return the same error. Its almost as if the show operation is using a cached version of the schema but I can't figure out how to force it to use the new names.
Has anyone run into this issue before?
Have a look at this minimal example (using your renaming code, ran in a pyspark shell version 3.3.1):
df = spark.createDataFrame(
[("test", "test", "test", "test", "test", "test")],
['col 0', 'col (0)', 'col {0}', 'col =0', 'col, 0', 'col; 0']
)
df.columns
['col 0', 'col (0)', 'col {0}', 'col =0', 'col, 0', 'col; 0']
for c in df.columns:
df = df.withColumnRenamed(c, c.replace(" ", "_").replace("(","__").replace(")","__").replace("{","___").replace("}","___").replace(",","____").replace(";","_____").replace("=","_"))
df.columns
['col_0', 'col___0__', 'col____0___', 'col__0', 'col_____0', 'col______0']
df.show()
+-----+---------+-----------+------+---------+----------+
|col_0|col___0__|col____0___|col__0|col_____0|col______0|
+-----+---------+-----------+------+---------+----------+
| test| test| test| test| test| test|
+-----+---------+-----------+------+---------+----------+
As you see, this executes successfully. So your renaming functionality is OK.
Since you haven't shared all your code (how upsertDf is defined), we can't really know exactly what's going on. But looking at your error message, this comes from ParquetSchemaConverter.scala in a Spark version earlier than 3.2.0 (this error message changed in 3.2.0, see SPARK-34402).
Make sure that you read in your data and then immediately rename the columns, without doing any other operation.

Getting KeyError while merging dataframe even though columns are correct

I am trying to merge two dataframes but keep getting KeyError.
I checked column names I used and it looks fine. I even trimmed the col names so that there is no leading or trailing space. Still I get the same error.
I have no clue why is it failing.
Can someone help me with this? I went through lot of posts here and in other sites but none seems to fix my issue.
This is the merge statement:
roaster_ilc_mrg = (pd.merge(roaster,ilcdata,left_on="Emp ID", right_on="Emp Serial"))
Here is the cols from both df:
roaster Columns: Index(['Emp ID', 'XID', 'Name', 'Team', 'Location', 'Site', 'Status'], dtype='object')
ilcdata Columns: Index(['Activity Code', 'Billing Code', 'Emp Serial', 'Emp Lastname',
'Emp Manager', 'Weekending Date', 'Total hours'],
dtype='object')
Error I see:
File "C:\Abraham\Python\Anaconda3\lib\site-packages\pandas\core\generic.py", line 1563, in _get_label_or_level_values
raise KeyError(key)
KeyError: 'Emp Serial'
ilcdata.head() data
if I do below check for any column lables in ilcdata dataframe, i get default value ie, 'No col'
print(ilcdata.get('Emp Serial', default='No col'))
But all those labels are present there...Its driving me crazy coz I have used similar merge before and it was working smoothly

Python pandas function to concat into one row different values into one column based on repeating values in another

Apologies, I didn't even know how to title/describe the issue I am having, so bear with me. I have the following code:
import pandas as pd
data = {'Invoice Number':[1279581, 1279581,1229422, 1229422, 1229422],
'Project Key':[263736, 263736, 259661, 259661, 259661],
'Project Type': ['Visibility', 'Culture', 'Spend', 'Visibility', 'Culture']}
df= pd.DataFrame(data)
How do I get the output to basically group the Invoice Numbers so that there is only 1 row per Invoice Number and combine the multiple Project Types (per that 1 Invoice) into 1 row?
Code and output for output is below.
Thanks much appreciated.
import pandas as pd
data = {'Invoice Number':[1279581,1229422],
'Project Key':[263736, 259661],
'Project Type': ['Visibility_Culture', 'Spend_Visibility_Culture']
}
output = pd.DataFrame(data)
output
>>> (df
.groupby(['Invoice Number', 'Project Key'])['Project Type']
.apply(lambda x: '_'.join(x))
.reset_index()
)
Invoice Number Project Key Project Type
0 1229422 259661 Spend_Visibility_Culture
1 1279581 263736 Visibility_Culture

Formatting two columns with dollar signs during to_csv using pandas

I have a csv merge that has many columns. I am having trouble formatting price columns.I need to have them follow this format $1,000.00.Is there a function I can use to achieve this for just two columns (Sales Price and Payment Amount)? Here is my code so far:
df3 = pd.merge(df1, df2, how='left', on=['Org ID', 'Org Name'])
cols = ['Org Name', 'Org Type', 'Chapter', 'Join Date', 'Effective Date', 'Expire Date',
'Transaction Date', 'Product Name', 'Sales Price',
'Invoice Code', 'Payment Amount', 'Add Date']
df3 = df3[cols]
df3 = df3.fillna("-")
out_csv = root_out + "report-merged.csv"
df3.to_csv(out_csv, index=False)
A solution that I thought was going to work but I get an error (ValueError: Unknown format code 'f' for object of type 'str')
df3['Sales Price'] = df3['Sales Price'].map('${:,.2f}'.format)
Based on your error ("Unknown format code 'f' for object of type 'str'"), the columns that you are trying to format are being treated as strings. So using .astype(float) in the code below addresses this.
There is not a great way to set this formatting during (within) your to_csv call. However, in an intermediate line you could use:
cols = ['Sales Price', 'Payment Amount']
df3.loc[:, cols] = df3[cols].astype(float).applymap('${:,.2f}'.format)
Then call to_csv.

Match lists based on Name and DOB

This seems like it should be easy, but I can't seem to find what I'm looking for...I have two lists of people, FirstName, LastName, Date of Birth, and I just want to know which people are in both lists, and which ones are in one but not the other.
I've tried something like
common = pd.merge(list1, list2, how='left', left_on=['Last', 'First', 'DOB'], right_on=['Patient Last Name', 'Patient First Name', 'Date of Birth']).dropna()
Based on something else I found online, but it give me this error:
KeyError: 'Date of Birth'
I've verified that that is indeed the column heading in the second list, so I don't get what's wrong. Anyone do matching like this? What's the easiest/fastest way? The names may have different formatting between lists, like "Smith-Jones" vs. "SmithJones" vs. "Smith Jones", but I get around that by stripping all spances and punctuation from the names...I assume that's a first good step?
Try this , it should work
import sys
from StringIO import StringIO
import pandas as pd
TESTDATA=StringIO("""DOB;First;Last
2016-07-26;John;smith
2016-07-27;Mathew;George
2016-07-28;Aryan;Singh
2016-07-29;Ella;Gayau
""")
list1 = pd.read_csv(TESTDATA, sep=";")
TESTDATA=StringIO("""Date of Birth;Patient First Name;Patient Last Name
2016-07-26;John;smith
2016-07-27;Mathew;XXX
2016-07-28;Aryan;Singh
2016-07-20;Ella;Gayau
""")
list2 = pd.read_csv(TESTDATA, sep=";")
print list2
print list1
common = pd.merge(list1, list2, how='left', left_on=['Last', 'First', 'DOB'], right_on=['Patient Last Name', 'Patient First Name', 'Date of Birth']).dropna()
print common

Categories