Handling unicode names in DataFrame

Handling unicode names in DataFrame - python

I want to convert all my data in a DataFrame to uppercase. When I start conversion from column names I get this error:
Code:
xl = pd.ExcelFile(target_processed_directory + filename)
# check sheet names
print(xl.sheet_names[0])
# sheet to pandas dataframe
df = xl.parse(xl.sheet_names[0])
# make whole dataframe uppercase
df.columns = map(str.upper, df.columns)
Error :
TypeError: descriptor 'upper' requires a 'str' object but received a 'unicode'

When using Pandas you'll want to avoid for loops in Python, and you'll usually want to avoid map() as well. Those are the slow ways to do things, and if you want to build good habits, you'll avoid them whenever you can.
There are fast vectorized string operations available for Pandas string sequences. In this case, you want:
df.columns = df.columns.str.upper()
Docs: http://pandas.pydata.org/pandas-docs/stable/text.html

Try using list comprehension instead of mapping str.upper.
df.columns = [c.upper() for c in df.columns]
In Python 2.7, the distinction between strings and unicode is preventing you from applying a string method to a unicode object, despite the fact that the names of the methods are the same.

Related

Replacing dot with comma from a dataframe using Python

I have a dataframe for example df :
I'm trying to replace the dot with a comma to be able to do calculations in excel.
I used :
df = df.stack().str.replace('.', ',').unstack()
or
df = df.apply(lambda x: x.str.replace('.', ','))
Results :
Nothing changes but I receive his warning at the end of an execution without errors :
FutureWarning: The default value of regex will change from True to
False in a future version. In addition, single character regular
expressions willnot be treated as literal strings when regex=True.
View of what I have :
Expected Results :
Updated Question for more information thanks to #Pythonista anonymous:
print(df.dtypes)
returns :
Date object
Open object
High object
Low object
Close object
Adj Close object
Volume object
dtype: object
I'm extracting data with the to_excel method:
df.to_excel()
I'm not exporting the dataframe in a .csv file but an .xlsx file

Where does the dataframe come from - how was it generated? Was it imported from a CSV file?
Your code works if you apply it to columns which are strings, as long as you remember to do
df = df.apply() and not just df.apply() , e.g.:
import pandas as pd
df = pd.DataFrame()
df['a'] =['some . text', 'some . other . text']
df = df.apply(lambda x: x.str.replace('.', ','))
print(df)
However, you are trying to do this with numbers, not strings.
To be precise, the other question is: what are the dtypes of your dataframe?
If you type
df.dtypes
what's the output?
I presume your columns are numeric and not strings, right? After all, if they are numbers they should be stored as such in your dataframe.
The next question: how are you exporting this table to Excel?
If you are saving a csv file, pandas' to_csv() method has a decimal argument which lets you specify what should be the separator for the decimals (tyipically, dot in the English-speaking world and comma in many countries in continental Europe). Look up the syntax.
If you are using the to_excel() method, it shouldn't matter because Excel should treat it internally as a number, and how it displays it (whether with a dot or comma for decimal separator) will typically depend on the options set in your computer.
Please clarify how you are exporting the data and what happens when you open it in Excel: does Excel treat it as a string? Or as a number, but you would like to see a different separator for the decimals?
Also look here for how to change decimal separators in Excel: https://www.officetooltips.com/excel_2016/tips/change_the_decimal_point_to_a_comma_or_vice_versa.html
UPDATE
OP, you have still not explained where the dataframe comes from. Do you import it from an external source? Do you create it/ calculate it yourself?
The fact that the columns are objects makes me think they are either stored as strings, or maybe some rows are numeric and some are not.
What happens if you try to convert a column to float?
df['Open'] = df['Open'].astype('float64')
If the entire column should be numeric but it's not, then start by cleansing your data.
Second question: what happens when you use Excel to open the file you have just created? Excel displays a comma, but what character Excel sues to separate decimals depends on the Windows/Mac/Excel settings, not on how pandas created the file. Have you tried the link I gave above, can you change how Excel displays decimals? Also, does Excel treat those numbers as numbers or as strings?

pySpark list to dataframe

My code below creates a dataframe from lists of columns from other dataframes. I'm getting an error when calling a list that is produce by a set. How can I treat that set of list, in order to add those columns to my dataframe?
Error produce by +list(matchedList)
#extract columns that need to be conform
datasetMatched = dataset.select(selectedColumns +list(matchedList))
#display(datasetMatched)
TypeError: 'list' object is not callable

It probably happens due to shadowing the builtin list function. Make sure you didn't define any variable named list in your code.

'function' object has no attribute 'str' in pandas

I am using below code to read and split the csv file strings separated by /
DATA IS
SRC_PATH TGT_PATH
/users/sn/Retail /users/am/am
/users/sn/Retail Reports/abc /users/am/am
/users/sn/Automation /users/am/am
/users/sn/Nidh /users/am/xzy
import pandas as pd
df = pd.read_csv('E:\RCTemplate.csv',index_col=None, header=0)
s1 = df.SRC_PATH.str.split('/', expand=True)
i get the correct split data in s1, but when i am going to do the similar operation on single row it throws error "'function' object has no attribute 'str'"
error is throwing in below code
df2= [(df.SRC_PATH.iloc[0])]
df4=pd.DataFrame([(df.SRC_PATH.iloc[0])],columns = ['first'])
newvar = df4.first.str.split('/', expand=True)

Pandas thinks you are trying to access the method dataframe.first().
This is why it's best practice to use hard brackets to access dataframe columns rather than .column access
df4['first'].str.split() instead of df4.first.str.split()
Not that this cause common issues with things like a column called 'name' ending up as the name attribute of the dataframe and a host of other problems

Python is strongly typed language, dtype=float when create DataFrame would apply to the values only making sense to be float?

import pandas as pd
data = [['Alex',10],['Bob',12],['Clarke',13]]
df = pd.DataFrame(data,columns=['Name','Age'],dtype=float)
print df
When create DataFrame, we define the data type to be float, but not named which column to change to float. We leave it to python to decide if it makes sense to change the column data type or not, based on the value.
Just wondering, on this practice, is this still strict strongly typed?
Thanks!

Python is a strongly, dynamically typed language (see Is Python strongly typed?).
If you pass dtype=float explicitly when constructing a pd.DataFrame, Pandas will try to coerce all columns to float. In your case, the 'Name' column cannot be coerced to float, so it will remain as object (you can think of this as str in pure Python).
In general, you can leave dtype blank to have Pandas infer datatypes for DataFrames. Then do a df.info() to see if dtypes make sense to you. The ones you don't approve, you can change them like so:
df['Age'] = df['Age'].astype(int)

Pandas falsely converting strings to floats

I'm using a csv file from Excel to create a pandas data frame. Recently, I've encountered several ValueError messages regarding the dtypes of each column in the dataframe.
This is the most recent exception raised:
ValueError: could not convert string to float: 'OH'
After running pandas' dtypes method on my data frame, it shows that this particular column addr_state is an object, not a float.
I've pasted all my code below for clarification:
work_path = 'C:\\Users\\Projects\\loans.csv'
unfiltered_y_df = pd.read_csv(work_path, low_memory=False, encoding='latin-1')
print(unfiltered_y_df.dtypes)
filtered_y_df = unfiltered_y_df.loc[unfiltered_y_df['loan_status'].isin(['Fully Paid', 'Charged Off', 'Default'])]
X = StandardScaler().fit_transform(filtered_y_df[[column for column in filtered_y_df]])
Y = filtered_y_df['loan_status']
Also, is it possible to explicitly write out the dtypes for each column? Right now I feel like that's the only way to solve this. Thanks in advance!

So two issues here I think:
To print out the types for each column just use the ftypes or dtypes method:
i.e.
unfiltered_y_df.ftypes
You say 'addr_state' is an object not a float. Well that is the problem, StandardScaler() will only work on floats so it is trying to coerce your state 'OH' to a float and can't, hence the error

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.