Rounding inconsistencies when changing from currency to float in pandas - python

I am trying to change a column (price) which is an object data type to use in a groupby. Changing this column to float sometimes keeps the same number of decimal places as the original, sometimes rounds to one decimal place, and sometimes gets rid of all decimal places altogether. I would like to keep the float values the same as the original due to eventual reconciliation that needs to be accurate to the decimal place. I have tried changing the column type using astye, and also pd.to_numeric. Ideally, price_3 and price_4 should apples should be 93927.82.
Any help would be greatly appreciated.
import pandas as pd
d = {'product': ['apples', 'pears', 'grapes', 'oranges'],
'price': ['$93,927.83' , '$9,868.23', '$110,838.10', '$10,093.88']}
df = pd.DataFrame(data=d)
df['price_2'] = df['price'].str.replace('$', '').str.replace(',', '').str.replace('(', '').str.replace(')', '')
df['price_3'] = df['price_2'].astype(float)
df['price_4'] = pd.to_numeric(df['price_2'])

You likely want to set the display precision:
import pandas
pandas.set_option('display.precision', 2)
df = pandas.DataFrame({
'product': ['apples', 'pears', 'grapes', 'oranges'],
'price': ['$93,927.83' , '$9,868.23', '$110,838.10', '$10,093.88']
})
df['price_2'] = df['price'].str.replace('$', '').str.replace(',', '').str.replace('(', '').str.replace(')', '')
df['price_3'] = df['price_2'].astype(float)
df['price_4'] = pandas.to_numeric(df['price_2'])
print(df['price_4'])
Giving you:
0 93927.83
1 9868.23
2 110838.10
3 10093.88

Try changing df['price_3'] = df['price_2'].astype(float) to df['price_3'] = df['price_2'].apply(Decimal).

Related

How to rename columns of a nested dataframe?

I have a python function that cleans up my dataframe(replaces whitespaces with _ and adds _ if column begins with a number):
These dataframes were jsons that have been converted to dataframes to easily work with them.
def prepare_json(df):
df = df.rename(lambda x: '_' + x if re.match('([0-9])\w+',x) else x, axis=1)
df = df.rename(lambda x: x.replace(' ', '_'), axis=1)
return df
This works for simple jsons like the following:
{"123asd":"test","test json":"test"}
Output:
{"_123asd":"test","test_json":"test"}
However when i try it with a more complex dataframe it does not work anymore.
Here is an exampe:
{"SETDET":[{"SETPRTY":[{"DEAG":{"95R":[{"Data Source Scheme":"SCOM","Proprietary Code":"CH123456"}]}},{"SAFE":{"97A":[{"Account Number":"123456789"}]},"SELL":{"95P":[{"Identifier Code Location Code":"AB","Identifier Code Country Code":"AB","Identifier Code Logical Terminal":"XXX","Identifier Code Bank Code":"ABCD"}]}},{"PSET":{"95P":[{"Identifier Code Location Code":"ZZ","Identifier Code Country Code":"CH","Identifier Code Logical Terminal":"","Identifier Code Bank Code":"INSE"}]}}],"SETR":{"22F":[{"Data Source Scheme":"","Indicator":"TRAD"}]}}],"TRADDET":[{"Other":{"35B":[{"Identification of Security":"CH0012138530","Description of Security":"CREDIT SUISSE GROUP"}]},"SETT":{"98A":[{"Date":"20181127"}]},"TRAD":{"98A":[{"Date":"20181123"}]}}],"FIAC":[{"SAFE":{"97A":[{"Account Number":"0123-1234567-05-001"}]},"SETT":{"36B":[{"Quantity":"10,","Quantity Type Code":"UNIT"}]}}],"GENL":[{"SEME":{"20C":[{"Reference":"1234567890123456"}]},"Other":{"23G":[{"Subfunction":"","Function":"NEWM"}]},"PREP":{"98C":[{"Date":"20181123","Time":"165256"}]}}]}
Trying it out with this i get the following error when trying to write the dataframe to bigquery:
Invalid field name "97A". Fields must contain only letters, numbers, and underscores, start with a letter or underscore, and be at most 300 characters long. with loading dataframe
maybe my solution helps you:
I convert your dictionary to a string
find all keys of dictionary with regex
replace spaces in keys by _ and add _ before keys start with digit
convert string to the dictionary with ast.literal_eval(dict_string)
try this:
import re
import ast
from copy import deepcopy
def my_replace(match):
return match.group()[0] + match.group()[1] + "_" + match.group()[2]
dct = {"SETDET":[{"SETPRTY":[{"DEAG":{"95R":[{"Data Source Scheme":"SCOM","Proprietary Code":"CH123456"}]}},{"SAFE":{"97A":[{"Account Number":"123456789"}]},"SELL":{"95P":[{"Identifier Code Location Code":"AB","Identifier Code Country Code":"AB","Identifier Code Logical Terminal":"XXX","Identifier Code Bank Code":"ABCD"}]}},{"PSET":{"95P":[{"Identifier Code Location Code":"ZZ","Identifier Code Country Code":"CH","Identifier Code Logical Terminal":"","Identifier Code Bank Code":"INSE"}]}}],"SETR":{"22F":[{"Data Source Scheme":"","Indicator":"TRAD"}]}}],"TRADDET":[{"Other":{"35B":[{"Identification of Security":"CH0012138530","Description of Security":"CREDIT SUISSE GROUP"}]},"SETT":{"98A":[{"Date":"20181127"}]},"TRAD":{"98A":[{"Date":"20181123"}]}}],"FIAC":[{"SAFE":{"97A":[{"Account Number":"0123-1234567-05-001"}]},"SETT":{"36B":[{"Quantity":"10,","Quantity Type Code":"UNIT"}]}}],"GENL":[{"SEME":{"20C":[{"Reference":"1234567890123456"}]},"Other":{"23G":[{"Subfunction":"","Function":"NEWM"}]},"PREP":{"98C":[{"Date":"20181123","Time":"165256"}]}}]}
keys = re.findall("{\'.*?\': | \'.*?\': ", str(dct))
keys_bfr_chng = deepcopy(keys)
keys = [re.sub("\s+(?=\w)", '_', key) for key in keys]
keys = [re.sub(r"{\'\d", my_replace, key) for key in keys]
dct = str(dct)
for i in range(len(keys)):
dct = dct.replace(keys_bfr_chng[i], keys[i])
dct = ast.literal_eval(dct)
print(dct)
type(dct)
output:
{'SETDET': [{'SETPRTY': [{'DEAG': {'_95R': [{'Data_Source_Scheme': 'SCOM', 'Proprietary_Code': 'CH123456'}]}}, {'SAFE': {'_97A': [{'Account_Number': '123456789'}]}, 'SELL': {'_95P': [{'Identifier_Code_Location_Code': 'AB', 'Identifier_Code_Country_Code': 'AB', 'Identifier_Code_Logical_Terminal': 'XXX', 'Identifier_Code_Bank_Code': 'ABCD'}]}}, {'PSET': {'_95P': [{'Identifier_Code_Location_Code': 'ZZ', 'Identifier_Code_Country_Code': 'CH', 'Identifier_Code_Logical_Terminal': '', 'Identifier_Code_Bank_Code': 'INSE'}]}}], 'SETR': {'_22F': [{'Data_Source_Scheme': '', 'Indicator': 'TRAD'}]}}], 'TRADDET': [{'Other': {'_35B': [{'Identification_of_Security': 'CH0012138530', 'Description_of_Security': 'CREDIT SUISSE GROUP'}]}, 'SETT': {'_98A': [{'Date': '20181127'}]}, 'TRAD': {'_98A': [{'Date': '20181123'}]}}], 'FIAC': [{'SAFE': {'_97A': [{'Account_Number': '0123-1234567-05-001'}]}, 'SETT': {'_36B': [{'Quantity': '10,', 'Quantity_Type_Code': 'UNIT'}]}}], 'GENL': [{'SEME': {'_20C': [{'Reference': '1234567890123456'}]}, 'Other': {'_23G': [{'Subfunction': '', 'Function': 'NEWM'}]}, 'PREP': {'_98C': [{'Date': '20181123', 'Time': '165256'}]}}]}
dict

How to make a column of categorised group in pandas

Given a column of “food” (apple, banana, carrot, donuts, egg,...), I want to make the “category” column that contains values which correspond to each item in “food” column.
Ex. given the information below
import pandas as pd
fruit =['apple', 'banana', 'orange']
veg =['carrot', 'onion']
meat=['chicken', 'pork', 'beef']
food = fruit + veg + meat
df = pd.DataFrame(food, columns=['food'])
df
When I write the code like this:
df[df['food']=='apple']['category']='fruit'
df[df['food']=='carrot']['category']='vegetable'
However, a SettingWithCopyWarning occurs when I write down in this way.
What would be the best way to set this value?
You probably got a SettingWithCopy warning from pandas. You can resolve that in a few different ways:
# Use loc
df['category'] = None # Initialize an empty column
df.loc[df['food']=='apple', 'category'] = 'fruit'
df.loc[df['food']=='carrot', 'category'] = 'vegetable'
# Use map
df['category'] = df['food'].map({
'apple': 'fruit',
'carrot': 'vegetable'
})

How do I split a names column in a pandas data frame if only some of the names have middle names?

I am working with a pandas data frame of names and there are a few different formats of names. Some are 'first' 'last, others are 'first' 'middle' 'last', and others are 'first initial' 'second initial' 'last'. I would like to split these into three columns by using the strings. I am currently trying to use the split function but I am getting "ValueError: Columns must be same length as key" because some names will split into two columns and others will be split into three. How can I get around this?
df = {'name': ['bradley efron', 'c arden pope', 'a l smith']}
mak_df[['First', 'Middle', 'Last']] = mak_df.Author_Name.str.split(" ", expand = True)
Here is a workaround:
import pandas as pd
list_of_names = ['bradley efron', 'c arden pope', 'a l smith']
new_list =[]
for name in list_of_names:
new_list.append(name.split(" "))
print(new_list)
for name in new_list:
if (len(name)==2):
name.insert(1," ")
print(new_list)
df = pd.DataFrame.from_records(new_list).T
df.index = ["first name","middle name","last name"]
df= df.T
print(df)
Output:
There's probably a better way to go about this, but here's what I've got:
df = {'name': ['bradley efron', 'c arden pope', 'a l smith']}
df=pd.DataFrame(df)
df=df['name'].str.split(' ',expand=True)
df.columns=['first','middle','last']
df['last']=np.where(df['last'].isnull(),df['middle'],df['last'])
df['middle']=np.where((df['middle']==df['last']),'',df['middle'])

How do I use mapping of dictionary for value correction?

I have a pandas series whose unique values are something like:
['toyota', 'toyouta', 'vokswagen', 'volkswagen,' 'vw', 'volvo']
Now I want to fix some of these values like:
toyouta -> toyota
(Note that not all values have mistakes such as volvo, toyota etc)
I've tried making a dictionary where key is the correct word and value is the word to be corrected and then map that onto my series.
This is how my code looks:
corrections = {'maxda': 'mazda', 'porcshce': 'porsche', 'toyota': 'toyouta', 'vokswagen': 'vw', 'volkswagen': 'vw'}
df.brands = df.brands.map(corrections)
print(df.brands.unique())
>>> [nan, 'mazda', 'porsche', 'toyouta', 'vw']
As you can see the problem is that this way, all values not present in the dictionary are automatically converted to nan. One solution is to map all the correct values to themselves, but I was hoping there could be a better way to go about this.
Use:
df.brands = df.brands.map(corrections).fillna(df.brands)
Or:
df.brands = df.brands.map(lambda x: corrections.get(x, x))
Or:
df.brands = df.brands.replace(corrections)

Pandas DataFrame from Dictionary with Lists

I have an API that returns a single row of data as a Python dictionary. Most of the keys have a single value, but some of the keys have values that are lists (or even lists-of-lists or lists-of-dictionaries).
When I throw the dictionary into pd.DataFrame to try to convert it to a pandas DataFrame, it throws a "Arrays must be the same length" error. This is because it cannot process the keys which have multiple values (i.e. the keys which have values of lists).
How do I get pandas to treat the lists as 'single values'?
As a hypothetical example:
data = { 'building': 'White House', 'DC?': True,
'occupants': ['Barack', 'Michelle', 'Sasha', 'Malia'] }
I want to turn it into a DataFrame like this:
ix building DC? occupants
0 'White House' True ['Barack', 'Michelle', 'Sasha', 'Malia']
This works if you pass a list (of rows):
In [11]: pd.DataFrame(data)
Out[11]:
DC? building occupants
0 True White House Barack
1 True White House Michelle
2 True White House Sasha
3 True White House Malia
In [12]: pd.DataFrame([data])
Out[12]:
DC? building occupants
0 True White House [Barack, Michelle, Sasha, Malia]
This turns out to be very trivial in the end
data = { 'building': 'White House', 'DC?': True, 'occupants': ['Barack', 'Michelle', 'Sasha', 'Malia'] }
df = pandas.DataFrame([data])
print df
Which results in:
DC? building occupants
0 True White House [Barack, Michelle, Sasha, Malia]
Solution to make dataframe from dictionary of lists where keys become a sorted index and column names are provided. Good for creating dataframes from scraped html tables.
d = { 'B':[10,11], 'A':[20,21] }
df = pd.DataFrame(d.values(),columns=['C1','C2'],index=d.keys()).sort_index()
df
C1 C2
A 20 21
B 10 11
Would it be acceptable if instead of having one entry with a list of occupants, you had individual entries for each occupant? If so you could just do
n = len(data['occupants'])
for key, val in data.items():
if key != 'occupants':
data[key] = n*[val]
EDIT: Actually, I'm getting this behavior in pandas (i.e. just with pd.DataFrame(data)) even without this pre-processing. What version are you using?
I had a closely related problem, but my data structure was a multi-level dictionary with lists in the second level dictionary:
result = {'hamster': {'confidence': 1, 'ids': ['id1', 'id2']},
'zombie': {'confidence': 1, 'ids': ['id3']}}
When importing this with pd.DataFrame([result]), I end up with columns named hamster and zombie. The (for me) correct import would be to have these as row titles, and confidence and ids as column titles. To achieve this, I used pd.DataFrame.from_dict:
In [42]: pd.DataFrame.from_dict(result, orient="index")
Out[42]:
confidence ids
hamster 1 [id1, id2]
zombie 1 [id3]
This works for me with python 3.8 + pandas 1.2.3.
if you know the keys of the dictionary beforehand, why not first create an empty data frame and then keep adding rows?

Categories