Python/Pandas formatting values of a column if its header contains "Price" - python

Using pandas, I have read an .xlsx file that contains 4 columns: ID, Product, Buy Price and Sell Price.
I would like to format values under the columns that contain "Price" in their headers in the following way:
1399 would become $1,399.00
1538.9 would become $1,538.90
I understand how to address the column headers and impose the desired condition, but I don't know how to format the values themselves. This is how far I got:
for col in df.columns:
if "Price" in col:
print("This header has 'Price' in it")
else:
print(col)
ID
Name
This header has 'Price' in it
This header has 'Price' in it
How can I do this?

Try:
for col in df.columns:
if "Price" in col:
print("This header has 'Price' in it")
df[col] = df[col].map('${:,.2f}'.format)
else:
print(col)
Or if get all columns names to list is possible use DataFrame.applymap:
cols = df.filter(like='Price').columns
df[cols] = df[cols].applymap('${:,.2f}'.format)
In the string formatting, :, puts a comma for as the thousands separator, and .2f formats the floats to 2 decimal points, to become cents.

I suggest you use py-moneyed, see below how to use it for transforming it to a string representing money:
import pandas as pd
from moneyed import Money, USD
res = pd.Series(data=[1399, 1538.9]).map(lambda x: str(Money(x, USD)))
print(res)
Output
0 $1,399.00
1 $1,538.90
dtype: object
Full code
import pandas as pd
from moneyed import Money, USD
# toy data
columns = ["ID", "Product", "Buy Price", "Sell Price"]
df = pd.DataFrame(data=[[0, 0, 1399, 1538.9]], columns=columns)
# find columns with Price in it
filtered = df.columns.str.contains("Price")
# transform the values of those columns
df.loc[:, filtered] = df.loc[:, filtered].applymap(lambda x: str(Money(x, USD)))
print(df)
Output
ID Product Buy Price Sell Price
0 0 0 $1,399.00 $1,538.90

Related

Dataframe with empty column in the data

I have a list of lists with an header row and then the different value rows.
It could happen that is some cases the last "column" has an empty value for all the rows (if just a row has a value it works fine), but DataFrame is not happy about that as the number of columns differs from the header.
I'm thinking to add a None value to the first list without any value before creating the DF, but I wondering if there is a better way to handle this case?
data = [
["data1", "data2", "data3"],
["value11", "value12"],
["value21", "value22"],
["value31", "value32"]]
headers = data.pop(0)
dataframe = pandas.DataFrame(data, columns = headers)
You could do this:
import pandas as pd
data = [
["data1", "data2", "data3"],
["value11", "value12"],
["value21", "value22"],
["value31", "value32"]
]
# create dataframe
df = pd.DataFrame(data)
# set new column names
# this will use ["data1", "data2", "data3"] as new columns, because they are in the first row
df.columns = df.iloc[0].tolist()
# now that you have the right column names, just jump the first line
df = df.iloc[1:].reset_index(drop=True)
df
data1 data2 data3
0 value11 value12 None
1 value21 value22 None
2 value31 value32 None
Is this that you want?
You can use pd.reindex function to add missing columns. You can possibly do something like this:
import pandas as pd
df = pd.DataFrame(data)
# To prevent throwing exception.
df.columns = headers[:df.shape[1]]
df = df.reindex(headers,axis=1)

Using Panda, Update column values based on a list of ID and new Values

I have a df with and ID and Sell columns. I want to update the Sell column, using a list of new Sells (not all raws need to be updated - just some of them). In all examples I have seen, the value is always the same or is coming from a column. In my case, I have a dynamic value.
This is what I would like:
file = ('something.csv') # Has 300 rows
IDList= [['453164259','453106168','453163869','453164463'] # [ID]
SellList=[120,270,350,410] # Sells values
csv = path_pattern = os.path.join(os.getcwd(), file)
df = pd.read_csv(file)
df.loc[df['Id'].isin(IDList[x]), 'Sell'] = SellList[x] # Update the rows with the corresponding Sell value of the ID.
df.to_csv(file)
Any ideas?
Thanks in advance
Assuming 'id' is a string (as mentioned in IDList) & is not index of your df
IDList= [['453164259','453106168','453163869','453164463'] # [ID]
SellList=[120,270,350,410]
id_dict={x:y for x,y in zip(IDList,SellList)}
for index,row in df.iterrows():
if row['id'] in IDList:
df.loc[str(index),'Sell']=id_dict[row['id']]
If id is index:
IDList= [['453164259','453106168','453163869','453164463'] # [ID]
SellList=[120,270,350,410]
id_dict={x:y for x,y in zip(IDList,SellList)}
for index,row in df.iterrows():
if index in IDList:
df.loc[str(index),'Sell']=id_dict[index]
What I did is created a dictionary using IDlist & SellList & than looped over the df using iterrows()
df = pd.read_csv('something.csv')
IDList= ['453164259','453106168','453163869','453164463']
SellList=[120,270,350,410]
This will work efficiently, specially for large files:
df.set_index('id', inplace=True)
df.loc[IDList, 'Sell'] = SellList
df.reset_index() ## not mandatory, just in case you need 'id' back as a column
df.to_csv(file)

How to find and add missing dates in a dataframe of sorted dates (descending order)?

In Python, I have a DataFrame with column 'Date' (format e.g. 2020-06-26). This column is sorted in descending order: 2020-06-26, 2020-06-25, 2020-06-24...
The other column 'Reviews' is made of text reviews of a website. My data can have multiple reviews on a given date or no reviews on another date. I want to find what dates are missing in column 'Date'. Then, for each missing date, add one row with date in ´´format='%Y-%m-%d'´´, and an empty review on 'Reviews', to be able to plot them. How should I do this?
from datetime import date, timedelta
d = data['Date']
print(d[0])
print(d[-1])
date_set = set(d[-1] + timedelta(x) for x in range((d[0] - d[-1]).days))
missing = sorted(date_set - set(d))
missing = pd.to_datetime(missing, format='%Y-%m-%d')
idx = pd.date_range(start=min(data.Date), end=max(data.Date), freq='D')
#tried this
data = data.reindex(idx, fill_value=0)
data.head()
#Got TypeError: 'fill_value' ('0') is not in this Categorical's categories.
#also tried this
df2 = (pd.DataFrame(data.set_index('Date'), index=idx).fillna(0) + data.set_index('Date')).ffill().stack()
df2.head()
#Got ValueError: cannot reindex from a duplicate axis
This is my code:
for i in range(len(df)):
if i > 0:
prev = df.loc[i-1]["Date"]
current =df.loc[i]["Date"]
for a in range((prev-current).days):
if a > 0:
df.loc[df["Date"].count()] = [prev-timedelta(days = a), None]
df = df.sort_values("Date", ascending=False)
print(df)

Column in DataFrame isn't recognised. Keyword Error: 'Date'

I'm in the initial stages of doing some 'machine learning'.
I'm trying to create a new data frame and one of the columns doesn't appear to be recognised..?
I've loaded an Excel file with 2 columns (removed the index). All fine.
Code:
df = pd.read_excel('scores.xlsx',index=False)
df=df.rename(columns=dict(zip(df.columns,['Date','Amount'])))
df.index=df['Date']
df=df[['Amount']]
#creating dataframe
data = df.sort_index(ascending=True, axis=0)
new_data = pd.DataFrame(index=range(0,len(df)),columns=['Date','Amount'])
for i in range(0,len(data)):
new_data['Date'][i] = data['Date'][i]
new_data['Amount'][i] = data['Amount'][i]
The error:
KeyError: 'Date'
Not really sure what's the problem here.
Any help greatly appreciated
I think in line 4 you reduce your dataframe to just one column "Amount"
To add to #Grzegorz Skibinski's answer, the problem is after line 4, there is no longer a 'Date' column. The Date column was assigned to the index and removed, and while the index has a name "Date", you can't use 'Date' as a key to get the index - you have to use data.index[i] instead of data['Date'][i].
It seems that you have an error in the formatting of your Date column.
To check that you don't have an error on the name of the columns you can print the columns names:
import pandas as pd
# create data
data_dict = {}
data_dict['Fruit '] = ['Apple', 'Orange']
data_dict['Price'] = [1.5, 3.24]
# create dataframe from dict
df = pd.DataFrame.from_dict(data_dict)
# Print columns names
print(df.columns.values)
# Print "Fruit " column
print(df['Fruit '])
This code outputs:
['Fruit ' 'Price']
0 Apple
1 Orange
Name: Fruit , dtype: object
We clearly see that the "Fruit " column as a trailing space. This is an easy mistake to do, especially when using excel.
If you try to call "Fruit" instead of "Fruit " you obtain the error you have:
KeyError: 'Fruit'

Python: Convert columns into date format and extract order

I am asking for help in transforming values into date format.
I have following data structure:
ID ACT1 ACT2 ACT3 ACT4
1 154438.0 154104.0 155321.0 155321.0
2 154042.0 154073.0 154104.0 154104.0
...
The number in columns ACT1-4 need to be converted. Some rows contain NaN values.
I found that following function helps me to get a Gregorian date:
from datetime import datetime, timedelta
gregorian = datetime.strptime('1582/10/15', "%Y/%m/%d")
modified_date = gregorian + timedelta(days=154438)
datetime.strftime(modified_date, "%Y/%m/%d")
It would be great to know how I can apply this transformation to all columns except for "ID" and whether the approach is correct (or could be improved).
After the transformation is applied, I need to extract the order of column items, sorted by date in ascending order. For instance
ID ORDER
1 ACT1, ACT3, ACT4, ACT2
2 ACT2, ACT1, ACT3, ACT4
Thank you!
It sounds like you have two questions here.
1) To change to datetime:
cols = [col for col in df.columns if col != 'ID']
df.loc[:, cols] = df.loc[:, cols].applymap(lambda x: datetime.strptime('1582/10/15', "%Y/%m/%d") + timedelta(days=x) if np.isfinite(x) else x)
2) To get the sorted column names:
df['ORDER'] = df.loc[:, cols].apply(lambda dr: ','.join(df.loc[:, cols].columns[dr.dropna().argsort()]), axis=1)
Note: the dropna above will omit columns with NaT values from the order string.
First I would make the input column comma separated so that its much easier to handle of the form:
ID,ACT1,ACT2,ACT3,ACT4
1,154438.0,154104.0,155321.0,155321.0
2,154042.0,154073.0,154104.0,154104.0
Then you can read each line using a CSV reader, extracting key,value pairs that have your column names as keys. Then you pop the ID off that dictionary to get its value ie, 1,2,etc. And you can then reorder according to the value which is the date. The code is below:
#!/usr/bin/env python3
import csv
from operator import itemgetter
idAndTuple = {}
with open('time.txt') as csvfile:
reader = csv.DictReader(csvfile)
for row in reader:
myID = row.pop('ID',None)
reorderedList = sorted(row.items(), key = itemgetter(1))
idAndTuple[myID] = reorderedList
print( myID, reorderedList )
The result when you run this is:
1 [('ACT2', '154104.0'), ('ACT1', '154438.0'), ('ACT3', '155321.0'), ('ACT4', '155321.0')]
2 [('ACT1', '154042.0'), ('ACT2', '154073.0'), ('ACT3', '154104.0'), ('ACT4', '154104.0')]
which I think is what you are looking for.

Categories