Handling a column with dates and missing dates - python

I have the following code to estimate profit from buy and sell price of crypto token.
import pandas as pd
# Read text file into pandas DataFrame
# --------------------------------------
df = pd.read_csv("crypto.txt", comment="#", skip_blank_lines=True, delim_whitespace=True).dropna()
# Display DataFrame
# -----------------
print(df)
print()
# Replace commas in number
# --------------------------------------
df['BuyPrice'] = df['BuyPrice'].str.replace(',', '').astype(float)
df['SellPrice'] = df['SellPrice'].str.replace(',', '').astype(float)
df['Size'] = df['Size'].str.replace(',', '').astype(float)
df['Profit'] = df.SellPrice - df.BuyPrice
# Sort BuyPrice column in ascending way
# --------------------------------------
df = df.sort_values('BuyPrice', ignore_index=True)
#df = df.sort_values('BuyPrice').reset_index(drop=True)
print()
# Sum all the numerical values and create a 'Total' row
# -----------------------------------------------------
df.loc['Total'] = df.sum(numeric_only=True)
# Replace NaN by empty space
# ---------------------------
df = df.fillna('')
df = df.rename({'BuyPrice': 'Buy Price', 'SellPrice': 'Sell Price'}, axis=1)
# Display Final DataFrame
# -----------------
print(df)
Now the output does only show the rows with sensible entries in the 'Date' column. I get
Coin BuyPrice SellPrice Size Date
1 1INCH 2,520 3180 10 23-10-2021
3 SHIB 500 450 200,000 27-10-2021
4 DOT 1650 2500 1 June 01, 2021
Coin Buy Price Sell Price Size Date Profit
0 SHIB 500.0 450.0 200000.0 27-10-2021 -50.0
1 DOT 1650.0 2500.0 1.0 June 01, 2021 850.0
2 1INCH 2520.0 3180.0 10.0 23-10-2021 660.0
Total 4670.0 6130.0 200011.0 1460.0
Clearly, we can see the rows without dates have been ignored. How could one tackle this issue? How can Pandas understand they are dates?
crypto.txt file contains:
Coin BuyPrice SellPrice Size Date
#--- --------- ---------- ---- -----------
ADA 1,580 1,600 1 NA
1INCH 2,520 3180 10 23-10-2021
SHIB 261.6 450 200,000 NA
SHIB 500 450 200,000 27-10-2021
DOT 1650 2500 1 "June 01, 2021"
It seems I couldn't write the last row and column entry within single inverted commas. Is it possible to convert all the dates in one single kind of format (say , )?

Related

Python dataframe returning closest value above specified input in one row (pivot_table)

I have the following DataFrame, output_excel, containing inventory data and sales data for different products. See the DataFrame below:
Product 2022-04-01 2022-05-01 2022-06-01 2022-07-01 2022-08-01 2022-09-01 AvgMonthlySales Current Inventory
1 BE37908 1500 1400 1200 1134 1110 1004 150.208333 1500
2 BE37907 2000 1800 1800 1540 1300 1038 189.562500 2000
3 DE37907 5467 5355 5138 4926 4735 4734 114.729167 5467
Please note that that in my example, today's date is 2022-04-01, so all inventory numbers for the months May through September are predicted values, while the AvgMonthlySales are the mean of actual, past sales for that specific product. The current inventory just displays today's value.
I also have another dataframe, df2, containing the lead time, the same sales data, and the calculated security stock for the same products. The formula for the security stock is ((leadtime in weeks / 4) + 1) * AvgMonthlySales:
Product AvgMonthlySales Lead time in weeks Security Stock
1 BE37908 250.208333 16 1251.04166
2 BE37907 189.562500 24 1326.9375
3 DE37907 114.729167 10 401.552084
What I am trying to achieve:
I want to create a new dataframe, which tells me how many months are left until our inventory drops below the security stock. For example, for the first product, BE37908, the security stock is ~1251 units, and by 2022-06-01 our inventory will drop below that number. So I want to return 2022-05-01, as this is the last month where our inventories are projected to be above the security stock. The whole output should look something like this:
Product Last Date Above Security Stock
1 BE37908 2022-05-01
2 BE37907 2022-07-01
3 DE37907 NaN
Please also note that the timeframe for the projections (the columns) can be set by the user, so we couldn't just select columns 2 through 7. However, the Product column will always be the first one, and the AvgMonthlySales and the Current Inventory columns will always be the last two.
To recap, I want to return the column with the smallest value above the security stock for each product. I have an idea on how to do that by column using argsort, but not by row. What is the best way to achieve this? Any tips?
You could try as follows:
# create list with columns with dates
cols = [col for col in df.columns if col.startswith('20')]
# select cols, apply df.gt row-wise, sum and subtract 1
idx = df.loc[:,cols].gt(df2['Security Stock'], axis=0).sum(axis=1).sub(1)
# get the correct dates from the cols
# if the value == len(cols)-1, *all* values will have been greater so: np.nan
idx = [cols[i] if i != len(cols)-1 else np.nan for i in idx]
out = df['Product'].to_frame()
out['Last Date Above Security Stock'] = idx
print(out)
Product Last Date Above Security Stock
1 BE37908 2022-05-01
2 BE37907 2022-07-01
3 DE37907 NaN

merging quarterly and monthly data while doing ffill on multiindex

I am trying to merge a quarterly series and a monthly series, and in the process essentially "downsampling" the quarterly series. Both dataframes contain a DATE column, BANK, and the remaining columns are various values either in a monthly or quarterly format. The complication I have had is that it is a multiindex, so if I try:
merged_data=df1.join(df2).reset_index(['DATE', 'BANK_CODE']).ffill()
the forward fill for quarterly data up to the last monthly datapoint is not done for each respective bank as I intended. Could anyone help with this please? Note: I have also tried to resample the quarterly dataframe separately, however I do not know of a way to downsample it to a monthly level until a certain date (should be the latest date in the monthly data).
df2 = df2.set_index(['DATE']).groupby(['BANK']).resample('M')['VALUE'].ffill()
df1:
Date Bank Value1 Value2
2021-06-30 bank 1 2000 7000
2021-07-31 bank 1 3000 2000
2021-06-30 bank 2 6000 9000
df2:
Date Bank Value1 Value2
2021-06-30 bank 1 2000 5000
2021-09-30 bank 1 5000 4000
2021-06-30 bank 2 9000 10000
HERE IS A MINI EXAMPLE
Using the data provided, assuming df1 is monthly and df2 is quarterly.
Set index and resample your quarterly data to monthly:
# monthly data
x1 = df1.set_index(['Bank','Date'])
# quarterly data, resampling back to monthly
x2 = ( df2.set_index('Date')
.groupby('Bank')
.resample('M')
.ffill()
.drop(columns='Bank')
)
Merge both - I assume you want the product, not the union:
x1.join(x2, lsuffix='_m', rsuffix='_q', how='outer').fillna(0)
Value1_m Value2_m Value1_q Value2_q
Bank Date
bank 1 2021-06-30 2000.0 7000.0 2000 5000
2021-07-31 3000.0 2000.0 2000 5000
2021-08-31 0.0 0.0 2000 5000
2021-09-30 0.0 0.0 5000 4000
bank 2 2021-06-30 6000.0 9000.0 9000 10000
The _m suffices are the values from df1, _q are from df2. I'm assuming you'll know how to explain or deal with the differences between monthly and quarterly values on the same dates.
As you can see, no need to specify the interval, this is provided automatically.

how to expand a string into multiple rows in dataframe?

i want to split a string into multiple rows.
df.assign(MODEL_ABC = df['MODEL_ABC'].str.split('_').explode('MODEL_ABC'))
my output:
YEAR PERIOD MODEL_ABC Price Qty
0 2018 First A 75.0 25.0
if i run individually for column i'm getting like below but not entire dataframe
A
B
this is my dataframe df
YEAR PERIOD MODEL_ABC Price Qty
0 2018 First A_B 75.0 25.0
expected output:
YEAR PERIOD MODEL_ABC Price Qty
0 2018 First A 75.0 25.0
1 2018 First B 75.0 25.0
You can do the following, start by transforming the column into a list, so then you can explode it to create multiple rows:
df['MODEL_ABC'] = df['MODEL_ABC'].str.split('_')
df = df.explode('MODEL_ABC')

Combining mulitple cols names with same starting phrase to one col name

I have a sample dataframe from a excel file as below:
d = {"Id":[1,2],
"Freight charge - 694.5 KG # USD 0.68/KG":[340,0],
"Terminal Handling Charges":[0,0],
"IOR FEE":[0,0],
"Handling - 694.5 KG # USD 0.50/KG":[357,0],
"Delivery Cartage - 694.5 KG # USD 0.25/KG":[0,0],
"Fuel Surcharge - 694.5 KG # USD 0.25/KG":[346,0],
"War Risk Surcharge - 694.5 KG # USD 0.14/KG":[0,0],
"Freight charge - 97.5 KG # USD 1.30/KG":[0,124],
"Airway Bill Fee":[0,0],
"Handling":[0,0],
"Terminal Handling Charges - 97.5 KG # USD 0.18/KG":[0,34],
"Delivery Cartage- White glove service":[0,20]
}
df = pd.DataFrame.from_dict(d)
I have put 0 but in actual it wud be NA.
Looks like below as dataframe
I want to combine all cols which begin with a certain phrase as one col and the value for that should come in separate rows. For ex, I have cols above with "Freight Charge-". I want to make them as just one col "Freight Charge" and the values those cols have should be part of this col as values. I want to do the same for other cols which have same beginning phrase like
'Delivery Cartage" to be named as "Delivery Charges" Or anywhere I have "handling" as "Handling charges".
Want something like below:
ID Freight Charges Handling Fuel Surcharge Delivery Charges
1 340 357 346 NA
2 124 NA NA 20
I have added only a sample cols names. Pls expect cols with same starting phrase (like Freight Charges) are more than 2 with different ending text. So need a generic sols that can take as many cols name with same starting phrase and convert them into one col name
You can filter the columns as below (also, last line preserves the column names in order)
def colname(c):
if 'freight charge' in c.lower():
return 'Freight Charge'
elif 'delivery cartage' in c.lower():
return 'Delivery Charges'
elif 'handling' in c.lower():
return 'Handling charges'
else:
return c
cols = [colname(col) for col in df.columns]
df.columns = cols
#preserve the last order of the columns
old_cols = df.columns.unique().values
and you can combine the values as
df= df.groupby(lambda x:x, axis=1).sum()
Update: re-order the columns as before
df = df[list(old_cols)]
Here is the expected output
import numpy as np
Replace 0 with NaN. Drop colums with less than 1 non NaN. Split columns by special character - and take string index 0. Finally combine columns with same name
df2=df.replace(0,np.nan).dropna(thresh=1, axis='columns')
df2.columns=df2.columns.str.split('([-])').str[0]
df2.groupby(lambda x:x, axis=1).sum()
Shorter version
df.columns=df.columns.str.split('([-])').str[0]
df.replace(0,np.nan).dropna(thresh=1, axis='columns').groupby(lambda x:x, axis=1).sum()
Delivery Cartage Freight charge Fuel Surcharge Handling Id \
0 0.0 340.0 346.0 357.0 1.0
1 20.0 124.0 0.0 0.0 2.0
Terminal Handling Charges
0 0.0
1 34.0

Dividing rows for specific columns by date+n in Pandas

I want to divide rows in my dataframe via specific columns.
That is, I have a column named 'ticker' which has a attributes 'date' and 'price'.
I want to divide date[i+2] by date[i] where i and i+2 just mean the DAY and the DAY +2 for the price of that ticker. The date is also in proper datetime format for operations using Pandas.
The data looks like:
date | ticker | price |
2002-01-30 A 20
2002-01-31 A 21
2002-02-01 A 21.4
2002-02-02 A 21.3
.
.
That means I want to select the price based off the ticker and the DAY and the DAY + 2 specifically for each ticker to calculate the ratio date[i+2]/date[i].
I've considered using iloc but I'm not sure how to select for specific tickers only to do the math on.
use groupby:
df.groupby('ticker')['price'].transform(lambda x: x / x.shift(2))
0 NaN
1 NaN
2 1.070000
3 1.014286
Name: price, dtype: float64

Categories