how to expand a string into multiple rows in dataframe?

how to expand a string into multiple rows in dataframe? - python

i want to split a string into multiple rows.
df.assign(MODEL_ABC = df['MODEL_ABC'].str.split('_').explode('MODEL_ABC'))
my output:
YEAR PERIOD MODEL_ABC Price Qty
0 2018 First A 75.0 25.0
if i run individually for column i'm getting like below but not entire dataframe
A
B
this is my dataframe df
YEAR PERIOD MODEL_ABC Price Qty
0 2018 First A_B 75.0 25.0
expected output:
YEAR PERIOD MODEL_ABC Price Qty
0 2018 First A 75.0 25.0
1 2018 First B 75.0 25.0

You can do the following, start by transforming the column into a list, so then you can explode it to create multiple rows:
df['MODEL_ABC'] = df['MODEL_ABC'].str.split('_')
df = df.explode('MODEL_ABC')

Related

Create a pivot table where the value is held flat until a certain date

I want to create a pivot table with months as the columns and items as rows. Currently the data is in a table that looks like this:
Item
Balance
Maturity
A
100
1/31/23
B
150
2/28/23
C
200
3/31/23
But I want the data to look like this:
1/31/23
2/28/23
3/31/23
A
100
B
150
150
C
200
200
200
In python I have created a date range with frequency 'M'. The idea I am trying to accomplish is if the date is less than the Maturity date, repeat balance.

pivot and bfill:
(df.pivot(index='Item', columns='Maturity', values='Balance')
.sort_index(axis=1, key=lambda x: pd.to_datetime(x, dayfirst=False))
.bfill(axis=1)
)
NB. sort_index using conversion to_datetime to ensure the correct order of the columns.
Output:
Maturity 1/31/23 2/28/23 3/31/23
Item
A 100.0 NaN NaN
B 150.0 150.0 NaN
C 200.0 200.0 200.0

Handling a column with dates and missing dates

I have the following code to estimate profit from buy and sell price of crypto token.
import pandas as pd
# Read text file into pandas DataFrame
# --------------------------------------
df = pd.read_csv("crypto.txt", comment="#", skip_blank_lines=True, delim_whitespace=True).dropna()
# Display DataFrame
# -----------------
print(df)
print()
# Replace commas in number
# --------------------------------------
df['BuyPrice'] = df['BuyPrice'].str.replace(',', '').astype(float)
df['SellPrice'] = df['SellPrice'].str.replace(',', '').astype(float)
df['Size'] = df['Size'].str.replace(',', '').astype(float)
df['Profit'] = df.SellPrice - df.BuyPrice
# Sort BuyPrice column in ascending way
# --------------------------------------
df = df.sort_values('BuyPrice', ignore_index=True)
#df = df.sort_values('BuyPrice').reset_index(drop=True)
print()
# Sum all the numerical values and create a 'Total' row
# -----------------------------------------------------
df.loc['Total'] = df.sum(numeric_only=True)
# Replace NaN by empty space
# ---------------------------
df = df.fillna('')
df = df.rename({'BuyPrice': 'Buy Price', 'SellPrice': 'Sell Price'}, axis=1)
# Display Final DataFrame
# -----------------
print(df)
Now the output does only show the rows with sensible entries in the 'Date' column. I get
Coin BuyPrice SellPrice Size Date
1 1INCH 2,520 3180 10 23-10-2021
3 SHIB 500 450 200,000 27-10-2021
4 DOT 1650 2500 1 June 01, 2021
Coin Buy Price Sell Price Size Date Profit
0 SHIB 500.0 450.0 200000.0 27-10-2021 -50.0
1 DOT 1650.0 2500.0 1.0 June 01, 2021 850.0
2 1INCH 2520.0 3180.0 10.0 23-10-2021 660.0
Total 4670.0 6130.0 200011.0 1460.0
Clearly, we can see the rows without dates have been ignored. How could one tackle this issue? How can Pandas understand they are dates?
crypto.txt file contains:
Coin BuyPrice SellPrice Size Date
#--- --------- ---------- ---- -----------
ADA 1,580 1,600 1 NA
1INCH 2,520 3180 10 23-10-2021
SHIB 261.6 450 200,000 NA
SHIB 500 450 200,000 27-10-2021
DOT 1650 2500 1 "June 01, 2021"
It seems I couldn't write the last row and column entry within single inverted commas. Is it possible to convert all the dates in one single kind of format (say , )?

subset by counting the number of times 0 occurs in a column after groupby in python

I have some typical stock data. I want to create a column called "Volume_Count" that will count the number of 0 volume days per quarter. My ultimate goal is to remove all stocks that have 0 volume for more than 5 days in a quarter. By creating this column, I can write a simple statement to subset Vol_Count > 5.
A typical Dataset:
Stock Date Qtr Volume
XYZ 1/1/19 2019 Q1 0
XYZ 1/2/19 2019 Q1 598
XYZ 1/3/19 2019 Q1 0
XYZ 1/4/19 2019 Q1 0
XYZ 1/5/19 2019 Q1 0
XYZ 1/6/19 2019 Q1 2195
XYZ 1/7/19 2019 Q1 0
... ... and so on (for multiple stocks and quarters)
This is what I've tried - a 1 liner -
df = df.groupby(['stock','Qtr'], as_index=False).filter(lambda x: len(x.Volume == 0) > 5)
However, as stated previously, this produced inconsistent results.
I want to remove the stock from the dataset only for the quarter where the volume == 0 for 5 or more days.
Note: I have multiple Stocks and Qtr in my dataset, therefore it's essential to groupby Qtr, Stock.
Desired Output:
I want to keep the dataset but remove any stocks for a qtr if they have a volume = 0 for > 5 days.. that might entail a stock not being in the dataset for 2019 Q1 (because Vol == 0 >5 days) but being in the df in 2019 Q2 (Vol == 0 < 5 days)...

Try this:
df[df['Volume'].eq(0).groupby([df['Stock'],df['Qtr']]).transform('sum') < 5]
Details.
First take the Volume column of your dataframe and check to see if
it zero for each record.
Next, group that column by 'Stock' and 'Qtr' columns and get a sum of each True values from step 1 assign that sum to each record using groupby and transform.
Create boolean series from that sum where True if less than 5 and
use that series to boolean index your original dataframe.

Pandas Expanding Mean with Group By and before current row date

I have a Pandas dataframe as follows
df = pd.DataFrame([['John', '1/1/2017','10'],
['John', '2/2/2017','15'],
['John', '2/2/2017','20'],
['John', '3/3/2017','30'],
['Sue', '1/1/2017','10'],
['Sue', '2/2/2017','15'],
['Sue', '3/2/2017','20'],
['Sue', '3/3/2017','7'],
['Sue', '4/4/2017','20']
],
columns=['Customer', 'Deposit_Date','DPD'])
. What is the best way to calculate the PreviousMean column in the screen shot below?
The column is the year to date average of DPD for that customer. I.e. Includes all DPDs up to but not including rows that match the current deposit date. If no previous records existed then it's null or 0.
Screenshot:
Notes:
the data is grouped by Customer Name and expanding over Deposit Dates
within each group, the expanding mean is calculated using only values from the previous rows.
at the start of each new customer the mean is 0 or alternatively null as there are no previous records on which to form the mean
the data frame is ordered by Customer Name and Deposit_Date

instead of grouping & expanding the mean, filter the dataframe on the conditions, and calculate the mean of DPD:
Customer == current row's Customer
Deposit_Date < current row's Deposit_Date
Use df.apply to perform this operation for all row in the dataframe:
df['PreviousMean'] = df.apply(
lambda x: df[(df.Customer == x.Customer) & (df.Deposit_Date < x.Deposit_Date)].DPD.mean(),
axis=1)
outputs:
Customer Deposit_Date DPD PreviousMean
0 John 2017-01-01 10 NaN
1 John 2017-02-02 15 10.0
2 John 2017-02-02 20 10.0
3 John 2017-03-03 30 15.0
4 Sue 2017-01-01 10 NaN
5 Sue 2017-02-02 15 10.0
6 Sue 2017-03-02 20 12.5
7 Sue 2017-03-03 7 15.0
8 Sue 2017-04-04 20 13.0

Here's one way to exclude repeated days from mean calculation:
# create helper series which is NaN for repeated days, DPD otherwise
s = df.groupby(['Customer Name', 'Deposit_Date']).cumcount() == 1
df['DPD2'] = np.where(s, np.nan, df['DPD'])
# apply pd.expanding_mean
df['CumMean'] = df.groupby(['Customer Name'])['DPD2'].apply(lambda x: pd.expanding_mean(x))
# drop helper series
df = df.drop('DPD2', 1)
print(df)
Customer Name Deposit_Date DPD CumMean
0 John 01/01/2017 10 10.0
1 John 01/01/2017 10 10.0
2 John 02/02/2017 20 15.0
3 John 03/03/2017 30 20.0
4 Sue 01/01/2017 10 10.0
5 Sue 01/01/2017 10 10.0
6 Sue 02/02/2017 20 15.0
7 Sue 03/03/2017 30 20.0

Ok here is the best solution I've come up with thus far.
The trick is to first create an aggregated table at the customer & deposit date level containing a shifted mean. To calculate this mean you have to calculate the sum and the count first.
s=df.groupby(['Customer Name','Deposit_Date'],as_index=False)[['DPD']].agg(['count','sum'])
s.columns = [' '.join(col) for col in s.columns]
s.reset_index(inplace=True)
s['DPD_CumSum']=s.groupby(['Customer Name'])[['DPD sum']].cumsum()
s['DPD_CumCount']=s.groupby(['Customer Name'])[['DPD count']].cumsum()
s['DPD_CumMean']=s['DPD_CumSum']/ s['DPD_CumCount']
s['DPD_PrevMean']=s.groupby(['Customer Name'])['DPD_CumMean'].shift(1)
df=df.merge(s[['Customer Name','Deposit_Date','DPD_PrevMean']],how='left',on=['Customer Name','Deposit_Date'])

calculate dataframe column based on two rows in another column

I have a dataframe, containing one column with prices.
whats the best way to create a column calculating the rate of return between two rows (leaving the first or last Null).
For example the data frame looks as follows:
Date Price
2008-11-21 23.400000
2008-11-24 26.990000
2008-11-25 28.000000
2008-11-26 25.830000
Trying to add a column as follows:
Date Price Return
2008-11-21 23.400000 0.1534
2008-11-24 26.990000 0.0374
2008-11-25 28.000000 -0.0775
2008-11-26 25.830000 NaN
Where the calculation of return column as follows:
Return Row 0 = Price Row 1 / Price Row 0 - 1
Should i for loop, or is there a better way?

You can use shift to shift the rows and then div to divide the Series against itself shifted:
In [44]:
df['Return'] = (df['Price'].shift(-1).div(df['Price']) - 1)
df
Out[44]:
Date Price Return
0 2008-11-21 23.40 0.153419
1 2008-11-24 26.99 0.037421
2 2008-11-25 28.00 -0.077500
3 2008-11-26 25.83 NaN

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

how to expand a string into multiple rows in dataframe? - python

You can do the following, start by transforming the column into a list, so then you can explode it to create multiple rows: df['MODEL_ABC'] = df['MODEL_ABC'].str.split('_') df = df.explode('MODEL_ABC')

Related

Create a pivot table where the value is held flat until a certain date

Handling a column with dates and missing dates

subset by counting the number of times 0 occurs in a column after groupby in python

Pandas Expanding Mean with Group By and before current row date

calculate dataframe column based on two rows in another column

Categories

Resources