Pandas Coalesce Multiple Columns, NaN - python

I want to coalesce 4 columns using pandas. I've tried this:
final['join_key'] = final['book'].astype('str') + final['bdr'] + final['cusip'].fillna(final['isin']).fillna(final['Deal'].astype('str')).fillna(final['Id'])
When I use this it returns:
+-------+--------+-------+------+------+------------+------------------+
| book | bdr | cusip | isin | Deal | Id | join_key |
+-------+--------+-------+------+------+------------+------------------+
| 17236 | ETFROS | | | | 8012398421 | 17236.0ETFROSnan |
+-------+--------+-------+------+------+------------+------------------+
The field Id is not properly appending to my join_key field.
Any help would be appreciated, thanks.
Update:
+------------+------+------+-----------+--------------+------+------------+----------------------------+
| endOfDay | book | bdr | cusip | isin | Deal | Id | join_key |
+------------+------+------+-----------+--------------+------+------------+----------------------------+
| 31/10/2019 | 15 | ITOR | 371494AM7 | US371494AM77 | 161 | 8013210731 | 20191031|15|ITOR|371494AM7 |
| 31/10/2019 | 15 | ITOR | | | | 8011898573 | 20191031|15|ITOR| |
| 31/10/2019 | 15 | ITOR | | | | 8011898742 | 20191031|15|ITOR| |
| 31/10/2019 | 15 | ITOR | | | | 8011899418 | 20191031|15|ITOR| |
+------------+------+------+-----------+--------------+------+------------+----------------------------+
df['join_key'] = ("20191031|" + df['book'].astype('str') + "|" + df['bdr'] + "|" + df[['cusip', 'isin', 'Deal', 'id']].bfill(1)['cusip'].astype(str))
For some reason this code isnt picking up Id as part of the key.

The last chain fillna for cusip is too complicated. You may change it to bfill
final['join_key'] = (final['book'].astype('str') +
final['bdr'] +
final[['cusip', 'isin', 'Deal', 'Id']].bfill(1)['cusip'].astype(str))

Try this:
import pandas as pd
import numpy as np
# setup (ignore)
final = pd.DataFrame({
'book': [17236],
'bdr': ['ETFROS'],
'cusip': [np.nan],
'isin': [np.nan],
'Deal': [np.nan],
'Id': ['8012398421'],
})
# answer
final['join_key'] = final['book'].astype('str') + final['bdr'] + final['cusip'].fillna(final['isin']).fillna(final['Deal']).fillna(final['Id']).astype('str')
Output
book bdr cusip isin Deal Id join_key
0 17236 ETFROS NaN NaN NaN 8012398421 17236ETFROS8012398421

Related

Pandas - Create new column w/values from another column based on str contains

I have two DataFrames. One with multiple columns and other with just one. So what I need is to join based on partial str of a column. Example:
df1
| Name | Classification |
| -------- | -------------------------- |
| A | Transport/Bicycle/Mountain |
| B | Transport/City/Bus |
| C | Transport/Taxi/City |
| D | Transport/City/Uber |
| E | Transport/Mountain/Jeep |
df2
| Category |
| -------- |
| Mountain |
| City |
As you can see the order on Classification column is not well difined.
Derisable Output
| Name | Classification | Category |
| -------- | -------------------------- |-----------|
| A | Transport/Bicycle/Mountain | Mountain |
| B | Transport/City/Bus | City |
| C | Transport/Taxi/City | City |
| D | Transport/City/Uber | City |
| E | Transport/Mountain/Jeep | Mountain |
I'm stuck on this. Any ideas?
Many thanks in advance.
This implementation does the trick:
def get_cat(val):
for cat in df2['Category']:
if cat in val:
return cat
return None
df['Category'] = df['Classification'].apply(get_cat)
Note: as #Justin Ezequiel pointed out in the comments, you haven't specified what to do when Mountain and City exists in the Classification. Current implementation uses the first Category that matches.
You can try this:
dff={"ne":[]}
for x in df1["Classification"]:
if a in df2 and a in x:
dff["ne"].append(a)
df1["Category"]=dff["ne"]
df1 will look like your desirable output.

How to run TA-Lib on multiple tickers in a single dataframe

I have a pandas dataframe named idf with data from 4/19/21 to 5/19/21 for 4675 tickers with the following columns: symbol, date, open, high, low, close, vol
|index |symbol |date |open |high |low |close |vol |EMA8|EMA21|RSI3|RSI14|
|-------|-------|-----------|-------|-------|-----------|-------|-------|----|-----|----|-----|
|0 |AACG |2021-04-19 |2.85 |3.03 |2.8000 |2.99 |173000 | | | | |
|1 |AACG |2021-04-20 |2.93 |2.99 |2.7700 |2.85 |73700 | | | | |
|2 |AACG |2021-04-21 |2.82 |2.95 |2.7500 |2.76 |93200 | | | | |
|3 |AACG |2021-04-22 |2.76 |2.95 |2.7200 |2.75 |56500 | | | | |
|4 |AACG |2021-04-23 |2.75 |2.88 |2.7000 |2.84 |277700 | | | | |
|... |... |... |... |... |... |... |... | | | | |
|101873 |ZYXI |2021-05-13 |13.94 |14.13 |13.2718 |13.48 |413200 | | | | |
|101874 |ZYXI |2021-05-14 |13.61 |14.01 |13.2200 |13.87 |225200 | | | | |
|101875 |ZYXI |2021-05-17 |13.72 |14.05 |13.5500 |13.82 |183600 | | | | |
|101876 |ZYXI |2021-05-18 |13.97 |14.63 |13.8300 |14.41 |232200 | | | | |
|101877 |ZYXI |2021-05-19 |14.10 |14.26 |13.7700 |14.25 |165600 | | | | |
I would like to use ta-lib to calculate several technical indicators like EMA of length 8 and 21, and RSI of 3 and 14.
I have been doing this with the following code after uploading the file and creating a dataframe named idf:
ind = pd.DataFrame()
tind = pd.DataFrame()
for ticker in idf['symbol'].unique():
tind['rsi3'] = ta.RSI(idf.loc[idf['symbol'] == ticker, 'close'], 3).round(2)
tind['rsi14'] = ta.RSI(idf.loc[idf['symbol'] == ticker, 'close'], 14).round(2)
tind['ema8'] = ta.EMA(idf.loc[idf['symbol'] == ticker, 'close'], 8).round(2)
tind['ema21'] = ta.EMA(idf.loc[idf['symbol'] == ticker, 'close'], 21).round(2)
ind = ind.append(tind)
tind = tind.iloc[0:0]
idf = pd.merge(idf, ind, left_index=True, right_index=True)
Is this the most efficient way to doing this?
If not, what is the easiest and fastest way to calculate indicator values and get those calculated indicator values into the dataframe idf?
Prefer to avoid a for loop if possible.
Any help is highly appreciated.
rsi = lambda x: talib.RSI(idf.loc[x.index, "close"], 14)
idf['rsi(14)'] = idf.groupby(['symbol']).apply(rsi).reset_index(0,drop=True)

Pandas CSV dataframes

I have a dataframe like this :
+---+-------+------+-------+-------+
| id| prop1 | prop2| prop3|prop4 |
+---+-------+------+-------+-------+
| 1| value1|value2| value3| null|
| 2|value11| null|value13|value14|
+---+-------+------+-------+-------+
I want to get this in python:
+-------+------------+
| id | prop |
+-------+------------+
| 1 | value1 |
| 1 | value2 |
| 1 | value3 |
| 1 | null |
| 2 | value11 |
| 2 | null |
+-------+------------+
import pandas as pd
import numpy as ny
df1 = pd.read_csv('C:\Python27\programs\DF.csv', delimiter = ',', index_col = 'id')
print(df1)
print('*************************************')
for i,j in df1.iterrows():
df2 = (i,j)
print(df2)
It seems you need unpivot of your dataframe
so use melt for unpivot
pd.melt(df,id_vars=['id'],value_vars=['prop1', 'prop2','prop3','prop4'])

Python - Pandas - Converting column with specific subsets into rows

I have a dataframe that looks like this below with Date, Price and Serial.
+----------+--------+--------+
| Date | Price | Serial |
+----------+--------+--------+
| 2/1/1996 | 0.5909 | 1 |
| 2/1/1996 | 0.5711 | 2 |
| 2/1/1996 | 0.5845 | 3 |
| 3/1/1996 | 0.5874 | 1 |
| 3/1/1996 | 0.5695 | 2 |
| 3/1/1996 | 0.584 | 3 |
+----------+--------+--------+
I will like to make it look like this where the serial becomes the column name and the data sorts itself into the correct date row as well as Serial column.
+----------+--------+--------+--------+
| Date | 1 | 2 | 3 |
+----------+--------+--------+--------+
| 2/1/1996 | 0.5909 | 0.5711 | 0.5845 |
| 3/1/1996 | 0.5874 | 0.5695 | 0.584 |
+----------+--------+--------+--------+
I understand I can do this via a loop but just wondering if there is a more efficient way to do this?
Thanks for your kind help. Also curious if there is a better way to paste such tables rather than attaching images in my questions =x
You can use pandas.pivot_table:
res = df.pivot_table(index='Date', columns='Serial', values='Price', aggfunc=np.sum)\
.reset_index()
res.columns.name = ''
Date 1 2 3
0 2/1/1996 0.5909 0.5711 0.5845
1 3/1/1996 0.5874 0.5695 0.5840

Printing Lists as Tabular Data, Group Rows

I need to format a data containing as list of lists in a table.
I can make a grid using tabulate:
x = [['Alice', 'min', 2],
['', 'max', 5],
['Bob', 'min', 8],
['', 'max', 15]]
header = ['Name', '', 'value']
print(tabulate.tabulate(x, headers=header, tablefmt="grid"))
+--------+-----+---------+
| Name | | value |
+========+=====+=========+
| Alice | min | 2 |
+--------+-----+---------+
| | max | 5 |
+--------+-----+---------+
| Bob | min | 8 |
+--------+-----+---------+
| | max | 15 |
+--------+-----+---------+
However, we require grouping of rows, like this:
+--------+-----+---------+
| Name | | value |
+========+=====+=========+
| Alice | min | 2 |
+ + + +
| | max | 5 |
+--------+-----+---------+
| Bob | min | 8 |
+ + + +
| | max | 15 |
+--------+-----+---------+
I tried using multiline rows (using "\n".join()), which is apparently supported in tabular 0.8.3, with no success.
This is required to run in the production server, so we can't use any heavy libraries. We are using tabulate because the whole tabulate library is a single file, and we can ship the file with the product.
You can try this:
x = [['Alice', 'min\nmax', '2\n5'],
['Bob', 'min\nmax', '8\n15'],
]
+--------+-----+------------------------+
| Name | | ['value1', 'value2'] |
+========+=====+========================+
| Alice | min | 2 |
| | max | 5 |
+--------+-----+------------------------+
| Bob | min | 8 |
| | max | 15 |
+--------+-----+------------------------+

Categories