I have a pandas dataframe named idf with data from 4/19/21 to 5/19/21 for 4675 tickers with the following columns: symbol, date, open, high, low, close, vol
|index |symbol |date |open |high |low |close |vol |EMA8|EMA21|RSI3|RSI14|
|-------|-------|-----------|-------|-------|-----------|-------|-------|----|-----|----|-----|
|0 |AACG |2021-04-19 |2.85 |3.03 |2.8000 |2.99 |173000 | | | | |
|1 |AACG |2021-04-20 |2.93 |2.99 |2.7700 |2.85 |73700 | | | | |
|2 |AACG |2021-04-21 |2.82 |2.95 |2.7500 |2.76 |93200 | | | | |
|3 |AACG |2021-04-22 |2.76 |2.95 |2.7200 |2.75 |56500 | | | | |
|4 |AACG |2021-04-23 |2.75 |2.88 |2.7000 |2.84 |277700 | | | | |
|... |... |... |... |... |... |... |... | | | | |
|101873 |ZYXI |2021-05-13 |13.94 |14.13 |13.2718 |13.48 |413200 | | | | |
|101874 |ZYXI |2021-05-14 |13.61 |14.01 |13.2200 |13.87 |225200 | | | | |
|101875 |ZYXI |2021-05-17 |13.72 |14.05 |13.5500 |13.82 |183600 | | | | |
|101876 |ZYXI |2021-05-18 |13.97 |14.63 |13.8300 |14.41 |232200 | | | | |
|101877 |ZYXI |2021-05-19 |14.10 |14.26 |13.7700 |14.25 |165600 | | | | |
I would like to use ta-lib to calculate several technical indicators like EMA of length 8 and 21, and RSI of 3 and 14.
I have been doing this with the following code after uploading the file and creating a dataframe named idf:
ind = pd.DataFrame()
tind = pd.DataFrame()
for ticker in idf['symbol'].unique():
tind['rsi3'] = ta.RSI(idf.loc[idf['symbol'] == ticker, 'close'], 3).round(2)
tind['rsi14'] = ta.RSI(idf.loc[idf['symbol'] == ticker, 'close'], 14).round(2)
tind['ema8'] = ta.EMA(idf.loc[idf['symbol'] == ticker, 'close'], 8).round(2)
tind['ema21'] = ta.EMA(idf.loc[idf['symbol'] == ticker, 'close'], 21).round(2)
ind = ind.append(tind)
tind = tind.iloc[0:0]
idf = pd.merge(idf, ind, left_index=True, right_index=True)
Is this the most efficient way to doing this?
If not, what is the easiest and fastest way to calculate indicator values and get those calculated indicator values into the dataframe idf?
Prefer to avoid a for loop if possible.
Any help is highly appreciated.
rsi = lambda x: talib.RSI(idf.loc[x.index, "close"], 14)
idf['rsi(14)'] = idf.groupby(['symbol']).apply(rsi).reset_index(0,drop=True)
Related
I got a dataframe (df1), where I have listed some time frames:
| start | end | event name |
|-------|-----|------------|
| 1 | 3 | name_1 |
| 3 | 5 | name_2 |
| 2 | 6 | name_3 |
In these time frames, I would like to extract some data from another dataframe (df2). For example, I want to extend df1 with the average measurementn from df2 inside the specified time range.
| timestamp | measurement |
|-----------|-------------|
| 1 | 5 |
| 2 | 7 |
| 3 | 5 |
| 4 | 9 |
| 5 | 2 |
| 6 | 7 |
| 7 | 8 |
I was thinking about an UDF function which filters df2 by timestamp and evaluates the average. But in a UDF I can not reference two dataframes:
def get_avg(start, end):
return df2.filter(df2.timestamp > start & df2.timestamp < end).agg({"average": "avg"})
udf_1 = f.udf(get_avg)
df1.select(udf_1('start', 'end').show()
This will throw an error TypeError: cannot pickle '_thread.RLock' object.
How would I solve this issue efficiently?
In this case there is no need to use UDFs, you can simply use join over a range interval determined by the timestamps
import pyspark.sql.functions as F
df1.join(df2, on=[(df2.timestamp > df1.start) & (df2.timestamp < df1.end)]) \
.groupby('start', 'end', 'event_name') \
.agg(F.mean('measurement').alias('avg')) \
.show()
+-----+---+----------+-----------------+
|start|end|event_name| avg|
+-----+---+----------+-----------------+
| 1| 3| name_1| 7.0|
| 3| 5| name_2| 9.0|
| 2| 6| name_3|5.333333333333333|
+-----+---+----------+-----------------+
I want to coalesce 4 columns using pandas. I've tried this:
final['join_key'] = final['book'].astype('str') + final['bdr'] + final['cusip'].fillna(final['isin']).fillna(final['Deal'].astype('str')).fillna(final['Id'])
When I use this it returns:
+-------+--------+-------+------+------+------------+------------------+
| book | bdr | cusip | isin | Deal | Id | join_key |
+-------+--------+-------+------+------+------------+------------------+
| 17236 | ETFROS | | | | 8012398421 | 17236.0ETFROSnan |
+-------+--------+-------+------+------+------------+------------------+
The field Id is not properly appending to my join_key field.
Any help would be appreciated, thanks.
Update:
+------------+------+------+-----------+--------------+------+------------+----------------------------+
| endOfDay | book | bdr | cusip | isin | Deal | Id | join_key |
+------------+------+------+-----------+--------------+------+------------+----------------------------+
| 31/10/2019 | 15 | ITOR | 371494AM7 | US371494AM77 | 161 | 8013210731 | 20191031|15|ITOR|371494AM7 |
| 31/10/2019 | 15 | ITOR | | | | 8011898573 | 20191031|15|ITOR| |
| 31/10/2019 | 15 | ITOR | | | | 8011898742 | 20191031|15|ITOR| |
| 31/10/2019 | 15 | ITOR | | | | 8011899418 | 20191031|15|ITOR| |
+------------+------+------+-----------+--------------+------+------------+----------------------------+
df['join_key'] = ("20191031|" + df['book'].astype('str') + "|" + df['bdr'] + "|" + df[['cusip', 'isin', 'Deal', 'id']].bfill(1)['cusip'].astype(str))
For some reason this code isnt picking up Id as part of the key.
The last chain fillna for cusip is too complicated. You may change it to bfill
final['join_key'] = (final['book'].astype('str') +
final['bdr'] +
final[['cusip', 'isin', 'Deal', 'Id']].bfill(1)['cusip'].astype(str))
Try this:
import pandas as pd
import numpy as np
# setup (ignore)
final = pd.DataFrame({
'book': [17236],
'bdr': ['ETFROS'],
'cusip': [np.nan],
'isin': [np.nan],
'Deal': [np.nan],
'Id': ['8012398421'],
})
# answer
final['join_key'] = final['book'].astype('str') + final['bdr'] + final['cusip'].fillna(final['isin']).fillna(final['Deal']).fillna(final['Id']).astype('str')
Output
book bdr cusip isin Deal Id join_key
0 17236 ETFROS NaN NaN NaN 8012398421 17236ETFROS8012398421
+------------+-----+--------+-----+-------------+
| Meth.name | Min| Max |Layer| Global name |
+------------+-----+--------+-----+-------------+
| DTS | 2600| 3041.2 | AC1 | DTS |
| GGK | 1800| 3200.0 | AC1 | DEN |
| DTP | 700 | 3041.0 | AC2 | DT |
| DS | 700 | 3041.0 | AC3 | CALI |
| PF1 | 2800| 3012.0 | AC3 | CALI |
| PF2 | 3000| 3041.0 | AC4 | CALI |
+------------+-----+--------+-----+-------------+
We have to drop duplicated rows by "Global name" column but in specific way : we wants to choose the row, which will give the biggest intersection with range calculated using max value of column "Min" and min value if column "Max" of non-duplicated rows.
In example above this range will be [2600.0; 3041.0], so we wants to leave only row with ['Meth.name] == 'DS' and overall result should be like:
+------------+-----+--------+-----+-------------+
| Meth.name | Min| Max |Layer| Global name |
+------------+-----+--------+-----+-------------+
| DTS | 2600| 3041.2 | AC1 | DTS |
| GGK | 1800| 3200.0 | AC1 | DEN |
| DTP | 700 | 3041.0 | AC2 | DT |
| DS | 700 | 3041.0 | AC3 | CALI |
+------------+-----+--------+-----+-------------+
This problem, of course, can be solved in several iterations (calculate interval based on non-duplicated rows and then iteratively select only those rows (from duplicated) that will give biggest intersection), but I'm trying to discover the most efficient approach
Thank you
If the order of the lines is not important you can do the following :
df['diff'] = df['Max']-df['Min']
df=df.sort_values(["Global_name","diff"],ascending=True)
df.drop_duplicates('Global_name',keep='last')
From this question
Here is how I will go about it:
# Helper function
def calc_overlap(x):
if min_of_max == max_of_min:
return 0
low = max(min_of_max, x.Min)
high = min(max_of_min, x.Max)
return high-low
dup_global_name = df.Global_name.value_counts()[df.Global_name.value_counts() > 1].index
dup_global_name = list(dup_global_name)
# Filter duplicates
df_dup = df[df.Global_name.isin(dup_global_name)]
# Add overlap column
df_dup['overlap'] = df_dup.apply(lambda x: calc_overlap(x), axis=1)
#Select max overlap
df_dup = df_dup.loc[df_dup.groupby('Global_name').overlap.idxmax()]
# Drop overlap col
df_dup.drop('overlap', axis=1, inplace=True)
#Concatinate with nonduplicate ones
pd.concat([df[~df.Global_name.isin(dup_global_name)], df_dup])
The desired output:
I have 2 views as below:
experiments:
select * from experiments;
+--------+--------------------+-----------------+
| exp_id | exp_properties | value |
+--------+--------------------+-----------------+
| 1 | indicator:chemical | phenolphthalein |
| 1 | base | NaOH |
| 1 | acid | HCl |
| 1 | exp_type | titration |
| 1 | indicator:color | faint_pink |
+--------+--------------------+-----------------+
calculations:
select * from calculations;
+--------+------------------------+--------------+
| exp_id | exp_report | value |
+--------+------------------------+--------------+
| 1 | molarity:base | 0.500000000 |
| 1 | volume:acid:in_ML | 23.120000000 |
| 1 | volume:base:in_ML | 5.430000000 |
| 1 | moles:H | 0.012500000 |
| 1 | moles:OH | 0.012500000 |
| 1 | molarity:acid | 0.250000000 |
+--------+------------------------+--------------+
I managed to pivot each of these views individually as below:
experiments_pivot:
+-------+--------------------+------+------+-----------+----------------+
|exp_id | indicator:chemical | base | acid | exp_type | indicator:color|
+-------+--------------------+------+------+-----------+----------------+
| 1 | phenolphthalein | NaOH | HCl | titration | faint_pink |
+------+---------------------+------+------+-----------+----------------+
calculations_pivot:
+-------+---------------+---------------+--------------+-------------+------------------+-------------------+
|exp_id | molarity:base | molarity:acid | moles:H | moles:OH | volume:acid:in_ML| volume:base:in_ML |
+-------+---------------+---------------+--------------+-------------+------------------+-------------------+
| 1 | 0.500000000 | 0.250000000 | 0.012500000 | 0.012500000 | 23.120000000 | 5.430000000 |
+------+---------------------+------+------+-----------+----------------------------------------------------+
My question is how to get these two pivot results as a single row? Desired result is as below:
+-------+--------------------+------+------+-----------+----------------+--------------+---------------+--------------+-------------+------------------+------------------+
|exp_id | indicator:chemical | base | acid | exp_type | indicator:color|molarity:base | molarity:acid | moles:H | moles:OH | volume:acid:in_ML| volume:base:in_ML |
+-------+--------------------+------+------+-----------+----------------+--------------+---------------+--------------+-------------+------------------+------------------+
| 1 | phenolphthalein | NaOH | HCl | titration | faint_pink | 0.500000000 | 0.250000000 | 0.012500000 | 0.012500000 | 23.120000000 | 5.430000000 |
+------+---------------------+------+------+-----------+----------------+--------------+---------------+--------------+-------------+------------------+------------------+
Database Used: Mysql
Important Note: Each of these views can have increasing number of rows. Hence I considered "dynamic pivoting" for each of the view individually.
For reference -- Below is a prepared statement I used to pivot experiments in MySQL(and a similar statement to pivot the other view as well):
set #sql = Null;
SELECT
GROUP_CONCAT(DISTINCT
CONCAT(
'MAX(IF(exp_properties = ''',
exp_properties,
''', value, NULL)) AS ',
concat("`",exp_properties, "`")
)
)into #sql
from experiments;
set #sql = concat(
'select exp_id, ',
#sql,
' from experiment group by exp_id'
);
prepare stmt from #sql;
execute stmt;
I have a spark dataframe like this:
id | Operation | Value
-----------------------------------------------------------
1 | [Date_Min, Date_Max, Device] | [148590, 148590, iphone]
2 | [Date_Min, Date_Max, Review] | [148590, 148590, Good]
3 | [Date_Min, Date_Max, Review, Device] | [148590, 148590, Bad,samsung]
The resul that i expect:
id | Operation | Value |
--------------------------
1 | Date_Min | 148590 |
1 | Date_Max | 148590 |
1 | Device | iphone |
2 | Date_Min | 148590 |
2 | Date_Max | 148590 |
2 | Review | Good |
3 | Date_Min | 148590 |
3 | Date_Max | 148590 |
3 | Review | Bad |
3 | Review | samsung|
I'm using Spark 2.1.0 with pyspark. I tried this solution but it worked only for one column.
Thanks
Here is an example dataframe from above. I use this solution in order to solve your question.
df = spark.createDataFrame(
[[1, ['Date_Min', 'Date_Max', 'Device'], ['148590', '148590', 'iphone']],
[2, ['Date_Min', 'Date_Max', 'Review'], ['148590', '148590', 'Good']],
[3, ['Date_Min', 'Date_Max', 'Review', 'Device'], ['148590', '148590', 'Bad', 'samsung']]],
schema=['id', 'l1', 'l2'])
Here, you can define udf to zip two list together for each row first.
from pyspark.sql.types import *
from pyspark.sql.functions import col, udf, explode
zip_list = udf(
lambda x, y: list(zip(x, y)),
ArrayType(StructType([
StructField("first", StringType()),
StructField("second", StringType())
]))
)
Finally, you can zip two columns together then explode that column.
df_out = df.withColumn("tmp", zip_list('l1', 'l2')).\
withColumn("tmp", explode("tmp")).\
select('id', col('tmp.first').alias('Operation'), col('tmp.second').alias('Value'))
df_out.show()
Output
+---+---------+-------+
| id|Operation| Value|
+---+---------+-------+
| 1| Date_Min| 148590|
| 1| Date_Max| 148590|
| 1| Device| iphone|
| 2| Date_Min| 148590|
| 2| Date_Max| 148590|
| 2| Review| Good|
| 3| Date_Min| 148590|
| 3| Date_Max| 148590|
| 3| Review| Bad|
| 3| Device|samsung|
+---+---------+-------+
If using DataFrame then try this:-
import pyspark.sql.functions as F
your_df.select("id", F.explode("Operation"), F.explode("Value")).show()