I have a dataframe like this :
+---+-------+------+-------+-------+
| id| prop1 | prop2| prop3|prop4 |
+---+-------+------+-------+-------+
| 1| value1|value2| value3| null|
| 2|value11| null|value13|value14|
+---+-------+------+-------+-------+
I want to get this in python:
+-------+------------+
| id | prop |
+-------+------------+
| 1 | value1 |
| 1 | value2 |
| 1 | value3 |
| 1 | null |
| 2 | value11 |
| 2 | null |
+-------+------------+
import pandas as pd
import numpy as ny
df1 = pd.read_csv('C:\Python27\programs\DF.csv', delimiter = ',', index_col = 'id')
print(df1)
print('*************************************')
for i,j in df1.iterrows():
df2 = (i,j)
print(df2)
It seems you need unpivot of your dataframe
so use melt for unpivot
pd.melt(df,id_vars=['id'],value_vars=['prop1', 'prop2','prop3','prop4'])
Related
I have a pandas dataframe named idf with data from 4/19/21 to 5/19/21 for 4675 tickers with the following columns: symbol, date, open, high, low, close, vol
|index |symbol |date |open |high |low |close |vol |EMA8|EMA21|RSI3|RSI14|
|-------|-------|-----------|-------|-------|-----------|-------|-------|----|-----|----|-----|
|0 |AACG |2021-04-19 |2.85 |3.03 |2.8000 |2.99 |173000 | | | | |
|1 |AACG |2021-04-20 |2.93 |2.99 |2.7700 |2.85 |73700 | | | | |
|2 |AACG |2021-04-21 |2.82 |2.95 |2.7500 |2.76 |93200 | | | | |
|3 |AACG |2021-04-22 |2.76 |2.95 |2.7200 |2.75 |56500 | | | | |
|4 |AACG |2021-04-23 |2.75 |2.88 |2.7000 |2.84 |277700 | | | | |
|... |... |... |... |... |... |... |... | | | | |
|101873 |ZYXI |2021-05-13 |13.94 |14.13 |13.2718 |13.48 |413200 | | | | |
|101874 |ZYXI |2021-05-14 |13.61 |14.01 |13.2200 |13.87 |225200 | | | | |
|101875 |ZYXI |2021-05-17 |13.72 |14.05 |13.5500 |13.82 |183600 | | | | |
|101876 |ZYXI |2021-05-18 |13.97 |14.63 |13.8300 |14.41 |232200 | | | | |
|101877 |ZYXI |2021-05-19 |14.10 |14.26 |13.7700 |14.25 |165600 | | | | |
I would like to use ta-lib to calculate several technical indicators like EMA of length 8 and 21, and RSI of 3 and 14.
I have been doing this with the following code after uploading the file and creating a dataframe named idf:
ind = pd.DataFrame()
tind = pd.DataFrame()
for ticker in idf['symbol'].unique():
tind['rsi3'] = ta.RSI(idf.loc[idf['symbol'] == ticker, 'close'], 3).round(2)
tind['rsi14'] = ta.RSI(idf.loc[idf['symbol'] == ticker, 'close'], 14).round(2)
tind['ema8'] = ta.EMA(idf.loc[idf['symbol'] == ticker, 'close'], 8).round(2)
tind['ema21'] = ta.EMA(idf.loc[idf['symbol'] == ticker, 'close'], 21).round(2)
ind = ind.append(tind)
tind = tind.iloc[0:0]
idf = pd.merge(idf, ind, left_index=True, right_index=True)
Is this the most efficient way to doing this?
If not, what is the easiest and fastest way to calculate indicator values and get those calculated indicator values into the dataframe idf?
Prefer to avoid a for loop if possible.
Any help is highly appreciated.
rsi = lambda x: talib.RSI(idf.loc[x.index, "close"], 14)
idf['rsi(14)'] = idf.groupby(['symbol']).apply(rsi).reset_index(0,drop=True)
Assuming that we have two dataframes
df_1
+--------+--------+-------+-------+
| id | col1 | col2 | col3 |
+--------+--------+-------+-------+
| A | 10 | 5 | 4 |
| B | 5 | 3 | 2 |
+--------+--------+-------+-------+
and df_2
+----------+--------+---------+
| col_name | col_t | col_d |
+----------+--------+---------+
| col1 | 3.3 | 2.2 |
| col3 | 1 | 2 |
+----------+--------+---------+
What I want to achieve is to join the two tables, such that only the columns that appear under df_2's col_name are kept in df_1 i.e. the desired table would be
+--------+--------+-------+
| id | col1 | col3 |
+--------+--------+-------+
| A | 10 | 4 |
| B | 5 | 2 |
+--------+--------+-------+
however, I need to perform this action only through joins and/or df transpose or pivot if possible.
I know that the above could be easily inferred by just selecting the df_1 columns as they appear in df_2's col_name but this is not what I am looking for here
One way to do this is to dedup and obtain the values in df_2.col_name using collect_list and passing this list of column names in your df_1 dataframe:
col_list = list(set(df_2.select(collect_list("col_name")).collect()[0][0]))
list_with_id = ['id'] + col_list
df_1[list_with_id].show()
Output:
+---+----+----+
| id|col1|col3|
+---+----+----+
| A| 10| 4|
| B| 5| 2|
+---+----+----+
Is this what you're looking for? (Assuming you want something dynamic and not manually selecting columns). I'm not using joins or pivots here though.
I want to coalesce 4 columns using pandas. I've tried this:
final['join_key'] = final['book'].astype('str') + final['bdr'] + final['cusip'].fillna(final['isin']).fillna(final['Deal'].astype('str')).fillna(final['Id'])
When I use this it returns:
+-------+--------+-------+------+------+------------+------------------+
| book | bdr | cusip | isin | Deal | Id | join_key |
+-------+--------+-------+------+------+------------+------------------+
| 17236 | ETFROS | | | | 8012398421 | 17236.0ETFROSnan |
+-------+--------+-------+------+------+------------+------------------+
The field Id is not properly appending to my join_key field.
Any help would be appreciated, thanks.
Update:
+------------+------+------+-----------+--------------+------+------------+----------------------------+
| endOfDay | book | bdr | cusip | isin | Deal | Id | join_key |
+------------+------+------+-----------+--------------+------+------------+----------------------------+
| 31/10/2019 | 15 | ITOR | 371494AM7 | US371494AM77 | 161 | 8013210731 | 20191031|15|ITOR|371494AM7 |
| 31/10/2019 | 15 | ITOR | | | | 8011898573 | 20191031|15|ITOR| |
| 31/10/2019 | 15 | ITOR | | | | 8011898742 | 20191031|15|ITOR| |
| 31/10/2019 | 15 | ITOR | | | | 8011899418 | 20191031|15|ITOR| |
+------------+------+------+-----------+--------------+------+------------+----------------------------+
df['join_key'] = ("20191031|" + df['book'].astype('str') + "|" + df['bdr'] + "|" + df[['cusip', 'isin', 'Deal', 'id']].bfill(1)['cusip'].astype(str))
For some reason this code isnt picking up Id as part of the key.
The last chain fillna for cusip is too complicated. You may change it to bfill
final['join_key'] = (final['book'].astype('str') +
final['bdr'] +
final[['cusip', 'isin', 'Deal', 'Id']].bfill(1)['cusip'].astype(str))
Try this:
import pandas as pd
import numpy as np
# setup (ignore)
final = pd.DataFrame({
'book': [17236],
'bdr': ['ETFROS'],
'cusip': [np.nan],
'isin': [np.nan],
'Deal': [np.nan],
'Id': ['8012398421'],
})
# answer
final['join_key'] = final['book'].astype('str') + final['bdr'] + final['cusip'].fillna(final['isin']).fillna(final['Deal']).fillna(final['Id']).astype('str')
Output
book bdr cusip isin Deal Id join_key
0 17236 ETFROS NaN NaN NaN 8012398421 17236ETFROS8012398421
I have the following table:
df = spark.createDataFrame([(2,'john',1),
(2,'john',1),
(3,'pete',8),
(3,'pete',8),
(5,'steve',9)],
['id','name','value'])
df.show()
+----+-------+-------+--------------+
| id | name | value | date |
+----+-------+-------+--------------+
| 2 | john | 1 | 131434234342 |
| 2 | john | 1 | 10-22-2018 |
| 3 | pete | 8 | 10-22-2018 |
| 3 | pete | 8 | 3258958304 |
| 5 | steve | 9 | 124324234 |
+----+-------+-------+--------------+
I want to remove all duplicate pairs (When the duplicates occur in id, name, or value but NOT date) so that I end up with:
+----+-------+-------+-----------+
| id | name | value | date |
+----+-------+-------+-----------+
| 5 | steve | 9 | 124324234 |
+----+-------+-------+-----------+
How can I do this in PySpark?
You could groupBy id, name and value and filter on the count column : :
df = df.groupBy('id','name','value').count().where('count = 1')
df.show()
+---+-----+-----+-----+
| id| name|value|count|
+---+-----+-----+-----+
| 5|steve| 9| 1|
+---+-----+-----+-----+
You could eventually drop the count column if needed
Do groupBy for the columns you want and count and do a filter where count is equal to 1 and then you can drop the count column like below
import pyspark.sql.functions as f
df = df.groupBy("id", "name", "value").agg(f.count("*").alias('cnt')).where('cnt = 1').drop('cnt')
You can add the date column in the GroupBy condition if you want
Hope this helps you
I have a spark dataframe like this:
id | Operation | Value
-----------------------------------------------------------
1 | [Date_Min, Date_Max, Device] | [148590, 148590, iphone]
2 | [Date_Min, Date_Max, Review] | [148590, 148590, Good]
3 | [Date_Min, Date_Max, Review, Device] | [148590, 148590, Bad,samsung]
The resul that i expect:
id | Operation | Value |
--------------------------
1 | Date_Min | 148590 |
1 | Date_Max | 148590 |
1 | Device | iphone |
2 | Date_Min | 148590 |
2 | Date_Max | 148590 |
2 | Review | Good |
3 | Date_Min | 148590 |
3 | Date_Max | 148590 |
3 | Review | Bad |
3 | Review | samsung|
I'm using Spark 2.1.0 with pyspark. I tried this solution but it worked only for one column.
Thanks
Here is an example dataframe from above. I use this solution in order to solve your question.
df = spark.createDataFrame(
[[1, ['Date_Min', 'Date_Max', 'Device'], ['148590', '148590', 'iphone']],
[2, ['Date_Min', 'Date_Max', 'Review'], ['148590', '148590', 'Good']],
[3, ['Date_Min', 'Date_Max', 'Review', 'Device'], ['148590', '148590', 'Bad', 'samsung']]],
schema=['id', 'l1', 'l2'])
Here, you can define udf to zip two list together for each row first.
from pyspark.sql.types import *
from pyspark.sql.functions import col, udf, explode
zip_list = udf(
lambda x, y: list(zip(x, y)),
ArrayType(StructType([
StructField("first", StringType()),
StructField("second", StringType())
]))
)
Finally, you can zip two columns together then explode that column.
df_out = df.withColumn("tmp", zip_list('l1', 'l2')).\
withColumn("tmp", explode("tmp")).\
select('id', col('tmp.first').alias('Operation'), col('tmp.second').alias('Value'))
df_out.show()
Output
+---+---------+-------+
| id|Operation| Value|
+---+---------+-------+
| 1| Date_Min| 148590|
| 1| Date_Max| 148590|
| 1| Device| iphone|
| 2| Date_Min| 148590|
| 2| Date_Max| 148590|
| 2| Review| Good|
| 3| Date_Min| 148590|
| 3| Date_Max| 148590|
| 3| Review| Bad|
| 3| Device|samsung|
+---+---------+-------+
If using DataFrame then try this:-
import pyspark.sql.functions as F
your_df.select("id", F.explode("Operation"), F.explode("Value")).show()