Pandas CSV dataframes

Pandas CSV dataframes - python

I have a dataframe like this :
+---+-------+------+-------+-------+
| id| prop1 | prop2| prop3|prop4 |
+---+-------+------+-------+-------+
| 1| value1|value2| value3| null|
| 2|value11| null|value13|value14|
+---+-------+------+-------+-------+
I want to get this in python:
+-------+------------+
| id | prop |
+-------+------------+
| 1 | value1 |
| 1 | value2 |
| 1 | value3 |
| 1 | null |
| 2 | value11 |
| 2 | null |
+-------+------------+
import pandas as pd
import numpy as ny
df1 = pd.read_csv('C:\Python27\programs\DF.csv', delimiter = ',', index_col = 'id')
print(df1)
print('*************************************')
for i,j in df1.iterrows():
df2 = (i,j)
print(df2)

It seems you need unpivot of your dataframe
so use melt for unpivot
pd.melt(df,id_vars=['id'],value_vars=['prop1', 'prop2','prop3','prop4'])

Related

How to run TA-Lib on multiple tickers in a single dataframe

I have a pandas dataframe named idf with data from 4/19/21 to 5/19/21 for 4675 tickers with the following columns: symbol, date, open, high, low, close, vol
|index |symbol |date |open |high |low |close |vol |EMA8|EMA21|RSI3|RSI14|
|-------|-------|-----------|-------|-------|-----------|-------|-------|----|-----|----|-----|
|0 |AACG |2021-04-19 |2.85 |3.03 |2.8000 |2.99 |173000 | | | | |
|1 |AACG |2021-04-20 |2.93 |2.99 |2.7700 |2.85 |73700 | | | | |
|2 |AACG |2021-04-21 |2.82 |2.95 |2.7500 |2.76 |93200 | | | | |
|3 |AACG |2021-04-22 |2.76 |2.95 |2.7200 |2.75 |56500 | | | | |
|4 |AACG |2021-04-23 |2.75 |2.88 |2.7000 |2.84 |277700 | | | | |
|... |... |... |... |... |... |... |... | | | | |
|101873 |ZYXI |2021-05-13 |13.94 |14.13 |13.2718 |13.48 |413200 | | | | |
|101874 |ZYXI |2021-05-14 |13.61 |14.01 |13.2200 |13.87 |225200 | | | | |
|101875 |ZYXI |2021-05-17 |13.72 |14.05 |13.5500 |13.82 |183600 | | | | |
|101876 |ZYXI |2021-05-18 |13.97 |14.63 |13.8300 |14.41 |232200 | | | | |
|101877 |ZYXI |2021-05-19 |14.10 |14.26 |13.7700 |14.25 |165600 | | | | |
I would like to use ta-lib to calculate several technical indicators like EMA of length 8 and 21, and RSI of 3 and 14.
I have been doing this with the following code after uploading the file and creating a dataframe named idf:
ind = pd.DataFrame()
tind = pd.DataFrame()
for ticker in idf['symbol'].unique():
tind['rsi3'] = ta.RSI(idf.loc[idf['symbol'] == ticker, 'close'], 3).round(2)
tind['rsi14'] = ta.RSI(idf.loc[idf['symbol'] == ticker, 'close'], 14).round(2)
tind['ema8'] = ta.EMA(idf.loc[idf['symbol'] == ticker, 'close'], 8).round(2)
tind['ema21'] = ta.EMA(idf.loc[idf['symbol'] == ticker, 'close'], 21).round(2)
ind = ind.append(tind)
tind = tind.iloc[0:0]
idf = pd.merge(idf, ind, left_index=True, right_index=True)
Is this the most efficient way to doing this?
If not, what is the easiest and fastest way to calculate indicator values and get those calculated indicator values into the dataframe idf?
Prefer to avoid a for loop if possible.
Any help is highly appreciated.

rsi = lambda x: talib.RSI(idf.loc[x.index, "close"], 14)
idf['rsi(14)'] = idf.groupby(['symbol']).apply(rsi).reset_index(0,drop=True)

How to keep only certain df columns that appear as rows in another df

Assuming that we have two dataframes
df_1
+--------+--------+-------+-------+
| id | col1 | col2 | col3 |
+--------+--------+-------+-------+
| A | 10 | 5 | 4 |
| B | 5 | 3 | 2 |
+--------+--------+-------+-------+
and df_2
+----------+--------+---------+
| col_name | col_t | col_d |
+----------+--------+---------+
| col1 | 3.3 | 2.2 |
| col3 | 1 | 2 |
+----------+--------+---------+
What I want to achieve is to join the two tables, such that only the columns that appear under df_2's col_name are kept in df_1 i.e. the desired table would be
+--------+--------+-------+
| id | col1 | col3 |
+--------+--------+-------+
| A | 10 | 4 |
| B | 5 | 2 |
+--------+--------+-------+
however, I need to perform this action only through joins and/or df transpose or pivot if possible.
I know that the above could be easily inferred by just selecting the df_1 columns as they appear in df_2's col_name but this is not what I am looking for here

One way to do this is to dedup and obtain the values in df_2.col_name using collect_list and passing this list of column names in your df_1 dataframe:
col_list = list(set(df_2.select(collect_list("col_name")).collect()[0][0]))
list_with_id = ['id'] + col_list
df_1[list_with_id].show()
Output:
+---+----+----+
| id|col1|col3|
+---+----+----+
| A| 10| 4|
| B| 5| 2|
+---+----+----+
Is this what you're looking for? (Assuming you want something dynamic and not manually selecting columns). I'm not using joins or pivots here though.

Pandas Coalesce Multiple Columns, NaN

I want to coalesce 4 columns using pandas. I've tried this:
final['join_key'] = final['book'].astype('str') + final['bdr'] + final['cusip'].fillna(final['isin']).fillna(final['Deal'].astype('str')).fillna(final['Id'])
When I use this it returns:
+-------+--------+-------+------+------+------------+------------------+
| book | bdr | cusip | isin | Deal | Id | join_key |
+-------+--------+-------+------+------+------------+------------------+
| 17236 | ETFROS | | | | 8012398421 | 17236.0ETFROSnan |
+-------+--------+-------+------+------+------------+------------------+
The field Id is not properly appending to my join_key field.
Any help would be appreciated, thanks.
Update:
+------------+------+------+-----------+--------------+------+------------+----------------------------+
| endOfDay | book | bdr | cusip | isin | Deal | Id | join_key |
+------------+------+------+-----------+--------------+------+------------+----------------------------+
| 31/10/2019 | 15 | ITOR | 371494AM7 | US371494AM77 | 161 | 8013210731 | 20191031|15|ITOR|371494AM7 |
| 31/10/2019 | 15 | ITOR | | | | 8011898573 | 20191031|15|ITOR| |
| 31/10/2019 | 15 | ITOR | | | | 8011898742 | 20191031|15|ITOR| |
| 31/10/2019 | 15 | ITOR | | | | 8011899418 | 20191031|15|ITOR| |
+------------+------+------+-----------+--------------+------+------------+----------------------------+
df['join_key'] = ("20191031|" + df['book'].astype('str') + "|" + df['bdr'] + "|" + df[['cusip', 'isin', 'Deal', 'id']].bfill(1)['cusip'].astype(str))
For some reason this code isnt picking up Id as part of the key.

The last chain fillna for cusip is too complicated. You may change it to bfill
final['join_key'] = (final['book'].astype('str') +
final['bdr'] +
final[['cusip', 'isin', 'Deal', 'Id']].bfill(1)['cusip'].astype(str))

Try this:
import pandas as pd
import numpy as np
# setup (ignore)
final = pd.DataFrame({
'book': [17236],
'bdr': ['ETFROS'],
'cusip': [np.nan],
'isin': [np.nan],
'Deal': [np.nan],
'Id': ['8012398421'],
})
# answer
final['join_key'] = final['book'].astype('str') + final['bdr'] + final['cusip'].fillna(final['isin']).fillna(final['Deal']).fillna(final['Id']).astype('str')
Output
book bdr cusip isin Deal Id join_key
0 17236 ETFROS NaN NaN NaN 8012398421 17236ETFROS8012398421

pyspark - how can I remove all duplicate rows (ignoring certain columns) and not leaving any dupe pairs behind?

I have the following table:
df = spark.createDataFrame([(2,'john',1),
(2,'john',1),
(3,'pete',8),
(3,'pete',8),
(5,'steve',9)],
['id','name','value'])
df.show()
+----+-------+-------+--------------+
| id | name | value | date |
+----+-------+-------+--------------+
| 2 | john | 1 | 131434234342 |
| 2 | john | 1 | 10-22-2018 |
| 3 | pete | 8 | 10-22-2018 |
| 3 | pete | 8 | 3258958304 |
| 5 | steve | 9 | 124324234 |
+----+-------+-------+--------------+
I want to remove all duplicate pairs (When the duplicates occur in id, name, or value but NOT date) so that I end up with:
+----+-------+-------+-----------+
| id | name | value | date |
+----+-------+-------+-----------+
| 5 | steve | 9 | 124324234 |
+----+-------+-------+-----------+
How can I do this in PySpark?

You could groupBy id, name and value and filter on the count column : :
df = df.groupBy('id','name','value').count().where('count = 1')
df.show()
+---+-----+-----+-----+
| id| name|value|count|
+---+-----+-----+-----+
| 5|steve| 9| 1|
+---+-----+-----+-----+
You could eventually drop the count column if needed

Do groupBy for the columns you want and count and do a filter where count is equal to 1 and then you can drop the count column like below
import pyspark.sql.functions as f
df = df.groupBy("id", "name", "value").agg(f.count("*").alias('cnt')).where('cnt = 1').drop('cnt')
You can add the date column in the GroupBy condition if you want
Hope this helps you

How do I flattern a pySpark dataframe ?

I have a spark dataframe like this:
id | Operation | Value
-----------------------------------------------------------
1 | [Date_Min, Date_Max, Device] | [148590, 148590, iphone]
2 | [Date_Min, Date_Max, Review] | [148590, 148590, Good]
3 | [Date_Min, Date_Max, Review, Device] | [148590, 148590, Bad,samsung]
The resul that i expect:
id | Operation | Value |
--------------------------
1 | Date_Min | 148590 |
1 | Date_Max | 148590 |
1 | Device | iphone |
2 | Date_Min | 148590 |
2 | Date_Max | 148590 |
2 | Review | Good |
3 | Date_Min | 148590 |
3 | Date_Max | 148590 |
3 | Review | Bad |
3 | Review | samsung|
I'm using Spark 2.1.0 with pyspark. I tried this solution but it worked only for one column.
Thanks

Here is an example dataframe from above. I use this solution in order to solve your question.
df = spark.createDataFrame(
[[1, ['Date_Min', 'Date_Max', 'Device'], ['148590', '148590', 'iphone']],
[2, ['Date_Min', 'Date_Max', 'Review'], ['148590', '148590', 'Good']],
[3, ['Date_Min', 'Date_Max', 'Review', 'Device'], ['148590', '148590', 'Bad', 'samsung']]],
schema=['id', 'l1', 'l2'])
Here, you can define udf to zip two list together for each row first.
from pyspark.sql.types import *
from pyspark.sql.functions import col, udf, explode
zip_list = udf(
lambda x, y: list(zip(x, y)),
ArrayType(StructType([
StructField("first", StringType()),
StructField("second", StringType())
]))
)
Finally, you can zip two columns together then explode that column.
df_out = df.withColumn("tmp", zip_list('l1', 'l2')).\
withColumn("tmp", explode("tmp")).\
select('id', col('tmp.first').alias('Operation'), col('tmp.second').alias('Value'))
df_out.show()
Output
+---+---------+-------+
| id|Operation| Value|
+---+---------+-------+
| 1| Date_Min| 148590|
| 1| Date_Max| 148590|
| 1| Device| iphone|
| 2| Date_Min| 148590|
| 2| Date_Max| 148590|
| 2| Review| Good|
| 3| Date_Min| 148590|
| 3| Date_Max| 148590|
| 3| Review| Bad|
| 3| Device|samsung|
+---+---------+-------+

If using DataFrame then try this:-
import pyspark.sql.functions as F
your_df.select("id", F.explode("Operation"), F.explode("Value")).show()

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Pandas CSV dataframes - python

It seems you need unpivot of your dataframe so use melt for unpivot pd.melt(df,id_vars=['id'],value_vars=['prop1', 'prop2','prop3','prop4'])

Related

How to run TA-Lib on multiple tickers in a single dataframe

How to keep only certain df columns that appear as rows in another df

Pandas Coalesce Multiple Columns, NaN

pyspark - how can I remove all duplicate rows (ignoring certain columns) and not leaving any dupe pairs behind?

How do I flattern a pySpark dataframe ?

Categories

Resources