Hi I have a rather simple task but seems like all online help is not working.
I have data set like this:
ID | Px_1 | Px_2
theta| 106.013676 | 102.8024788702673
Rho | 100.002818 | 102.62640389123405
gamma| 105.360589 | 107.21999706084836
Beta | 106.133046 | 115.40449479551263
alpha| 106.821119 | 110.54312246081719
I want to find min by each row in a fourth col so the output I can have is for example, theta is 102.802 because it is the min value of both Px_1 and Px_2
I tried this but doesnt work
I constantly get max value
df_subset = read.set_index('ID')[['Px_1','Px_2']]
d = df_subset.min( axis=1)
Thanks
You can try this
df["min"] = df[["Px_1", "Px_2"]].min(axis=1)
Select the columns needed, here ["Px_1", "Px_2"], to perform min operation.
Related
Given the following example dataframe:
advertiser_id| name | amount | total |max_total_advertiser|
4061 |source1|-434.955284|-354882.75336200005| -355938.53950700007
4061 |source2|-594.012216|-355476.76557800005| -355938.53950700007
4061 |source3|-461.773929|-355938.53950700007| -355938.53950700007
I need to sum the amount and the max_total_advertiser field in order to get the correct total value in each row. Taking into account that I need this total value for every group partitioned by advertiser_id. (The total column in the initial dataframe is incorrect, that's why I want to calculate correctly)
Something like that should be:
w = Window.partitionBy("advertiser_id").orderBy("advertiser_id")
df.withColumn("total_aux", when( lag("advertiser_id").over(w) == col("advertiser_id"), lag("total_aux").over(w) + col("amount") ).otherwise( col("max_total_advertiser") + col("amount") ))
This lag("total_aux") is not working because the column is not generated yet, that's what I want to achieve, if it is the first row in the group, sum the columns in the same row if not sum the previous obtained value with the current amount field.
Example output:
advertiser_id| name | amount | total_aux |
4061 |source1|-434.955284|-356373.494791 |
4061 |source2|-594.012216|-356967.507007 |
4061 |source3|-461.773929|-357429.280936 |
Thanks.
I assume that name is a distinct value for each advertiser_id and your dataset is therefore sortable by name. I also assume that max_total_advertiser contains the same value for each advertiser_id. If one of those is not the case, please add a comment.
What you need is a rangeBetween window which gives you all preceding and following rows within the specified range. We will use Window.unboundedPreceding as we want to sum up all the previous values.
import pyspark.sql.functions as F
from pyspark.sql import Window
l = [
(4061, 'source1',-434.955284,-354882.75336200005, -355938.53950700007),
(4061, 'source2',-594.012216,-355476.76557800005, -345938.53950700007),
(4062, 'source1',-594.012216,-355476.76557800005, -5938.53950700007),
(4062, 'source2',-594.012216,-355476.76557800005, -5938.53950700007),
(4061, 'source3',-461.773929,-355938.53950700007, -355938.53950700007)
]
columns = ['advertiser_id','name' ,'amount', 'total', 'max_total_advertiser']
df=spark.createDataFrame(l, columns)
w = Window.partitionBy('advertiser_id').orderBy('name').rangeBetween(Window.unboundedPreceding, 0)
df = df.withColumn('total', F.sum('amount').over(w) + df.max_total_advertiser)
df.show()
Output:
+-------------+-------+-----------+-------------------+--------------------+
|advertiser_id| name| amount| total|max_total_advertiser|
+-------------+-------+-----------+-------------------+--------------------+
| 4062|source1|-594.012216|-6532.5517230000705| -5938.53950700007|
| 4062|source2|-594.012216| -7126.563939000071| -5938.53950700007|
| 4061|source1|-434.955284| -356373.4947910001| -355938.53950700007|
| 4061|source2|-594.012216| -346967.5070070001| -345938.53950700007|
| 4061|source3|-461.773929|-357429.28093600005| -355938.53950700007|
+-------------+-------+-----------+-------------------+--------------------+
You might be looking for the orderBy() function. Does this work?
from pyspark.sql.window import *
df.withColumn("cumulativeSum", sum(df("amount"))
.over( Window.partitionBy("advertiser_id").orderBy("amount")))
I'm trying to find the correlation between the open and close prices of 150 cryptocurrencies using pandas.
Each cryptocurrency data is stored in its own CSV file and looks something like this:
|---------------------|------------------|------------------|
| Date | Open | Close |
|---------------------|------------------|------------------|
| 2019-02-01 00:00:00 | 0.00001115 | 0.00001119 |
|---------------------|------------------|------------------|
| 2019-02-01 00:05:00 | 0.00001116 | 0.00001119 |
|---------------------|------------------|------------------|
| . | . | . |
I would like to find the correlation between the Close and Open column of every cryptocurrency.
As of right now, my code looks like this:
temporary_dataframe = pandas.DataFrame()
for csv_path, coin in zip(all_csv_paths, coin_name):
data_file = pandas.read_csv(csv_path)
temporary_dataframe[f"Open_{coin}"] = data_file["Open"]
temporary_dataframe[f"Close_{coin}"] = data_file["Close"]
# Create all_open based on temporary_dataframe data.
corr_file = all_open.corr()
print(corr_file.unstack().sort_values().drop_duplicates())
Here is a part of the output (the output has a shape of (43661,)):
Open_QKC_BTC Close_QKC_BTC 0.996229
Open_TNT_BTC Close_TNT_BTC 0.996312
Open_ETC_BTC Close_ETC_BTC 0.996423
The problem is that I don't want to see the following correlations:
between columns starting with Close_ and Close_(e.g. Close_USD_BTC and Close_ETH_BTC)
between columns starting with Open_ and Open_ (e.g. Open_USD_BTC and Open_ETH_BTC)
between the same coin (e.g. Open_USD_BTC and Close_USD_BTC).
In short, the perfect output would look like this:
Open_TNT_BTC Close_QKC_BTC 0.996229
Open_ETH_BTC Close_TNT_BTC 0.996312
Open_ADA_BTC Close_ETC_BTC 0.996423
(PS: I'm pretty sure this is not the most elegant to do what I'm doing. If anyone has any suggestions on how to make this script better I would be more than happy to hear them)
Thank you very much in advance for your help!
This is quite messy but it at least shows you an option.
Her i am generating some random data and have made some suffixes (coin names) easier than in your case
import string
import numpy as np
import pandas as pd
#Generate random data
prefix = ['Open_','Close_']
suffix = string.ascii_uppercase #All uppercase letter to simulate coin-names
var1 = [None] * 100
var2 = [None] * 100
for i in range(len(var1)) :
var1[i] = prefix[np.random.randint(0,len(prefix))] + suffix[np.random.randint(0,len(suffix))]
var2[i] = prefix[np.random.randint(0,len(prefix))] + suffix[np.random.randint(0,len(suffix))]
df = pd.DataFrame(data = {'var1': var1, 'var2':var2 })
df['DropScenario_1'] = False
df['DropScenario_2'] = False
df['DropScenario_3'] = False
df['DropScenario_Final'] = False
df['DropScenario_1'] = df.apply(lambda row: bool(prefix[0] in row.var1) and (prefix[0] in row.var2), axis=1) #Both are Open_
df['DropScenario_2'] = df.apply(lambda row: bool(prefix[1] in row.var1) and (prefix[1] in row.var2), axis=1) #Both are Close_
df['DropScenario_3'] = df.apply(lambda row: bool(row.var1[len(row.var1)-1] == row.var2[len(row.var2)-1]), axis=1) #Both suffixes are the same
#Combine all scenarios
df['DropScenario_Final'] = df['DropScenario_1'] | df['DropScenario_2'] | df['DropScenario_3']
#Keep only the part of the df that we want
df = df[df['DropScenario_Final'] == False]
#Drop our messy columns
df = df.drop(['DropScenario_1','DropScenario_2','DropScenario_3','DropScenario_Final'], axis = 1)
Hope this helps
P.S If you find the secret key to trading bitcoins without ending up on r/wallstreetbets, ill take 5% ;)
I'm trying to do something that seems quite easy on excel using vlookup. All the times bellow are of timedelta datatype . Couldn't find a solution that fit me by google searching the errors.
DF1 (bellow) is my main DataFrame one value is Arrival time.
+--------+------+
|Arrival | idBin|
+--------+------+
|10:01:40| nan |
|10:03:12| nan |
|10:05:55| nan |
|10:05:10| nan |
+--------+------+
DF2(bellow) is my parameters Dataframe with 1k+ time ranges (manually creating a dictionary seems impractical).
+--------+--------+------+
|start |end |idBin |
+--------+--------+------+
|10:00:00|10:00:30| 1 |
|10:00:31|10:01:00| 2 |
|10:01:01|10:01:30| 3 |
|10:01:31|10:02:00| 4 |
+--------+--------+------+
What I need is to get DF2.idBin into DF1.idBin where DF1.arrival between DF2.start and DF2.end
What I tried so far:
**.loc** > returns ValueError: Can only compare identically-labeled Series objects
pd.DataFrame.loc[ (df1['arrival'] >= df2['start'])
& (df1['arrival'] <= df2['end']), 'idBin'] = df2['idBin']
**date_range()** so I could transform it into dictionary, but return TypeError: Cannot convert input [0 days 10:00:00] of type <class 'pandas._libs.tslibs.timedeltas.Timedelta'> to Timestamp
dt_range = pd.date_range(start=df2['start'].min(), end=df2['end'].max(), name=df2['idBin'])
IIUC
x = pd.Series(df2['idBin'], pd.IntervalIndex.from_arrays(df2['start'], df2['end']))
inds = np.array([np.flatnonzero(np.array([k in z for z in x.index])) for k in df.Arrival])
bools = [arr.size>0 for arr in inds]
df.loc[bools, 'idBin'] = df2.iloc[[ind[0] for ind in inds[bools]]].idBin.values
DF2_intervals = pd.Series(DF2['idBin'], pd.IntervalIndex.from_arrays(DF2['start'], DF2['end']))
DF1['idBin'] = DF1['Arrival'].map(DF2_intervals)
You can also turn that into one line to be more efficient, should you wish to.
Let me know if that works.
Im not sure if there is a pre_built solution but you can do something similar to what you tried but in a UDF and then apply this to the column in df1 and have that output a new column.
def match_idbin(date, df2):
idbin = df2.loc[(df2['start'] > date)&
(df2['end'] < date),'idBin']
return idbin
df1['idBin'] = df1['Arrival'].apply(lambda x: match_idbin(x, df2))
I have the following table:
+---------+------------+----------------+
| IRR | Price List | Cambrdige Data |
+=========+============+================+
| '1.56%' | '0' | '6/30/1989' |
+---------+------------+----------------+
| '5.17%' | '100' | '9/30/1989' |
+---------+------------+----------------+
| '4.44%' | '0' | '12/31/1990' |
+---------+------------+----------------+
I'm trying to write a calculator that updates the Price List field by making a simple calculation. The logic is basically this:
previous price * ( 1 + IRR%)
So for the last row, the calculation would be: 100 * (1 + 4.44%) = 104.44
Since I'm using petl, I'm trying to figure out how to update a field with its above value and a value from the same row and then populate this across the whole Price List column. I can't seem to find a useful petl utility for this. Should I just manually write a method? What do you guys think?
Try this:
# conversion can access other values from the same row
table = etl.convert(table, 'Price List',
lambda row: 100 * (1 + row.IRR),
pass_row=True)
I use pyspark and work with the following dataframe:
+---------+----+--------------------+-------------------+
| id| sid| values| ratio|
+---------+----+--------------------+-------------------+
| 6052791|4178|[2#2#2#2#3#3#3#3#...|0.32673267326732675|
| 57908575|4178|[2#2#2#2#3#3#3#3#...| 0.3173076923076923|
| 78836630|4178|[2#2#2#2#3#3#3#3#...| 0.782608695652174|
|109252111|4178|[2#2#2#2#3#3#3#3#...| 0.2803738317757009|
|139428308|4385|[2#2#2#3#4#4#4#4#...| 1.140625|
|173158079|4320|[2#2#2#2#3#3#3#3#...|0.14049586776859505|
|183739386|4390|[3#2#2#3#3#2#4#4#...|0.32080419580419584|
|206815630|4178|[2#2#2#2#3#3#3#3#...|0.14782608695652175|
|242251660|4320|[2#2#2#2#3#3#3#3#...| 0.1452991452991453|
|272670796|5038|[3#2#2#2#2#2#2#3#...| 0.2648648648648649|
|297848516|4320|[2#2#2#2#3#3#3#3#...|0.12195121951219512|
|346566485|4113|[2#3#3#2#2#2#2#3#...| 0.646823138928402|
|369667874|5038|[2#2#2#2#2#2#2#3#...| 0.4546293788454067|
|374645154|4320|[2#2#2#2#3#3#3#3#...|0.34782608695652173|
|400996010|4320|[2#2#2#2#3#3#3#3#...|0.14049586776859505|
|401594848|4178|[3#3#6#6#3#3#4#4#...| 0.7647058823529411|
|401954629|4569|[3#3#3#3#3#3#3#3#...| 0.5520833333333333|
|417115190|4320|[2#2#2#2#3#3#3#3#...| 0.6235294117647059|
|423877535|4178|[2#2#2#2#3#3#3#3#...| 0.5538461538461539|
|445523599|4320|[2#2#2#2#3#3#3#3#...| 0.1271186440677966|
+---------+----+--------------------+-------------------+
What I want is to make sid 4178 as a column and put rounded ratio as its row value. The result should look as follows:
+---------+-------+------+-------+
| id| 4178 |4385 | 4390 |(if sid for id fill row with ratio)
+---------+-------+------+-------+
| 6052791|0.32 | 0 | 0 |(if not fill with 0)
id 4178
6052791 0.32
The number of columns is the number of sids that have the same rounded ratio.
If that sid does not exist for any id then sid column has to contain 0.
You need a column to groupby, for which I am adding a new column called sNo.
import sqlContext.implicits._
import org.apache.spark.sql.functions._
val df = sc.parallelize(List((6052791, 4178, 0.42673267326732675),
(6052791, 4178, 0.22673267326732675),
(6052791, 4179, 0.62673267326732675),
(6052791, 4180, 0.72673267326732675),
(6052791, 4179, 0.82673267326732675),
(6052791, 4179, 0.92673267326732675))).toDF("id", "sid", "ratio")
df.withColumn("sNo", lit(1))
.groupBy("sNo")
.pivot("sid")
.agg(min("ratio"))
.show
This would return output
+---+-------------------+------------------+------------------+
|sNo| 4178| 4179| 4180|
+---+-------------------+------------------+------------------+
| 1|0.22673267326732674|0.6267326732673267|0.7267326732673267|
+---+-------------------+------------------+------------------+
That sounds like a pivot that could be in Spark SQL (Scala version) as follows:
scala> ratios.
groupBy("id").
pivot("sid").
agg(first("ratio")).
show
+-------+-------------------+
| id| 4178|
+-------+-------------------+
|6052791|0.32673267326732675|
+-------+-------------------+
I'm still unsure how to select the other columns (4385 and 4390 in your example). It seems that you round ratio and search for other sids that would match.