I am trying (unsuccessfully) to create separate columns for embedded dictionary keys. Dict data looks like this:
{'averagePrice': 32.95,
'currentDayProfitLoss': 67.2,
'currentDayProfitLossPercentage': 0.02,
'instrument': {'assetType': 'EQUITY', 'cusip': '902104108', 'symbol': 'IIVI'},
'longQuantity': 120.0,
'marketValue': 4021.2,
'settledLongQuantity': 120.0,
'settledShortQuantity': 0.0,
'shortQuantity': 0.0}]
The 'instrument' key is what I am trying to flatten in to columns (ie assetType, cusip, symbol). Here is the code I last tried and still no indivdual columns
data = accounts_data_single
my_dict = data
headers = list(my_dict['securitiesAccount']['positions'])
dict1 = my_dict['securitiesAccount']['positions']
mypositions = pd.DataFrame(dict1)
pd.concat([mypositions.drop(['instrument'], axis=1), mypositions['instrument'].apply(pd.Series)], axis=1)
mypositions.to_csv('Amer_temp.csv')
Any suggestions are greatly appreciated
I am trying to get the nestled keys/fieldnames all in columns and then all the stock positions in the rows. The above code works great except the nestled 'instrument' keys are all in one column
averagePrice currentDayProfitLoss ... assetType cusip symbol
22.5 500 ... Equity 013245 IIVI
450 250 ... Equity 321354 AAPL
etc
Here's a way to do this. Let's say d is your dict.
Step 1: Convert the dict to dataframe
d1 = pd.DataFrame.from_dict(d, orient='index').T.reset_index(drop=True)
Step 2: Convert the instrument column into dataframe
d2 = d1['instrument'].apply(pd.Series)
Step3: Join outputs of step1 and step2
df = pd.concat([d1.drop('instrument', axis=1), d2], axis=1)
Are you trying to do this:
pd.DataFrame(d).assign(**pd.DataFrame([x['instrument'] for x in d])).drop(columns='instrument')
output:
averagePrice currentDayProfitLoss currentDayProfitLossPercentage longQuantity marketValue settledLongQuantity settledShortQuantity shortQuantity assetType cusip symbol
0 32.95 67.2 0.02 120.0 4021.2 120.0 0.0 0.0 EQUITY 902104108 IIVI
1 31.95 63.2 0.01 100.0 3021.2 100.0 0.0 0.0 EQUITY 802104108 AAPL
Related
I am using an extremely large dataset with around 1.6 million individual entries for the timespan I am trying to observe (1948 - 1960). An example of my dataset loaded into pandas before attempting to be averaged looks like this.
import pandas as pd
import pyreadr
data = pyreadr.read_r('C:/fileLocation/file.rds')
df = data[None]
df['time'] = pd.to_datetime(df['time'])
df.set_index('time', inplace=True)
df = df['1948':'1960']
print(df.info())
df_groups = df.groupby(['lat', 'lon'])['spei'].mean()
print(df_groups.head())
Now the answer I get
An example input/output could look like as follows
What I am trying to accomplish is to take pairs of latitude and longitude values, and take the average spei value of each pair, then create a new pandas data frame with those new pairs and the spei value attached with that pair to be plotted later. Instead, I am getting only 5 rows of seemingly random latitude and longitude values, instead of each unique pairs with average spei from all repeating lon/lat values. I've used this post to try and help get some answers but I have not been able to find a fix yet.
Thank you!
This should solve your issue:
import pandas as pd
# create sample dataframe
data = {
'lat': [40.0, 40.0, 41.0, 41.0, 42.0, 42.0],
'lon': [-105.0, -106.0, -105.0, -106.0, -105.0, -106.0],
'spei': [-1.2, -0.8, -0.5, -1.1, -1.3, -0.9]
}
df = pd.DataFrame(data)
# group by pairs of latitude and longitude and calculate the mean spei value for each pair
df_groups = df.groupby([df['lat'], df['lon']])['spei'].mean().reset_index()
df_groups.columns = ['lat', 'lon', 'spei_mean']
# print the resulting dataframe
print(df_groups)
which returns:
lat lon spei_mean
0 40.0 -106.0 -0.8
1 40.0 -105.0 -1.2
2 41.0 -106.0 -1.1
3 41.0 -105.0 -0.5
4 42.0 -106.0 -0.9
5 42.0 -105.0 -1.3
I have a panda dataframe that has values like below. Though in real I am working with lot more columns and historical data
AUD USD JPY EUR
0 0.67 1 140 1.05
I want to iterate over columns to create dataframe with columns AUDUSD, AUDJPY, AUDEUR, USDJPY, USDEUR and JPYEUR
where for eg AUDUSD is calculated as product of AUD column and USD colum
I tried below
for col in df:
for cols in df:
cf[col+cols]=df[col]*df[cols]
But it generates table with unneccessary values like AUDAUD, USDUSD or duplicate value like AUDUSD and USDAUD. I think if i can somehow set "cols =col+1 till end of df" in second for loop I should be able to resolve the issue. But i don't know how to do that ??
Result i am looking for is a table with below columns and their values
AUDUSD, AUDJPY, AUDEUR, USDJPY, USDEUR, JPYEUR
You can use itertools.combinations with pandas.Series.mul and pandas.concat.
Try this :
from itertools import combinations
combos = list(combinations(df.columns, 2))
out = pd.concat([df[col[1]].mul(df[col[0]]) for col in combos], axis=1, keys=combos)
out.columns = out.columns.map("".join)
# Output :
print(out)
AUDUSD AUDJPY AUDEUR USDJPY USDEUR JPYEUR
0 0.67 93.8 0.7035 140 1.05 147.0
# Used input :
df = pd.DataFrame({'AUD': [0.67], 'USD': [1], 'JPY': [140], 'EUR': [1.05]})
I thought it intuitive that your first approach was to use an inner / outer loop and think this solution works in the same spirit:
# Added a Second Row for testing
df = pd.DataFrame(
{'AUD': [0.67, 0.91], 'USD': [1, 1], 'JPY': [140, 130], 'EUR': [1.05, 1]},
)
# Instantiated the Second DataFrame
cf = pd.DataFrame()
# Call the index of the columns as an integer
for i in range(len(df.columns)):
# Increment the index + 1, so you aren't looking at the same column twice
# Also, limit the range to the length of your columns
for j in range(i+1, len(df.columns)):
print(f'{df.columns[i]}' + f'{df.columns[j]}') # VERIFY
# Create a variable of the column names mashed together
combine = f'{df.columns[i]}' + f'{df.columns[j]}
# Assign the rows to be a product of the mashed column series
cf[combine] = df[df.columns[i]] * df[df.columns[j]]
print(cf) # VERIFY
The console Log looks like this:
AUDUSD
AUDJPY
AUDEUR
USDJPY
USDEUR
JPYEUR
AUDUSD AUDJPY AUDEUR USDJPY USDEUR JPYEUR
0 0.67 93.8 0.7035 140 1.05 147.0
1 0.91 118.3 0.9100 130 1.00 130.0
I have a sample dataframe as given below.
import pandas as pd
import numpy as np
NaN = np.nan
data = {'ID':['A', 'A', 'A', 'B','B','B'],
'Date':['2021-09-20 04:34:57', '2021-09-20 04:37:25', '2021-09-20 04:38:26', '2021-09-01
00:12:29','2021-09-01 11:20:58','2021-09-02 09:20:58'],
'Name':['xx','xx',NaN,'yy',NaN,NaN],
'Height':[174,174,NaN,160,NaN,NaN],
'Weight':[74,NaN,NaN,58,NaN,NaN],
'Gender':[NaN,'Male',NaN,NaN,'Female',NaN],
'Interests':[NaN,NaN,'Hiking,Sports',NaN,NaN,'Singing']}
df1 = pd.DataFrame(data)
df1
I want to combine the data present on the same date into a single row. The 'Date' column is in timestamp format. I have written a code for it. Here is my TRY code:
TRY:
df1['Date'] = pd.to_datetime(df1['Date'])
df_out = (df1.groupby(['ID', pd.Grouper(key='Date', freq='D')])
.agg(lambda x: ''.join(x.dropna().astype(str)))
.reset_index()
).replace('', np.nan)
This gives an output where if there are multiple entries of same value, the final result has multiple entries in the same row as shown below.
Obtained Output
However, I do not want the values to be repeated if there are multiple entries. The final output should look like the image shown below.
Required Output
The first column should not have 'xx' and 174.0 instead of 'xxxx' and '174.0 174.0'.
Any help is greatly appreciated. Thank you.
In your case replace agg join to first
df_out = (df1.groupby(['ID', pd.Grouper(key='Date', freq='D')])
.first()
.reset_index()
).replace('', np.nan)
df_out
Out[113]:
ID Date Name Height Weight Gender Interests
0 A 2021-09-20 xx 174.0 74.0 Male Hiking,Sports
1 B 2021-09-01 yy 160.0 58.0 Female None
2 B 2021-09-02 None NaN NaN None Singing
Since you're only trying to keep the first available value for each column for each date, you can do:
>>> df1.groupby(["ID", pd.Grouper(key='Date', freq='D')]).agg("first").reset_index()
ID Date Name Height Weight Gender Interests
0 A 2021-09-20 xx 174.0 74.0 Male Hiking,Sports
1 B 2021-09-01 yy 160.0 58.0 Female None
2 B 2021-09-02 None NaN NaN None Singing
Here is the simplified sample dataset:
Price Signal
0 1.5
1 2.0 Buy
2 2.1
3 2.2
4 1.7 Sell
Here is the code to generate the above sample dataset for ease of reference:
price = [1.5, 2, 2.1, 2.2, 1.7]
signal = ['', 'Buy', '', '', 'Sell']
df = pd.DataFrame(zip(price,signal), columns = ['Price', 'Signal'])
Here is the task:
Assuming initial cash = 100 and stock position = 0, simulate cash and stock position at each step based on the following code using .itterows()
cash = 100
num_of_shares = 0
for index, row in df.iterrows():
if row['Signal'] == 'Sell':
if num_of_shares > 0:
cash = num_of_shares * row['Price']
num_of_shares = 0
elif row['Signal'] == 'Buy':
if cash > 0:
num_of_shares = cash / row['Price']
cash = 0
df.loc[index, 'Position'] = num_of_shares
df.loc[index, 'Cash'] = cash
Here is the result:
Price Signal Position Cash
0 1.5 0.0 100.0
1 2.0 Buy 50.0 0.0
2 2.1 50.0 0.0
3 2.2 50.0 0.0
4 1.7 Sell 0.0 85.0
Here is the Question: Is there any way to achieve the result faster than using .itterows()?
You can use. to_dict() function in Pandas to convert the data frame to a dictionary. Now iterating over a dictionary is comparatively very fast compared to iterrows() function. Iterating over a dictionary format of the dataset takes about 25 records that is 77x times faster than the iterrows() function.
for example:
iterate dataframe with iterrows()
for key,value in df.iterrows()
# key is your index
# and value is your row
iterate dataframe after converting to dictionary
dict1 = df.to_dict(orient='index')
for key in dict1:
value = dict1[key]
#key is your index
#value is your row
if we compare dataframe and dictionary, suppose my dataframe has 3400 rows
if i use pandas iterrows() it will take 2.45 sec to iterate full dataframe
in case of dictionary for the same length it will take 0.13 sec.
so converting pandas dataframe to dictionary its always good practice and best way to optimize the code.
I am having some issues applying several functions to my dataframe.
I have created a sample code to illustrate what I am trying to do. There might be a better way to do this specific function than the way I am doing it, but I am trying to get a general solution for my problem since I am using several functions, and not just how to do this specific thing the most efficient.
Basically, I have one sample dataframe that looks like this (df1):
Ticker Date High Volume
0 AAPL 20200501 1.5 150
1 AAPL 20200501 1.2 100
2 AAPL 20200501 1.3 150
3 AAPL 20200502 1.4 130
4 AAPL 20200502 1.2 170
5 AAPL 20200502 1.1 160
6 TSLA 20200501 2.5 250
7 TSLA 20200501 2.2 200
8 TSLA 20200501 2.3 250
9 TSLA 20200502 2.4 230
10 TSLA 20200502 2.2 270
11 TSLA 20200502 2.1 260
and one sample dataframe that looks like this (df2):
Ticker Date Price SumVol
0 AAPL 20200508 1.2 0
1 TSLA 20200508 2.2 0
the values in the column 'SumVol' in df2 should be filled with the sum of the values in the 'Volume' column from df1, up untill the first time the value in the 'Price'(df1) column is seen in df2, and the date in df1 matches the date from df2
desired output:
Ticker Date Price SumVol
0 AAPL 20200508 1.2 300
1 TSLA 20200508 2.2 500
for some reason I am unable to get this output, because I am probably doing something wrong in the line of code where I am trying to apply the function to the dataframe. I hope that someone here can help me out.
Full sample code including sample dataframes:
import pandas as pd
df1 = pd.DataFrame({'Ticker': ['AAPL', 'AAPL', 'AAPL', 'AAPL', 'AAPL', 'AAPL', 'TSLA', 'TSLA', 'TSLA', 'TSLA', 'TSLA', 'TSLA'],
'Date': [20200501, 20200501, 20200501, 20200502, 20200502, 20200502, 20200501, 20200501, 20200501, 20200502, 20200502, 20200502],
'High': [1.5, 1.2, 1.3, 1.4, 1.2, 1.1, 2.5, 2.2, 2.3, 2.4, 2.2, 2.1],
'Volume': [150, 100, 150, 130, 170, 160, 250, 200, 250, 230, 270, 260]})
print(df1)
df2 = pd.DataFrame({'Ticker': ['AAPL', 'TSLA'],
'Date': [20200501, 20200502],
'Price': [1.4, 2.2],
'SumVol': [0,0]})
print(df2)
def VolSum(ticker, date, price):
df11 = pd.DataFrame(df1)
df11 = df11[df11['Ticker'] == ticker]
df11 = df11[df11['Date'] == date]
df11 = df11[df11['High'] < price]
df11 = pd.DataFrame(df11)
return df11.Volume.sum
df2['SumVol'].apply(VolSum(df2['Ticker'], df2['Date'], df2['Price']), inplace=True).reset_index(drop=True, inplace=True)
print(df2)
The first reason of your failure is that your function ends with
return df11.Volume.sum (without parentheses),
so you return just sum function, not the result of its execution.
Another reason is that you can apply a function to e.g. each row of a Dataframe,
but you must pass axis=1 parameter. But then:
the function to be applied should have one parameter - the current row,
its result can be substituted under a desired column.
And the third reason of failure is that df2 contains e.g. dates not present
in df1, so you are not likely to find any matching rows.
How to get the expected result - Method 1
First, df2 must contain values that are likely to be matched with df1.
I defined df2 as:
Ticker Date Price SumVol
0 AAPL 20200501 1.4 0
1 TSLA 20200502 2.3 0
Then I changed your function to:
def VolSum(row):
df11 = pd.DataFrame(df1)
df11 = df11[df11['Ticker'] == row.Ticker]
df11 = df11[df11['Date'] == row.Date]
df11 = df11[df11['High'] < row.Price]
return df11.Volume.sum()
And finally I generated the result as:
df2['SumVol'] = df2.apply(VolSum, axis=1)
The result is:
Ticker Date Price SumVol
0 AAPL 20200501 1.4 250
1 TSLA 20200502 2.3 530
How to get the expected result - Method 2
But a more concise and elegant method is to define the summing function as:
def VolSum2(row):
return df1.query('Ticker == #row.Ticker and '
'Date == #row.Date and High < #row.Price').Volume.sum()
And apply it just the same way:
df2['SumVol'] = df2.apply(VolSum2, axis=1)
The result is of course the same.