pandas dataframe creating columns with loop - python

I'm trying to add new columns and fill them with data with for loops, take data from Price column and insert 1000 iterations into new dataframe column, after 1000 Price column iterations then make a new column for 1000 more, etc.
import pandas as pd
import matplotlib.pyplot as plt
data_frame = pd.read_csv('candle_data.csv', names=['Time', 'Symbol','Side', 'Size', 'Price','1','2','3','4','5'])
price_df = pd.DataFrame()
count_tick = 0
count_candle = 0
for price in data_frame['Price']:
if count_tick < 1000:
price_df[count_candle] = price
count_tick +=1
elif count_tick == 1000:
count_tick = 0
count_candle +=1
price_df.head()

It's not necessary that you loop through the data frame , you can use slicing to achieve this, look at below sample code. I have loaded a Dataframe with 100 rows and trying to create column -'col3' from first 50 rows of 'col1' and post that column 'col4' from the next 50 rows of 'col1'. You could modify the below code to point to your columns and the values that you want
import pandas as pd
import numpy as np
if __name__ == '__main__':
col1 = np.linspace(0,100,100)
col2 = np.linspace(100, 200, 100)
dict = {'col1':col1,'col2':col2}
df = pd.DataFrame(dict)
df['col3']= df['col1'][0:50]
df['col4'] = df['col1'][50:100]
print(df)
Solution 2 based on added info from comments
import pandas as pd
import numpy as np
if __name__ == '__main__':
pd.set_option('display.width', 100000)
pd.set_option('display.max_columns', 500)
### partition size for example I have taken a low volums 20
part_size = 20
## number generation for data frame
col1 = np.linspace(0,100,100)
col2 = np.linspace(100, 200, 100)
## create initial data frame
dict = {'col1':col1,'col2':col2}
df = pd.DataFrame(dict)
len = df.shape[0]
## tells you how many new columns you need
rec = int(len/part_size)
_ = {}
## initialize slicing variables
low =0
high=part_size
print(len)
for i in range(rec):
if high >= len:
_['col_name_here{0}'.format(i)] = df[low:]['col1']
break
else:
_['col_name_here{0}'.format(i)] = df[low:high]['col1']
low = high
high+= part_size
df = df.assign(**_)
print(df)

Related

Subset a DataFrame

If I have this data frame:
df = pd.DataFrame(
{"A":[45,67,12,78,92,65,89,12,34,78],
"B":["h","b","f","d","e","t","y","p","w","q"],
"C":[True,False,False,True,False,True,True,True,True,True]})
How can I select 50% of the rows, so that column "C" is True in 90% of the selected rows and False in 10% of them?
firstly create a dataframe in 1000 rows
import pandas as pd
df = pd.DataFrame(
{"A":[45,67,12,78,92,65,89,12,34,78],
"B":["h","b","f","d","e","t","y","p","w","q"],
"C":[True,False,False,True,False,True,True,True,True,True]})
df = pd.concat([df]*100)
print(df)
secondly get the true_row_num and false_row_num
row_num, _ = df.shape
true_row_num = int(row_num * 0.5 * 0.9)
false_row_num = int(row_num * 0.5 * 0.1)
print(true_row_num, false_row_num)
thirdly randomly sample true_df and false_df respectively
true_df = df[df["C"]].sample(true_row_num)
false_df = df[~df["C"]].sample(false_row_num)
new_df = pd.concat([true_df, false_df])
new_df = new_df.sample(frac=1.0).reset_index(drop=True) # shuffle
print(new_df["C"].value_counts())
I think if you calculate the needed sizes ex ante and then perform random sampling per group it might work. Look at something like this:
new=df.query('C==True').sample(int(0.5*len(df)*0.9)).append(df.query('C==False').sample(int(0.5*len(df)*0.1)))

build indicator fractals with pandas

My DataFrame looks like this:
<DATE>,<TIME>,<PRICE>
20200702,110000,207.2400000
20200702,120000,207.4400000
20200702,130000,208.2400000
20200702,140000,208.8200000
20200702,150000,208.0700000
20200702,160000,208.8100000
20200702,170000,209.4300000
20200702,180000,208.8700000
20200702,190000,210.0000000
20200702,200000,209.6900000
20200702,210000,209.8700000
20200702,220000,209.8000000
20200702,230000,209.5900000
20200703,000000,209.6000000
20200703,110000,211.1800000
20200703,120000,209.3900000
20200703,130000,209.6400000
I want to add here 2 another boolean columns called 'Up Fractal' and 'Down Fractal'.
It is stock market indicator Fractals with period 5.
It means:
Script runs from first row to last.
Script takes current row and looks at PRICE.
Script takes 5 previous rows and 5 next rows.
If PRICE of current row is maximum it is called 'Up Fractal'. True value in column 'Up Fractal'
If PRICE of current row is minimum it is called 'Down Fractal'. True value in column 'Down Fractal'
On stock market chart it looks something like this (this is an example from internet, not about my DataFrame)
It is easy for me to find fractals using standart methods of python. But I need high speed of pandas.
Help me please. I am very new to pandas library.
from binance.spot import Spot
import pandas as pd
from pandas import DataFrame
import numpy as np
if __name__ == '__main__':
cl = Spot()
r = cl.klines("BTCUSDT", "5m", limit = "100")
df = DataFrame(r).iloc[:, :6]
df.columns = list("tohlcv")
# number of rows to calculate fractal
n = 10
df = df.astype({'t': int})
df = df.astype({'o': float})
df = df.astype({'h': float})
df = df.astype({'l': float})
df = df.astype({'c': float})
# the first way
df['uf'] = (df['h'] == df['h'].rolling(n+n+1, center=True).max())
df['df'] = (df['l'] == df['l'].rolling(n+n+1, center=True).min())
# the second way
df['upfractal'] = np.where(df['h'] == df['h'].rolling(n+n+1, center=True).max(), True, False)
df['downfractal'] = np.where(df['l'] == df['l'].rolling(n+n+1, center=True).min(), True, False)
print(df)
df.to_csv('BTC_USD.csv')

Remove outlier from time series data using pandas

I have one-minute data:
# Import data
import yfinance as yf
data = yf.download(tickers="MSFT", period="7d", interval="1m")
print(data.tail())
I would like to remove observations where minute difference is grater than daily difference, where we refere to day of the minute bar. I would like to apply this rule on every column except volume. Begining of the code:
minute_diff = data.diff()
dail_diff = data.resample('D').last().diff().median()
# here remove rows from data were minute_diff is grater than daily diff
minute_diff = data.diff().reset_index()
dail_diff = data.resample('D').last().diff().median()
cols = minute_diff.columns.to_list()
cols.remove('Datetime')
for c in cols:
minute_diff = minute_diff[(minute_diff[c] <= dail_diff[c])|(minute_diff[c].isnull())]
data = data.loc[minute_diff['Datetime']]
import pandas as pd
# Import data
import yfinance as yf
data = yf.download(tickers="MSFT", period="7d", interval="1m")
data_minute = data.copy()
data_minute['Date'] = data_minute.index.astype('datetime64[ns]')
data_minute['Date'] = data_minute['Date'].dt.normalize()
#Create new column for difference of current close minus previous close
data_minute['Minute Close Difference'] = data_minute['Close'] - data_minute['Close'].shift(1)
#Convert minute data to daily data
data_daily = data_minute.resample('D').agg({'Open':'first',
'High':'max',
'Low':'min',
'Close':'last',
'Adj Close':'last',
'Volume':'sum'
})
data_daily['Date'] = data_daily.index.astype('datetime64[ns]')
data_daily['Date'] = data_daily['Date'].dt.normalize()
data_daily = data_daily.set_index('Date')
#Create new column for difference of current close minus previous close
data_daily['Daily Close Difference'] = data_daily['Close'] - data_daily['Close'].shift(1)
data_minute = pd.merge(data_minute,data_daily['Daily Close Difference'],how = 'left', left_on = 'Date', right_index = True)
data_minute = data_minute[data_minute['Minute Close Difference'].abs() <= data_minute['Daily Close Difference'].abs()]
data_minute
I have found the solution:
daily_diff = data.resample('D').last().dropna().diff() * 25
daily_diff['diff_date'] = daily_diff.index.strftime('%Y-%m-%d')
data_test = data.diff()
data_test['diff_date'] = data_test.index.strftime('%Y-%m-%d')
data_test_diff = pd.merge(data_test, daily_diff, on='diff_date')
data_test_final = data_test_diff.loc[(np.abs(data_test_diff['close_x']) < np.abs(data_test_diff['close_y']))]
data_test_final['close_x'].plot()
indexer = (np.abs(data_test_diff['close_x']) < np.abs(data_test_diff['close_y']))
data_final = data.loc[indexer.values, :]

My loop always skip the first index

Every time I creat a loop function, it's common to have problem with the first one:
For example:
dfd = quandl.get("FRED/DEXBZUS")
dfe = quandl.get("ECB/EURBRL")
df = [dfd, dfe]
dps = []
for i in df:
I just get the second dataframe values.
Using this:
dfd = quandl.get("FRED/DEXBZUS")
df = [dfd]
dps = []
for i in df:
I got this:
Empty DataFrame
Columns: []
Index: []
And if I use this (repeting the first one):
dfd = quandl.get("FRED/DEXBZUS")
dfe = quandl.get("ECB/EURBRL")
df = [dfd, dfd, dfe]
dps = []
for i in df:
I get both dataframes correcly
Examples :
import quandl
import pandas as pd
#import matplotlib
import matplotlib.pyplot as plt
dfd = quandl.get("FRED/DEXBZUS")
dfe = quandl.get("ECB/EURBRL")
df = [dfd, dfe]
dps = []
for i in df:
df1 = i.reset_index()
results = pd.DataFrame(df1)
results = results.rename(columns={'Date': 'ds','Value': 'y'})
dps = pd.DataFrame(dps.append(results))
print(dps)
Empty DataFrame
Columns: []
Index: []
ds y
0 2008-01-02 2.6010
1 2008-01-03 2.5979
2 2008-01-04 2.5709
3 2008-01-07 2.6027
4 2008-01-08 2.5796
UPDATE
As Bruno suggested, it is related to this function:
dps = pd.DataFrame(dps.append(results))
How to append all the dataset into a one data frame ?
result=Pd.DataFrame(df1) If you create dataframe like this and don't give columns, then by default first it will take 1st row as column and later you are renaming columns what default created.
So please create pd.DataFrame(df1,columns=[column_list]).
First row will not skip.
#this will print every element in df
for i in df:
print i
Also,
for dfIndex, i in enumerate(df):
print i
print dfIndex #this will print the index of i in df
Note that indexes start at 0, not 1.

Q: Merging 2 dataframes based on datetime column, replacing old value with new values

Sample csv data: ( actual data have a huge amount of similar data [roughly 150'000 - 270'000 rows] and different date and time , but the sample data is where the condition for df_filter2 can be met )
ID,v1,c1,v2,c2,p1,p2,p3,p4,f1,r1,r2,Time_Stamp
8301,418,13.2,34.4,136,4673,1,-1,5524.5,0,49,0,22/6/2017 05:11:00
8301,419.3,2.3,0.7,-0.9,-0.6,1,-1,946.2,0,50,0,22/6/2017 05:11:01
8301,417.7,15.2,30.3,196.5,5962,1,-1,6355,0,49,0,22/6/2017 05:11:02
8301,418.7,2.3,0.7,-0.9,-0.6,1,-1,944.7,0,50,0,22/6/2017 05:11:03
8301,419.3,3.4,53.6,10.8,580.2,1,-1,1432.8,0,49,0,22/6/2017 05:11:04
8301,417.7,13.6,30.1,170.4,5122.7,1,-1,5681.8,0,50,0,22/6/2017 05:11:05
8301,418,11.5,41.2,105,4328.2,1,-1,4796.9,0,49,0,22/6/2017 05:11:07
8301,419.7,2.3,0.8,-0.9,-0.7,1,-1,946.9,0,51,0,22/6/2017 05:11:08
8301,419.7,2.3,40.6,-0.7,-27.9,1,-1,974,0,49,0,22/6/2017 05:11:09
8301,417.4,14.9,30.4,194.4,5903.8,1,-1,6215.4,0,51,0,22/6/2017 05:11:10
8301,417.7,14.7,30.5,186.2,5682.9,1,-1,6139.5,0,49,0,22/6/2017 05:11:11
8301,418,12,31.5,141.5,4456.9,1,-1,5012.5,0,51,0,22/6/2017 05:11:12
8301,419,2.3,0.7,-1.4,-0.9,1,-1,945.4,0,49,0,22/6/2017 05:11:13
8301,419,2.3,0.7,-0.9,-0.6,1,-1,945.4,0,50,0,22/6/2017 05:11:14
8301,419.7,2.3,0.8,-0.9,-0.7,1,-1,946.9,0,50,0,22/6/2017 05:11:15
8301,419,2.3,0.7,-0.9,-0.6,1,-1,945.4,0,49,0,22/6/2017 05:11:16
8301,419,2.3,32.9,-0.2,-5.7,1,-1,972.4,0,51,0,22/6/2017 05:11:17
8301,419.3,2.3,50.3,0.3,17.3,1,-1,973.2,0,49,0,22/6/2017 05:11:18
8301,417.4,15.2,30.5,197.4,6010.5,1,-1,6350,0,50,0,22/6/2017 05:11:19
8301,418.7,2.3,0.9,-0.9,-0.7,1,-1,944.7,0,49,0,22/6/2017 05:11:20
8301,419,2.3,42.9,-0.2,-7.4,1,-1,972.4,0,50,0,22/6/2017 05:11:21
8301,417.4,13.9,30.4,180,5477.6,1,-1,5811.8,0,49,0,22/6/2017 05:11:22
8301,419.7,2.3,0.9,-0.9,-0.8,1,-1,946.9,0,50,0,22/6/2017 05:11:23
8301,418.7,2.3,0.7,-0.9,-0.6,1,-1,944.7,0,50,0,22/6/2017 05:11:24
8301,418.3,2.3,0.6,-0.9,-0.5,1,-1,943.9,0,49,0,22/6/2017 05:11:25
Python Code: Note: I have to be reading from a .csv file and not set the data into strings in a variable
import numpy as np
from datetime import date,time,datetime
import pandas as pd
import csv
# ------ Reading from .csv file - initial df with old values ------
df = pd.read_csv('Data.csv')
df["Time_Stamp"] = pd.to_datetime(df["Time_Stamp"]) # convert to Datetime
def getMask(start,end):
mask = (df['Time_Stamp'] > start) & (df['Time_Stamp'] <= end)
return mask;
start = '2017-06-22 05:00:00'
end = '2017-06-22 05:20:00'
timerange = df.loc[getMask(start, end)]
df_filter = timerange[timerange["AC_Input_Current"].le(3.0)] # new df with less or equal to 0.5
where = (df_filter[df_filter["Time_Stamp"].diff().dt.total_seconds() > 1] ["Time_Stamp"] - pd.Timedelta("1s")).astype(str).tolist()
# df_filter2 below
df_filter2 = timerange[timerange["Time_Stamp"].isin(where)] # Create new df with those
#print(df_filter2)
df_filter2["AC_Input_Current"] = 0.0 # Set c1 to 0.0
for index, row in df_filter2.iterrows():
values = row.astype(str).tolist()
print(','.join(values))
Here is the output of df_filter2 where the value of c1 is edited to 0.0.
I want to merge it together with df - where all the data is at , but replacing the rows of df with the rows from df_filter2 where the Time_Stamp is the same. How do I do this?
Output:
8301,418.0,0.0,34.4,136.0,4673.0,1,-1,5524.5,0,49,0,2017-06-22 05:11:00
8301,417.7,0.0,30.3,196.5,5962.0,1,-1,6355.0,0,49,0,2017-06-22 05:11:02
8301,418.0,0.0,41.2,105.0,4328.2,1,-1,4796.9,0,49,0,2017-06-22 05:11:07
8301,418.0,0.0,31.5,141.5,4456.9,1,-1,5012.5,0,51,0,2017-06-22 05:11:12
8301,417.4,0.0,30.5,197.4,6010.5,1,-1,6350.0,0,50,0,2017-06-22 05:11:19
8301,417.4,0.0,30.4,180.0,5477.6,1,-1,5811.8,0,49,0,2017-06-22 05:11:22
EDIT - Wanted Result: ( < replaced - with df_filter2 rows )
ID,v1,c1,v2,c2,p1,p2,p3,p4,f1,r1,r2,Time_Stamp
8301,418.0,0.0,34.4,136.0,4673.0,1,-1,5524.5,0,49,0,2017-06-22 05:11:00 < replaced
8301,419.3,2.3,0.7,-0.9,-0.6,1,-1,946.2,0,50,0,22/6/2017 05:11:01
8301,417.7,0.0,30.3,196.5,5962.0,1,-1,6355.0,0,49,0,2017-06-22 05:11:02 < replaced
8301,418.7,2.3,0.7,-0.9,-0.6,1,-1,944.7,0,50,0,22/6/2017 05:11:03
8301,419.3,3.4,53.6,10.8,580.2,1,-1,1432.8,0,49,0,22/6/2017 05:11:04
8301,417.7,13.6,30.1,170.4,5122.7,1,-1,5681.8,0,50,0,22/6/2017 05:11:05
8301,418.0,0.0,41.2,105.0,4328.2,1,-1,4796.9,0,49,0,2017-06-22 05:11:07 < replaced
8301,419.7,2.3,0.8,-0.9,-0.7,1,-1,946.9,0,51,0,22/6/2017 05:11:08
8301,419.7,2.3,40.6,-0.7,-27.9,1,-1,974,0,49,0,22/6/2017 05:11:09
8301,417.4,14.9,30.4,194.4,5903.8,1,-1,6215.4,0,51,0,22/6/2017 05:11:10
8301,417.7,14.7,30.5,186.2,5682.9,1,-1,6139.5,0,49,0,22/6/2017 05:11:11
8301,418.0,0.0,31.5,141.5,4456.9,1,-1,5012.5,0,51,0,2017-06-22 05:11:12 < replaced
8301,419,2.3,0.7,-1.4,-0.9,1,-1,945.4,0,49,0,22/6/2017 05:11:13
8301,419,2.3,0.7,-0.9,-0.6,1,-1,945.4,0,50,0,22/6/2017 05:11:14
8301,419.7,2.3,0.8,-0.9,-0.7,1,-1,946.9,0,50,0,22/6/2017 05:11:15
8301,419,2.3,0.7,-0.9,-0.6,1,-1,945.4,0,49,0,22/6/2017 05:11:16
8301,419,2.3,32.9,-0.2,-5.7,1,-1,972.4,0,51,0,22/6/2017 05:11:17
8301,419.3,2.3,50.3,0.3,17.3,1,-1,973.2,0,49,0,22/6/2017 05:11:18
8301,417.4,0.0,30.5,197.4,6010.5,1,-1,6350.0,0,50,0,2017-06-22 05:11:19 < replaced
8301,418.7,2.3,0.9,-0.9,-0.7,1,-1,944.7,0,49,0,22/6/2017 05:11:20
8301,419,2.3,42.9,-0.2,-7.4,1,-1,972.4,0,50,0,22/6/2017 05:11:21
8301,417.4,0.0,30.4,180.0,5477.6,1,-1,5811.8,0,49,0,2017-06-22 05:11:22 < replaced
8301,419.7,2.3,0.9,-0.9,-0.8,1,-1,946.9,0,50,0,22/6/2017 05:11:23
8301,418.7,2.3,0.7,-0.9,-0.6,1,-1,944.7,0,50,0,22/6/2017 05:11:24
8301,418.3,2.3,0.6,-0.9,-0.5,1,-1,943.9,0,49,0,22/6/2017 05:11:25

Categories