pandas dataframe creating columns with loop - python
I'm trying to add new columns and fill them with data with for loops, take data from Price column and insert 1000 iterations into new dataframe column, after 1000 Price column iterations then make a new column for 1000 more, etc.
import pandas as pd
import matplotlib.pyplot as plt
data_frame = pd.read_csv('candle_data.csv', names=['Time', 'Symbol','Side', 'Size', 'Price','1','2','3','4','5'])
price_df = pd.DataFrame()
count_tick = 0
count_candle = 0
for price in data_frame['Price']:
if count_tick < 1000:
price_df[count_candle] = price
count_tick +=1
elif count_tick == 1000:
count_tick = 0
count_candle +=1
price_df.head()
It's not necessary that you loop through the data frame , you can use slicing to achieve this, look at below sample code. I have loaded a Dataframe with 100 rows and trying to create column -'col3' from first 50 rows of 'col1' and post that column 'col4' from the next 50 rows of 'col1'. You could modify the below code to point to your columns and the values that you want
import pandas as pd
import numpy as np
if __name__ == '__main__':
col1 = np.linspace(0,100,100)
col2 = np.linspace(100, 200, 100)
dict = {'col1':col1,'col2':col2}
df = pd.DataFrame(dict)
df['col3']= df['col1'][0:50]
df['col4'] = df['col1'][50:100]
print(df)
Solution 2 based on added info from comments
import pandas as pd
import numpy as np
if __name__ == '__main__':
pd.set_option('display.width', 100000)
pd.set_option('display.max_columns', 500)
### partition size for example I have taken a low volums 20
part_size = 20
## number generation for data frame
col1 = np.linspace(0,100,100)
col2 = np.linspace(100, 200, 100)
## create initial data frame
dict = {'col1':col1,'col2':col2}
df = pd.DataFrame(dict)
len = df.shape[0]
## tells you how many new columns you need
rec = int(len/part_size)
_ = {}
## initialize slicing variables
low =0
high=part_size
print(len)
for i in range(rec):
if high >= len:
_['col_name_here{0}'.format(i)] = df[low:]['col1']
break
else:
_['col_name_here{0}'.format(i)] = df[low:high]['col1']
low = high
high+= part_size
df = df.assign(**_)
print(df)
Related
Subset a DataFrame
If I have this data frame: df = pd.DataFrame( {"A":[45,67,12,78,92,65,89,12,34,78], "B":["h","b","f","d","e","t","y","p","w","q"], "C":[True,False,False,True,False,True,True,True,True,True]}) How can I select 50% of the rows, so that column "C" is True in 90% of the selected rows and False in 10% of them?
firstly create a dataframe in 1000 rows import pandas as pd df = pd.DataFrame( {"A":[45,67,12,78,92,65,89,12,34,78], "B":["h","b","f","d","e","t","y","p","w","q"], "C":[True,False,False,True,False,True,True,True,True,True]}) df = pd.concat([df]*100) print(df) secondly get the true_row_num and false_row_num row_num, _ = df.shape true_row_num = int(row_num * 0.5 * 0.9) false_row_num = int(row_num * 0.5 * 0.1) print(true_row_num, false_row_num) thirdly randomly sample true_df and false_df respectively true_df = df[df["C"]].sample(true_row_num) false_df = df[~df["C"]].sample(false_row_num) new_df = pd.concat([true_df, false_df]) new_df = new_df.sample(frac=1.0).reset_index(drop=True) # shuffle print(new_df["C"].value_counts())
I think if you calculate the needed sizes ex ante and then perform random sampling per group it might work. Look at something like this: new=df.query('C==True').sample(int(0.5*len(df)*0.9)).append(df.query('C==False').sample(int(0.5*len(df)*0.1)))
build indicator fractals with pandas
My DataFrame looks like this: <DATE>,<TIME>,<PRICE> 20200702,110000,207.2400000 20200702,120000,207.4400000 20200702,130000,208.2400000 20200702,140000,208.8200000 20200702,150000,208.0700000 20200702,160000,208.8100000 20200702,170000,209.4300000 20200702,180000,208.8700000 20200702,190000,210.0000000 20200702,200000,209.6900000 20200702,210000,209.8700000 20200702,220000,209.8000000 20200702,230000,209.5900000 20200703,000000,209.6000000 20200703,110000,211.1800000 20200703,120000,209.3900000 20200703,130000,209.6400000 I want to add here 2 another boolean columns called 'Up Fractal' and 'Down Fractal'. It is stock market indicator Fractals with period 5. It means: Script runs from first row to last. Script takes current row and looks at PRICE. Script takes 5 previous rows and 5 next rows. If PRICE of current row is maximum it is called 'Up Fractal'. True value in column 'Up Fractal' If PRICE of current row is minimum it is called 'Down Fractal'. True value in column 'Down Fractal' On stock market chart it looks something like this (this is an example from internet, not about my DataFrame) It is easy for me to find fractals using standart methods of python. But I need high speed of pandas. Help me please. I am very new to pandas library.
from binance.spot import Spot import pandas as pd from pandas import DataFrame import numpy as np if __name__ == '__main__': cl = Spot() r = cl.klines("BTCUSDT", "5m", limit = "100") df = DataFrame(r).iloc[:, :6] df.columns = list("tohlcv") # number of rows to calculate fractal n = 10 df = df.astype({'t': int}) df = df.astype({'o': float}) df = df.astype({'h': float}) df = df.astype({'l': float}) df = df.astype({'c': float}) # the first way df['uf'] = (df['h'] == df['h'].rolling(n+n+1, center=True).max()) df['df'] = (df['l'] == df['l'].rolling(n+n+1, center=True).min()) # the second way df['upfractal'] = np.where(df['h'] == df['h'].rolling(n+n+1, center=True).max(), True, False) df['downfractal'] = np.where(df['l'] == df['l'].rolling(n+n+1, center=True).min(), True, False) print(df) df.to_csv('BTC_USD.csv')
Remove outlier from time series data using pandas
I have one-minute data: # Import data import yfinance as yf data = yf.download(tickers="MSFT", period="7d", interval="1m") print(data.tail()) I would like to remove observations where minute difference is grater than daily difference, where we refere to day of the minute bar. I would like to apply this rule on every column except volume. Begining of the code: minute_diff = data.diff() dail_diff = data.resample('D').last().diff().median() # here remove rows from data were minute_diff is grater than daily diff
minute_diff = data.diff().reset_index() dail_diff = data.resample('D').last().diff().median() cols = minute_diff.columns.to_list() cols.remove('Datetime') for c in cols: minute_diff = minute_diff[(minute_diff[c] <= dail_diff[c])|(minute_diff[c].isnull())] data = data.loc[minute_diff['Datetime']]
import pandas as pd # Import data import yfinance as yf data = yf.download(tickers="MSFT", period="7d", interval="1m") data_minute = data.copy() data_minute['Date'] = data_minute.index.astype('datetime64[ns]') data_minute['Date'] = data_minute['Date'].dt.normalize() #Create new column for difference of current close minus previous close data_minute['Minute Close Difference'] = data_minute['Close'] - data_minute['Close'].shift(1) #Convert minute data to daily data data_daily = data_minute.resample('D').agg({'Open':'first', 'High':'max', 'Low':'min', 'Close':'last', 'Adj Close':'last', 'Volume':'sum' }) data_daily['Date'] = data_daily.index.astype('datetime64[ns]') data_daily['Date'] = data_daily['Date'].dt.normalize() data_daily = data_daily.set_index('Date') #Create new column for difference of current close minus previous close data_daily['Daily Close Difference'] = data_daily['Close'] - data_daily['Close'].shift(1) data_minute = pd.merge(data_minute,data_daily['Daily Close Difference'],how = 'left', left_on = 'Date', right_index = True) data_minute = data_minute[data_minute['Minute Close Difference'].abs() <= data_minute['Daily Close Difference'].abs()] data_minute
I have found the solution: daily_diff = data.resample('D').last().dropna().diff() * 25 daily_diff['diff_date'] = daily_diff.index.strftime('%Y-%m-%d') data_test = data.diff() data_test['diff_date'] = data_test.index.strftime('%Y-%m-%d') data_test_diff = pd.merge(data_test, daily_diff, on='diff_date') data_test_final = data_test_diff.loc[(np.abs(data_test_diff['close_x']) < np.abs(data_test_diff['close_y']))] data_test_final['close_x'].plot() indexer = (np.abs(data_test_diff['close_x']) < np.abs(data_test_diff['close_y'])) data_final = data.loc[indexer.values, :]
My loop always skip the first index
Every time I creat a loop function, it's common to have problem with the first one: For example: dfd = quandl.get("FRED/DEXBZUS") dfe = quandl.get("ECB/EURBRL") df = [dfd, dfe] dps = [] for i in df: I just get the second dataframe values. Using this: dfd = quandl.get("FRED/DEXBZUS") df = [dfd] dps = [] for i in df: I got this: Empty DataFrame Columns: [] Index: [] And if I use this (repeting the first one): dfd = quandl.get("FRED/DEXBZUS") dfe = quandl.get("ECB/EURBRL") df = [dfd, dfd, dfe] dps = [] for i in df: I get both dataframes correcly Examples : import quandl import pandas as pd #import matplotlib import matplotlib.pyplot as plt dfd = quandl.get("FRED/DEXBZUS") dfe = quandl.get("ECB/EURBRL") df = [dfd, dfe] dps = [] for i in df: df1 = i.reset_index() results = pd.DataFrame(df1) results = results.rename(columns={'Date': 'ds','Value': 'y'}) dps = pd.DataFrame(dps.append(results)) print(dps) Empty DataFrame Columns: [] Index: [] ds y 0 2008-01-02 2.6010 1 2008-01-03 2.5979 2 2008-01-04 2.5709 3 2008-01-07 2.6027 4 2008-01-08 2.5796 UPDATE As Bruno suggested, it is related to this function: dps = pd.DataFrame(dps.append(results)) How to append all the dataset into a one data frame ?
result=Pd.DataFrame(df1) If you create dataframe like this and don't give columns, then by default first it will take 1st row as column and later you are renaming columns what default created. So please create pd.DataFrame(df1,columns=[column_list]). First row will not skip.
#this will print every element in df for i in df: print i Also, for dfIndex, i in enumerate(df): print i print dfIndex #this will print the index of i in df Note that indexes start at 0, not 1.
Q: Merging 2 dataframes based on datetime column, replacing old value with new values
Sample csv data: ( actual data have a huge amount of similar data [roughly 150'000 - 270'000 rows] and different date and time , but the sample data is where the condition for df_filter2 can be met ) ID,v1,c1,v2,c2,p1,p2,p3,p4,f1,r1,r2,Time_Stamp 8301,418,13.2,34.4,136,4673,1,-1,5524.5,0,49,0,22/6/2017 05:11:00 8301,419.3,2.3,0.7,-0.9,-0.6,1,-1,946.2,0,50,0,22/6/2017 05:11:01 8301,417.7,15.2,30.3,196.5,5962,1,-1,6355,0,49,0,22/6/2017 05:11:02 8301,418.7,2.3,0.7,-0.9,-0.6,1,-1,944.7,0,50,0,22/6/2017 05:11:03 8301,419.3,3.4,53.6,10.8,580.2,1,-1,1432.8,0,49,0,22/6/2017 05:11:04 8301,417.7,13.6,30.1,170.4,5122.7,1,-1,5681.8,0,50,0,22/6/2017 05:11:05 8301,418,11.5,41.2,105,4328.2,1,-1,4796.9,0,49,0,22/6/2017 05:11:07 8301,419.7,2.3,0.8,-0.9,-0.7,1,-1,946.9,0,51,0,22/6/2017 05:11:08 8301,419.7,2.3,40.6,-0.7,-27.9,1,-1,974,0,49,0,22/6/2017 05:11:09 8301,417.4,14.9,30.4,194.4,5903.8,1,-1,6215.4,0,51,0,22/6/2017 05:11:10 8301,417.7,14.7,30.5,186.2,5682.9,1,-1,6139.5,0,49,0,22/6/2017 05:11:11 8301,418,12,31.5,141.5,4456.9,1,-1,5012.5,0,51,0,22/6/2017 05:11:12 8301,419,2.3,0.7,-1.4,-0.9,1,-1,945.4,0,49,0,22/6/2017 05:11:13 8301,419,2.3,0.7,-0.9,-0.6,1,-1,945.4,0,50,0,22/6/2017 05:11:14 8301,419.7,2.3,0.8,-0.9,-0.7,1,-1,946.9,0,50,0,22/6/2017 05:11:15 8301,419,2.3,0.7,-0.9,-0.6,1,-1,945.4,0,49,0,22/6/2017 05:11:16 8301,419,2.3,32.9,-0.2,-5.7,1,-1,972.4,0,51,0,22/6/2017 05:11:17 8301,419.3,2.3,50.3,0.3,17.3,1,-1,973.2,0,49,0,22/6/2017 05:11:18 8301,417.4,15.2,30.5,197.4,6010.5,1,-1,6350,0,50,0,22/6/2017 05:11:19 8301,418.7,2.3,0.9,-0.9,-0.7,1,-1,944.7,0,49,0,22/6/2017 05:11:20 8301,419,2.3,42.9,-0.2,-7.4,1,-1,972.4,0,50,0,22/6/2017 05:11:21 8301,417.4,13.9,30.4,180,5477.6,1,-1,5811.8,0,49,0,22/6/2017 05:11:22 8301,419.7,2.3,0.9,-0.9,-0.8,1,-1,946.9,0,50,0,22/6/2017 05:11:23 8301,418.7,2.3,0.7,-0.9,-0.6,1,-1,944.7,0,50,0,22/6/2017 05:11:24 8301,418.3,2.3,0.6,-0.9,-0.5,1,-1,943.9,0,49,0,22/6/2017 05:11:25 Python Code: Note: I have to be reading from a .csv file and not set the data into strings in a variable import numpy as np from datetime import date,time,datetime import pandas as pd import csv # ------ Reading from .csv file - initial df with old values ------ df = pd.read_csv('Data.csv') df["Time_Stamp"] = pd.to_datetime(df["Time_Stamp"]) # convert to Datetime def getMask(start,end): mask = (df['Time_Stamp'] > start) & (df['Time_Stamp'] <= end) return mask; start = '2017-06-22 05:00:00' end = '2017-06-22 05:20:00' timerange = df.loc[getMask(start, end)] df_filter = timerange[timerange["AC_Input_Current"].le(3.0)] # new df with less or equal to 0.5 where = (df_filter[df_filter["Time_Stamp"].diff().dt.total_seconds() > 1] ["Time_Stamp"] - pd.Timedelta("1s")).astype(str).tolist() # df_filter2 below df_filter2 = timerange[timerange["Time_Stamp"].isin(where)] # Create new df with those #print(df_filter2) df_filter2["AC_Input_Current"] = 0.0 # Set c1 to 0.0 for index, row in df_filter2.iterrows(): values = row.astype(str).tolist() print(','.join(values)) Here is the output of df_filter2 where the value of c1 is edited to 0.0. I want to merge it together with df - where all the data is at , but replacing the rows of df with the rows from df_filter2 where the Time_Stamp is the same. How do I do this? Output: 8301,418.0,0.0,34.4,136.0,4673.0,1,-1,5524.5,0,49,0,2017-06-22 05:11:00 8301,417.7,0.0,30.3,196.5,5962.0,1,-1,6355.0,0,49,0,2017-06-22 05:11:02 8301,418.0,0.0,41.2,105.0,4328.2,1,-1,4796.9,0,49,0,2017-06-22 05:11:07 8301,418.0,0.0,31.5,141.5,4456.9,1,-1,5012.5,0,51,0,2017-06-22 05:11:12 8301,417.4,0.0,30.5,197.4,6010.5,1,-1,6350.0,0,50,0,2017-06-22 05:11:19 8301,417.4,0.0,30.4,180.0,5477.6,1,-1,5811.8,0,49,0,2017-06-22 05:11:22 EDIT - Wanted Result: ( < replaced - with df_filter2 rows ) ID,v1,c1,v2,c2,p1,p2,p3,p4,f1,r1,r2,Time_Stamp 8301,418.0,0.0,34.4,136.0,4673.0,1,-1,5524.5,0,49,0,2017-06-22 05:11:00 < replaced 8301,419.3,2.3,0.7,-0.9,-0.6,1,-1,946.2,0,50,0,22/6/2017 05:11:01 8301,417.7,0.0,30.3,196.5,5962.0,1,-1,6355.0,0,49,0,2017-06-22 05:11:02 < replaced 8301,418.7,2.3,0.7,-0.9,-0.6,1,-1,944.7,0,50,0,22/6/2017 05:11:03 8301,419.3,3.4,53.6,10.8,580.2,1,-1,1432.8,0,49,0,22/6/2017 05:11:04 8301,417.7,13.6,30.1,170.4,5122.7,1,-1,5681.8,0,50,0,22/6/2017 05:11:05 8301,418.0,0.0,41.2,105.0,4328.2,1,-1,4796.9,0,49,0,2017-06-22 05:11:07 < replaced 8301,419.7,2.3,0.8,-0.9,-0.7,1,-1,946.9,0,51,0,22/6/2017 05:11:08 8301,419.7,2.3,40.6,-0.7,-27.9,1,-1,974,0,49,0,22/6/2017 05:11:09 8301,417.4,14.9,30.4,194.4,5903.8,1,-1,6215.4,0,51,0,22/6/2017 05:11:10 8301,417.7,14.7,30.5,186.2,5682.9,1,-1,6139.5,0,49,0,22/6/2017 05:11:11 8301,418.0,0.0,31.5,141.5,4456.9,1,-1,5012.5,0,51,0,2017-06-22 05:11:12 < replaced 8301,419,2.3,0.7,-1.4,-0.9,1,-1,945.4,0,49,0,22/6/2017 05:11:13 8301,419,2.3,0.7,-0.9,-0.6,1,-1,945.4,0,50,0,22/6/2017 05:11:14 8301,419.7,2.3,0.8,-0.9,-0.7,1,-1,946.9,0,50,0,22/6/2017 05:11:15 8301,419,2.3,0.7,-0.9,-0.6,1,-1,945.4,0,49,0,22/6/2017 05:11:16 8301,419,2.3,32.9,-0.2,-5.7,1,-1,972.4,0,51,0,22/6/2017 05:11:17 8301,419.3,2.3,50.3,0.3,17.3,1,-1,973.2,0,49,0,22/6/2017 05:11:18 8301,417.4,0.0,30.5,197.4,6010.5,1,-1,6350.0,0,50,0,2017-06-22 05:11:19 < replaced 8301,418.7,2.3,0.9,-0.9,-0.7,1,-1,944.7,0,49,0,22/6/2017 05:11:20 8301,419,2.3,42.9,-0.2,-7.4,1,-1,972.4,0,50,0,22/6/2017 05:11:21 8301,417.4,0.0,30.4,180.0,5477.6,1,-1,5811.8,0,49,0,2017-06-22 05:11:22 < replaced 8301,419.7,2.3,0.9,-0.9,-0.8,1,-1,946.9,0,50,0,22/6/2017 05:11:23 8301,418.7,2.3,0.7,-0.9,-0.6,1,-1,944.7,0,50,0,22/6/2017 05:11:24 8301,418.3,2.3,0.6,-0.9,-0.5,1,-1,943.9,0,49,0,22/6/2017 05:11:25