Concatenating tables with axis=1 in Orange python - python

I'm fairly new to Orange.
I'm trying to separate rows of angle (elv) into intervals.
Let's say, if I want to separate my 90-degree angle into 8 intervals, or 90/8 = 11.25 degrees per interval.
Here's the table I'm working with
Here's what I did originally, separating them by their elv value
Here's the result that I want, x rows 16 columns separated by their elv value.
But I want them done dynamically.
I list them out and turn each list into a table with x rows and 2 columns.
This is what I originally did
from Orange.data.table import Table
from Orange.data import Domain, Domain, ContinuousVariable, DiscreteVariable
import numpy
import pandas as pd
from pandas import DataFrame
df = pd.DataFrame()
num = 10 #number of intervals that we want to seperate our elv into.
interval = 90.00/num #separating them into degree/interval
low = 0
high = interval
table = []
first = []
second = []
for i in range(num):
between = []
if i != 0: #not the first run
low = high
high = high + interval
for row in in_data: #Run through the whole table to see if the elv falls in between interval
if row[0] >= low and row[0] < high:
between.append(row)
elv = "elv" + str(i)
err = "err" + str(i)
domain = Domain([ContinuousVariable.make(err)],[ContinuousVariable.make(elv)])
data = Table.from_numpy(domain, numpy.array(between))
print("table number ", i)
print(data[:3])
Here's the output
But as you can see, these are separated tables being assigned every loop.
And I have to find a way to concatenate axis = 1 for these tables.
Even the source code for Orange3 forbids this for some reason.

Related

first attempt at python, error ("IndexError: index 8 is out of bounds for axis 0 with size 8") and efficiency question

learning python, just began last week, havent otherwise coded for about 20 years and was never that advanced to begin with. I got the hello world thing down. Now im trying to back test FX pairs. Any help up the learning curve appreciated, and of course scouring this site while on my Lynda vids.
Getting a funky error, and also wondering if theres blatantly more efficient ways to loop through columns of excel data the way I am.
The spreadsheet being read is simple ... 56 FX pairs down column A, and 8 rows over where the column headers are dates, and the cells in each column are the respective FX pair closing price on that date. The strategy starts at the top of the 2nd column (so that there is a return % that can be calc'd vs the prior priord) and calcs out period/period % returns for each pair, identifying which is the 'maximum value', and then "goes long" that highest performer ... whose performance in the subsequent period/period is recorded as PnL to the portfolio ("p" in the code), loops through that until the current, most recent column is read.
The error relates to using 8 columns instead of 7 ... works when i limit the loop to 7 columns but not 8. When I used 8 I get a wall of text concluding with "IndexError: index 8 is out of bounds for axis 0 with size 8" Similar error when i use too many rows, 56 instead of 55, think im missing the bottom row.
Here's my code:
,,,
enter code here
#set up imports
import pandas as pd
#import spreadsheet
x1 = pd.ExcelFile(r"C:\Users\Gamblor\Desktop\Python\test2020.xlsx")
df = pd.read_excel(x1, "Sheet1", header=1)
#define counters for loops
o = 1 # observation counter
c = 3 # column counter
r = 0 # active row counter for sorting through for max
#define identifiers for the portfolio
rpos = 0 # static row, for identifying which currency pair is in column 0 of that row
p = 100 # portfolio size starts at $100
#define the stuff we are evaluating for
pair = df.iat[r,0] # starting pair at 0,0 where each loop will begin
pair_pct_rtn = 0 # starts out at zero, becomes something at first evaluation, then gets
compared to each subsequent eval
pair_pct_rtn_calc = 0 # a second version of above, for comparison to prior return
#runs a loop starting at the top to find the max period/period % return in a specific column
while (c < 8): # manually limiting this to 5 columns left to right
while (r < 55): # i am manually limiting this to 55 data rows per the spreadsheet ... would be better if automatic
pair_pct_rtn_calc = ((df.iat[r,c])/(df.iat[r,c-1]) - 1)
if pair_pct_rtn_calc > pair_pct_rtn: # if its a higher return, it must be the "max" to that point
pair = df.iat[r,0] # identifies the max pair for this column observation, so far
pair_pct_rtn = pair_pct_rtn_calc # sets pair_pct_rtn as the new max
rpos = r # identifies the max pair's ROW for this column observation, so far
r = r + 1 # adds to r in order to jump down and calc the next row
print('in obs #', o ,', ', pair ,'did best at' ,pair_pct_rtn ,'.')
o = o + 1
# now adjust the portfolio by however well USDMXN did in the subsequent week
p = p * ( 1 + ((df.iat[rpos,c+1])/(df.iat[rpos,c]) - 1))
print('then the subsequent period it did: ',(df.iat[rpos,c+1])/(df.iat[rpos,c]) - 1)
print('resulting in portfolio value of', p)
rpos = 0
r = 0
pair_pct_rtn = 0
c = c + 1 # adds to c in order to move to the next period to the right
print(p)
Since indices are labelled from 0 onwards, the 8th element you are looking for will have index 7. Likewise, row index 55 (the 56th row) will be your last row.

identify data with zero value in python

I have data in following csv format
Date,State,City,Station Code,Minimum temperature (C),Maximum temperature (C),Rainfall (mm),Evaporation (mm),Sunshine (hours),Direction of maximum wind gust,Speed of maximum wind gust (km/h),9am Temperature (C),9am relative humidity (%),3pm Temperature (C),3pm relative humidity (%)
2017-12-25,VIC,Melbourne,086338,15.1,21.4,0,8.2,10.4,S,44,17.2,57,20.7,54
2017-12-25,VIC,Bendigo,081123,11.3,26.3,0,,,ESE,46,17.2,53,25.5,25
2017-12-25,QLD,Gold Coast,040764,22.3,35.7,0,,,SE,59,29.2,53,27.7,67
2017-12-25,SA,Adelaide,023034,13.9,29.5,0,10.8,12.4,N,43,18.6,42,27.7,17
The output for VIC sohuld be
S : 1
ESE : 1
SE : 0
N : 0
however i am getting output as
S : 1
ESE : 1
Thus would like to know, how can a unique function be used to include the other 2 missing results. Below is the proram which calls a csv file
import pandas as pd
#read file
df = pd.read_csv('climate_data_Dec2017.csv')
#marker
value = df['Date']
date = value == "2017-12-26"
marker = df[date]
#group data
directionwise_data = marker.groupby('Direction of maximum wind gust')
count = directionwise_data.size()
numbers = count.to_dict()
for key in numbers:
print(key, ":", numbers[key])
To begin with, i'm not sure what you're trying to get from this:
Your data sample has no "2017-12-26" records yet you're using it in your code, hence i presume for that sample, i'll change the code to "2017-12-25" just to see what is it producing, now that produces the exact thing you're expecting! Therefore i guess in your full data, you don't have records for "2017-12-26" for SE and N and therefore it's not being grouped, i suggest you create a unique set of the four directions you've in your df, then just count their occurances in a slice of your dataframe fo the needed date!
Or if all you want is how many records for each direction you have by date, why not just pivot it like below:
output = df.pivot_table(index='Date', columns = 'Direction of maximum wind gust', aggfunc={'Direction of maximum wind gust':'count'}, fill_value=0)
EDIT:
Ok, so i wrote this real quick which should get you what you want, however you need to feed it which date you want:
import pandas as pd
#read csv
df = pd.read_csv('climate_data_Dec2017.csv')
#specify date
neededDate = '2017-12-25'
#slice dataframe to keep needed records based on the date
subFrame = df.loc[df['Date'] == neededDate].reset_index(drop=True)
#set count to zero
d1 = 0 #'S'
d2 = 0 #'SE'
d3 = 0 #'N'
d4 = 0 #'ESE'
#loop over slice and count directions
for i, row in subFrame.iterrows():
direction = subFrame.at[i,'Direction of maximum wind gust']
if direction == 'S':
d1 = d1+1
elif direction == 'SE':
d2 = d2+1
elif direction == 'N':
d3 = d3+1
if direction == 'ESE':
d4 = d4+1
#print directions count
print ('S = ' + str(d1))
print ('SE = ' + str(d2))
print ('N = ' + str(d3))
print ('ESE = ' + str(d4))
S = 1
SE = 1
N = 1
ESE = 1

Storing Stock OHLCV Data into Their Own Lists (Python)

I'm trying to store stock data (Open, High, Low, Close, Volume), pulled by pandas_datareader, into 5 distinct lists named accordingly. I am new to Python and am wondering where I am going wrong. I got it to cycle through a one-dimensional list of integer values and assign them to each list, but am unsure of how to handle the additional dimension of the f.head output. I have twice gotten a traceback error indicating index values out of range, but know that I've made a mistake beyond simple index range.
Open, High, Low, Close, Vol = [], [], [], [], []
col_data = [Open, High, Low, Close, Vol]
stock = 'BABA'
# data period
yStart = 2017
mStart = 11
dStart = 14
yEnd = 2018
mEnd = 2
dEnd = 14
import pandas as p
p.core.common.is_list_like = p.api.types.is_list_like
import pandas_datareader.data as pdr
from datetime import datetime
start = datetime(yStart,mStart,dStart)
end = datetime(yEnd,mEnd,dEnd)
f = pdr.DataReader(stock, 'morningstar', start, end)
f.head()
a = 0
b = 0
while a < len(col_data):
b = 0
while b < len(f):
cur = (f.loc[f.index[b], col_data[a]])
col_data[a].append(cur)
b += 1
a += 1
I would like to ultimately be able to print the individual lists ( like print(Open) and retrieve the list of Open prices ). Any advice/additional resources that might help would be appreciated.

How to compare these data sets from a csv? Python 2.7

I have a project where I'm trying to create a program that will take a csv data set from www.transtats.gov which is a data set for airline flights in the US. My goal is to find the flight from one airport to another that had the worst delays overall, meaning it is the "worst flight". So far I have this:
`import csv
with open('826766072_T_ONTIME.csv') as csv_infile: #import and open CSV
reader = csv.DictReader(csv_infile)
total_delay = 0
flight_count = 0
flight_numbers = []
delay_totals = []
dest_list = [] #create empty list of destinations
for row in reader:
if row['ORIGIN'] == 'BOS': #only take flights leaving BOS
if row['FL_NUM'] not in flight_numbers:
flight_numbers.append(row['FL_NUM'])
if row['DEST'] not in dest_list: #if the dest is not already in the list
dest_list.append(row['DEST']) #append the dest to dest_list
for number in flight_numbers:
for row in reader:
if row['ORIGIN'] == 'BOS': #for flights leaving BOS
if row['FL_NUM'] == number:
if float(row['CANCELLED']) < 1: #if the flight is not cancelled
if float(row['DEP_DELAY']) >= 0: #and the delay is greater or equal to 0 (some flights had negative delay?)
total_delay += float(row['DEP_DELAY']) #add time of delay to total delay
flight_count += 1 #add the flight to total flight count
for row in reader:
for number in flight_numbers:
delay_totals.append(sum(row['DEP_DELAY']))`
I was thinking that I could create a list of flight numbers and a list of the total delays from those flight numbers and compare the two and see which flight had the highest delay total. What is the best way to go about comparing the two lists?
I'm not sure if I understand you correctly, but I think you should use dict for this purpose, where key is a 'FL_NUM' and value is total delay.
In general I want to eliminate loops in Python code. For files that aren't massive I'll typically read through a data file once and build up some dicts that I can analyze at the end. The below code isn't tested because I don't have the original data but follows the general pattern I would use.
Since a flight is identified by the origin, destination, and flight number I would capture them as a tuple and use that as the key in my dict.
from collections import defaultdict
flight_delays = defaultdict(list) # look this up if you aren't familiar
for row in reader:
if row['ORIGIN'] == 'BOS': #only take flights leaving BOS
if row['CANCELLED'] > 0:
flight = (row['ORIGIN'], row['DEST'], row['FL_NUM'])
flight_delays[flight].append(float(row['DEP_DELAY']))
# Finished reading through data, now I want to calculate average delays
worst_flight = ""
worst_delay = 0
for flight, delays in flight_delays.items():
average_delay = sum(delays) / len(delays)
if average_delay > worst_delay:
worst_flight = flight[0] + " to " + flight[1] + " on FL#" + flight[2]
worst_delay = average_delay
A very simple solution would be. Adding two new variables:
max_delay = 0
delay_flight = 0
# Change: if float(row['DEP_DELAY']) >= 0: FOR:
if float(row['DEP_DELAY']) > max_delay:
max_delay = float(row['DEP_DELAY'])
delay_flight = #save the row number or flight number for reference.

How to improve efficiency in while loop by pandas

I am a new python er. in my job, I open deal mass of data. So I begin to study python to improve the efficiency.
The first small trial is that: finding the nearest distance between two coordinates.
I have two files, one is named as "book.csv", the other is named as "macro.csv".[file content screen shot][1]
book.csv has three column: BookName, Longitude,Latitude; macro.csv has threed column: MacroName, Longitude,Latitude.
the trial purpose is to find the nearest Macro to each book. I try to use pandas to finish this trial, now I can get the right result, but the efficiency is a little low, when I have a 1500 book and 200 macro, it will take about 15 second.
please to help whether I can improve the efficiency. thx the following is my trial code:
#import pandas lib
from pandas import Series,DataFrame
import pandas as pd
#import geopy lib, to calculate the distance between two poins
import geopy.distance
#def func, to calculate the distance, input parameter: two points coordinates(Lat,Lon),return m
def dist(coord1,coord2):
return geopy.distance.vincenty(coord1, coord2).m
#def func, to find the nearest result: including MacroName and distance
def find_nearest_macro(df_macro,df_book):
#Get column content from dataframe to series
# Macro
s_macro_name = df_macro["MacroName"]
s_macro_Lat = df_macro["Latitude"]
s_macro_Lon = df_macro["Longitude"]
# Book
s_book_name = df_book["BookName"]
s_book_Lat = df_book["Latitude"]
s_book_Lon = df_book["Longitude"]
#def a empty list, used to append nearest result
nearest_macro = []
nearest_dist = []
#Loop through each book
ibook = 0
while ibook < len(s_book_name):
#Give initial value to result
nearest_macro_name = s_macro_name[0]
nearest_macro_dist = dist((s_book_Lat[0],s_book_Lon[0]), (s_macro_Lat[0],s_macro_Lon[0]))
#Get the coordinate of the x book
book_coord = (s_book_Lat[ibook],s_book_Lon[ibook])
#Loop through each Macro, Reset the loop variable
imacro = 1
while imacro < len(s_macro_name):
# Get the coordinate of the x Macro
macro_cood = (s_macro_Lat[imacro],s_macro_Lon[imacro])
#Calculate the distance between book and macro
tempd = dist(book_coord,macro_cood)
#if distance more close
if tempd < nearest_macro_dist:
#Update the result
nearest_macro_dist = tempd
nearest_macro_name = s_macro_name[imacro]
#Increments the loop variable
imacro = imacro + 1
#Loop over each book, append the nearest to the result
nearest_macro.append(nearest_macro_name)
nearest_dist.append(nearest_macro_dist)
# Increments the loop variable
ibook = ibook + 1
#return nearest macro name and distance(by tuple way can return 2 results
return (nearest_macro,nearest_dist)
# Assign the filename:
file_macro = '.\\TestFile\\Macro.csv'
file_book = '.\\TestFile\\Book.csv'
#read content from csv to dataframe
df_macro = pd.read_csv(file_macro)
df_book = pd.read_csv(file_book)
#find the nearest macro name and distance
t_nearest_result = find_nearest_macro(df_macro,df_book)
#create a new series, convert list to Series
s_nearest_marco_name = Series(t_nearest_result[0])
s_nearest_macro_dist = Series(t_nearest_result[1])
#insert the new Series to dataframe
df_book["NearestMacro"] = s_nearest_marco_name
df_book["NearestDist"] = s_nearest_macro_dist
print(df_book.head())
# write the new df_book to a new csv file
df_book.to_csv('.\\TestFile\\nearest.csv')

Categories