identify data with zero value in python - python

I have data in following csv format
Date,State,City,Station Code,Minimum temperature (C),Maximum temperature (C),Rainfall (mm),Evaporation (mm),Sunshine (hours),Direction of maximum wind gust,Speed of maximum wind gust (km/h),9am Temperature (C),9am relative humidity (%),3pm Temperature (C),3pm relative humidity (%)
2017-12-25,VIC,Melbourne,086338,15.1,21.4,0,8.2,10.4,S,44,17.2,57,20.7,54
2017-12-25,VIC,Bendigo,081123,11.3,26.3,0,,,ESE,46,17.2,53,25.5,25
2017-12-25,QLD,Gold Coast,040764,22.3,35.7,0,,,SE,59,29.2,53,27.7,67
2017-12-25,SA,Adelaide,023034,13.9,29.5,0,10.8,12.4,N,43,18.6,42,27.7,17
The output for VIC sohuld be
S : 1
ESE : 1
SE : 0
N : 0
however i am getting output as
S : 1
ESE : 1
Thus would like to know, how can a unique function be used to include the other 2 missing results. Below is the proram which calls a csv file
import pandas as pd
#read file
df = pd.read_csv('climate_data_Dec2017.csv')
#marker
value = df['Date']
date = value == "2017-12-26"
marker = df[date]
#group data
directionwise_data = marker.groupby('Direction of maximum wind gust')
count = directionwise_data.size()
numbers = count.to_dict()
for key in numbers:
print(key, ":", numbers[key])

To begin with, i'm not sure what you're trying to get from this:
Your data sample has no "2017-12-26" records yet you're using it in your code, hence i presume for that sample, i'll change the code to "2017-12-25" just to see what is it producing, now that produces the exact thing you're expecting! Therefore i guess in your full data, you don't have records for "2017-12-26" for SE and N and therefore it's not being grouped, i suggest you create a unique set of the four directions you've in your df, then just count their occurances in a slice of your dataframe fo the needed date!
Or if all you want is how many records for each direction you have by date, why not just pivot it like below:
output = df.pivot_table(index='Date', columns = 'Direction of maximum wind gust', aggfunc={'Direction of maximum wind gust':'count'}, fill_value=0)
EDIT:
Ok, so i wrote this real quick which should get you what you want, however you need to feed it which date you want:
import pandas as pd
#read csv
df = pd.read_csv('climate_data_Dec2017.csv')
#specify date
neededDate = '2017-12-25'
#slice dataframe to keep needed records based on the date
subFrame = df.loc[df['Date'] == neededDate].reset_index(drop=True)
#set count to zero
d1 = 0 #'S'
d2 = 0 #'SE'
d3 = 0 #'N'
d4 = 0 #'ESE'
#loop over slice and count directions
for i, row in subFrame.iterrows():
direction = subFrame.at[i,'Direction of maximum wind gust']
if direction == 'S':
d1 = d1+1
elif direction == 'SE':
d2 = d2+1
elif direction == 'N':
d3 = d3+1
if direction == 'ESE':
d4 = d4+1
#print directions count
print ('S = ' + str(d1))
print ('SE = ' + str(d2))
print ('N = ' + str(d3))
print ('ESE = ' + str(d4))
S = 1
SE = 1
N = 1
ESE = 1

Related

Concatenating tables with axis=1 in Orange python

I'm fairly new to Orange.
I'm trying to separate rows of angle (elv) into intervals.
Let's say, if I want to separate my 90-degree angle into 8 intervals, or 90/8 = 11.25 degrees per interval.
Here's the table I'm working with
Here's what I did originally, separating them by their elv value
Here's the result that I want, x rows 16 columns separated by their elv value.
But I want them done dynamically.
I list them out and turn each list into a table with x rows and 2 columns.
This is what I originally did
from Orange.data.table import Table
from Orange.data import Domain, Domain, ContinuousVariable, DiscreteVariable
import numpy
import pandas as pd
from pandas import DataFrame
df = pd.DataFrame()
num = 10 #number of intervals that we want to seperate our elv into.
interval = 90.00/num #separating them into degree/interval
low = 0
high = interval
table = []
first = []
second = []
for i in range(num):
between = []
if i != 0: #not the first run
low = high
high = high + interval
for row in in_data: #Run through the whole table to see if the elv falls in between interval
if row[0] >= low and row[0] < high:
between.append(row)
elv = "elv" + str(i)
err = "err" + str(i)
domain = Domain([ContinuousVariable.make(err)],[ContinuousVariable.make(elv)])
data = Table.from_numpy(domain, numpy.array(between))
print("table number ", i)
print(data[:3])
Here's the output
But as you can see, these are separated tables being assigned every loop.
And I have to find a way to concatenate axis = 1 for these tables.
Even the source code for Orange3 forbids this for some reason.

What is the best way to make the data as stationary & inverse transform in time series - Python

I did the 1st differencing as the time series is not stationary.
When I do the invert transformation, some values are coming as negative as we get negative values due to diff(). Is there a way to sort it out and bring back the data in original format as close to the expected result.
This is my python code. Is there a way to fix the code or any alternate logic to make the data as stationary and forecasting the series?
count = 0
def invert_transformation(df_train, df_forecast):
"""Revert back the differencing to get the forecast to original scale."""
df_fc = df_forecast.copy()
columns = df_train.columns
if count > 0: # For 1st differencing
print("Enter into invert transformation")
for col in columns:
df_fc[str(col)+'_f'] = df_train[col].iloc[-1] + df_fc[str(col)+'_f'].cumsum()
print("df_fc: \n", df_fc)
return df_fc
# Since the data is not stationary, I did the 1st difference
df_differenced = df_train.diff().dropna()
count = count + 1 #increase the count
count
....
....
model = VAR(df_differenced)
....
fc = model_fitted.forecast(y=forecast_input, steps=10)
df_forecast2 = pd.DataFrame(fc, index=df2.index[-nobs:], columns=df2.columns + '_f')
df_results = invert_transformation(df_train, df_forecast2)
value of df_results(TS is the index column) are:
TS Field1_f Field2_f
44:13.0 6.826511e+05 1.198614e+06
44:14.0 -8.620101e+05 4.694556e+05
..
..
44:22.0 -1.401620e+07 -2.092826e+06
Value of df_differenced are:
TS Field1 Field2
43:34.0 187000.0 29000.0
43:35.0 175000.0 76722.0
43:36.0 -10000.0 31000.0
43:37.0 90000.0 42000.0
43:38.0 -130000.0 -42000.0
43:39.0 40000.0 -98444.0
..
..
44:11.0 -130000.0 40722.0
44:12.0 117000.0 -42444.0

Double for loop to extract data from several urls

I am trying to get data from a website to write them on an excel file to be worked on. I have a main url scheme and I have to change the "year" and the "reference number" accordingly:
http://calcio-seriea.net/presenze/"year"/"reference number"/
I already tried to write a part of the code but I have one issue. First of all, I should keep the year the same while the reference number takes every number of an interval of 18. Then the year increases of 1, and the reference number take again every number of an interval of 18. I try to give an example:
Y = 1998 RN = [1142:1159];
Y = 1999 RN = [1160:1177];
Y = 2000 RN = [1178:1195];
Y = … RN = …
Then from year 2004 the interval becomes of 20, so
Y = 2004 RN = [1250:1269];
Y = 2005 RN = [1270:1289];
Till year = 2019 included.
This is the code I could make so far:
import pandas as pd
year = str(1998)
all_items = []
for i in range(1142, 1159):
pattern = "http://calcio-seriea.net/presenze/" + year + "/" + str(i) + "/"
df = pd.read_html(pattern)[6]
all_items.append(df)
pd.DataFrame(all_items).to_csv(r"C:\Users\glcve\Desktop\data.csv", index = False, header = False)
print("Done!")
Thanks to all in advance
All that's missing is a pd.concat from your function, however as you're calling the same method over and over, lets write a function so you can keep your code dry.
def create_html_df(base_url, year,range_nums = ()):
"""
Returns a dataframe from a url/html table
base_url : the url to target
year : the target year.
range_nums = the range of numbers i.e (1,50)
"""
start, stop = range_nums
url_pat = [f"{base_url}/{year}/{i}" for i in range(start,stop)]
dfs = []
for each_url in url_pat:
df = pd.read_html(each_url)[6]
dfs.append(df)
return pd.concat(dfs)
final_df = create_html_df(base_url = "http://calcio-seriea.net/presenze/",
year = 1998,
range_nums = (1142, 1159))

Pandas capitalization of compound interests

I am writing an emulation of a bank deposit account in pandas.
I got stuck with Compound interest (It is the result of reinvesting interest, so that interest in the next period is then earned on the principal sum plus previously accumulated interest.)
So far I have the following code:
import pandas as pd
from pandas.tseries.offsets import MonthEnd
from datetime import datetime
# Create a date range
start = '21/11/2017'
now = datetime.now()
date_rng = pd.date_range(start=start, end=now, freq='d')
# Create an example data frame with the timestamp data
df = pd.DataFrame(date_rng, columns=['Date'])
# Add column (EndOfMonth) - shows the last day of the current month
df['LastDayOfMonth'] = pd.to_datetime(df['Date']) + MonthEnd(0)
# Add columns for interest, Sasha, Artem, Total, Description
df['Debit'] = 0
df['Credit'] = 0
df['Total'] = 0
df['Description'] = ''
# Iterate through the DataFrame to set "IsItLastDay" value
for i in df:
df['IsItLastDay'] = (df['LastDayOfMonth'] == df['Date'])
# Add the transaction of the first deposit
df.loc[df.Date == '2017-11-21', ['Debit', 'Description']] = 10000, "First deposit"
# Calculate the principal sum (It the summ of all deposits minus all withdrows plus all compaund interests)
df['Total'] = (df.Debit - df.Credit).cumsum()
# Calculate interest per day and Cumulative interest
# 11% is the interest rate per year
df['InterestPerDay'] = (df['Total'] * 0.11) / 365
df['InterestCumulative'] = ((df['Total'] * 0.11) / 365).cumsum()
# Change the order of columns
df = df[['Date', 'LastDayOfMonth', 'IsItLastDay', 'InterestPerDay', 'InterestCumulative', 'Debit', 'Credit', 'Total', 'Description']]
df.to_excel("results.xlsx")
The output file looks fine, but I need the following:
The "InterestCumulative" column adds to the "Total" column at the last day of each months (compounding the interests)
At the beggining of each month the "InterestCumulative" column should be cleared (Because the interest were added to the Principal sum).
How can I do this?
You're going to need to loop, as your total changes depending on previous rows, which then affects the later rows. As a result your current interest calculations are wrong.
total = 0
cumulative_interest = 0
total_per_day = []
interest_per_day = []
cumulative_per_day = []
for day in df.itertuples():
total += day.Debit - day.Credit
interest = total * 0.11 / 365
cumulative_interest += interest
if day.IsItLastDay:
total += cumulative_interest
total_per_day.append(total)
interest_per_day.append(interest)
cumulative_per_day.append(cumulative_interest)
if day.IsItLastDay:
cumulative_interest = 0
df.Total = total_per_day
df.InterestPerDay = interest_per_day
df.InterestCumulative = cumulative_per_day
This is unfortunately a lot more confusing looking, but that's what happens when values depend on previous values. Depending on your exact requirements there may be nice ways to simplify this using math, but otherwise this is what you've got.
I've written this directly into stackoverflow so it may not be perfect.

Find Pips value - 3 to 5 Digits forex pricing calculation

Nb in Python asking! Trying to subtract price (trade opened) with price.1 (trade closed) to get number of pips formatted properly without decimal. However, I could not proceed due to restriction involving split x lists.The, I was trying the solution below: , but, seems to be reduntant ..
I have created 4 lists and 4 loops to transform float to string, change the format to proceeed with subtraction. Any idea how to get correct number formatted ? Something to go directly into column (results) float .. if 3 digits before punctation decimal . . do 1000*100.. If one digit before . .. *100/10.
# Price Trade Opened
listp = []
listpf = []
for i in df2['Price']:
listp.append(format(i,'.5f'))
for i in listp:
listpf.append(str(i))
# Price.1 trade closed.
listpp = []
listppf = []
for i in df2['Price.1']:
listpp.append(format(i,'.5f'))
for i in listpp:
listppf.append(str(i))
# Transform list into DF and remove punctuation. Thereby, I could
subtract.
df3 = pd.DataFrame(listp)
col = ['Price']
df3.columns = col
df3 = df3.stack().str.replace('.', '').unstack()
df4 = pd.DataFrame(listpp)
col = ['Price1']
df4.columns = col
df4 = df4.stack().str.replace('.', '').unstack()
dfc = pd.concat([df3, df4], axis=1)
dfc.fillna(0)
dfc.replace({'nan': 0}, inplace=True)
dfc['Price'] = pd.to_numeric(dfc['Price'])
dfc['Price1'] = pd.to_numeric(dfc['Price1'])
dfc['Result'] = (dfc['Price'] - dfc['Price1'])
dfc.head()
You should be able to calculate the difference between open and close values, and divide by the relevant multiplier for the pair. Like this:
def pip_calc(open, close):
if str(open).index('.') >= 3: # JPY pair
multiplier = 0.01
else:
multiplier = 0.0001
pips = round((close - open) / multiplier)
return int(pips)
pip_calc(112.65, 112.68)
# 3
pip_calc(1.6566, 1.6568)
# 2

Categories