I'm trying to learn how to analyze data in python, so I'm using a database that I've already did some work on it with PowerBI, but now I'm trying to do the same plots with python.
The Pandas dataframe is this...
And I'm trying to build a line plot like this one...
This line represents the amount of 'Água e sabonete' and 'Fricção com álcool' in the column Ação divided by the the totals of Ação.
This was how managed to do it on PowerBI using Dax:
Adesão = VAR nReal = (COUNTROWS(FILTER(Tabela1,Tabela1[Ação]="Água e sabonete")) + COUNTROWS(FILTER(Tabela1,Tabela1[Ação]="Fricção com álcool")))
//VAR acao = COUNTA(Tabela1[Ação]
RETURN
DIVIDE(nReal,COUNTA(Tabela1[Ação]))
I want to know if it is possible to do something similar to build the plot or if there is other way to build it in python.
I didn't try anything especifically, but I think that should be possible to build it with a function, but it is too difficult to me right now to create one since I'm a beginner.
Any help would be greatly appreciated!
The idea here is to get access to each month and count every time Água e sabonete and Fricção com álcool appear.
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
def dateprod(year): #could change this with an array of each month instead
year = str(year) + "-"
dates = []
for i in range(1,13):
if i >= 10:
date = year + str(i)
else:
date = year + "0" + str(i)
dates.append(date)
return dates
def y_construct(year,search_1,search_2):
y_vals = np.zeros(0)
for j in dateprod(2001):
score = 0
temp_list = df.loc[str(j)]["aloc"].values #Gets all values of each month
for k in temp_list:
if k == str(search_1) or str(search_2): #Cheak if the words in the arrays are time Água e sabonete and Fricção com álcool
score += 1
y_vals = np.append(y_vals,score)
return y_vals/df.size # divid by total. Assuming that the amount of Ação is the same as df.size
# this does the plotting
y_vals = y_construct(2022,"Água e sabonete","Fricção com álcool")
x_label = ["jan","feb","mar","apr","may","jul","juli","aug","sep","otc","nov","dec"]
plt.ylabel("%")
plt.plot(x_label,y_vals)
plt.show()
Related
I am saving a large amount of data from some Monte Carlo simulations. I simulate 20 things over a period of 10 time steps using a varying of random draws. So, for a given number of random draws, I have have a folder with 10 .csv files (one for each time step) which has 20 columns of data and n rows per column, where n is the number of random draws in that simulation. Currently my basic code for loading data in looks something like this:
import pandas as pd
import numpy as np
load_path = r'...\path\to\data'
numScenarios = [100, 500, 1000, 2500, 5000, 10000, 20000]
yearsSimulated = np.arange(1,11)
for n in numScenarios:
folder_path = load_path + '\draws = ' + str(n)
for year in yearsSimulated:
filename = '\year ' + str(year) + '.csv'
path = folder_path + filename
df = pd.read_csv(path)
# save df.describe() somewhere
I want to efficiently save df.describe() somehow so that I can compare how the number of random draws is affecting results for the 20 things for a given time step. That is, I would ultimately like some object that I can access easily that will store all the df.describe() outputs for each individual time step. I'm not sure of a nice way to do this though. Some previous questions seem to suggest that dictionaries may be the way to go here but I've not been able to get them going.
Edit:
My final approach is to use an answer to a question here with a bunch of loops. So now I have:
class ngram(dict):
"""Based on perl's autovivification feature."""
def __getitem__(self, item):
try:
return super(ngram, self).__getitem__(item)
except KeyError:
value = self[item] = type(self)()
return value
results = ngram()
for i, year in enumerate(years):
year_str = str(year)
ann_stats = pd.DataFrame()
for j, n in enumerate(numScenarios):
n_str = str(n)
folder_path = load_path + '\draws = ' + str(n)
filename = '\scenarios ' + str(year) + '.csv'
path = folder_path + filename
df = pd.read_csv(path)
ann_stats['mean'] = df.mean()
ann_stats['std. dev'] = df.std()
ann_stats['1%'] = df.quantile(0.01)
ann_stats['25%'] = df.quantile(0.25)
ann_stats['50%'] = df.quantile(0.5)
ann_stats['75%'] = df.quantile(0.75)
ann_stats['99%'] = df.quantile(0.99)
results[year_str][n_str] = ann_stats.T
And so now the summary data for each time step and number of draws is accessed as a dataframe with
test = results[year_str][n_str]
where the columns of test hold results for each of my 20 things.
Say I have a csv file as follows:
GL000004250,1958.0833333333333,-1.4821428571428572
GL000004250,1958.1666666666667,-2.586206896551724
GL000004250,1958.25,-1.5733333333333333
GL000004250,1958.3333333333333,4.680000000000001
GL000004250,1958.4166666666667,9.944827586206895
GL000004250,1958.5,12.874193548387098
GL000004250,1958.5833333333333,12.21290322580645
GL000004250,1958.6666666666667,7.18148148148148
GL000004250,1958.75,2.187096774193549
GL000004250,1958.8333333333333,-0.9066666666666666
GL000004250,1958.9166666666667,0.3777777777777777
GL000004250,1959.0,0.43214285714285744
GL000004250,1959.0833333333333,-6.432142857142857
GL000004250,1959.1666666666667,-6.806451612903226
GL000004250,1959.25,0.6933333333333334
GL000004250,1959.3333333333333,5.780645161290322
GL000004250,1959.4166666666667,8.343333333333332
GL000004250,1959.5,10.71935483870968
GL000004250,1959.5833333333333,10.216129032258062
Where the second column is the year in decimal form and the third column is the data. I would like the program to find all the values from 1958 and average them, then 1959 and average them, etc.
If you're a beginner, start with the basics. Try with a loop and a dictionary to get a better handle on Python.
import numpy as np
with open(csvfile,'r') as f:
yearAvgs = dict()
data = f.read().split('\n')
for line in data:
if line:
year = int(float(line.split(',')[1]))
val = float(line.split(',')[2])
if year not in yearAvgs:
yearAvgs[year] = []
yearAvgs[year].append(val)
for k, v in yearAvgs.items():
avg = np.mean(v)
print ("Year = ",k,": Mean = ",avg)
Edit: If you're looking for a solution with pandas:
import pandas as pd
df = pd.read_csv(csvfile,names=['ID','Year','Value'])
df['Year'] = df['Year'].astype(int)
df.groupby(['Year']).mean()
I need some help dropping the NaN from the list generated in the code below. I'm trying to calculate the geometric average of the list of numbers labeled 'prices'. I can get as far as calculating the percent changes between the sequential numbers, but when I go to take the product of the list, there is an NaN that throws is off. I tried pandas.dropna(), but it didn't drop anything and gave me the same output. Any suggestions would be appreciated.
Thanks.
import pandas as pd
import math
import numpy as np
prices = [2,3,4,3,1,3,7,8]
prices = pd.Series(prices)
prices = prices.iloc[::-1]
retlist = list(prices.pct_change())
retlist.reverse()
print(retlist)
calc = np.array([x + 1 for x in retlist])
print(calc)
def product(P):
p = 1
for i in P:
p = i * p
return p
print(product(calc))
retlist a list, which contains NaN.
you can add a step to get rid of NaN by using the following code:
retlist = [i for indx, i in enumerate(retlist) if filter[indx] == True]
After this you can follow with the other steps. Do note that the size of the list changes
I'm new to python/pandas.. so please don't judge:)
I have a DF with stock data (i.e., Date, Close Value, ...).
Now I want to see if a given Close value will hit a target value (e.g. Close+50€, Close-50€).
I wrote a nested loop to check every close value with the following close values of that day:
def calc_zv(_df, _distance):
_df['ZV_C'] = 0
_df['ZV_P'] = 0
for i in range(0, len(_df)):
_date = _df.iloc[i].get('Date')
target_put = _df.iloc[i].get('Close') - _distance
target_call = _df.iloc[i].get('Close') + _distance
for x in range(i, len(_df)-1):
a = _df.iloc[x+1].get('Close')
_date2 = _df.iloc[x+1].get('Date')
if(target_call <= a and _date == _date2):
_df.ix[i,'ZV_C'] = 1
break
elif(target_put >= a and _date == _date2):
_df.ix[i,'ZV_P'] = 1
break
elif (_date != _date2):
break
This works fine.. but I wonder if there is a "better" (Faster, more panda-like) solution?
Thanks and best wishes.
M.
EDIT
hi again,
here is some sample data generator:
import numpy as np
import pandas as pd
from PX.indicator_macros import calc_zv
import datetime
abc = datetime.datetime.now()
print(abc)
df2 = pd.DataFrame({'DateTime' : pd.Timestamp('20130102'),
'Close' : pd.Series(np.random.randn(5000))})
#print(df2.to_string())
calc_zv(df2, 2)
#print(df2.to_string())
abc = datetime.datetime.now()
print(abc)
for 5000 rows i need approx. 10s.
I have stock data for 3 years (in 15min intervall) which takes some minutes.
cheers
I have a dataframe of 600 000 x/y points with date-time information, along another field 'status', with extra descriptive information
My objective is, for each record:
sum column 'status' by records that are within a certain spatial temporal buffer
the specific buffer is within t - 8 hours and < 100 meters
Currently I have the data in a pandas data frame.
I could, loop through the rows, and for each record, subset the dates of interest, then calculate a distances and restrict the selection further. However that would still be quite slow with so many records.
THIS TAKES 4.4 hours to run.
I can see that I could create a 3 dimensional kdtree with x, y, date as epoch time. However, I am not certain how to restrict the distances properly when incorporating dates and geographic distances.
Here is some reproducible code for you guys to test on:
Import
import numpy.random as npr
import numpy
import pandas as pd
from pandas import DataFrame, date_range
from datetime import datetime, timedelta
Create data
np.random.seed(111)
Function to generate test data
def CreateDataSet(Number=1):
Output = []
for i in range(Number):
# Create a date range with hour frequency
date = date_range(start='10/1/2012', end='10/31/2012', freq='H')
# Create long lat data
laty = npr.normal(4815862, 5000,size=len(date))
longx = npr.normal(687993, 5000,size=len(date))
# status of interest
status = [0,1]
# Make a random list of statuses
random_status = [status[npr.randint(low=0,high=len(status))] for i in range(len(date))]
# user pool
user = ['sally','derik','james','bob','ryan','chris']
# Make a random list of users
random_user = [user[npr.randint(low=0,high=len(user))] for i in range(len(date))]
Output.extend(zip(random_user, random_status, date, longx, laty))
return pd.DataFrame(Output, columns = ['user', 'status', 'date', 'long', 'lat'])
#Create data
data = CreateDataSet(3)
len(data)
#some time deltas
before = timedelta(hours = 8)
after = timedelta(minutes = 1)
Function to speed up
def work(df):
output = []
#loop through data index's
for i in range(0, len(df)):
l = []
#first we will filter out the data by date to have a smaller list to compute distances for
#create a mask to query all dates between range for date i
date_mask = (df['date'] >= df['date'].iloc[i]-before) & (df['date'] <= df['date'].iloc[i]+after)
#create a mask to query all users who are not user i (themselves)
user_mask = df['user']!=df['user'].iloc[i]
#apply masks
dists_to_check = df[date_mask & user_mask]
#for point i, create coordinate to calculate distances from
a = np.array((df['long'].iloc[i], df['lat'].iloc[i]))
#create array of distances to check on the masked data
b = np.array((dists_to_check['long'].values, dists_to_check['lat'].values))
#for j in the date queried data
for j in range(1, len(dists_to_check)):
#compute the ueclidean distance between point a and each point of b (the date masked data)
x = np.linalg.norm(a-np.array((b[0][j], b[1][j])))
#if the distance is within our range of interest append the index to a list
if x <=100:
l.append(j)
else:
pass
try:
#use the list of desired index's 'l' to query a final subset of the data
data = dists_to_check.iloc[l]
#summarize the column of interest then append to output list
output.append(data['status'].sum())
except IndexError, e:
output.append(0)
#print "There were no data to add"
return pd.DataFrame(output)
Run code and time it
start = datetime.now()
out = work(data)
print datetime.now() - start
Is there a way to do this query in a vectorized way? Or should I be chasing another technique.
<3
Here is what at least somewhat solves my problem. Since the loop can operate on different parts of the data independently, parallelization makes sense here.
using Ipython...
from IPython.parallel import Client
cli = Client()
cli.ids
cli = Client()
dview=cli[:]
with dview.sync_imports():
import numpy as np
import os
from datetime import timedelta
import pandas as pd
#We also need to add the time deltas and output list into the function as
#local variables as well as add the Ipython.parallel decorator
#dview.parallel(block=True)
def work(df):
before = timedelta(hours = 8)
after = timedelta(minutes = 1)
output = []
final time 1:17:54.910206, about 1/4 original time
I would still be very interested for anyone to suggest small speed improvements within the body of the function.