I see issue in the aggregation function (part of pipeline) during the online ingest, because aggregation output is invalid (output is different then expectation, I got value 0 instead of 6). The pipeline is really very simple:
See part of code (Python and MLRun):
import datetime
import mlrun
import mlrun.feature_store as fstore
from mlrun.datastore.targets import ParquetTarget, NoSqlTarget
# Prepare data, four columns key0, key1, fn1, sysdate
data = {"key0":[1,1,1,1,1,1], "key1":[0,0,0,0,0,0],"fn1":[1,1,2,3,1,0],
"sysdate":[datetime.datetime(2021,1,1,1), datetime.datetime(2021,1,1,1),
datetime.datetime(2021,1,1,1), datetime.datetime(2021,1,1,1),
datetime.datetime(2021,1,1,1), datetime.datetime(2021,1,1,1)]}
# Create project and featureset with NoSqlTarget & ParquetTarget
project = mlrun.get_or_create_project("jist-agg",context='./', user_project=False)
feature_set=featureGetOrCreate(True,project_name, 'sample')
# Add easy aggregation 'agg1'
feature_set.add_aggregation(name='fn1',column='fn1',operations=['count'],windows=['60d'],step_name="agg1")
# Ingest data to the on-line and off-line targets
output_df=fstore.ingest(feature_set, input_df, overwrite=True, infer_options=fstore.InferOptions.default())
# Read data from online source
svc=fstore.get_online_feature_service(fstore.FeatureVector("my-vec", ["sample.*"], with_indexes=True))
resp = svc.get([{"key0": 1, "key1":0} ])
# Output validation
assert resp[0]['fn1_count_60d'] == 6.0, 'Mistake in solution'
Do you see the mistake?
Whole code is valid, but the issue is on side of knowledge ;-).
Key information is that aggregation for on-line target works from NOW till history. If today is 03.02.2023, than aggregation window is minus 60 days (see part of code windows=['60d']) and source data focus on date 01.01.2021.
You have two possible solutions:
1. Change input data (move date from 2021 to 2023)
# Prepare data, four columns key0, key1, fn1, sysdate
data = {"key0":[1,1,1,1,1,1], "key1":[0,0,0,0,0,0],"fn1":[1,1,2,3,1,0],
"sysdate":[datetime.datetime(2023,1,1,1), datetime.datetime(2023,1,1,1),
datetime.datetime(2023,1,1,1), datetime.datetime(2023,1,1,1),
datetime.datetime(2023,1,1,1), datetime.datetime(2023,1,1,1)]}
or
2. Extend window for calculation (e.g. 3 years = ~1095 days)
# Add easy aggregation 'agg1'
feature_set.add_aggregation(name='fn1',column='fn1',operations=['count'],windows=['1095d'],step_name="agg1")
MY DATA IN EXCEL
MY CODE
Hello Everyone!
I am brand new to python and have some simple data I want to separate and graph in a bar chart.
I have a data set on the cars currently being driven in California. They are separated by Year, Fuel type, Zip Code, Make, and 'Light/Heavy'.
I want to tell python to count the number of Gasoline cars, the number of diesel cars, the number of battery electric cars, etc.
How could i separate this data, and then graph it on a bar chart? I am assuming it is quite easy, but I have been learning python myself for maybe a week.
I attached the data set, as well as some code that I have so far. It is returning 'TRUE' when I tried to make subseries of the data as 'gas', 'diesel', etc. I am assuming python is just telling me "yes, it says gasoline there". I now just hope to gather all the "Gasoline"s in the 'Fuel' column, and add them all up by the number in the 'Vehicle' column.
Any help would be very much appreciated!!!
import pandas as pd
import matplotlib.pyplot as plt
df = pd.read_csv('~/Desktop/PYTHON/californiavehicles.csv')
print(df.head())
print(df.describe())
X = df['Fuel']
y = df['Vehicles']
gas = df[(df['Fuel']=='Gasoline','Flex-Fuel')]
diesel = df[(df['Fuel']=='Diesel and Diesel Hybrid')]
hybrid = df[(df['Fuel']=='Hybrid Gasoline', 'Plug-in Hybrid')]
electric = df[(df['Fuel']=='Battery Electric')]
I tried to create a subseries of the data. I haven't tried to include the numbers in 'vehicles' yet because I don't know how.
This will let you use the built-in conveniences of pandas. Short answer is, use this line:
df.groupby("Fuel").sum().plot.bar()
Long answer with home made data:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
N = 1000
placeholder = [pd.NA]*N
types = np.random.choice(["Gasoline", "Diesel", "Hybrid", "Battery"], size=N)
nr_vehicles = np.random.randint(low=1, high=100, size=N)
df = pd.DataFrame(
{
"Date": placeholder,
"Zip": placeholder,
"Model year": placeholder,
"Fuel": types,
"Make": placeholder,
"Duty": placeholder,
"Vehicles": nr_vehicles
}
)
df.groupby("Fuel").sum().plot.bar()
You mentioned it's a CSV specifically. Read in the file line by line, split the data by comma (which produces a list for the current row), then if currentrow[3] == fuel type increment your count.
Example:
gas_cars=0
with open("data.csv", "r") as file:
for line in file:
row = line.split(",")
if row[3] == "Gasoline":
gas_cars += int(row[6]) # num cars for that car make
# ...
# ...
# ...
Hello, I have the following code:
# Import Libraies
import numpy as np
import pandas as pd
import datetime as dt
# Conect to Drive
from google.colab import drive
drive.mount('/content/drive')
# Read Data
ruta = '/content/drive/MyDrive/example.csv'
df = pd.read_csv(ruta)
df.head(15)
d = pd.date_range(start="2015-01-01",end="2022-01-01", freq='MS')
dates = pd.DataFrame({"DATE":d})
df["DATE"] = pd.to_datetime(df["DATE"])
df_merge = pd.merge(dates, df, how='outer', on='DATE')
The data that I am using, you could download here: DATA
What I am trying to achieve is something known as Rolling Year.
First I create this metric gruped for each category:
# ROLLING YEAR
##################################################################################################
# I want to make a Roling Year for each category. Thats mean how much sell each category since 12 moths ago TO current month
# RY_ACTUAL One year have 12 months so I pass as parameter in the rolling 12
f = lambda x:x.rolling(12).sum()
df_merge["RY_ACTUAL"] = df_merge.groupby(["CATEGORY"])['Sales'].apply(f)
# RY_24 I create a rolling with 24 as parameter to compare actual RY vs last RY
f_1 = lambda x:x.rolling(24).sum()
df_merge["RY_24"] = df_merge.groupby(["CATEGORY"])['Sales'].apply(f_1)
#RY_LAST Substract RY_24 - RY_Actual to get the correct amount. Thats mean amount of RY vs the amount of RY-1
df_merge["RY_LAST"] = df_merge["RY_24"] - df_merge["RY_ACTUAL"]
##################################################################################################
df_merge.head(30)
And it works perfectly, ´cause if you download the file and then filter for example for "Blue" category, you could see, something like this:
Thats mean, if you stop in the row 2015-November, you could see in the column RY_ACTUAL the sum of all the values 12 records before.
Mi next goal is to create a similar column using the rollig function but with the next condition:
The column must sum all the sales of ALL the categories, as long as
the Color/Animal column is equal to Colour. For example if I am
stopped in 2016-December, it should give me the sum of ALL the sales
of the colors from 2016-January to 2016-December
This was my attempt:
df_merge.loc[(df_merge['Colour/Animal'] == 'Colour'),'Sales'].apply(f)
Cold anyone help me to code correctly this example?.
Thanks in advance comunity!!!
I am trying to measure the Word Mover's Distance between a lot of texts using Gensim's Word2Vec tools in Python. I am comparing each text with all other texts, so I first use itertools to create pairwise combinations like [1,2,3] -> [(1,2), (1,3), (2,3)]. For memory's sake, I don't do the combinations by having all texts repeated in a big dataframe, but instead make a reference dataframe combinations with indices of the texts, which looks like:
0 1
0 0 1
1 0 2
2 0 3
And then in the comparison function I use these indices to look up the text in the original dataframe. The solution works fine, but I am wondering whether I would be able to it with big datasets. For instance I have a 300.000 row dataset of texts, which gives me about a 100 year's worth of computation on my laptop:
C2(300000) = 300000! / (2!(300000−2))!
= 300000⋅299999 / 2 * 1
= 44999850000 combinations
Is there any way this could be optimized better?
My code right now:
import multiprocessing
import itertools
import numpy as np
import pandas as pd
import dask.dataframe as dd
from dask.diagnostics import ProgressBar
from gensim.models.word2vec import Word2Vec
from gensim.corpora.wikicorpus import WikiCorpus
def get_distance(row):
try:
sent1 = df.loc[row[0], 'text'].split()
sent2 = df.loc[row[1], 'text'].split()
return model.wv.wmdistance(sent1, sent2) # Compute WMD
except Exception as e:
return np.nan
df = pd.read_csv('data.csv')
# I then set up the gensim model, let me know if you need that bit of code too.
# Make pairwise combination of all indices
combinations = pd.DataFrame(itertools.combinations(df.index, 2))
# To dask df and apply function
dcombinations = dd.from_pandas(combinations, npartitions= 2 * multiprocessing.cpu_count())
dcombinations['distance'] = dcombinations.apply(get_distance, axis=1)
with ProgressBar():
combinations = dcombinations.compute()
You might use wmd-relax for performance improvement. However, you'll first have to convert your model to spaCy and use the SimilarityHook as described on their webpage:
import spacy
import wmd
nlp = spacy.load('en_core_web_md')
nlp.add_pipe(wmd.WMD.SpacySimilarityHook(nlp), last=True)
doc1 = nlp("Politician speaks to the media in Illinois.")
doc2 = nlp("The president greets the press in Chicago.")
print(doc1.similarity(doc2))
I am trying to perform a Principal Component Analysis for work. While i have successful in getting the the Principal Components laid out, i don't really know how to assign the resulting Component Score to each line item. I am looking for an output sort of like this.
Town PrinComponent 1 PrinComponent 2 PrinComponent 3
Columbia 0.31989 -0.44216 -0.44369
Middletown -0.37101 -0.24531 -0.47020
Harrisburg -0.00974 -0.06105 0.32792
Newport -0.38678 0.40935 -0.62996
The scikit-learn docs are not being helpful in this circumstance. Can anybody explain to me how i can reach this output?
The code i have so far is below.
def perform_PCA(df):
threshold = 0.1
pca = decomposition.PCA(n_components=3)
numpyMatrix = df.as_matrix().astype(float)
scaled_data = preprocessing.scale(numpyMatrix)
pca.fit(scaled_data)
pca.transform(scaled_data)
pca_components_df = pd.DataFrame(data = pca.components_,columns = df.columns.values)
#print pca_components_df
#pca_components_df.to_csv('pca_components_df.csv')
filtered = pca_components_df[abs(pca_components_df) > threshold]
trans_filtered= filtered.T
#print filtered.T #Tranformed Dataframe
trans_filtered.to_csv('trans_filtered.csv')
print pca.explained_variance_ratio_
I pumped the transformed array into the data portion of the DataFrame function, and then defined the index and columns the by putting them into columns= and index= respectively.
pd.DataFrame(data=transformed, columns=["PC1", "PC2"], index=df.index)