time series trend recognization in python - python

I have a CSV containing selling figures for various dates.
Here is an example of the file:
DATE, ARTICLENO, QUANTITY
2018-07-17, 101, 50
2018-07-16, 101, 55
2018-07-16, 105, 36
2018-07-15, 105, 23
I read this into a pandas dataframe and ran a basic kmeans-algorithm on this but i need more help.
Data description:
The date column is the index of the dataframe and describes the date for the selling value. There are multiple tuples (Date-Quantity-ArticleNo) so there is a time series for each article number. Those can have different lengths and starting dates, which makes predicting and recognizing trends (e.g. good selling in summer or winter) even harder. The CSV is sorted by ArticleNo and Date.
Goal:
Cluster a given set of data from a csv and create labels for good selling articles in summer or winter (seasonal trends) and match future articles to them.
Here is what I did so far (currently i did not have date as index xet, but that is the goal):
from __future__ import absolute_import, division, print_function
import pandas as pd
import numpy as np
from matplotlib import pyplot as plp
from sklearn import preprocessing
from sklearn.cluster import KMeans
import sys
def extract_articles(data, article_numbers):
return pd.DataFrame(
[
data[data['ARTICLENO'] == article_no]['QUANTITY'].values
for article_no in article_numbers
]
).fillna(0)
def read_csv_file(file_name, number_of_lines):
return pd.read_csv(file_name, parse_dates=['DATE'],
nrows=number_of_lines)
def get_unique_article_numbers(data):
return data['ARTICLENO'].unique()
def main():
data = read_csv_file('statistic.csv', 400000)
modeling_article_numbers = get_unique_article_numbers(data)
print("Clustering on", len(modeling_article_numbers), "article numbers")
modeling_data = extract_articles(data, modeling_article_numbers)
modeling_data = modeling_data.iloc[:50, :]
# 'switch' dataframe
modeling_data = modeling_data.T
modeling_data = modeling_data.pct_change().fillna(0)
normalized_modeling_data = preprocessing.normalize(modeling_data,
norm='l2', axis=0)
print(modeling_data)
predicting_article_numbers = [30079229, 30079854, 30086845]
predicting_article_data = extract_articles(data,
predicting_article_numbers)
predicting_article_data = predicting_article_data.pct_change().fillna(0)
normalized_predicting_article_data = preprocessing.normalize(
predicting_article_data, norm='l2'
)
kmeans = KMeans(n_clusters=5,
random_state=0).fit(normalized_modeling_data)
print(kmeans.labels_)
# for data, article_no in [
# (normalized_predicting_article_data, 430079229),
# (normalized_predicting_article_data, 430079854),
# (modeling_data, 430074590),
# ]:
# print('Predicting article {0}'.format(article_no))
# print(kmeans.predict([data[0]]))
for i, cluster_center in enumerate(kmeans.cluster_centers_):
plp.plot(cluster_center, label='Center {0}'.format(i))
plp.legend(loc='best')
plp.title(('Cluster based on ' + str(len(modeling_article_numbers)) + '
article numbers'))
plp.show()
main()
I transposed the dataframe, beacause it did not contain the series for each article number along the axis 1.
My question is: How can i get the 'description' of the label? Can i name them?
Maybe kmeans is the wrong algorithm for my intentions?

have you tried making each article a row in your dataset?
I'm not sure if you did after reading your question.
After you did that you can aggregate your date e.g. as quantity per week. If you have more than one year data make it average quantity per week. So you get a table with 52 Features {week 1 : sold 500; week 2 : sold 520 ...} for every article.
I dont think k-means is what you are looking for because you know pretty well what you want and that makes you a good "teacher" for your algorithm, ergo: use supervised algortihms.
Therefore you need to lable at least some (at best all) of your aggregated product data by hand, but it should be worth the work due to better results.
Also you could look into Time-Series Sesonality Analysis / Time Series decomposition.
Anyway if you are familiar with sci-kit learn i would give the supervised algorithms (Decision Trees, Random Forest, SVM, MLPClassifier ...) a chance, might be way easier to accomplish.

Related

Issue with the aggregation function in the pipeline during online ingest

I see issue in the aggregation function (part of pipeline) during the online ingest, because aggregation output is invalid (output is different then expectation, I got value 0 instead of 6). The pipeline is really very simple:
See part of code (Python and MLRun):
import datetime
import mlrun
import mlrun.feature_store as fstore
from mlrun.datastore.targets import ParquetTarget, NoSqlTarget
# Prepare data, four columns key0, key1, fn1, sysdate
data = {"key0":[1,1,1,1,1,1], "key1":[0,0,0,0,0,0],"fn1":[1,1,2,3,1,0],
"sysdate":[datetime.datetime(2021,1,1,1), datetime.datetime(2021,1,1,1),
datetime.datetime(2021,1,1,1), datetime.datetime(2021,1,1,1),
datetime.datetime(2021,1,1,1), datetime.datetime(2021,1,1,1)]}
# Create project and featureset with NoSqlTarget & ParquetTarget
project = mlrun.get_or_create_project("jist-agg",context='./', user_project=False)
feature_set=featureGetOrCreate(True,project_name, 'sample')
# Add easy aggregation 'agg1'
feature_set.add_aggregation(name='fn1',column='fn1',operations=['count'],windows=['60d'],step_name="agg1")
# Ingest data to the on-line and off-line targets
output_df=fstore.ingest(feature_set, input_df, overwrite=True, infer_options=fstore.InferOptions.default())
# Read data from online source
svc=fstore.get_online_feature_service(fstore.FeatureVector("my-vec", ["sample.*"], with_indexes=True))
resp = svc.get([{"key0": 1, "key1":0} ])
# Output validation
assert resp[0]['fn1_count_60d'] == 6.0, 'Mistake in solution'
Do you see the mistake?
Whole code is valid, but the issue is on side of knowledge ;-).
Key information is that aggregation for on-line target works from NOW till history. If today is 03.02.2023, than aggregation window is minus 60 days (see part of code windows=['60d']) and source data focus on date 01.01.2021.
You have two possible solutions:
1. Change input data (move date from 2021 to 2023)
# Prepare data, four columns key0, key1, fn1, sysdate
data = {"key0":[1,1,1,1,1,1], "key1":[0,0,0,0,0,0],"fn1":[1,1,2,3,1,0],
"sysdate":[datetime.datetime(2023,1,1,1), datetime.datetime(2023,1,1,1),
datetime.datetime(2023,1,1,1), datetime.datetime(2023,1,1,1),
datetime.datetime(2023,1,1,1), datetime.datetime(2023,1,1,1)]}
or
2. Extend window for calculation (e.g. 3 years = ~1095 days)
# Add easy aggregation 'agg1'
feature_set.add_aggregation(name='fn1',column='fn1',operations=['count'],windows=['1095d'],step_name="agg1")

Looking for Simple Python Help: Counting the Number of Vehicles in a CSV by their Fuel Type

MY DATA IN EXCEL
MY CODE
Hello Everyone!
I am brand new to python and have some simple data I want to separate and graph in a bar chart.
I have a data set on the cars currently being driven in California. They are separated by Year, Fuel type, Zip Code, Make, and 'Light/Heavy'.
I want to tell python to count the number of Gasoline cars, the number of diesel cars, the number of battery electric cars, etc.
How could i separate this data, and then graph it on a bar chart? I am assuming it is quite easy, but I have been learning python myself for maybe a week.
I attached the data set, as well as some code that I have so far. It is returning 'TRUE' when I tried to make subseries of the data as 'gas', 'diesel', etc. I am assuming python is just telling me "yes, it says gasoline there". I now just hope to gather all the "Gasoline"s in the 'Fuel' column, and add them all up by the number in the 'Vehicle' column.
Any help would be very much appreciated!!!
import pandas as pd
import matplotlib.pyplot as plt
df = pd.read_csv('~/Desktop/PYTHON/californiavehicles.csv')
print(df.head())
print(df.describe())
X = df['Fuel']
y = df['Vehicles']
gas = df[(df['Fuel']=='Gasoline','Flex-Fuel')]
diesel = df[(df['Fuel']=='Diesel and Diesel Hybrid')]
hybrid = df[(df['Fuel']=='Hybrid Gasoline', 'Plug-in Hybrid')]
electric = df[(df['Fuel']=='Battery Electric')]
I tried to create a subseries of the data. I haven't tried to include the numbers in 'vehicles' yet because I don't know how.
This will let you use the built-in conveniences of pandas. Short answer is, use this line:
df.groupby("Fuel").sum().plot.bar()
Long answer with home made data:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
N = 1000
placeholder = [pd.NA]*N
types = np.random.choice(["Gasoline", "Diesel", "Hybrid", "Battery"], size=N)
nr_vehicles = np.random.randint(low=1, high=100, size=N)
df = pd.DataFrame(
{
"Date": placeholder,
"Zip": placeholder,
"Model year": placeholder,
"Fuel": types,
"Make": placeholder,
"Duty": placeholder,
"Vehicles": nr_vehicles
}
)
df.groupby("Fuel").sum().plot.bar()
You mentioned it's a CSV specifically. Read in the file line by line, split the data by comma (which produces a list for the current row), then if currentrow[3] == fuel type increment your count.
Example:
gas_cars=0
with open("data.csv", "r") as file:
for line in file:
row = line.split(",")
if row[3] == "Gasoline":
gas_cars += int(row[6]) # num cars for that car make
# ...
# ...
# ...

Rolling Year Based on Condition

Hello, I have the following code:
# Import Libraies
import numpy as np
import pandas as pd
import datetime as dt
# Conect to Drive
from google.colab import drive
drive.mount('/content/drive')
# Read Data
ruta = '/content/drive/MyDrive/example.csv'
df = pd.read_csv(ruta)
df.head(15)
d = pd.date_range(start="2015-01-01",end="2022-01-01", freq='MS')
dates = pd.DataFrame({"DATE":d})
df["DATE"] = pd.to_datetime(df["DATE"])
df_merge = pd.merge(dates, df, how='outer', on='DATE')
The data that I am using, you could download here: DATA
What I am trying to achieve is something known as Rolling Year.
First I create this metric gruped for each category:
# ROLLING YEAR
##################################################################################################
# I want to make a Roling Year for each category. Thats mean how much sell each category since 12 moths ago TO current month
# RY_ACTUAL One year have 12 months so I pass as parameter in the rolling 12
f = lambda x:x.rolling(12).sum()
df_merge["RY_ACTUAL"] = df_merge.groupby(["CATEGORY"])['Sales'].apply(f)
# RY_24 I create a rolling with 24 as parameter to compare actual RY vs last RY
f_1 = lambda x:x.rolling(24).sum()
df_merge["RY_24"] = df_merge.groupby(["CATEGORY"])['Sales'].apply(f_1)
#RY_LAST Substract RY_24 - RY_Actual to get the correct amount. Thats mean amount of RY vs the amount of RY-1
df_merge["RY_LAST"] = df_merge["RY_24"] - df_merge["RY_ACTUAL"]
##################################################################################################
df_merge.head(30)
And it works perfectly, ´cause if you download the file and then filter for example for "Blue" category, you could see, something like this:
Thats mean, if you stop in the row 2015-November, you could see in the column RY_ACTUAL the sum of all the values 12 records before.
Mi next goal is to create a similar column using the rollig function but with the next condition:
The column must sum all the sales of ALL the categories, as long as
the Color/Animal column is equal to Colour. For example if I am
stopped in 2016-December, it should give me the sum of ALL the sales
of the colors from 2016-January to 2016-December
This was my attempt:
df_merge.loc[(df_merge['Colour/Animal'] == 'Colour'),'Sales'].apply(f)
Cold anyone help me to code correctly this example?.
Thanks in advance comunity!!!

Can I optimize this Word Mover's Distance look-up function?

I am trying to measure the Word Mover's Distance between a lot of texts using Gensim's Word2Vec tools in Python. I am comparing each text with all other texts, so I first use itertools to create pairwise combinations like [1,2,3] -> [(1,2), (1,3), (2,3)]. For memory's sake, I don't do the combinations by having all texts repeated in a big dataframe, but instead make a reference dataframe combinations with indices of the texts, which looks like:
0 1
0 0 1
1 0 2
2 0 3
And then in the comparison function I use these indices to look up the text in the original dataframe. The solution works fine, but I am wondering whether I would be able to it with big datasets. For instance I have a 300.000 row dataset of texts, which gives me about a 100 year's worth of computation on my laptop:
C2​(300000) = 300000​! / (2!(300000−2))!
= 300000⋅299999​ / 2 * 1
= 44999850000 combinations
Is there any way this could be optimized better?
My code right now:
import multiprocessing
import itertools
import numpy as np
import pandas as pd
import dask.dataframe as dd
from dask.diagnostics import ProgressBar
from gensim.models.word2vec import Word2Vec
from gensim.corpora.wikicorpus import WikiCorpus
def get_distance(row):
try:
sent1 = df.loc[row[0], 'text'].split()
sent2 = df.loc[row[1], 'text'].split()
return model.wv.wmdistance(sent1, sent2) # Compute WMD
except Exception as e:
return np.nan
df = pd.read_csv('data.csv')
# I then set up the gensim model, let me know if you need that bit of code too.
# Make pairwise combination of all indices
combinations = pd.DataFrame(itertools.combinations(df.index, 2))
# To dask df and apply function
dcombinations = dd.from_pandas(combinations, npartitions= 2 * multiprocessing.cpu_count())
dcombinations['distance'] = dcombinations.apply(get_distance, axis=1)
with ProgressBar():
combinations = dcombinations.compute()
You might use wmd-relax for performance improvement. However, you'll first have to convert your model to spaCy and use the SimilarityHook as described on their webpage:
import spacy
import wmd
nlp = spacy.load('en_core_web_md')
nlp.add_pipe(wmd.WMD.SpacySimilarityHook(nlp), last=True)
doc1 = nlp("Politician speaks to the media in Illinois.")
doc2 = nlp("The president greets the press in Chicago.")
print(doc1.similarity(doc2))

Python Scikit-Learn PCA: Get Component Score

I am trying to perform a Principal Component Analysis for work. While i have successful in getting the the Principal Components laid out, i don't really know how to assign the resulting Component Score to each line item. I am looking for an output sort of like this.
Town PrinComponent 1 PrinComponent 2 PrinComponent 3
Columbia 0.31989 -0.44216 -0.44369
Middletown -0.37101 -0.24531 -0.47020
Harrisburg -0.00974 -0.06105 0.32792
Newport -0.38678 0.40935 -0.62996
The scikit-learn docs are not being helpful in this circumstance. Can anybody explain to me how i can reach this output?
The code i have so far is below.
def perform_PCA(df):
threshold = 0.1
pca = decomposition.PCA(n_components=3)
numpyMatrix = df.as_matrix().astype(float)
scaled_data = preprocessing.scale(numpyMatrix)
pca.fit(scaled_data)
pca.transform(scaled_data)
pca_components_df = pd.DataFrame(data = pca.components_,columns = df.columns.values)
#print pca_components_df
#pca_components_df.to_csv('pca_components_df.csv')
filtered = pca_components_df[abs(pca_components_df) > threshold]
trans_filtered= filtered.T
#print filtered.T #Tranformed Dataframe
trans_filtered.to_csv('trans_filtered.csv')
print pca.explained_variance_ratio_
I pumped the transformed array into the data portion of the DataFrame function, and then defined the index and columns the by putting them into columns= and index= respectively.
pd.DataFrame(data=transformed, columns=["PC1", "PC2"], index=df.index)

Categories