Machine Learning UDF’s in Avalanche, VectorH, and Vector – Introduction and Creating the Model – Part 1 By Vidisha Sharma August 6, 2020 Recently in Avalanche, VectorH 6.0, and Vector 6.0, Actian introduced a capability for Scalar user-defined functions (UDF’s). This has given Avalanche, VectorH, and Vector a new dimension to run Machine Learning (ML) models in Python and JavaScript within database. More about UDF’s can be found in our documentation. Model creation is simple with so many available libraries such as Spark, Tensorflow, Python Scikit-learn (SKlearn), which is the most commonly used. Once a production-grade model is created, it needs to be deployed into production. Here Avalanche, Vector, and VectorH get an advantage by deploying these models directly in database, and therefore model scan be used to score data directly within the database. To demonstrate this, we used Python SKlearn to train the model. The focus of this blog is to demonstrate how a UDF would work in the context of deploying a machine learning model. We found a very interesting project called sklearn-porter, which transpiles the model to JavaScript and m2cgen, which can be used to transpile the model to both JavaScript and Python. Avalanche, Vector, and VectorH support both JavaScript and Python UDFs, and therefore our choice of library is m2cgen. Since our UDF’s are scalar UDFs, we needed to write some additional code for m2cgen to return scalar values. For showcasing the ML UDF, I choose the Iris dataset. It has just 4 columns and 150 rows, which makes the use case easy to comprehend. I will demonstrate an end to end test case that creates the table, loads data in database, builds the model using data from the database, and finally run the model inside the database. Iris Dataset The Iris dataset is easily available. It can be downloaded from Kaggle: https://www.kaggle.com/uciml/iris/data# It’s fields are ID (int), SepalLengthCm (float), SepalWidthCm(float), PetalLengthCm(float), PetalWidthCm (float), Species (varchar (20)). Details about Python Connection with Vector/VectorH It is discussed in https://www.actian.com/company/blog/integrating-python-vector-actianx/ on how to make python ODBC or JDBC connections. In this tutorial, I will be using ODBC connections. Connect to DB import pyodbc as pdb import pandas as pd import numpy as np conn = pdb.connect("dsn=Vector6;uid=actian;pwd=passwd" ) conn.setdecoding(pdb.SQL_CHAR, encoding='utf-8') conn.setdecoding(pdb.SQL_WCHAR, encoding='utf-8') conn.setencoding(encoding='utf-8') cursor = conn.cursor() iristbl='''create table iris1( id integer, sepallengthcm float, sepalwidthcm float, petallengthcm float, petalwidthcm float, species varchar(20))''' conn.execute(iristbl) conn.commit() I have not taken any partition as dataset has just 150 rows load data to DB This will help in bulk loading the data for CSV we downloaded from Kaggle query ="COPY iris() VWLOAD FROM '/home/actian/vidisha/datasets_19_420_Iris.csv' with fdelim=',', insertmode ='Bulk' ,header" conn.execute(query) conn.commit() Note: datasets_19_420_Iris.csv is the dataset I downloaded from Kaggle and used vwload to load the data to database. Building the Model Classification and Prediction are the two most important aspects of Machine Learning. With the Iris Dataset, we will create a simple logistic regression model for Iris classification. The focus here is not model building, however, showing how the model can be run inside the database. Checking the Data sql_case="select sepallengthcm ,sepalwidthcm, petalwidthcm ,petalwidthcm , CASE WHEN species='Iris-setosa' THEN '1' WHEN species='Iris-versicolor' THEN '2' ELSE '3' END as speciesclass FROM iris" iris_case=pd.read_sql(sql_case, conn) print(iris_case.shape) iris_case.info(verbose=True) iris_case.describe() iris_case.head(10) Split the Test and Train Data sql_case="select sepallengthcm ,sepalwidthcm, petalwidthcm ,petalwidthcm , CASE WHEN species='Iris-setosa' THEN '1' WHEN species='Iris-versicolor' THEN '2' ELSE '3' END as speciesclass FROM iris" iris_case=pd.read_sql(sql_case, conn) print(iris_case.shape) iris_case.info(verbose=True) iris_case.describe() iris_case.head(10) In the second part of this two-part article we will go through the steps to create the UDF’s in database. To learn more about the capabilities of all the Actian products, visit our website. About Vidisha Sharma Vidisha Sharma is working as consultant for Actian Support. She has done her master’s in computers with specialization in analytics. Vidisha has around 16 years of experience and working for Actian since last 10 years on various database technologies.