Data Engineering

ML UDFs in Actian Data Platform, VectorH, and Vector – Part 1

Actian Corporation

August 6, 2020

Machine Learning UDF showing a robot hand and a human hand

Recently in Actian Data Platform, VectorH 6.0, and Vector 6.0, Actian introduced a capability for Scalar user-defined functions (UDFs). This has given Actian Data Platform, VectorH, and Vector a new dimension to run Machine Learning (ML) models in Python and JavaScript within a database. More about UDFs can be found in our documentation.

Model creation is simple with so many available libraries such as Spark, Tensorflow, Python Scikit-learn (SKlearn), which is the most commonly used. Once a production-grade model is created, it needs to be deployed into production. Here Actian Data Platform, Vector, and VectorH get an advantage by deploying these models directly in the database, and therefore model scan be used to score data directly within the database.

To demonstrate this, we used Python SKlearn to train the model. The focus of this blog is to demonstrate how a UDF would work in the context of deploying a machine-learning model.

We found a very interesting project called sklearn-porter, which transpiles the model to JavaScript and m2cgen, which can be used to transpile the model to both JavaScript and Python. Actian Data Platform, Vector, and VectorH support both JavaScript and Python UDFs, and therefore our choice of library is m2cgen. Since our UDFs are scalar UDFs, we needed to write some additional code for m2cgen to return scalar values.

For showcasing the ML UDF, I chose the Iris dataset. It has just 4 columns and 150 rows, which makes the use case easy to comprehend. I will demonstrate an end-to-end test case that creates the table, loads data in the database, builds the model using data from the database, and finally runs the model inside the database.

Iris Dataset

The Iris dataset is easily available. It can be downloaded from Kaggle: https://www.kaggle.com/uciml/iris/data#

Its fields are ID (int), SepalLengthCm (float), SepalWidthCm(float), PetalLengthCm(float), PetalWidthCm (float), Species (varchar (20)).

Details About Python Connection with Vector/VectorH

It is discussed in https://www.actian.com/blog/data-warehouse/integrating-python-vector-ingres/ on how to make python ODBC or JDBC connections. In this tutorial, I will be using ODBC connections.

Connect to DB

import pyodbc as pdb
import pandas as pd
import numpy as np
conn = pdb.connect("dsn=Vector6;uid=actian;pwd=passwd" )
conn.setdecoding(pdb.SQL_CHAR, encoding='utf-8')
conn.setdecoding(pdb.SQL_WCHAR, encoding='utf-8')
conn.setencoding(encoding='utf-8')
cursor = conn.cursor()
iristbl='''create table iris1(
id integer,
sepallengthcm float,
sepalwidthcm float,
petallengthcm float,
petalwidthcm float,
species varchar(20))'''
conn.execute(iristbl)
conn.commit()

I have not taken any partition as dataset has just 150 rows

Load Data to DB

This will help in bulk loading the data for CSV we downloaded from Kaggle

query ="COPY iris() VWLOAD FROM '/home/actian/vidisha/datasets_19_420_Iris.csv' with fdelim=',', insertmode ='Bulk' ,header"
conn.execute(query)
conn.commit()

Note: datasets_19_420_Iris.csv is the dataset I downloaded from Kaggle and used vwload to load the data to database.

Building the Model

Classification and Prediction are the two most important aspects of Machine Learning. With the Iris Dataset, we will create a simple logistic regression model for Iris classification. The focus here is not model building, however, showing how the model can be run inside the database.

Checking the Data

sql_case="select sepallengthcm ,sepalwidthcm, petalwidthcm ,petalwidthcm , 
CASE 
WHEN species='Iris-setosa' THEN '1' 
WHEN species='Iris-versicolor' THEN '2' 
ELSE '3' 
END as speciesclass 
FROM iris"
iris_case=pd.read_sql(sql_case, conn)
print(iris_case.shape)
iris_case.info(verbose=True)
iris_case.describe()
iris_case.head(10)

Split the Test and Train Data

sql_case="select sepallengthcm ,sepalwidthcm, petalwidthcm ,petalwidthcm , 
CASE 
WHEN species='Iris-setosa' THEN '1' 
WHEN species='Iris-versicolor' THEN '2' 
ELSE '3' 
END as speciesclass 
FROM iris"
iris_case=pd.read_sql(sql_case, conn)
print(iris_case.shape)
iris_case.info(verbose=True)
iris_case.describe()
iris_case.head(10)

In the second part of this two-part article we will go through the steps to create the UDFs in database.

To learn more about the capabilities of all the Actian products, visit our website.

About Actian Corporation

Actian makes data easy. Our data platform simplifies how people connect, manage, and analyze data across cloud, hybrid, and on-premises environments. With decades of experience in data management and analytics, Actian delivers high-performance solutions that empower businesses to make data-driven decisions. Actian is recognized by leading analysts and has received industry awards for performance and innovation. Our teams share proven use cases at conferences (e.g., Strata Data) and contribute to open-source projects. On the Actian blog, we cover topics ranging from real-time data ingestion to AI-driven analytics.

Actian Data Intelligence Platform

Core Capabilities

Actian Data Observability

Core Capabilities

Actian Data Platform

Core Capabilities

Data Management

Products

Databases

Products

Product Overview

All Products

ML UDFs in Actian Data Platform, VectorH, and Vector – Part 1

Iris Dataset

Details About Python Connection with Vector/VectorH

Connect to DB

Load Data to DB

Building the Model

Checking the Data

Split the Test and Train Data

About Actian Corporation

Ready to Get Started?

ML UDFs in Actian Data Platform, VectorH, and Vector – Part 1

Iris Dataset

Details About Python Connection with Vector/VectorH

Connect to DB

Load Data to DB

Building the Model

Checking the Data

Split the Test and Train Data

About Actian Corporation

Related Tags

Subscribe to the Actian Blog

Subscribe

Thank you for subscribing to the Actian Blog!

Ready to Get Started?