Data Warehouse

Machine Learning UDF’s in Actian Data Platform, VectorH, and Vector – Creating UDF’s in Database – Part 2

Actian Corporation

August 6, 2020

Machine Learning UDF

In the first part of this two-part article, we created the model. In this installment, we will create the UDF’s in database.

We can convert the model to Python, and I used m2cgen to transpile the sklearn model to Python model

I will be making 3 types of UDF’s,

  1. Python UDF.
  2. JavaScript UDF.
  3. Saving the data to JSON and then classification using JSON data.

Python UDF

Following 1 line of code will give us a model in Python.

import m2cgen as m2c
code = m2c.export_to_python(clf)
The output we get is :
def score(input):
return [0.2834483510261381 + input[0] * 0.4156346233598447
+ input[1] * 1.6410466066307934 + input[2]
* -2.4486206310542746 + input[3] * -1.0775840518519904,
1.271492823301712 + input[0] * 0.7334652955571552
+ input[1] * -2.0527321763924102 + input[2]
* 0.4049181379167901 + input[3] * -1.4276343377888174,
-1.403879236024429 + input[0] * -1.8886197150361799
+ input[1] * -1.701005052157171 + input[2]
* 2.803528886728411 + input[3] * 2.6749639949986195]
Since m2cgen needs an array as input we give input as a string and split it to float values.
input = [float(i) for i in lst.split(",")]

Linear Regression and Logistic Regression Maths

The formula for linear regression is;

Y=Wo+ X1W1+ X2W2+ X3W3+ ……

Linear regression is designed more for predicting numerical values than a classification problem.

However for logistic regression, we would like an output between 0 and 1, for that, we have to use a nonlinear function called sigmoid which is;

H(X) =1/(1 + e^ (-y))

The output is interpreted in terms of probabilities.

We have 4 input values and m2cgen easily gives us a linear equation with our inputs , coefficient and intercepts calculated by the model.  We apply the sigmoid  function on the equation  shown above and it gives us one vs rest classification.

for i in range(0,len(var0)):
arr1.append( 1 / (1 + math.exp(-var0[i])))
print(arr1)

After applying sigmod, we get an array. As our UDF is scalar, we need to modify the code to return the index of array, which has maximum probability. We just add code to return index with maximum value;

max_val=arr1[0]
for i in range(0,len(arr1)):
if arr1[i] > max_val:
index=i
max_val=arr1[i]
return index+1
clubbing it all together makes our function as;
import math
def score (lst):
input = [float(i) for i in lst.split(",")]
var0= [((((0.24739569243110213) + ((input[0]) * (0.3677840447489154))) + ((input[1]) * (1.4151458935607388))) + ((input[2]) * (-2.124321635674482))) + ((input[3]) * (-0.9309638143559119)), ((((0.7060488675827682) + ((input[0]) * (0.6404296785789872))) + ((input[1]) * (-1.7889249557223028))) + ((input[2]) * (0.4056580513021318))) + ((input[3]) * (-1.106884750746711)), ((((-0.9529159159570318) + ((input[0]) * (-1.5615947756178696))) + ((input[1]) * (-1.1894433955845047))) + ((input[2]) * (2.1944766675309997))) + ((input[3]) * (2.0100453163309537))]
print(var0)
index=0
arr1=[]
for i in range(0,len(var0)):
arr1.append( 1 / (1 + math.exp(-var0[i])))
print(arr1)
max_val=arr1[0]
for i in range(0,len(arr1)):
if arr1[i] > max_val:
index=i
max_val=arr1[i]
return index+1
Our UDF structure is ready, we just need to add CREATE OR REPLACE FUNCTION for the syntax to be correct to be added to database.
udf='''CREATE OR REPLACE FUNCTION Iris_classifier(lst VARCHAR(100) not null)
return (int not null)
AS LANGUAGE PYTHON
SOURCE='
def score (lst):
input = [float(i) for i in lst.split(",")]
var0= [((((0.24739569243110213) + ((input[0]) * (0.3677840447489154))) + ((input[1]) * (1.4151458935607388))) + ((input[2]) * (-2.124321635674482))) + ((input[3]) * (-0.9309638143559119)), ((((0.7060488675827682) + ((input[0]) * (0.6404296785789872))) + ((input[1]) * (-1.7889249557223028))) + ((input[2]) * (0.4056580513021318))) + ((input[3]) * (-1.106884750746711)), ((((-0.9529159159570318) + ((input[0]) * (-1.5615947756178696))) + ((input[1]) * (-1.1894433955845047))) + ((input[2]) * (2.1944766675309997))) + ((input[3]) * (2.0100453163309537))]
print(var0)
index=0
max_val=var0[0]
for i in range(0,len(var0)):
if var0[i] > max_val:
index=i
max_val=var0[i]
return index+1
return score (lst)'
'''

Our UDF is ready, and we just have to add it to the database.

cursor.execute(udf)
conn.commit()

The UDF is added to the database and can be used for classification of the data, notice the input is added as a string, since m2cgen expects string as input to the model.

predict='''select CASE
WHEN species='Iris-setosa' THEN '1'
WHEN species='Iris-versicolor' THEN '2'
ELSE '3'
END as species, Iris_classifier (sepallengthcm  || ',' || sepalwidthcm  || ',' || petallengthcm  || ',' || petalwidthcm ) as classify  from iris'''
predict1=pd.read_sql(predict, conn)
predict1.head()

JavaScript UDF

For transpiling the JavaScript UDF, we just need to provide the following:

code = m2c.export_to_javascript(clf)
output generated is
function score(input) {
return [((((0.2901789521827987) + ((input[0]) * (0.4467535618836661))) + ((input[1]) * (1.5843519667681565))) + ((input[2]) * (-2.409947591791464))) + ((input[3]) * (-1.0736156286007468)), ((((0.8330817501826279) + ((input[0]) * (1.0259003852575856))) + ((input[1]) * (-2.4327046722797347))) + ((input[2]) * (0.48745117088772905))) + ((input[3]) * (-1.8329094419137872)), ((((-1.141764377255606) + ((input[0]) * (-2.35933886039274))) + ((input[1]) * (-1.2090666108474617))) + ((input[2]) * (2.981435002839768))) + ((input[3]) * (2.99871035615134))];
}

Here, I am returning the index for maximum value and not using sigmoid function as if don’t use sigmoid it works on the maximum likelihood estimations.

We just need to change it as per UDF qualification for Javascript.

udf_JS ='''create function  iris_lr_javascript1(lst VARCHAR(100) not
null) return (int not null) AS LANGUAGE JAVASCRIPT SOURCE='
function score(lst) {
var input = lst.split(",").map(function(item) {  return
parseFloat(item); });
nums=[((((0.2901789521827987) + ((input[0]) * (0.4467535618836661))) + ((input[1]) * (1.5843519667681565))) + ((input[2]) * (-2.409947591791464))) + ((input[3]) * (-1.0736156286007468)), ((((0.8330817501826279) + ((input[0]) * (1.0259003852575856))) + ((input[1]) * (-2.4327046722797347))) + ((input[2]) * (0.48745117088772905))) + ((input[3]) * (-1.8329094419137872)), ((((-1.141764377255606) + ((input[0]) * (-2.35933886039274))) + ((input[1]) * (-1.2090666108474617))) + ((input[2]) * (2.981435002839768))) + ((input[3]) * (2.99871035615134))];
var index = 0;
for (var i = 0; i < nums.length; i++) {
index = nums[i] > nums[index] ? i : index;
}
return index+1;
}
return score(lst)'
'''
cursor.execute(udf_JS)
conn.commit()

And our UDF is added in database; likewise, we can use this for classification as we did for the Python UDF.

UDF with JSON data

VectorH 6.0 and Vector 6.0 also support the JSON datatype. So we can convert the model into JSON data, later to be used with our  model.

import json
model_param = {}
model_param['coef'] = clf.coef_.tolist()
model_param['intercept'] =clf.intercept_.tolist()
b = json.dumps(model_param)
print(b)
{"coef": [[0.4467535618836661, 1.5843519667681565, -2.409947591791464, -1.0736156286007468], [1.0259003852575856, -2.4327046722797347, 0.48745117088772905, -1.8329094419137872], [-2.35933886039274, -1.2090666108474617, 2.981435002839768, 2.99871035615134]], "intercept": [0.2901789521827987, 0.8330817501826279, -1.141764377255606]}
jsonstr='''create function  iris_lr_json(lst VARCHAR(100) not null)
return (int not null) AS LANGUAGE python SOURCE='
import json
def score1 (lst):
input = [float(i) for i in lst.split(",")]
json_data = "{"coef": [[0.4156346233598447, 1.6410466066307934, 
-2.4486206310542746, -1.0775840518519904], [0.7334652955571552, 
-2.0527321763924102, 0.4049181379167901, -1.4276343377888174], 
[-1.8886197150361799, -1.701005052157171, 2.803528886728411, 
2.6749639949986195]], "intercept": [0.2834483510261381, 
1.271492823301712, -1.403879236024429]}"
dictjson = json.loads(json_data)
# raise Exception("it is %s" % str(dictjson))
var0=[dictjson["intercept"][0]+ input[0]*dictjson["coef"][0][0] +
input[1]*dictjson["coef"][0][1]+input[2]*dictjson["coef"][0][2] +
input[3]*dictjson["coef"][0][3], dictjson["intercept"][1]+
input[0]*dictjson["coef"][1][0] +input[1]*dictjson["coef"][1][1]+
input[2]*dictjson["coef"][1][2] + input[3]*dictjson["coef"][1][3],
dictjson["intercept"][2]+ input[0]*dictjson["coef"][2][0] +
input[1]*dictjson["coef"][2][1]+input[2]*dictjson["coef"][2][2] +
input[3]*dictjson["coef"][2][3]]
#print(var0)
index=0
max_val=var0[0]
for i in range(0,len(var0)):
if var0[i] > max_val:
index=i
max_val=var0[i]
return index+1
return score (lst)'
'''
cursor.execute(jsonstr)
conn.commit()

UDF can be called simply by the query, as shown in Python UDF.

Detailed code can be found at https://github.com/ActianCorp/MachineLearningUDFs/blob/master/Vector-UDF.ipynb

In case you have any questions, feel free to reach me at vidisha.sharma@actian.com

Conclusion

The attraction of on-database UDF’s for ML scoring is the ability to score data on database at break-neck speed. The model and data are in the same place, reducing data movement. With all operations running on database, your machine learning models will run extremely fast using the Vector X100 engine. Python is one of the most used languages for ML development, so it will be very easy to transition and manage Python and JavaScript models to on database Scalar UDFs for batch scoring and near-real-time scoring use cases.

To learn more about the capabilities of all the Actian products, visit our website.

About Actian Corporation

Actian is helping businesses build a bridge to a data-defined future. We’re doing this by delivering scalable cloud technologies while protecting customers’ investments in existing platforms. Our patented technology has enabled us to maintain a 10-20X performance edge against competitors large and small in the mission-critical data management market. The most data-intensive enterprises in financial services, retail, telecommunications, media, healthcare and manufacturing trust Actian to solve their toughest data challenges.