Data Mining Service definition in Cloud Computing

Authors: Manuel Parra-Royon, J.M. Benítez
Laboratory: http://dicits.ugr.es
Department of: Soft Computing and Intelligent Information Systems , a University of Granada research group


This is the documentation for the set of ontologies that allow the definition of data mining services in CloudComputing (dmcc-schema).

The set of schemes allows to define all the key aspects for a service on Cloud Computing, with emphasis on issues such as SLA/SLO, service pricing, DM/ML workflow with experiments, authentication, etc. In general, this schema contains everything you need to define a CC Data Mining service

Each vocabulary and outline has been defined separately so that the definition of each of the service components, as well as all the additional auxiliary diagrams created, can be worked on more easily.

For each of the schemes, it is available:

dmcc-schema

With dmcc-schema, a cloud computing service can be defined using ontologies, and more specifically a Data Mining and MachineLearning service in the cloud. Cloud services require specific functional characteristics, through which aspects such as License Agreements and service use, service authentication, service interaction, provider service catalog, or workflow with algorithms and Data Mining functions are defined. Our proposal with dmcc-schema, allows to define all these aspects in a single scheme, adapted to the specific characteristics of a service of this type: ML/DM over CloudComputing Services.

For this scheme, the recommendations proposed by LinkedData have been used, through which other vocabularies and schemes are re-used, thus complementing and enriching our service definition scheme.

Schema

The Turtle (.ttl) format scheme is available in: http://cookingbigdata.com/linkeddata/dmcc-schema/

Documentation

Documentation is available here:
http://cookingbigdata.com/linkeddata/dmcc-schema/documentation/

Datasets

A set of datasets have been developed to validate the model using SparQL (for example):

LinkedOpen Data Vocabularies

The vocabulary has been included in the LOV platform http://lov.okfn.org/ (LinkedOpenData Vocabularies) :

http://lov.okfn.org/dataset/lov/vocabs/dmc

ccsla

This scheme makes it easy to define service level agreements for cloud computing providers.

Includes all aspects of terms, claims, service credits, etc. As well as compensation in cases of violation of agreements.

Schema

The Turtle (.ttl) format scheme is available in: http://cookingbigdata.com/linkeddata/ccsla/

Documentation

Documentation is available here:
http://cookingbigdata.com/linkeddata/ccsla/documentation/

Datasets

A set of datasets have been developed to validate the model using SparQL (for example):

LinkedOpen Data Vocabularies

The vocabulary has been included in the LOV platform http://lov.okfn.org/ (LinkedOpenData Vocabularies) :

http://lov.okfn.org/dataset/lov/vocabs/ccsla

ccpricing

This scheme allows defining the pricing of cloud computing services, especially those related to Data Mining and Machine Learning, which are affected by characteristics such as type of instances, region, storage space utilisation, etc. With ccpricing it is possible to define in a simple way all the possibilities of service pricing in Cloud for D

Schema

The Turtle (.ttl) format scheme is available in: http://cookingbigdata.com/linkeddata/ccpricing/

Documentation

Documentation is available here:
http://cookingbigdata.com/linkeddata/ccpricing/documentation/

Datasets

A set of datasets have been developed to validate the model using SparQL (for example):

LinkedOpen Data Vocabularies

The vocabulary has been included in the LOV platform http://lov.okfn.org/ (LinkedOpenData Vocabularies) :

http://lov.okfn.org/dataset/lov/vocabs/ccp

ccdm

For the definition of the experiments or data mining functions, a simple scheme has been created, using part of the definitions proposed by msl-schema (MachineLearning Schema http://lov.okfn.org/dataset/lov/vocabs/mls), and complemented for the workflow of an MD service in Cloud.

Through this vocabulary it is possible to define the input (parameters, datasets, etc), output (models, datasets) and algorithms, for a data mining service.

Schema

The Turtle (.ttl) format scheme is available in: http://cookingbigdata.com/linkeddata/ccdm/

Documentation

Documentation is available here:
http://cookingbigdata.com/linkeddata/ccpricing/documentation/

Datasets

A set of datasets have been developed to validate the model using SparQL (for example):

LinkedOpen Data Vocabularies

The vocabulary has been included in the LOV platform http://lov.okfn.org/ (LinkedOpenData Vocabularies) :

http://lov.okfn.org/dataset/lov/vocabs/ccdm

ccinstances

The vocabulary proposed by ccinstances, allows to define the typology of instances of cloud computing providers, including CPU, RAM, NetWork, or storage among others.

Schema

The Turtle (.ttl) format scheme is available in: http://cookingbigdata.com/linkeddata/ccinstances/

Documentation

Documentation is available here:
http://cookingbigdata.com/linkeddata/ccinstances/documentation/

Datasets

A set of datasets have been developed to validate the model using SparQL (for example):

LinkedOpen Data Vocabularies

The vocabulary has been included in the LOV platform http://lov.okfn.org/ (LinkedOpenData Vocabularies) :

http://lov.okfn.org/dataset/lov/vocabs/cci

ccregions

With ccregions, you can define the regions and zones of availability of the instances and experiments of a Cloud Computing provider. This vocabulary makes it possible to define zones of availability, data sovereignty, physical location, name of zones and applicable legislation.

Schema

The Turtle (.ttl) format scheme is available in: http://cookingbigdata.com/linkeddata/ccregions/

Documentation

Documentation is available here:
http://cookingbigdata.com/linkeddata/ccregions/documentation/

Datasets

A set of datasets have been developed to validate the model using SparQL (for example):

LinkedOpen Data Vocabularies

The vocabulary has been included in the LOV platform http://lov.okfn.org/ (LinkedOpenData Vocabularies) :

http://lov.okfn.org/dataset/lov/vocabs/ccr

Individual examples

Examples of instantiation of the vocabularies defined in this document:

ccsla

Cloud Computing Service Leve Agreement.

Use Case:

The Data Mining service has two possible compensations in case of service failures of the service for periods of one month. You define the
Monthly Uptime Percentage (MUP), and two intervals:

  • minor of 99.0% implies a service credit of 30% of service credit.
  • 99.0 % to 99.99% implies a credit of compensaci´pn of 10% of service credit

The definition of SLA would be for the first term of MUP:

_:Term2 a ccsla:Term;
		rdfs:label "Term1";
		rdfs:comment "";
		ccsla:includeDefs _:def2;
		ccsla:hasCompensation _:comp2;
	.
_:def2 a ccsla:Definition;
	rdfs:label "Monthly Uptime Percentage <99.0%";
	rdfs:comment "Less than 99.0%.Is calculated by subtracting from 100% the percentage of minutes during the month in which any of the Included Products and Services, as applicable, was in the state of “Region Unavailable.” Monthly Uptime Percentage measurements exclude downtime resulting directly or indirectly from any Amazon Compute Services SLA Exclusion.";
	ccsla:hasDefinitionValue [
		a sc:structuredValue;
			sc:value [
				sc:maxValue "89.99";
				sc:minValue "0.0";
				sc:unitText "Percentaje";
			];
	];
	.

_:comp2 a ccsla:Condition;
	rdfs:label "Commitments for Monthly Uptime Percentage";
	rdfs:comment "Commitments MUP Less than 99.0%";
	ccsla:includeValue [ 
		a sc:structuredValue;
			sc:value "30";
			sc:unitText "Service Credit Percentage";
	];
.

ccregions

Cloud Computing regions and AZ.

Use Case:

Example for Region in N. Virginia (Amazon), including entry point for the Region, Location, and one AZ.

<http://example.org/cc/AmazonRegion#us-east-1> a ccregions:Region;
	rdfs:label "Region and AZ.";
	rdfs:comment "Each region is completely independent. Each Availability Zone (AZ) is isolated, but connected";
	ccregions:region_code "us-east-1"^^xsd:string;
	ccregions:region_name "US East (N. Virginia)"^^xsd:string;
	ccregions:region_geocompilance "United States of America (USA) and Canada"^^xsd:string;
	ccregions:region_location _:rlocation;
	ccregions:region_dataresidency _:rdataresidency;
	ccregions:hasAvailabilityZone _:az_a,
	                              _:az_b;	
	ccregions:region_endpoint [
		a sc:EntryPoint;
			sc:urlTemplate "https://ec2.us-east-1.amazonaws.com";
	];
	.

_:az_a a ccregions:AvailabilityZone;
	ccregions:availabilityzone_name "us-east-1a"^^xsd:string;
	ccregions:availabilityzone_status "available"^^xsd:string;
	.

_:az_b a ccregions:AvailabilityZone;
	ccregions:availabilityzone_name "us-east-1b"^^xsd:string;
	ccregions:availabilityzone_status "available"^^xsd:string;
	.

_:rlocation a sc:Place;
	sc:address [
		a sc:PostalAddress;
			sc:addressLocality "Springfield"^^xsd:string;
		    sc:addressRegion "VA"^^xsd:string;
		    sc:postalCode "22151"^^xsd:string;
		    sc:streetAddress "5617 Industrial Dr"^^xsd:string;
			sc:addressCountry "US"^^xsd:string;
		]
	.

ccdm

Cloud computing Service for Data Mining

Use case:

Create a Linear Regression Service. This service require an external dataset, have hyperparameters (algorithm parameter), and returns a model for the input data.

Define input parameters:

<http://example.org/cc/MLService#LinearRegression> a ccdm:MLFunction;	
	dc:created "2017-04-20" ;
	dc:creator "Manuel Parra, Ruben Castro, J. Antonio Cortes" ;
	dc:title "Linear Regression" ;
	dc:description "Linear regression is an approach for modeling the relationship between a scalar dependent variable y and one or more explanatory variables (or independent variables) denoted X. The case of one explanatory variable is called simple linear regression." ;
	dc:modified "2018-05-20" ;
	dc:publisher "DICITS_ML" ;
    ccdm:hasInputParameters _:LinearRegression_Service_InputParameters ;
    mls:hasInput _:LinearRegression_Service_Input ;
    mls:hasOutput _:LinearRegression_Service_Output	
	.

Sets the input parameters

_:LinearRegression_Service_InputParameters a ccdm:MLServiceInputParameters;
	dc:description "Input Parameters" ;
	dc:title "Input" ;    
	ccdm:Parameters _:response_parameter_01,
	                _:response_parameter_02,
                    _:response_parameter_03,
                    _:response_parameter_04 ;
	.

And set the attributes of each parameter:

_:response_parameter_01 a ccdm:MLServiceInputParameter ;
    ccdm:defaultvalue "" ;
    ccdm:mandatory "true" ;
    dc:description "Response variable and Formulae" ;
    dc:title "formula" 
	.
	
_:response_parameter_02 a ccdm:MLServiceInputParameter ;
    ccdm:defaultvalue "NULL" ;
    ccdm:mandatory "optional" ;
    dc:description "Optional vector specifying a subset of observations to be used in the fitting process" ;
    dc:title "subset" .

_:response_parameter_03 a ccdm:MLServiceInputParameter ;
    ccdm:defaultvalue "na.remove" ;
    ccdm:mandatory "optional" ;
    dc:description "A function which indicates what should happen when the data contain NAs" ;
    dc:title "na__action" .

_:response_parameter_04 a ccdm:MLServiceInputParameter ;
    ccdm:defaultvalue "NULL" ;
    ccdm:mandatory "optional" ;
    dc:description "Optional vector of weights to be used in the fitting process. If non-NULL, weighted least squares is used with weights weights (that is, minimizing sum(w*e^2)); otherwise ordinary least squares is used" ;
    dc:title "weights" .

dmcc-schema

Data Mining Schema for Cloud Computing Services

Use case:

Create a CloudComputing service of K-Means (or LinearRegression [as shown in above example]) for DataMining. This service contains, interaction, SLA, pricing, functions/operation, among others.

The individual aspects of SLA, price and Function/Operation, have been defined in the above examples. Now we will define the service from the highest level that integrates the other elements:

_:MLProvider_Dicits a dmcc:MLServiceProvider;
	rdfs:label "Machine Learning Provider"@en ;
	dcterms:description "DICITS Machine Learning Service Provider"@en ;
	gr:name "DITICS ML Services Provider";
	gr:legalName "University of Granada";
	gr:hasNAICS "541519";
	s:url "http://www.dicits.ugr.es";
	s:serviceLocation 
		[ a s:PostalAddress;
			s:addressCountry "ES";
			s:addressLocality "Granada";
			s:postalCode "18017";
		 ] ;
	s:contactPoint 
		[
			a s:ContactPoint;
			s:telephone "+34 958";
			s:contactType "Costumer Service hot line";
			s:availableLanguage [ a s:Language;
									s:name "English";];
			s:email "serviceml@dicits.ugr.es";
		];
		
	dmcc:hasMLService _:MLServiceDicitsRF;
	dmcc:hasOfferCatalog _:MLServiceDicitsCatalog;
	.

Now, we need to define each of the above aspects of the Data Mining Service definition:

_:MLServiceDicitsRF a dmcc:MLService;
	rdfs:label "ML Service on Dicits.ugr.es"@en ;
	dcterms:description "DICITS Machine Learning Service"@en ;
	dmcc:hasInteractionPoint _:MLServiceInteraction;
	dmcc:hasServiceCommitment _:MLServiceSLA;
	dmcc:hasFunction _:MLServiceFunction;
	dmcc:hasAuthentication _:MLServiceAuth;
	dmcc:hasPricingPlan _:MLServicePricing;
	.

Each element like _:MLServicePricing; or _:MLServiceSLA; and so on, can be instantiated using the above examples.

ccinstances

Cloud computing instances

Use case:

Example of amazon instances for T2 and M5:

  • t2.micro containing Intel Xeon, 1 core, 1GB RAM and EBS storage
  • m5.2xlarge containing Intel Xeon Platinum , 8 core, 32GB RAM and SSD/EBS storage

t2.micro

<http://example.org/cc/AmazonInstances#t2.micro> a ccinstances:Instance;

	rdfs:label "t2.micro, Baseline level of CPU performance";
	rdfs:comment "T2 instances are Burstable Performance Instances that provide a baseline level of CPU performance with the ability to burst above the baseline.";
	ccinstances:code "T2"^^xsd:string ;
	ccinstances:model "t2.micro"^^xsd:string ;
	ccinstances:type "General purpose"^^xsd:string ;
	ccinstances:hasCPU [ 
		a ccinstances:cpu;
				ccinstances:cpu_model "Intel Xeon E7-8893 v3"^^xsd:string ;
				ccinstances:cpu_code "E7-8893V3"^^xsd:string ;
				ccinstances:cpu_cores "1"^^xsd:integer;
				ccinstances:cpu_frecuency "3200"^^xsd:integer;
				ccinstances:max_frecuency "3500"^^xsd:integer;
				ccinstances:cpu_cache "45"^^xsd:integer;	
		];
	ccinstances:hasRAM [
		a ccinstances:ram;
				ccinstances:ram_size "1024"^^xsd:integer;
				ccinstances:ram_frecuency "2800"^^xsd:integer;
		];
	ccinstances:hasStorage [
		a ccinstances:storage;
			ccinstances:technology "EBS"^^xsd:string ;	
		];
	.

For m5.2xlarge

<http://example.org/cc/AmazonInstances#m5.2xlarge> a ccinstances:Instance;

	rdfs:label "m5.2xlarge, Baseline level of CPU performance";
	rdfs:comment "M5 instances are the latest generation of General Purpose Instances. This family provides a balance of compute, memory, and network resources, and it is a good choice for many application";
	ccinstances:code "M5"^^xsd:string ;
	ccinstances:model "m5.2xlarge"^^xsd:string ;
	ccinstances:type "General purpose"^^xsd:string ;
	ccinstances:hasCPU [ 
		a ccinstances:cpu;
				ccinstances:cpu_model "Intel Xeon® Platinum 8175"^^xsd:string ;
				ccinstances:cpu_code "8176"^^xsd:string ;
				ccinstances:cpu_cores "8"^^xsd:integer;
				ccinstances:cpu_frecuency "2100"^^xsd:integer;
				ccinstances:max_frecuency "3800"^^xsd:integer;
				ccinstances:cpu_cache "38"^^xsd:integer;	
		];
	ccinstances:hasRAM [
		a ccinstances:ram;
				ccinstances:ram_size "32768"^^xsd:integer;
				ccinstances:ram_frecuency "2800"^^xsd:integer;
		];
	ccinstances:hasStorage [
		a ccinstances:storage;
			ccinstances:technology "SSD/EBS"^^xsd:string ;	
		];
	.

ccpricing

Pricing form Data Mining services

Use case:

Two pricing plans:

  • Free usage, limited to 250 hours and small Virtual machine instance and fixed Region
  • Regular usage, pay-as-you-go, for a Region, using a pair of instances and a fixed Region.

Free Plan

Define the Service Pricing plans:

<http://example.org/cc/Plans> a ccprices:ServicePricing;
	ccprices:hasPricing _:FreePlan ,
	                    _:RegularPlan ;

Free plan components: Region, instance, and MaxPrice (limited to 250 Hours of use):

#Free Plan
_:FreePlan a ccprices:PricesPlan;
	rdfs:label "PricesPlan Free";
	rdfs:comment "Limited Plan for simple users";
	ccprices:MaxPrice [
		 a gr:Offering;
			gr:hasPriceSpecification [
				a gr:PriceSpecification;
					gr:hasMaxValue 0.0;
					gr:priceCurrency "EUR";
				];
		];
	ccprices:hasComponentPrice 
		_:FreePlanComponents;
	.

_:FreePlanComponents a ccprices:Compound;
	rdfs:label "Price Components";
	rdfs:comment "List of components and limits";
	ccprices:hasRegion _:RegionNVirginia;
	ccprices:hasInstances _:instance01,
	                      _:instance02;
	ccprices:withMaxCompound [
		 a gr:Offering;
			gr:hasPriceSpecification [
				a gr:PriceSpecification;
					gr:hasCurrency "USD"^^xsd:string;
					gr:hasCurrencyValue "0.0"^^xsd:float ;
			];		
		 	gr:includeObject [ 
				a gr:TypeAndQuantityNode;
					gr:amountOfThisGood "250"^^xsd:integer ;
					gr:hasUnitOfMeasurement "HRS"^^xsd:string ;
			];	
	];
	.

Regular Plan

Define the Service Pricing plans for a Regular plan:

#Regular Plan
_:Regular a ccprices:PricesPlan;
	rdfs:label "Regular Prices plan";
	rdfs:comment "Regular prices Plan including cost per instance, regions, etc.";
	ccprices:hasComponentPrice 
		_:RegularPlanComponents;
	.
	
_:RegularPlanComponents a cprices:Compound;
	rdfs:label "Price Components";
	rdfs:comment "List of components and limits";
	ccprices:hasRegion _:RegionNVirginia;
	ccprices:hasInstances _:instance02;
  	ccprices:withMaxCompound [
  		 a gr:Offering;
  			gr:hasPriceSpecification [
  				a gr:PriceSpecification;
  					gr:hasCurrency "USD"^^xsd:string;
  					gr:hasCurrencyValue "0,0464 USD"^^xsd:float ;
  			];		
  		 	gr:includeObject [ 
  				a gr:TypeAndQuantityNode;
  					gr:amountOfThisGood "1"^^xsd:integer ;
  					gr:hasUnitOfMeasurement "HRS"^^xsd:string ;
  			];	
  	];					  
	.

Complete queries

In this section we present a series of examples of queries, using our proposed scheme. For them we use the following dataset, which includes a couple of data mining service providers, as well as various types of algorithms, prices, regions and instances.

Dataset

The complete dataset can be downloaded from here. For queries, we recommend using the web platform Apache Jena + Fuseki. This dataset have been developed to validate the model using SparQL:

Queries

The following queries have been carried out:

Cloud Computing providers

Shows all data providers, name and full name:

PREFIX s:     <http://www.w3.org/2000/01/rdf-schema#>
PREFIX dmc: <http://cookingbigdata.com/linkeddata/dmcc-schema>
PREFIX dc:    <http://purl.org/dc/elements/1.1/>

SELECT ?s ?name
WHERE {
  ?s ?p dmc:MLProvider .
  ?s dc:title ?name .
}

Cloud Computing providers offering DM algorithms

Shows all providers and the services they offer:

PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX s:     <http://www.w3.org/2000/01/rdf-schema#>
PREFIX dmc: <http://cookingbigdata.com/linkeddata/dmcc-schema>
PREFIX dc:    <http://purl.org/dc/elements/1.1/>

SELECT ?name ?label ?algor
WHERE {
  ?s ?p dmc:MLProvider ;
        dc:title ?name ;
        s:label ?label ;
        dmc:hasMLService [
    	a dmc:MLService;
	    	s:label ?algor
  		];
        .
}

Cloud Computing providers offering Random Forest

Shows all providers that perform data mining services and Random Forest:

PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX s:     <http://www.w3.org/2000/01/rdf-schema#>
PREFIX dmc: <http://cookingbigdata.com/linkeddata/dmcc-schema>
PREFIX dc:    <http://purl.org/dc/elements/1.1/>

SELECT ?name ?label ?algor
WHERE {
  ?s ?p dmc:MLProvider ;
        dc:title ?name ;
        s:label ?label ;
        dmc:hasMLService [
    	a dmc:MLService;
	    	s:label ?algor
  		];
        .
  FILTER (?algor = "RandomForest")
}

Regions, Instances for Cloud Computing providers

Shows all regions and instances offered by providers to run the Random Forest algorithm:

PREFIX adms: <http://www.w3.org/ns/adms#>
PREFIX pr: <http://purl.org/ontology/prv/core#>
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX s:     <http://www.w3.org/2000/01/rdf-schema#>
PREFIX dmc: <http://cookingbigdata.com/linkeddata/dmcc-schema>
PREFIX dc:    <http://purl.org/dc/elements/1.1/>
PREFIX price:    <http://cookingbigdata.com/linkeddata/ccpricing/>

SELECT ?name ?label ?algor ?priceplan ?priceplancompound ?regioninstance ?data
WHERE {
  ?s ?p dmc:MLProvider ;
        dc:title ?name ;
        s:label ?label ;
        dmc:hasMLService [
    	 a dmc:MLService;
	    	s:label ?algor;
  		];        
  	   	dmc:hasPricingPlan [
    	 a price:PricesPlan;
    	   s:label ?priceplan;
    	   price:hasComponentPrice ?priceplancompound ;
	    ];      	
     	.  
  	?priceplancompound s:label ?regioninstance.
   	?priceplancompound price:hasInstances ?data. 

  FILTER (?algor = "RandomForest")
}

Best Cloud Computing providers offering Random Forest with prices

Shows all providers and lower prices for running a Random Forest algorithm on a dataset.

PREFIX adms: <http://www.w3.org/ns/adms#>
PREFIX pr: <http://purl.org/ontology/prv/core#>
PREFIX gr: <http://purl.org/goodrelations/v1#>
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX s:     <http://www.w3.org/2000/01/rdf-schema#>
PREFIX dmc: <http://cookingbigdata.com/linkeddata/dmcc-schema>
PREFIX dc:    <http://purl.org/dc/elements/1.1/>
PREFIX price:    <http://cookingbigdata.com/linkeddata/ccpricing/>

SELECT ?name ?label ?algor ?priceplan ?priceplancompound ?aaaa ?cost 
WHERE {
  ?s ?p dmc:MLProvider ;
        dc:title ?name ;
        s:label ?label ;
        dmc:hasMLService [
    	 a dmc:MLService;
	    	s:label ?algor;
  		];        
  	   	dmc:hasPricingPlan [
    	 a price:PricesPlan;
    	   s:label ?priceplan;
    	   price:hasComponentPrice ?priceplancompound ;
	    ];      	
     	.  
  	?priceplancompound s:label ?aaaa.
    ?priceplancompound price:withMaxCompound [
    	a gr:Offering;
  	    	gr:hasPriceSpecification [
		    	a gr:PriceSpecification;
  					gr:hasCurrencyValue ?cost;
  		    ];
		 ].	

  FILTER (?algor = "RandomForest")
}

ORDER BY ASC(?cost) LIMIT 20

References