SciSPARQL

User Manual

TABLE OF CONTENTS

0. Overview

1. Getting started

2. SciSPARQL basics

3. Array queries

4. SciSPARQL views, second-order functions and closures

5. Foreign functions

6. Back-end storage

7. Calling SciSPARQL from C

8. MATLAB front-end

9. Advanced issues and work in progress

0. Overview

SciSPARQL is an extension of W3C SPARQL 1.1. Language, designed to accommodate the needs of scientific and industrial applications. It extends the standard query and update language in the following ways:

- syntax and semantics for querying for arrays and specifying array operations in the query,

- extensibility with foreign functions implemented in C, Java, Python, Lisp or MATLAB,

- parameterized queries (functions defined in terms of a SciSPARQL query) and updates.

The main purpose of SciSPARQL is the uniform handling of metadata and massive numeric data (multidimensional arrays) in terms of a high-level query language, scalable storage of this data in a client-server setting or cloud infrastructure, and easy reuse of the existing algorithmic libraries for filtering and post-processing the data.

SciSPARQL is developed since 2014 in Uppsala DataBase Laboratory (UDBL). It is fully implemented our publicly available software prototype, Scientific SPARQL Database Manager (SSDM), available at the project homepage:

http://www.it.uu.se/research/group/udbl/SciSPARQL/

SSDM can act as either SciSPARQL server that stores the data and processes the queries and updates, or as a client to communicate with a SciSPARQL server. As a server it supports extensibility with foreign functions and storage back-ends.

Techicallly, SSDM can be run as either

- stand-alone Linux or Windows executable,

- on top of Java Virtual Machine, as a native library,

- as an extension to MATLAB environment.

The SPARQL language is the fundamental part of SciSPARQL, and is implemented according to the latest W3C Recommendations. The following documents should be used as a reference to SPARQL 1.1.

http://www.w3.org/TR/sparql11-query/

http://www.w3.org/TR/sparql11-update/

The following section provides a brief step-by-step introduction to the SPARQL queries, as the constitute the essential basis to any possible use of SciSPARQL For more formal definitions, additional examples and explanations please consult the documents linked above.

1. Getting Started

SciSPARQL is implemented with SSDM, technically - an extension of Amos II system. It is being distributed together with Amos II executables, headers and documentation.

The following files belong to SSDM extension proper:

bin/ssdm.dll

bin/ssdm.dmp

bin/ssdm.cmd

ssdm/*.*

To run SSDM, use bin/ssdm.cmd batch file. Directory bin/ should be current or listed in system PATH.

The SciSPARQL toploop

When started, the system enters an Amos II top loop where it reads SciSPARQL statements, (including queries, function calls and system directives), executes them, and prints their results. The prompter in the SciSPARQL top loop is:

SciSPARQL n>

where n is a generation number. The generation number is increased every time a SciSPARQL statement is executed.

Typically you start by loading data. LOAD directive allows loading local or remote Turtle or NTriples files containing RDF datasets. Unqualified filename is processed relative to the current directory:

LOAD("talk.ttl");

The files are loaded into the 'default graph'. Multiple files can be loaded and any RDF blank nodes used in the different datasets become renamed to keep them lexically distinct. To empty the default graph before loading use LOAD with true as a second argument:

LOAD("talk.ttl",true);

Current dataset (i.e. default graph) can be written to a local NTriples file using DUMP directive:

DUMP("current.nt");

You can also load and execute SciSPARQL scripts, typically containing function definitions. SOURCE directive does exactly that:

SOURCE("talk.sparql");

At any point you can switch to Lisp interpreter using LISP directive, and return to SciSPARQL toploop by evaluating language-sparql symbol:

SPARQL 1> lisp;

Lisp 1> language-sparql

SPARQL 1>

To exit the toploop, use EXIT directive.

SPARQL 1> exit;

2. SciSPARQL basics

This section describes the most features of Scientific SPARQL that are also defined in SPARQL 1.1, providing relevant examples, and detailing on certain aspects that are not explained in detail in W3C SPARQL 1.1 specifications.

Unless explicitly mentioned, these explanations do not imply any difference of SciSPARQL language syntax and semantics to that of W3C SPARQL 1.1. For the summary of current differences, see Section 9 of this manual.

2.2. SELECT: querying for values

The simplest SPARQL query will have to bind a value to a resulting variable:

SELECT (1 as ?res)

Will return

?res

In the most general form, a SciSPARQL select query has the following syntax:

(PREFIX <prefix>:<URI>)*

SELECT <select-modifier>* (<select-spec>+ | '*')

(FROM <graph-id>)*

(WHERE? <block>)?

(GROUP BY <var>+)?

(HAVING <expr>)?

(ORDER BY <expr>+)?

<block> is an enclosed in curly braces { } dot-separated conjunction of different kinds of conditions, namely:

- graph patterns

- UNIONs of alternative blocks

- OPTIONAL blocks

- FILTERs

- BIND conditions

- VALUES conditions

- nested queries

Expressions are explained in 3.4, and <select-spec> is either a variable or an expression bound to a variable:

<select-spec> ::= <var> | (<expr> AS <var>)

2.2.1. Graph patterns

A variable can be bound in a graph pattern, like (Q1)

SELECT ?person

WHERE { ?person foaf:name "Alice" }

will bind the variable ?person to the subject in each such triple of the default dataset, where the predicate is a URI foaf:name, and the object is a string "Alice".

Note: this query implies that the prefix foaf: is already defined in the session, which can be achieved by PREFIX directive:

PREFIX foaf: <http://xmlns.com/foaf/0.1/>;

We can also use a graphical representation of a graph pattern, drawing predicates as arrows, subjects and objects as graph nodes. A node with value provided in the pattern will be shown as rectangle, a 'wildcard' node, will be depicted as oval, with a variable name if provided.

A graph pattern may be more complex and contain several variables, for example (Q2)

SELECT ?friend_name

WHERE { ?person foaf:name "Alice" .

?person foaf:knows ?friend .

?friend foaf:name ?friend_name }

Here '.' is used for conjunction, requiring that there should be a triple conforming to each triple pattern in order for the graph pattern to bind its variables.

Note that not all the variables are interesting for us, only the bindings of ?friend_name are returned.

This query (Q2) can be simplified in several ways:

1) We can use ';' instead of '.' to indicate that the next pattern will have the same subject:

SELECT ?friend_name

WHERE { ?person foaf:name "Alice" ;

foaf:knows ?friend .

?friend foaf:name ?friend_name }

2) Since we are not interested in bindings for ?person, we do not need to provide a variable name - instead we can tell the parser to generate a wildcard by using an unlabeled blank node '[]' in the query:

SELECT ?friend_name

WHERE { [] foaf:name "Alice" ;

foaf:knows ?friend .

?friend foaf:name ?friend_name }

3) We can get rid of ?friend variable as well, by substituting it with a blank-subject construct:

SELECT ?friend_name

WHERE { [] foaf:name "Alice" ;

foaf:knows [ foaf:name ?friend_name ] }

The [ foaf:name ?friend_name ] construct denotes a wildcard subject of a triple pattern with specified predicate and object, and can be used as a blank node anywhere in a graph pattern. So this graph pattern will contain two unnamed 'blank' nodes:

Another query (Q3) will look for names of the people who know both Alice and Bob:

SELECT ?common_friend_name

WHERE { [] foaf:name ?common_friend_name ;

foaf:knows [ foaf:name "Alice" ] ;

foaf:knows [ foaf:name "Bob" ] }

or, graphically:

We can use ',' conjunction to indicate that the next triple pattern will have the same subject and predicate:

SELECT ?common_friend_name

WHERE { [] foaf:name ?common_friend_name ;

foaf:knows [ foaf:name "Alice" ] ,

[ foaf:name "Bob" ] }

2.2.2.1. Regular path expressions

Algebraically, a path can be viewed as a binary predicate Path(X,Y), with X being its subject and Y being its object. This predicate is recursively defined below, together with the syntax of <path> expressions:

<path> is either:

single RDF predicate, identified by an URI. It holds true if the corresponding triple pattern is matched in the dataset.

a variable or a blank node. It holds true whenever there exists a triple (X P Y) in the dataset, where P can be any RDF predicate. In a variable was used, it gets a binding for each predicate connecting X and Y

<path>/<path>, where slash / is used for chaining the path fragments. This requires that there exists an RDF node Z such that there is a path defined by the left operand from X to Z and a path denoted by right operand from Z to Y. The query Q2 can be reformulated using property chain:

SELECT ?friend_name

WHERE { [] foaf:name "Alice" ;

foaf:knows/foaf:name ?friend_name }

<path>+, where + is transitive closure operator specifying that the operand <path> can appear one or more times in a resulting <path>. For example, one might ask for all friends and friends of friends etc. of Alice (Q4):

SELECT ?friend

WHERE ( [] foaf:name "Alice" ;

foaf:knows+ ?friend }

<path>*, where * is reflexive-transitive closure operator specifying that the operand <path> can appear zero or more times in a resulting <path>. In the example Q4 above, if * were used instead of + the query would return an RDF node for Alice as well.

^<path>, where ^ is reversal operator, denoting that the operand <path> should appear in the reverse direction in the resulting <path>, so that the subject and object are effectively swapped. Reversal becomes especially useful when tracing the same relationship there-and-back, like in the following query looking for Alice's co-workers (Q5), implying there's a common employer for Alice and ?x:

SELECT ?x

WHERE { [] foaf:name "Alice" ;

:employedBy/^:employedBy ?x }

One could also use predicate reversal to formulate a chain-shaped query (like Q5) with a single path:

SELECT ?x

WHERE { "Alice" ^foaf:name/:employedBy/^:employedBy ?x }

<path> | <path>, where | is the disjunction operator, denoting that either of its operand paths constitutes the resulting <path>. A following modification (Q6) of the above query will look for both friends and co-workers of Alice:

SELECT ?x

WHERE { [] foaf:name "Alice" ;

foaf:knows | :employedBy/^:employedBy ?x }

(<path>), where round parentheses are used to control precedence. This is typically required when putting a complex path fragment under a closure or reversal. For example, the following query (Q7) will look for people connected to Alice by an arbitrary chain of foaf:knows and co-worker relationships:

SELECT ?x

WHERE { [] foaf:name "Alice" ;

(foaf:knows | :employedBy/^:employedBy)+ ?x }

Precedence of path operators

+ and * precede ^ precedes / precedes |

Algebraic properties of path operators

^ is distributive with respect to +, *, and |:

^A+ = (^A)+

^A* = (^A)*

^(A|B) = ^A|^B

additionally, ^ reverses the path chains:

^(A/B) = ^B/^A

| is symmetric and distributive with respect to /:

A|B = B|A

(A|B).C = A.C|B.C

A.(B|C) = A.B|A.C

* always includes +:

(A+)* = A*

(A*)+ = A*

2.2.2. Matching alternatives and DISTINCT option

Consider that foaf:knows relationship is not restricted to be symmetric in the dataset, so we would like to trace it in either direction. The following query returns names of all people who know Alice and all people whom Alice knows.

SELECT ?friend_name

WHERE { ?friend foaf:name ?friend_name .

?alice foaf:name "Alice" .

{ ?alice foaf:knows ?friend }

UNION

{ ?friend foaf:knows ?alice } }

This query will effectively express two alternative graph patterns:

However, if foaf:knows relationship happens to be mutual in some case, same bindings will be generated twice for ?friend and ?friend_name. To avoid this, and return every person at most once, we should use DISTINCT option on the ?freind variable in the SELECT clause:

SELECT DISTINCT ?friend ?friend_name

WHERE { ?friend foaf:name ?friend_name .

?alice foaf:name "Alice" .

{ ?alice foaf:knows ?friend }

UNION

{ ?friend foaf:knows ?alice } }

Note that in this case we are required to include ?friend in the result list, as the bindings for this variable are expected to be unique URIs (or dataset-unique blank nodes) identifying different persons. If we apply the DISTINCT only to the ?friend_name variable, we will get a set of unique names, which might be shorter, as different people might happen to be namesakes.

There may be more than two UNION branches in the same conjunct, and a union branch can be any valid query block, including block containing nested UNIONs.

Additionally, different branches of the same union might provide bindings for different variables. For example, the following query might return a more informative result:

SELECT ?name_Alice_knows ?name_knows_Alice

WHERE { ?alice foaf:name "Alice" .

{ ?alice foaf:knows [ foaf:name ?name_Alice_knows ] }

UNION

{ [ foaf:name ?name_knows_Alice ] foaf:knows ?alice } }

When applied to the dataset

@prefix foaf: <http://xmlns.com/foaf/0.1/> .

_:alice foaf:name "Alice" ;

foaf:knows _:bob ,

_:Cindy .

_:bob foaf:name "Bob" ;

foaf:knows _:alice .

_:cindy foaf:name "cindy" .

_:erich foaf:name "erich" ;

foaf:knows _:alice .

will return the following bindings for its SELECT variables

?name_Alice_knows	?name_knows_Alice
"Bob"	"Bob"
"Cindy"
	"Erich"

The empty cells show that certain variables sometimes remain unbound. No further processing can be applied to them, and the result of any expression involving these variables will also be unbound, and any filter depending on such an expression will not be satisfied in that case. The only exception is the bound() function, which will return either true or false.

2.2.3. OPTIONAL graph patterns

Consider we would like to get the names of all the people Alice knows, and also their emails, if they are available in the dataset. Whenever there is no email information, the name of a person should still be returned:

SELECT ?friend_name ?friend_email

WHERE { [] foaf:name "Alice" ;

foaf:knows ?friend .

?friend foaf:name ?friend_name .

OPTIONAL { ?friend foaf:email ?friend_email } }

When applied to the dataset

@prefix foaf: <http://xmlns.com/foaf/0.1/> .

@prefix countries: <http://example.org/Countries#> .

_:alice foaf:name "Alice" ;

foaf:knows _:bob ,

_:cindy ,

_:dave .

_:bob foaf:name "Bob" ;

foaf:email "bob@example.org" ;

foaf:phone "+4912123456789" ;

foaf:residesIn countries:Germany .

_:cindy foaf:name "Cindy" ;

foaf:phone "+46701234567" .

_:dave foaf:name "Dave" ;

foaf:residesIn countries:Australia .

will return the required and the optional bindings

?friend_name	?friend_email
"Bob"	"bob@example.org"
"Cindy"
"Dave"

Optional blocks can be nested. The nested block will provide new bindings for its variables only if the parent block succeeds. In the following query we are interested in the country information only if a phone number is returned:

SELECT ?friend_name ?friend_phone ?friend_country

WHERE { [] foaf:name "Alice" ;

foaf:knows ?friend .

?friend foaf:name ?friend_name .

OPTIONAL { ?friend foaf:phone ?friend_phone .

OPTIONAL { ?friend foaf:residesIn ?friend_country } } }

When applied to the same dataset, this query returns:

?friend_name	?friend_phone	?friend_country
"Bob"	"+4912123456789"	countries:Germany
"Cindy"	"+46701234567"
"Dave"

Note that the there is no country information returned for Dave, since his phone is not found in the dataset.

There can also be several successive OPTIONAL blocks in a query block, and some of them might attempt to bind the same variable. In this case, the OPTIONAL block that appears earlier in the query gets the priority.

Consider the following query, where we are interested in the contact information Alice's friends - preferably an email, a phone number, or nothing but the name.

SELECT ?friend_name ?friend_contact

WHERE { [] foaf:name "Alice" ;

foaf:knows ?friend .

?friend foaf:name ?friend_name .

OPTIONAL { ?friend foaf:email ?friend_contact } .

OPTIONAL { ?friend foaf:phone ?friend_contact } }

Applied to the same dataset, the query will return

?friend_name	?friend_contact
"Bob"	"bob@example.org"
"Cindy"	"+46701234567"
"Dave"

If we swap the two OPTIONAL blocks in this query, a phone number would be returned for Bob as well.

2.2.4. Expressions

Expressions can be used in filters and explicit bindings (FILTER and BIND clauses), post-filters in grouping queries (HAVING clause), post-processing (when listed directly after SELECT keyword, GROUP BY and ORDER BY lists. Filtering uses are not restricted to logical expressions, thanks to the notion of effective boolean value described in 2.2.5.

Logical and arithmetic expressions in SciSPARQL are formed by terms and operators.

Terms can be

- numeric, string or typed literals,

- URIs

- keywords true and false representing logical values

- variables

- function calls and typecasting

- array dereferences

2.2.4.1. Typed literals

Typed literals are syntactically formed by a string followed by ^^ delimiter and a complete or abbreviated URI indicating its type, for example

"1"^^xsd:integer

"10101110"^^<http://example.org/types/MyBitVector>

"2005-02-28T00:00:00Z"^^xsd:dateTime

The typed literals of type xsd:integer, xsd:float, xsd:string, xsd:double, xsd:dateTime, and xsd:boolean, found in SciSPARQL queries as well as in the imported Turtle/NTriples files are automatically converted to corresponding simple values. Other typed literals are stored together with their type URI and are considered comparable and equal when both the type URIs and value strings are the same.

2.2.4.2. Operators

Arithmetic operators are + - * /, and are only applicable to numbers.

Comparison operators include < <= > >=, that are only applicable if both operands are numbers or both are strings (in "strict" SciSPARQL, see 9.2) , and = !=, that are applicable to the operands of comparable types. All numeric types are comparable with each other, typed literals are only comparable when they are completely equal, strings, dataTime, boolean and URI values are comparable with operands of the same type.

Boolean operators include ||, &&, !, and operate on effective boolean values that can be derived from operands of any type. Effective boolean values of non-boolean types are described in the following table

type	effective boolean value
xsd:integer	false if equal to 0, true otherwise
xsd:float
xsd:double
xsd:string	false if empty, true otherwise
xsd:dateTime	always true
URI	always true
other	false if string part is empty, true otherwise

2.2.4.3. Handling of unbound values

If a variable is unbound (e.g. due to its binding in an OPTIONAL graph pattern) the result of any expression involving that variable is also unbound. The exception are the boolean operators || and &&, that implement the following three-value logic, according to W3C SPARQL specifications:

A	B	A \|\| B	A && B
true	true	true	true
true	false	true	false
false	true	true	false
false	false	false	false
true	unbound	true	unbound
unbound	true	true	unbound
false	unbound	unbound	false
unbound	false	unbound	false
unbound	unbound	unbound	unbound

Most built-in and all foreign functions are also not applicable to the unbound values, effectively returning unbound result. The only exception is bound() function, that will return false if the argument is unbound. unbound values can be seen as "empty cells" in SELECT query results. In CONSTRUCT queries resulting triples with unbound terms are filtered out.

If the operator is not applicable to the values of its operands (like comparing a number to a string) or applying the operator produces arithmetic error (like dividing by zero), the result is error. FILTER and HAVING conditions do not distinguish between unbound, error, and effective boolean false results of their expressions. Consult 9.3 for more information.

2.2.4.4. Function calls

SciSPARQL distinguishes between ordinary functions, that return a result for each set of bindings for their arguments, and aggregate functions, which accumulate the bindings for their first argument and return at most one result. For example

SELECT (abs(?x) AS ?result)

WHERE { :s :p ?x }

will return as many results as there are bindings for ?x to be found. In contrast,

SELECT (SUM(?x) AS ?result)

WHERE { :s :p ?x }

will return single result, if there was at least one binding for ?x matched by the graph pattern. It will not return if the match fails. Syntactically, aggregate functions are indistinguishable from the ordinary ones (unless DISTINCT option is used), but the semantics is, however, much different as shown in #.

All kinds of functions are identified by names that are unique on the server. SciSPARQL has all functions mentioned in SCPARQL 1.1 recommendations, with few exceptions as noted in #.

Users can define their own functions using DEFINE FUNCTION and DEFINE AGGREGATE as described in chapter 4 for the "native" SciSPARQL functions, i.e. parameterized views, and in chapter 5 for the foreign functions.

2.2.4.5. Daplex semantics

The number of times the ordinary function is called is, conceptually, the number of sets of bindings found for its arguments.

Whenever independent properties are matched by the query, they form a Cartesian Product, so the following query will return as many times as the number of bindings for ?x times the number of bindings for ?y.

SELECT ?x ?y

WHERE { :s :x ?x ; :y ?y }

Similarly, a function call that is dependent on such variables will effectively operate on the Cartesian Product that combines them:

SELECT (?x + ?y AS ?result)

WHERE { :s :x ?x ; :y ?y }

will return the same number of values. (Here we view the arithmetic operator '+' as a function of two arguments)

2.2.4.6. Typecasting

SciSPARQL supports the number of standard conversions between the basic RDF types, as shown in the followint table

Typecasting is invoked as a function on single argument, where the type URI (either prefixed with xsd: or complete) is used instead of function name. The following example is also valid in SPARQL 1.1:

... FILTER ( xsd:dateTime(?date) < xsd:dateTime("2005-01-01T00:00:00Z") ) ...

2.2.5. Filters and explicit bindings

When filtering, the Effective Boolean Value (EBV, described in 2.2.4.2) of the expression is taken into account, and the results that satisfy the filter condition are passed on from the query.

However, when a filter condition contains equality or another predicate that can be used to derive one value from another, the SciSPARQL query optimizer may use this information as a faster way to answer the query. For example one of the possible execution plans for

SELECT ?x ?y

WHERE { ?x :a ?a .

?y :b ?b .

FILTER (?a = 2 * ?b) }

may consist of looking for ?x and ?a solutions first, assuming for ?b value of ?a / 2 and looking for ?y solutions with candidate ?b values already known (which is typically faster than matching patterns based only on predicate. (If :b predicate happens to be more selective than :a (fewer instances of it), ?y and ?b bindings will be looked up first. )

In case when one of the variables used in FILTER may be unbound, for example

SELECT ?x ?p

WHERE { ?x :a ?a .

OPTIONAL { ?x ?p ?b } .

FILTER (?a = 2 * ?b) }

no such strategy can be applied, and the filter expression will effectively require all its variables to be bund, thus making the OPTIONAL part obligatory. SciSPARQL takes care not to produce any false positives in all these cases.

BIND

An equivalent to the first query using BIND construct for explicit binding would be

SELECT ?x ?y

WHERE { ?x :a ?a .

?y :b ?b .

BIND (2 * ?b AS ?a) }

and, despite the expressed asymmetry, the condition can be evaluated in both directions. It is up to user to choose the most natural way to express the equality condition. However, FILTER is more powerful syntactically, allowing for equality of arbitrary expressions, whereas BIND requires a variable on the right side.

VALUES

Another important alternative to extensively disjunctive filters, providing explicit alternative values for a variable

FILTER { ?x = "a" || ?x = "b" || ?x = "c" ... }

or for a set of variables

FILTER { ?x = "a" && ?y = 1 || ?y = 2 || ?x = "c" }

is the VALUES clause as defined in SPARQL 1.1

VALUES ?x { "a" "b" "c" }

or, respectively,

VALUES (?x ?y) { ("a" 1) (UNDEF 2) ("c" UNDEF) }

The keyword UNDEF indicates that the respective variable needs to be bound otherwise (e.g. by a triple pattern) in the given disjunctive case. For example, the following query (Q#) would look people named Alice whose email is alice@example.org, or anyone named Bob:

SELECT ?person

WHERE { ?person foaf:name ?name ;

foaf:mbox ?email .

VALUES (?name ?email) {

("Alice" "alice@example.org")

("Bob" UNDEF) } }

2.2.6. Result sequences

By default, the result set of a SciSPARQL query is a bag (i.e. multiset) of mappings (otherwise called results in this document, where each output variable is either bound to a value, or left unbound. Duplicate mappings can be eliminated with DISTINCT option. The results are typically generated one by one and shipped back to the client - thus a query execution can be stopped before all results are found. No deterministic order of mappings in the query result can be assumed.

However, in many applications it is desirable to order the results into a sequence, and additionally to retrieve only specified sections of that sequence. ORDER BY clause allows one to specify a list of expressions used for ordering. The order (numerical and alphabetical) is ascending by default, and can be changed to descending with DESC() option for the list element. Whenever the left n ordering expressions provide the same value, the value of (n+1)-th expression is used. For two mappings where all ordering expressions return same (or incomparable) respective values, the order would still be undefined.

Once the order is defined, one could use LIMIT and OFFSET clauses to support paged representation of query results, so the query will return a certain section of the result set. Only integer constants can follow these options, and OFFSET 0 would indicate the beginning of the sequence.

The section returned will be deterministic only if the ordering expressions specify a total ordering of the result set. For example (Q#), when selecting people based on their age (descending) and name, it's a good idea to include some unique value (a person node itself) in the end of the order list:

SELECT ?name ?age

WHERE { ?p a foaf:Person ;

foaf:name ?gn ;

foaf:age ?age }

ORDER BY DESC(?age) ?name ?p

OFFSET 50 LIMIT 10

Otherwise, if only one of two people having same name and age is returned in the end of a section, the next section, (with OFFSET 60) will not necessarily contain the other one.

It is also possible to use LIMIT option to limit possible unmanageably large result sets. However, in the absence of ordering, the subset returned will not be deterministic.

2.2.7. Aggregate queries

Expressions containing aggregate functions are only allowed in SELECT list in a SciSPARQL query. The use of an aggregate function alters the semantics of a query, effectively breaking it in two parts: inner query and outer query. GROUP BY clause may only be used in queries containing aggregation.

The processing of aggregate query involves the following steps

processing of inner query
grouping, if specified
removing of duplicates, if specified
applying aggregate functions
processing of outer query (filtering, selection, ordering)

For example, when executing the following query (Q#) will select all people having 2 employers or more, ordered by name:

SELECT ?name (COUNT(DISTINCT ?employer) AS ?employerCnt)

WHERE { ?p a foaf:Person ;

foaf:name ?name ;

:employer ?employer }

GROUP BY ?p

HAVING (?employerCnt >= 2)

ORDER BY ?name

Here, the inner query will simply select all people (i.e. instances of foaf:Person) and their employers from the RDF graph, resulting in a Cartesian product relation. Grouping is done on "person" nodes bound to ?p, resulting in one group per person. Then, inside each group, duplicate employers are removed, and count of remaining ones is computed.

The outer query deals with grouped ?p, ?name, and computed ?employerCnt variables. Please note that since ?name variable is used in the outer query (both for selection and ordering), it is automatically made available from inner query. SPARQL 1.1 requires that the only variables from inner query that can be used in the outer query are those mentioned in GROUP BY clause. SciSPARQL detects and implicitly adds such variables to GROUP BY list. In most cases this does not change the semantics, as, e.g. in our case, ?name is functionally dependent on ?p anyway.

HAVING clause provides a filter condition in the outer query. Equality condition will never bind a variable in the outer query.

GROUP BY and ORDER BY clauses are not restricted to single variables - as SELECT clause, they might contain lists of arbitrary expressions. Whereas GROUP BY can refer to any inner query variable, other parts of outer query can only use:

variables from inner query used in arguments to aggregate functions
variables defined by another expression in the outer query (circular dependencies will result in a not computable query)
variables appearing as separate elements in GROUP BY list
any other variables from inner query, that will be implicitly added to GROUP BY list.

For example, one might query for number of people for every first letter in their names, with the letter with greatest count returned first, and letters with the same count returned alphabetically (Q#):

SELECT ?firstLetter (COUNT(?p) AS ?personCnt)

WHERE { ?p a foaf:Person ;

foaf:name ?name }

GROUP BY (SUBSTR(?name, 1, 1) AS ?firstLetter)

ORDER BY DESC(?personCnt) ?firstLetter

Here 4 variables are used in the outer query:

inner query variable ?name is used in a GROUP BY expression - note that grouping is not performed on ?name values, but on the result of expression SUBSTR(?name, 1, 1)
inner query variable ?p used in an argument to aggregate function COUNT()
variables ?personCnt and ?firstLetter defined by outer query expressions.

Built-in aggregate functions

SciSPARQL has the same set of built-in aggregate functions as SPARQL 1.1:

- COUNT(), SAMPLE(), operating on all kinds of values,

- SUM(), MIN(), MAX(), AVG(), with addition of STDEV(), operating on numeric values or arrays of same shape. No result is returned of the values are not compatible with each other. The base numeric type of the result will be the widest of (Integer, Double, Complex) that appears among the bindings for the argument, and is never Integer for AVG() and STDEV().

- GROUP_CONCAT(), operating on string values. No result returned if not all bindings for the first argument are strings. The order of concatenated strings in the result is never guaranteed. The second argument specifies the delimiter string.

As mentioned earlier, users can define their own aggregate functions (both foreign and native to SciSPARQL) using DEFINE AGGREGATE.

Parameters to aggregate functions

It is only the first argument to the aggregate function that can depend on the inner query variables. The other arguments are allowed, but may only depend on the outer query variables. For example, the typical use for GROUP_CONCAT() would be

GROUP_CONCAT(?x, "|")

However, the following query would also be correct

SELECT ?pn (GROUP_CONCAT(?name, ?d) AS ?names)

WHERE { ?p :pn ?pn ;

:name ?name ;

:language ?l .

?l :name_delimeter ?d }

GROUP BY ?pn ?d

since the variable ?d is being grouped upon, and thus can be used in the outer query.

DISTINCT with aggregate functions

The keyword DISTINCT may precede the first argument to the aggregate function. In that case the only distinct, or, unique argument values will be grouped together. For example,

SELECT (COUNT(DISTINCT ?x) AS ?result)

WHERE (:s :p ?x)

will count the number of bindings for ?x that are not equal to each other.

COUNT(*)

It is possible to use '*' as a replacement to the argument for COUNT(). In that case, all query variables will be used as a single argument.

For example, whereas the query

SELECT *

WHERE { :s :x ?x ; :y ?y }

will return the Cartesian Product of all bindings found for ?x and ?y (returning two variables per result), applying COUNT() will instead return cardinality of that Cartesian Product():

SELECT (COUNT(*) AS ?result)

WHERE { :s :x ?x ; :y ?y }

Using DISTINCT option will additionally eliminate the non-unique pairs of ?x and ?y bindings before returning or counting.

2.2.8. Querying multiple graphs

So far we've only seen queries that address the default graph. It is also possible to specify the graph names for the complete query or parts of it, and even query for the graph name.

SciSPARQL Database Manager stores the default graph and a number of named graphs, identified by URIs. Any RDF mappings defined over other mediated data models are also identified by mapping-specific URIs for querying with SciSPARQL.

The usage of FROM, FROM NAMED, and GRAPH syntax is thoroughly described in section 13 of W3C SPARQL 1.1 specifications:

http://www.w3.org/TR/sparql11-query/#specDataset

2.3. CONSTRUCT: generating derived RDF graphs

Whereas the result set of a SELECT query is a set of variable mappings, were some variables might remain unbound in certain mappings, CONSTRUCT query is designed to generate a set RDF triples with all RDF terms in place.

A CONSTRUCT query is very similar to SELECT query, except that it provides a graph template to generate a set of triples for each solution. It is described in section 16.2 of W3C SPARQL 1.1 specifications:

http://www.w3.org/TR/sparql11-query/#construct

Please note that in the current version blank nodes in CONSTUCT patterns are not supported.

2.4. ASK queries

ASK query is a simplification of SELECT query that is guaranteed to return Boolean value: true if a solution exists and false otherwise.

For example, the following query checks if the default graph contains a person named Alice (Q#):

ASK { ?x a foaf:Person ;

           foaf:name  "Alice" }

2.5. SciSPARQL Update language

SciSPARQL implements DELETE/INSERT, DELETE DATA, INSERT DATA statements as described in section 3 of W3C SPARQL 1.1. Update specifications:

http://www.w3.org/TR/sparql11-update/#updateLanguage

The difference between DATA and query-based update statements is that the former are executed at the parsing time, thus the complete statement is not buffered in memory, and thus can be efficiently used for inserting/removing large sets of RDF triples.

LOAD, CLEAR, CREATE, and DROP statements, for simplicity of the grammar, are implemented as functions, and SciSPARQL allows all kinds of function calls as top-level statements.

For example,

LOAD('alice.ttl', true, :aliceGraph)

would load the Turtle RDF document ./alice.ttl into the named graph identified by URI :aliceGraph (provided the empty prefix was defined in the session). If such graph does not exist, it will be created. If no third argument is given, default graph is addressed. The second argument specifies whether the target graph should be cleared first.

The string type of the first argument suggests the local file (on the server filesystem) should be read. If URI is used instead, the RDF document will be accessed on the Web:

LOAD(<http://example.org/alice.ttl>)

Clearing can also be achieved with

CLEAR(:aliceGraph)

for the same named graph, or just

CLEAR()

for the default graph.

SciSPARQL does not implement COPY, MOVE, and ADD shortcuts for the same purpose of grammar simplicity. The document referred above specifies the INSERT/DELETE equivalents for these.

3. Array queries

One of the primary distinctive features of Scientific SPARQL is the support for multidimensional numeric arrays (NMAs or arrays for short) as first-class citizens. Arrays appear among other possible values as objects of RDF triples, and can be addressed and processed using the array dereference syntax, SPARQL operators and functions extended to handle arrays, and specialized array functions.

Whereas RDF data model can be used to represent nested lists of RDF values, using rdf:first and rdf:next properties, for realistically big arrays such representation would be extremely inefficient: both for storage (2 triples per element + additional triples to represent nested dimensions) and access to elements and slices (linear path length w.r.t. sum of array dimensions).

Scientific SPARQL Database Manager provides efficient internal representations of arrays and their projections – both in-memory and in back-end storage, and also minimizes the amount of data transmitted in client-server setting by performing data-reducing operations (selection, aggregation) on the server.

To specialized handling of arrays, SciSPARQL assumes they have regular structure and elements of the same numeric type, so NMA datatype is specialized for array-related tasks, and is not a complete replacement for RDF collections.

3.1. Loading RDF with arrays

SciSPARQL Database Manager allows loading RDF data from Turtle (and NTriples as a subset) serialization formats. An RDF collection consisting of a nested list with four numbers, ((1 2) (3 4)), being an object in a triple with subject :s and predicate :p, will be represented in NTriples as

<http://example.org/s> <http://example.org/p> _:a .

_:a <http://www.w3.org/1999/02/22-rdf-syntax-ns#first> _:b .

_:b <http://www.w3.org/1999/02/22-rdf-syntax-ns#first> 1 .

_:b <http://www.w3.org/1999/02/22-rdf-syntax-ns#rest> _:c .

_:c <http://www.w3.org/1999/02/22-rdf-syntax-ns#first> 2 .

_:c <http://www.w3.org/1999/02/22-rdf-syntax-ns#rest> <http://www.w3.org/1999/02/22-rdf-syntax-ns#nil> .

_:a <http://www.w3.org/1999/02/22-rdf-syntax-ns#rest> _:d .

_:d <http://www.w3.org/1999/02/22-rdf-syntax-ns#first> _:e .

_:e <http://www.w3.org/1999/02/22-rdf-syntax-ns#first> 3 .

_:e <http://www.w3.org/1999/02/22-rdf-syntax-ns#rest> _:f .

_:f <http://www.w3.org/1999/02/22-rdf-syntax-ns#first> 4 .

_:f <http://www.w3.org/1999/02/22-rdf-syntax-ns#rest> <http://www.w3.org/1999/02/22-rdf-syntax-ns#nil> .

_:d <http://www.w3.org/1999/02/22-rdf-syntax-ns#rest> <http://www.w3.org/1999/02/22-rdf-syntax-ns#nil> .

The same nested list is shown graphically as part of RDF graph, where every triple is shown as an edge:

Besides abbreviating URIs with common prefixes, Turtle provides more compact syntax for RDF collections, so the same triple with the nested list as object can be encoded as:

@prefix : <http://example.org/>

:s :p ((1 2) (3 4)) .

however, with the standard RDF data model, exactly the same 13 triples should be stored and made available to SPARQL queries, so that this compact notation is left as just a "syntactic sugar" in Turtle format.

As SciSPARQL extends RDF data model with arrays, the SciSPARQL Database Manager is by default configured to recognize such collections already in Turtle format and store them internally as NMAs. This behaviour is governed by _sq_emit_nmas_ switch. An RDF collection would be identified as NMA if the following conditions about it hold:

each element is either a number or another collection
all numbers appear in collections nested on the same level, and only numbers do appear at that level
the uniform number of elements in collections nested on each level

The element type of array is Integer, if all numbers are integers, and Double otherwise. Although SSDM internally supports also Boolean and Complex arrays, acquiring this kind of data from Turtle will be added in future versions.

As follows from the above description, Turtle triple

:s :p (((1 2) (3 4)) ((5 6) (7 8))) .

will be stored with 3-dimensional Integer array of size 2x2x2 as an object,

:s :p (1 3.14 5) .

will contain 1-dimensional Double array of size 3, and

:s :p (1 (2 3) 4) .

will point to a stored RDF collection (defined by 6 triples), containing array (2 3) as its second element.

Backwards compatibility

Since SciSPARQL is the strict superset of SPARQL, besides providing its own syntax for addressing array elements and slices (described in the next session), it also allows the use of rdf:first and rdf:next predicated to address array elements, as if they were regular RDF collections.

This feature provides backwards compatibility, so that the following queries (to address number 3 stored in array in the first example here) are equivalent (Q#):

SELECT ?element21

WHERE { :s :p ?array .

?array rdf:rest ?x .

?x rdf:first ?slice2 .

?slice2 rdf:first ?element21 }

and

SELECT (?array[2,1] AS ?element21)

WHERE { :s :p ?array }

Certainly, the latter syntax is preferred for scalability. If we, for example, would need to address the element [x, y], in a 2D array, the traditional SPARQL syntax would require us to specify a property path of (x+y) triple patterns, and (x+y-1) additional variables.

3.2. Array dereference syntax

As we can see from the previous example, SciSPARQL allows array subscripts in square brackets, where subscripts for respective dimensions are separated with commas.

These are either single subscripts or range selections. By default, range selections can be specified with a colon as lo:hi, and selections with a stride as lo:stride:hi, where both lo and hi address the elements that are included in the selection, and the elements are counted from 1. This design was chosen to make MATLAB users feel at home.

However, with _sq_python_ranges_ switch the user can opt for a different dialect of SciSCPARQL, which supports Python notation for ranges. In this case, elements are counted from 0, hi element is never part of the selection, and optional strides are specified as lo:hi:stride. No other differences are introduced.

This switch only takes effect at the stage when a SciSPARQL query, update, or function definition are passed to the interpreter. The definitions of SciSPARQL functions and parameterized updates are stored internally in a way that is invariant to these syntactic differences, so it is safe to arbitrarily switch between the two dialects in the same session.

In the rest of this manual, the default (MATLAB) notation is used.

Either or both lo and hi values can be omitted, with default for lo being 1 (or 0 if _sq_python_ranges_ is set), and default for hi always being the array size in the respective dimension. Thus the expressions ?a[:] and ?a[:1:] are always equivalent to ?a.

If valid single subscripts for all array dimensions are specified, the array is dereferenced to a single element. Otherwise, complete ranges are assumed for the remaining dimensions.

SciSPARQL thus makes a difference between three kinds of array dereferences:

single element dereference, for example ?a[2,1] for a 2D array ?a, where single subscripts are provided for all dimensions. The result is always a number, or error if a subscript falls out of range.
projection dereference, for example ?a[:,1] or ?a[2] or ?a[1:3,2] or ?a[2,:5:] for a 2D array ?a, where single subscripts are provided for some dimensions, and range selections (explicit or implicit) for the others. The result is a smaller array with lesser number of dimensions (only those of the original dimensions for which ranges were provided), or error if a single subscript falls out of range or range selection results in an empty selection.
subset selection dereference, for example ?a[1:5,2:3], ?a[1:5], ?a[:5,:2:], where range selections (explicit or implicit) are provided for all array dimensions. The result is a smaller array with the same number of dimensions as the original one, or error if a range selection results in an empty selection.

The latter two are also collectively called array slicing operations.

If a range selection effectively specifies a single element, it is still treated as a range selection with respect to the dimensionality reduction. Thus, (unlike MATLAB) SciSPARQL makes a difference between arrays that have different number of "single-element" trailing dimensions, and between singleton arrays and numbers, so that ?a[2,3:3] is not equal to ?a[2,3]. For a 2d array ?a where these subscripts are valid, the former expression would return a 1d-projection with a single element in it, whereas the latter expression would dereference directly to that element.

Since SciSPARQL is optimized to handle very large arrays, any dereference operation that returns a derived array does not allocate any memory to store the new array's elements - internally, it just allocates a new descriptor object pointing to the same storage space. Thus, creating numerous projections and slices of arrays is very cheap, and is encouraged as a simple way to formulate many data-reduction operations, as shown in 3.3 - 3.4.

Variables bound to array subscripts

One important feature of SciSPARQL as a declarative query language is the possibility to automatically bind a query variable to its valid range of values. Just as a triple pattern

?x foaf:name "Alice" .

binds variable ?x to every node that has a property foaf:name with value "Alice", an array dereference expression

?a[?i]

with otherwise unbound variable ?i becomes an array access pattern: the variable ?i will assume all valid subscript values, that is, integers from 1 and up to the size of ?a array in its first dimension. Unless otherwise restricted, such binding will form a Cartesian product with bindings for other variables in the query solution. So, for example,

SELECT ?i, ?j (?a[?i,?j] AS ?value)

WHERE { :s :p ?a }

will return every element of 2d array ?a (or respective projections if ?a is array of grater dimensionality, or nothing otherwise), together with subscript values. Similarly,

SELECT ?i, ?j (?a[?i,?j] AS ?value)

WHERE { :s :p ?a .

FILTER ( ?i <= ?j ) }

will return top-right triangle of ?a, and

SELECT ?i (?a[?i,?i] AS ?value)

WHERE { :s :p ?a }

will return the diagonal elements.

3.3. Array arithmetics

The RDF-based arithmetics of SciSPARQL is not currently extended to arrays, except for the aggregate functions. Thus, it is possible to compute an element-wise sum of a bag of arrays (Q#):

SELECT (SUM(?a) AS ?result)

WHERE { ?s :a ?a }

that for a dataset (G#)

:x :a (1 2 3) .

:y :a (4 0 6) .

will return an array (5 2 9). The functions MIN(), MAX(), AVG(), and STDEV() work in a similar way - they apply the respective aggregate function element-wise, checking that all input values are arrays of same shape (or all numbers) and return an array of the same shape (or a number). Integer and Double arrays can be freely combined, and while SUM(), MIN(), and MAX() will return an Integer array of all input arrays were of type Integer, AVG() and STDEV() will always return arrays of type Double.

For example, using an array access pattern feature of SciSPARQL, one may compute an average column in a 2d array ?a with a query (Q#):

SELECT (AVG(?a[:,?j]) AS ?avg_column)

WHERE { :s :p ?a }

Non-aggregate arithmetic operators +, -, *, / and functions round(), mod(), abs(), floor(), ceil() will only accept numbers as operands.

However, a number of simple functions that are not governed by the SPARQL 1.1 standard currently accept both numeric and array arguments, provided the latter are of the same shape:

SELECT (max2(?x,?y) AS ?result)

WHERE { :x :a ?x . :y :a ?y };

applied to the dataset G# above, will return array (4 2 6), and with the same function applied to numbers, the query (Q#):

SELECT (max2(?x[1],?y[1]) AS ?result)

WHERE { :x :a ?x . :y :a ?y };

will return the number 4. Function min2() will return the minimum of two operands.

Intra-array aggregation

Since arrays (of any shape) can be viewed as collections of their elements, another type of aggregate functions can be applied them. Functions array_sum(), array_avg(), array_min(), and array_max() take a single array as an argument, and return a single number - the result of aggregate operation.

For example a simple query (Q#)

SELECT (array_sum(?a) AS ?result)

WHERE { :s :p ?a }

will (in case of 2d array only!) be equivalent to an aggregate query:

SELECT (SUM(?a[?i, ?j] AS ?result)

WHERE { :s :p ?a }

but the computation is done much faster in the first case.

3.4. Other built-in array functions

adims(?x) - return the shape of array as a 1d integer array containing sizes of ?x in each dimension. To obtain the number of dimensions, use adims(adims(?x))[1]

elttype(?x) - return element type of array, with 0 for Integer, 1 for Double, 2 for Complex

permute(?x, d1, d2, ...) - change the shape of array by rearranging its dimensions (generalized transposition). The integer values d1, d2, ... denote the new order for the array dimensions. Effect is same as in MATLAB permute() function:

http://se.mathworks.com/help/matlab/ref/permute.html

transpose(?x) - simple 2d matrix transposition, equivalent to permute(?x, 2, 1).

Rearranging array dimensions, similarly to array slicing operation, involves no copying of array elements, and thus produces a derived array.

first(?x) - equivalent to ?x[1], returns first element or projection to the first element in the first dimension

rest(?x) - equivalent to ?x[2:], returns the subset of array that excludes the first element in the first dimension. first() and rest() provide an alternative way to address arrays - as linked lists.

4. SciSPARQL Views, second-order functions and closures

SciSPARQL introduces more functional programming features into SPARQL:

SciSPARQL Views, which are can be seen as named and parameterized SciSPARQL queries, or as user functions defined in terms of SciSPARQL SELECT:

DEFINE FUNCTION age(?person)

AS SELECT (year(now()) - year(?birthday) AS ?result)

WHERE { ?person foaf:birthday ?birthday ;

:isAlive true }

SciSPARQL functions help to express common subqueries, and can be used to simplify other queries. For example, finding friends of Alice who are at least twice older than her (Q#):

SELECT ?friend

WHERE { ?a foaf:name 'Alice' ;

foaf:knows ?friend .

FILTER (age(?friend) >= 2 * age(?a) }

second-order functions, such as ARGMIN() and ARGMAX(), that expect other functions as arguments:

ARGMAX(age)

functional closures – a special syntax allowing to bind certain parameters of a function, thus creating another function on-the-fly. For example, if we have a function of two arguments, like:

DEFINE FUNCTION avgIncome(?city ?year)

AS SELECT (AVG(?income) AS ?result)

WHERE { ?td a :TaxDeclaration ;

:filedIn ?city ;

:year ?year ;

:totalIncome ?income }

we might want to fix year parameter to find the richest city for that year, according to tax declarations:

ARGMAX(avgIncome(*, 2012));

(in current version ARGMAX() only expects a function of one parameter, since all functions have result width of 1)

5. Foreign functions

6. Connecting to back-ends

7. Calling SciSPARQL from C

8. The MATLAB front-end

The MATLAB-SciSPARQL Link is the interface provided for tight integration of SciSPARQL Queries into MATLAB [#] - for connecting to SSDM servers, querying and updating RDF graphs with arrays.

32-bit version of MATLAB is required. MSL was tested with MATLAB Release 2012b for Windows, and we are interested in getting feedback on compatibility with other versions of MATLAB.

MSL is implemented as an extension to MATLAB environment, consisting of a library msl.dll, header file msl.h, and a number of .m files containing definitions of MATLAB classes and functions. All of these are collected in SQoND/embeddings/MATLAB/M subfolder of SSDM installation directory, which should be included into MATLAB library path:

addpath('/SQoND/embeddings/MATLAB/M');

The rest of this chapter is based on technical report [#]

8.1. Examples

This section explains how to:

· Initialize the SciSPARQL MATLAB client

· Create a connection to an SSDM server named udbl.it.uu.se

· Populate the SSDM database

· Send a query to the database server for evaluation

· Iteratively retrieve data from the database

· Unload MSL from MATLAB.

First the MATLAB function sparqlInit() is called to load the MSL DLL into MATLAB and initialize the system. After MSL has been initialized, MATLAB users can connect to an SSDM sever on host h by calling newConnection(h). The function newConnection() returns an object of class Connection which is used to communicate with the server.

8.1.1 Queries

The method Connection.sparql(q) -> Scan evaluates the query string q on the SSDM server connected to Connection object. The result of a call is a Scan object that identifies the set of result tuples that are about to be emitted from the query. Since the result set can be very large it is not returned as a single object but as a Scan, which contains methods for the application program to iterate over and to retrieve one result tuple at the time.

For example, the following program opens a connection to the SSDM server running on udbl.it.uu.se and retrieves all the triples in the default graph:

sparqlInit();

c = newConnection('udbl.it.uu.se')

s = c.sparql('SELECT ?s ?p ?o WHERE {?s ?p ?o}');

while not(s.endOf())

s.getElement(1)

s.getElement(2)

s.getElement(3)

s.nextRow();

end

c.close();

The method Scan.endOf() returns 1 when there are no more tuples in the scan, and 0 otherwise. The method Scan.getElement(p) retrieves the object at position p in the current tuple of the scan. The method Scan.nextRow() advances to the next tuple in the scan. The method Connection.close() closes a connection and frees all associated scans.

The objects created by MSL, e.g. connections, scans, and tuples, are stored in main-memory database on the client. The function getElement() converts the data representation used in SSDM to the corresponding MATLAB data type. The supported data types are scalar values, numeric arrays, character strings, and classes defined by MSL to represent RDF-specific data types. For example, the function makeURI(s) creates a new URI named s, which is an instance of a MATLAB class URITYPE defined by MSL.

8.1.2 Function calls

SciSPARQL functions can be called from MATLAB using the overloaded MSL method Connection.sparql(fn, argl) -> Scan, where fn is the name of the SciSPARQL function to call and argl is an argument list represented as a MATLAB cell array[8.2], which can contain data of different types. For example, the following code calls the SciSPARQL function plus to add two 2D arrays A and B together:

A = [1,2,3;4,5,6];

B = [7,8,9;10,11,12];

s = c.sparql(’plus’, {A, B});

if not(s.endOf())

s.getElement(1)

end

The user can define SciSPARQL functions and install them on the SSDM server by, e.g., using a MATLAB workspace as an SSDM console where commands as shipped to the SSDM server using the method Connection.sparqlDo(), differing from Connection.sparql() in that it does not return a scan. For example, the SciSPARQL function square(x) can be defined on the server by:

c.sparqlDo('DEFINE FUNCTION square(?x) AS SELECT (?x*?x AS ?res)');

Note that the function on the server need to be defined only once, so a good choice of names is important.

8.1.3 Updates

The function Connection.insert(s, p, o) inserts a triple (s, p, o) into the SSDM database connected to by Connection object. For example, the following code adds values of the property http://example.org/ns#p to four different subjects:

c.usePrefix('', 'http:// example.org/ns#');

p = makeUri('', 'p'); % make a URI object

A = [1,2,3;4,5,6]; % make a 2D array

B = cat(3, A, A+1, A+2, A+3); % make a 3D array

c.insert(makeUri('', 'x4'), p, 'simple string');

c.insert(makeUri('', 'x5'), p, 3);

c.insert(makeUri('', 'x6'), p, B);

c.insert(makeUri('', 'x7'), p, true);

The commonly used prefix is stored on the connection object (identified by an empty string). The variable p holds the URI of the added property. The variable A is a 2D array used to construct the 3D array B.

Another way to insert data it to use a standard SPARQL Update syntax [#], wrapped in Connection.sparqlDo() call. The same goes with deletion. Often an update involves deleting many triples based on a query. For example,

c.sparqlDo('DELETE {?s, <http://example.org/ns#p>, ?o}');

removes all triples having the property http://example.org/ns#p. Since the prefix is already defined for the session, one could also use :p instead in the query.

Passing parameters into updates can be done either by placing them into the query string (not recommended) or by defining update in a named procedure shipped to the SSDM server. For example,

sparqlQuery(Cid,'DEFINE PROCEDURE delProp(?x) AS DELETE (?s ?x ?o)');

defines a procedure stored on the server that will delete all triples having the property ?x. Procedures are called the same way as functions. For example,

c.sparqlDo(’delProp’, {c.makeURI('', ’p’)});

which also removes all triples in the database having the property http://example.org/ns#p.

8.2 MATLAB interface functions

8.2.1 Initializing and finalizing

sparqlInit()

This function loads the SSDM client DLL as MATLAB extension and initializes it.

sparqlInit(option)

In addition to the previous, this also initializes the MSL extensions given in the options. The only option currently supported is 'mat', which allows sending MATLAB values as binary .mat files to the server for storage. This enables store() and link() functions.

newConnection(hostid) -> Connection

The function creates a connection to a server running on a given host, hostid. This function returns a connection object. Most functions provided in MSL require a Connection argument. If hostid is an empty string, an embedded SSDM process is initialized.

setRemotePort(portNumber)

Set the port number a name server is expected to listen to. The default is 35021.

Connection.close()

This method closes the connection and frees memory allocated for the connection. After calling this function, the connection object cannot be used anymore. To obtain a new connection call newConnection(hostid). The purpose is to keep the memory clean when we want to re-initialize the SciSPARQL client while continuing to run the same MATLAB session.

closeAllConnections()

This function closes all connections and frees all memory allocated for all connections.

finalizeSystem()

This function frees all memory associated with SSDM and unloads the interface MSL DLL from the MATLAB application.

8.2.2 Sending queries to SSDM

Connection.usePrefix(prefix, str)

Register the prefix on the Connection. The prefixed URIs can be used in the queries and updates being sent to the server. The registered prefixes can also be used locally when constructing URIs with Connection.makeURI().

Connection.sparql(q) -> Scan

The method takes a query string q as an argument. It sends the query string to the SSDM server for execution. It returns a scan object, representing the result of the query that the MATLAB program can iterate over.

Connection.sparql(funName, argList) -> Scan

This method calls a SPARQL function on the server. It takes two arguments, funName, argList, where funName is the SPARQL function name, and argList is a MATLAB cell array holding the arguments that are passed to the SPARQL function. The result is a scan of the result.

Connection.sparqlDo(q)

Connection.sparqlDo(funName, argList)

These methods are similar to the previous ones, the only difference is that they do not return a scan, so it does not need to be freed by user. They are primarily desigined for SciSPARQL updates or function definitions.

Connection.sparqlPrint(q)

Connection.sparqlPrint(funName, argList)

These methods are similar to the previous ones, and they do not return a scan. All results are printed to the interpreter console using the default row printer Scan.printRow(). These should be used with caution in case of very big result sets.

Connection.insert(S, P, O)

The function loads an RDF triple (S, P, O) into the database by providing S as subject, P as predicate, and O as object. The subject and the predicate must be of type of URITYPE, which is defined as a MATLAB data type and can be constructed by calling the makeURI() function. Object can be of any type. The representation of RDF data types in MATLAB are discussed in the next section.

Scan.endOf() -> Boolean

This function checks if the end of a scan has been reached, i.e. if the last tuple has already been retrieved and Scan.nextRow() has been called after that.

Scan.width() -> Integer

This function returns the number of elements in the current row of a scan. When applied to an empty scan, 0 is returned.

Scan.getElement(pos) ->

This function returns an element from the current row of a scan. pos specifies what element in the current tuple of scan Sid to access. RDF data value in that position is mapped to the corresponding MATLAB data type, an array is mapped to a MATLAB array.

Scan.nextRow() -> Boolean

This function advances to the next tuple in the scan. It returns true on success, and fallse iff Scan.endOf() will return true.

Scan.printRow()

The function prints all elements in the current tuple of the scan. RDF values like URIs, typed literals, language-tagged strings are printed to string values using Turtle notation.

freeScan(Sid)

If a scan is no longer used, this function is called to free the memory allocated for the scan. The scan is also automatically freed when the end-of-scan is reached, so freeScan() is used only when prematurely ending a scan.

freeAllScans()

This function frees the memory allocated for all scans.

8.2.3 Constructors

makeTimeVal(timeVector)

This data constructor creates a TIMEVALTYPE object representing RDF time value. It takes one argument, timeVector, which is a vector representing time, e.g., [2013 12 25 12 55 33], and constructs a TIMEVALTYPE object which holds the value of timeVector as a property. The last value (seconds) can have a fraction part that will be translated to milliseconds when mapping to the corresponding RDF type.

makeTimeVal( timeVector, timeZone)

This function creates TIMEVALUETYPE object representing RDF time value and containing time zone information passed in timeZone parameter. That parameter is an integer number denoting "seconds west of greenwich", e.g. -3600 for Uppsala (Central European Time).

Connection.makeURI(prefix, tail)

This function creates a URITYPE to represent RDF URIs. The URI string is constructed by concatenating the prefix registered with Connection.usePrefix() and identified by prefix argument with the tail string provided.

makeURI(uriString)

Connection.makeURI(uriString)

Creates a URITYPE object with URI string uriString, to represent RDF URIs.

makeUniStr(string, langTag)

This function creates a UNISTRINGTYPE object, representing language- and locale-tagged strings in RDF. It takes two arguments, and stores them in String and LangTag properties, respectively. To represent RDF string literals without language or locale tags, native MATLAB strings should be used.

makeTypedRDF(literal, datatype)

This function creates a TYPEDRDFTYPE object representing typed RDF literals of the specified datatype, given as a string. A typed literal is string combined with a datatype URI. Typed strings enables the RDF type system to be extensible with literals in new datatypes.

9. Advanced issues and work in progress

SciSPARQL is designed to be a strict superset of

- SPARQL 1.1. Query Language, as defined in W3C Recommendations 21 March 2013:

http://www.w3.org/TR/sparql11-query/

- SPARQL 1.1. Update Language, as defined in W3C Recommendations 21 March 2013:

http://www.w3.org/TR/sparql11-update/

However, there is a small number of features of SPARQL 1.1. Query Language that are not implemented in the current SciSPARQL release. Below is the complete list, following the topic enumeration on W3C webpage:

(8.2) MINUS

The W3C recommendation is ambiguous; the implied concept of "compatible" solutions contradicts its definition in 18.3 on the same webpage.

(16.2.1) blank nodes in CONSTRUCT templates (17.4.2.9) BNODE(), (17.4.2.12) UUID(), (17.4.2.13) STRUUID()

Currently we cannot guarantee different output for different solutions

(16.4) DESCRIBE

Not defined by the standard, it is said instead that, "It may be information the service deems relevant to the resources being described". We currently provide no such relevance defaults.

(17.4.1.2) IF, (17.4.1.3) COALESCE

In future releases we are going to extend our physical algebra with the branching operator.

(17.4.1.9 - 17.4.1.10) Lazy evaluation of IN lists, silent handling of errors

In future releases error catching from arithmetic operators will be made silent for these cases.

(17.4.3.13) Unicode

SciSPARQL stores RDF strings with language and locale information, however, Unicode escapes \uXXXX should be specified explicitly. Reading of Unicode input files is not currently supported.

(17.4.3.14) REGEX(), (17.4.3.15) REPLACE()

There is currently a subset of REGEX() functionality supported in SciSPARQL regex() function, that is useful for the most cases. We currently do not plan to implement the complete regular expressions language as specified in

http://www.w3.org/TR/xpath-functions/#regex-syntax

(17.4.6) Hash functions MD5, SHA1, SHA256, SHA384, SHA512

In future releases we are going use a standard library with implementations of these functions.

All other functionality of W3C SPARQL 1.1 Query and Update languages is implemented and the queries and updates produce the same results as shown in the examples on the recommendation pages.

In the following sections we clarify a number of advanced issues that might contradict some intuitions about the language semantics.

9.1. Open world assumption

Since the Semantic Web initiative was developing as interdisciplinary approach with roots both in AI and database technology, it retained a capacity to handle negative facts. One, for example, could store a following RDF fact

:joe :isNobelPrizeLaureate false .

in addition to a positive fact

:joe a :Person .

and query for people for whom it is known that they are not Nobel Prize laureates:

SELECT ?person

WHERE { ?person a :Person ;

:isNobelPrizeLaureate false }

In the traditional database paradigm, however, it would be useless to store such a fact. Instead, it is assumed that the state of the database completely models a certain aspect of reality (closed-world assumption), thus only "positive" facts need to be stored. The corresponding query would be

SELECT ?person

WHERE { ?person a :Person .

FILTER NOT EXISTS { ?person :isNobelPrizeLaureate true } }

These two queries are not equivalent in the eyes of the SPARQL standard, as the latter searches for people who are not recorded as Nobel Prize Laureates. However, they become semantically equivalent under closed world assumption: if X is not recorded to be A, then it is known that X is not A.

SciSPARQL adopts the open world assumption, where the above two queries are not equivalent, and thus follows the standard.

However, in the current release these Boolean values are physically stored as integers, and thus are indistinguishable from 0 and 1 integer values. datatype() function will report xsd:integer, however, all other comparisons will use Effective Boolean Value (EBV), as specified in 17.2.2 of the query standard webpage.

An alternative (closed-world assumption) approach would be to silently ignore negative facts upon insertion, and to disable the queries that look for false literal in the stored data.

9.2. Default and strict SciSPARQL

SciSPARQL is designed for a purpose to be useful tool in scientific computing and data management, and to be translated efficiently to SQL whenever storage back-ends are used. For this purpose a number of small changes was made to the standard SPARQL 1.1 semantics that might potentially generate a number of false positives in the eyes of the standard. At UDBL we believe that in some cases the user might actually want these extra query results.

However, in order to legitimately claim being a superset of the standard, a switch is designed to emulate the "strict" standard behavior. Here are the cases were it matters:

9.2.1. Comparing values of different types

In SPARQL, the result of comparison of two values of different types is always false (more, precisely, a silent error which affects the filter the same way as false). However, in the underlying system of SSDM, all values are comparable with each other, which is suitable for sorting and constructing trees.

The query

SELECT (true AS ?res) WHERE { FILTER ('a' > 1) }

will return true by default. By setting (setq _sq_strict_ t) an extra check will be performed, and the query will not return.

9.3. Run-time error handling

As follows from section 17.2 of the standard, W3C SPARQL employs 3-value logic, where unbound value resulting from use of OPTIONAL and UNION clauses is treated in the same way as the error value, resulting e.g. from failed numeric expressions, like ?x/0, or comparison of incompatible types, like 'a' > 1.

This value is propagated through logical AND and OR operators (according to truth table #), and all kinds of functions, except bound(), if() and coalesce(). In other words, the errors are silent and might not propagate from under AND, OR, if(), and coalesce(). For example, the following FILTER expressions would still be true, even though their (underlined) sub-expressions produce error:

?x/0 > 3 || 5 > 3

!('a'>2 && false)

coalesce(?x/0, 5) > 3

The same result would be if an unbound variable is found on the place of underlined sub-expressions.

The standard does not say explicitly, however, that the reverse holds, i.e. that error value should be treated in the same way as unbound. For example should the expression bound(5/0) evaluate to error or false?

SciSPARQL does not distinguish between error and logical false that results from expressions. Either of this values, if propagated to the result row, invalidates it (such rows are not returned). unbound value is treated differently in two ways:

- it does not invalidate the result row

- it is recognized by bound() and !bound().

This can be summarized in the following table for bound() function:

?x	bound(?x)	!bound(?x)
true	true	false
false	false	false
error	false	false
unbound	false	true
any other RDF term	true	false

One might argue that bound(false) should be true, however, in the current implementation of SciSPARQL, false and error are equivalent.