Felis

Introduction

Felis is a way of describing database catalogs, scientific and otherwise, in a language and DBMS agnostic way. It’s built on concepts from JSON-LD/RDF and CSVW, but intended to provide a comprehensive way to describe tabular data, using annotations on tables, columns, and schemas, to document scientifically useful metadata as well as implementation-specific metadata for database management systems, file formats, and application data models.

When processing a Felis description, we envision SQLAlchemy to be the target implementation backend, so descriptions for Tables, Columns, Foreign Keys, Constraints, and Indexes should generally map very closely to SQLAlchemy parameters for those objects.

Liquibase descriptions were also consulted. Liquibase is oriented around the concept of a changeset. It should be the case that a Felis description could be transformed into a Liquibase changeset without too much effort.

JSON-LD

JSON-LD is a way of representing data in a linked fashion. The reason why it’s most interesting for this problem is because it provides a few algorithms for manipulating data which is useful for being able to also provide a succinct way to describe something that’s human readable, and algorithms to translate those descriptions into objects that are easier to process by a computer.

Because of the emphasis put on linking objects together, it provides a natural way of describing the fundamentally relational objects that make up a database, including metadata about them.

Felis is heavily influenced by work on CSVW, but CSVW is oriented a bit more towards publishing data to the web, and that doesn’t quite capture the use case of desribing tables, especially those which haven’t been created yet. Still, for services which may return CSV files, a translation to CSVW should be straightforward.

Some links that might be helpful for understanding JSON-LD:

http://arfon.org/json-ld-for-software-discovery-reuse-and-credit/index.html https://w3c.github.io/json-ld-syntax/#basic-concept

IRIs and @context

Linked Data, and the Web in general, uses IRIs (Internationalized Resource Identifiers as described in [RFC3987]) for unambiguous identification. This means the key in every annotation must be an IRI.

The simplest possible schema, a schema with one table which contains a point, represented in JSON, would look like the following:

{
  "name": "MySchema",
  "tables": [
    {
      "name": "Point",
      "columns": [
        {
          "name": "ra",
          "datatype": "float"
        },
        {
          "name": "dec",
          "datatype": "float"
        }
      ]
    }
  ]
}

We can infer that this is probably describing a schema, but it’s possible the definitions are ambiguous. IRIs help with this:

{
  "http://lsst.org/felis/name": "MySchema",
  "http://lsst.org/felis/tables": [
    {
      "http://lsst.org/felis/name": "Point",
      "http://lsst.org/felis/columns": [
        {
          "http://lsst.org/felis/name": "ra",
          "http://lsst.org/felis/datatype": "float"
        },
        {
          "http://lsst.org/felis/name": "dec",
          "http://lsst.org/felis/datatype": "float"
        }
      ]
    }
  ]
}

This provides unambiguous definitions to the semantics of each value, but it’s extremely wordy, compared to the natural JSON form.

To help with this, JSON-LD document has a context. Every Felis description should as well. @context is similar to an XML namespace.

Used to define the short-hand names that are used throughout a JSON-LD document. These short-hand names are called terms and help developers to express specific identifiers in a compact manner.

{
  "@context": "http://lsst.org/felis/",
  "name": "MySchema",
  "tables": [
    {
      "name": "Point",
      "columns": [
        {
          "name": "ra",
          "datatype": "float"
        },
        {
          "name": "dec",
          "datatype": "float"
        }
      ]
    }
  ]
}

This is fine, but the base vocabulary of Felis doesn’t help much with annotating columns with FITS or IVOA terms, for example. So we can add to our context more vocabulary terms.

{
  "@context": {
    "@vocab": "http://lsst.org/felis/",
    "ivoa": "http://ivoa.net/",
    "fits": "http://fits.gsfc.nasa.gov/FITS/4.0/"
  },
  "name": "MySchema",
  "tables": [
    {
      "name": "Point",
      "columns": [
        {
          "name": "ra",
          "datatype": "float",
          "ivoa:ucd": "pos.eq.ra;meta.main",
          "fits:tunit": "deg"
        },
        {
          "name": "dec",
          "datatype": "float",
          "ivoa:ucd": "pos.eq.dec;meta.main",
          "fits:tunit": "deg"
        }
      ]
    }
  ]
}

Currently, vocabularies aren’t formally defined for IVOA, FITS, MySQL, Oracle, Postgres, SQLite. For now, we won’t worry about that too much. For most descriptions of tables, we will recommend a default context of the following:

{
  "@context": {
    "@vocab": "http://lsst.org/felis/",
    "ivoa": "http://ivoa.net/",
    "votable": "http://ivoa.net/documents/VOTable/",
    "fits": "http://fits.gsfc.nasa.gov/FITS/4.0/",
    "mysql": "http://mysql.com/",
    "postgres": "http://posgresql.org/",
    "oracle": "http://oracle.com/database/",
    "sqlite": "http://sqlite.org/"
  }
}

@id

The main way to reference objects within a JSON-LD document is by id. The @id attribute of any object MUST be unique in that document. @id is the main way we use to reference objects in a Felis description, such as the columns referenced in an index, for example.

Tabular Data Models

This section defines the objects which make up the model.

The annotations provide information about the columns, tables, and schemas they are defined in. The values of an annotation may be a list, object, or atomic values. To maximize portability, it’s recommended to use atomic values everywhere possible. A list or a structured object, for example, may need to be serialized in target formats that only allow key-value metadata on column and table objects. This would include storage in a database as well.

Schemas

A schema is a group of tables.

A schema comprises a group of annotated tables and a set of annotations that relate to that group of tables. The core annotations of a schema are:

  • name — The name of this schema. In implementation terms, this typically maps to:
    • A schema in a CREATE SCHEMA statement in Postgres.
    • A database in a CREATE DATABASE statement in MySQL. There is also a synonym for this statement under CREATE SCHEMA.
    • A user in a CREATE USER statement in Oracle
    • A SQLite file, which might be named according to [name].db
  • @id — an identifier for this group of tables. This may be used for relating schemas together at a higher level. Typically, the name of the schema can be used as the id.
  • description — A textual description of this schema
  • tables — the list of tables in the schema. A schema MUST have one or more tables.

Schemas MAY in addition have any number of annotations which provide information about the group of tables. Annotations on a group of tables may include:

  • DBMS-specific information for a schema, especially for creating a schema.
  • IVOA metadata about the table
  • Schema versioning information
  • Column Groupings
  • Links to other schemas which may be related
  • Reference URLs
  • Provenance

Tables

A Table within a Schema. The core annotations of a table are:

  • name — The name of this table. In implementation terms, this typically maps to a table name in a CREATE TABLE statement in a MySQL/Oracle/Postgres/SQLite.
  • @id — an identifier for this table
  • description — A textual of this table
  • columns — the list of columns in the table. A table MUST have one or more columns and the order of the columns within the list is significant and MUST be preserved by applications.
  • primaryKey — A column reference that holds either a single reference to a column id or a list of column id references for compound primary keys.
  • constraints — the list of constraints for the table. A table MAY have zero or more constraints. Usually these are Forein Key constraints.
  • indexes — the list of indexes in the schema. A schema MAY have zero or more indexes.

Tables MAY in addition have any number of annotations which provide information about the table. Annotations on a table may include:

  • DBMS-specific information for a table, such as storage engine.
  • IVOA metadata about the table, such as utype
  • Links to other tables which may be related
  • Provenance

Columns

Represents a column in a table. The core annotations of a column are:

  • name — the name of the column.
  • @id — an identifier for this column
  • description — A textual description of this column
  • datatype — the expected datatype for the value of the column. This is the canonical datatype, but may often be overridden by additional annotations for DBMS or format-specific datatypes.
  • value — the default value for a column. This is used in DBMS systems that support it, and it may also be used when processing a table.
  • nullable — if the column is nullable. When set to false, this will cause a NOT NULL to be appended to SQL DDL. false. A missing value is assumed to be equivalent to true. If the value is set to false and the column is referenced in the primaryKey property of a table, then an error should be thrown during the processing of the metadata.
  • autoincrement — If the column is the primary key or part of a primary key, this may be used to specify autoincrement behavior. We derive semantics from SQLAlchemy.

Columns MAY in addition have any number of annotations which provide information about the column. Annotations on a table may include:

  • DBMS-specific information for a table, such as storage engine.
  • IVOA metadata about the table, such as utype
  • Links to other tables which may be related
  • Provenance

Indexes

This section is under development

An index that is annotated with a table. An index is typically associated with one or more columns from a table, but it may consist of expressions involving the columns of a table instead.

The core annotations of an index are:

  • name — The name of this index. This is optional.
  • @id — an identifier for this index
  • description — A textual description of this index
  • columns — A column reference property that holds either a single reference to a column description object within this schema, or an list of references. This annotation is mutually exclusive with the expressions annotation.
  • expressions — A column reference property that holds either a single column expression object, or a list of them. This annotation is mutually exclusive with the columns annotation.

Constraints

This section is under development

  • name — The name of this constraint. This is optional.
  • @id — an identifier for this constraint
  • @type — One of “ForeignKey”, “Unique”, “Check”. Required.
  • description — A description of this constraint
  • columns — A column reference property that holds either a single reference to a column description object within this schema, or an list of references.
  • referencedColumns — A column reference property that holds either a single reference to a column description object within this schema, or an list of references. Used on ForeignKey Constraints.
  • expression — A column expression object. Used on Check Constraints.
  • deferrable — If true, emit DEFERRABLE or NOT DEFERRABLE when issuing DDL for this constraint.
  • initially — If set, emit INITIALLY when issuing DDL for this constraint.

References

This section is under development

References are annotated objects which hold a reference to a single object, usually a Column or a Column Grouping. While a reference to a column might normally be just an @id, we create a special object so that the reference itself may be annotated with additional information. This is mostly useful in the case of Column Groupings.

In VOTable, this is similar to the FIELDref and PARAMref objects. It’s also similar a GROUP nested in a GROUP, which provides an implicit reference where the nested GROUP would have an implicit reference to the parent.

  • name — The name of this reference
  • @id — an identifier for this reference
  • description — A description of the reference
  • reference — The id of the object being referenced

Column Groupings

This section is incomplete

  • name — The name of this table. In implementation terms, this typically maps to a table name in a CREATE TABLE statement in a MySQL/Oracle/Postgres/SQLite.
  • @id — an identifier for this table
  • description — any number of additional annotations on the table. This annotation may be empty.
  • reference — A reference to another column grouping, if applicable.
  • columnReferences — A list of column references in the table. A Column Grouping MUST have one or more column references.

Datatypes

Type C++ Python Java JDBC SQLAlchemy[1] Notes
boolean bool bool boolean BOOLEAN BOOLEAN  
byte int8 int byte TINYINT SMALLINT 2
short int16 int short SMALLINT SMALLINT  
int int32 int int INTEGER INTEGER  
long int64 int long BIGINT BIGINT  
float float float float FLOAT FLOAT  
double double float double DOUBLE FLOAT(precision=53)  
char string str String CHAR CHAR 3
string string str String VARCHAR VARCHAR 3
unicode string str String NVARCHAR NVARCHAR 3
text string str String CLOB CLOB  
binary string bytes byte[] BLOB BLOB  
Type MySQL SQLite Oracle Postgres Avro Parquet Notes
boolean BIT(1) BOOLEAN NUMBER(1) BOOLEAN boolean BOOLEAN 5
byte TINYINT TINYINT NUMBER(3) SMALLINT int INT_8  
short SMALLINT SMALLINT NUMBER(5) SMALLINT int INT_16  
int INT INTEGER INTEGER INT int INT_32  
long BIGINT BIGINT NUMBER(38, 0) BIGINT long INT_64  
float FLOAT FLOAT FLOAT FLOAT float FLOAT  
double DOUBLE DOUBLE FLOAT(24) DOUBLE PRECISION double DOUBLE  
char CHAR CHAR CHAR CHAR string UTF8/STRING  
string VARCHAR VARCHAR VARCHAR2 VARCHAR string UTF8/STRING  
unicode NVARCHAR NVARCHAR NVARCHAR2 VARCHAR string UTF8/STRING  
text LONGTEXT TEXT CLOB TEXT string UTF8/STRING  
binary LONGBLOB BLOB BLOB BYTEA bytes BYTE_ARRAY  
Type xsd VOTable Notes
boolean boolean boolean  
byte byte unsignedByte 3
short short short  
int int int  
long long long  
float float float  
double double double  
char string char[] 3
string string char[] 3
unicode string unicodeChar[] 3
text string unicodeChar[]  
binary base64Binary unsignedByte[] 6

Notes:

  • [1]: This is the default SQLAlchemy Mapping
  • [2] SQLAlchemy has no “TinyInteger”, so you need to override, or the default is SMALLINT
  • [3] The length is an additional parameter elsewhere for VOTable types
  • [4] This is a single byte value between 0-255, not a member of a byte array. It’s preferable to not use this type.
  • [5] Parquet Logical types from Thrift
  • [6] There’s also hexBinary, but it was not considered as the target format is usually human-readable XML

Length Constraints

Processing Metadata

This section is under development

Creating annotated tables

This section is under development

Metadata Compatibility

This section is non-normative.

As mentioned before, to maximize portability, it’s recommended to use atomic values everywhere possible. A list or a structured object, for example, may need to be serialized as a string (usually JSON) for target formats that only allow key-value metadata on column and table objects. This would include un-mapped storage to a database table.

In the case where all annotations are pure atoms, we can represent the annotations in virtually every format or model which allows a way to store key-value metadata on table and columns. This includes parquet files and afw.table objects.

We assume that atomic values of an annotation will likely be stored as string in most formats. This means libraries processing the metadata may need to translate a formatted number back to a float or double. Most of this can probably be automated with a proper vocabulary for Felis.

Formats and Models

afw.table

A few of the metadata values for tables and columns are storeable on in the properties of a schema (table) or field.

YAML/JSON

This is the most natural format. Note that @id fields must be quoted in a YAML file.

FITS

A convention and vocabulary for FITS header keywords is being developed. In general, a FITS keyword includes a name, a value, and a comment.

Avro

As Avro is very similar to YAML and JSON

Parquet

Parquet files allow key-value metadata on column and table objects, though all values must be strings.

Relational Databases

Relational database do not necessarily have facilities to directly annotate columns and tables. However, we

VOTable

The annotations for columns and tables should be reused where possible. The Column Groupings are based off of the GROUP element in VOTable.

HDF5 and PyTables

PyTables is an opinionated way of representing tabular data in HDF5.