Felis¶

Introduction¶

Felis is a way of describing database catalogs, scientific and otherwise, in a language and DBMS agnostic way. It’s built on concepts from JSON-LD/RDF and CSVW, but intended to provide a comprehensive way to describe tabular data, using annotations on tables, columns, and schemas, to document scientifically useful metadata as well as implementation-specific metadata for database management systems, file formats, and application data models.

When processing a Felis description, we envision SQLAlchemy to be the target implementation backend, so descriptions for Tables, Columns, Foreign Keys, Constraints, and Indexes should generally map very closely to SQLAlchemy parameters for those objects.

Liquibase descriptions were also consulted. Liquibase is oriented around the concept of a changeset. It should be the case that a Felis description could be transformed into a Liquibase changeset without too much effort.

JSON-LD¶

JSON-LD is a way of representing data in a linked fashion. The reason why it’s most interesting for this problem is because it provides a few algorithms for manipulating data which is useful for being able to also provide a succinct way to describe something that’s human readable, and algorithms to translate those descriptions into objects that are easier to process by a computer.

Because of the emphasis put on linking objects together, it provides a natural way of describing the fundamentally relational objects that make up a database, including metadata about them.

Felis is heavily influenced by work on CSVW, but CSVW is oriented a bit more towards publishing data to the web, and that doesn’t quite capture the use case of desribing tables, especially those which haven’t been created yet. Still, for services which may return CSV files, a translation to CSVW should be straightforward.

Some links that might be helpful for understanding JSON-LD:

http://arfon.org/json-ld-for-software-discovery-reuse-and-credit/index.html https://w3c.github.io/json-ld-syntax/#basic-concept

IRIs and @context¶

Linked Data, and the Web in general, uses IRIs (Internationalized Resource Identifiers as described in [RFC3987]) for unambiguous identification. This means the key in every annotation must be an IRI.

The simplest possible schema, a schema with one table which contains a point, represented in JSON, would look like the following:

{
  "name": "MySchema",
  "tables": [
    {
      "name": "Point",
      "columns": [
        {
          "name": "ra",
          "datatype": "float"
        },
        {
          "name": "dec",
          "datatype": "float"
        }
      ]
    }
  ]
}

We can infer that this is probably describing a schema, but it’s possible the definitions are ambiguous. IRIs help with this:

{
  "http://lsst.org/felis/name": "MySchema",
  "http://lsst.org/felis/tables": [
    {
      "http://lsst.org/felis/name": "Point",
      "http://lsst.org/felis/columns": [
        {
          "http://lsst.org/felis/name": "ra",
          "http://lsst.org/felis/datatype": "float"
        },
        {
          "http://lsst.org/felis/name": "dec",
          "http://lsst.org/felis/datatype": "float"
        }
      ]
    }
  ]
}

This provides unambiguous definitions to the semantics of each value, but it’s extremely wordy, compared to the natural JSON form.

To help with this, JSON-LD document has a context. Every Felis description should as well. @context is similar to an XML namespace.

Used to define the short-hand names that are used throughout a JSON-LD document. These short-hand names are called terms and help developers to express specific identifiers in a compact manner.

{
  "@context": "http://lsst.org/felis/",
  "name": "MySchema",
  "tables": [
    {
      "name": "Point",
      "columns": [
        {
          "name": "ra",
          "datatype": "float"
        },
        {
          "name": "dec",
          "datatype": "float"
        }
      ]
    }
  ]
}

This is fine, but the base vocabulary of Felis doesn’t help much with annotating columns with FITS or IVOA terms, for example. So we can add to our context more vocabulary terms.

{
  "@context": {
    "@vocab": "http://lsst.org/felis/",
    "ivoa": "http://ivoa.net/",
    "fits": "http://fits.gsfc.nasa.gov/FITS/4.0/"
  },
  "name": "MySchema",
  "tables": [
    {
      "name": "Point",
      "columns": [
        {
          "name": "ra",
          "datatype": "float",
          "ivoa:ucd": "pos.eq.ra;meta.main",
          "fits:tunit": "deg"
        },
        {
          "name": "dec",
          "datatype": "float",
          "ivoa:ucd": "pos.eq.dec;meta.main",
          "fits:tunit": "deg"
        }
      ]
    }
  ]
}

Currently, vocabularies aren’t formally defined for IVOA, FITS, MySQL, Oracle, Postgres, SQLite. For now, we won’t worry about that too much. For most descriptions of tables, we will recommend a default context of the following:

{
  "@context": {
    "@vocab": "http://lsst.org/felis/",
    "ivoa": "http://ivoa.net/",
    "votable": "http://ivoa.net/documents/VOTable/",
    "fits": "http://fits.gsfc.nasa.gov/FITS/4.0/",
    "mysql": "http://mysql.com/",
    "postgres": "http://posgresql.org/",
    "oracle": "http://oracle.com/database/",
    "sqlite": "http://sqlite.org/"
  }
}

@id¶

The main way to reference objects within a JSON-LD document is by id. The @id attribute of any object MUST be unique in that document. @id is the main way we use to reference objects in a Felis description, such as the columns referenced in an index, for example.

Tabular Data Models¶

This section defines the objects which make up the model.

The annotations provide information about the columns, tables, and schemas they are defined in. The values of an annotation may be a list, object, or atomic values. To maximize portability, it’s recommended to use atomic values everywhere possible. A list or a structured object, for example, may need to be serialized in target formats that only allow key-value metadata on column and table objects. This would include storage in a database as well.

Schemas¶

A schema is a group of tables.

A schema comprises a group of annotated tables and a set of annotations that relate to that group of tables. The core annotations of a schema are:

name — The name of this schema. In implementation terms, this typically maps to:
- A schema in a CREATE SCHEMA statement in Postgres.
- A database in a CREATE DATABASE statement in MySQL. There is also a synonym for this statement under CREATE SCHEMA.
- A user in a CREATE USER statement in Oracle
- A SQLite file, which might be named according to [name].db
@id — an identifier for this group of tables. This may be used for relating schemas together at a higher level. Typically, the name of the schema can be used as the id.
description — A textual description of this schema
tables — the list of tables in the schema. A schema MUST have one or more tables.

Schemas MAY in addition have any number of annotations which provide information about the group of tables. Annotations on a group of tables may include:

DBMS-specific information for a schema, especially for creating a schema.
IVOA metadata about the table
Schema versioning information
Column Groupings
Links to other schemas which may be related
Reference URLs
Provenance

Tables¶

A Table within a Schema. The core annotations of a table are:

name — The name of this table. In implementation terms, this typically maps to a table name in a CREATE TABLE statement in a MySQL/Oracle/Postgres/SQLite.
@id — an identifier for this table
description — A textual of this table
columns — the list of columns in the table. A table MUST have one or more columns and the order of the columns within the list is significant and MUST be preserved by applications.
primaryKey — A column reference that holds either a single reference to a column id or a list of column id references for compound primary keys.
constraints — the list of constraints for the table. A table MAY have zero or more constraints. Usually these are Forein Key constraints.
indexes — the list of indexes in the schema. A schema MAY have zero or more indexes.

Tables MAY in addition have any number of annotations which provide information about the table. Annotations on a table may include:

DBMS-specific information for a table, such as storage engine.
IVOA metadata about the table, such as utype
Links to other tables which may be related
Provenance

Columns¶

Represents a column in a table. The core annotations of a column are:

name — the name of the column.
@id — an identifier for this column
description — A textual description of this column
datatype — the expected datatype for the value of the column. This is the canonical datatype, but may often be overridden by additional annotations for DBMS or format-specific datatypes.
value — the default value for a column. This is used in DBMS systems that support it, and it may also be used when processing a table.
nullable — if the column is nullable. When set to false, this will cause a NOT NULL to be appended to SQL DDL. false. A missing value is assumed to be equivalent to true. If the value is set to false and the column is referenced in the primaryKey property of a table, then an error should be thrown during the processing of the metadata.
autoincrement — If the column is the primary key or part of a primary key, this may be used to specify autoincrement behavior. We derive semantics from SQLAlchemy.

Columns MAY in addition have any number of annotations which provide information about the column. Annotations on a table may include:

DBMS-specific information for a table, such as storage engine.
IVOA metadata about the table, such as utype
Links to other tables which may be related
Provenance

Indexes¶

This section is under development

An index that is annotated with a table. An index is typically associated with one or more columns from a table, but it may consist of expressions involving the columns of a table instead.

The core annotations of an index are:

name — The name of this index. This is optional.
@id — an identifier for this index
description — A textual description of this index
columns — A column reference property that holds either a single reference to a column description object within this schema, or an list of references. This annotation is mutually exclusive with the expressions annotation.
expressions — A column reference property that holds either a single column expression object, or a list of them. This annotation is mutually exclusive with the columns annotation.

Constraints¶

This section is under development

name — The name of this constraint. This is optional.
@id — an identifier for this constraint
@type — One of “ForeignKey”, “Unique”, “Check”. Required.
description — A description of this constraint
columns — A column reference property that holds either a single reference to a column description object within this schema, or an list of references.
referencedColumns — A column reference property that holds either a single reference to a column description object within this schema, or an list of references. Used on ForeignKey Constraints.
expression — A column expression object. Used on Check Constraints.
deferrable — If true, emit DEFERRABLE or NOT DEFERRABLE when issuing DDL for this constraint.
initially — If set, emit INITIALLY when issuing DDL for this constraint.

References¶

This section is under development

References are annotated objects which hold a reference to a single object, usually a Column or a Column Grouping. While a reference to a column might normally be just an @id, we create a special object so that the reference itself may be annotated with additional information. This is mostly useful in the case of Column Groupings.

In VOTable, this is similar to the FIELDref and PARAMref objects. It’s also similar a GROUP nested in a GROUP, which provides an implicit reference where the nested GROUP would have an implicit reference to the parent.

name — The name of this reference
@id — an identifier for this reference
description — A description of the reference
reference — The id of the object being referenced

Column Groupings¶

This section is incomplete

name — The name of this table. In implementation terms, this typically maps to a table name in a CREATE TABLE statement in a MySQL/Oracle/Postgres/SQLite.
@id — an identifier for this table
description — any number of additional annotations on the table. This annotation may be empty.
reference — A reference to another column grouping, if applicable.
columnReferences — A list of column references in the table. A Column Grouping MUST have one or more column references.

Datatypes¶

Type	C++	Python	Java	JDBC	SQLAlchemy[1]	Notes
boolean	bool	bool	boolean	BOOLEAN	BOOLEAN
byte	int8	int	byte	TINYINT	SMALLINT	2
short	int16	int	short	SMALLINT	SMALLINT
int	int32	int	int	INTEGER	INTEGER
long	int64	int	long	BIGINT	BIGINT
float	float	float	float	FLOAT	FLOAT
double	double	float	double	DOUBLE	FLOAT(precision=53)
char	string	str	String	CHAR	CHAR	3
string	string	str	String	VARCHAR	VARCHAR	3
unicode	string	str	String	NVARCHAR	NVARCHAR	3
text	string	str	String	CLOB	CLOB
binary	string	bytes	byte[]	BLOB	BLOB

Type	MySQL	SQLite	Oracle	Postgres	Avro	Parquet	Notes
boolean	BIT(1)	BOOLEAN	NUMBER(1)	BOOLEAN	boolean	BOOLEAN	5
byte	TINYINT	TINYINT	NUMBER(3)	SMALLINT	int	INT_8
short	SMALLINT	SMALLINT	NUMBER(5)	SMALLINT	int	INT_16
int	INT	INTEGER	INTEGER	INT	int	INT_32
long	BIGINT	BIGINT	NUMBER(38, 0)	BIGINT	long	INT_64
float	FLOAT	FLOAT	FLOAT	FLOAT	float	FLOAT
double	DOUBLE	DOUBLE	FLOAT(24)	DOUBLE PRECISION	double	DOUBLE
char	CHAR	CHAR	CHAR	CHAR	string	UTF8/STRING
string	VARCHAR	VARCHAR	VARCHAR2	VARCHAR	string	UTF8/STRING
unicode	NVARCHAR	NVARCHAR	NVARCHAR2	VARCHAR	string	UTF8/STRING
text	LONGTEXT	TEXT	CLOB	TEXT	string	UTF8/STRING
binary	LONGBLOB	BLOB	BLOB	BYTEA	bytes	BYTE_ARRAY

Type	xsd	VOTable	Notes
boolean	boolean	boolean
byte	byte	unsignedByte	3
short	short	short
int	int	int
long	long	long
float	float	float
double	double	double
char	string	char[]	3
string	string	char[]	3
unicode	string	unicodeChar[]	3
text	string	unicodeChar[]
binary	base64Binary	unsignedByte[]	6

Notes:

[1]: This is the default SQLAlchemy Mapping
[2] SQLAlchemy has no “TinyInteger”, so you need to override, or the default is SMALLINT
[3] The length is an additional parameter elsewhere for VOTable types
[4] This is a single byte value between 0-255, not a member of a byte array. It’s preferable to not use this type.
[5] Parquet Logical types from Thrift
[6] There’s also hexBinary, but it was not considered as the target format is usually human-readable XML

Length Constraints¶

Processing Metadata¶

This section is under development

Creating annotated tables¶

This section is under development

Metadata Compatibility¶

This section is non-normative.

As mentioned before, to maximize portability, it’s recommended to use atomic values everywhere possible. A list or a structured object, for example, may need to be serialized as a string (usually JSON) for target formats that only allow key-value metadata on column and table objects. This would include un-mapped storage to a database table.

In the case where all annotations are pure atoms, we can represent the annotations in virtually every format or model which allows a way to store key-value metadata on table and columns. This includes parquet files and afw.table objects.

We assume that atomic values of an annotation will likely be stored as string in most formats. This means libraries processing the metadata may need to translate a formatted number back to a float or double. Most of this can probably be automated with a proper vocabulary for Felis.

Formats and Models¶

afw.table¶

A few of the metadata values for tables and columns are storeable on in the properties of a schema (table) or field.

YAML/JSON¶

This is the most natural format. Note that @id fields must be quoted in a YAML file.

FITS¶

A convention and vocabulary for FITS header keywords is being developed. In general, a FITS keyword includes a name, a value, and a comment.

Avro¶

As Avro is very similar to YAML and JSON

Parquet¶

Parquet files allow key-value metadata on column and table objects, though all values must be strings.

Relational Databases¶

Relational database do not necessarily have facilities to directly annotate columns and tables. However, we

VOTable¶

The annotations for columns and tables should be reused where possible. The Column Groupings are based off of the GROUP element in VOTable.

HDF5 and PyTables¶

PyTables is an opinionated way of representing tabular data in HDF5.

Navigation

Felis¶

Introduction¶

JSON-LD¶

IRIs and @context¶

@id¶

Tabular Data Models¶

Schemas¶

Tables¶

Columns¶

Indexes¶

Constraints¶

References¶

Column Groupings¶

Datatypes¶

Length Constraints¶

Processing Metadata¶

Creating annotated tables¶

Metadata Compatibility¶

Formats and Models¶

afw.table¶

YAML/JSON¶

FITS¶

Avro¶

Parquet¶

Relational Databases¶

VOTable¶

HDF5 and PyTables¶