=======
DataHub
=======

For users who want to analyze the data stored in CELUS themselves, CELUS offers an export into
DataHub.

The export is updated daily and contains all usage data from CELUS. Other data, such as information
about harvesting success, configured credentials, etc. is not included.

.. contents::
   :local:
   :depth: 2
   :backlinks: entry


Connecting to DataHub
=====================

ClickHouse connection parameters (host, ports, database name, user name, and password) can be
retrieved either in the CELUS UI under the "Data Hub" page or through the public API with the
same API key used for other read-only integration endpoints. See :ref:`datahub-credentials`.

Technology
==========

DataHub is built on top of `ClickHouse <https://clickhouse.com/>`_, a powerful column-oriented
database engine. It is run and maintained by CELUS on its own infrastructure.

ClickHouse is compatible with Microsoft Power BI, Apache Superset, Metabase and many other
analytical tools.

Database structure
==================

Report tables
-------------

Each report from your CELUS instance is stored in a separate table in the database. For example,
if you are harvesting TR, DR and PR reports, you will have three tables in the database with the
names ``TR``, ``DR`` and ``PR``. COUNTER 5.1 reports have their own tables with the ``51`` suffix,
e.g. ``TR51``. Non-COUNTER reports are exported as well.

Each table has the same overall layout, with optional title or item fields and report-specific
COUNTER dimensions. Please note the double underscore in some of the column names, e.g.
``organization__name``.

**Common columns (every report table):**

- ``organization_id`` (``integer``) - internal CELUS id of the organization that the row belongs to
- ``organization__name`` (``string``) - name of that organization
- ``platform_id`` (``integer``) - internal CELUS id of the platform
- ``platform__name`` (``string``) - name of the platform
- ``platform__counter_registry_id`` (``string``) - COUNTER registry id of the platform when available
- ``date`` (``date``) - usage date
- ``metric_id`` (``integer``) - internal CELUS id of the metric
- ``metric__short_name`` (``string``) - metric short name (for COUNTER reports, the COUNTER metric name)
- ``import_batch_id`` (``integer``) - internal id of the import batch (one harvest, upload, etc.); one batch always covers one month. Mainly for internal use and reconciliation
- ``last_update`` (``DateTime``) - when this row content was last changed
- ``value`` (``integer``) - metric value (usage count)

**Title columns** (present when the report type uses titles, e.g. TR but not PR):

- ``title_id`` (``integer``) - internal CELUS id of the title
- ``title__name`` (``string``)
- ``title__pub_type`` (``string``)
- ``title__isbn`` (``string``)
- ``title__issn`` (``string``)
- ``title__eissn`` (``string``)
- ``title__issn_l`` (``string``) - ISSN-L
- ``title__doi`` (``string``)
- ``title__proprietary_ids`` (array of strings) - proprietary identifiers from the COUNTER record


**Item columns** (present when the report type uses items):

- ``item_id`` (``integer``)
- ``item__name`` (``string``)
- ``item__publication_date`` (``date``)
- ``item__doi`` (``string``)
- ``item__isbn`` (``string``)
- ``item__issn`` (``string``)
- ``item__eissn`` (``string``)
- ``item__proprietary_ids`` (array of strings) - proprietary identifiers from the COUNTER record

**Report specific dimensions:** For each dimension defined on the report type in CELUS, there is a column
named after that dimension’s short name in *lowercase*. These names usually match the COUNTER dimension
names, but there are some exceptions. The ``Platform`` dimension in
COUNTER is stored under ``platform_in_counter_data`` (to distinguish it from CELUS platform
attributes).

If the report has a *Year of Publication* (``yop``) dimension, an extra integer column
``yop__number`` is present. It contains a numeric year parsed from the ``yop`` string (``0`` if
no year could be determined), which is convenient for sorting and numeric filters.

Example report specific dimensions for the COUNTER 5.1 Title Report (table ``TR51``):

- ``data_type`` (``string``) - the data type of the title
- ``access_method`` (``string``) - the access method
- ``platform_in_counter_data`` (``string``) - the platform name as it is found in the COUNTER data (the ``Platform`` column in the COUNTER data)
- ``access_type`` (``string``) - the access type
- ``publisher`` (``string``) - the publisher of the title
- ``yop`` (``string``) - the year of publication
- ``yop__number`` (``integer``) - the numeric year parsed from the ``yop`` string (``0`` if no year could be determined)


Merged views (COUNTER 5 + 5.1)
------------------------------

For selected master report types (TR, DR, PR, IR_M1), the database also contains SQL **views** named
``TR_merged``, ``DR_merged``, ``PR_merged``, and ``IR_M1_merged``. Each view combines the COUNTER 5.1
table with the corresponding COUNTER 5.0 table for the same logical report, so you can query both
versions in one place. The views include an extra string column ``source_report`` with the name of
the underlying export table for each row (for example ``TR51`` vs ``TR``). You can still use the
dedicated ``TR`` / ``TR51`` (etc.) tables if you need only one COUNTER version.

The merged view use the same logic as when merging the COUNTER 5.1 and COUNTER 5.0 tables in the
reporting module in CELUS. When data for COUNTER 5.1 is available for a given month, it is used,
otherwise the data for COUNTER 5.0 is used.

Tag tables
----------

The database also contains tag tables for titles, organizations and platforms.
These tables are used to store the tags for the titles, organizations and platforms and can
be joined with the main report tables to enable filtering, grouping and other analytics
based on tags.

The tables are:

- ``title_tags``
- ``organization_tags``
- ``platform_tags``

The structure of the tag tables is as follows:

- ``{TARGET_ID}`` (``integer``) - internal CELUS id of the tagged target, where `TARGET` is one of (`title`, `organization`, `platform`)
- ``tag_id`` (``integer``) - internal CELUS id of the tag
- ``tag__name`` (``string``) - the name of the tag
- ``tag_class_id`` (``integer``) - internal CELUS id of the tag class that the tag belongs to
- ``tag_class__name`` (``string``) - the name of the tag class
- ``internal`` (``boolean``) - whether the tag class is **internal**, i. e. managed by CELUS and not by the user
- ``last_update`` (``DateTime``) - when this tag row was last synced to DataHub

The first column differs by table. In the ``title_tags`` table, it is ``title_id``;
in the ``organization_tags`` table, it is ``organization_id``;
in the ``platform_tags`` table, it is ``platform_id``.

Example query:

.. code-block:: sql

    SELECT
        tag__name AS Topic,
        SUM(value) AS Unique_Item_Requests
    FROM TR
    INNER JOIN title_tags AS tt ON tt.title_id = TR.title_id
    WHERE (tt.tag_class__name = 'Scopus #2') AND (metric__short_name = 'Unique_Item_Requests')
    GROUP BY 1
    ORDER BY 2 DESC
    LIMIT 10

The above query uses the internal CELUS tags class `Scopus #2` to assign topics to titles and then
sums up the usage by topic. It prints the top 10 topics by usage defined by the `Unique_Item_Requests`
metric.

The output may look like this:

.. code-block:: text

       ┌─Topic────────────────────────────────────────┬─Unique_Item_Requests─┐
    1. │ Medicine                                     │              4435267 │
    2. │ Social Sciences                              │              3788952 │
    3. │ Biochemistry, Genetics and Molecular Biology │              2205736 │
    4. │ Arts and Humanities                          │              1775827 │
    5. │ Engineering                                  │              1543425 │
    6. │ Environmental Science                        │              1516750 │
    7. │ Agricultural and Biological Sciences         │              1487613 │
    8. │ Chemistry                                    │              1250674 │
    9. │ Business, Management and Accounting          │              1236660 │
    10.│ Psychology                                   │              1082768 │
       └──────────────────────────────────────────────┴──────────────────────┘


Keeping an external copy in sync
================================

The DataHub is designed to handle the load of real-time analytics and reporting, so there is no need
to copy the DataHub report tables into your own database. However, if you for some reason need to
do so (for backup purposes or local processing), the following text describes how to do it.

One simple option is to simply drop the DataHub tables and create fresh copies on a regular basis.
Because the usage data usually changes only on a monthly basis, doing a monthly full refresh would be
a valid approach.

If you need to sync the data more frequently, or you want to save some bandwidth, you can use the
following incremental approach to keep your local copy in sync with the DataHub.

Import batches
--------------

Every usage row belongs to exactly one ``import_batch_id``. An import batch corresponds to one import
(harvest, manual upload, etc.) and **one calendar month** of data.

Import batches are nearly immutable, so the basic approach is to compare the set of import batch ids
between your local copy and the DataHub and deal with the differences.

Also, import batches contain a ``last_update`` timestamp, which should additionaly be used to detect
changes to the import batch content. Such changes typically occur when an older report is replaced by a new one
and interest data is thus updated. Such updates only affect the "interest" report, not the underlying
usage data reports. But other report types may also change, for example when title metadata is updated, etc.

Please note that although ``last_update`` is stored **on every row**, it should be used only as a
**batch-level** signal: the maximum of the ``last_update`` timestamp for each import batch determines
the last time the import batch was changed.

The two steps below are the intended way to stay in sync.

Step 1: compare ``import_batch_id`` sets
++++++++++++++++++++++++++++++++++++++++

#. On each sync, compute the set of distinct ``import_batch_id`` values in DataHub for each report
   table you replicate.
#. Compare that set to the set of batch ids you already have locally.

Then:

* **Ids that appear in DataHub but not locally** — insert **all** rows for those ``import_batch_id``
  values from DataHub into your mirror.
* **Ids that appear locally but no longer in DataHub** — delete all rows with those
  ``import_batch_id`` values from your mirror.


Step 2: compare ``MAX(last_update)`` per shared batch
+++++++++++++++++++++++++++++++++++++++++++++++++++++

For each ``import_batch_id`` that exists **both** in DataHub and in your mirror:

#. Compute ``MAX(last_update)`` over all rows with that ``import_batch_id`` in DataHub.
#. Compare it to ``MAX(last_update)`` for the same id in your mirror.

If DataHub’s maximum is **greater than** the local maximum, CELUS has re-exported that batch — **delete** every row
with that ``import_batch_id`` from your mirror and reload the full batch from DataHub. If both maxima
are equal (and server times are trustworthy), skip that batch.


Merged views
------------

Replicating **merged views** (``TR_merged``, etc.) is optional: they are views over the base report
tables. You can mirror only the base tables and recreate equivalent logic in your environment if
needed.

Tag tables
----------

Tag tables (``title_tags``, ``organization_tags``, ``platform_tags``) do **not** carry
``import_batch_id``. Mirror them with a separate policy. They are typically much smaller than the
report tables, so a periodic full refresh would be the easiest way to keep them in sync.