DataHub¶

For users who want to analyze the data stored in CELUS themselves, CELUS offers an export into DataHub.

The export is updated daily and contains all usage data from CELUS. Other data, such as information about harvesting success, configured credentials, etc. is not included.

Connecting to DataHub ¶

ClickHouse connection parameters (host, ports, database name, user name, and password) can be retrieved either in the CELUS UI under the “Data Hub” page or through the public API with the same API key used for other read-only integration endpoints. See DataHub credentials.

Technology ¶

DataHub is built on top of ClickHouse, a powerful column-oriented database engine. It is run and maintained by CELUS on its own infrastructure.

ClickHouse is compatible with Microsoft Power BI, Apache Superset, Metabase and many other analytical tools.

Database structure ¶

Report tables ¶

Each report from your CELUS instance is stored in a separate table in the database. For example, if you are harvesting TR, DR and PR reports, you will have three tables in the database with the names TR, DR and PR. COUNTER 5.1 reports have their own tables with the 51 suffix, e.g. TR51. Non-COUNTER reports are exported as well.

Each table has the same overall layout, with optional title or item fields and report-specific COUNTER dimensions. Please note the double underscore in some of the column names, e.g. organization__name.

Common columns (every report table):

organization_id (integer) - internal CELUS id of the organization that the row belongs to
organization__name (string) - name of that organization
platform_id (integer) - internal CELUS id of the platform
platform__name (string) - name of the platform
platform__counter_registry_id (string) - COUNTER registry id of the platform when available
date (date) - usage date
metric_id (integer) - internal CELUS id of the metric
metric__short_name (string) - metric short name (for COUNTER reports, the COUNTER metric name)
import_batch_id (integer) - internal id of the import batch (one harvest, upload, etc.); one batch always covers one month. Mainly for internal use and reconciliation
last_update (DateTime) - when this row content was last changed
value (integer) - metric value (usage count)

Title columns (present when the report type uses titles, e.g. TR but not PR):

title_id (integer) - internal CELUS id of the title
title__name (string)
title__pub_type (string)
title__isbn (string)
title__issn (string)
title__eissn (string)
title__issn_l (string) - ISSN-L
title__doi (string)
title__proprietary_ids (array of strings) - proprietary identifiers from the COUNTER record

Item columns (present when the report type uses items):

item_id (integer)
item__name (string)
item__publication_date (date)
item__doi (string)
item__isbn (string)
item__issn (string)
item__eissn (string)
item__proprietary_ids (array of strings) - proprietary identifiers from the COUNTER record

Report specific dimensions: For each dimension defined on the report type in CELUS, there is a column named after that dimension’s short name in lowercase. These names usually match the COUNTER dimension names, but there are some exceptions. The Platform dimension in COUNTER is stored under platform_in_counter_data (to distinguish it from CELUS platform attributes).

If the report has a Year of Publication (yop) dimension, an extra integer column yop__number is present. It contains a numeric year parsed from the yop string (0 if no year could be determined), which is convenient for sorting and numeric filters.

Example report specific dimensions for the COUNTER 5.1 Title Report (table TR51):

data_type (string) - the data type of the title
access_method (string) - the access method
platform_in_counter_data (string) - the platform name as it is found in the COUNTER data (the Platform column in the COUNTER data)
access_type (string) - the access type
publisher (string) - the publisher of the title
yop (string) - the year of publication
yop__number (integer) - the numeric year parsed from the yop string (0 if no year could be determined)

Merged views (COUNTER 5 + 5.1)¶

For selected master report types (TR, DR, PR, IR_M1), the database also contains SQL views named TR_merged, DR_merged, PR_merged, and IR_M1_merged. Each view combines the COUNTER 5.1 table with the corresponding COUNTER 5.0 table for the same logical report, so you can query both versions in one place. The views include an extra string column source_report with the name of the underlying export table for each row (for example TR51 vs TR). You can still use the dedicated TR / TR51 (etc.) tables if you need only one COUNTER version.

The merged view use the same logic as when merging the COUNTER 5.1 and COUNTER 5.0 tables in the reporting module in CELUS. When data for COUNTER 5.1 is available for a given month, it is used, otherwise the data for COUNTER 5.0 is used.

Tag tables ¶

The database also contains tag tables for titles, organizations and platforms. These tables are used to store the tags for the titles, organizations and platforms and can be joined with the main report tables to enable filtering, grouping and other analytics based on tags.

The tables are:

title_tags
organization_tags
platform_tags

The structure of the tag tables is as follows:

{TARGET_ID} (integer) - internal CELUS id of the tagged target, where TARGET is one of (title, organization, platform)
tag_id (integer) - internal CELUS id of the tag
tag__name (string) - the name of the tag
tag_class_id (integer) - internal CELUS id of the tag class that the tag belongs to
tag_class__name (string) - the name of the tag class
internal (boolean) - whether the tag class is internal, i. e. managed by CELUS and not by the user
last_update (DateTime) - when this tag row was last synced to DataHub

The first column differs by table. In the title_tags table, it is title_id; in the organization_tags table, it is organization_id; in the platform_tags table, it is platform_id.

Example query:

SELECT
    tag__name AS Topic,
    SUM(value) AS Unique_Item_Requests
FROM TR
INNER JOIN title_tags AS tt ON tt.title_id = TR.title_id
WHERE (tt.tag_class__name = 'Scopus #2') AND (metric__short_name = 'Unique_Item_Requests')
GROUP BY 1
ORDER BY 2 DESC
LIMIT 10

The above query uses the internal CELUS tags class Scopus #2 to assign topics to titles and then sums up the usage by topic. It prints the top 10 topics by usage defined by the Unique_Item_Requests metric.

The output may look like this:

   ┌─Topic────────────────────────────────────────┬─Unique_Item_Requests─┐
│ Medicine                                     │              4435267 │
│ Social Sciences                              │              3788952 │
│ Biochemistry, Genetics and Molecular Biology │              2205736 │
│ Arts and Humanities                          │              1775827 │
│ Engineering                                  │              1543425 │
│ Environmental Science                        │              1516750 │
│ Agricultural and Biological Sciences         │              1487613 │
│ Chemistry                                    │              1250674 │
│ Business, Management and Accounting          │              1236660 │
│ Psychology                                   │              1082768 │
   └──────────────────────────────────────────────┴──────────────────────┘

Keeping an external copy in sync ¶

The DataHub is designed to handle the load of real-time analytics and reporting, so there is no need to copy the DataHub report tables into your own database. However, if you for some reason need to do so (for backup purposes or local processing), the following text describes how to do it.

One simple option is to simply drop the DataHub tables and create fresh copies on a regular basis. Because the usage data usually changes only on a monthly basis, doing a monthly full refresh would be a valid approach.

If you need to sync the data more frequently, or you want to save some bandwidth, you can use the following incremental approach to keep your local copy in sync with the DataHub.

Import batches ¶

Every usage row belongs to exactly one import_batch_id. An import batch corresponds to one import (harvest, manual upload, etc.) and one calendar month of data.

Import batches are nearly immutable, so the basic approach is to compare the set of import batch ids between your local copy and the DataHub and deal with the differences.

Also, import batches contain a last_update timestamp, which should additionaly be used to detect changes to the import batch content. Such changes typically occur when an older report is replaced by a new one and interest data is thus updated. Such updates only affect the “interest” report, not the underlying usage data reports. But other report types may also change, for example when title metadata is updated, etc.

Please note that although last_update is stored on every row, it should be used only as a batch-level signal: the maximum of the last_update timestamp for each import batch determines the last time the import batch was changed.

The two steps below are the intended way to stay in sync.

Step 1: compare `import_batch_id` sets¶

On each sync, compute the set of distinct import_batch_id values in DataHub for each report table you replicate.
Compare that set to the set of batch ids you already have locally.

Then:

Ids that appear in DataHub but not locally — insert all rows for those import_batch_id values from DataHub into your mirror.
Ids that appear locally but no longer in DataHub — delete all rows with those import_batch_id values from your mirror.

Step 2: compare `MAX(last_update)` per shared batch¶

For each import_batch_id that exists both in DataHub and in your mirror:

Compute MAX(last_update) over all rows with that import_batch_id in DataHub.
Compare it to MAX(last_update) for the same id in your mirror.

If DataHub’s maximum is greater than the local maximum, CELUS has re-exported that batch — delete every row with that import_batch_id from your mirror and reload the full batch from DataHub. If both maxima are equal (and server times are trustworthy), skip that batch.

Merged views ¶

Replicating merged views (TR_merged, etc.) is optional: they are views over the base report tables. You can mirror only the base tables and recreate equivalent logic in your environment if needed.

Tag tables ¶

Tag tables (title_tags, organization_tags, platform_tags) do not carry import_batch_id. Mirror them with a separate policy. They are typically much smaller than the report tables, so a periodic full refresh would be the easiest way to keep them in sync.

DataHub¶

Connecting to DataHub ¶

Technology ¶

Database structure ¶

Report tables ¶

Merged views (COUNTER 5 + 5.1)¶

Tag tables ¶

Keeping an external copy in sync ¶

Import batches ¶

Step 1: compare `import_batch_id` sets¶

Step 2: compare `MAX(last_update)` per shared batch¶

Merged views ¶

Tag tables ¶

CELUS

Navigation

Related Topics

DataHub¶

Step 1: compare import_batch_id sets¶

Step 2: compare MAX(last_update) per shared batch¶

Step 1: compare `import_batch_id` sets¶

Step 2: compare `MAX(last_update)` per shared batch¶