DataHub¶
For users who want to analyze the data stored in CELUS themselves, CELUS offers an export into DataHub.
The export is updated daily and contains all usage data from CELUS. Other data, such as information about harvesting success, configured credentials, etc. is not included.
Connecting to DataHub¶
ClickHouse connection parameters (host, ports, database name, user name, and password) can be retrieved either in the CELUS UI under the “Data Hub” page or through the public API with the same API key used for other read-only integration endpoints. See DataHub credentials.
Technology¶
DataHub is built on top of ClickHouse, a powerful column-oriented database engine. It is run and maintained by CELUS on its own infrastructure.
ClickHouse is compatible with Microsoft Power BI, Apache Superset, Metabase and many other analytical tools.
Database structure¶
Report tables¶
Each report from your CELUS instance is stored in a separate table in the database. For example,
if you are harvesting TR, DR and PR reports, you will have three tables in the database with the
names TR, DR and PR. COUNTER 5.1 reports have their own tables with the 51 suffix,
e.g. TR51. Non-COUNTER reports are exported as well.
Each table has the same overall layout, with optional title or item fields and report-specific
COUNTER dimensions. Please note the double underscore in some of the column names, e.g.
organization__name.
Common columns (every report table):
organization_id(integer) - internal CELUS id of the organization that the row belongs toorganization__name(string) - name of that organizationplatform_id(integer) - internal CELUS id of the platformplatform__name(string) - name of the platformplatform__counter_registry_id(string) - COUNTER registry id of the platform when availabledate(date) - usage datemetric_id(integer) - internal CELUS id of the metricmetric__short_name(string) - metric short name (for COUNTER reports, the COUNTER metric name)import_batch_id(integer) - internal id of the import batch (one harvest, upload, etc.); one batch always covers one month. Mainly for internal use and reconciliationlast_update(DateTime) - when this row content was last changedvalue(integer) - metric value (usage count)
Title columns (present when the report type uses titles, e.g. TR but not PR):
title_id(integer) - internal CELUS id of the titletitle__name(string)title__pub_type(string)title__isbn(string)title__issn(string)title__eissn(string)title__issn_l(string) - ISSN-Ltitle__doi(string)title__proprietary_ids(array of strings) - proprietary identifiers from the COUNTER record
Item columns (present when the report type uses items):
item_id(integer)item__name(string)item__publication_date(date)item__doi(string)item__isbn(string)item__issn(string)item__eissn(string)item__proprietary_ids(array of strings) - proprietary identifiers from the COUNTER record
Report specific dimensions: For each dimension defined on the report type in CELUS, there is a column
named after that dimension’s short name in lowercase. These names usually match the COUNTER dimension
names, but there are some exceptions. The Platform dimension in
COUNTER is stored under platform_in_counter_data (to distinguish it from CELUS platform
attributes).
If the report has a Year of Publication (yop) dimension, an extra integer column
yop__number is present. It contains a numeric year parsed from the yop string (0 if
no year could be determined), which is convenient for sorting and numeric filters.
Example report specific dimensions for the COUNTER 5.1 Title Report (table TR51):
data_type(string) - the data type of the titleaccess_method(string) - the access methodplatform_in_counter_data(string) - the platform name as it is found in the COUNTER data (thePlatformcolumn in the COUNTER data)access_type(string) - the access typepublisher(string) - the publisher of the titleyop(string) - the year of publicationyop__number(integer) - the numeric year parsed from theyopstring (0if no year could be determined)
Merged views (COUNTER 5 + 5.1)¶
For selected master report types (TR, DR, PR, IR_M1), the database also contains SQL views named
TR_merged, DR_merged, PR_merged, and IR_M1_merged. Each view combines the COUNTER 5.1
table with the corresponding COUNTER 5.0 table for the same logical report, so you can query both
versions in one place. The views include an extra string column source_report with the name of
the underlying export table for each row (for example TR51 vs TR). You can still use the
dedicated TR / TR51 (etc.) tables if you need only one COUNTER version.
The merged view use the same logic as when merging the COUNTER 5.1 and COUNTER 5.0 tables in the reporting module in CELUS. When data for COUNTER 5.1 is available for a given month, it is used, otherwise the data for COUNTER 5.0 is used.
Tag tables¶
The database also contains tag tables for titles, organizations and platforms. These tables are used to store the tags for the titles, organizations and platforms and can be joined with the main report tables to enable filtering, grouping and other analytics based on tags.
The tables are:
title_tagsorganization_tagsplatform_tags
The structure of the tag tables is as follows:
{TARGET_ID}(integer) - internal CELUS id of the tagged target, where TARGET is one of (title, organization, platform)tag_id(integer) - internal CELUS id of the tagtag__name(string) - the name of the tagtag_class_id(integer) - internal CELUS id of the tag class that the tag belongs totag_class__name(string) - the name of the tag classinternal(boolean) - whether the tag class is internal, i. e. managed by CELUS and not by the userlast_update(DateTime) - when this tag row was last synced to DataHub
The first column differs by table. In the title_tags table, it is title_id;
in the organization_tags table, it is organization_id;
in the platform_tags table, it is platform_id.
Example query:
SELECT
tag__name AS Topic,
SUM(value) AS Unique_Item_Requests
FROM TR
INNER JOIN title_tags AS tt ON tt.title_id = TR.title_id
WHERE (tt.tag_class__name = 'Scopus #2') AND (metric__short_name = 'Unique_Item_Requests')
GROUP BY 1
ORDER BY 2 DESC
LIMIT 10
The above query uses the internal CELUS tags class Scopus #2 to assign topics to titles and then sums up the usage by topic. It prints the top 10 topics by usage defined by the Unique_Item_Requests metric.
The output may look like this:
┌─Topic────────────────────────────────────────┬─Unique_Item_Requests─┐
1. │ Medicine │ 4435267 │
2. │ Social Sciences │ 3788952 │
3. │ Biochemistry, Genetics and Molecular Biology │ 2205736 │
4. │ Arts and Humanities │ 1775827 │
5. │ Engineering │ 1543425 │
6. │ Environmental Science │ 1516750 │
7. │ Agricultural and Biological Sciences │ 1487613 │
8. │ Chemistry │ 1250674 │
9. │ Business, Management and Accounting │ 1236660 │
10.│ Psychology │ 1082768 │
└──────────────────────────────────────────────┴──────────────────────┘
Keeping an external copy in sync¶
The DataHub is designed to handle the load of real-time analytics and reporting, so there is no need to copy the DataHub report tables into your own database. However, if you for some reason need to do so (for backup purposes or local processing), the following text describes how to do it.
One simple option is to simply drop the DataHub tables and create fresh copies on a regular basis. Because the usage data usually changes only on a monthly basis, doing a monthly full refresh would be a valid approach.
If you need to sync the data more frequently, or you want to save some bandwidth, you can use the following incremental approach to keep your local copy in sync with the DataHub.
Import batches¶
Every usage row belongs to exactly one import_batch_id. An import batch corresponds to one import
(harvest, manual upload, etc.) and one calendar month of data.
Import batches are nearly immutable, so the basic approach is to compare the set of import batch ids between your local copy and the DataHub and deal with the differences.
Also, import batches contain a last_update timestamp, which should additionaly be used to detect
changes to the import batch content. Such changes typically occur when an older report is replaced by a new one
and interest data is thus updated. Such updates only affect the “interest” report, not the underlying
usage data reports. But other report types may also change, for example when title metadata is updated, etc.
Please note that although last_update is stored on every row, it should be used only as a
batch-level signal: the maximum of the last_update timestamp for each import batch determines
the last time the import batch was changed.
The two steps below are the intended way to stay in sync.
Step 1: compare import_batch_id sets¶
On each sync, compute the set of distinct
import_batch_idvalues in DataHub for each report table you replicate.Compare that set to the set of batch ids you already have locally.
Then:
Ids that appear in DataHub but not locally — insert all rows for those
import_batch_idvalues from DataHub into your mirror.Ids that appear locally but no longer in DataHub — delete all rows with those
import_batch_idvalues from your mirror.
Merged views¶
Replicating merged views (TR_merged, etc.) is optional: they are views over the base report
tables. You can mirror only the base tables and recreate equivalent logic in your environment if
needed.
Tag tables¶
Tag tables (title_tags, organization_tags, platform_tags) do not carry
import_batch_id. Mirror them with a separate policy. They are typically much smaller than the
report tables, so a periodic full refresh would be the easiest way to keep them in sync.