feat: add source-agnostic ETL pipeline (convert2df) to standardize Scopus, Dimensions, PubMed & Lens for the dashboard by RaphS57 · Pull Request #25 · PRAISELab-PicusLab/bibliometrix-python

RaphS57 · 2026-06-23T00:13:57Z

No description provided.

Copilot

Pull request overview

Adds a new, source-agnostic ETL “convert2df” pipeline intended to standardize raw exports from multiple bibliographic sources into a WoS-like schema so existing analytics can run on a consistent, strongly-typed DataFrame.

Changes:

Introduces www/services/standardizer.py implementing EXTRACT → TRANSFORM → LOAD, including type contracts and validation.
Updates the dashboard upload path (functions/get_data.py) to prefer convert2df() with a fallback to the legacy biblio_json route.
Adds reproducibility artifacts (etl_demo.py, EXECUTION_LOG.md, ETL_REPORT.md) documenting the ETL behavior and example runs.

Reviewed changes

Copilot reviewed 6 out of 70 changed files in this pull request and generated 3 comments.

Show a summary per file

File	Description
www/services/standardizer.py	New ETL pipeline module (`convert2df`, contracts, validation, SR calculation).
www/services/init.py	Exposes the new standardizer API via `from .standardizer import *`.
functions/get_data.py	Routes single-file uploads through `convert2df()` first, with fallback.
EXECUTION_LOG.md	Documents validation results and compatibility matrix for standardized data.
ETL_REPORT.md	Detailed design/report describing the ETL approach and rationale.
etl_demo.py	Demo script to standardize bundled datasets and emit CSV outputs.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

+#Provenance label written to the DB column (used by downstream functions to
+#check where the data comes from, e.g. SR() behaves differently for Scopus).
+DB_LABELS = {
+    "wos": "WEB_OF_SCIENCE",
+    "scopus": "SCOPUS",
+    "dimensions": "DIMENSIONS",
+    "lens": "LENS",
+    "pubmed": "PUBMED",
+    "cochrane": "COCHRANE",
+}


+    VALIDATION phase: programmatically verify the output contract.
+
+    Checks performed:
+        1. All mandatory columns exist.
+        2. No ``NaN`` / ``None`` value remains in any cell.
+        3. Multi-value columns are typed as ``list``.
+        4. Numeric columns (PY, TC) are integers.
+


+                try:
+                    standardized = convert2df(
+                        file[0]["datapath"], source, filename=type
+                    )
+                    df.set(standardized)
+                except Exception:
+                    #Fallback to the original logic for any source / extension
+                    #not yet covered by the ETL pipeline (e.g. .bib files).
+                    json = biblio_json(file[0]["datapath"], source, type, author)
+                    df.set(pd.read_json(StringIO(json)))


RaphS57 added 4 commits June 23, 2026 02:01

Add files via upload

9c9bc88

Add files via upload

8f1d3f9

Add files via upload

107ac36

Add files via upload

8b67878

Copilot AI review requested due to automatic review settings June 23, 2026 00:13

Copilot started reviewing on behalf of RaphS57 June 23, 2026 00:14 View session

Copilot AI reviewed Jun 23, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add source-agnostic ETL pipeline (convert2df) to standardize Scopus, Dimensions, PubMed & Lens for the dashboard#25

feat: add source-agnostic ETL pipeline (convert2df) to standardize Scopus, Dimensions, PubMed & Lens for the dashboard#25
RaphS57 wants to merge 4 commits into
PRAISELab-PicusLab:mainfrom
RaphS57:main

RaphS57 commented Jun 23, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

RaphS57 commented Jun 23, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants