Skip to content

feat: add source-agnostic ETL pipeline (convert2df) to standardize Scopus, Dimensions, PubMed & Lens for the dashboard#25

Open
RaphS57 wants to merge 4 commits into
PRAISELab-PicusLab:mainfrom
RaphS57:main
Open

feat: add source-agnostic ETL pipeline (convert2df) to standardize Scopus, Dimensions, PubMed & Lens for the dashboard#25
RaphS57 wants to merge 4 commits into
PRAISELab-PicusLab:mainfrom
RaphS57:main

Conversation

@RaphS57

@RaphS57 RaphS57 commented Jun 23, 2026

Copy link
Copy Markdown

No description provided.

Copilot AI review requested due to automatic review settings June 23, 2026 00:13

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a new, source-agnostic ETL “convert2df” pipeline intended to standardize raw exports from multiple bibliographic sources into a WoS-like schema so existing analytics can run on a consistent, strongly-typed DataFrame.

Changes:

  • Introduces www/services/standardizer.py implementing EXTRACT → TRANSFORM → LOAD, including type contracts and validation.
  • Updates the dashboard upload path (functions/get_data.py) to prefer convert2df() with a fallback to the legacy biblio_json route.
  • Adds reproducibility artifacts (etl_demo.py, EXECUTION_LOG.md, ETL_REPORT.md) documenting the ETL behavior and example runs.

Reviewed changes

Copilot reviewed 6 out of 70 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
www/services/standardizer.py New ETL pipeline module (convert2df, contracts, validation, SR calculation).
www/services/init.py Exposes the new standardizer API via from .standardizer import *.
functions/get_data.py Routes single-file uploads through convert2df() first, with fallback.
EXECUTION_LOG.md Documents validation results and compatibility matrix for standardized data.
ETL_REPORT.md Detailed design/report describing the ETL approach and rationale.
etl_demo.py Demo script to standardize bundled datasets and emit CSV outputs.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +56 to +65
#Provenance label written to the DB column (used by downstream functions to
#check where the data comes from, e.g. SR() behaves differently for Scopus).
DB_LABELS = {
"wos": "WEB_OF_SCIENCE",
"scopus": "SCOPUS",
"dimensions": "DIMENSIONS",
"lens": "LENS",
"pubmed": "PUBMED",
"cochrane": "COCHRANE",
}
Comment on lines +339 to +346
VALIDATION phase: programmatically verify the output contract.

Checks performed:
1. All mandatory columns exist.
2. No ``NaN`` / ``None`` value remains in any cell.
3. Multi-value columns are typed as ``list``.
4. Numeric columns (PY, TC) are integers.

Comment thread functions/get_data.py
Comment on lines +49 to +58
try:
standardized = convert2df(
file[0]["datapath"], source, filename=type
)
df.set(standardized)
except Exception:
#Fallback to the original logic for any source / extension
#not yet covered by the ETL pipeline (e.g. .bib files).
json = biblio_json(file[0]["datapath"], source, type, author)
df.set(pd.read_json(StringIO(json)))
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants