Skip to content

feature(extract-core): output page byte ranges#12

Merged
ClemDoum merged 1 commit into
mainfrom
feature(extract-python)/page-by-ranges
Jun 25, 2026
Merged

feature(extract-core): output page byte ranges#12
ClemDoum merged 1 commit into
mainfrom
feature(extract-python)/page-by-ranges

Conversation

@ClemDoum

@ClemDoum ClemDoum commented Jun 24, 2026

Copy link
Copy Markdown
Contributor

Descriptions

Output mardown page bytes range, preliminary to support ICIJ/datashare#2229 in the datashare-python's extract-worker

Changes

extract-core

Changed

  • introduced Pages(total: int, bytes_ranges: list[tuple[int, int]]) and replaced ConversionOutput.pages: PageIndexes = []) by ConversionOutput.pages: Pages

extract-python

Added

  • added the utils.write_pages helper to serialize markdown documents and yield pages bytes ranges

@ClemDoum ClemDoum self-assigned this Jun 24, 2026
@ClemDoum ClemDoum force-pushed the feature(extract-python)/page-by-ranges branch 2 times, most recently from 8d5032d to b531afa Compare June 24, 2026 13:38
@ClemDoum ClemDoum marked this pull request as ready for review June 24, 2026 13:38
@ClemDoum ClemDoum requested a review from pirhoo June 24, 2026 13:39

@pirhoo pirhoo left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the swift implem! I've made a few suggestions.

Comment thread extract-python/benches/compare.py Outdated
Comment thread extract-python/benches/compare.py Outdated
Comment thread extract-python/extract_python/docling_.py Outdated
Comment thread extract-python/extract_python/utils.py Outdated
@ClemDoum ClemDoum force-pushed the feature(extract-python)/page-by-ranges branch from b531afa to 0163035 Compare June 25, 2026 09:10
@ClemDoum ClemDoum requested a review from pirhoo June 25, 2026 09:10
@ClemDoum ClemDoum merged commit 297ee68 into main Jun 25, 2026
5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants