WIP: feat(transformer_ner): Trainable BERT NER plugin, along with rawstring tokenizer that splits on all spaces. by adam-sutton-1992 · Pull Request #550 · CogStack/cogstack-nlp

adam-sutton-1992 · 2026-06-16T18:34:29Z

Hihi,

This is WIP so we'll top it off with a TODO here for now:

Unit Testing for transformer_NER - DONE!
Unit Testing for rawstring_tokenizer - DONE!
Testing all variants of components that are viable (aka all except RawStringTokenizer + ContextBasedLinking)
Parameter testing (mostly done)
README for transformer_ner
README for rawstring_tokenizer
Whatever mypy and linting fails I missed :D

This is the trainable MLM transformer model attempting to do NER. It is a binary BIOES NER model where each prediction is either (Beginning-Ent, Inside-Ent, Outside-Ent, End-Ent, Single-Ent). This is a bit of an advancement compared to BIO models, where E signals the end of a multi token label, and S signals a stand alone token label. We try to prioritise B and E tokens here for performance (i.e. ensure we get well formed predictions). We also have a CRF head after the MLM model to try to encourge more well formed label predictions (i.e. only I and E after B

transformer_ner.py is the main logic for the plugin, while transformer_ner_model is the logic for the model (such as initialisation, loading, and the forward step).

Rawstring_tokenizer is a tokenizer where all tokens are based on whitespace splits i.e. new lines, tabs, and spaces. It still can't perfectly obtain all entities (sub word entities). But is an improvement for where some entities don't have spacy representation. This mainly improves performances in pipelines where it's using transformer_ner, and embedding_linker.

There are also additional changes to the embedding_linker. I understand these should probably be seperate, however that's slipped through the cracks. Apologies. The changes are mainly more functionality and configurability:

Multiple entities per span
Short and long similarity thresholds, along with top_k entities being passed
Appending the pre_inference link candidates to those from vocab based methods (which are both a part of transfomer_ner and vocab_ner).
Additional documentation in the code for all of this step by step.

Performance wise you can expect performances of trained models with reasonable configs to look like this (based on training / testing of Distemist & Snomed Entity Linking Benchmark):

Spacy Tokenizer + Vocab based NER + Context Based Linker: Recall 0.7
Spacy Tokenizer + Vocab based NER + Embedding Linking: Recall 0.75
Spacy Tokenizer + Transformer based NER + Context Based Linking: Recall 0.73
Rawstring Tokenizer + Transformer based NER + Embedding Linking: Recall 0.85

There a few additional pieces with these metrics. The embedding linker is highly configurable so you can go from a Recall of 0.84-ish and a Precision of 0.4, to 0.9 recall and 0.05 precision. These metrics I have here are essentially on configurations I think make "sense". One such measure of that metric is "if the recall goes up, and precision remains the same improves I'd consider that a solid improvement". I have documented these changes in performance within the config, so hopefully people can make informed decisions.

…rainable_bert_ner

…creation

…sed training

…izer

…trainable_bert_ner

adam-sutton-1992 and others added 30 commits April 24, 2026 12:39

fixing the comp name requirement

75f6ff7

initial commit for transformer_ner

a2075be

changed cui embedding method and fixed mention_mask generation

3664576

fixed spacing

65ea764

Merge branch 'embedding_cui_longest_name' into feat(TransformerNER)_t…

07c85be

…rainable_bert_ner

CU-869d9n2rg: Avoid running pipe twice for supervised training

4e3feb9

CU-869d9n2rg: Fix typo

eca18ea

CU-869d9n2rg: Use separate callers for tokenizer and pipe in trainer

08267ab

CU-869d9n2rg: Use separate callers for tokenizer and pipe in trainer …

01c7898

…creation

CU-869d9n2rg: Fix training time instantiation

872a98a

CU-869d9n2rg: Set entity name when preparing for supervised training

69406bf

CU-869d9n2rg: Fix issue with trainable component counting

e83d8b4

CU-869d9n2rg: Add cui to entity for supervised training

c67fb32

CU-869d9n2rg: Using properly processed name when prepping for supervi…

68ee430

…sed training

CU-869d9n2rg: Simplify trainer class - use pipe for different callers

ac94a02

CU-869d9n2rg: Simplify trainer class in init

ad07225

CU-869d9n2rg: Fix trainer utils tests

2f30634

CU-869d9n2rg: Update trainer tests with new trainer object

7c4a071

CU-869d9n2rg: Update trainer utils tests with new trainer object

2afc942

CU-869d9n2rg: Remove unused import

86a155e

Merge branch 'pr-467' into feat(TransformerNER)_trainable_bert_ner

e402458

changes to embedding linker and transformer ner, along with new token…

d90755c

…izer

Merge remote-tracking branch 'origin/main' into feat(TransformerNER)_…

a4251c7

…trainable_bert_ner

progress on linking and ner models

ad658b0

merge with main

99793b7

embedding linker fixes / adaptations and config updates

1c9f0b2

mypy fixes that are never enough

220a734

linting!

83246a9

remove debugging print statements

d3eebe4

fixing remote mypy issues and errors thrown in running

e759245

adam-sutton-1992 added 3 commits June 16, 2026 22:17

fixed tests!

6383db7

testing and related fixes due to tests

1cf6b6a

re-added deid transformer-ner type

6650fd4

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

WIP: feat(transformer_ner): Trainable BERT NER plugin, along with rawstring tokenizer that splits on all spaces.#550

WIP: feat(transformer_ner): Trainable BERT NER plugin, along with rawstring tokenizer that splits on all spaces.#550
adam-sutton-1992 wants to merge 33 commits into
mainfrom
feat(TransformerNER)_trainable_bert_ner

adam-sutton-1992 commented Jun 16, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

adam-sutton-1992 commented Jun 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

adam-sutton-1992 commented Jun 16, 2026 •

edited

Loading