Skip to content

WIP: feat(transformer_ner): Trainable BERT NER plugin, along with rawstring tokenizer that splits on all spaces.#550

Open
adam-sutton-1992 wants to merge 33 commits into
mainfrom
feat(TransformerNER)_trainable_bert_ner
Open

WIP: feat(transformer_ner): Trainable BERT NER plugin, along with rawstring tokenizer that splits on all spaces.#550
adam-sutton-1992 wants to merge 33 commits into
mainfrom
feat(TransformerNER)_trainable_bert_ner

Conversation

@adam-sutton-1992

@adam-sutton-1992 adam-sutton-1992 commented Jun 16, 2026

Copy link
Copy Markdown
Contributor

Hihi,

This is WIP so we'll top it off with a TODO here for now:

  • Unit Testing for transformer_NER - DONE!
  • Unit Testing for rawstring_tokenizer - DONE!
  • Testing all variants of components that are viable (aka all except RawStringTokenizer + ContextBasedLinking)
  • Parameter testing (mostly done)
  • README for transformer_ner
  • README for rawstring_tokenizer
  • Whatever mypy and linting fails I missed :D

This is the trainable MLM transformer model attempting to do NER. It is a binary BIOES NER model where each prediction is either (Beginning-Ent, Inside-Ent, Outside-Ent, End-Ent, Single-Ent). This is a bit of an advancement compared to BIO models, where E signals the end of a multi token label, and S signals a stand alone token label. We try to prioritise B and E tokens here for performance (i.e. ensure we get well formed predictions). We also have a CRF head after the MLM model to try to encourge more well formed label predictions (i.e. only I and E after B

transformer_ner.py is the main logic for the plugin, while transformer_ner_model is the logic for the model (such as initialisation, loading, and the forward step).

Rawstring_tokenizer is a tokenizer where all tokens are based on whitespace splits i.e. new lines, tabs, and spaces. It still can't perfectly obtain all entities (sub word entities). But is an improvement for where some entities don't have spacy representation. This mainly improves performances in pipelines where it's using transformer_ner, and embedding_linker.

There are also additional changes to the embedding_linker. I understand these should probably be seperate, however that's slipped through the cracks. Apologies. The changes are mainly more functionality and configurability:

  1. Multiple entities per span
  2. Short and long similarity thresholds, along with top_k entities being passed
  3. Appending the pre_inference link candidates to those from vocab based methods (which are both a part of transfomer_ner and vocab_ner).
  4. Additional documentation in the code for all of this step by step.

Performance wise you can expect performances of trained models with reasonable configs to look like this (based on training / testing of Distemist & Snomed Entity Linking Benchmark):

  1. Spacy Tokenizer + Vocab based NER + Context Based Linker: Recall 0.7
  2. Spacy Tokenizer + Vocab based NER + Embedding Linking: Recall 0.75
  3. Spacy Tokenizer + Transformer based NER + Context Based Linking: Recall 0.73
  4. Rawstring Tokenizer + Transformer based NER + Embedding Linking: Recall 0.85

There a few additional pieces with these metrics. The embedding linker is highly configurable so you can go from a Recall of 0.84-ish and a Precision of 0.4, to 0.9 recall and 0.05 precision. These metrics I have here are essentially on configurations I think make "sense". One such measure of that metric is "if the recall goes up, and precision remains the same improves I'd consider that a solid improvement". I have documented these changes in performance within the config, so hopefully people can make informed decisions.

adam-sutton-1992 and others added 30 commits April 24, 2026 12:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant