WIP: feat(transformer_ner): Trainable BERT NER plugin, along with rawstring tokenizer that splits on all spaces.#550
Open
adam-sutton-1992 wants to merge 33 commits into
Open
Conversation
…rainable_bert_ner
…trainable_bert_ner
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Hihi,
This is WIP so we'll top it off with a TODO here for now:
This is the trainable MLM transformer model attempting to do NER. It is a binary BIOES NER model where each prediction is either (Beginning-Ent, Inside-Ent, Outside-Ent, End-Ent, Single-Ent). This is a bit of an advancement compared to BIO models, where E signals the end of a multi token label, and S signals a stand alone token label. We try to prioritise B and E tokens here for performance (i.e. ensure we get well formed predictions). We also have a CRF head after the MLM model to try to encourge more well formed label predictions (i.e. only I and E after B
transformer_ner.pyis the main logic for the plugin, whiletransformer_ner_modelis the logic for the model (such as initialisation, loading, and the forward step).Rawstring_tokenizer is a tokenizer where all tokens are based on whitespace splits i.e. new lines, tabs, and spaces. It still can't perfectly obtain all entities (sub word entities). But is an improvement for where some entities don't have spacy representation. This mainly improves performances in pipelines where it's using transformer_ner, and embedding_linker.
There are also additional changes to the embedding_linker. I understand these should probably be seperate, however that's slipped through the cracks. Apologies. The changes are mainly more functionality and configurability:
Performance wise you can expect performances of trained models with reasonable configs to look like this (based on training / testing of Distemist & Snomed Entity Linking Benchmark):
There a few additional pieces with these metrics. The embedding linker is highly configurable so you can go from a Recall of 0.84-ish and a Precision of 0.4, to 0.9 recall and 0.05 precision. These metrics I have here are essentially on configurations I think make "sense". One such measure of that metric is "if the recall goes up, and precision remains the same improves I'd consider that a solid improvement". I have documented these changes in performance within the config, so hopefully people can make informed decisions.