Labelled data for fine tuning a geological Named Entity Recognition and Entity Relation Extraction model
This dataset consists of sentences extracted from BGS memoirs, DECC/OGA onshore hydrocarbons well reports and Mineral Reconnaissance Programme (MRP) reports. The sentences have been annotated to enable the dataset to be used as labelled training data for a Named Entity Recognition model and Entity Relation Extraction model, both of which are Natural Language Processing (NLP) techniques that assist with extracting structured data from unstructured text. The entities of interest are rock formations, geological ages, rock types, physical properties and locations, with inter-relations such as overlies, observedIn. The entity labels for rock formations and geological ages in the BGS memoirs were an extract from earlier published work . The data can be used to fine tune a pre-trained large language model using transfer learning, to create a model that can be used in inference mode to automatically create the labels, thereby creating structured data useful for geological modelling and subsurface characterisation. The data is provided in JSONL (Relation) format which is the export format from doccano open source text annotation software ( used to create the labels. The source documents are already publicly available, but the MRP and DECC reports are only published in pdf image form. These latter documents had to undergo OCR and resulted in lower quality text and a lower quality training data. The majority of the labelled data is from the higher quality BGS memoirs text. The dataset is a proof of concept. Minimal peer review of the labelling has been conducted so this should not be treated as a gold standard labelled dataset, and it is of insufficient volume to build a performant model. The development of this training data and the text processing scripts were supported by a grant from UK Government Office for Technology Transfer (GOTT) Knowledge Asset Grant Fund Project 10083604
- Date (Creation)
- 2024-02-15
- Citation identifier
- Point of contact
Organisation name Individual name Electronic mail address Role British Geological Survey
Distributor British Geological Survey
Originator British Geological Survey
Point of contact British Geological Survey
Principal investigator
- Maintenance and update frequency
- Not planned
GEMET - INSPIRE themes, version 1.0
BGS Thesaurus of Geosciences
NGDC Deposited Data
Physical properties
Mathematical programming
- (non-INSPIRE)
Citable Data
Stratigraphic unit
- dataCentre
- Keywords
- Access constraints
- Other restrictions
- Other constraints
- licenceOGL
- Use constraints
- Other restrictions
- Other constraints
The copyright of materials derived from the British Geological Survey's work is vested in the Natural Environment Research Council [NERC]. No part of this work may be reproduced or transmitted in any form or by any means, or stored in a retrieval system of any nature, without the prior permission of the copyright holder, via the BGS Intellectual Property Rights Manager. Use by customers of information provided by the BGS, is at the customer's own risk. In view of the disparate sources of information at BGS's disposal, including such material donated to BGS, that BGS accepts in good faith as being accurate, the Natural Environment Research Council (NERC) gives no warranty, expressed or implied, as to the quality or accuracy of the information supplied, or to the information's suitability for any use. NERC/BGS accepts no liability whatever in respect of loss, damage, injury or other occurence however caused.
- Other constraints
Available under the Open Government Licence subject to the following acknowledgement accompanying the reproduced NERC materials "Contains NERC materials ©NERC [year]"
- Language
- English
- Topic category
- Geoscientific information
- Begin date
- 2023-11-01
- End date
- 2024-02-15
Reference System Information
- Distribution format
Name Version jsonlines
doccano JSONL(Relation)
PURE relations
- Distributor contact
Organisation name Individual name Electronic mail address Role British Geological Survey
- OnLine resource
Protocol Linkage Name BGS github repository
- OnLine resource
Protocol Linkage Name Data
- OnLine resource
Protocol Linkage Name Digital Object Identifier (DOI)
- Hierarchy level
- Non geographic dataset
- Other
non geographic dataset
Conformance result
- Title
INSPIRE Implementing rules laying down technical arrangements for the interoperability and harmonisation of Geology
- Date (Publication)
- 2011
- Explanation
See the referenced specification
- Pass
- No
Conformance result
- Title
Commission Regulation (EU) No 1089/2010 of 23 November 2010 implementing Directive 2007/2/EC of the European Parliament and of the Council as regards interoperability of spatial data sets and services
- Date (Publication)
- 2010-12-08
- Explanation
- Pass
- No
- Statement
Data was sourced from 3 corpus already available under open licences. BGS memoirs/technical reports sentences were sourced from previous labelled data at This was converted to a new data format and additional labels were manually added to a subset of the data using doccano open source text annotation software. A small sample of sentences were taken from selected DECC/OGA onshore hydrocarbons well reports and from selected Mineral Reconnaisance Programme reports These reports were processed by 1. converting to machine readable text, 2. splitting into pages and sentences/paragraphs, 3. converting to doccano import JSONlines format, 5. manually annotating to label a range of geological concepts, 6. manually labelling to add relation labels to indicate how those concepts relate to each other, 7. exporting in doccano JSONL (Relations) format and also converting to format required by
- File identifier
- 15ac4ca9-3be0-119e-e063-0937940a8990 XML
- Metadata language
- English
- Hierarchy level
- Non geographic dataset
- Hierarchy level name
non geographic dataset
- Date stamp
- 2025-03-06
- Metadata standard name
- Metadata standard version
- Metadata author
Organisation name Individual name Electronic mail address Role British Geological Survey
Point of contact
- Dataset URI