Labelled data for fine tuning a geological Named Entity Recognition and Entity Relation Extraction model
This dataset consists of sentences extracted from BGS memoirs, DECC/OGA onshore hydrocarbons well reports and Mineral Reconnaissance Programme (MRP) reports. The sentences have been annotated to enable the dataset to be used as labelled training data for a Named Entity Recognition model and Entity Relation Extraction model, both of which are Natural Language Processing (NLP) techniques that assist with extracting structured data from unstructured text. The entities of interest are rock formations, geological ages, rock types, physical properties and locations, with inter-relations such as overlies, observedIn. The entity labels for rock formations and geological ages in the BGS memoirs were an extract from earlier published work https://github.com/BritishGeologicalSurvey/geo-ner-model https://zenodo.org/records/4181488 . The data can be used to fine tune a pre-trained large language model using transfer learning, to create a model that can be used in inference mode to automatically create the labels, thereby creating structured data useful for geological modelling and subsurface characterisation. The data is provided in JSONL (Relation) format which is the export format from doccano open source text annotation software ( https://doccano.github.io/doccano/) used to create the labels. The source documents are already publicly available, but the MRP and DECC reports are only published in pdf image form. These latter documents had to undergo OCR and resulted in lower quality text and a lower quality training data. The majority of the labelled data is from the higher quality BGS memoirs text. The dataset is a proof of concept. Minimal peer review of the labelling has been conducted so this should not be treated as a gold standard labelled dataset, and it is of insufficient volume to build a performant model. The development of this training data and the text processing scripts were supported by a grant from UK Government Office for Technology Transfer (GOTT) Knowledge Asset Grant Fund Project 10083604
Simple
- Date (Creation)
- 2024-02-15
- Citation identifier
- http://data.bgs.ac.uk/id/dataHolding/13608217
- Point of contact
-
Organisation name Individual name Electronic mail address Role British Geological Survey
Enquiries
Distributor British Geological Survey
Enquiries
Originator British Geological Survey
Enquiries
Point of contact British Geological Survey
Enquiries
Principal investigator
- Maintenance and update frequency
- Not planned
-
GEMET - INSPIRE themes, version 1.0
-
BGS Thesaurus of Geosciences
-
-
NGDC Deposited Data
-
Physical properties
-
Mathematical programming
-
data.gov.uk (non-INSPIRE)
-
Citable Data
-
Stratigraphic unit
-
- dataCentre
- Keywords
-
-
NERC_DDC
-
- Access constraints
- Other restrictions
- Other constraints
- licenceOGL
- Use constraints
- Other restrictions
- Other constraints
-
The copyright of materials derived from the British Geological Survey's work is vested in the Natural Environment Research Council [NERC]. No part of this work may be reproduced or transmitted in any form or by any means, or stored in a retrieval system of any nature, without the prior permission of the copyright holder, via the BGS Intellectual Property Rights Manager. Use by customers of information provided by the BGS, is at the customer's own risk. In view of the disparate sources of information at BGS's disposal, including such material donated to BGS, that BGS accepts in good faith as being accurate, the Natural Environment Research Council (NERC) gives no warranty, expressed or implied, as to the quality or accuracy of the information supplied, or to the information's suitability for any use. NERC/BGS accepts no liability whatever in respect of loss, damage, injury or other occurence however caused.
- Other constraints
-
Available under the Open Government Licence subject to the following acknowledgement accompanying the reproduced NERC materials "Contains NERC materials ©NERC [year]"
- Language
- English
- Topic category
-
- Geoscientific information
- Begin date
- 2023-11-01
- End date
- 2024-02-15
Reference System Information
- Distribution format
-
Name Version jsonlines
doccano JSONL(Relation)
jsonlines
PURE relations
- Distributor contact
-
Organisation name Individual name Electronic mail address Role British Geological Survey
Enquiries
Distributor
- OnLine resource
-
Protocol Linkage Name https://github.com/BritishGeologicalSurvey/princeton-nlp-relation-extraction BGS github repository
- OnLine resource
-
Protocol Linkage Name https://webapps.bgs.ac.uk/services/ngdc/accessions/index.html#item186633 Data
- OnLine resource
-
Protocol Linkage Name https://doi.org/10.5285/afba2d1d-8a5d-4b96-a6fa-c13b5d8d32cd Digital Object Identifier (DOI)
- Hierarchy level
- Non geographic dataset
- Other
-
non geographic dataset
Conformance result
- Title
-
INSPIRE Implementing rules laying down technical arrangements for the interoperability and harmonisation of Geology
- Date (Publication)
- 2011
- Explanation
-
See the referenced specification
- Pass
- No
Conformance result
- Title
-
Commission Regulation (EU) No 1089/2010 of 23 November 2010 implementing Directive 2007/2/EC of the European Parliament and of the Council as regards interoperability of spatial data sets and services
- Date (Publication)
- 2010-12-08
- Explanation
-
See http://eur-lex.europa.eu/LexUriServ/LexUriServ.do?uri=OJ:L:2010:323:0011:0102:EN:PDF
- Pass
- No
- Statement
-
Data was sourced from 3 corpus already available under open licences. BGS memoirs/technical reports sentences were sourced from previous labelled data at https://github.com/BritishGeologicalSurvey/geo-ner-model/blob/main/bgs.3class.geo-all-data.txt. This was converted to a new data format and additional labels were manually added to a subset of the data using doccano open source text annotation software. A small sample of sentences were taken from selected DECC/OGA onshore hydrocarbons well reports http://data.bgs.ac.uk/id/dataHolding/13607542 and from selected Mineral Reconnaisance Programme reports http://data.bgs.ac.uk/id/dataHolding/13605457 These reports were processed by 1. converting to machine readable text, 2. splitting into pages and sentences/paragraphs, 3. converting to doccano import JSONlines format, 5. manually annotating to label a range of geological concepts, 6. manually labelling to add relation labels to indicate how those concepts relate to each other, 7. exporting in doccano JSONL (Relations) format and also converting to format required by https://github.com/princeton-nlp/PURE
Metadata
- File identifier
- 15ac4ca9-3be0-119e-e063-0937940a8990 XML
- Metadata language
- English
- Hierarchy level
- Non geographic dataset
- Hierarchy level name
-
non geographic dataset
- Date stamp
- 2024-12-20
- Metadata standard name
- UK GEMINI
- Metadata standard version
-
2.3
- Metadata author
-
Organisation name Individual name Electronic mail address Role British Geological Survey
Point of contact
- Dataset URI