• BGS Metadata Catalogue
  •   Search
  •   Map
  •  Sign in

Labelled data for fine tuning a geological Named Entity Recognition and Entity Relation Extraction model

This dataset consists of sentences extracted from BGS memoirs, DECC/OGA onshore hydrocarbons well reports and Mineral Reconnaissance Programme (MRP) reports. The sentences have been annotated to enable the dataset to be used as labelled training data for a Named Entity Recognition model and Entity Relation Extraction model, both of which are Natural Language Processing (NLP) techniques that assist with extracting structured data from unstructured text. The entities of interest are rock formations, geological ages, rock types, physical properties and locations, with inter-relations such as overlies, observedIn. The entity labels for rock formations and geological ages in the BGS memoirs were an extract from earlier published work https://github.com/BritishGeologicalSurvey/geo-ner-model https://zenodo.org/records/4181488 . The data can be used to fine tune a pre-trained large language model using transfer learning, to create a model that can be used in inference mode to automatically create the labels, thereby creating structured data useful for geological modelling and subsurface characterisation. The data is provided in JSONL (Relation) format which is the export format from doccano open source text annotation software ( https://doccano.github.io/doccano/) used to create the labels. The source documents are already publicly available, but the MRP and DECC reports are only published in pdf image form. These latter documents had to undergo OCR and resulted in lower quality text and a lower quality training data. The majority of the labelled data is from the higher quality BGS memoirs text. The dataset is a proof of concept. Minimal peer review of the labelling has been conducted so this should not be treated as a gold standard labelled dataset, and it is of insufficient volume to build a performant model. The development of this training data and the text processing scripts were supported by a grant from UK Government Office for Technology Transfer (GOTT) Knowledge Asset Grant Fund Project 10083604

Simple

Date (Creation)
2024-02-15
Citation identifier
http://data.bgs.ac.uk/id/dataHolding/13608217
Point of contact
Organisation name Individual name Electronic mail address Role

British Geological Survey

Enquiries

enquiries@bgs.ac.uk

Distributor

British Geological Survey

Enquiries

enquiries@bgs.ac.uk

Originator

British Geological Survey

Enquiries

enquiries@bgs.ac.uk

Point of contact

British Geological Survey

Enquiries

enquiries@bgs.ac.uk

Principal investigator
Maintenance and update frequency
Not planned

GEMET - INSPIRE themes, version 1.0

  • Geology

BGS Thesaurus of Geosciences

  • NGDC Deposited Data

  • Physical properties

  • Mathematical programming

  • data.gov.uk (non-INSPIRE)

  • Citable Data

  • Stratigraphic unit

dataCentre
  • data.gov.uk (non-INSPIRE)
  • NGDC Deposited Data
  • Citable Data
Keywords
  • NERC_DDC

Access constraints
Other restrictions
Other constraints
licenceOGL
Other constraints
Available under the Open Government Licence subject to the following acknowledgement accompanying the reproduced NERC materials "Contains NERC materials ©NERC [year]"
Use constraints
Other restrictions
Other constraints

The copyright of materials derived from the British Geological Survey's work is vested in the Natural Environment Research Council [NERC]. No part of this work may be reproduced or transmitted in any form or by any means, or stored in a retrieval system of any nature, without the prior permission of the copyright holder, via the BGS Intellectual Property Rights Manager. Use by customers of information provided by the BGS, is at the customer's own risk. In view of the disparate sources of information at BGS's disposal, including such material donated to BGS, that BGS accepts in good faith as being accurate, the Natural Environment Research Council (NERC) gives no warranty, expressed or implied, as to the quality or accuracy of the information supplied, or to the information's suitability for any use. NERC/BGS accepts no liability whatever in respect of loss, damage, injury or other occurence however caused.

Other constraints

Available under the Open Government Licence subject to the following acknowledgement accompanying the reproduced NERC materials "Contains NERC materials ©NERC [year]"

Language
English
Topic category
  • Geoscientific information
Begin date
2023-11-01
End date
2024-02-15

Reference System Information

No information provided.
Distribution format
Name Version

jsonlines

doccano JSONL(Relation)

jsonlines

PURE relations

Distributor contact
Organisation name Individual name Electronic mail address Role

British Geological Survey

Enquiries

enquiries@bgs.ac.uk

Distributor
OnLine resource
Protocol Linkage Name
https://github.com/BritishGeologicalSurvey/princeton-nlp-relation-extraction

BGS github repository

OnLine resource
Protocol Linkage Name
https://webapps.bgs.ac.uk/services/ngdc/accessions/index.html#item186633

Data

OnLine resource
Protocol Linkage Name
https://doi.org/10.5285/afba2d1d-8a5d-4b96-a6fa-c13b5d8d32cd

Digital Object Identifier (DOI)

Hierarchy level
Non geographic dataset
Other

non geographic dataset

Conformance result

Title

INSPIRE Implementing rules laying down technical arrangements for the interoperability and harmonisation of Geology

Date (Publication)
2011
Explanation

See the referenced specification

Pass
No

Conformance result

Title

Commission Regulation (EU) No 1089/2010 of 23 November 2010 implementing Directive 2007/2/EC of the European Parliament and of the Council as regards interoperability of spatial data sets and services

Date (Publication)
2010-12-08
Explanation

See http://eur-lex.europa.eu/LexUriServ/LexUriServ.do?uri=OJ:L:2010:323:0011:0102:EN:PDF

Pass
No
Statement

Data was sourced from 3 corpus already available under open licences. BGS memoirs/technical reports sentences were sourced from previous labelled data at https://github.com/BritishGeologicalSurvey/geo-ner-model/blob/main/bgs.3class.geo-all-data.txt. This was converted to a new data format and additional labels were manually added to a subset of the data using doccano open source text annotation software. A small sample of sentences were taken from selected DECC/OGA onshore hydrocarbons well reports http://data.bgs.ac.uk/id/dataHolding/13607542 and from selected Mineral Reconnaisance Programme reports http://data.bgs.ac.uk/id/dataHolding/13605457 These reports were processed by 1. converting to machine readable text, 2. splitting into pages and sentences/paragraphs, 3. converting to doccano import JSONlines format, 5. manually annotating to label a range of geological concepts, 6. manually labelling to add relation labels to indicate how those concepts relate to each other, 7. exporting in doccano JSONL (Relations) format and also converting to format required by https://github.com/princeton-nlp/PURE

Metadata

File identifier
15ac4ca9-3be0-119e-e063-0937940a8990 XML
Metadata language
English
Hierarchy level
Non geographic dataset
Hierarchy level name

non geographic dataset

Date stamp
2025-06-17
Metadata standard name
UK GEMINI
Metadata standard version

2.3

Metadata author
Organisation name Individual name Electronic mail address Role

British Geological Survey

enquiries@bgs.ac.uk

Point of contact
Dataset URI

http://data.bgs.ac.uk/id/dataHolding/13608217

 
 

Overviews

overview

Keywords

Citable Data NGDC Deposited Data data.gov.uk (non-INSPIRE)


Provided by

logo

Share on social sites

Access to the catalogue
Read here the full details and access to the data.




  •   About
  •   Github
  •