Machine Learning & Exploration Data: How to overcome common issues during data preparation

data-cleaning-exploration-south-australian-gawler-challenge-unearthed

Machine learning can be used in mineral exploration as a tool to discover complex patterns in geological data, which helps predict or identify the location of mineral deposits. But, exploration data is often a challenge for data scientists to work with, due to its complexity. Some practitioners estimate up to 90% of their time is spent preparing data. 

I’ll explore common issues experienced using exploration data, and some methods to get around them. I’ll start with an example from a challenge we are currently working on with the South Australian Government (ExploreSA) involving geologists and data scientists to identify new mineralisation in the Gawler Craton. 

What are some of the general issues with exploration data?

  • There are a lot of different data types to directly compare which are a combination of structured and unstructured, grid and point data.

  • It’s mostly spatial, collected at different resolutions and is often highly clustered.

  • The data is mostly historic, and similar ‘types’ have been collected using different methods. 

  • There’s no definition of ‘mineralisation’

To help get started, here are some specific examples that the Unearthed Community have used to prepare exploration data for the ExploreSA challenge.

Ingesting large data files of chemical data to extract meaningful information

The example is based on work by Russel Menezes from Radix Geo, which can be found on Github.

Particularly for projects looking at large areas, for example the ExploreSA challenge, datasets associated with drill holes/boreholes can be multiple GBs in size, and can be difficult to process. 

The files contain the chemical data pertaining to the drillholes, but a lot of other metadata and information, that may or may not be useful for predictions. 

Through a number of notebooks, Russel goes through specifically extracting information on gold concentration from the raw data files, which can then directly be used to train models. 

The example demonstrates how to:

  • Deal with different units of concentration (i.e. ppm, ppb, %)

  • Remove null values

  • Filter irrelevant information such as, boreholes drilled for other purposes, old (less reliable) data, specific mineral targets

  • Chunk relevant data to reduce the amount of total processing

The image below shows the final output in this example; gold concentration in boreholes across the project area.

(Gold concentration data visualised by Russel Menezes across the Gawler Craton)

Classifying ‘mineralisation’ to build a training set

If you ask a geologist to define what economic mineralisation is, you’re unlikely to get an answer that’s useful for building any kind of training dataset. In his recent article, Machine Learning in Mineral Exploration — Understanding Classification Evaluation Metrics, Jack Maughan provides an example of how to classify data to help with this. 

Jack generates targets, or areas of interest, labeled 0 for barren or 1 for a mineralised. Mineralisation is referred to as known elevations in base metals (Au, Cu, Pb, Zn and/or Ag). The locations were chosen based upon a combination of existing deposit locations (i.e. Olympic Dam, Carrapeteena) and drillhole assays. 15 different features are then extracted for each of the targets from a range of different datasets e.g. gravity, magnetic intensity, resistivity. 

(Target areas generated by Jack Maughan in his recent article)

Dealing with highly variable sample density

For a lot of geological datasets, sampling density can have a huge range. Probably the most variable is drillhole chemical assay data. For each drillhole, samples are often taken every 1m, but the drill holes themselves are spaced 50-100m apart within a local area, with these clusters of drill holes spaced up to hundreds of km’s apart. Imbalanced datasets such as these, can lead to model outputs that may appear accurate, but aren’t relevant, and so they often need manipulating prior to use.

Geophysical data may also have variable sampling density.. As part of the ExploreSA challenge, David McSkimming looked into this problem for gravity data. For the gravity dataset, most data points have been collected by ground surveys over the last 80 years. Gravity station spacing ranges from 50m to 50km.  The effect of sample density in whole-of-state gravity layers is visually evident by feature definition in the .tif images. 

David describes how to use tools within QGIS to identify areas within the dataset where sampling density is sufficient to define discrete gravity features within the Gawler Craton, that can assist with identifying areas for potential economic mineralisation. Check out his approach here.

Merging Images

Gridded images are typically created from geophysical survey data, to facilitate interpretation of the data. Often, as the survey data is collected on a region by region basis, and the files are quite large, the data is initially received in tiles that are not stitched, as per the example below. Stitching the images into one file, can make it a lot easier to use the data, particularly RGB values, in ML models. 

This example from Jack Maughan, shows the original total magnetic intensity gridded data for the Gawler Craton in tiles, and the stitched version. For a description of how this was done, and to access the stitched data, head to Jack’s post.

(GCAS data courtesy of the SA DEM, visualised and merged in QGIS.)

There are many other examples of how to prepare exploration data for machine learning, please share what else you would like to see or hear about from our community. 

One of the aims of the ExploreSA Challenge is to help data scientists and geologists up skill on data preparation techniques, so the whole community can benefit by learning and growing together. You can find a wealth of information on the ExploreSA Challenge page on this topic. If you’re reading this before 30th April 2020, you can also submit your own data preparation ideas for the chance to win one of four $5000 prizes. 

Thanks to everyone in the Unearthed Community who has already shared their ideas and processes in the ExploreSA Challenge. We are particularly grateful to Russel, Jack and David, whose work we have referenced.