Modelling and Propagation of Legacy Petrophysical Data for Mining Exploration (3/3)

Modeling

Barcelona 25/09/24 GEO3BCN Manuel David Soto, Juan Alcalde, Adrià Hernàndez-Pineda, Ramón Carbonel

import dotenv
import os

Introduction

The dispersion and scarcity of petrophysical data are well-known challenges in the mining sector. These issues are primarily driven by economic factors, but also by geological (such as sedimentary cover, weathering, erosion, or the depth of targets), geotechnical (e.g., slope or borehole stability), and even technical limitations or availability that have been resolved in other industries (for instance, sonic logs were not previously acquired due to a lack of interest in velocity field data).

To address the challenge of sparse and incomplete petrophysical data in the mining sector we have developed three Jupyter notebooks that tackle these issues through a suite of open-source Python tools. These tools support researchers in the initial exploration, visualization, and integration of data from diverse sources, filling gaps caused by technical limitations and ultimately enabling the complete modeling of missing properties (through standard and more advanced ML-based models). We applied these tools to both, recently acquired and legacy petrophysical data of two cores northwest of Collinstown (County Westmeath, Province of Leinster), Ireland, located 26 km west-northwest of the Navan mine. However, these tools are adaptable and applicable to mining data from any location.

After the exploratory data analysis (notebook 1/3) and filling the gaps in the petrophysical dataset of Collinstown (previous notebook 2/3), this third notebook is focused on modeling by Machine Learning (ML) algorithms entire missing variables, such as the Gamma Ray (GR) measurement, which is available in only one of the petrophysical boreholes. The tasks to perform in this notebook are:

  • Load and merge the legacy GR data of borehole TC-3660-008 into its petrophysical dataset.

  • Use the merged data to train and evaluate different ML algorithms capable of predicting GR data in other boreholes.

  • By using the trained best model, propagate the GR to the other petrophysical borehole.

  • Evaluate the possibility of propagating this (GR) or other properties to other boreholes with less available data.

As with the previous notebooks, these tasks are performed with open-source Python tools easily accessible by any researcher through a Python installation connected to the Internet.

Variables

The dataset used in this notebook is the imputed dataset from the previous notebook (2/3), features3. It contains the modelable petrophysical features of the two available boreholes; TC-1319-005 and TC-3660-008. Hole (text object) and Len (float) variables are for reference, Form (text object) is a categorical variable representing the Major Formations:

Name

Explanation

Unit

Hole

Hole name

From

Top of the sample

m

Len

Length of the core sample

cm

Den

Density

g/cm³

Vp

Compressional velocity

m/s

Vs

Shear velocity

m/s

Mag

Magnetic susceptibility

Ip

Chargeability or induced polarization

mv/V

Res

Resistivity

ohm·m

Form

Major formations or zone along the hole

The GR measurement from keyhole TC-3660-008 is the property we want to transfer to the other borehole, TC-1319-005. The challenge lies in the fact that both datasets have different ranges and intervals, and is not possible to make an ML model unless both datasets (new petrophysical features and legacy GR) are integrated into a single dataframe.

Libraries

The following are the Python libraries used along this notebook. PSL are Python Standard Libraries, UDL are User Defined Libraries, and PEL are Python External Libraries:

PLS

import sys
import warnings

#UDL
from vector_geology import basic_stat, geo

# PEL- Basic
import numpy as np
import pandas as pd
from tabulate import tabulate
import json

# PEL - Plotting
import matplotlib.pyplot as plt

# PEL - Filtering
from scipy.signal import butter, filtfilt

# # PEL - Data selection, transformation, preprocessing
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score

# # PEL- ML algorithms
from sklearn.linear_model import LinearRegression, Lasso, Ridge
from sklearn.svm import SVR
from sklearn.neighbors import KNeighborsRegressor
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
import xgboost as xgb

# PEL - Metrics
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
from sklearn.metrics import make_scorer, mean_absolute_error, mean_squared_error

Settings

Activation of qt GUI Seed of random process

seed = 123

# Warning suppression
warnings.filterwarnings("ignore")

Data Loading

features3, imputed features, data load

features3 = pd.read_csv('Output/features3.csv', index_col=0)
features3.head()
Hole From Len Den Vp Vs Mag Ip Res Form
0 TC-1319-005 265.77 11.2 2.610000 5725.879959 3073.767206 0.108 36.995337 5489.392555 CALP
1 TC-1319-005 271.66 11.8 2.786624 5514.000000 2222.000000 0.108 26.060000 4600.000000 CALP
2 TC-1319-005 277.05 10.3 3.399000 5754.000000 2085.000000 0.237 17.520000 10200.000000 CALP
3 TC-1319-005 282.55 11.2 2.800579 5773.000000 2011.000000 0.151 28.130000 5800.000000 CALP
4 TC-1319-005 287.20 10.8 3.293000 6034.000000 2122.000000 0.176 6.880000 15500.000000 CALP


features3.describe()
From Len Den Vp Vs Mag Ip Res
count 329.000000 321.000000 329.000000 329.000000 329.000000 329.000000 329.000000 329.000000
mean 543.279696 10.223364 2.723043 5501.096555 2808.728436 0.142841 48.950658 3401.316252
std 262.864370 1.070127 0.143224 538.939658 517.989081 0.100134 24.374350 3623.480943
min 9.600000 7.500000 2.420000 3009.000000 1295.000000 0.001000 5.390000 156.900000
25% 355.580000 9.700000 2.679000 5266.000000 2448.000000 0.064000 31.808800 1200.000000
50% 547.800000 10.000000 2.707000 5587.000000 2994.000000 0.115000 46.200000 2300.000000
75% 750.000000 10.400000 2.731000 5849.000000 3231.000000 0.201000 65.470000 4200.000000
max 1049.200000 18.000000 3.650000 6478.000000 3740.000000 0.545000 171.730000 23900.000000


Columns in the features3

features3.columns
Index(['Hole', 'From', 'Len', 'Den', 'Vp', 'Vs', 'Mag', 'Ip', 'Res', 'Form'], dtype='object')

Imputed features of borehole TC-3660-008

features3_8 = features3[features3.Hole=='TC-3660-008']
features3_8 = features3_8.reset_index(drop=True)
features3_8
Hole From Len Den Vp Vs Mag Ip Res Form
0 TC-3660-008 9.60 9.9 2.690 6226.000000 3426.000000 0.545 12.190000 18200.000000 CALP
1 TC-3660-008 12.86 9.5 2.702 6376.000000 3467.000000 0.435 11.260000 23000.000000 CALP
2 TC-3660-008 18.36 10.4 2.676 5810.000000 3180.000000 0.377 19.710000 6900.000000 CALP
3 TC-3660-008 22.90 10.2 2.686 6415.000000 3434.000000 0.143 26.681228 8414.846011 CALP
4 TC-3660-008 24.50 10.3 2.693 6095.000000 3388.000000 0.159 15.040000 13500.000000 CALP
... ... ... ... ... ... ... ... ... ... ...
163 TC-3660-008 814.55 9.6 2.763 5680.000000 3179.000000 0.158 70.940000 481.100000 LIS
164 TC-3660-008 819.92 NaN 2.649 5222.938307 2668.541294 0.024 54.048890 879.289776 LIS
165 TC-3660-008 825.60 9.9 2.762 5593.000000 1787.000000 0.337 78.440000 334.600000 PAL
166 TC-3660-008 828.80 9.9 2.804 5858.000000 1964.000000 0.311 73.730000 466.500000 PAL
167 TC-3660-008 833.40 10.1 2.735 5805.000000 3217.000000 0.308 33.860000 2300.000000 PAL

168 rows × 10 columns



Depth information of feature3

print('Petrophysical features of TC-360-008', '\n')
print('Top:', features3_8.From[0])
print('Base:', features3_8.From[len(features3_8) - 1], '\n')

print('Length:', len(features3_8))
print('Range:', features3_8.From[len(features3_8) - 1] - features3_8.From[0], '\n')
print('First step: {:.1f}'.format(features3_8.From[1] - features3_8.From[0]))

# Mean step
steps = []
for i in range(len(features3_8) - 1):
    steps.append(features3_8.From[i+1] - features3_8.From[i])
print('Mean step: {:.2f}'.format(np.mean(steps)))
Petrophysical features of TC-360-008

Top: 9.600000000000025
Base: 833.4

Length: 168
Range: 823.8

First step: 3.3
Mean step: 4.93

Legacy GR data of different holes

dotenv.load_dotenv()
base_path = os.getenv("PATH_TO_COLLINSTOWN_PETRO")
gr = pd.read_csv(f'{base_path}/collinstown_Gamma.csv')
gr.head()
Python-dotenv could not parse statement starting at line 13
Python-dotenv could not parse statement starting at line 14
Python-dotenv could not parse statement starting at line 15
Python-dotenv could not parse statement starting at line 16
Python-dotenv could not parse statement starting at line 17
HOLEID PROJECTCODE DEPTHSTEP RUN PRIORITY DEPTH GeoPhys_NGAM_API GeoPhys_NGAM_NLF50 Gamma_Temp Gamma_VBatt GeoPhys_NGAM_Speed PL_DER GeoPhys_Date_dr Logging_Method_dr Calibration_dr Diameter_dr IntegrationPeriod_dr WheelDiameter_dr Gamma_Overlap
0 TC-3660-007 Collinstown -0.1 1 1 0.7 26.9 NaN 25 8.31 8.57 3660 6-Sep-2019 IN_ROD 0.517073 N 800 100.454 NaN
1 TC-3660-007 Collinstown -0.1 1 1 0.8 26.8 NaN 25 8.31 8.58 3660 6-Sep-2019 IN_ROD 0.517073 N 800 100.454 NaN
2 TC-3660-007 Collinstown -0.1 1 1 0.9 18.1 NaN 25 8.32 8.46 3660 6-Sep-2019 IN_ROD 0.517073 N 800 100.454 NaN
3 TC-3660-007 Collinstown -0.1 1 1 1.0 19.4 NaN 25 8.32 8.43 3660 6-Sep-2019 IN_ROD 0.517073 N 800 100.454 NaN
4 TC-3660-007 Collinstown -0.1 1 1 1.1 19.5 NaN 25 8.32 8.51 3660 6-Sep-2019 IN_ROD 0.517073 N 800 100.454 NaN


Different holes in the legacy GR dataframe

gr.HOLEID.unique()
array(['TC-3660-007', 'TC-3660-008', 'TC-3839-002'], dtype=object)

Legacy GR of borehole TC-3660-008

gr8 = gr[gr.HOLEID == 'TC-3660-008'].reset_index(drop=True)
gr8
HOLEID PROJECTCODE DEPTHSTEP RUN PRIORITY DEPTH GeoPhys_NGAM_API GeoPhys_NGAM_NLF50 Gamma_Temp Gamma_VBatt GeoPhys_NGAM_Speed PL_DER GeoPhys_Date_dr Logging_Method_dr Calibration_dr Diameter_dr IntegrationPeriod_dr WheelDiameter_dr Gamma_Overlap
0 TC-3660-008 Collinstown -0.1 1 1 -0.2 27.3 NaN 25 8.14 7.37 3660 23-Oct-2019 IN_ROD 0.517073 B 800 100.454 0.0
1 TC-3660-008 Collinstown -0.1 1 1 -0.1 16.2 NaN 25 8.14 7.94 3660 23-Oct-2019 IN_ROD 0.517073 B 800 100.454 0.0
2 TC-3660-008 Collinstown -0.1 1 1 0.0 19.8 NaN 25 8.14 7.96 3660 23-Oct-2019 IN_ROD 0.517073 B 800 100.454 0.0
3 TC-3660-008 Collinstown -0.1 1 1 0.1 22.6 NaN 25 8.14 7.87 3660 23-Oct-2019 IN_ROD 0.517073 B 800 100.454 0.0
4 TC-3660-008 Collinstown -0.1 1 1 0.2 30.6 NaN 25 8.14 7.77 3660 23-Oct-2019 IN_ROD 0.517073 B 800 100.454 0.0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
8718 TC-3660-008 Collinstown -0.1 2 1 830.4 72.5 NaN 25 8.33 6.58 3660 04-Nov-2019 IN_ROD 0.517073 B 800 100.454 0.0
8719 TC-3660-008 Collinstown -0.1 2 1 830.5 80.3 NaN 25 8.33 6.20 3660 04-Nov-2019 IN_ROD 0.517073 B 800 100.454 0.0
8720 TC-3660-008 Collinstown -0.1 2 1 830.6 77.1 NaN 25 8.33 5.02 3660 04-Nov-2019 IN_ROD 0.517073 B 800 100.454 0.0
8721 TC-3660-008 Collinstown -0.1 2 1 830.7 72.5 NaN 25 8.33 4.55 3660 04-Nov-2019 IN_ROD 0.517073 B 800 100.454 0.0
8722 TC-3660-008 Collinstown -0.1 2 1 830.8 52.6 NaN 25 8.33 3.66 3660 04-Nov-2019 IN_ROD 0.517073 B 800 100.454 0.0

8723 rows × 19 columns



gr8.describe()
DEPTHSTEP RUN PRIORITY DEPTH GeoPhys_NGAM_API GeoPhys_NGAM_NLF50 Gamma_Temp Gamma_VBatt GeoPhys_NGAM_Speed PL_DER Calibration_dr IntegrationPeriod_dr WheelDiameter_dr Gamma_Overlap
count 8.723000e+03 8723.000000 8723.0 8723.000000 8723.000000 0.0 8723.0 8723.000000 8723.000000 8723.0 8.723000e+03 8723.0 8.723000e+03 8723.000000
mean -1.000000e-01 1.437006 1.0 417.895369 36.886003 NaN 25.0 8.251908 9.070378 3660.0 5.170728e-01 800.0 1.004540e+02 0.047117
std 1.387858e-17 0.496044 0.0 234.501159 17.433525 NaN 0.0 0.059756 0.815250 0.0 1.110287e-16 0.0 2.842334e-14 0.211901
min -1.000000e-01 1.000000 1.0 -0.200000 2.800000 NaN 25.0 8.140000 0.610000 3660.0 5.170728e-01 800.0 1.004540e+02 0.000000
25% -1.000000e-01 1.000000 1.0 217.850000 24.400000 NaN 25.0 8.200000 8.720000 3660.0 5.170728e-01 800.0 1.004540e+02 0.000000
50% -1.000000e-01 1.000000 1.0 435.900000 33.500000 NaN 25.0 8.250000 9.190000 3660.0 5.170728e-01 800.0 1.004540e+02 0.000000
75% -1.000000e-01 2.000000 1.0 612.750000 45.200000 NaN 25.0 8.300000 9.565000 3660.0 5.170728e-01 800.0 1.004540e+02 0.000000
max -1.000000e-01 2.000000 1.0 830.800000 119.800000 NaN 25.0 8.370000 19.680000 3660.0 5.170728e-01 800.0 1.004540e+02 1.000000


Depth information of the legacy GR

print('Legacy GR of TC-360-008', '\n')

print('Top:', gr8.DEPTH[0])
print('Base:', gr8.DEPTH[len(gr8) - 1], '\n')

print('Length:', len(gr8))
print('Range:', gr8.DEPTH[len(gr8) - 1] - gr8.DEPTH[2], '\n')

print('First step: {:.1f}'.format(gr8.DEPTH[3] - gr8.DEPTH[2]))

# Mean step
steps = []
for i in range(2, len(gr8) - 1):
    steps.append(gr8.DEPTH[i+1] - gr8.DEPTH[i])
print('Mean step: {:.2f}'.format(np.mean(steps)))
Legacy GR of TC-360-008

Top: -0.2
Base: 830.8

Length: 8723
Range: 830.8

First step: 0.1
Mean step: 0.10

Data Merging

To model a new GR in borehole TC-1319-005, the legacy GR (gr8) and petrophysical data of borehole TC-3660-008 (feature3_8) have to be integrated into a single dataframe, free of NaNs.

Depth Equalization

The first step in merging the process of the legacy GR into the petrophysical data of borehole TC-3660-008 is to equalize the depths of both dataframes by the use of conditional expressions:

Equalizing depths in feature3

features3_8 = features3_8[features3_8.From <= 830.8].reset_index(drop=True)
features3_8
Hole From Len Den Vp Vs Mag Ip Res Form
0 TC-3660-008 9.60 9.9 2.690 6226.000000 3426.000000 0.545 12.190000 18200.000000 CALP
1 TC-3660-008 12.86 9.5 2.702 6376.000000 3467.000000 0.435 11.260000 23000.000000 CALP
2 TC-3660-008 18.36 10.4 2.676 5810.000000 3180.000000 0.377 19.710000 6900.000000 CALP
3 TC-3660-008 22.90 10.2 2.686 6415.000000 3434.000000 0.143 26.681228 8414.846011 CALP
4 TC-3660-008 24.50 10.3 2.693 6095.000000 3388.000000 0.159 15.040000 13500.000000 CALP
... ... ... ... ... ... ... ... ... ... ...
162 TC-3660-008 809.47 9.5 2.786 5621.000000 2334.000000 0.105 55.200000 706.500000 P
163 TC-3660-008 814.55 9.6 2.763 5680.000000 3179.000000 0.158 70.940000 481.100000 LIS
164 TC-3660-008 819.92 NaN 2.649 5222.938307 2668.541294 0.024 54.048890 879.289776 LIS
165 TC-3660-008 825.60 9.9 2.762 5593.000000 1787.000000 0.337 78.440000 334.600000 PAL
166 TC-3660-008 828.80 9.9 2.804 5858.000000 1964.000000 0.311 73.730000 466.500000 PAL

167 rows × 10 columns



New depth information of feature3

print('Petrophysical features of TC-360-008', '\n')
print('Top:', features3_8.From[0])
print('Base:', features3_8.From[len(features3_8) - 1], '\n')

print('Length:', len(features3_8))
print('Range:', features3_8.From[len(features3_8) - 1] - features3_8.From[0], '\n')
print('First step: {:.1f}'.format(features3_8.From[1] - features3_8.From[0]))

# Mean step
steps = []
for i in range(len(features3_8) - 1):
    steps.append(features3_8.From[i+1] - features3_8.From[i])
print('Mean step: {:.2f}'.format(np.mean(steps)))
Petrophysical features of TC-360-008

Top: 9.600000000000025
Base: 828.8

Length: 167
Range: 819.1999999999999

First step: 3.3
Mean step: 4.93

Equalizing depths in the legacy GR

gr8 = gr8[(gr8.DEPTH >= 9.6) & (gr8.DEPTH <= 828.8)].reset_index(drop=True)
gr8
HOLEID PROJECTCODE DEPTHSTEP RUN PRIORITY DEPTH GeoPhys_NGAM_API GeoPhys_NGAM_NLF50 Gamma_Temp Gamma_VBatt GeoPhys_NGAM_Speed PL_DER GeoPhys_Date_dr Logging_Method_dr Calibration_dr Diameter_dr IntegrationPeriod_dr WheelDiameter_dr Gamma_Overlap
0 TC-3660-008 Collinstown -0.1 1 1 9.6 36.5 NaN 25 8.15 9.17 3660 23-Oct-2019 IN_ROD 0.517073 B 800 100.454 0.0
1 TC-3660-008 Collinstown -0.1 1 1 9.7 37.9 NaN 25 8.15 9.13 3660 23-Oct-2019 IN_ROD 0.517073 B 800 100.454 0.0
2 TC-3660-008 Collinstown -0.1 1 1 9.8 37.0 NaN 25 8.15 9.34 3660 23-Oct-2019 IN_ROD 0.517073 B 800 100.454 0.0
3 TC-3660-008 Collinstown -0.1 1 1 9.9 33.6 NaN 25 8.15 9.44 3660 23-Oct-2019 IN_ROD 0.517073 B 800 100.454 0.0
4 TC-3660-008 Collinstown -0.1 1 1 10.0 32.6 NaN 25 8.15 9.43 3660 23-Oct-2019 IN_ROD 0.517073 B 800 100.454 0.0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
8600 TC-3660-008 Collinstown -0.1 2 1 828.4 89.6 NaN 25 8.33 8.29 3660 04-Nov-2019 IN_ROD 0.517073 B 800 100.454 0.0
8601 TC-3660-008 Collinstown -0.1 2 1 828.5 88.3 NaN 25 8.33 8.31 3660 04-Nov-2019 IN_ROD 0.517073 B 800 100.454 0.0
8602 TC-3660-008 Collinstown -0.1 2 1 828.6 74.2 NaN 25 8.33 8.31 3660 04-Nov-2019 IN_ROD 0.517073 B 800 100.454 0.0
8603 TC-3660-008 Collinstown -0.1 2 1 828.7 84.0 NaN 25 8.33 7.83 3660 04-Nov-2019 IN_ROD 0.517073 B 800 100.454 0.0
8604 TC-3660-008 Collinstown -0.1 2 1 828.8 70.6 NaN 25 8.33 7.37 3660 04-Nov-2019 IN_ROD 0.517073 B 800 100.454 0.0

8605 rows × 19 columns



New depth information of the legacy GR

print('Legacy GR of TC-360-008', '\n')

print('Top:', gr8.DEPTH[0])
print('Base:', gr8.DEPTH[len(gr8) - 1], '\n')

print('Length:', len(gr8))
print('Range:', gr8.DEPTH[len(gr8) - 1] - gr8.DEPTH[2], '\n')

print('First step: {:.1f}'.format(gr8.DEPTH[3] - gr8.DEPTH[2]))

# Mean step
steps = []
for i in range(2, len(gr8) - 1):
    steps.append(gr8.DEPTH[i+1] - gr8.DEPTH[i])
print('Mean step: {:.2f}'.format(np.mean(steps)))
Legacy GR of TC-360-008

Top: 9.6
Base: 828.8

Length: 8605
Range: 819.0

First step: 0.1
Mean step: 0.10

Filtering and Dowsampling

Now that the depths of the legacy GR and the petrophysical data of borehole TC-3660-008 have been equalized, the next challenge is to merge the legacy GR into the petrophysical data. This involves condensing the 8605 legacy GR values into a new petrophysical feature with 167 values that can be used in the models. This task has been archived by combining a Butterwort filter and a downsampling function, an ideal combination for data with fast fluctuation such as the GR log (only downsampling produces a very spiky result). The effect of the filter, downsampling, and filter followed by downsampling can be appreciated at the end of this section.

Simplify the legacy GR

gr8 = gr8[['DEPTH', 'GeoPhys_NGAM_API']]
gr8
DEPTH GeoPhys_NGAM_API
0 9.6 36.5
1 9.7 37.9
2 9.8 37.0
3 9.9 33.6
4 10.0 32.6
... ... ...
8600 828.4 89.6
8601 828.5 88.3
8602 828.6 74.2
8603 828.7 84.0
8604 828.8 70.6

8605 rows × 2 columns



Rename series in the legacy GR

gr8 = gr8.rename(columns={'DEPTH':'depth', 'GeoPhys_NGAM_API':'gr'})
gr8
depth gr
0 9.6 36.5
1 9.7 37.9
2 9.8 37.0
3 9.9 33.6
4 10.0 32.6
... ... ...
8600 828.4 89.6
8601 828.5 88.3
8602 828.6 74.2
8603 828.7 84.0
8604 828.8 70.6

8605 rows × 2 columns



Downsampling the legacy GR but keeping the first and last value

reduc = int((len(gr8)) / 165)

gr8_middle = gr8.iloc[1:-1].iloc[::reduc, :]
gr8_down = pd.concat([gr8.iloc[[0]], gr8_middle, gr8.iloc[[-1]]])
gr8_down = gr8_down.sort_values('depth').reset_index(drop=True)
gr8_down
depth gr
0 9.6 36.5
1 9.7 37.9
2 14.9 38.8
3 20.1 25.0
4 25.3 15.0
... ... ...
163 810.9 27.2
164 816.1 63.5
165 821.3 49.8
166 826.5 72.8
167 828.8 70.6

168 rows × 2 columns



Butterworth filter on the legacy GR

b, a = butter(N=2, Wn=0.02, btype='low')
filtered_gr = filtfilt(b, a, gr8.gr)

gr8['gr_fil'] = filtered_gr
gr8 = gr8.sort_values('depth').reset_index(drop=True)
gr8
depth gr gr_fil
0 9.6 36.5 30.578377
1 9.7 37.9 30.455095
2 9.8 37.0 30.326338
3 9.9 33.6 30.192920
4 10.0 32.6 30.055744
... ... ... ...
8600 828.4 89.6 76.752640
8601 828.5 88.3 76.716145
8602 828.6 74.2 76.680680
8603 828.7 84.0 76.646916
8604 828.8 70.6 76.615594

8605 rows × 3 columns



Downsampling the filtered GR

gr8_fil_down_middle = gr8.iloc[1:-1].iloc[::reduc, :]
gr8_fil_down = pd.concat([gr8.iloc[[0]], gr8_fil_down_middle, gr8.iloc[[-1]]])

# Sort by depth
gr8_fil_down = gr8_fil_down.sort_values('depth').reset_index(drop=True)
gr8_fil_down
depth gr gr_fil
0 9.6 36.5 30.578377
1 9.7 37.9 30.455095
2 14.9 38.8 34.024174
3 20.1 25.0 23.740136
4 25.3 15.0 24.472588
... ... ... ...
163 810.9 27.2 59.538951
164 816.1 63.5 76.511278
165 821.3 49.8 68.944642
166 826.5 72.8 77.107392
167 828.8 70.6 76.615594

168 rows × 3 columns



Plot of filter and/or downsampling

plt.figure(figsize=(15, 8))

plt.subplot(121)
plt.plot(gr8.gr, gr8.depth, label='Original')
plt.legend()
plt.grid()
plt.xlabel('GR (API)')
plt.xlabel('Depth (m)')
plt.axis([0, 120, 850, 0])
plt.ylim(850, 0)

plt.subplot(122)
plt.plot(gr8_down.gr, gr8_down.depth, label='Down Samplig')
plt.plot(gr8.gr_fil, gr8.depth, label='Butterworth Filtered')
plt.plot(gr8_fil_down.gr_fil, gr8_fil_down.depth, label='Butterworth Filtered & Down Sampling')
plt.legend()
plt.grid()
plt.xlabel('GR (API)')
plt.xlabel('Depth (m)')
plt.axis([0, 120, 850, 0])
plt.ylim(850, 0)

plt.suptitle('Borehole TC-3660-008 GR', y=0.98)
plt.tight_layout()
Borehole TC-3660-008 GR

Integrated Dataframe

As the downsampling gave 168 values, we dropped the central value to reach 167 (the length of the petrophysical dataset) and then fused the filter and downsampled GR into the petrophysical data of borehole TC-3660-008.

Drop of central value in the filter and downsampled GR

del_index = len(gr8_fil_down)//2
gr8_fil_down = gr8_fil_down.drop(axis=0, index=del_index).reset_index(drop=True)
gr8_fil_down
depth gr gr_fil
0 9.6 36.5 30.578377
1 9.7 37.9 30.455095
2 14.9 38.8 34.024174
3 20.1 25.0 23.740136
4 25.3 15.0 24.472588
... ... ... ...
162 810.9 27.2 59.538951
163 816.1 63.5 76.511278
164 821.3 49.8 68.944642
165 826.5 72.8 77.107392
166 828.8 70.6 76.615594

167 rows × 3 columns



Final integrated dataframe without NaNs

features3_8['GR'] = gr8_fil_down.gr_fil
features3_8
Hole From Len Den Vp Vs Mag Ip Res Form GR
0 TC-3660-008 9.60 9.9 2.690 6226.000000 3426.000000 0.545 12.190000 18200.000000 CALP 30.578377
1 TC-3660-008 12.86 9.5 2.702 6376.000000 3467.000000 0.435 11.260000 23000.000000 CALP 30.455095
2 TC-3660-008 18.36 10.4 2.676 5810.000000 3180.000000 0.377 19.710000 6900.000000 CALP 34.024174
3 TC-3660-008 22.90 10.2 2.686 6415.000000 3434.000000 0.143 26.681228 8414.846011 CALP 23.740136
4 TC-3660-008 24.50 10.3 2.693 6095.000000 3388.000000 0.159 15.040000 13500.000000 CALP 24.472588
... ... ... ... ... ... ... ... ... ... ... ...
162 TC-3660-008 809.47 9.5 2.786 5621.000000 2334.000000 0.105 55.200000 706.500000 P 59.538951
163 TC-3660-008 814.55 9.6 2.763 5680.000000 3179.000000 0.158 70.940000 481.100000 LIS 76.511278
164 TC-3660-008 819.92 NaN 2.649 5222.938307 2668.541294 0.024 54.048890 879.289776 LIS 68.944642
165 TC-3660-008 825.60 9.9 2.762 5593.000000 1787.000000 0.337 78.440000 334.600000 PAL 77.107392
166 TC-3660-008 828.80 9.9 2.804 5858.000000 1964.000000 0.311 73.730000 466.500000 PAL 76.615594

167 rows × 11 columns



features3_8.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 167 entries, 0 to 166
Data columns (total 11 columns):
 #   Column  Non-Null Count  Dtype
---  ------  --------------  -----
 0   Hole    167 non-null    object
 1   From    167 non-null    float64
 2   Len     159 non-null    float64
 3   Den     167 non-null    float64
 4   Vp      167 non-null    float64
 5   Vs      167 non-null    float64
 6   Mag     167 non-null    float64
 7   Ip      167 non-null    float64
 8   Res     167 non-null    float64
 9   Form    167 non-null    object
 10  GR      167 non-null    float64
dtypes: float64(9), object(2)
memory usage: 14.5+ KB

Modeling

Having the GR already integrated into the petrophysical data of borehole TC-3660-008 we are ready for modeling, which involves a regression analysis to generate predictive numerical outputs. The steps required are:

  • Split the data for training and test or evaluation

  • Select and run the regression models

  • Test or evaluate the results of the different models

  • With the best ML algorithm, generate a synthetic GR for borehole TC-1319-005

Data Split

Here we selected the petrophysical features and target (GR) in borehole TC-3660-008 for modeling, leaving 20% (34) of them aside for testing or evaluation of the regression algorithms.

features for modeling

features_model_8 = features3_8.drop(columns=['Hole', 'Len', 'Form', 'GR'])
print(features_model_8.shape)
features_model_8.head()
(167, 7)
From Den Vp Vs Mag Ip Res
0 9.60 2.690 6226.0 3426.0 0.545 12.190000 18200.000000
1 12.86 2.702 6376.0 3467.0 0.435 11.260000 23000.000000
2 18.36 2.676 5810.0 3180.0 0.377 19.710000 6900.000000
3 22.90 2.686 6415.0 3434.0 0.143 26.681228 8414.846011
4 24.50 2.693 6095.0 3388.0 0.159 15.040000 13500.000000


target or objetive

target_8 = features3_8.GR
target_8
0      30.578377
1      30.455095
2      34.024174
3      23.740136
4      24.472588
         ...
162    59.538951
163    76.511278
164    68.944642
165    77.107392
166    76.615594
Name: GR, Length: 167, dtype: float64

Split and shape of data for training and testing

X_train, X_test, y_train, y_test = train_test_split(features_model_8, target_8, test_size=0.2, random_state=seed)

# X_train = np.array(X_train).flatten()
# y_train = np.array(y_train).flatten()

print('X_train shape:', X_train.shape)
print('y_train shape:', y_train.shape)

print('X_test shape:', X_test.shape)
print('y_test shape:', y_test.shape)
X_train shape: (133, 7)
y_train shape: (133,)
X_test shape: (34, 7)
y_test shape: (34,)
Regression Models

In this analysis, we trained and evaluated nine regression models, some of which are influenced by random processes (those using the random_state=seed argument). After testing with different seed values (e.g., 123, 456, 789), we observed variations in the top-performing model, typically alternating between Gradient Boosting, Extreme Gradient Boosting, and Random Forest. To mitigate this variability and ensure a more reliable evaluation, we implemented cross-validation for the calculation of the metrics.

Regression models and metrics with a fixed seed

seed = 123

models = []

models.append(('LR', LinearRegression()))
models.append(('L', Lasso()))
models.append(('R', Ridge()))
models.append(('SVR', SVR()))
models.append(('KNR', KNeighborsRegressor()))
models.append(('GB', GradientBoostingRegressor(random_state=seed)))
models.append(('DTR', DecisionTreeRegressor(random_state=seed)))
models.append(('RFR', RandomForestRegressor(random_state=seed)))
models.append(('XGB', xgb.XGBRegressor(objective ='reg:squarederror')))

headers = ['Model', 'Sum  of Residuals', 'MAE', 'MSE', 'RMSE', 'R2']
rows = []

for name, model in models:
    model.fit(X_train, y_train)
    predict = model.predict(X_test)

    sum_residual = sum(y_test - predict)
    mae = mean_absolute_error(y_test, predict)
    mse = mean_squared_error(y_test, predict)
    rmse = np.sqrt(mse)
    score = model.score(X_test, y_test)

    rows.append([name, sum_residual, mae, mse, rmse, score])

print('Seed:', seed)
print(tabulate(rows, headers=headers, tablefmt="fancy_outline"))
Seed: 123
╒═════════╤═════════════════════╤══════════╤══════════╤══════════╤════════════╕
│ Model   │   Sum  of Residuals │      MAE │      MSE │     RMSE │         R2 │
╞═════════╪═════════════════════╪══════════╪══════════╪══════════╪════════════╡
│ LR      │            -4.86117 │  8.93602 │ 129.384  │ 11.3747  │  0.230243  │
│ L       │            11.9249  │  8.86367 │ 136.354  │ 11.6771  │  0.188777  │
│ R       │             9.17722 │  8.87722 │ 135.115  │ 11.6239  │  0.196147  │
│ SVR     │           165.92    │ 10.0965  │ 182.715  │ 13.5172  │ -0.0870428 │
│ KNR     │            44.6092  │  8.80595 │ 153.988  │ 12.4092  │  0.0838649 │
│ GB      │           -18.2136  │  7.03068 │  73.9925 │  8.60189 │  0.55979   │
│ DTR     │           -34.3359  │ 12.1705  │ 210.477  │ 14.5078  │ -0.252209  │
│ RFR     │           -11.9149  │  7.9933  │ 100.311  │ 10.0155  │  0.403212  │
│ XGB     │           -55.5957  │  7.40284 │  82.3656 │  9.07555 │  0.509975  │
╘═════════╧═════════════════════╧══════════╧══════════╧══════════╧════════════╛

Regression models and metrics with a fixed seed and cross-validation

seed = 123

models = []

models.append(('LR', LinearRegression()))
models.append(('L', Lasso()))
models.append(('R', Ridge()))
models.append(('SVR', SVR()))
models.append(('KNR', KNeighborsRegressor()))
models.append(('GB', GradientBoostingRegressor(random_state=seed)))
models.append(('DTR', DecisionTreeRegressor(random_state=seed)))
models.append(('RFR', RandomForestRegressor(random_state=seed)))
models.append(('XGB', xgb.XGBRegressor(objective ='reg:squarederror')))

headers = ['Model', 'MAE', 'MSE', 'RMSE', 'R2']
rows = []

n_folds = 5

rmse_scorer = make_scorer(lambda y_true, y_pred: np.sqrt(mean_squared_error(y_true, y_pred)))

for name, model in models:
    scores_mae = cross_val_score(model, X_train, y_train, cv=n_folds, scoring='neg_mean_absolute_error')
    scores_mse = cross_val_score(model, X_train, y_train, cv=n_folds, scoring='neg_mean_squared_error')
    scores_rmse = cross_val_score(model, X_train, y_train, cv=n_folds, scoring=rmse_scorer)
    scores_r2 = cross_val_score(model, X_train, y_train, cv=n_folds, scoring='r2')

    mae = -np.mean(scores_mae)
    mse = -np.mean(scores_mse)
    rmse = np.mean(scores_rmse)
    r2 = np.mean(scores_r2)

    rows.append([name, mae, mse, rmse, r2])

print('Seed:', seed)
print(tabulate(rows, headers=headers, tablefmt="fancy_outline"))
Seed: 123
╒═════════╤═════════╤══════════╤══════════╤═══════════╕
│ Model   │     MAE │      MSE │     RMSE │        R2 │
╞═════════╪═════════╪══════════╪══════════╪═══════════╡
│ LR      │ 7.11785 │  92.1091 │  9.4983  │ 0.268285  │
│ L       │ 7.30283 │  94.0936 │  9.60407 │ 0.252065  │
│ R       │ 7.2349  │  93.1695 │  9.55375 │ 0.259637  │
│ SVR     │ 7.88615 │ 124.49   │ 11.1176  │ 0.0328943 │
│ KNR     │ 8.07405 │ 114.013  │ 10.6201  │ 0.106625  │
│ GB      │ 7.14677 │ 104.942  │ 10.1178  │ 0.183119  │
│ DTR     │ 8.12116 │ 128.029  │ 11.1078  │ 0.0187    │
│ RFR     │ 7.31101 │  97.9016 │  9.80004 │ 0.230271  │
│ XGB     │ 7.08965 │ 100.852  │  9.96463 │ 0.195461  │
╘═════════╧═════════╧══════════╧══════════╧═══════════╛

Best Regression Model

After applying cross-validation, Linear Regression (LR) emerged as the top-performing model with an R² of 0.265, followed by Ridge Regression (R) with an R² of 0.256, and Lasso Regression (L) with an R² of 0.247.

Next, we selected features from borehole C-1319-005 to predict GR, based on the best model trained on borehole TC-3660-008. The chosen model is LR, with additional predictions generated using XGB.

Petrophysical features of borehole TC-1319-005

features3_5 = features3[features3.Hole=='TC-1319-005']
features3_5 = features3_5.reset_index(drop=True)
features3_5
Hole From Len Den Vp Vs Mag Ip Res Form
0 TC-1319-005 265.77 11.2 2.610000 5725.879959 3073.767206 0.108 36.995337 5489.392555 CALP
1 TC-1319-005 271.66 11.8 2.786624 5514.000000 2222.000000 0.108 26.060000 4600.000000 CALP
2 TC-1319-005 277.05 10.3 3.399000 5754.000000 2085.000000 0.237 17.520000 10200.000000 CALP
3 TC-1319-005 282.55 11.2 2.800579 5773.000000 2011.000000 0.151 28.130000 5800.000000 CALP
4 TC-1319-005 287.20 10.8 3.293000 6034.000000 2122.000000 0.176 6.880000 15500.000000 CALP
... ... ... ... ... ... ... ... ... ... ...
156 TC-1319-005 1034.65 9.8 2.470000 4279.000000 1808.000000 0.028 34.740000 1900.000000 P
157 TC-1319-005 1038.37 9.5 2.559000 5163.000000 1975.000000 0.221 47.820000 1400.000000 LIS
158 TC-1319-005 1041.85 9.8 2.479000 4925.000000 2648.579509 0.054 61.170000 906.500000 LIS
159 TC-1319-005 1046.35 10.6 2.455000 5217.025190 2674.318401 0.158 44.440000 1300.000000 LIS
160 TC-1319-005 1049.20 10.0 2.470000 4673.000000 2595.011555 0.157 78.880000 187.100000 PAL

161 rows × 10 columns



Petrophysical features of borehole TC-1319-005 for modeling

features_model_5 = features3_5.drop(columns=['Hole', 'Len', 'Form'])
print(features_model_5.shape)
features_model_5.head()
(161, 7)
From Den Vp Vs Mag Ip Res
0 265.77 2.610000 5725.879959 3073.767206 0.108 36.995337 5489.392555
1 271.66 2.786624 5514.000000 2222.000000 0.108 26.060000 4600.000000
2 277.05 3.399000 5754.000000 2085.000000 0.237 17.520000 10200.000000
3 282.55 2.800579 5773.000000 2011.000000 0.151 28.130000 5800.000000
4 287.20 3.293000 6034.000000 2122.000000 0.176 6.880000 15500.000000


GR by Linear Regressor

lr_model = LinearRegression()
lr_model.fit(X_train, y_train)

gr5_lr = lr_model.predict(features_model_5)
gr5_lr
array([37.48012316, 40.29644384, 23.09190308, 44.8346237 , 27.11026391,
       35.69517186, 35.80346419, 42.98681624, 21.71842209, 50.42072407,
       43.10646152, 42.64311325, 36.81285745, 35.75724267, 24.77325061,
       38.88608028, 18.99647974, 18.34440846, 24.71733796, 49.77336379,
       31.50291417, 50.27812368, 12.68860118, 40.97470383, 41.35654788,
        9.46749403, 45.41312187, 57.11531171, 52.44942385, 55.00547771,
       54.99299988, 58.31717629, 63.00910746, 38.18780739, 43.40488394,
       53.54787669, 38.28788144, 54.77673522, 53.94068928, 50.00196899,
       53.33280443, 51.97077575, 43.94754544, 44.7692286 , 55.2820785 ,
       42.99287884, 43.53616197, 43.90804534, 45.1923788 , 43.3995711 ,
       52.41320378, 38.40547193, 49.45734775, 54.47626681, 35.21860407,
       40.97396549, 44.98800414, 52.50782359, 33.17683036, 56.93967984,
       31.80814629, 38.25475496, 58.61926752, 46.56422313, 55.22333646,
       48.95181455, 56.40445626, 56.21276624, 40.55208937, 44.54711027,
       34.30283305, 41.41804912, 56.44581239, 50.82112976, 38.51075176,
       42.79861258, 45.19983303, 50.65237446, 38.59099861, 52.70893187,
       48.80542021, 42.14681373, 62.89384827, 46.03558981, 41.10928562,
       48.24311094, 41.97767837, 45.75722795, 48.64259008, 48.654642  ,
       45.50473295, 40.33085351, 44.94789532, 43.92711492, 44.22012991,
       49.9342732 , 60.21970233, 62.67858876, 62.02673151, 52.49919054,
       46.20133463, 50.31771774, 44.34999683, 54.34472837, 32.06757116,
       46.52112522, 66.25802377, 58.53394378, 55.59449825, 61.29707256,
       51.95505826, 53.61999492, 51.86700364, 54.94377794, 54.86303988,
       59.14183574, 66.23808415, 55.51343349, 59.19185001, 52.3024468 ,
       48.61038302, 64.82166153, 46.11411479, 64.31701643, 63.04114529,
       52.15649911, 56.70399617, 51.08228262, 65.68280901, 59.08239203,
       51.20549995, 38.84526869, 62.50110662, 48.03486534, 54.3338539 ,
       53.59520789, 51.58907318, 51.19657462, 53.01946386, 47.82961527,
       71.89787018, 65.52042076, 54.58325456, 51.06668569, 68.48469698,
       64.49554711, 48.54356557, 51.14052389, 51.62244032, 50.396638  ,
       59.06133645, 68.13949395, 75.84718351, 74.02208004, 62.65090578,
       71.2700738 , 67.4500938 , 69.8234084 , 61.83761112, 64.01707194,
       64.17237329])

GR by Extreme Gradient Boosting Regressor

xgb_model = xgb.XGBRegressor(objective ='reg:squarederror')
xgb_model.fit(X_train, y_train)

gr5_xgb = xgb_model.predict(features_model_5)
gr5_xgb
array([34.75164 , 40.796104, 36.992447, 42.021427, 35.146862, 32.735653,
       31.437817, 44.679855, 25.369324, 37.394672, 37.02685 , 34.253475,
       41.54279 , 35.28796 , 37.156284, 27.499958, 24.900589, 59.073536,
       25.334816, 33.707527, 30.582035, 36.551727, 58.758087, 39.717037,
       37.68949 , 51.238953, 37.855713, 37.301083, 36.851   , 33.767624,
       33.657173, 35.071648, 39.557102, 36.829685, 33.846024, 37.697605,
       37.730824, 34.91515 , 36.3279  , 38.628242, 37.983574, 38.069294,
       38.160236, 38.536686, 38.399925, 38.082027, 39.27452 , 37.119156,
       39.841293, 45.398674, 35.81751 , 38.662144, 35.173935, 34.593754,
       33.010708, 39.53115 , 39.431847, 41.398838, 35.45763 , 35.191555,
       34.551064, 31.259857, 38.560356, 37.284836, 41.16499 , 38.056583,
       41.03737 , 35.434082, 38.705368, 39.224365, 36.462032, 36.3096  ,
       36.547577, 45.77963 , 41.261803, 33.92088 , 34.588528, 39.656155,
       39.15763 , 37.552555, 38.922432, 39.36756 , 38.62001 , 38.477215,
       41.59901 , 40.979454, 47.4265  , 53.00436 , 54.575684, 48.65887 ,
       52.29925 , 49.431183, 51.90922 , 49.391224, 52.66824 , 49.652584,
       48.35062 , 44.88772 , 53.68105 , 40.471123, 49.45793 , 51.189827,
       48.084194, 53.25084 , 50.381542, 59.261337, 65.46839 , 41.859623,
       46.78074 , 45.050728, 41.25792 , 49.75928 , 49.73571 , 60.701645,
       60.768967, 61.67022 , 61.20107 , 61.175545, 61.259388, 56.77341 ,
       55.918613, 51.66773 , 54.700287, 54.94038 , 57.015076, 53.717358,
       53.812485, 66.552025, 67.536354, 56.427017, 57.21038 , 49.67948 ,
       54.108974, 54.36197 , 55.57285 , 53.881783, 55.802116, 55.35852 ,
       54.312916, 55.4986  , 62.86732 , 53.801105, 54.689976, 54.729584,
       51.35334 , 66.19035 , 68.41729 , 55.093616, 68.71511 , 56.620056,
       53.18428 , 55.686577, 52.945896, 52.141483, 55.897858, 50.89671 ,
       54.226032, 51.24211 , 68.31796 , 50.653297, 74.86713 ],
      dtype=float32)

Merge the two synthetic GRs into the petrophysical data of borehole TC-1319-005

features3_5['GR_lr'], features3_5['GR_xgb']  = gr5_lr, gr5_xgb
features3_5
Hole From Len Den Vp Vs Mag Ip Res Form GR_lr GR_xgb
0 TC-1319-005 265.77 11.2 2.610000 5725.879959 3073.767206 0.108 36.995337 5489.392555 CALP 37.480123 34.751640
1 TC-1319-005 271.66 11.8 2.786624 5514.000000 2222.000000 0.108 26.060000 4600.000000 CALP 40.296444 40.796104
2 TC-1319-005 277.05 10.3 3.399000 5754.000000 2085.000000 0.237 17.520000 10200.000000 CALP 23.091903 36.992447
3 TC-1319-005 282.55 11.2 2.800579 5773.000000 2011.000000 0.151 28.130000 5800.000000 CALP 44.834624 42.021427
4 TC-1319-005 287.20 10.8 3.293000 6034.000000 2122.000000 0.176 6.880000 15500.000000 CALP 27.110264 35.146862
... ... ... ... ... ... ... ... ... ... ... ... ...
156 TC-1319-005 1034.65 9.8 2.470000 4279.000000 1808.000000 0.028 34.740000 1900.000000 P 67.450094 54.226032
157 TC-1319-005 1038.37 9.5 2.559000 5163.000000 1975.000000 0.221 47.820000 1400.000000 LIS 69.823408 51.242111
158 TC-1319-005 1041.85 9.8 2.479000 4925.000000 2648.579509 0.054 61.170000 906.500000 LIS 61.837611 68.317963
159 TC-1319-005 1046.35 10.6 2.455000 5217.025190 2674.318401 0.158 44.440000 1300.000000 LIS 64.017072 50.653297
160 TC-1319-005 1049.20 10.0 2.470000 4673.000000 2595.011555 0.157 78.880000 187.100000 PAL 64.172373 74.867126

161 rows × 12 columns



There are no NaNs in the merge data of borehole TC-1319-005

features3_5.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 161 entries, 0 to 160
Data columns (total 12 columns):
 #   Column  Non-Null Count  Dtype
---  ------  --------------  -----
 0   Hole    161 non-null    object
 1   From    161 non-null    float64
 2   Len     161 non-null    float64
 3   Den     161 non-null    float64
 4   Vp      161 non-null    float64
 5   Vs      161 non-null    float64
 6   Mag     161 non-null    float64
 7   Ip      161 non-null    float64
 8   Res     161 non-null    float64
 9   Form    161 non-null    object
 10  GR_lr   161 non-null    float64
 11  GR_xgb  161 non-null    float32
dtypes: float32(1), float64(9), object(2)
memory usage: 14.6+ KB

Understanding the Model

Beyond determining which regression model performs better, it is important to examine the coefficients of the Linear Regression (LR) model to understand the relative importance and influence of each feature. The magnitude and sign of these coefficients reveal which features contribute most to the target, whether positively or negatively.

The table and plot below (negative coefficients were inverted for the logarithm, and colored in red) show that Den has a strong negative impact, while Mag has a significant positive effect. In contrast, Vp, Vs, and Ip have much smaller impacts, and the influence of Res is insignificant. The intercept represents the baseline value when all features are zero.

X_train
From Den Vp Vs Mag Ip Res
28 140.35 2.633 5814.377420 3138.868627 0.090 31.808800 6387.399758
13 64.54 2.677 5643.825284 3011.626392 0.038 35.190000 2500.000000
74 399.33 2.829 5531.000000 3153.000000 0.103 67.780000 397.000000
24 120.40 2.723 5805.808808 3084.328392 0.071 30.704049 6429.850185
85 442.50 2.678 5682.000000 2890.000000 0.073 55.790000 4700.000000
... ... ... ... ... ... ... ...
17 84.42 2.689 5858.000000 3289.000000 0.031 20.760000 5300.000000
98 504.62 2.712 5598.000000 3445.000000 0.035 93.260000 1500.000000
66 349.95 2.693 5587.000000 3135.000000 0.042 24.750000 4200.000000
126 640.00 2.725 6037.000000 3367.000000 0.234 73.220000 2900.000000
109 559.00 2.721 5747.000000 3289.000000 0.078 31.310000 3700.000000

133 rows × 7 columns



Linear regressor coefficients

lr_model = LinearRegression()
lr_model.fit(X_train, y_train)

print('Features:', X_train.columns, '\n')
print("Coeficientes:", lr_model.coef_, '\n')
print("Intercepto:", lr_model.intercept_)
Features: Index(['From', 'Den', 'Vp', 'Vs', 'Mag', 'Ip', 'Res'], dtype='object')

Coeficientes: [ 2.27539299e-02 -3.56367531e+01  4.98929080e-03 -1.28090298e-02
  1.19706930e+01  7.13230615e-02  1.08666969e-04]

Intercepto: 130.7206610710383

Linear regressor coefficients in table

features_list = list(X_train.columns)
features_list.append('Intercepto')

coef_list = list(lr_model.coef_)
coef_list.append(lr_model.intercept_)

table = list(zip(features_list, coef_list))
print(tabulate(table, headers=['Feature', 'Coefficient'], tablefmt="fancy_outline"))
╒════════════╤═══════════════╕
│ Feature    │   Coefficient │
╞════════════╪═══════════════╡
│ From       │   0.0227539   │
│ Den        │ -35.6368      │
│ Vp         │   0.00498929  │
│ Vs         │  -0.012809    │
│ Mag        │  11.9707      │
│ Ip         │   0.0713231   │
│ Res        │   0.000108667 │
│ Intercepto │ 130.721       │
╘════════════╧═══════════════╛

Linear regressor coefficients in linear and log10 plot

plt.figure(figsize=(13,5))

plt.subplot(121)
plt.bar(features_list, (coef_list), label='Positive')
plt.bar(features_list[1], (coef_list[1]), color='r', label='negative')
plt.legend()
plt.ylabel('Coeficients')
plt.grid()

plt.subplot(122)
plt.bar(features_list, np.log10(coef_list))
plt.bar(features_list[1], np.log10(-coef_list[1]), color='r')
plt.bar(features_list[3], np.log10(-coef_list[3]), color='r')
plt.ylabel('Log10[Coeficients]')
plt.grid()

plt.tight_layout;
03 no 3 modeling
<function tight_layout at 0x7f32d4e31360>

Graphic Comparison

The legacy GR data and major formation tops from borehole TC-3660-008 were integrated in the first notebook (1/3), provide a reference for comparing the new ML-generated GRs in borehole TC-1319-005. While the Linear Regression (LR) model achieved a higher R² score, the GR predictions generated by the XGB model (with an R² of only 0.170) more closely follow the patterns observed in the reference borehole TC-3660-008.

The original and ML-generated GRs are presented in the plot at the end of the section. To make the comparison easier between both ML-generated GRs, the GR generated by the XGB algorithm (in orange) was displayed to the right by adding 40 API units.

Load of tops from borehole TC-3660-008

# Leer el archivo JSON
with open(f'{base_path}/tops_list_8.json', 'r') as json_file:
    tops_list_8 = json.load(json_file)

tops_list_8
[{'name': 'CALP', 'top': 9.6, 'base': 270.95, 'color': '#CCDF9A'}, {'name': 'FLT', 'top': 270.95, 'base': 304.3, 'color': '#FF0000'}, {'name': 'CALP', 'top': 304.3, 'base': 345.9, 'color': '#CCDF9A'}, {'name': 'VISS', 'top': 345.9, 'base': 363.8, 'color': '#D9E9F0'}, {'name': 'FLT', 'top': 363.8, 'base': 369.26, 'color': '#FF0000'}, {'name': 'VISS', 'top': 369.26, 'base': 399.33, 'color': '#D9E9F0'}, {'name': 'FLT', 'top': 399.33, 'base': 401.88, 'color': '#FF0000'}, {'name': 'VISS', 'top': 401.88, 'base': 424.46, 'color': '#D9E9F0'}, {'name': 'FLT', 'top': 424.46, 'base': 460.16, 'color': '#FF0000'}, {'name': 'S', 'top': 460.16, 'base': 504.62, 'color': '#8CBAE2'}, {'name': 'LABL', 'top': 504.62, 'base': 724.5, 'color': '#8A6E9F'}, {'name': 'SP', 'top': 724.5, 'base': 785.08, 'color': '#ABA16C'}, {'name': 'P', 'top': 785.08, 'base': 814.55, 'color': '#EBDE98'}, {'name': 'LIS', 'top': 814.55, 'base': 825.6, 'color': '#806000'}, {'name': 'PAL', 'top': 825.6, 'base': 833.5, 'color': '#2A7C43'}]

Tops in borehole TC-1319-005

features3_5.Form.unique()
array(['CALP', 'LABL', 'SP', 'P', 'LIS', 'PAL'], dtype=object)

Extraction of top depths in borehole TC-1319-005

tops5 = pd.DataFrame(features3_5.From[features3_5.Form.ne(features3_5.Form.shift())])
tops5
From
0 265.77
46 503.40
115 830.45
131 911.10
157 1038.37
160 1049.20


Extraction tops names in borehole TC-1319-005

tops5['Top'] = features3_5.Form[features3_5.Form.ne(features3_5.Form.shift())]
tops5
From Top
0 265.77 CALP
46 503.40 LABL
115 830.45 SP
131 911.10 P
157 1038.37 LIS
160 1049.20 PAL


Reset of indexes

tops5 = tops5.reset_index(drop=True)
tops5
From Top
0 265.77 CALP
1 503.40 LABL
2 830.45 SP
3 911.10 P
4 1038.37 LIS
5 1049.20 PAL


Colors of the formations in borehole TC-1319-005

tops5['color'] = pd.Series([ '#CCDF9A', '#8A6E9F', '#ABA16C', '#EBDE98', '#806000', '#2A7C43'])
tops5
From Top color
0 265.77 CALP #CCDF9A
1 503.40 LABL #8A6E9F
2 830.45 SP #ABA16C
3 911.10 P #EBDE98
4 1038.37 LIS #806000
5 1049.20 PAL #2A7C43


Bottom of the last formation

new_row5 = pd.DataFrame([{'From':1059.20, 'Top': '', 'color':''}])
new_row5
From Top color
0 1059.2


Merge of dataframes

tops5 = pd.concat([tops5, new_row5], ignore_index=True)
tops5
From Top color
0 265.77 CALP #CCDF9A
1 503.40 LABL #8A6E9F
2 830.45 SP #ABA16C
3 911.10 P #EBDE98
4 1038.37 LIS #806000
5 1049.20 PAL #2A7C43
6 1059.20


Rename of columns of the dataframe

tops5 = tops5.rename(columns={'From':'depth', 'Top':'name'})
tops5
depth name color
0 265.77 CALP #CCDF9A
1 503.40 LABL #8A6E9F
2 830.45 SP #ABA16C
3 911.10 P #EBDE98
4 1038.37 LIS #806000
5 1049.20 PAL #2A7C43
6 1059.20


Convertion of the dataframe to a list of dictionaries by UDL, borehole TC-1319-005

tops_list_5 = geo.plot_tops(tops5)
tops_list_5
[{'name': 'CALP', 'top': np.float64(265.77), 'base': np.float64(503.4), 'color': '#CCDF9A'}, {'name': 'LABL', 'top': np.float64(503.4), 'base': np.float64(830.45), 'color': '#8A6E9F'}, {'name': 'SP', 'top': np.float64(830.45), 'base': np.float64(911.1), 'color': '#ABA16C'}, {'name': 'P', 'top': np.float64(911.1), 'base': np.float64(1038.37), 'color': '#EBDE98'}, {'name': 'LIS', 'top': np.float64(1038.37), 'base': np.float64(1049.2), 'color': '#806000'}, {'name': 'PAL', 'top': np.float64(1049.2), 'base': np.float64(1059.2), 'color': '#2A7C43'}]

Saving the tops to a json file

with open('Output/tops_list_5.json', 'w') as json_file:
    json.dump(tops_list_5, json_file, indent=4)

Plot of GRs with formations

plt.figure(figsize=(10, 8))
plt.subplot(141)
plt.plot(gr8.iloc[:,1], gr8.iloc[:,0], label='Real', c='lightblue')
# plt.plot(gr8.iloc[:,2].rolling(window=250).mean(), gr8.iloc[:,1], label='Mean 400')
plt.xlabel('GR (API)')

# Butterworth filter
b, a = butter(N=2, Wn=0.02, btype='low')
filtered_data = filtfilt(b, a, gr8.iloc[:,1])
plt.plot(filtered_data, gr8.iloc[:,0], label='BWF')
plt.legend()
plt.grid()
plt.xlabel('GR (API)')
plt.ylabel('Depth (m)')
plt.axis([0, 120, 1100, 0])

plt.subplot(142)
for i in range(0, len(tops_list_8)):
    plt.axhspan(tops_list_8[i]['top'], tops_list_8[i]['base'], color=tops_list_8[i]['color'])
    plt.text(122, tops_list_8[i]['top'], tops_list_8[i]['name'], fontsize=6, va='center')
plt.axis([0, 120, 1100, 0])
plt.xticks([])
plt.xlabel('Formations')

plt.subplot(143)
plt.plot(features3_5.GR_lr, features3_5.From, label='LR')
plt.plot(features3_5.GR_xgb + 40, features3_5.From, label='XGB + 40')
plt.legend()
plt.grid()
plt.xlabel('GR (API)')
plt.axis([0, 120, 1100, 0])

plt.subplot(144)
for i in range(0, len(tops_list_5)):
    plt.axhspan(tops_list_5[i]['top'], tops_list_5[i]['base'], color=tops_list_5[i]['color'])
    plt.text(122, tops_list_5[i]['top'], tops_list_5[i]['name'], fontsize=7, va='center')
plt.axis([0, 120, 1100, 0])
plt.xticks([])
plt.xlabel('Formations')

plt.suptitle('GR and Formations in Boreholes\nTC-3660-008                                                                   TC-1319-005', y=0.98)
plt.tight_layout()
plt.savefig('Output/all_grs.png', dpi=300)
plt.show()
GR and Formations in Boreholes TC-3660-008                                                                   TC-1319-005

Observations

Here are some observations related to the tasks covered in this notebook:

  • This notebook highlights the utility of open-source Python tools in addressing the issue of incomplete and sparse petrophysical data in mining. The various regression models employed demonstrate that machine learning techniques can be effective in filling data gaps, even when direct measurements are limited or unavailable.

  • More complex models, such as Extreme Gradient Boosting, provided higher predictive accuracy following the pattern of the reference GR, but simpler models like Linear Regression offered better interpretability. This trade-off between accuracy and interpretability is a crucial consideration for practical applications in mining data analysis.

  • The coefficients in the Linear Regression model indicated that variables such as Den and Mag significantly impact the target variable (GR), demonstrating strong correlation and critical predictive roles in petrophysical properties. In contrast, Vp and Res were less important, contributing minimally to the model’s predictions.

Final Remarks

The analysis of the petrophysical dataset from Collinstown highlights challenges related to data dispersion and scarcity, as well as the potential of machine learning (ML) tools to enhance the value of mining data.

Data dispersion, particularly in Den and Vp, complicates the generation of accurate models and reflects the inherent heterogeneity of the subsurface. Variations across boreholes and core samples, along with some false anomalies from measurement issues, must be addressed to ensure data integrity and reliable modeling. Low correlation (or low covariance) in the dataset, resulting from fractures and methodological limitations, leads to models with low R² values.

ML algorithms and Python-based tools have proven valuable for integrating legacy data with new sources, thereby improving dataset quality. Techniques such as imputation and regression modeling help fill gaps and enhance the resolution of petrophysical properties. Overall, ML tools present significant opportunities to improve legacy data and support its integration with new datasets, leading to more accurate mining models.

Both this analysis and the previous notebook demonstrate the effective use of open-source Python tools and ML models to address incomplete and sparse petrophysical data in mining exploration. This adaptable workflow can be applied across various geological contexts, benefiting other mining projects as well.

Next Steps

Here are some possible next steps to build on the work from this and the previous notebooks:

  • Consider exploring classification models, especially if there is interest in classifying different types of rocks or deposits based on petrophysical properties rather than predicting continuous variables.

  • Explore advanced feature engineering or transformation techniques to uncover more complex relationships between variables. This could potentially improve the model’s performance.

  • Further optimize the machine learning models by performing hyperparameter tuning. This can improve the accuracy of the predictions and refine the model selection process.

  • Apply the propagation of missing properties using the models obtained here to other boreholes with limited data.

These steps would extend the current analysis, enhance model accuracy, and contribute to more practical applications in mining exploration.

Total running time of the script: (0 minutes 3.688 seconds)

Gallery generated by Sphinx-Gallery