Note

Go to the end to download the full example code

Modelling and Propagation of Legacy Petrophysical Data for Mining Exploration (3/3)¶

Modeling

Barcelona 25/09/24 GEO3BCN Manuel David Soto, Juan Alcalde, Adrià Hernàndez-Pineda, Ramón Carbonel

import dotenv
import os

Introduction¶

The dispersion and scarcity of petrophysical data are well-known challenges in the mining sector. These issues are primarily driven by economic factors, but also by geological (such as sedimentary cover, weathering, erosion, or the depth of targets), geotechnical (e.g., slope or borehole stability), and even technical limitations or availability that have been resolved in other industries (for instance, sonic logs were not previously acquired due to a lack of interest in velocity field data).

To address the challenge of sparse and incomplete petrophysical data in the mining sector we have developed three Jupyter notebooks that tackle these issues through a suite of open-source Python tools. These tools support researchers in the initial exploration, visualization, and integration of data from diverse sources, filling gaps caused by technical limitations and ultimately enabling the complete modeling of missing properties (through standard and more advanced ML-based models). We applied these tools to both, recently acquired and legacy petrophysical data of two cores northwest of Collinstown (County Westmeath, Province of Leinster), Ireland, located 26 km west-northwest of the Navan mine. However, these tools are adaptable and applicable to mining data from any location.

After the exploratory data analysis (notebook 1/3) and filling the gaps in the petrophysical dataset of Collinstown (previous notebook 2/3), this third notebook is focused on modeling by Machine Learning (ML) algorithms entire missing variables, such as the Gamma Ray (GR) measurement, which is available in only one of the petrophysical boreholes. The tasks to perform in this notebook are:

Load and merge the legacy GR data of borehole TC-3660-008 into its petrophysical dataset.
Use the merged data to train and evaluate different ML algorithms capable of predicting GR data in other boreholes.
By using the trained best model, propagate the GR to the other petrophysical borehole.
Evaluate the possibility of propagating this (GR) or other properties to other boreholes with less available data.

As with the previous notebooks, these tasks are performed with open-source Python tools easily accessible by any researcher through a Python installation connected to the Internet.

Variables¶

The dataset used in this notebook is the imputed dataset from the previous notebook (2/3), features3. It contains the modelable petrophysical features of the two available boreholes; TC-1319-005 and TC-3660-008. Hole (text object) and Len (float) variables are for reference, Form (text object) is a categorical variable representing the Major Formations:

Name	Explanation	Unit
Hole	Hole name
From	Top of the sample	m
Len	Length of the core sample	cm
Den	Density	g/cm³
Vp	Compressional velocity	m/s
Vs	Shear velocity	m/s
Mag	Magnetic susceptibility
Ip	Chargeability or induced polarization	mv/V
Res	Resistivity	ohm·m
Form	Major formations or zone along the hole

The GR measurement from keyhole TC-3660-008 is the property we want to transfer to the other borehole, TC-1319-005. The challenge lies in the fact that both datasets have different ranges and intervals, and is not possible to make an ML model unless both datasets (new petrophysical features and legacy GR) are integrated into a single dataframe.

Libraries¶

The following are the Python libraries used along this notebook. PSL are Python Standard Libraries, UDL are User Defined Libraries, and PEL are Python External Libraries:

PLS

import sys
import warnings

#UDL
from vector_geology import basic_stat, geo

# PEL- Basic
import numpy as np
import pandas as pd
from tabulate import tabulate
import json

# PEL - Plotting
import matplotlib.pyplot as plt

# PEL - Filtering
from scipy.signal import butter, filtfilt

# # PEL - Data selection, transformation, preprocessing
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score

# # PEL- ML algorithms
from sklearn.linear_model import LinearRegression, Lasso, Ridge
from sklearn.svm import SVR
from sklearn.neighbors import KNeighborsRegressor
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
import xgboost as xgb

# PEL - Metrics
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
from sklearn.metrics import make_scorer, mean_absolute_error, mean_squared_error

Settings¶

Activation of qt GUI Seed of random process

seed = 123

# Warning suppression
warnings.filterwarnings("ignore")

Data Loading¶

features3, imputed features, data load

features3 = pd.read_csv('Output/features3.csv', index_col=0)
features3.head()

	Hole	From	Len	Den	Vp	Vs	Mag	Ip	Res	Form
0	TC-1319-005	265.77	11.2	2.610000	5725.879959	3073.767206	0.108	36.995337	5489.392555	CALP
1	TC-1319-005	271.66	11.8	2.786624	5514.000000	2222.000000	0.108	26.060000	4600.000000	CALP
2	TC-1319-005	277.05	10.3	3.399000	5754.000000	2085.000000	0.237	17.520000	10200.000000	CALP
3	TC-1319-005	282.55	11.2	2.800579	5773.000000	2011.000000	0.151	28.130000	5800.000000	CALP
4	TC-1319-005	287.20	10.8	3.293000	6034.000000	2122.000000	0.176	6.880000	15500.000000	CALP

features3.describe()

	From	Len	Den	Vp	Vs	Mag	Ip	Res
count	329.000000	321.000000	329.000000	329.000000	329.000000	329.000000	329.000000	329.000000
mean	543.279696	10.223364	2.723043	5501.096555	2808.728436	0.142841	48.950658	3401.316252
std	262.864370	1.070127	0.143224	538.939658	517.989081	0.100134	24.374350	3623.480943
min	9.600000	7.500000	2.420000	3009.000000	1295.000000	0.001000	5.390000	156.900000
25%	355.580000	9.700000	2.679000	5266.000000	2448.000000	0.064000	31.808800	1200.000000
50%	547.800000	10.000000	2.707000	5587.000000	2994.000000	0.115000	46.200000	2300.000000
75%	750.000000	10.400000	2.731000	5849.000000	3231.000000	0.201000	65.470000	4200.000000
max	1049.200000	18.000000	3.650000	6478.000000	3740.000000	0.545000	171.730000	23900.000000

Columns in the features3

features3.columns

Index(['Hole', 'From', 'Len', 'Den', 'Vp', 'Vs', 'Mag', 'Ip', 'Res', 'Form'], dtype='object')

Imputed features of borehole TC-3660-008

features3_8 = features3[features3.Hole=='TC-3660-008']
features3_8 = features3_8.reset_index(drop=True)
features3_8

	Hole	From	Len	Den	Vp	Vs	Mag	Ip	Res	Form
0	TC-3660-008	9.60	9.9	2.690	6226.000000	3426.000000	0.545	12.190000	18200.000000	CALP
1	TC-3660-008	12.86	9.5	2.702	6376.000000	3467.000000	0.435	11.260000	23000.000000	CALP
2	TC-3660-008	18.36	10.4	2.676	5810.000000	3180.000000	0.377	19.710000	6900.000000	CALP
3	TC-3660-008	22.90	10.2	2.686	6415.000000	3434.000000	0.143	26.681228	8414.846011	CALP
4	TC-3660-008	24.50	10.3	2.693	6095.000000	3388.000000	0.159	15.040000	13500.000000	CALP
...	...	...	...	...	...	...	...	...	...	...
163	TC-3660-008	814.55	9.6	2.763	5680.000000	3179.000000	0.158	70.940000	481.100000	LIS
164	TC-3660-008	819.92	NaN	2.649	5222.938307	2668.541294	0.024	54.048890	879.289776	LIS
165	TC-3660-008	825.60	9.9	2.762	5593.000000	1787.000000	0.337	78.440000	334.600000	PAL
166	TC-3660-008	828.80	9.9	2.804	5858.000000	1964.000000	0.311	73.730000	466.500000	PAL
167	TC-3660-008	833.40	10.1	2.735	5805.000000	3217.000000	0.308	33.860000	2300.000000	PAL

168 rows × 10 columns

Depth information of feature3

print('Petrophysical features of TC-360-008', '\n')
print('Top:', features3_8.From[0])
print('Base:', features3_8.From[len(features3_8) - 1], '\n')

print('Length:', len(features3_8))
print('Range:', features3_8.From[len(features3_8) - 1] - features3_8.From[0], '\n')
print('First step: {:.1f}'.format(features3_8.From[1] - features3_8.From[0]))

# Mean step
steps = []
for i in range(len(features3_8) - 1):
    steps.append(features3_8.From[i+1] - features3_8.From[i])
print('Mean step: {:.2f}'.format(np.mean(steps)))

Petrophysical features of TC-360-008

Top: 9.600000000000025
Base: 833.4

Length: 168
Range: 823.8

First step: 3.3
Mean step: 4.93

Legacy GR data of different holes

dotenv.load_dotenv()
base_path = os.getenv("PATH_TO_COLLINSTOWN_PETRO")
gr = pd.read_csv(f'{base_path}/collinstown_Gamma.csv')
gr.head()

Python-dotenv could not parse statement starting at line 13
Python-dotenv could not parse statement starting at line 14
Python-dotenv could not parse statement starting at line 15
Python-dotenv could not parse statement starting at line 16
Python-dotenv could not parse statement starting at line 17

	HOLEID	PROJECTCODE	DEPTHSTEP	RUN	PRIORITY	DEPTH	GeoPhys_NGAM_API	GeoPhys_NGAM_NLF50	Gamma_Temp	Gamma_VBatt	GeoPhys_NGAM_Speed	PL_DER	GeoPhys_Date_dr	Logging_Method_dr	Calibration_dr	Diameter_dr	IntegrationPeriod_dr	WheelDiameter_dr	Gamma_Overlap
0	TC-3660-007	Collinstown	-0.1	1	1	0.7	26.9	NaN	25	8.31	8.57	3660	6-Sep-2019	IN_ROD	0.517073	N	800	100.454	NaN
1	TC-3660-007	Collinstown	-0.1	1	1	0.8	26.8	NaN	25	8.31	8.58	3660	6-Sep-2019	IN_ROD	0.517073	N	800	100.454	NaN
2	TC-3660-007	Collinstown	-0.1	1	1	0.9	18.1	NaN	25	8.32	8.46	3660	6-Sep-2019	IN_ROD	0.517073	N	800	100.454	NaN
3	TC-3660-007	Collinstown	-0.1	1	1	1.0	19.4	NaN	25	8.32	8.43	3660	6-Sep-2019	IN_ROD	0.517073	N	800	100.454	NaN
4	TC-3660-007	Collinstown	-0.1	1	1	1.1	19.5	NaN	25	8.32	8.51	3660	6-Sep-2019	IN_ROD	0.517073	N	800	100.454	NaN

Different holes in the legacy GR dataframe

gr.HOLEID.unique()

array(['TC-3660-007', 'TC-3660-008', 'TC-3839-002'], dtype=object)

Legacy GR of borehole TC-3660-008

gr8 = gr[gr.HOLEID == 'TC-3660-008'].reset_index(drop=True)
gr8

	HOLEID	PROJECTCODE	DEPTHSTEP	RUN	PRIORITY	DEPTH	GeoPhys_NGAM_API	GeoPhys_NGAM_NLF50	Gamma_Temp	Gamma_VBatt	GeoPhys_NGAM_Speed	PL_DER	GeoPhys_Date_dr	Logging_Method_dr	Calibration_dr	Diameter_dr	IntegrationPeriod_dr	WheelDiameter_dr	Gamma_Overlap
0	TC-3660-008	Collinstown	-0.1	1	1	-0.2	27.3	NaN	25	8.14	7.37	3660	23-Oct-2019	IN_ROD	0.517073	B	800	100.454	0.0
1	TC-3660-008	Collinstown	-0.1	1	1	-0.1	16.2	NaN	25	8.14	7.94	3660	23-Oct-2019	IN_ROD	0.517073	B	800	100.454	0.0
2	TC-3660-008	Collinstown	-0.1	1	1	0.0	19.8	NaN	25	8.14	7.96	3660	23-Oct-2019	IN_ROD	0.517073	B	800	100.454	0.0
3	TC-3660-008	Collinstown	-0.1	1	1	0.1	22.6	NaN	25	8.14	7.87	3660	23-Oct-2019	IN_ROD	0.517073	B	800	100.454	0.0
4	TC-3660-008	Collinstown	-0.1	1	1	0.2	30.6	NaN	25	8.14	7.77	3660	23-Oct-2019	IN_ROD	0.517073	B	800	100.454	0.0
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
8718	TC-3660-008	Collinstown	-0.1	2	1	830.4	72.5	NaN	25	8.33	6.58	3660	04-Nov-2019	IN_ROD	0.517073	B	800	100.454	0.0
8719	TC-3660-008	Collinstown	-0.1	2	1	830.5	80.3	NaN	25	8.33	6.20	3660	04-Nov-2019	IN_ROD	0.517073	B	800	100.454	0.0
8720	TC-3660-008	Collinstown	-0.1	2	1	830.6	77.1	NaN	25	8.33	5.02	3660	04-Nov-2019	IN_ROD	0.517073	B	800	100.454	0.0
8721	TC-3660-008	Collinstown	-0.1	2	1	830.7	72.5	NaN	25	8.33	4.55	3660	04-Nov-2019	IN_ROD	0.517073	B	800	100.454	0.0
8722	TC-3660-008	Collinstown	-0.1	2	1	830.8	52.6	NaN	25	8.33	3.66	3660	04-Nov-2019	IN_ROD	0.517073	B	800	100.454	0.0

8723 rows × 19 columns

gr8.describe()

	DEPTHSTEP	RUN	PRIORITY	DEPTH	GeoPhys_NGAM_API	GeoPhys_NGAM_NLF50	Gamma_Temp	Gamma_VBatt	GeoPhys_NGAM_Speed	PL_DER	Calibration_dr	IntegrationPeriod_dr	WheelDiameter_dr	Gamma_Overlap
count	8.723000e+03	8723.000000	8723.0	8723.000000	8723.000000	0.0	8723.0	8723.000000	8723.000000	8723.0	8.723000e+03	8723.0	8.723000e+03	8723.000000
mean	-1.000000e-01	1.437006	1.0	417.895369	36.886003	NaN	25.0	8.251908	9.070378	3660.0	5.170728e-01	800.0	1.004540e+02	0.047117
std	1.387858e-17	0.496044	0.0	234.501159	17.433525	NaN	0.0	0.059756	0.815250	0.0	1.110287e-16	0.0	2.842334e-14	0.211901
min	-1.000000e-01	1.000000	1.0	-0.200000	2.800000	NaN	25.0	8.140000	0.610000	3660.0	5.170728e-01	800.0	1.004540e+02	0.000000
25%	-1.000000e-01	1.000000	1.0	217.850000	24.400000	NaN	25.0	8.200000	8.720000	3660.0	5.170728e-01	800.0	1.004540e+02	0.000000
50%	-1.000000e-01	1.000000	1.0	435.900000	33.500000	NaN	25.0	8.250000	9.190000	3660.0	5.170728e-01	800.0	1.004540e+02	0.000000
75%	-1.000000e-01	2.000000	1.0	612.750000	45.200000	NaN	25.0	8.300000	9.565000	3660.0	5.170728e-01	800.0	1.004540e+02	0.000000
max	-1.000000e-01	2.000000	1.0	830.800000	119.800000	NaN	25.0	8.370000	19.680000	3660.0	5.170728e-01	800.0	1.004540e+02	1.000000

Depth information of the legacy GR

print('Legacy GR of TC-360-008', '\n')

print('Top:', gr8.DEPTH[0])
print('Base:', gr8.DEPTH[len(gr8) - 1], '\n')

print('Length:', len(gr8))
print('Range:', gr8.DEPTH[len(gr8) - 1] - gr8.DEPTH[2], '\n')

print('First step: {:.1f}'.format(gr8.DEPTH[3] - gr8.DEPTH[2]))

# Mean step
steps = []
for i in range(2, len(gr8) - 1):
    steps.append(gr8.DEPTH[i+1] - gr8.DEPTH[i])
print('Mean step: {:.2f}'.format(np.mean(steps)))

Legacy GR of TC-360-008

Top: -0.2
Base: 830.8

Length: 8723
Range: 830.8

First step: 0.1
Mean step: 0.10

Data Merging¶

To model a new GR in borehole TC-1319-005, the legacy GR (gr8) and petrophysical data of borehole TC-3660-008 (feature3_8) have to be integrated into a single dataframe, free of NaNs.

Depth Equalization¶

The first step in merging the process of the legacy GR into the petrophysical data of borehole TC-3660-008 is to equalize the depths of both dataframes by the use of conditional expressions:

Equalizing depths in feature3

features3_8 = features3_8[features3_8.From <= 830.8].reset_index(drop=True)
features3_8

	Hole	From	Len	Den	Vp	Vs	Mag	Ip	Res	Form
0	TC-3660-008	9.60	9.9	2.690	6226.000000	3426.000000	0.545	12.190000	18200.000000	CALP
1	TC-3660-008	12.86	9.5	2.702	6376.000000	3467.000000	0.435	11.260000	23000.000000	CALP
2	TC-3660-008	18.36	10.4	2.676	5810.000000	3180.000000	0.377	19.710000	6900.000000	CALP
3	TC-3660-008	22.90	10.2	2.686	6415.000000	3434.000000	0.143	26.681228	8414.846011	CALP
4	TC-3660-008	24.50	10.3	2.693	6095.000000	3388.000000	0.159	15.040000	13500.000000	CALP
...	...	...	...	...	...	...	...	...	...	...
162	TC-3660-008	809.47	9.5	2.786	5621.000000	2334.000000	0.105	55.200000	706.500000	P
163	TC-3660-008	814.55	9.6	2.763	5680.000000	3179.000000	0.158	70.940000	481.100000	LIS
164	TC-3660-008	819.92	NaN	2.649	5222.938307	2668.541294	0.024	54.048890	879.289776	LIS
165	TC-3660-008	825.60	9.9	2.762	5593.000000	1787.000000	0.337	78.440000	334.600000	PAL
166	TC-3660-008	828.80	9.9	2.804	5858.000000	1964.000000	0.311	73.730000	466.500000	PAL

167 rows × 10 columns

New depth information of feature3

print('Petrophysical features of TC-360-008', '\n')
print('Top:', features3_8.From[0])
print('Base:', features3_8.From[len(features3_8) - 1], '\n')

print('Length:', len(features3_8))
print('Range:', features3_8.From[len(features3_8) - 1] - features3_8.From[0], '\n')
print('First step: {:.1f}'.format(features3_8.From[1] - features3_8.From[0]))

# Mean step
steps = []
for i in range(len(features3_8) - 1):
    steps.append(features3_8.From[i+1] - features3_8.From[i])
print('Mean step: {:.2f}'.format(np.mean(steps)))

Petrophysical features of TC-360-008

Top: 9.600000000000025
Base: 828.8

Length: 167
Range: 819.1999999999999

First step: 3.3
Mean step: 4.93

Equalizing depths in the legacy GR

gr8 = gr8[(gr8.DEPTH >= 9.6) & (gr8.DEPTH <= 828.8)].reset_index(drop=True)
gr8

	HOLEID	PROJECTCODE	DEPTHSTEP	RUN	PRIORITY	DEPTH	GeoPhys_NGAM_API	GeoPhys_NGAM_NLF50	Gamma_Temp	Gamma_VBatt	GeoPhys_NGAM_Speed	PL_DER	GeoPhys_Date_dr	Logging_Method_dr	Calibration_dr	Diameter_dr	IntegrationPeriod_dr	WheelDiameter_dr	Gamma_Overlap
0	TC-3660-008	Collinstown	-0.1	1	1	9.6	36.5	NaN	25	8.15	9.17	3660	23-Oct-2019	IN_ROD	0.517073	B	800	100.454	0.0
1	TC-3660-008	Collinstown	-0.1	1	1	9.7	37.9	NaN	25	8.15	9.13	3660	23-Oct-2019	IN_ROD	0.517073	B	800	100.454	0.0
2	TC-3660-008	Collinstown	-0.1	1	1	9.8	37.0	NaN	25	8.15	9.34	3660	23-Oct-2019	IN_ROD	0.517073	B	800	100.454	0.0
3	TC-3660-008	Collinstown	-0.1	1	1	9.9	33.6	NaN	25	8.15	9.44	3660	23-Oct-2019	IN_ROD	0.517073	B	800	100.454	0.0
4	TC-3660-008	Collinstown	-0.1	1	1	10.0	32.6	NaN	25	8.15	9.43	3660	23-Oct-2019	IN_ROD	0.517073	B	800	100.454	0.0
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
8600	TC-3660-008	Collinstown	-0.1	2	1	828.4	89.6	NaN	25	8.33	8.29	3660	04-Nov-2019	IN_ROD	0.517073	B	800	100.454	0.0
8601	TC-3660-008	Collinstown	-0.1	2	1	828.5	88.3	NaN	25	8.33	8.31	3660	04-Nov-2019	IN_ROD	0.517073	B	800	100.454	0.0
8602	TC-3660-008	Collinstown	-0.1	2	1	828.6	74.2	NaN	25	8.33	8.31	3660	04-Nov-2019	IN_ROD	0.517073	B	800	100.454	0.0
8603	TC-3660-008	Collinstown	-0.1	2	1	828.7	84.0	NaN	25	8.33	7.83	3660	04-Nov-2019	IN_ROD	0.517073	B	800	100.454	0.0
8604	TC-3660-008	Collinstown	-0.1	2	1	828.8	70.6	NaN	25	8.33	7.37	3660	04-Nov-2019	IN_ROD	0.517073	B	800	100.454	0.0

8605 rows × 19 columns

New depth information of the legacy GR

print('Legacy GR of TC-360-008', '\n')

print('Top:', gr8.DEPTH[0])
print('Base:', gr8.DEPTH[len(gr8) - 1], '\n')

print('Length:', len(gr8))
print('Range:', gr8.DEPTH[len(gr8) - 1] - gr8.DEPTH[2], '\n')

print('First step: {:.1f}'.format(gr8.DEPTH[3] - gr8.DEPTH[2]))

# Mean step
steps = []
for i in range(2, len(gr8) - 1):
    steps.append(gr8.DEPTH[i+1] - gr8.DEPTH[i])
print('Mean step: {:.2f}'.format(np.mean(steps)))

Legacy GR of TC-360-008

Top: 9.6
Base: 828.8

Length: 8605
Range: 819.0

First step: 0.1
Mean step: 0.10

Filtering and Dowsampling¶

Now that the depths of the legacy GR and the petrophysical data of borehole TC-3660-008 have been equalized, the next challenge is to merge the legacy GR into the petrophysical data. This involves condensing the 8605 legacy GR values into a new petrophysical feature with 167 values that can be used in the models. This task has been archived by combining a Butterwort filter and a downsampling function, an ideal combination for data with fast fluctuation such as the GR log (only downsampling produces a very spiky result). The effect of the filter, downsampling, and filter followed by downsampling can be appreciated at the end of this section.

Simplify the legacy GR

gr8 = gr8[['DEPTH', 'GeoPhys_NGAM_API']]
gr8

	DEPTH	GeoPhys_NGAM_API
0	9.6	36.5
1	9.7	37.9
2	9.8	37.0
3	9.9	33.6
4	10.0	32.6
...	...	...
8600	828.4	89.6
8601	828.5	88.3
8602	828.6	74.2
8603	828.7	84.0
8604	828.8	70.6

8605 rows × 2 columns

Rename series in the legacy GR

gr8 = gr8.rename(columns={'DEPTH':'depth', 'GeoPhys_NGAM_API':'gr'})
gr8

	depth	gr
0	9.6	36.5
1	9.7	37.9
2	9.8	37.0
3	9.9	33.6
4	10.0	32.6
...	...	...
8600	828.4	89.6
8601	828.5	88.3
8602	828.6	74.2
8603	828.7	84.0
8604	828.8	70.6

8605 rows × 2 columns

Downsampling the legacy GR but keeping the first and last value

reduc = int((len(gr8)) / 165)

gr8_middle = gr8.iloc[1:-1].iloc[::reduc, :]
gr8_down = pd.concat([gr8.iloc[[0]], gr8_middle, gr8.iloc[[-1]]])
gr8_down = gr8_down.sort_values('depth').reset_index(drop=True)
gr8_down

	depth	gr
0	9.6	36.5
1	9.7	37.9
2	14.9	38.8
3	20.1	25.0
4	25.3	15.0
...	...	...
163	810.9	27.2
164	816.1	63.5
165	821.3	49.8
166	826.5	72.8
167	828.8	70.6

168 rows × 2 columns

Butterworth filter on the legacy GR

b, a = butter(N=2, Wn=0.02, btype='low')
filtered_gr = filtfilt(b, a, gr8.gr)

gr8['gr_fil'] = filtered_gr
gr8 = gr8.sort_values('depth').reset_index(drop=True)
gr8

	depth	gr	gr_fil
0	9.6	36.5	30.578377
1	9.7	37.9	30.455095
2	9.8	37.0	30.326338
3	9.9	33.6	30.192920
4	10.0	32.6	30.055744
...	...	...	...
8600	828.4	89.6	76.752640
8601	828.5	88.3	76.716145
8602	828.6	74.2	76.680680
8603	828.7	84.0	76.646916
8604	828.8	70.6	76.615594

8605 rows × 3 columns

Downsampling the filtered GR

gr8_fil_down_middle = gr8.iloc[1:-1].iloc[::reduc, :]
gr8_fil_down = pd.concat([gr8.iloc[[0]], gr8_fil_down_middle, gr8.iloc[[-1]]])

# Sort by depth
gr8_fil_down = gr8_fil_down.sort_values('depth').reset_index(drop=True)
gr8_fil_down

	depth	gr	gr_fil
0	9.6	36.5	30.578377
1	9.7	37.9	30.455095
2	14.9	38.8	34.024174
3	20.1	25.0	23.740136
4	25.3	15.0	24.472588
...	...	...	...
163	810.9	27.2	59.538951
164	816.1	63.5	76.511278
165	821.3	49.8	68.944642
166	826.5	72.8	77.107392
167	828.8	70.6	76.615594

168 rows × 3 columns

Plot of filter and/or downsampling

plt.figure(figsize=(15, 8))

plt.subplot(121)
plt.plot(gr8.gr, gr8.depth, label='Original')
plt.legend()
plt.grid()
plt.xlabel('GR (API)')
plt.xlabel('Depth (m)')
plt.axis([0, 120, 850, 0])
plt.ylim(850, 0)

plt.subplot(122)
plt.plot(gr8_down.gr, gr8_down.depth, label='Down Samplig')
plt.plot(gr8.gr_fil, gr8.depth, label='Butterworth Filtered')
plt.plot(gr8_fil_down.gr_fil, gr8_fil_down.depth, label='Butterworth Filtered & Down Sampling')
plt.legend()
plt.grid()
plt.xlabel('GR (API)')
plt.xlabel('Depth (m)')
plt.axis([0, 120, 850, 0])
plt.ylim(850, 0)

plt.suptitle('Borehole TC-3660-008 GR', y=0.98)
plt.tight_layout()

Integrated Dataframe¶

As the downsampling gave 168 values, we dropped the central value to reach 167 (the length of the petrophysical dataset) and then fused the filter and downsampled GR into the petrophysical data of borehole TC-3660-008.

Drop of central value in the filter and downsampled GR

del_index = len(gr8_fil_down)//2
gr8_fil_down = gr8_fil_down.drop(axis=0, index=del_index).reset_index(drop=True)
gr8_fil_down

	depth	gr	gr_fil
0	9.6	36.5	30.578377
1	9.7	37.9	30.455095
2	14.9	38.8	34.024174
3	20.1	25.0	23.740136
4	25.3	15.0	24.472588
...	...	...	...
162	810.9	27.2	59.538951
163	816.1	63.5	76.511278
164	821.3	49.8	68.944642
165	826.5	72.8	77.107392
166	828.8	70.6	76.615594

167 rows × 3 columns

Final integrated dataframe without NaNs

features3_8['GR'] = gr8_fil_down.gr_fil
features3_8

	Hole	From	Len	Den	Vp	Vs	Mag	Ip	Res	Form	GR
0	TC-3660-008	9.60	9.9	2.690	6226.000000	3426.000000	0.545	12.190000	18200.000000	CALP	30.578377
1	TC-3660-008	12.86	9.5	2.702	6376.000000	3467.000000	0.435	11.260000	23000.000000	CALP	30.455095
2	TC-3660-008	18.36	10.4	2.676	5810.000000	3180.000000	0.377	19.710000	6900.000000	CALP	34.024174
3	TC-3660-008	22.90	10.2	2.686	6415.000000	3434.000000	0.143	26.681228	8414.846011	CALP	23.740136
4	TC-3660-008	24.50	10.3	2.693	6095.000000	3388.000000	0.159	15.040000	13500.000000	CALP	24.472588
...	...	...	...	...	...	...	...	...	...	...	...
162	TC-3660-008	809.47	9.5	2.786	5621.000000	2334.000000	0.105	55.200000	706.500000	P	59.538951
163	TC-3660-008	814.55	9.6	2.763	5680.000000	3179.000000	0.158	70.940000	481.100000	LIS	76.511278
164	TC-3660-008	819.92	NaN	2.649	5222.938307	2668.541294	0.024	54.048890	879.289776	LIS	68.944642
165	TC-3660-008	825.60	9.9	2.762	5593.000000	1787.000000	0.337	78.440000	334.600000	PAL	77.107392
166	TC-3660-008	828.80	9.9	2.804	5858.000000	1964.000000	0.311	73.730000	466.500000	PAL	76.615594

167 rows × 11 columns

features3_8.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 167 entries, 0 to 166
Data columns (total 11 columns):
 #   Column  Non-Null Count  Dtype
---  ------  --------------  -----
 0   Hole    167 non-null    object
 1   From    167 non-null    float64
 2   Len     159 non-null    float64
 3   Den     167 non-null    float64
 4   Vp      167 non-null    float64
 5   Vs      167 non-null    float64
 6   Mag     167 non-null    float64
 7   Ip      167 non-null    float64
 8   Res     167 non-null    float64
 9   Form    167 non-null    object
 10  GR      167 non-null    float64
dtypes: float64(9), object(2)
memory usage: 14.5+ KB

Modeling¶

Having the GR already integrated into the petrophysical data of borehole TC-3660-008 we are ready for modeling, which involves a regression analysis to generate predictive numerical outputs. The steps required are:

Split the data for training and test or evaluation
Select and run the regression models
Test or evaluate the results of the different models
With the best ML algorithm, generate a synthetic GR for borehole TC-1319-005

Data Split¶

Here we selected the petrophysical features and target (GR) in borehole TC-3660-008 for modeling, leaving 20% (34) of them aside for testing or evaluation of the regression algorithms.

features for modeling

features_model_8 = features3_8.drop(columns=['Hole', 'Len', 'Form', 'GR'])
print(features_model_8.shape)
features_model_8.head()

(167, 7)

	From	Den	Vp	Vs	Mag	Ip	Res
0	9.60	2.690	6226.0	3426.0	0.545	12.190000	18200.000000
1	12.86	2.702	6376.0	3467.0	0.435	11.260000	23000.000000
2	18.36	2.676	5810.0	3180.0	0.377	19.710000	6900.000000
3	22.90	2.686	6415.0	3434.0	0.143	26.681228	8414.846011
4	24.50	2.693	6095.0	3388.0	0.159	15.040000	13500.000000

target or objetive

target_8 = features3_8.GR
target_8

    30.578377
    30.455095
    34.024174
    23.740136
    24.472588
         ...
  59.538951
  76.511278
  68.944642
  77.107392
  76.615594
Name: GR, Length: 167, dtype: float64

Split and shape of data for training and testing

X_train, X_test, y_train, y_test = train_test_split(features_model_8, target_8, test_size=0.2, random_state=seed)

# X_train = np.array(X_train).flatten()
# y_train = np.array(y_train).flatten()

print('X_train shape:', X_train.shape)
print('y_train shape:', y_train.shape)

print('X_test shape:', X_test.shape)
print('y_test shape:', y_test.shape)

X_train shape: (133, 7)
y_train shape: (133,)
X_test shape: (34, 7)
y_test shape: (34,)

Regression Models

In this analysis, we trained and evaluated nine regression models, some of which are influenced by random processes (those using the random_state=seed argument). After testing with different seed values (e.g., 123, 456, 789), we observed variations in the top-performing model, typically alternating between Gradient Boosting, Extreme Gradient Boosting, and Random Forest. To mitigate this variability and ensure a more reliable evaluation, we implemented cross-validation for the calculation of the metrics.

Regression models and metrics with a fixed seed

seed = 123

models = []

models.append(('LR', LinearRegression()))
models.append(('L', Lasso()))
models.append(('R', Ridge()))
models.append(('SVR', SVR()))
models.append(('KNR', KNeighborsRegressor()))
models.append(('GB', GradientBoostingRegressor(random_state=seed)))
models.append(('DTR', DecisionTreeRegressor(random_state=seed)))
models.append(('RFR', RandomForestRegressor(random_state=seed)))
models.append(('XGB', xgb.XGBRegressor(objective ='reg:squarederror')))

headers = ['Model', 'Sum  of Residuals', 'MAE', 'MSE', 'RMSE', 'R2']
rows = []

for name, model in models:
    model.fit(X_train, y_train)
    predict = model.predict(X_test)

    sum_residual = sum(y_test - predict)
    mae = mean_absolute_error(y_test, predict)
    mse = mean_squared_error(y_test, predict)
    rmse = np.sqrt(mse)
    score = model.score(X_test, y_test)

    rows.append([name, sum_residual, mae, mse, rmse, score])

print('Seed:', seed)
print(tabulate(rows, headers=headers, tablefmt="fancy_outline"))

Seed: 123
╒═════════╤═════════════════════╤══════════╤══════════╤══════════╤════════════╕
│ Model   │   Sum  of Residuals │      MAE │      MSE │     RMSE │         R2 │
╞═════════╪═════════════════════╪══════════╪══════════╪══════════╪════════════╡
│ LR      │            -4.86117 │  8.93602 │ 129.384  │ 11.3747  │  0.230243  │
│ L       │            11.9249  │  8.86367 │ 136.354  │ 11.6771  │  0.188777  │
│ R       │             9.17722 │  8.87722 │ 135.115  │ 11.6239  │  0.196147  │
│ SVR     │           165.92    │ 10.0965  │ 182.715  │ 13.5172  │ -0.0870428 │
│ KNR     │            44.6092  │  8.80595 │ 153.988  │ 12.4092  │  0.0838649 │
│ GB      │           -18.2136  │  7.03068 │  73.9925 │  8.60189 │  0.55979   │
│ DTR     │           -34.3359  │ 12.1705  │ 210.477  │ 14.5078  │ -0.252209  │
│ RFR     │           -11.9149  │  7.9933  │ 100.311  │ 10.0155  │  0.403212  │
│ XGB     │           -55.5957  │  7.40284 │  82.3656 │  9.07555 │  0.509975  │
╘═════════╧═════════════════════╧══════════╧══════════╧══════════╧════════════╛

Regression models and metrics with a fixed seed and cross-validation

seed = 123

models = []

models.append(('LR', LinearRegression()))
models.append(('L', Lasso()))
models.append(('R', Ridge()))
models.append(('SVR', SVR()))
models.append(('KNR', KNeighborsRegressor()))
models.append(('GB', GradientBoostingRegressor(random_state=seed)))
models.append(('DTR', DecisionTreeRegressor(random_state=seed)))
models.append(('RFR', RandomForestRegressor(random_state=seed)))
models.append(('XGB', xgb.XGBRegressor(objective ='reg:squarederror')))

headers = ['Model', 'MAE', 'MSE', 'RMSE', 'R2']
rows = []

n_folds = 5

rmse_scorer = make_scorer(lambda y_true, y_pred: np.sqrt(mean_squared_error(y_true, y_pred)))

for name, model in models:
    scores_mae = cross_val_score(model, X_train, y_train, cv=n_folds, scoring='neg_mean_absolute_error')
    scores_mse = cross_val_score(model, X_train, y_train, cv=n_folds, scoring='neg_mean_squared_error')
    scores_rmse = cross_val_score(model, X_train, y_train, cv=n_folds, scoring=rmse_scorer)
    scores_r2 = cross_val_score(model, X_train, y_train, cv=n_folds, scoring='r2')

    mae = -np.mean(scores_mae)
    mse = -np.mean(scores_mse)
    rmse = np.mean(scores_rmse)
    r2 = np.mean(scores_r2)

    rows.append([name, mae, mse, rmse, r2])

print('Seed:', seed)
print(tabulate(rows, headers=headers, tablefmt="fancy_outline"))

Seed: 123
╒═════════╤═════════╤══════════╤══════════╤═══════════╕
│ Model   │     MAE │      MSE │     RMSE │        R2 │
╞═════════╪═════════╪══════════╪══════════╪═══════════╡
│ LR      │ 7.11785 │  92.1091 │  9.4983  │ 0.268285  │
│ L       │ 7.30283 │  94.0936 │  9.60407 │ 0.252065  │
│ R       │ 7.2349  │  93.1695 │  9.55375 │ 0.259637  │
│ SVR     │ 7.88615 │ 124.49   │ 11.1176  │ 0.0328943 │
│ KNR     │ 8.07405 │ 114.013  │ 10.6201  │ 0.106625  │
│ GB      │ 7.14677 │ 104.942  │ 10.1178  │ 0.183119  │
│ DTR     │ 8.12116 │ 128.029  │ 11.1078  │ 0.0187    │
│ RFR     │ 7.31101 │  97.9016 │  9.80004 │ 0.230271  │
│ XGB     │ 7.08965 │ 100.852  │  9.96463 │ 0.195461  │
╘═════════╧═════════╧══════════╧══════════╧═══════════╛

Best Regression Model¶

After applying cross-validation, Linear Regression (LR) emerged as the top-performing model with an R² of 0.265, followed by Ridge Regression (R) with an R² of 0.256, and Lasso Regression (L) with an R² of 0.247.

Next, we selected features from borehole C-1319-005 to predict GR, based on the best model trained on borehole TC-3660-008. The chosen model is LR, with additional predictions generated using XGB.

Petrophysical features of borehole TC-1319-005

features3_5 = features3[features3.Hole=='TC-1319-005']
features3_5 = features3_5.reset_index(drop=True)
features3_5

	Hole	From	Len	Den	Vp	Vs	Mag	Ip	Res	Form
0	TC-1319-005	265.77	11.2	2.610000	5725.879959	3073.767206	0.108	36.995337	5489.392555	CALP
1	TC-1319-005	271.66	11.8	2.786624	5514.000000	2222.000000	0.108	26.060000	4600.000000	CALP
2	TC-1319-005	277.05	10.3	3.399000	5754.000000	2085.000000	0.237	17.520000	10200.000000	CALP
3	TC-1319-005	282.55	11.2	2.800579	5773.000000	2011.000000	0.151	28.130000	5800.000000	CALP
4	TC-1319-005	287.20	10.8	3.293000	6034.000000	2122.000000	0.176	6.880000	15500.000000	CALP
...	...	...	...	...	...	...	...	...	...	...
156	TC-1319-005	1034.65	9.8	2.470000	4279.000000	1808.000000	0.028	34.740000	1900.000000	P
157	TC-1319-005	1038.37	9.5	2.559000	5163.000000	1975.000000	0.221	47.820000	1400.000000	LIS
158	TC-1319-005	1041.85	9.8	2.479000	4925.000000	2648.579509	0.054	61.170000	906.500000	LIS
159	TC-1319-005	1046.35	10.6	2.455000	5217.025190	2674.318401	0.158	44.440000	1300.000000	LIS
160	TC-1319-005	1049.20	10.0	2.470000	4673.000000	2595.011555	0.157	78.880000	187.100000	PAL

161 rows × 10 columns

Petrophysical features of borehole TC-1319-005 for modeling

features_model_5 = features3_5.drop(columns=['Hole', 'Len', 'Form'])
print(features_model_5.shape)
features_model_5.head()

(161, 7)

	From	Den	Vp	Vs	Mag	Ip	Res
0	265.77	2.610000	5725.879959	3073.767206	0.108	36.995337	5489.392555
1	271.66	2.786624	5514.000000	2222.000000	0.108	26.060000	4600.000000
2	277.05	3.399000	5754.000000	2085.000000	0.237	17.520000	10200.000000
3	282.55	2.800579	5773.000000	2011.000000	0.151	28.130000	5800.000000
4	287.20	3.293000	6034.000000	2122.000000	0.176	6.880000	15500.000000

GR by Linear Regressor

lr_model = LinearRegression()
lr_model.fit(X_train, y_train)

gr5_lr = lr_model.predict(features_model_5)
gr5_lr

array([37.48012316, 40.29644384, 23.09190308, 44.8346237 , 27.11026391,
       35.69517186, 35.80346419, 42.98681624, 21.71842209, 50.42072407,
       43.10646152, 42.64311325, 36.81285745, 35.75724267, 24.77325061,
       38.88608028, 18.99647974, 18.34440846, 24.71733796, 49.77336379,
       31.50291417, 50.27812368, 12.68860118, 40.97470383, 41.35654788,
        9.46749403, 45.41312187, 57.11531171, 52.44942385, 55.00547771,
       54.99299988, 58.31717629, 63.00910746, 38.18780739, 43.40488394,
       53.54787669, 38.28788144, 54.77673522, 53.94068928, 50.00196899,
       53.33280443, 51.97077575, 43.94754544, 44.7692286 , 55.2820785 ,
       42.99287884, 43.53616197, 43.90804534, 45.1923788 , 43.3995711 ,
       52.41320378, 38.40547193, 49.45734775, 54.47626681, 35.21860407,
       40.97396549, 44.98800414, 52.50782359, 33.17683036, 56.93967984,
       31.80814629, 38.25475496, 58.61926752, 46.56422313, 55.22333646,
       48.95181455, 56.40445626, 56.21276624, 40.55208937, 44.54711027,
       34.30283305, 41.41804912, 56.44581239, 50.82112976, 38.51075176,
       42.79861258, 45.19983303, 50.65237446, 38.59099861, 52.70893187,
       48.80542021, 42.14681373, 62.89384827, 46.03558981, 41.10928562,
       48.24311094, 41.97767837, 45.75722795, 48.64259008, 48.654642  ,
       45.50473295, 40.33085351, 44.94789532, 43.92711492, 44.22012991,
       49.9342732 , 60.21970233, 62.67858876, 62.02673151, 52.49919054,
       46.20133463, 50.31771774, 44.34999683, 54.34472837, 32.06757116,
       46.52112522, 66.25802377, 58.53394378, 55.59449825, 61.29707256,
       51.95505826, 53.61999492, 51.86700364, 54.94377794, 54.86303988,
       59.14183574, 66.23808415, 55.51343349, 59.19185001, 52.3024468 ,
       48.61038302, 64.82166153, 46.11411479, 64.31701643, 63.04114529,
       52.15649911, 56.70399617, 51.08228262, 65.68280901, 59.08239203,
       51.20549995, 38.84526869, 62.50110662, 48.03486534, 54.3338539 ,
       53.59520789, 51.58907318, 51.19657462, 53.01946386, 47.82961527,
       71.89787018, 65.52042076, 54.58325456, 51.06668569, 68.48469698,
       64.49554711, 48.54356557, 51.14052389, 51.62244032, 50.396638  ,
       59.06133645, 68.13949395, 75.84718351, 74.02208004, 62.65090578,
       71.2700738 , 67.4500938 , 69.8234084 , 61.83761112, 64.01707194,
       64.17237329])

GR by Extreme Gradient Boosting Regressor

xgb_model = xgb.XGBRegressor(objective ='reg:squarederror')
xgb_model.fit(X_train, y_train)

gr5_xgb = xgb_model.predict(features_model_5)
gr5_xgb

array([34.75164 , 40.796104, 36.992447, 42.021427, 35.146862, 32.735653,
       31.437817, 44.679855, 25.369324, 37.394672, 37.02685 , 34.253475,
       41.54279 , 35.28796 , 37.156284, 27.499958, 24.900589, 59.073536,
       25.334816, 33.707527, 30.582035, 36.551727, 58.758087, 39.717037,
       37.68949 , 51.238953, 37.855713, 37.301083, 36.851   , 33.767624,
       33.657173, 35.071648, 39.557102, 36.829685, 33.846024, 37.697605,
       37.730824, 34.91515 , 36.3279  , 38.628242, 37.983574, 38.069294,
       38.160236, 38.536686, 38.399925, 38.082027, 39.27452 , 37.119156,
       39.841293, 45.398674, 35.81751 , 38.662144, 35.173935, 34.593754,
       33.010708, 39.53115 , 39.431847, 41.398838, 35.45763 , 35.191555,
       34.551064, 31.259857, 38.560356, 37.284836, 41.16499 , 38.056583,
       41.03737 , 35.434082, 38.705368, 39.224365, 36.462032, 36.3096  ,
       36.547577, 45.77963 , 41.261803, 33.92088 , 34.588528, 39.656155,
       39.15763 , 37.552555, 38.922432, 39.36756 , 38.62001 , 38.477215,
       41.59901 , 40.979454, 47.4265  , 53.00436 , 54.575684, 48.65887 ,
       52.29925 , 49.431183, 51.90922 , 49.391224, 52.66824 , 49.652584,
       48.35062 , 44.88772 , 53.68105 , 40.471123, 49.45793 , 51.189827,
       48.084194, 53.25084 , 50.381542, 59.261337, 65.46839 , 41.859623,
       46.78074 , 45.050728, 41.25792 , 49.75928 , 49.73571 , 60.701645,
       60.768967, 61.67022 , 61.20107 , 61.175545, 61.259388, 56.77341 ,
       55.918613, 51.66773 , 54.700287, 54.94038 , 57.015076, 53.717358,
       53.812485, 66.552025, 67.536354, 56.427017, 57.21038 , 49.67948 ,
       54.108974, 54.36197 , 55.57285 , 53.881783, 55.802116, 55.35852 ,
       54.312916, 55.4986  , 62.86732 , 53.801105, 54.689976, 54.729584,
       51.35334 , 66.19035 , 68.41729 , 55.093616, 68.71511 , 56.620056,
       53.18428 , 55.686577, 52.945896, 52.141483, 55.897858, 50.89671 ,
       54.226032, 51.24211 , 68.31796 , 50.653297, 74.86713 ],
      dtype=float32)

Merge the two synthetic GRs into the petrophysical data of borehole TC-1319-005

features3_5['GR_lr'], features3_5['GR_xgb']  = gr5_lr, gr5_xgb
features3_5

	Hole	From	Len	Den	Vp	Vs	Mag	Ip	Res	Form	GR_lr	GR_xgb
0	TC-1319-005	265.77	11.2	2.610000	5725.879959	3073.767206	0.108	36.995337	5489.392555	CALP	37.480123	34.751640
1	TC-1319-005	271.66	11.8	2.786624	5514.000000	2222.000000	0.108	26.060000	4600.000000	CALP	40.296444	40.796104
2	TC-1319-005	277.05	10.3	3.399000	5754.000000	2085.000000	0.237	17.520000	10200.000000	CALP	23.091903	36.992447
3	TC-1319-005	282.55	11.2	2.800579	5773.000000	2011.000000	0.151	28.130000	5800.000000	CALP	44.834624	42.021427
4	TC-1319-005	287.20	10.8	3.293000	6034.000000	2122.000000	0.176	6.880000	15500.000000	CALP	27.110264	35.146862
...	...	...	...	...	...	...	...	...	...	...	...	...
156	TC-1319-005	1034.65	9.8	2.470000	4279.000000	1808.000000	0.028	34.740000	1900.000000	P	67.450094	54.226032
157	TC-1319-005	1038.37	9.5	2.559000	5163.000000	1975.000000	0.221	47.820000	1400.000000	LIS	69.823408	51.242111
158	TC-1319-005	1041.85	9.8	2.479000	4925.000000	2648.579509	0.054	61.170000	906.500000	LIS	61.837611	68.317963
159	TC-1319-005	1046.35	10.6	2.455000	5217.025190	2674.318401	0.158	44.440000	1300.000000	LIS	64.017072	50.653297
160	TC-1319-005	1049.20	10.0	2.470000	4673.000000	2595.011555	0.157	78.880000	187.100000	PAL	64.172373	74.867126

161 rows × 12 columns

There are no NaNs in the merge data of borehole TC-1319-005

features3_5.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 161 entries, 0 to 160
Data columns (total 12 columns):
 #   Column  Non-Null Count  Dtype
---  ------  --------------  -----
 0   Hole    161 non-null    object
 1   From    161 non-null    float64
 2   Len     161 non-null    float64
 3   Den     161 non-null    float64
 4   Vp      161 non-null    float64
 5   Vs      161 non-null    float64
 6   Mag     161 non-null    float64
 7   Ip      161 non-null    float64
 8   Res     161 non-null    float64
 9   Form    161 non-null    object
 10  GR_lr   161 non-null    float64
 11  GR_xgb  161 non-null    float32
dtypes: float32(1), float64(9), object(2)
memory usage: 14.6+ KB

Understanding the Model¶

Beyond determining which regression model performs better, it is important to examine the coefficients of the Linear Regression (LR) model to understand the relative importance and influence of each feature. The magnitude and sign of these coefficients reveal which features contribute most to the target, whether positively or negatively.

The table and plot below (negative coefficients were inverted for the logarithm, and colored in red) show that Den has a strong negative impact, while Mag has a significant positive effect. In contrast, Vp, Vs, and Ip have much smaller impacts, and the influence of Res is insignificant. The intercept represents the baseline value when all features are zero.

X_train

	From	Den	Vp	Vs	Mag	Ip	Res
28	140.35	2.633	5814.377420	3138.868627	0.090	31.808800	6387.399758
13	64.54	2.677	5643.825284	3011.626392	0.038	35.190000	2500.000000
74	399.33	2.829	5531.000000	3153.000000	0.103	67.780000	397.000000
24	120.40	2.723	5805.808808	3084.328392	0.071	30.704049	6429.850185
85	442.50	2.678	5682.000000	2890.000000	0.073	55.790000	4700.000000
...	...	...	...	...	...	...	...
17	84.42	2.689	5858.000000	3289.000000	0.031	20.760000	5300.000000
98	504.62	2.712	5598.000000	3445.000000	0.035	93.260000	1500.000000
66	349.95	2.693	5587.000000	3135.000000	0.042	24.750000	4200.000000
126	640.00	2.725	6037.000000	3367.000000	0.234	73.220000	2900.000000
109	559.00	2.721	5747.000000	3289.000000	0.078	31.310000	3700.000000

133 rows × 7 columns

Linear regressor coefficients

lr_model = LinearRegression()
lr_model.fit(X_train, y_train)

print('Features:', X_train.columns, '\n')
print("Coeficientes:", lr_model.coef_, '\n')
print("Intercepto:", lr_model.intercept_)

Features: Index(['From', 'Den', 'Vp', 'Vs', 'Mag', 'Ip', 'Res'], dtype='object')

Coeficientes: [ 2.27539299e-02 -3.56367531e+01  4.98929080e-03 -1.28090298e-02
  1.19706930e+01  7.13230615e-02  1.08666969e-04]

Intercepto: 130.7206610710383

Linear regressor coefficients in table

features_list = list(X_train.columns)
features_list.append('Intercepto')

coef_list = list(lr_model.coef_)
coef_list.append(lr_model.intercept_)

table = list(zip(features_list, coef_list))
print(tabulate(table, headers=['Feature', 'Coefficient'], tablefmt="fancy_outline"))

╒════════════╤═══════════════╕
│ Feature    │   Coefficient │
╞════════════╪═══════════════╡
│ From       │   0.0227539   │
│ Den        │ -35.6368      │
│ Vp         │   0.00498929  │
│ Vs         │  -0.012809    │
│ Mag        │  11.9707      │
│ Ip         │   0.0713231   │
│ Res        │   0.000108667 │
│ Intercepto │ 130.721       │
╘════════════╧═══════════════╛

Linear regressor coefficients in linear and log10 plot

plt.figure(figsize=(13,5))

plt.subplot(121)
plt.bar(features_list, (coef_list), label='Positive')
plt.bar(features_list[1], (coef_list[1]), color='r', label='negative')
plt.legend()
plt.ylabel('Coeficients')
plt.grid()

plt.subplot(122)
plt.bar(features_list, np.log10(coef_list))
plt.bar(features_list[1], np.log10(-coef_list[1]), color='r')
plt.bar(features_list[3], np.log10(-coef_list[3]), color='r')
plt.ylabel('Log10[Coeficients]')
plt.grid()

plt.tight_layout;

<function tight_layout at 0x7f32d4e31360>

Graphic Comparison¶

The legacy GR data and major formation tops from borehole TC-3660-008 were integrated in the first notebook (1/3), provide a reference for comparing the new ML-generated GRs in borehole TC-1319-005. While the Linear Regression (LR) model achieved a higher R² score, the GR predictions generated by the XGB model (with an R² of only 0.170) more closely follow the patterns observed in the reference borehole TC-3660-008.

The original and ML-generated GRs are presented in the plot at the end of the section. To make the comparison easier between both ML-generated GRs, the GR generated by the XGB algorithm (in orange) was displayed to the right by adding 40 API units.

Load of tops from borehole TC-3660-008

# Leer el archivo JSON
with open(f'{base_path}/tops_list_8.json', 'r') as json_file:
    tops_list_8 = json.load(json_file)

tops_list_8

[{'name': 'CALP', 'top': 9.6, 'base': 270.95, 'color': '#CCDF9A'}, {'name': 'FLT', 'top': 270.95, 'base': 304.3, 'color': '#FF0000'}, {'name': 'CALP', 'top': 304.3, 'base': 345.9, 'color': '#CCDF9A'}, {'name': 'VISS', 'top': 345.9, 'base': 363.8, 'color': '#D9E9F0'}, {'name': 'FLT', 'top': 363.8, 'base': 369.26, 'color': '#FF0000'}, {'name': 'VISS', 'top': 369.26, 'base': 399.33, 'color': '#D9E9F0'}, {'name': 'FLT', 'top': 399.33, 'base': 401.88, 'color': '#FF0000'}, {'name': 'VISS', 'top': 401.88, 'base': 424.46, 'color': '#D9E9F0'}, {'name': 'FLT', 'top': 424.46, 'base': 460.16, 'color': '#FF0000'}, {'name': 'S', 'top': 460.16, 'base': 504.62, 'color': '#8CBAE2'}, {'name': 'LABL', 'top': 504.62, 'base': 724.5, 'color': '#8A6E9F'}, {'name': 'SP', 'top': 724.5, 'base': 785.08, 'color': '#ABA16C'}, {'name': 'P', 'top': 785.08, 'base': 814.55, 'color': '#EBDE98'}, {'name': 'LIS', 'top': 814.55, 'base': 825.6, 'color': '#806000'}, {'name': 'PAL', 'top': 825.6, 'base': 833.5, 'color': '#2A7C43'}]

Tops in borehole TC-1319-005

features3_5.Form.unique()

array(['CALP', 'LABL', 'SP', 'P', 'LIS', 'PAL'], dtype=object)

Extraction of top depths in borehole TC-1319-005

tops5 = pd.DataFrame(features3_5.From[features3_5.Form.ne(features3_5.Form.shift())])
tops5

	From
0	265.77
46	503.40
115	830.45
131	911.10
157	1038.37
160	1049.20

Extraction tops names in borehole TC-1319-005

tops5['Top'] = features3_5.Form[features3_5.Form.ne(features3_5.Form.shift())]
tops5

	From	Top
0	265.77	CALP
46	503.40	LABL
115	830.45	SP
131	911.10	P
157	1038.37	LIS
160	1049.20	PAL

Reset of indexes

tops5 = tops5.reset_index(drop=True)
tops5

	From	Top
0	265.77	CALP
1	503.40	LABL
2	830.45	SP
3	911.10	P
4	1038.37	LIS
5	1049.20	PAL

Colors of the formations in borehole TC-1319-005

tops5['color'] = pd.Series([ '#CCDF9A', '#8A6E9F', '#ABA16C', '#EBDE98', '#806000', '#2A7C43'])
tops5

	From	Top	color
0	265.77	CALP	#CCDF9A
1	503.40	LABL	#8A6E9F
2	830.45	SP	#ABA16C
3	911.10	P	#EBDE98
4	1038.37	LIS	#806000
5	1049.20	PAL	#2A7C43

Bottom of the last formation

new_row5 = pd.DataFrame([{'From':1059.20, 'Top': '', 'color':''}])
new_row5

	From	Top	color
0	1059.2

Merge of dataframes

tops5 = pd.concat([tops5, new_row5], ignore_index=True)
tops5

	From	Top	color
0	265.77	CALP	#CCDF9A
1	503.40	LABL	#8A6E9F
2	830.45	SP	#ABA16C
3	911.10	P	#EBDE98
4	1038.37	LIS	#806000
5	1049.20	PAL	#2A7C43
6	1059.20

Rename of columns of the dataframe

tops5 = tops5.rename(columns={'From':'depth', 'Top':'name'})
tops5

	depth	name	color
0	265.77	CALP	#CCDF9A
1	503.40	LABL	#8A6E9F
2	830.45	SP	#ABA16C
3	911.10	P	#EBDE98
4	1038.37	LIS	#806000
5	1049.20	PAL	#2A7C43
6	1059.20

Convertion of the dataframe to a list of dictionaries by UDL, borehole TC-1319-005

tops_list_5 = geo.plot_tops(tops5)
tops_list_5

[{'name': 'CALP', 'top': np.float64(265.77), 'base': np.float64(503.4), 'color': '#CCDF9A'}, {'name': 'LABL', 'top': np.float64(503.4), 'base': np.float64(830.45), 'color': '#8A6E9F'}, {'name': 'SP', 'top': np.float64(830.45), 'base': np.float64(911.1), 'color': '#ABA16C'}, {'name': 'P', 'top': np.float64(911.1), 'base': np.float64(1038.37), 'color': '#EBDE98'}, {'name': 'LIS', 'top': np.float64(1038.37), 'base': np.float64(1049.2), 'color': '#806000'}, {'name': 'PAL', 'top': np.float64(1049.2), 'base': np.float64(1059.2), 'color': '#2A7C43'}]

Saving the tops to a json file

with open('Output/tops_list_5.json', 'w') as json_file:
    json.dump(tops_list_5, json_file, indent=4)

Plot of GRs with formations

plt.figure(figsize=(10, 8))
plt.subplot(141)
plt.plot(gr8.iloc[:,1], gr8.iloc[:,0], label='Real', c='lightblue')
# plt.plot(gr8.iloc[:,2].rolling(window=250).mean(), gr8.iloc[:,1], label='Mean 400')
plt.xlabel('GR (API)')

# Butterworth filter
b, a = butter(N=2, Wn=0.02, btype='low')
filtered_data = filtfilt(b, a, gr8.iloc[:,1])
plt.plot(filtered_data, gr8.iloc[:,0], label='BWF')
plt.legend()
plt.grid()
plt.xlabel('GR (API)')
plt.ylabel('Depth (m)')
plt.axis([0, 120, 1100, 0])

plt.subplot(142)
for i in range(0, len(tops_list_8)):
    plt.axhspan(tops_list_8[i]['top'], tops_list_8[i]['base'], color=tops_list_8[i]['color'])
    plt.text(122, tops_list_8[i]['top'], tops_list_8[i]['name'], fontsize=6, va='center')
plt.axis([0, 120, 1100, 0])
plt.xticks([])
plt.xlabel('Formations')

plt.subplot(143)
plt.plot(features3_5.GR_lr, features3_5.From, label='LR')
plt.plot(features3_5.GR_xgb + 40, features3_5.From, label='XGB + 40')
plt.legend()
plt.grid()
plt.xlabel('GR (API)')
plt.axis([0, 120, 1100, 0])

plt.subplot(144)
for i in range(0, len(tops_list_5)):
    plt.axhspan(tops_list_5[i]['top'], tops_list_5[i]['base'], color=tops_list_5[i]['color'])
    plt.text(122, tops_list_5[i]['top'], tops_list_5[i]['name'], fontsize=7, va='center')
plt.axis([0, 120, 1100, 0])
plt.xticks([])
plt.xlabel('Formations')

plt.suptitle('GR and Formations in Boreholes\nTC-3660-008                                                                   TC-1319-005', y=0.98)
plt.tight_layout()
plt.savefig('Output/all_grs.png', dpi=300)
plt.show()

GR and Formations in Boreholes TC-3660-008 TC-1319-005

Observations¶

Here are some observations related to the tasks covered in this notebook:

This notebook highlights the utility of open-source Python tools in addressing the issue of incomplete and sparse petrophysical data in mining. The various regression models employed demonstrate that machine learning techniques can be effective in filling data gaps, even when direct measurements are limited or unavailable.
More complex models, such as Extreme Gradient Boosting, provided higher predictive accuracy following the pattern of the reference GR, but simpler models like Linear Regression offered better interpretability. This trade-off between accuracy and interpretability is a crucial consideration for practical applications in mining data analysis.
The coefficients in the Linear Regression model indicated that variables such as Den and Mag significantly impact the target variable (GR), demonstrating strong correlation and critical predictive roles in petrophysical properties. In contrast, Vp and Res were less important, contributing minimally to the model’s predictions.

Final Remarks¶

The analysis of the petrophysical dataset from Collinstown highlights challenges related to data dispersion and scarcity, as well as the potential of machine learning (ML) tools to enhance the value of mining data.

Data dispersion, particularly in Den and Vp, complicates the generation of accurate models and reflects the inherent heterogeneity of the subsurface. Variations across boreholes and core samples, along with some false anomalies from measurement issues, must be addressed to ensure data integrity and reliable modeling. Low correlation (or low covariance) in the dataset, resulting from fractures and methodological limitations, leads to models with low R² values.

ML algorithms and Python-based tools have proven valuable for integrating legacy data with new sources, thereby improving dataset quality. Techniques such as imputation and regression modeling help fill gaps and enhance the resolution of petrophysical properties. Overall, ML tools present significant opportunities to improve legacy data and support its integration with new datasets, leading to more accurate mining models.

Both this analysis and the previous notebook demonstrate the effective use of open-source Python tools and ML models to address incomplete and sparse petrophysical data in mining exploration. This adaptable workflow can be applied across various geological contexts, benefiting other mining projects as well.

Next Steps¶

Here are some possible next steps to build on the work from this and the previous notebooks:

Consider exploring classification models, especially if there is interest in classifying different types of rocks or deposits based on petrophysical properties rather than predicting continuous variables.
Explore advanced feature engineering or transformation techniques to uncover more complex relationships between variables. This could potentially improve the model’s performance.
Further optimize the machine learning models by performing hyperparameter tuning. This can improve the accuracy of the predictions and refine the model selection process.
Apply the propagation of missing properties using the models obtained here to other boreholes with limited data.

These steps would extend the current analysis, enhance model accuracy, and contribute to more practical applications in mining exploration.

Total running time of the script: (0 minutes 3.688 seconds)

Gallery generated by Sphinx-Gallery