Note

Go to the end to download the full example code

Modeling and Propagation of Petrophysical Data for Mining Exploration (2/3)¶

Cleaning and Filling the Gaps

Barcelona 25/09/24 GEO3BCN Manuel David Soto, Juan Alcalde, Adrià Hernàndez-Pineda, Ramón Carbonel

Introduction¶

The dispersion and scarcity of petrophysical data are well-known challenges in the mining sector. These issues are primarily driven by economic factors, but also by geological factors such as sedimentary cover, weathering, erosion, or the depth of targets, geotechnical issues (e.g., slope or borehole stability), and even technical limitations or availability that have been resolved in other industries (for instance, sonic logs were not previously acquired due to a lack of interest in velocity field data).

To address the challenge of sparse and incomplete petrophysical data in the mining sector, we have developed three Jupyter notebooks that tackle these issues through a suite of open-source Python tools. These tools support researchers in the initial exploration, visualization, and integration of data from diverse sources, filling gaps caused by technical limitations and ultimately enabling the complete modeling of missing properties (through standard and more advanced ML-based models). We applied these tools to both recently acquired and legacy petrophysical data of two cores northwest of Collinstown (County Westmeath, Province of Leinster), Ireland, located 26 km west-northwest of the Navan mine. However, these tools are adaptable and applicable to mining data from any location.

After the EDA and integration of the petrophysical dataset of Collinstown (notebook 1/3), this second notebook gathers different tasks that can be grouped into what is called, in data science, data mining. These tasks are:

Record the positions of the original NaNs.
Record the positions and the values of the anomalies to be deleted.
Delete the anomalies.
Fill out all NaNs, both original and those resulting from deleting the anomalies, using different means (imputations, empirical formulas, simple ML models).
Compare the effectiveness of the different filling gap methods.
Use the better option to fill in the gaps and deliver the corrected petrophysical data for further investigation.

As with the previous notebook, these tasks are performed with open-source Python tools that are easily accessible by any researcher through a Python installation connected to the Internet.

Variables¶

The dataset used in this notebook is the ‘features’ dataset from the previous notebook (1/3). It contains the modelable petrophysical features with their respective anomalies. ‘Hole’ (text object) and ‘Len’ (float) variables are for reference, ‘Form’ (text object) is a categorical variable representing the major formations:

Name	Explanation	Unit
Hole	Hole name
From	Top of the sample	m
Len	Length of the core sample	cm
Den	Density	g/cm³
Vp	Compressional velocity	m/s
Vs	Shear velocity	m/s
Mag	Magnetic susceptibility
Ip	Chargeability or induced polarization	mv/V
Res	Resistivity	ohm·m
Form	Major formations or zone along the hole

Libraries¶

The following are the Python libraries used along this notebook. PSL are Python Standard Libraries, UDL are User Defined Libraries, and PEL are Python External Libraries:

PLS

import sys
import warnings

# UDL
from vector_geology import basic_stat, geo

# PEL- Basic
import numpy as np
import pandas as pd
from tabulate import tabulate

# PEL - Plotting
import matplotlib.pyplot as plt
import seaborn as sns

# PEL - Data selection, transformation, preprocessing
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import PolynomialFeatures

# PEL - Imputer
from sklearn.impute import KNNImputer, SimpleImputer
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer

# PEL- ML algorithms
from scipy.optimize import curve_fit
from sklearn.linear_model import LinearRegression, Lasso, Ridge
from sklearn.svm import SVR
from sklearn.neighbors import KNeighborsRegressor
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
from sklearn import tree
import xgboost as xgb

# PEL - Metrics
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score

Settings¶

Seed of random process

seed = 123

# Warning suppression
warnings.filterwarnings("ignore")

Data Loading¶

Features data load

features = pd.read_csv('Output/features.csv', index_col=0)
features.head()

	Hole	From	Len	Den	Vp	Vs	Mag	Ip	Res	Form
0	TC-1319-005	265.77	11.2	2.610	2179.0	NaN	0.108	NaN	NaN	CALP
1	TC-1319-005	271.66	11.8	4.318	5514.0	2222.0	0.108	26.06	4600.0	CALP
2	TC-1319-005	277.05	10.3	3.399	5754.0	2085.0	0.237	17.52	10200.0	CALP
3	TC-1319-005	282.55	11.2	4.342	5773.0	2011.0	0.151	28.13	5800.0	CALP
4	TC-1319-005	287.20	10.8	3.293	6034.0	2122.0	0.176	6.88	15500.0	CALP

features.describe()

	From	Len	Den	Vp	Vs	Mag	Ip	Res
count	329.000000	321.000000	327.000000	310.000000	289.000000	326.000000	303.000000	303.000000
mean	543.279696	10.223364	2.761664	5441.990323	2772.937716	0.430543	49.738548	3283.762376
std	262.864370	1.070127	0.293938	700.858075	653.446129	5.192919	25.145208	3714.851096
min	9.600000	7.500000	2.420000	992.000000	0.000000	0.001000	5.390000	156.900000
25%	355.580000	9.700000	2.678500	5196.000000	2294.000000	0.064000	31.555000	1100.000000
50%	547.800000	10.000000	2.706000	5575.000000	3041.000000	0.114500	47.040000	2100.000000
75%	750.000000	10.400000	2.731000	5857.000000	3248.000000	0.205250	67.190000	3850.000000
max	1049.200000	18.000000	5.074000	6478.000000	3740.000000	93.886000	171.730000	23900.000000

Columns in the features dataframe

features.columns

Index(['Hole', 'From', 'Len', 'Den', 'Vp', 'Vs', 'Mag', 'Ip', 'Res', 'Form'], dtype='object')

Features Cleaning¶

The plots below show the anomalous values of the four affected variables. Then, along the section, we will register the position of the anomalies and then delete them, according to the findings of the previous notebook (1/3):

Feat.	Deliting Anomalies
Den	Values above 4 g/cm³ are related to bad measurements in core samples longer than 11.2 cm.
Vp	Values below 3000 m/s are related to fractures in the core samples.
Vs	Values less than 1000 m/s are also related to core fractures.
Mag	The anomaly of 93.9 is above the range of the measuring equipment.
Ip	We see no reasons to discard outliers.
Res	We see no reasons to discard outliers.

Bad data areas in Vp, Vs, Mag vs. Den

plt.figure(figsize=(16, 5))

# len vs. Den
plt.subplot(131)
plt.scatter(features.Vp, features.Den, alpha=0.65, edgecolors='k', c='skyblue')
plt.axhspan(4, 5.25, color='r', alpha=0.2)
plt.axvspan(500, 3000, color='r', alpha=0.2)
plt.text(1100, 4.7, 'Bad Data Area')
plt.axis([500, 7000, 2, 5.25])
plt.xlabel('Vp (m/s)')
plt.ylabel('Den (g/cm3)')
plt.grid()

# Den vs. Vs bad data area
plt.subplot(132)
plt.scatter(features.Vs, features.Den, alpha=0.65, edgecolors='k', c='skyblue')
plt.axhspan(4, 5.25, color='r', alpha=0.2)
plt.axvspan(0, 1000, color='r', alpha=0.2)
plt.text(75, 4.7, 'Bad Data Area')
plt.axis([0, 4000, 2, 5.25])
plt.xlabel('Vs (m/s)')
plt.ylabel('Den (g/cm3)')
plt.grid()

# Den vs. Vs bad data area
plt.subplot(133)
plt.scatter(features.Mag, features.Den, alpha=0.65, edgecolors='k', c='skyblue')
plt.axvspan(20, 100, color='r', alpha=0.2)
plt.axhspan(4, 5.25, color='r', alpha=0.2)
plt.text(2, 4.7, 'Bad Data Area')
plt.axis([-5, 100, 2, 5.25])
plt.xlabel('Magnetic Susceptibility')
plt.ylabel('Den (g/cm3)')
plt.grid()

plt.suptitle('Variable with Anomalies in the Keywells')

plt.tight_layout();

NaN and Anomalies Preservation¶

The folowing dataframes preserves the location of original NaN (1) and anomalies (2):

nan_ano_loc = features.copy()

Fill the datafreme with 1 and 2

for column in features.columns:
    nan_ano_loc[column] = np.where(pd.isna(features[column]), 1, 0)

nan_ano_loc.Den = np.where(features.Den > 4, 2, 0)
nan_ano_loc.Vp = np.where(features.Vp < 3000, 2, 0)
nan_ano_loc.Vs = np.where(features.Vs < 1000, 2, 0)
nan_ano_loc.Mag = np.where(features.Mag > 20, 2, 0)
nan_ano_loc.head()

	Den	Vp	Ip	Res
0	0	2	1	1
1	2	0	0	0
2	0	0	0	0
3	2	0	0	0
4	0	0	0	0

nan_ano_loc.describe()

	Hole	From	Len	Den	Vp	Vs	Mag	Ip	Res	Form
count	329.0	329.0	329.000000	329.000000	329.000000	329.000000	329.000000	329.000000	329.000000	329.0
mean	0.0	0.0	0.024316	0.048632	0.030395	0.030395	0.006079	0.079027	0.079027	0.0
std	0.0	0.0	0.154263	0.308527	0.245049	0.245049	0.110264	0.270192	0.270192	0.0
min	0.0	0.0	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.0
25%	0.0	0.0	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.0
50%	0.0	0.0	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.0
75%	0.0	0.0	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.0
max	0.0	0.0	1.000000	2.000000	2.000000	2.000000	2.000000	1.000000	1.000000	0.0

Save the locations of nans and anomalies

nan_ano_loc.to_csv('Output/nan_ano_loc.csv')

Preservation of the anomalies

anomalies = features[features.Den > 4]
anomalies = pd.concat([anomalies, features[features.Vp < 3000]])
anomalies = pd.concat([anomalies, features[features.Vs < 1000]])
anomalies = pd.concat([anomalies, features[features.Mag > 2]])

anomalies = anomalies.drop_duplicates().sort_index()
anomalies

	Hole	From	Len	Den	Vp	Vs	Mag	Ip	Res	Form
0	TC-1319-005	265.77	11.2	2.610	2179.0	NaN	0.108	NaN	NaN	CALP
1	TC-1319-005	271.66	11.8	4.318	5514.0	2222.0	0.108	26.06	4600.0	CALP
3	TC-1319-005	282.55	11.2	4.342	5773.0	2011.0	0.151	28.13	5800.0	CALP
6	TC-1319-005	301.50	11.6	4.251	5686.0	2755.0	0.132	14.77	6000.0	CALP
7	TC-1319-005	313.65	11.4	4.193	5588.0	2039.0	0.068	20.37	5400.0	CALP
11	TC-1319-005	334.65	11.3	4.079	5678.0	2198.0	0.036	34.51	2000.0	CALP
12	TC-1319-005	340.00	11.9	4.444	5561.0	2581.0	0.024	14.92	4800.0	CALP
13	TC-1319-005	344.77	11.2	4.123	5359.0	NaN	0.041	45.93	1000.0	CALP
24	TC-1319-005	400.10	13.9	5.074	5367.0	2422.0	0.066	NaN	NaN	CALP
77	TC-1319-005	645.30	11.0	2.725	992.0	NaN	0.308	98.11	293.3	LABL
79	TC-1319-005	653.95	12.0	2.936	2321.0	1929.0	0.342	88.45	443.4	LABL
159	TC-1319-005	1046.35	10.6	2.455	2325.0	NaN	0.158	44.44	1300.0	LIS
174	TC-3660-008	64.54	9.4	2.677	2802.0	0.0	0.038	35.19	2500.0	CALP
198	TC-3660-008	188.03	9.4	2.679	5345.0	0.0	0.098	NaN	NaN	CALP
243	TC-3660-008	432.95	9.7	2.779	4641.0	178.0	NaN	69.69	515.7	FLT
248	TC-3660-008	452.30	10.4	2.686	5652.0	0.0	0.053	27.54	4000.0	FLT
305	TC-3660-008	734.80	9.5	2.708	5026.0	2914.0	93.886	71.63	1200.0	SP
312	TC-3660-008	774.21	9.6	2.787	6173.0	0.0	0.123	73.08	373.2	SP

Number and % of anomalies by boreholes

print('Total anomalies:', len(anomalies), '({:.1f}%)'.format(100 * len(anomalies) / (features.shape[0] * features.shape[1])))
anomalies.Hole.value_counts()

Total anomalies: 18 (0.5%)

Hole
TC-1319-005    12
TC-3660-008     6
Name: count, dtype: int64

Save the anomalies

anomalies.to_csv('Output/anomalies.csv')

Rows with anomalies

list(anomalies.index)

[0, 1, 3, 6, 7, 11, 12, 13, 24, 77, 79, 159, 174, 198, 243, 248, 305, 312]

Anomalies Deletion¶

features2, new features dataframe without anomalies

features2 = features.copy()

features2['Den'] = np.where(features.Den > 4, np.nan, features.Den)
features2['Vp'] = np.where(features.Vp < 3000, np.nan, features.Vp)
features2['Vs'] = np.where(features.Vs < 1000, np.nan, features.Vs)
features2['Mag'] = np.where(features.Mag > 20, np.nan, features.Mag)

features2.head()

	Hole	From	Len	Den	Vp	Vs	Mag	Ip	Res	Form
0	TC-1319-005	265.77	11.2	2.610	NaN	NaN	0.108	NaN	NaN	CALP
1	TC-1319-005	271.66	11.8	NaN	5514.0	2222.0	0.108	26.06	4600.0	CALP
2	TC-1319-005	277.05	10.3	3.399	5754.0	2085.0	0.237	17.52	10200.0	CALP
3	TC-1319-005	282.55	11.2	NaN	5773.0	2011.0	0.151	28.13	5800.0	CALP
4	TC-1319-005	287.20	10.8	3.293	6034.0	2122.0	0.176	6.88	15500.0	CALP

features2.describe()

	From	Len	Den	Vp	Vs	Mag	Ip	Res
count	329.000000	321.000000	319.000000	305.000000	284.000000	325.000000	303.000000	303.000000
mean	543.279696	10.223364	2.721755	5496.386885	2821.130282	0.142988	49.738548	3283.762376
std	262.864370	1.070127	0.145217	556.077848	547.476118	0.100706	25.145208	3714.851096
min	9.600000	7.500000	2.420000	3009.000000	1295.000000	0.001000	5.390000	156.900000
25%	355.580000	9.700000	2.678000	5220.000000	2368.250000	0.064000	31.555000	1100.000000
50%	547.800000	10.000000	2.705000	5581.000000	3041.000000	0.114000	47.040000	2100.000000
75%	750.000000	10.400000	2.726500	5858.000000	3248.000000	0.203000	67.190000	3850.000000
max	1049.200000	18.000000	3.650000	6478.000000	3740.000000	0.545000	171.730000	23900.000000

Boxplot of each feature without anomalies

features2.plot.box(subplots=True, grid=False, figsize=(12, 7), layout=(3, 4), flierprops={"marker": "."})
plt.suptitle('Features Without Anomalies')
plt.tight_layout();

Save features without anomalies

features2.to_csv('Output/features2.csv')

NaNs in the Features¶

Original NaNs

features.isna().sum()

Hole     0
From     0
Len      8
Den      2
Vp      19
Vs      40
Mag      3
Ip      26
Res     26
Form     0
dtype: int64

Total original NaNs

features.isna().sum().sum()

np.int64(124)

% of original NaNs

features.isna().sum() / len(features) * 100

Hole     0.000000
From     0.000000
Len      2.431611
Den      0.607903
Vp       5.775076
Vs      12.158055
Mag      0.911854
Ip       7.902736
Res      7.902736
Form     0.000000
dtype: float64

Total % original and new NaNs

print('% NaN in Features:', round(100 * features.isna().sum().sum() / (features.shape[0] * features.shape[1]), 1))

% NaN in Features: 3.8

Original and new NaNs

features2.isna().sum()

Hole     0
From     0
Len      8
Den     10
Vp      24
Vs      45
Mag      4
Ip      26
Res     26
Form     0
dtype: int64

Total original and new NaNs

features2.isna().sum().sum()

np.int64(143)

Total % original and new NaNs

print('% NaN in features2:', round(100 * features2.isna().sum().sum() / (features2.shape[0] * features2.shape[1]), 1))

% NaN in features2: 4.3

% Total original and new NaNs by feature

features2.isna().sum() / len(features) * 100

Hole     0.000000
From     0.000000
Len      2.431611
Den      3.039514
Vp       7.294833
Vs      13.677812
Mag      1.215805
Ip       7.902736
Res      7.902736
Form     0.000000
dtype: float64

Bar plot of NaNs

features2.isna().sum().plot.bar(figsize=(8, 3), color='r', ylabel='Count', title='All Features NaN', label='Without anomalies', legend=True)
features.isna().sum().plot.bar(figsize=(8, 3), color='b', label='With anomalies', legend=True, grid=True)

<Axes: title={'center': 'All Features NaN'}, ylabel='Count'>

Bar plot of NaNs (%)

(features2.isna().sum() / len(features) * 100).plot.bar(figsize=(8, 3), color='r', label='Without anomalies', legend=True)
(features.isna().sum() / len(features) * 100).plot.bar(figsize=(8, 3), color='b', label='With anomalies', legend=True,
                                                       title='All Features NaN', ylabel='%', grid=True);

<Axes: title={'center': 'All Features NaN'}, ylabel='%'>

Total rows at least one with NaN and their indexes

total_nan_index = list(features2[pd.isna(features2).any(axis=1)].index)
print('Total rows with at least one NaN:', len(total_nan_index), '\n', '\n', 'Rows indexes:')
total_nan_index

Total rows with at least one NaN: 62

 Rows indexes:

[0, 1, 3, 6, 7, 11, 12, 13, 15, 24, 36, 45, 47, 63, 68, 71, 75, 77, 79, 99, 100, 105, 110, 111, 112, 113, 114, 115, 117, 118, 135, 158, 159, 160, 164, 174, 183, 185, 187, 189, 191, 194, 197, 198, 201, 206, 209, 210, 229, 233, 240, 243, 248, 251, 276, 305, 311, 312, 318, 319, 320, 325]

Filling the holes¶

Imputations Evaluation¶

Imputation refers to the alternative process of filling in all the missing values (NaN) or gaps in a dataset. Instead of removing rows or columns with missing values, those values are replaced (imputed) with estimated values based on information available in the entire dataset (in all variables). This is typically done when you want to preserve the other values in the rows where the NaNs are, and when the missing values for each variable or column do not exceed 25%.

Since multiple imputation methods are available, it’s important to evaluate each one and select the method that yields the best results, based on metrics such as the sum of residuals, MAE, MSE, RMSE, and R² (formulas of these metrics in the annex). In this case, we tested six imputation methods:

Simple mean (mean)
Simple K-Means (knn)
Iterative Impute (ii)
Normalized mean (nmean)
Normalized K-Means (nknn)
Normalized Iterative Imputer (nii)

To compare these methods, we first created fictitious gaps in the Vp values (just this variable due to its importance, and for simplicity and speed) of a non-NaN dataset to compare them later the real values of Vp against the imputed. By doing so, we can compute each method’s metrics and determine which performs best. In addition to these metrics, at the end of the section, there is a plot with the resulting values of each imputer method with respect to the real values, showing less dispersion towards the top of the plot.

Non-NaN dataframe to test the imputation

features2_num = features2[['From', 'Den', 'Vp', 'Vs', 'Mag', 'Ip', 'Res']]
features2_num_nonan = features2_num.dropna()
features2_num_nonan.head()

	From	Den	Vp	Vs	Mag	Ip	Res
2	277.05	3.399	5754.0	2085.0	0.237	17.52	10200.0
4	287.20	3.293	6034.0	2122.0	0.176	6.88	15500.0
5	296.75	2.782	4661.0	2294.0	0.107	27.81	2100.0
8	318.75	3.386	5815.0	2115.0	0.060	8.53	10700.0
9	325.75	2.616	4785.0	1946.0	0.040	84.90	296.1

Generation of random mask for the fictitious NaNs or gaps

np.random.seed(seed)
missing_rate = 0.2  # Porcentaje de valores faltantes
Vp_mask = np.random.rand(features2_num_nonan.shape[0]) < missing_rate
Vp_mask

array([False, False, False, False, False, False, False, False, False,
       False, False, False, False,  True, False, False,  True,  True,
       False, False, False, False, False, False, False, False, False,
       False, False, False,  True, False, False, False, False, False,
       False, False, False, False, False,  True, False, False, False,
       False, False, False, False, False,  True, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False,  True, False, False,  True, False,  True, False,
       False, False,  True, False, False,  True,  True, False, False,
       False, False, False, False, False, False,  True, False, False,
       False, False, False, False, False, False,  True, False, False,
       False, False, False,  True,  True, False, False, False, False,
       False, False, False, False,  True, False, False, False, False,
       False,  True, False, False, False, False, False, False, False,
       False, False, False, False, False,  True, False, False, False,
        True, False, False, False, False,  True, False, False, False,
       False, False, False,  True, False, False,  True, False, False,
       False, False,  True, False, False, False, False, False, False,
       False,  True, False, False, False, False, False, False, False,
        True, False, False, False,  True,  True,  True, False, False,
       False,  True, False, False, False, False,  True, False, False,
       False, False, False, False, False,  True, False,  True, False,
       False,  True, False,  True, False, False, False, False,  True,
       False, False,  True, False, False, False, False, False, False,
        True, False, False,  True, False, False, False, False, False,
        True, False, False, False,  True, False,  True, False, False,
       False,  True, False, False, False, False,  True, False, False,
       False,  True, False, False, False, False, False, False, False,
       False, False,  True, False, False, False,  True, False, False,
        True, False, False, False, False,  True])

New gaps in Vp of the Non-NaN dataframe

features2_new_missing = features2_num_nonan.copy()
features2_new_missing.loc[Vp_mask, 'Vp'] = np.nan
features2_new_missing

	From	Den	Vp	Vs	Mag	Ip	Res
2	277.05	3.399	5754.0	2085.0	0.237	17.52	10200.0
4	287.20	3.293	6034.0	2122.0	0.176	6.88	15500.0
5	296.75	2.782	4661.0	2294.0	0.107	27.81	2100.0
8	318.75	3.386	5815.0	2115.0	0.060	8.53	10700.0
9	325.75	2.616	4785.0	1946.0	0.040	84.90	296.1
...	...	...	...	...	...	...	...
323	809.47	2.786	5621.0	2334.0	0.105	55.20	706.5
324	814.55	2.763	5680.0	3179.0	0.158	70.94	481.1
326	825.60	2.762	5593.0	1787.0	0.337	78.44	334.6
327	828.80	2.804	5858.0	1964.0	0.311	73.73	466.5
328	833.40	2.735	NaN	3217.0	0.308	33.86	2300.0

267 rows × 7 columns

At index 328 the original Vp is

features.loc[328]

Hole    TC-3660-008
From          833.4
Len            10.1
Den           2.735
Vp           5805.0
Vs           3217.0
Mag           0.308
Ip            33.86
Res          2300.0
Form            PAL
Name: 328, dtype: object

True Vp values

true_vp = features2_num_nonan[Vp_mask].Vp.reset_index(drop=True)
true_vp.head()

  5196.0
  6159.0
  6037.0
  5815.0
  4930.0
Name: Vp, dtype: float64

Evaluation of the imputers

# List of imputers
imputers = []

# Initialize imputers
mean_imputer = SimpleImputer(strategy='mean')
ii_imputer = IterativeImputer(max_iter=10, random_state=seed)
knn_imputer = KNNImputer(n_neighbors=5)

# Normalization
scaler = StandardScaler()
features2_new_missing_scaled = scaler.fit_transform(features2_new_missing)

# Append of imputer (name, norm, imputer, data)
imputers.append(('mean', 'no', mean_imputer, features2_new_missing))
imputers.append(('ii', 'no', ii_imputer, features2_new_missing))
imputers.append(('knn', 'no', knn_imputer, features2_new_missing))
imputers.append(('nmean', 'yes', mean_imputer, features2_new_missing_scaled))
imputers.append(('nii', 'yes', ii_imputer, features2_new_missing_scaled))
imputers.append(('nknn', 'yes', knn_imputer, features2_new_missing_scaled))

# Results table
headers = ['Imputer', 'Normalization', 'Sum Residual', 'MAE', 'MSE', 'RMSE', 'R2']
rows = []

# Loop over the imputers
for name, norm, imputer, data in imputers:
    # Impute the data
    imputed_data = imputer.fit_transform(data)

    # Reverse the normalization if data was normalized
    if norm == 'yes':
        imputed_data = scaler.inverse_transform(imputed_data)

    # Convert the array to dataframe
    imputed_data_df = pd.DataFrame(imputed_data, columns=features2_new_missing.columns)
    imputed_vp = imputed_data_df.loc[Vp_mask, 'Vp'].reset_index(drop=True)

    # Calculate residuals and metrics
    residual = sum(true_vp - imputed_vp)
    mae = mean_absolute_error(true_vp, imputed_vp)
    mse = mean_squared_error(true_vp, imputed_vp)
    rmse = np.sqrt(mse)
    r2 = r2_score(true_vp, imputed_vp)

    # Create a dataframe for imputer using globals()
    globals()[f'imputer_{name}'] = imputed_data_df.copy()

    # Append results
    rows.append([name, norm, residual, mae, mse, rmse, r2])

# Print the results in a formatted table

print('Seed:', seed)

print(tabulate(rows, headers=headers, tablefmt="fancy_outline"))

Seed: 123
╒═══════════╤═════════════════╤════════════════╤═════════╤════════╤═════════╤═════════════╕
│ Imputer   │ Normalization   │   Sum Residual │     MAE │    MSE │    RMSE │          R2 │
╞═══════════╪═════════════════╪════════════════╪═════════╪════════╪═════════╪═════════════╡
│ mean      │ no              │        1779.8  │ 394.903 │ 256363 │ 506.323 │ -0.00517294 │
│ ii        │ no              │       -1106.23 │ 308.72  │ 163029 │ 403.768 │  0.360781   │
│ knn       │ no              │        -490.6  │ 310.118 │ 168360 │ 410.317 │  0.339878   │
│ nmean     │ yes             │        1779.8  │ 394.903 │ 256363 │ 506.323 │ -0.00517294 │
│ nii       │ yes             │        -646.02 │ 296.29  │ 160690 │ 400.862 │  0.36995    │
│ nknn      │ yes             │        -924.4  │ 298.808 │ 166131 │ 407.592 │  0.348616   │
╘═══════════╧═════════════════╧════════════════╧═════════╧════════╧═════════╧═════════════╛

imputer_nii.head()

	From	Den	Vp	Vs	Mag	Ip	Res
0	277.05	3.399	5754.0	2085.0	0.237	17.52	10200.0
1	287.20	3.293	6034.0	2122.0	0.176	6.88	15500.0
2	296.75	2.782	4661.0	2294.0	0.107	27.81	2100.0
3	318.75	3.386	5815.0	2115.0	0.060	8.53	10700.0
4	325.75	2.616	4785.0	1946.0	0.040	84.90	296.1

Plot of the real Vp vs. the imputed Vp

plt.figure(figsize=(8, 8))
plt.scatter(true_vp, true_vp, color='k', label='True Values')
plt.scatter(true_vp, imputer_mean[Vp_mask].Vp, label='Mean (mean)')
plt.scatter(true_vp, imputer_nmean[Vp_mask].Vp, label='Normalized Mean (nmean)')
plt.scatter(true_vp, imputer_ii[Vp_mask].Vp, label='Iterative (ii)')
plt.scatter(true_vp, imputer_nii[Vp_mask].Vp, label='Normalized (nii)')
plt.scatter(true_vp, imputer_knn[Vp_mask].Vp, label='K-Mean Neighbor (knn)')
plt.scatter(true_vp, imputer_nknn[Vp_mask].Vp, label='Normalized K-Mean Neighbor (nknn)')
plt.xlabel('True Vp (m/s)')
plt.ylabel('Imputed Vp (m/s)')
plt.title('Real Vp vs. Imputed Vp')
plt.axis([4000, 6750, 4000, 6750])
plt.legend()
plt.grid()
plt.show()

Best Imputation

By a closer examination of the metrics, we can conclude that the normalized Iterative Imputer (nii), also called MICE (Multiple Imputation by Chained Equations), performs the best (or least poorly), particularly for the Vp variable, with the mean of the imputed values increasing just 1.2 % with respect to the mean of the real values. Consequently, we used the nii imputer to fill the gaps across all variables and saved the updated dataset in a new dataframe (features3).

The multiplot of this section shows that the imputation process maintains the shape of the boxplot of the variables from which the anomalies have been removed. In other words, this multiplot of the imputed dataset (features3) is almost identical to the multiplot of features2 in section 7.2.

Normalized interactive imputation of features2

# Data normalization
scaler = StandardScaler()
features2_scaled = scaler.fit_transform(features2_num)

# Imputation of normalized data
nii_features3_array = ii_imputer.fit_transform(features2_scaled)

# Reverse normalization and new imputed dataframe
features3 = pd.DataFrame(scaler.inverse_transform(nii_features3_array), columns=features2_num.columns)

features3.head()

	From	Den	Vp	Vs	Mag	Ip	Res
0	265.77	2.610000	5725.879959	3073.767206	0.108	36.995337	5489.392555
1	271.66	2.786624	5514.000000	2222.000000	0.108	26.060000	4600.000000
2	277.05	3.399000	5754.000000	2085.000000	0.237	17.520000	10200.000000
3	282.55	2.800579	5773.000000	2011.000000	0.151	28.130000	5800.000000
4	287.20	3.293000	6034.000000	2122.000000	0.176	6.880000	15500.000000

features3.describe()

	From	Den	Vp	Vs	Mag	Ip	Res
count	329.000000	329.000000	329.000000	329.000000	329.000000	329.000000	329.000000
mean	543.279696	2.723043	5501.096555	2808.728436	0.142841	48.950658	3401.316252
std	262.864370	0.143224	538.939658	517.989081	0.100134	24.374350	3623.480943
min	9.600000	2.420000	3009.000000	1295.000000	0.001000	5.390000	156.900000
25%	355.580000	2.679000	5266.000000	2448.000000	0.064000	31.808800	1200.000000
50%	547.800000	2.707000	5587.000000	2994.000000	0.115000	46.200000	2300.000000
75%	750.000000	2.731000	5849.000000	3231.000000	0.201000	65.470000	4200.000000
max	1049.200000	3.650000	6478.000000	3740.000000	0.545000	171.730000	23900.000000

Boxplot of each imputed feature

features3.plot.box(subplots=True, grid=False, figsize=(12, 7), layout=(3, 4), flierprops={"marker": "."})
plt.suptitle('Imputed Features')
plt.tight_layout()

Copy reference variable to features3

features3[['Hole', 'Len', 'Form']] = features2[['Hole', 'Len', 'Form']]

Order of the variables

features3_series = list(features2.columns)
features3_series

['Hole', 'From', 'Len', 'Den', 'Vp', 'Vs', 'Mag', 'Ip', 'Res', 'Form']

Reordering features3

features3 = features3[features3_series]
features3.head()

	Hole	From	Len	Den	Vp	Vs	Mag	Ip	Res	Form
0	TC-1319-005	265.77	11.2	2.610000	5725.879959	3073.767206	0.108	36.995337	5489.392555	CALP
1	TC-1319-005	271.66	11.8	2.786624	5514.000000	2222.000000	0.108	26.060000	4600.000000	CALP
2	TC-1319-005	277.05	10.3	3.399000	5754.000000	2085.000000	0.237	17.520000	10200.000000	CALP
3	TC-1319-005	282.55	11.2	2.800579	5773.000000	2011.000000	0.151	28.130000	5800.000000	CALP
4	TC-1319-005	287.20	10.8	3.293000	6034.000000	2122.000000	0.176	6.880000	15500.000000	CALP

Save features3

features3.to_csv('Output/features3.csv')

Indixes of Vp values in features2

features2.Vp[~features2.Vp.isna()]

    5514.0
    5754.0
    5773.0
    6034.0
    4661.0
        ...
  5621.0
  5680.0
  5593.0
  5858.0
  5805.0
Name: Vp, Length: 305, dtype: float64

Indixes of Vp NaNs in features2

features2.Vp[features2.Vp.isna()]

   NaN
  NaN
  NaN
  NaN
  NaN
 NaN
 NaN
 NaN
 NaN
 NaN
 NaN
 NaN
 NaN
 NaN
 NaN
 NaN
 NaN
 NaN
 NaN
 NaN
 NaN
 NaN
 NaN
 NaN
Name: Vp, dtype: float64

Vp_nan_index = list(features2.Vp[features2.Vp.isna()].index)
Vp_nan_index

[0, 15, 36, 77, 79, 111, 118, 159, 174, 183, 185, 187, 189, 191, 194, 201, 210, 233, 240, 251, 276, 311, 320, 325]

Original Vp without anomalies

features2.Vp.describe()

count     305.000000
mean     5496.386885
std       556.077848
min      3009.000000
25%      5220.000000
50%      5581.000000
75%      5858.000000
max      6478.000000
Name: Vp, dtype: float64

Imputed Vp

features3.Vp.iloc[Vp_nan_index].describe()

count      24.000000
mean     5560.948607
std       225.846356
min      5210.315493
25%      5369.531486
50%      5605.088801
75%      5779.916352
max      5854.418200
Name: Vp, dtype: float64

% of increase of the mean of the imputed Vp compared with the real Vp

vp_change_den = 100 * (features3.Vp.iloc[Vp_nan_index].mean() - features2.Vp.mean()) / features2.Vp.mean()
print('Increase of the mean:', '{:.1f} %'.format(vp_change_den))

Increase of the mean: 1.2 %

nii Vp for comparison

nii_vp = features3.Vp.iloc[Vp_nan_index]
nii_vp.head()

   5725.879959
  5604.870331
  5558.402401
  5374.912754
  5210.315493
Name: Vp, dtype: float64

Estimating Vp¶

The calculation of a single variable such as Vp (again, just this variable due to its importance, for simplicity and speed) allows us to try and evaluate different methods for filling the gaps, from empirical formulas to the simplest Machine Learning (ML) algorithms.

Vp by Gardner¶

The Gardner’s formula is a well-known empirical formula that allows us to transform the density (Den) into compressional velocity (Vp):

\[Vp = \alpha * \phi ^ {\beta}\]

Where

$\phi$ is the density and the standard coefficients (reference: https://wiki.seg.org/wiki/Dictionary:Gardner%E2%80%99s_equation) are:

\[\alpha \approx 4348\]

\[\beta = 0.25\]

Contrary to the expected increasing trend between Den and Vp, the plot below shows an anomalous vertical trend, where multiple Vp values are associated with a single Den value (around 2.7). This anomalous trend results in a negative R² (-0.021, the worst so far) when Vp is calculated using all values, and this score improves slightly when we filter out the pairs that fall outside the expected increasing trend. This is confirmed by the low covariance presented by the Den-Vp pair in the covariance matrix of this section (low correlation coefficient in the previous notebook 1/3). %% Plot of Den vs. Vp

plt.grid()
plt.scatter(features2.Den, features2.Vp, alpha=0.65, edgecolors='k')
plt.xlabel('Den (g/cm$^3$)')
plt.ylabel('Vp (m/s)')
plt.title('Data Without Anomalies')

Text(0.5, 1.0, 'Data Without Anomalies')

features2.select_dtypes(include=['number'])

	From	Len	Den	Vp	Vs	Mag	Ip	Res
0	265.77	11.2	2.610	NaN	NaN	0.108	NaN	NaN
1	271.66	11.8	NaN	5514.0	2222.0	0.108	26.06	4600.0
2	277.05	10.3	3.399	5754.0	2085.0	0.237	17.52	10200.0
3	282.55	11.2	NaN	5773.0	2011.0	0.151	28.13	5800.0
4	287.20	10.8	3.293	6034.0	2122.0	0.176	6.88	15500.0
...	...	...	...	...	...	...	...	...
324	814.55	9.6	2.763	5680.0	3179.0	0.158	70.94	481.1
325	819.92	NaN	2.649	NaN	NaN	0.024	NaN	NaN
326	825.60	9.9	2.762	5593.0	1787.0	0.337	78.44	334.6
327	828.80	9.9	2.804	5858.0	1964.0	0.311	73.73	466.5
328	833.40	10.1	2.735	5805.0	3217.0	0.308	33.86	2300.0

329 rows × 8 columns

Covariance of the features2

vari_mat = features2.select_dtypes(include=['number']).cov()
vari_mat

	From	Len	Den	Vp	Vs	Mag	Ip	Res
From	69097.677012	-16.222055	-7.748490	-53382.663546	-40566.665284	7.019153	2766.653801	-5.105614e+05
Len	-16.222055	1.145171	0.052440	-49.309151	-165.827338	-0.010128	-0.338487	-2.218782e+02
Den	-7.748490	0.052440	0.021088	2.888883	-12.299676	-0.000414	-0.243046	5.485842e+01
Vp	-53382.663546	-49.309151	2.888883	309222.573512	93428.327238	0.964435	-3777.361458	9.111955e+05
Vs	-40566.665284	-165.827338	-12.299676	93428.327238	299730.099574	-2.230359	-1474.676736	5.989462e+05
Mag	7.019153	-0.010128	-0.000414	0.964435	-2.230359	0.010142	0.674282	-1.611072e+01
Ip	2766.653801	-0.338487	-0.243046	-3777.361458	-1474.676736	0.674282	632.281488	-6.145102e+04
Res	-510561.399578	-221.878244	54.858423	911195.549174	598946.207455	-16.110723	-61451.019939	1.380012e+07

vari_mat.describe()

	From	Len	Den	Vp	Vs	Mag	Ip	Res
count	8.000000	8.000000	8.000000	8.000000	8.000000	8.000000	8.000000	8.000000e+00
mean	-66582.918624	-56.548474	4.691151	157080.121136	118735.366859	-1.210452	-7913.003762	1.842258e+06
std	183023.873982	87.703069	20.884445	325158.301042	222378.924444	6.584200	21711.748328	4.851295e+06
min	-510561.399578	-221.878244	-12.299676	-53382.663546	-40566.665284	-16.110723	-61451.019939	-5.105614e+05
25%	-43770.664850	-78.438698	-2.119407	-981.322228	-493.039688	-0.565186	-2050.347917	-1.552916e+04
50%	-11.985273	-8.280271	0.010337	1.926659	-7.265017	0.004864	-0.290766	1.937385e+01
75%	696.927815	0.005514	0.761551	147376.888806	145003.770322	0.746820	158.576084	6.770085e+05
max	69097.677012	1.145171	54.858423	911195.549174	598946.207455	7.019153	2766.653801	1.380012e+07

Covariance matrix

sns.heatmap(vari_mat, vmin=-500, vmax=500, center=0, linewidths=.1, cmap='seismic_r', annot=True)
plt.title('Covariance Matrix')
plt.tight_layout()
plt.show()

Gardner Vp in features2, with the standard alpha of 4348

features2['VpG'] = 4348 * (features2.Den) ** 0.25
features2['VpG']

    5526.493483
            NaN
    5903.741408
            NaN
    5857.165129
          ...
  5605.763308
  5547.023745
  5605.256023
  5626.444480
  5591.506938
Name: VpG, Length: 329, dtype: float64

Drop NaNs for metrics calculation

features2_nonan = features2.dropna()
features2_nonan.head()

	Hole	From	Len	Den	Vp	Vs	Mag	Ip	Res	Form	VpG
2	TC-1319-005	277.05	10.3	3.399	5754.0	2085.0	0.237	17.52	10200.0	CALP	5903.741408
4	TC-1319-005	287.20	10.8	3.293	6034.0	2122.0	0.176	6.88	15500.0	CALP	5857.165129
5	TC-1319-005	296.75	10.3	2.782	4661.0	2294.0	0.107	27.81	2100.0	CALP	5615.375681
8	TC-1319-005	318.75	10.7	3.386	5815.0	2115.0	0.060	8.53	10700.0	CALP	5898.088351
9	TC-1319-005	325.75	10.0	2.616	4785.0	1946.0	0.040	84.90	296.1	CALP	5529.666895

Metrics of the Gardner calculation with all values

vpg_metrics = basic_stat.metrics(features2_nonan.Vp, features2_nonan.VpG)

print("Metrics for Gardner:\n")
print(vpg_metrics)

Metrics for Gardner:

{'Sum of Residuals': np.float64(-10723.609649536735), 'MAE': np.float64(365.5125465596317), 'MSE': np.float64(254469.14816210992), 'RMSE': np.float64(504.44935143392735), 'R2': -0.021743605762918117}

Metrics of the Gardner calculation with filtered values

vpg_metrics2 = basic_stat.metrics(
    features2_nonan.Vp[(features2_nonan.Vp > 4000) & (features2_nonan.Vp < 6000)],
    features2_nonan.VpG[(features2_nonan.Vp > 4000) & (features2_nonan.Vp < 6000)]
)

print("Metrics for Gardner with Filtered Data Vp):\n")
print(vpg_metrics2)

Metrics for Gardner with Filtered Data Vp):

{'Sum of Residuals': np.float64(-25589.572721078162), 'MAE': np.float64(295.9254751013969), 'MSE': np.float64(162331.10374522334), 'RMSE': np.float64(402.9033429313082), 'R2': -0.0937786027043972}

Gardner Vp for comparison

gard_vp = features2.VpG.iloc[Vp_nan_index]
gard_vp.head()

   5526.493483
  5520.661324
  5633.454391
  5586.388848
  5691.523695
Name: VpG, dtype: float64

Simpliest ML models for Vp¶

Machine Learning (ML) algorithms offer a wide range of options to calculate missing values of a single variable. From the simplest and most well-known, such as linear regression with an independent variable, to the most complex. Below we use the simplest ML algorithms and their respective metrics, applied again to predict Vp and fill its gaps. But first, we split the non-NaN portion of the filtered data (features2) for training and testing.

features2_num_nonan.head()

	From	Den	Vp	Vs	Mag	Ip	Res
2	277.05	3.399	5754.0	2085.0	0.237	17.52	10200.0
4	287.20	3.293	6034.0	2122.0	0.176	6.88	15500.0
5	296.75	2.782	4661.0	2294.0	0.107	27.81	2100.0
8	318.75	3.386	5815.0	2115.0	0.060	8.53	10700.0
9	325.75	2.616	4785.0	1946.0	0.040	84.90	296.1

Target from filtered features2

# The target or objective of the model is the Vp
target = features2_num_nonan.Vp[(features2_num_nonan.Vp > 4000) & (features2_num_nonan.Vp < 6000)]
print(target.shape)
target.head()

(225,)

   5754.0
   4661.0
   5815.0
   4785.0
  5707.0
Name: Vp, dtype: float64

Filtered density is the independent feature to compute Vp, the target

features2_den = features2_num_nonan.Den[(features2_num_nonan.Vp > 4000) & (features2_num_nonan.Vp < 6000)]
features2_den.head()

   3.399
   2.782
   3.386
   2.616
  2.821
Name: Den, dtype: float64

Split and shape of data for training and testing

X_train, X_test, y_train, y_test = train_test_split(features2_den, target, test_size=0.2, random_state=seed)

X_train = np.array(X_train).reshape(-1, 1)
X_test = np.array(X_test).reshape(-1, 1)
y_train = np.array(y_train).reshape(-1, 1)
y_test = np.array(y_test).reshape(-1, 1)

print('X_train shape:', X_train.shape)
print('y_train shape:', y_train.shape)

print('X_test shape:', X_test.shape)
print('y_test shape:', y_test.shape)

X_train shape: (180, 1)
y_train shape: (180, 1)
X_test shape: (45, 1)
y_test shape: (45, 1)

Simple Linear Regression¶

The linear regression is perhaps one of the simplest ML algorithms, it allows us to define a simple formula to compute Vp from Den, the only feature. The regression using all the data gave a negative $R^2$ (the predictions are worse than if it had simply predicted the mean), while applying a filter in the input data(4000 < Vp < 6000) resulted in a positive $R^2$, although a very small one (0.003).

Linear Regression

# Model training and prediction
lr_model = LinearRegression().fit(X_train, y_train)
lr_predict = lr_model.predict(np.array(X_test))

# Coheficients
coef = lr_model.coef_.item()
inter = lr_model.intercept_.item()

Metric of the linear regression with filtered data

lr_metrics = basic_stat.metrics(y_test, lr_predict)

print("Metrics for Linear Regressor:\n")
print(lr_metrics)

Metrics for Linear Regressor:

{'Sum of Residuals': np.float64(-663.6657176180624), 'MAE': np.float64(321.4070681962207), 'MSE': np.float64(152576.8824032091), 'RMSE': np.float64(390.61090922196365), 'R2': 0.002886141920427243}

Plot of filtered data and regression

plt.scatter(X_test, y_test, label='Real data', alpha=0.65, edgecolors='k')
plt.plot(X_test, lr_predict, c='r', label='Prediction')
plt.title(f"Vp = {coef:.1f} * den + {inter:.1f}")
plt.suptitle('Linear Regression')
plt.xlabel('Density (g/cm$^3$)')
plt.ylabel('Vp (m/s)')
plt.grid()
plt.legend();

Linear Regression, Vp = 175.7 * den + 4996.3

<matplotlib.legend.Legend object at 0x7f3199fc2ce0>

lr Vp for comparison

#  Density of the Vp calculations
den_feature = np.array(features2.Den.iloc[Vp_nan_index]).reshape(-1, 1)

lr_vp = lr_model.predict(den_feature)
lr_vp

array([[5454.81134299],
       [5452.87886354],
       [5491.35277254],
       [5475.01453721],
       [5512.08300661],
       [5473.43341766],
       [5454.98702294],
       [5427.58095077],
       [5466.58189962],
       [5467.28461942],
       [5474.66317731],
       [5468.33869912],
       [5458.85198183],
       [5466.05485977],
       [5454.81134299],
       [5517.70476501],
       [5469.74413872],
       [5470.27117857],
       [5467.98733922],
       [5478.52813621],
       [5470.27117857],
       [5462.01422093],
       [5495.74477128],
       [5461.66286103]])

Non Linear Fit¶

Although the non_linear fit is more powerful since it can fit the alpha and beta coefficients of a non-linear curve of density to Vp, the $R^2$ metric (0.004) does not differ much from the linear regression with the filtered data.

Reshaping the filtered data

X_train = np.array(X_train).flatten()
y_train = np.array(y_train).flatten()

print('X_train shape:', X_train.shape)
print('y_train shape:', y_train.shape)

X_train shape: (180,)
y_train shape: (180,)

Gardner function for double fitting, for alpha and beta

def gardner_model(X_train, alpha, beta):
    return alpha * X_train ** beta

Model fit

popt, pcov = curve_fit(gardner_model, X_train, y_train, p0=[0, 0])  # p0 is the initial estimation of alpha y beta

Coheficients and prediction

alpha, beta = popt

print('Alpha:', alpha)
print('Beta:', beta)

nlf_predict = gardner_model(X_test, alpha, beta)

Alpha: 4944.85723377095
Beta: 0.1017276747649569

Metric of the non linear fit with filtered data

nlf_metrics = basic_stat.metrics(y_test, nlf_predict)

print("Metrics for Non Linear Fit:\n")
print(nlf_metrics)

Metrics for Non Linear Fit:

{'Sum of Residuals': np.float64(-672.1847517458073), 'MAE': np.float64(321.430456631444), 'MSE': np.float64(152446.9582936085), 'RMSE': np.float64(390.4445649431024), 'R2': 0.003735216355170601}

Plot of filtered data and NLF equation

plt.scatter(X_test, y_test, label='Real data')
plt.plot(X_test, nlf_predict, c='r', label='Prediction')
plt.title(f"Vp = {alpha:.2f} * den ** {beta:.2f}")
plt.xlabel("Density (g/cm3)")
plt.suptitle('Non Linear Fit')
plt.ylabel("Vp (m/s)")
plt.grid()
plt.show()

Non Linear Fit, Vp = 4944.86 * den ** 0.10

nlf Vp for comparison

nlf_vp = gardner_model(den_feature, alpha, beta)
nlf_vp

array([[5451.77127868],
       [5449.42946455],
       [5494.46242706],
       [5475.7369865 ],
       [5517.43839526],
       [5473.89450644],
       [5451.98373102],
       [5417.92250818],
       [5465.84650885],
       [5466.67677416],
       [5475.32801939],
       [5467.92008814],
       [5456.63928719],
       [5465.22307824],
       [5451.77127868],
       [5523.52610545],
       [5469.57396647],
       [5470.19303406],
       [5467.50592748],
       [5479.81189483],
       [5470.19303406],
       [5460.42246055],
       [5499.40142812],
       [5460.00324939]])

Vps Comparison¶

The real Vp values allow us to evaluate the quality of different procedures that estimate Vp, such as different imputation algorithms, the Gardner empirical formula, and the simpliest ML regression model.

Datos y títulos a utilizar en el bucle

plt.figure(figsize=(9, 15))

# Histograms
plt.subplot(7, 1, 1)
plt.grid(zorder=2)
plt.hist(features2.Vp, bins=20, zorder=3)
plt.xlim(2500, 7000)
plt.xlabel('m/s')
plt.title('Real Vp')

plt.subplot(7, 1, 2)
box = plt.boxplot(features2.Vp.dropna(), vert=False, showmeans=True, meanline=True, patch_artist=False)
mean_line = box['means'][0]  # Obtener la línea de la media
median_line = box['medians'][0]  # Obtener la línea de la mediana
plt.legend([mean_line, median_line], ['Mean', 'Median'], loc='upper left')
plt.xlim(2500, 7000)
plt.title('Real Vp')
plt.xlabel('m/s')
plt.grid()

plt.subplot(7, 1, 3)
plt.grid(zorder=2)
plt.hist(gard_vp, label='Gardner Vp', zorder=4)
plt.hist(nii_vp, label='NII Imputer Vp', zorder=7)
plt.hist(lr_vp, label='LR Vp', zorder=9)
plt.hist(nlf_vp, label='NLF Vp', zorder=10)
plt.xlim(2500, 7000)
plt.xlabel('m/s')
plt.legend()
plt.title('Estimated Vps')

# Boxplots
data_list = [gard_vp, nii_vp, lr_vp, nlf_vp]
titles = ['Gardner Vp', 'NII Inputer Vp', 'LR Vp', 'NLF Vp']
means = [np.mean(features2.Vp.dropna()), np.mean(gard_vp), np.mean(nii_vp), np.mean(lr_vp), np.mean(nlf_vp)]

for i in range(len(data_list)):
    plt.subplot(7, 1, i + 4)
    plt.boxplot(data_list[i], vert=False, showmeans=True, meanline=True)
    vp_change = 100 * (data_list[i].mean() - features2.Vp.mean()) / features2.Vp.mean()
    plt.text(means[i] - 500, 1.25, 'Mean: {:.0f}, % change: {:.1f}'.format(means[i], vp_change))
    plt.xlim(2500, 7000)
    plt.title(titles[i])
    plt.xlabel('m/s')
    plt.grid()

plt.tight_layout()

Real Vp, Real Vp, Estimated Vps, Gardner Vp, NII Inputer Vp, LR Vp, NLF Vp

Observations¶

Here are some observations related to the tasks covered in this notebook:

Anomalies represent 0.5% (18) of the data, and after removing them we reach a total of 4.3% of NaN (142).
The metrics favor the nii imputer. However, the knn, ii, and nknn imputer provided similar results. In the end, the dataset without anomalies (features2) was filled with the nii imputer (features3) because it was the best option.
In addition to the correlation matrix (described in the previous notebook 1/3), it is important to explore the covariance matrix because it provides valuable information regarding the scale-dependent relationships between variables, reflecting how changes in one variable are associated with changes in another in terms of their actual units.
The poor correlation (low covariance) between Den and Vp was not realized until Gardner’s Vp was calculated. This gave negative $R^2$, even after filtering part of the data outside the increasing trend.
Filtering the Den and Vp values outside the increasing trend allows us to obtain positive but very small $R^2$ for the LR and the NLF ML algorithms.

Next Step¶

The next and last notebook (3/3), after the imputation, is going to focus on the prediction of an entire variable (GR), missing in one of the available boreholes.

Reference¶

Gardner’s Equation

Annex - Regression Metrics¶

These metrics help you understand different aspects of prediction accuracy and are critical for evaluating and comparing regression models. The main metrics are:

The Sum of Residuals is the sum of errors between the predicted and the actual values. Ideally, it should add up to 0.

\[\text{Sum of Residuals} = \sum_{i=1}^{n} (y_i - \hat{y}_i)\]

Where:

$y_i$ is the real value.
$\hat{y}_i$ is the predicted value.

The MAE measures the average of the absolute errors between the predicted and the actual values. It is an intuitive and easy-to-interpret metric.

\[\text{MAE} = \frac{1}{n} \sum_{i=1}^{n} |y_i - \hat{y}_i|\]

Where:

$n$ is the number of observations.

The MSE measures the average of the squared errors between the predicted values and the actual values. It penalizes large errors more than the MAE.

\[\text{MSE} = \frac{1}{n} \sum_{i=1}^{n} (y_i - \hat{y}_i)^2\]

The RMSE is the square root of the MSE. It is expressed in the same units as the output variable, which makes it more interpretable.

\[\text{RMSE} = \sqrt{\frac{1}{n} \sum_{i=1}^{n} (y_i - \hat{y}_i)^2} = \sqrt{\text{MSE}}\]

The $R^2$ measures the proportion of the variance of the dependent or predicted variable against the independent variables. Normally it ranges from 0 (not fit at all) to 1 (perfect fit). A rare negative value indicates the prediction model is worse than a prediction with the mean.

\[R^2 = 1 - \frac{\sum_{i=1}^{n} (y_i - \hat{y}_i)^2}{\sum_{i=1}^{n} (y_i - \bar{y})^2}\]

Where:

$\bar{y}$ is the mean of the real values $y_i$.

Total running time of the script: (0 minutes 1.429 seconds)

Gallery generated by Sphinx-Gallery

Modeling and Propagation of Petrophysical Data for Mining Exploration (2/3)¶

Introduction¶

Variables¶

Libraries¶

Settings¶

Data Loading¶

Features Cleaning¶

NaN and Anomalies Preservation¶

Anomalies Deletion¶

NaNs in the Features¶

Filling the holes¶

Imputations Evaluation¶

Estimating Vp¶

Vp by Gardner¶

Simpliest ML models for Vp¶

Simple Linear Regression¶

Non Linear Fit¶

Vps Comparison¶

Observations¶

Next Step¶

Reference¶

Annex - Regression Metrics¶

Vector Geology

Navigation