.. DO NOT EDIT. .. THIS FILE WAS AUTOMATICALLY GENERATED BY SPHINX-GALLERY. .. TO MAKE CHANGES, EDIT THE SOURCE PYTHON FILE: .. "examples/05_petrophysics/02_no_2_filling_gaps.py" .. LINE NUMBERS ARE GIVEN BELOW. .. only:: html .. note:: :class: sphx-glr-download-link-note :ref:`Go to the end <sphx_glr_download_examples_05_petrophysics_02_no_2_filling_gaps.py>` to download the full example code .. rst-class:: sphx-glr-example-title .. _sphx_glr_examples_05_petrophysics_02_no_2_filling_gaps.py: Modeling and Propagation of Petrophysical Data for Mining Exploration (2/3) =========================================================================== **Cleaning and Filling the Gaps** Barcelona 25/09/24 GEO3BCN Manuel David Soto, Juan Alcalde, Adrià Hernàndez-Pineda, Ramón Carbonel Introduction ------------- The dispersion and scarcity of petrophysical data are well-known challenges in the mining sector. These issues are primarily driven by economic factors, but also by geological factors such as sedimentary cover, weathering, erosion, or the depth of targets, geotechnical issues (e.g., slope or borehole stability), and even technical limitations or availability that have been resolved in other industries (for instance, sonic logs were not previously acquired due to a lack of interest in velocity field data). To address the challenge of sparse and incomplete petrophysical data in the mining sector, we have developed three Jupyter notebooks that tackle these issues through a suite of open-source Python tools. These tools support researchers in the initial exploration, visualization, and integration of data from diverse sources, filling gaps caused by technical limitations and ultimately enabling the complete modeling of missing properties (through standard and more advanced ML-based models). We applied these tools to both recently acquired and legacy petrophysical data of two cores northwest of Collinstown (County Westmeath, Province of Leinster), Ireland, located 26 km west-northwest of the Navan mine. However, these tools are adaptable and applicable to mining data from any location. After the EDA and integration of the petrophysical dataset of Collinstown (notebook 1/3), this second notebook gathers different tasks that can be grouped into what is called, in data science, data mining. These tasks are: * Record the positions of the original NaNs. * Record the positions and the values of the anomalies to be deleted. * Delete the anomalies. * Fill out all NaNs, both original and those resulting from deleting the anomalies, using different means (imputations, empirical formulas, simple ML models). * Compare the effectiveness of the different filling gap methods. * Use the better option to fill in the gaps and deliver the corrected petrophysical data for further investigation. As with the previous notebook, these tasks are performed with open-source Python tools that are easily accessible by any researcher through a Python installation connected to the Internet. .. GENERATED FROM PYTHON SOURCE LINES 30-57 Variables --------- The dataset used in this notebook is the 'features' dataset from the previous notebook (1/3). It contains the modelable petrophysical features with their respective anomalies. 'Hole' (text object) and 'Len' (float) variables are for reference, 'Form' (text object) is a categorical variable representing the major formations: +------+------------------------------------------------+--------+ | Name | Explanation | Unit | +======+================================================+========+ | Hole | Hole name | - | +------+------------------------------------------------+--------+ | From | Top of the sample | m | +------+------------------------------------------------+--------+ | Len | Length of the core sample | cm | +------+------------------------------------------------+--------+ | Den | Density | g/cm³ | +------+------------------------------------------------+--------+ | Vp | Compressional velocity | m/s | +------+------------------------------------------------+--------+ | Vs | Shear velocity | m/s | +------+------------------------------------------------+--------+ | Mag | Magnetic susceptibility | - | +------+------------------------------------------------+--------+ | Ip | Chargeability or induced polarization | mv/V | +------+------------------------------------------------+--------+ | Res | Resistivity | ohm·m | +------+------------------------------------------------+--------+ | Form | Major formations or zone along the hole | - | +------+------------------------------------------------+--------+ .. GENERATED FROM PYTHON SOURCE LINES 59-62 Libraries -------- The following are the Python libraries used along this notebook. PSL are Python Standard Libraries, UDL are User Defined Libraries, and PEL are Python External Libraries: .. GENERATED FROM PYTHON SOURCE LINES 64-65 PLS .. GENERATED FROM PYTHON SOURCE LINES 65-103 .. code-block:: Python import sys import warnings # UDL from vector_geology import basic_stat, geo # PEL- Basic import numpy as np import pandas as pd from tabulate import tabulate # PEL - Plotting import matplotlib.pyplot as plt import seaborn as sns # PEL - Data selection, transformation, preprocessing from sklearn.model_selection import train_test_split from sklearn.preprocessing import StandardScaler from sklearn.preprocessing import PolynomialFeatures # PEL - Imputer from sklearn.impute import KNNImputer, SimpleImputer from sklearn.experimental import enable_iterative_imputer from sklearn.impute import IterativeImputer # PEL- ML algorithms from scipy.optimize import curve_fit from sklearn.linear_model import LinearRegression, Lasso, Ridge from sklearn.svm import SVR from sklearn.neighbors import KNeighborsRegressor from sklearn.tree import DecisionTreeRegressor from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor from sklearn import tree import xgboost as xgb # PEL - Metrics from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score .. GENERATED FROM PYTHON SOURCE LINES 104-106 Settings -------- .. GENERATED FROM PYTHON SOURCE LINES 108-109 Seed of random process .. GENERATED FROM PYTHON SOURCE LINES 109-114 .. code-block:: Python seed = 123 # Warning suppression warnings.filterwarnings("ignore") .. GENERATED FROM PYTHON SOURCE LINES 115-117 Data Loading ------------ .. GENERATED FROM PYTHON SOURCE LINES 119-120 Features data load .. GENERATED FROM PYTHON SOURCE LINES 120-123 .. code-block:: Python features = pd.read_csv('Output/features.csv', index_col=0) features.head() .. raw:: html <div class="output_subarea output_html rendered_html output_result"> <div> <style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; } </style> <table border="1" class="dataframe"> <thead> <tr style="text-align: right;"> <th></th> <th>Hole</th> <th>From</th> <th>Len</th> <th>Den</th> <th>Vp</th> <th>Vs</th> <th>Mag</th> <th>Ip</th> <th>Res</th> <th>Form</th> </tr> </thead> <tbody> <tr> <th>0</th> <td>TC-1319-005</td> <td>265.77</td> <td>11.2</td> <td>2.610</td> <td>2179.0</td> <td>NaN</td> <td>0.108</td> <td>NaN</td> <td>NaN</td> <td>CALP</td> </tr> <tr> <th>1</th> <td>TC-1319-005</td> <td>271.66</td> <td>11.8</td> <td>4.318</td> <td>5514.0</td> <td>2222.0</td> <td>0.108</td> <td>26.06</td> <td>4600.0</td> <td>CALP</td> </tr> <tr> <th>2</th> <td>TC-1319-005</td> <td>277.05</td> <td>10.3</td> <td>3.399</td> <td>5754.0</td> <td>2085.0</td> <td>0.237</td> <td>17.52</td> <td>10200.0</td> <td>CALP</td> </tr> <tr> <th>3</th> <td>TC-1319-005</td> <td>282.55</td> <td>11.2</td> <td>4.342</td> <td>5773.0</td> <td>2011.0</td> <td>0.151</td> <td>28.13</td> <td>5800.0</td> <td>CALP</td> </tr> <tr> <th>4</th> <td>TC-1319-005</td> <td>287.20</td> <td>10.8</td> <td>3.293</td> <td>6034.0</td> <td>2122.0</td> <td>0.176</td> <td>6.88</td> <td>15500.0</td> <td>CALP</td> </tr> </tbody> </table> </div> </div> <br /> <br /> .. GENERATED FROM PYTHON SOURCE LINES 124-126 .. code-block:: Python features.describe() .. raw:: html <div class="output_subarea output_html rendered_html output_result"> <div> <style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; } </style> <table border="1" class="dataframe"> <thead> <tr style="text-align: right;"> <th></th> <th>From</th> <th>Len</th> <th>Den</th> <th>Vp</th> <th>Vs</th> <th>Mag</th> <th>Ip</th> <th>Res</th> </tr> </thead> <tbody> <tr> <th>count</th> <td>329.000000</td> <td>321.000000</td> <td>327.000000</td> <td>310.000000</td> <td>289.000000</td> <td>326.000000</td> <td>303.000000</td> <td>303.000000</td> </tr> <tr> <th>mean</th> <td>543.279696</td> <td>10.223364</td> <td>2.761664</td> <td>5441.990323</td> <td>2772.937716</td> <td>0.430543</td> <td>49.738548</td> <td>3283.762376</td> </tr> <tr> <th>std</th> <td>262.864370</td> <td>1.070127</td> <td>0.293938</td> <td>700.858075</td> <td>653.446129</td> <td>5.192919</td> <td>25.145208</td> <td>3714.851096</td> </tr> <tr> <th>min</th> <td>9.600000</td> <td>7.500000</td> <td>2.420000</td> <td>992.000000</td> <td>0.000000</td> <td>0.001000</td> <td>5.390000</td> <td>156.900000</td> </tr> <tr> <th>25%</th> <td>355.580000</td> <td>9.700000</td> <td>2.678500</td> <td>5196.000000</td> <td>2294.000000</td> <td>0.064000</td> <td>31.555000</td> <td>1100.000000</td> </tr> <tr> <th>50%</th> <td>547.800000</td> <td>10.000000</td> <td>2.706000</td> <td>5575.000000</td> <td>3041.000000</td> <td>0.114500</td> <td>47.040000</td> <td>2100.000000</td> </tr> <tr> <th>75%</th> <td>750.000000</td> <td>10.400000</td> <td>2.731000</td> <td>5857.000000</td> <td>3248.000000</td> <td>0.205250</td> <td>67.190000</td> <td>3850.000000</td> </tr> <tr> <th>max</th> <td>1049.200000</td> <td>18.000000</td> <td>5.074000</td> <td>6478.000000</td> <td>3740.000000</td> <td>93.886000</td> <td>171.730000</td> <td>23900.000000</td> </tr> </tbody> </table> </div> </div> <br /> <br /> .. GENERATED FROM PYTHON SOURCE LINES 127-128 Columns in the features dataframe .. GENERATED FROM PYTHON SOURCE LINES 128-130 .. code-block:: Python features.columns .. rst-class:: sphx-glr-script-out .. code-block:: none Index(['Hole', 'From', 'Len', 'Den', 'Vp', 'Vs', 'Mag', 'Ip', 'Res', 'Form'], dtype='object') .. GENERATED FROM PYTHON SOURCE LINES 131-153 Features Cleaning ----------------- The plots below show the anomalous values of the four affected variables. Then, along the section, we will register the position of the anomalies and then delete them, according to the findings of the previous notebook (1/3): +------+--------------------------------------------------------------+ | Feat.| Deliting Anomalies | +======+==============================================================+ | Den | Values above 4 g/cm³ are related to bad measurements in core | | | samples longer than 11.2 cm. | +------+--------------------------------------------------------------+ | Vp | Values below 3000 m/s are related to fractures in the core | | | samples. | +------+--------------------------------------------------------------+ | Vs | Values less than 1000 m/s are also related to core fractures.| +------+--------------------------------------------------------------+ | Mag | The anomaly of 93.9 is above the range of the measuring | | | equipment. | +------+--------------------------------------------------------------+ | Ip | We see no reasons to discard outliers. | +------+--------------------------------------------------------------+ | Res | We see no reasons to discard outliers. | +------+--------------------------------------------------------------+ .. GENERATED FROM PYTHON SOURCE LINES 155-156 Bad data areas in Vp, Vs, Mag vs. Den .. GENERATED FROM PYTHON SOURCE LINES 156-195 .. code-block:: Python plt.figure(figsize=(16, 5)) # len vs. Den plt.subplot(131) plt.scatter(features.Vp, features.Den, alpha=0.65, edgecolors='k', c='skyblue') plt.axhspan(4, 5.25, color='r', alpha=0.2) plt.axvspan(500, 3000, color='r', alpha=0.2) plt.text(1100, 4.7, 'Bad Data Area') plt.axis([500, 7000, 2, 5.25]) plt.xlabel('Vp (m/s)') plt.ylabel('Den (g/cm3)') plt.grid() # Den vs. Vs bad data area plt.subplot(132) plt.scatter(features.Vs, features.Den, alpha=0.65, edgecolors='k', c='skyblue') plt.axhspan(4, 5.25, color='r', alpha=0.2) plt.axvspan(0, 1000, color='r', alpha=0.2) plt.text(75, 4.7, 'Bad Data Area') plt.axis([0, 4000, 2, 5.25]) plt.xlabel('Vs (m/s)') plt.ylabel('Den (g/cm3)') plt.grid() # Den vs. Vs bad data area plt.subplot(133) plt.scatter(features.Mag, features.Den, alpha=0.65, edgecolors='k', c='skyblue') plt.axvspan(20, 100, color='r', alpha=0.2) plt.axhspan(4, 5.25, color='r', alpha=0.2) plt.text(2, 4.7, 'Bad Data Area') plt.axis([-5, 100, 2, 5.25]) plt.xlabel('Magnetic Susceptibility') plt.ylabel('Den (g/cm3)') plt.grid() plt.suptitle('Variable with Anomalies in the Keywells') plt.tight_layout(); .. image-sg:: /examples/05_petrophysics/images/sphx_glr_02_no_2_filling_gaps_001.png :alt: Variable with Anomalies in the Keywells :srcset: /examples/05_petrophysics/images/sphx_glr_02_no_2_filling_gaps_001.png :class: sphx-glr-single-img .. GENERATED FROM PYTHON SOURCE LINES 196-199 NaN and Anomalies Preservation ------------------------------ The folowing dataframes preserves the location of original NaN (1) and anomalies (2): .. GENERATED FROM PYTHON SOURCE LINES 201-203 .. code-block:: Python nan_ano_loc = features.copy() .. GENERATED FROM PYTHON SOURCE LINES 204-205 Fill the datafreme with 1 and 2 .. GENERATED FROM PYTHON SOURCE LINES 205-214 .. code-block:: Python for column in features.columns: nan_ano_loc[column] = np.where(pd.isna(features[column]), 1, 0) nan_ano_loc.Den = np.where(features.Den > 4, 2, 0) nan_ano_loc.Vp = np.where(features.Vp < 3000, 2, 0) nan_ano_loc.Vs = np.where(features.Vs < 1000, 2, 0) nan_ano_loc.Mag = np.where(features.Mag > 20, 2, 0) nan_ano_loc.head() .. raw:: html <div class="output_subarea output_html rendered_html output_result"> <div> <style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; } </style> <table border="1" class="dataframe"> <thead> <tr style="text-align: right;"> <th></th> <th>Hole</th> <th>From</th> <th>Len</th> <th>Den</th> <th>Vp</th> <th>Vs</th> <th>Mag</th> <th>Ip</th> <th>Res</th> <th>Form</th> </tr> </thead> <tbody> <tr> <th>0</th> <td>0</td> <td>0</td> <td>0</td> <td>0</td> <td>2</td> <td>0</td> <td>0</td> <td>1</td> <td>1</td> <td>0</td> </tr> <tr> <th>1</th> <td>0</td> <td>0</td> <td>0</td> <td>2</td> <td>0</td> <td>0</td> <td>0</td> <td>0</td> <td>0</td> <td>0</td> </tr> <tr> <th>2</th> <td>0</td> <td>0</td> <td>0</td> <td>0</td> <td>0</td> <td>0</td> <td>0</td> <td>0</td> <td>0</td> <td>0</td> </tr> <tr> <th>3</th> <td>0</td> <td>0</td> <td>0</td> <td>2</td> <td>0</td> <td>0</td> <td>0</td> <td>0</td> <td>0</td> <td>0</td> </tr> <tr> <th>4</th> <td>0</td> <td>0</td> <td>0</td> <td>0</td> <td>0</td> <td>0</td> <td>0</td> <td>0</td> <td>0</td> <td>0</td> </tr> </tbody> </table> </div> </div> <br /> <br /> .. GENERATED FROM PYTHON SOURCE LINES 215-217 .. code-block:: Python nan_ano_loc.describe() .. raw:: html <div class="output_subarea output_html rendered_html output_result"> <div> <style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; } </style> <table border="1" class="dataframe"> <thead> <tr style="text-align: right;"> <th></th> <th>Hole</th> <th>From</th> <th>Len</th> <th>Den</th> <th>Vp</th> <th>Vs</th> <th>Mag</th> <th>Ip</th> <th>Res</th> <th>Form</th> </tr> </thead> <tbody> <tr> <th>count</th> <td>329.0</td> <td>329.0</td> <td>329.000000</td> <td>329.000000</td> <td>329.000000</td> <td>329.000000</td> <td>329.000000</td> <td>329.000000</td> <td>329.000000</td> <td>329.0</td> </tr> <tr> <th>mean</th> <td>0.0</td> <td>0.0</td> <td>0.024316</td> <td>0.048632</td> <td>0.030395</td> <td>0.030395</td> <td>0.006079</td> <td>0.079027</td> <td>0.079027</td> <td>0.0</td> </tr> <tr> <th>std</th> <td>0.0</td> <td>0.0</td> <td>0.154263</td> <td>0.308527</td> <td>0.245049</td> <td>0.245049</td> <td>0.110264</td> <td>0.270192</td> <td>0.270192</td> <td>0.0</td> </tr> <tr> <th>min</th> <td>0.0</td> <td>0.0</td> <td>0.000000</td> <td>0.000000</td> <td>0.000000</td> <td>0.000000</td> <td>0.000000</td> <td>0.000000</td> <td>0.000000</td> <td>0.0</td> </tr> <tr> <th>25%</th> <td>0.0</td> <td>0.0</td> <td>0.000000</td> <td>0.000000</td> <td>0.000000</td> <td>0.000000</td> <td>0.000000</td> <td>0.000000</td> <td>0.000000</td> <td>0.0</td> </tr> <tr> <th>50%</th> <td>0.0</td> <td>0.0</td> <td>0.000000</td> <td>0.000000</td> <td>0.000000</td> <td>0.000000</td> <td>0.000000</td> <td>0.000000</td> <td>0.000000</td> <td>0.0</td> </tr> <tr> <th>75%</th> <td>0.0</td> <td>0.0</td> <td>0.000000</td> <td>0.000000</td> <td>0.000000</td> <td>0.000000</td> <td>0.000000</td> <td>0.000000</td> <td>0.000000</td> <td>0.0</td> </tr> <tr> <th>max</th> <td>0.0</td> <td>0.0</td> <td>1.000000</td> <td>2.000000</td> <td>2.000000</td> <td>2.000000</td> <td>2.000000</td> <td>1.000000</td> <td>1.000000</td> <td>0.0</td> </tr> </tbody> </table> </div> </div> <br /> <br /> .. GENERATED FROM PYTHON SOURCE LINES 218-219 Save the locations of nans and anomalies .. GENERATED FROM PYTHON SOURCE LINES 219-222 .. code-block:: Python nan_ano_loc.to_csv('Output/nan_ano_loc.csv') .. GENERATED FROM PYTHON SOURCE LINES 223-224 Preservation of the anomalies .. GENERATED FROM PYTHON SOURCE LINES 224-229 .. code-block:: Python anomalies = features[features.Den > 4] anomalies = pd.concat([anomalies, features[features.Vp < 3000]]) anomalies = pd.concat([anomalies, features[features.Vs < 1000]]) anomalies = pd.concat([anomalies, features[features.Mag > 2]]) .. GENERATED FROM PYTHON SOURCE LINES 230-233 .. code-block:: Python anomalies = anomalies.drop_duplicates().sort_index() anomalies .. raw:: html <div class="output_subarea output_html rendered_html output_result"> <div> <style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; } </style> <table border="1" class="dataframe"> <thead> <tr style="text-align: right;"> <th></th> <th>Hole</th> <th>From</th> <th>Len</th> <th>Den</th> <th>Vp</th> <th>Vs</th> <th>Mag</th> <th>Ip</th> <th>Res</th> <th>Form</th> </tr> </thead> <tbody> <tr> <th>0</th> <td>TC-1319-005</td> <td>265.77</td> <td>11.2</td> <td>2.610</td> <td>2179.0</td> <td>NaN</td> <td>0.108</td> <td>NaN</td> <td>NaN</td> <td>CALP</td> </tr> <tr> <th>1</th> <td>TC-1319-005</td> <td>271.66</td> <td>11.8</td> <td>4.318</td> <td>5514.0</td> <td>2222.0</td> <td>0.108</td> <td>26.06</td> <td>4600.0</td> <td>CALP</td> </tr> <tr> <th>3</th> <td>TC-1319-005</td> <td>282.55</td> <td>11.2</td> <td>4.342</td> <td>5773.0</td> <td>2011.0</td> <td>0.151</td> <td>28.13</td> <td>5800.0</td> <td>CALP</td> </tr> <tr> <th>6</th> <td>TC-1319-005</td> <td>301.50</td> <td>11.6</td> <td>4.251</td> <td>5686.0</td> <td>2755.0</td> <td>0.132</td> <td>14.77</td> <td>6000.0</td> <td>CALP</td> </tr> <tr> <th>7</th> <td>TC-1319-005</td> <td>313.65</td> <td>11.4</td> <td>4.193</td> <td>5588.0</td> <td>2039.0</td> <td>0.068</td> <td>20.37</td> <td>5400.0</td> <td>CALP</td> </tr> <tr> <th>11</th> <td>TC-1319-005</td> <td>334.65</td> <td>11.3</td> <td>4.079</td> <td>5678.0</td> <td>2198.0</td> <td>0.036</td> <td>34.51</td> <td>2000.0</td> <td>CALP</td> </tr> <tr> <th>12</th> <td>TC-1319-005</td> <td>340.00</td> <td>11.9</td> <td>4.444</td> <td>5561.0</td> <td>2581.0</td> <td>0.024</td> <td>14.92</td> <td>4800.0</td> <td>CALP</td> </tr> <tr> <th>13</th> <td>TC-1319-005</td> <td>344.77</td> <td>11.2</td> <td>4.123</td> <td>5359.0</td> <td>NaN</td> <td>0.041</td> <td>45.93</td> <td>1000.0</td> <td>CALP</td> </tr> <tr> <th>24</th> <td>TC-1319-005</td> <td>400.10</td> <td>13.9</td> <td>5.074</td> <td>5367.0</td> <td>2422.0</td> <td>0.066</td> <td>NaN</td> <td>NaN</td> <td>CALP</td> </tr> <tr> <th>77</th> <td>TC-1319-005</td> <td>645.30</td> <td>11.0</td> <td>2.725</td> <td>992.0</td> <td>NaN</td> <td>0.308</td> <td>98.11</td> <td>293.3</td> <td>LABL</td> </tr> <tr> <th>79</th> <td>TC-1319-005</td> <td>653.95</td> <td>12.0</td> <td>2.936</td> <td>2321.0</td> <td>1929.0</td> <td>0.342</td> <td>88.45</td> <td>443.4</td> <td>LABL</td> </tr> <tr> <th>159</th> <td>TC-1319-005</td> <td>1046.35</td> <td>10.6</td> <td>2.455</td> <td>2325.0</td> <td>NaN</td> <td>0.158</td> <td>44.44</td> <td>1300.0</td> <td>LIS</td> </tr> <tr> <th>174</th> <td>TC-3660-008</td> <td>64.54</td> <td>9.4</td> <td>2.677</td> <td>2802.0</td> <td>0.0</td> <td>0.038</td> <td>35.19</td> <td>2500.0</td> <td>CALP</td> </tr> <tr> <th>198</th> <td>TC-3660-008</td> <td>188.03</td> <td>9.4</td> <td>2.679</td> <td>5345.0</td> <td>0.0</td> <td>0.098</td> <td>NaN</td> <td>NaN</td> <td>CALP</td> </tr> <tr> <th>243</th> <td>TC-3660-008</td> <td>432.95</td> <td>9.7</td> <td>2.779</td> <td>4641.0</td> <td>178.0</td> <td>NaN</td> <td>69.69</td> <td>515.7</td> <td>FLT</td> </tr> <tr> <th>248</th> <td>TC-3660-008</td> <td>452.30</td> <td>10.4</td> <td>2.686</td> <td>5652.0</td> <td>0.0</td> <td>0.053</td> <td>27.54</td> <td>4000.0</td> <td>FLT</td> </tr> <tr> <th>305</th> <td>TC-3660-008</td> <td>734.80</td> <td>9.5</td> <td>2.708</td> <td>5026.0</td> <td>2914.0</td> <td>93.886</td> <td>71.63</td> <td>1200.0</td> <td>SP</td> </tr> <tr> <th>312</th> <td>TC-3660-008</td> <td>774.21</td> <td>9.6</td> <td>2.787</td> <td>6173.0</td> <td>0.0</td> <td>0.123</td> <td>73.08</td> <td>373.2</td> <td>SP</td> </tr> </tbody> </table> </div> </div> <br /> <br /> .. GENERATED FROM PYTHON SOURCE LINES 234-235 Number and % of anomalies by boreholes .. GENERATED FROM PYTHON SOURCE LINES 235-239 .. code-block:: Python print('Total anomalies:', len(anomalies), '({:.1f}%)'.format(100 * len(anomalies) / (features.shape[0] * features.shape[1]))) anomalies.Hole.value_counts() .. rst-class:: sphx-glr-script-out .. code-block:: none Total anomalies: 18 (0.5%) Hole TC-1319-005 12 TC-3660-008 6 Name: count, dtype: int64 .. GENERATED FROM PYTHON SOURCE LINES 240-241 Save the anomalies .. GENERATED FROM PYTHON SOURCE LINES 241-243 .. code-block:: Python anomalies.to_csv('Output/anomalies.csv') .. GENERATED FROM PYTHON SOURCE LINES 244-245 Rows with anomalies .. GENERATED FROM PYTHON SOURCE LINES 245-248 .. code-block:: Python list(anomalies.index) .. rst-class:: sphx-glr-script-out .. code-block:: none [0, 1, 3, 6, 7, 11, 12, 13, 24, 77, 79, 159, 174, 198, 243, 248, 305, 312] .. GENERATED FROM PYTHON SOURCE LINES 249-251 Anomalies Deletion ------------------ .. GENERATED FROM PYTHON SOURCE LINES 253-254 features2, new features dataframe without anomalies .. GENERATED FROM PYTHON SOURCE LINES 254-264 .. code-block:: Python features2 = features.copy() features2['Den'] = np.where(features.Den > 4, np.nan, features.Den) features2['Vp'] = np.where(features.Vp < 3000, np.nan, features.Vp) features2['Vs'] = np.where(features.Vs < 1000, np.nan, features.Vs) features2['Mag'] = np.where(features.Mag > 20, np.nan, features.Mag) features2.head() .. raw:: html <div class="output_subarea output_html rendered_html output_result"> <div> <style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; } </style> <table border="1" class="dataframe"> <thead> <tr style="text-align: right;"> <th></th> <th>Hole</th> <th>From</th> <th>Len</th> <th>Den</th> <th>Vp</th> <th>Vs</th> <th>Mag</th> <th>Ip</th> <th>Res</th> <th>Form</th> </tr> </thead> <tbody> <tr> <th>0</th> <td>TC-1319-005</td> <td>265.77</td> <td>11.2</td> <td>2.610</td> <td>NaN</td> <td>NaN</td> <td>0.108</td> <td>NaN</td> <td>NaN</td> <td>CALP</td> </tr> <tr> <th>1</th> <td>TC-1319-005</td> <td>271.66</td> <td>11.8</td> <td>NaN</td> <td>5514.0</td> <td>2222.0</td> <td>0.108</td> <td>26.06</td> <td>4600.0</td> <td>CALP</td> </tr> <tr> <th>2</th> <td>TC-1319-005</td> <td>277.05</td> <td>10.3</td> <td>3.399</td> <td>5754.0</td> <td>2085.0</td> <td>0.237</td> <td>17.52</td> <td>10200.0</td> <td>CALP</td> </tr> <tr> <th>3</th> <td>TC-1319-005</td> <td>282.55</td> <td>11.2</td> <td>NaN</td> <td>5773.0</td> <td>2011.0</td> <td>0.151</td> <td>28.13</td> <td>5800.0</td> <td>CALP</td> </tr> <tr> <th>4</th> <td>TC-1319-005</td> <td>287.20</td> <td>10.8</td> <td>3.293</td> <td>6034.0</td> <td>2122.0</td> <td>0.176</td> <td>6.88</td> <td>15500.0</td> <td>CALP</td> </tr> </tbody> </table> </div> </div> <br /> <br /> .. GENERATED FROM PYTHON SOURCE LINES 265-267 .. code-block:: Python features2.describe() .. raw:: html <div class="output_subarea output_html rendered_html output_result"> <div> <style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; } </style> <table border="1" class="dataframe"> <thead> <tr style="text-align: right;"> <th></th> <th>From</th> <th>Len</th> <th>Den</th> <th>Vp</th> <th>Vs</th> <th>Mag</th> <th>Ip</th> <th>Res</th> </tr> </thead> <tbody> <tr> <th>count</th> <td>329.000000</td> <td>321.000000</td> <td>319.000000</td> <td>305.000000</td> <td>284.000000</td> <td>325.000000</td> <td>303.000000</td> <td>303.000000</td> </tr> <tr> <th>mean</th> <td>543.279696</td> <td>10.223364</td> <td>2.721755</td> <td>5496.386885</td> <td>2821.130282</td> <td>0.142988</td> <td>49.738548</td> <td>3283.762376</td> </tr> <tr> <th>std</th> <td>262.864370</td> <td>1.070127</td> <td>0.145217</td> <td>556.077848</td> <td>547.476118</td> <td>0.100706</td> <td>25.145208</td> <td>3714.851096</td> </tr> <tr> <th>min</th> <td>9.600000</td> <td>7.500000</td> <td>2.420000</td> <td>3009.000000</td> <td>1295.000000</td> <td>0.001000</td> <td>5.390000</td> <td>156.900000</td> </tr> <tr> <th>25%</th> <td>355.580000</td> <td>9.700000</td> <td>2.678000</td> <td>5220.000000</td> <td>2368.250000</td> <td>0.064000</td> <td>31.555000</td> <td>1100.000000</td> </tr> <tr> <th>50%</th> <td>547.800000</td> <td>10.000000</td> <td>2.705000</td> <td>5581.000000</td> <td>3041.000000</td> <td>0.114000</td> <td>47.040000</td> <td>2100.000000</td> </tr> <tr> <th>75%</th> <td>750.000000</td> <td>10.400000</td> <td>2.726500</td> <td>5858.000000</td> <td>3248.000000</td> <td>0.203000</td> <td>67.190000</td> <td>3850.000000</td> </tr> <tr> <th>max</th> <td>1049.200000</td> <td>18.000000</td> <td>3.650000</td> <td>6478.000000</td> <td>3740.000000</td> <td>0.545000</td> <td>171.730000</td> <td>23900.000000</td> </tr> </tbody> </table> </div> </div> <br /> <br /> .. GENERATED FROM PYTHON SOURCE LINES 268-269 Boxplot of each feature without anomalies .. GENERATED FROM PYTHON SOURCE LINES 269-274 .. code-block:: Python features2.plot.box(subplots=True, grid=False, figsize=(12, 7), layout=(3, 4), flierprops={"marker": "."}) plt.suptitle('Features Without Anomalies') plt.tight_layout(); .. image-sg:: /examples/05_petrophysics/images/sphx_glr_02_no_2_filling_gaps_002.png :alt: Features Without Anomalies :srcset: /examples/05_petrophysics/images/sphx_glr_02_no_2_filling_gaps_002.png :class: sphx-glr-single-img .. GENERATED FROM PYTHON SOURCE LINES 275-276 Save features without anomalies .. GENERATED FROM PYTHON SOURCE LINES 276-279 .. code-block:: Python features2.to_csv('Output/features2.csv') .. GENERATED FROM PYTHON SOURCE LINES 280-282 NaNs in the Features -------------------- .. GENERATED FROM PYTHON SOURCE LINES 284-285 Original NaNs .. GENERATED FROM PYTHON SOURCE LINES 285-288 .. code-block:: Python features.isna().sum() .. rst-class:: sphx-glr-script-out .. code-block:: none Hole 0 From 0 Len 8 Den 2 Vp 19 Vs 40 Mag 3 Ip 26 Res 26 Form 0 dtype: int64 .. GENERATED FROM PYTHON SOURCE LINES 289-290 Total original NaNs .. GENERATED FROM PYTHON SOURCE LINES 290-293 .. code-block:: Python features.isna().sum().sum() .. rst-class:: sphx-glr-script-out .. code-block:: none np.int64(124) .. GENERATED FROM PYTHON SOURCE LINES 294-295 % of original NaNs .. GENERATED FROM PYTHON SOURCE LINES 295-298 .. code-block:: Python features.isna().sum() / len(features) * 100 .. rst-class:: sphx-glr-script-out .. code-block:: none Hole 0.000000 From 0.000000 Len 2.431611 Den 0.607903 Vp 5.775076 Vs 12.158055 Mag 0.911854 Ip 7.902736 Res 7.902736 Form 0.000000 dtype: float64 .. GENERATED FROM PYTHON SOURCE LINES 299-300 Total % original and new NaNs .. GENERATED FROM PYTHON SOURCE LINES 300-303 .. code-block:: Python print('% NaN in Features:', round(100 * features.isna().sum().sum() / (features.shape[0] * features.shape[1]), 1)) .. rst-class:: sphx-glr-script-out .. code-block:: none % NaN in Features: 3.8 .. GENERATED FROM PYTHON SOURCE LINES 304-305 Original and new NaNs .. GENERATED FROM PYTHON SOURCE LINES 305-308 .. code-block:: Python features2.isna().sum() .. rst-class:: sphx-glr-script-out .. code-block:: none Hole 0 From 0 Len 8 Den 10 Vp 24 Vs 45 Mag 4 Ip 26 Res 26 Form 0 dtype: int64 .. GENERATED FROM PYTHON SOURCE LINES 309-310 Total original and new NaNs .. GENERATED FROM PYTHON SOURCE LINES 310-313 .. code-block:: Python features2.isna().sum().sum() .. rst-class:: sphx-glr-script-out .. code-block:: none np.int64(143) .. GENERATED FROM PYTHON SOURCE LINES 314-315 Total % original and new NaNs .. GENERATED FROM PYTHON SOURCE LINES 315-318 .. code-block:: Python print('% NaN in features2:', round(100 * features2.isna().sum().sum() / (features2.shape[0] * features2.shape[1]), 1)) .. rst-class:: sphx-glr-script-out .. code-block:: none % NaN in features2: 4.3 .. GENERATED FROM PYTHON SOURCE LINES 319-320 % Total original and new NaNs by feature .. GENERATED FROM PYTHON SOURCE LINES 320-323 .. code-block:: Python features2.isna().sum() / len(features) * 100 .. rst-class:: sphx-glr-script-out .. code-block:: none Hole 0.000000 From 0.000000 Len 2.431611 Den 3.039514 Vp 7.294833 Vs 13.677812 Mag 1.215805 Ip 7.902736 Res 7.902736 Form 0.000000 dtype: float64 .. GENERATED FROM PYTHON SOURCE LINES 324-325 Bar plot of NaNs .. GENERATED FROM PYTHON SOURCE LINES 325-329 .. code-block:: Python features2.isna().sum().plot.bar(figsize=(8, 3), color='r', ylabel='Count', title='All Features NaN', label='Without anomalies', legend=True) features.isna().sum().plot.bar(figsize=(8, 3), color='b', label='With anomalies', legend=True, grid=True) .. image-sg:: /examples/05_petrophysics/images/sphx_glr_02_no_2_filling_gaps_003.png :alt: All Features NaN :srcset: /examples/05_petrophysics/images/sphx_glr_02_no_2_filling_gaps_003.png :class: sphx-glr-single-img .. rst-class:: sphx-glr-script-out .. code-block:: none <Axes: title={'center': 'All Features NaN'}, ylabel='Count'> .. GENERATED FROM PYTHON SOURCE LINES 330-331 Bar plot of NaNs (%) .. GENERATED FROM PYTHON SOURCE LINES 331-335 .. code-block:: Python (features2.isna().sum() / len(features) * 100).plot.bar(figsize=(8, 3), color='r', label='Without anomalies', legend=True) (features.isna().sum() / len(features) * 100).plot.bar(figsize=(8, 3), color='b', label='With anomalies', legend=True, title='All Features NaN', ylabel='%', grid=True); .. image-sg:: /examples/05_petrophysics/images/sphx_glr_02_no_2_filling_gaps_004.png :alt: All Features NaN :srcset: /examples/05_petrophysics/images/sphx_glr_02_no_2_filling_gaps_004.png :class: sphx-glr-single-img .. rst-class:: sphx-glr-script-out .. code-block:: none <Axes: title={'center': 'All Features NaN'}, ylabel='%'> .. GENERATED FROM PYTHON SOURCE LINES 336-337 Total rows at least one with NaN and their indexes .. GENERATED FROM PYTHON SOURCE LINES 337-342 .. code-block:: Python total_nan_index = list(features2[pd.isna(features2).any(axis=1)].index) print('Total rows with at least one NaN:', len(total_nan_index), '\n', '\n', 'Rows indexes:') total_nan_index .. rst-class:: sphx-glr-script-out .. code-block:: none Total rows with at least one NaN: 62 Rows indexes: [0, 1, 3, 6, 7, 11, 12, 13, 15, 24, 36, 45, 47, 63, 68, 71, 75, 77, 79, 99, 100, 105, 110, 111, 112, 113, 114, 115, 117, 118, 135, 158, 159, 160, 164, 174, 183, 185, 187, 189, 191, 194, 197, 198, 201, 206, 209, 210, 229, 233, 240, 243, 248, 251, 276, 305, 311, 312, 318, 319, 320, 325] .. GENERATED FROM PYTHON SOURCE LINES 343-359 Filling the holes ----------------- Imputations Evaluation ++++++++++++++++++++++ Imputation refers to the alternative process of filling in all the missing values (NaN) or gaps in a dataset. Instead of removing rows or columns with missing values, those values are replaced (imputed) with estimated values based on information available in the entire dataset (in all variables). This is typically done when you want to preserve the other values in the rows where the NaNs are, and when the missing values for each variable or column do not exceed 25%. Since multiple imputation methods are available, it's important to evaluate each one and select the method that yields the best results, based on metrics such as the sum of residuals, MAE, MSE, RMSE, and R² (formulas of these metrics in the annex). In this case, we tested six imputation methods: * Simple mean (mean) * Simple K-Means (knn) * Iterative Impute (ii) * Normalized mean (nmean) * Normalized K-Means (nknn) * Normalized Iterative Imputer (nii) To compare these methods, we first created fictitious gaps in the Vp values (just this variable due to its importance, and for simplicity and speed) of a non-NaN dataset to compare them later the real values of Vp against the imputed. By doing so, we can compute each method's metrics and determine which performs best. In addition to these metrics, at the end of the section, there is a plot with the resulting values of each imputer method with respect to the real values, showing less dispersion towards the top of the plot. .. GENERATED FROM PYTHON SOURCE LINES 361-362 Non-NaN dataframe to test the imputation .. GENERATED FROM PYTHON SOURCE LINES 362-367 .. code-block:: Python features2_num = features2[['From', 'Den', 'Vp', 'Vs', 'Mag', 'Ip', 'Res']] features2_num_nonan = features2_num.dropna() features2_num_nonan.head() .. raw:: html <div class="output_subarea output_html rendered_html output_result"> <div> <style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; } </style> <table border="1" class="dataframe"> <thead> <tr style="text-align: right;"> <th></th> <th>From</th> <th>Den</th> <th>Vp</th> <th>Vs</th> <th>Mag</th> <th>Ip</th> <th>Res</th> </tr> </thead> <tbody> <tr> <th>2</th> <td>277.05</td> <td>3.399</td> <td>5754.0</td> <td>2085.0</td> <td>0.237</td> <td>17.52</td> <td>10200.0</td> </tr> <tr> <th>4</th> <td>287.20</td> <td>3.293</td> <td>6034.0</td> <td>2122.0</td> <td>0.176</td> <td>6.88</td> <td>15500.0</td> </tr> <tr> <th>5</th> <td>296.75</td> <td>2.782</td> <td>4661.0</td> <td>2294.0</td> <td>0.107</td> <td>27.81</td> <td>2100.0</td> </tr> <tr> <th>8</th> <td>318.75</td> <td>3.386</td> <td>5815.0</td> <td>2115.0</td> <td>0.060</td> <td>8.53</td> <td>10700.0</td> </tr> <tr> <th>9</th> <td>325.75</td> <td>2.616</td> <td>4785.0</td> <td>1946.0</td> <td>0.040</td> <td>84.90</td> <td>296.1</td> </tr> </tbody> </table> </div> </div> <br /> <br /> .. GENERATED FROM PYTHON SOURCE LINES 368-369 Generation of random mask for the fictitious NaNs or gaps .. GENERATED FROM PYTHON SOURCE LINES 369-375 .. code-block:: Python np.random.seed(seed) missing_rate = 0.2 # Porcentaje de valores faltantes Vp_mask = np.random.rand(features2_num_nonan.shape[0]) < missing_rate Vp_mask .. rst-class:: sphx-glr-script-out .. code-block:: none array([False, False, False, False, False, False, False, False, False, False, False, False, False, True, False, False, True, True, False, False, False, False, False, False, False, False, False, False, False, False, True, False, False, False, False, False, False, False, False, False, False, True, False, False, False, False, False, False, False, False, True, False, False, False, False, False, False, False, False, False, False, False, False, False, False, True, False, False, True, False, True, False, False, False, True, False, False, True, True, False, False, False, False, False, False, False, False, True, False, False, False, False, False, False, False, False, True, False, False, False, False, False, True, True, False, False, False, False, False, False, False, False, True, False, False, False, False, False, True, False, False, False, False, False, False, False, False, False, False, False, False, True, False, False, False, True, False, False, False, False, True, False, False, False, False, False, False, True, False, False, True, False, False, False, False, True, False, False, False, False, False, False, False, True, False, False, False, False, False, False, False, True, False, False, False, True, True, True, False, False, False, True, False, False, False, False, True, False, False, False, False, False, False, False, True, False, True, False, False, True, False, True, False, False, False, False, True, False, False, True, False, False, False, False, False, False, True, False, False, True, False, False, False, False, False, True, False, False, False, True, False, True, False, False, False, True, False, False, False, False, True, False, False, False, True, False, False, False, False, False, False, False, False, False, True, False, False, False, True, False, False, True, False, False, False, False, True]) .. GENERATED FROM PYTHON SOURCE LINES 376-377 New gaps in Vp of the Non-NaN dataframe .. GENERATED FROM PYTHON SOURCE LINES 377-382 .. code-block:: Python features2_new_missing = features2_num_nonan.copy() features2_new_missing.loc[Vp_mask, 'Vp'] = np.nan features2_new_missing .. raw:: html <div class="output_subarea output_html rendered_html output_result"> <div> <style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; } </style> <table border="1" class="dataframe"> <thead> <tr style="text-align: right;"> <th></th> <th>From</th> <th>Den</th> <th>Vp</th> <th>Vs</th> <th>Mag</th> <th>Ip</th> <th>Res</th> </tr> </thead> <tbody> <tr> <th>2</th> <td>277.05</td> <td>3.399</td> <td>5754.0</td> <td>2085.0</td> <td>0.237</td> <td>17.52</td> <td>10200.0</td> </tr> <tr> <th>4</th> <td>287.20</td> <td>3.293</td> <td>6034.0</td> <td>2122.0</td> <td>0.176</td> <td>6.88</td> <td>15500.0</td> </tr> <tr> <th>5</th> <td>296.75</td> <td>2.782</td> <td>4661.0</td> <td>2294.0</td> <td>0.107</td> <td>27.81</td> <td>2100.0</td> </tr> <tr> <th>8</th> <td>318.75</td> <td>3.386</td> <td>5815.0</td> <td>2115.0</td> <td>0.060</td> <td>8.53</td> <td>10700.0</td> </tr> <tr> <th>9</th> <td>325.75</td> <td>2.616</td> <td>4785.0</td> <td>1946.0</td> <td>0.040</td> <td>84.90</td> <td>296.1</td> </tr> <tr> <th>...</th> <td>...</td> <td>...</td> <td>...</td> <td>...</td> <td>...</td> <td>...</td> <td>...</td> </tr> <tr> <th>323</th> <td>809.47</td> <td>2.786</td> <td>5621.0</td> <td>2334.0</td> <td>0.105</td> <td>55.20</td> <td>706.5</td> </tr> <tr> <th>324</th> <td>814.55</td> <td>2.763</td> <td>5680.0</td> <td>3179.0</td> <td>0.158</td> <td>70.94</td> <td>481.1</td> </tr> <tr> <th>326</th> <td>825.60</td> <td>2.762</td> <td>5593.0</td> <td>1787.0</td> <td>0.337</td> <td>78.44</td> <td>334.6</td> </tr> <tr> <th>327</th> <td>828.80</td> <td>2.804</td> <td>5858.0</td> <td>1964.0</td> <td>0.311</td> <td>73.73</td> <td>466.5</td> </tr> <tr> <th>328</th> <td>833.40</td> <td>2.735</td> <td>NaN</td> <td>3217.0</td> <td>0.308</td> <td>33.86</td> <td>2300.0</td> </tr> </tbody> </table> <p>267 rows × 7 columns</p> </div> </div> <br /> <br /> .. GENERATED FROM PYTHON SOURCE LINES 383-384 At index 328 the original Vp is .. GENERATED FROM PYTHON SOURCE LINES 384-387 .. code-block:: Python features.loc[328] .. rst-class:: sphx-glr-script-out .. code-block:: none Hole TC-3660-008 From 833.4 Len 10.1 Den 2.735 Vp 5805.0 Vs 3217.0 Mag 0.308 Ip 33.86 Res 2300.0 Form PAL Name: 328, dtype: object .. GENERATED FROM PYTHON SOURCE LINES 388-389 True Vp values .. GENERATED FROM PYTHON SOURCE LINES 389-393 .. code-block:: Python true_vp = features2_num_nonan[Vp_mask].Vp.reset_index(drop=True) true_vp.head() .. rst-class:: sphx-glr-script-out .. code-block:: none 0 5196.0 1 6159.0 2 6037.0 3 5815.0 4 4930.0 Name: Vp, dtype: float64 .. GENERATED FROM PYTHON SOURCE LINES 394-395 Evaluation of the imputers .. GENERATED FROM PYTHON SOURCE LINES 395-451 .. code-block:: Python # List of imputers imputers = [] # Initialize imputers mean_imputer = SimpleImputer(strategy='mean') ii_imputer = IterativeImputer(max_iter=10, random_state=seed) knn_imputer = KNNImputer(n_neighbors=5) # Normalization scaler = StandardScaler() features2_new_missing_scaled = scaler.fit_transform(features2_new_missing) # Append of imputer (name, norm, imputer, data) imputers.append(('mean', 'no', mean_imputer, features2_new_missing)) imputers.append(('ii', 'no', ii_imputer, features2_new_missing)) imputers.append(('knn', 'no', knn_imputer, features2_new_missing)) imputers.append(('nmean', 'yes', mean_imputer, features2_new_missing_scaled)) imputers.append(('nii', 'yes', ii_imputer, features2_new_missing_scaled)) imputers.append(('nknn', 'yes', knn_imputer, features2_new_missing_scaled)) # Results table headers = ['Imputer', 'Normalization', 'Sum Residual', 'MAE', 'MSE', 'RMSE', 'R2'] rows = [] # Loop over the imputers for name, norm, imputer, data in imputers: # Impute the data imputed_data = imputer.fit_transform(data) # Reverse the normalization if data was normalized if norm == 'yes': imputed_data = scaler.inverse_transform(imputed_data) # Convert the array to dataframe imputed_data_df = pd.DataFrame(imputed_data, columns=features2_new_missing.columns) imputed_vp = imputed_data_df.loc[Vp_mask, 'Vp'].reset_index(drop=True) # Calculate residuals and metrics residual = sum(true_vp - imputed_vp) mae = mean_absolute_error(true_vp, imputed_vp) mse = mean_squared_error(true_vp, imputed_vp) rmse = np.sqrt(mse) r2 = r2_score(true_vp, imputed_vp) # Create a dataframe for imputer using globals() globals()[f'imputer_{name}'] = imputed_data_df.copy() # Append results rows.append([name, norm, residual, mae, mse, rmse, r2]) # Print the results in a formatted table print('Seed:', seed) print(tabulate(rows, headers=headers, tablefmt="fancy_outline")) .. rst-class:: sphx-glr-script-out .. code-block:: none Seed: 123 ╒═══════════╤═════════════════╤════════════════╤═════════╤════════╤═════════╤═════════════╕ │ Imputer │ Normalization │ Sum Residual │ MAE │ MSE │ RMSE │ R2 │ ╞═══════════╪═════════════════╪════════════════╪═════════╪════════╪═════════╪═════════════╡ │ mean │ no │ 1779.8 │ 394.903 │ 256363 │ 506.323 │ -0.00517294 │ │ ii │ no │ -1106.23 │ 308.72 │ 163029 │ 403.768 │ 0.360781 │ │ knn │ no │ -490.6 │ 310.118 │ 168360 │ 410.317 │ 0.339878 │ │ nmean │ yes │ 1779.8 │ 394.903 │ 256363 │ 506.323 │ -0.00517294 │ │ nii │ yes │ -646.02 │ 296.29 │ 160690 │ 400.862 │ 0.36995 │ │ nknn │ yes │ -924.4 │ 298.808 │ 166131 │ 407.592 │ 0.348616 │ ╘═══════════╧═════════════════╧════════════════╧═════════╧════════╧═════════╧═════════════╛ .. GENERATED FROM PYTHON SOURCE LINES 452-454 .. code-block:: Python imputer_nii.head() .. raw:: html <div class="output_subarea output_html rendered_html output_result"> <div> <style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; } </style> <table border="1" class="dataframe"> <thead> <tr style="text-align: right;"> <th></th> <th>From</th> <th>Den</th> <th>Vp</th> <th>Vs</th> <th>Mag</th> <th>Ip</th> <th>Res</th> </tr> </thead> <tbody> <tr> <th>0</th> <td>277.05</td> <td>3.399</td> <td>5754.0</td> <td>2085.0</td> <td>0.237</td> <td>17.52</td> <td>10200.0</td> </tr> <tr> <th>1</th> <td>287.20</td> <td>3.293</td> <td>6034.0</td> <td>2122.0</td> <td>0.176</td> <td>6.88</td> <td>15500.0</td> </tr> <tr> <th>2</th> <td>296.75</td> <td>2.782</td> <td>4661.0</td> <td>2294.0</td> <td>0.107</td> <td>27.81</td> <td>2100.0</td> </tr> <tr> <th>3</th> <td>318.75</td> <td>3.386</td> <td>5815.0</td> <td>2115.0</td> <td>0.060</td> <td>8.53</td> <td>10700.0</td> </tr> <tr> <th>4</th> <td>325.75</td> <td>2.616</td> <td>4785.0</td> <td>1946.0</td> <td>0.040</td> <td>84.90</td> <td>296.1</td> </tr> </tbody> </table> </div> </div> <br /> <br /> .. GENERATED FROM PYTHON SOURCE LINES 455-456 Plot of the real Vp vs. the imputed Vp .. GENERATED FROM PYTHON SOURCE LINES 456-473 .. code-block:: Python plt.figure(figsize=(8, 8)) plt.scatter(true_vp, true_vp, color='k', label='True Values') plt.scatter(true_vp, imputer_mean[Vp_mask].Vp, label='Mean (mean)') plt.scatter(true_vp, imputer_nmean[Vp_mask].Vp, label='Normalized Mean (nmean)') plt.scatter(true_vp, imputer_ii[Vp_mask].Vp, label='Iterative (ii)') plt.scatter(true_vp, imputer_nii[Vp_mask].Vp, label='Normalized (nii)') plt.scatter(true_vp, imputer_knn[Vp_mask].Vp, label='K-Mean Neighbor (knn)') plt.scatter(true_vp, imputer_nknn[Vp_mask].Vp, label='Normalized K-Mean Neighbor (nknn)') plt.xlabel('True Vp (m/s)') plt.ylabel('Imputed Vp (m/s)') plt.title('Real Vp vs. Imputed Vp') plt.axis([4000, 6750, 4000, 6750]) plt.legend() plt.grid() plt.show() .. image-sg:: /examples/05_petrophysics/images/sphx_glr_02_no_2_filling_gaps_005.png :alt: Real Vp vs. Imputed Vp :srcset: /examples/05_petrophysics/images/sphx_glr_02_no_2_filling_gaps_005.png :class: sphx-glr-single-img .. GENERATED FROM PYTHON SOURCE LINES 474-479 Best Imputation +++++++++++++++ By a closer examination of the metrics, we can conclude that the normalized Iterative Imputer (nii), also called MICE (Multiple Imputation by Chained Equations), performs the best (or least poorly), particularly for the Vp variable, with the mean of the imputed values increasing just 1.2 % with respect to the mean of the real values. Consequently, we used the nii imputer to fill the gaps across all variables and saved the updated dataset in a new dataframe (features3). The multiplot of this section shows that the imputation process maintains the shape of the boxplot of the variables from which the anomalies have been removed. In other words, this multiplot of the imputed dataset (features3) is almost identical to the multiplot of features2 in section 7.2. .. GENERATED FROM PYTHON SOURCE LINES 481-482 Normalized interactive imputation of features2 .. GENERATED FROM PYTHON SOURCE LINES 482-495 .. code-block:: Python # Data normalization scaler = StandardScaler() features2_scaled = scaler.fit_transform(features2_num) # Imputation of normalized data nii_features3_array = ii_imputer.fit_transform(features2_scaled) # Reverse normalization and new imputed dataframe features3 = pd.DataFrame(scaler.inverse_transform(nii_features3_array), columns=features2_num.columns) features3.head() .. raw:: html <div class="output_subarea output_html rendered_html output_result"> <div> <style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; } </style> <table border="1" class="dataframe"> <thead> <tr style="text-align: right;"> <th></th> <th>From</th> <th>Den</th> <th>Vp</th> <th>Vs</th> <th>Mag</th> <th>Ip</th> <th>Res</th> </tr> </thead> <tbody> <tr> <th>0</th> <td>265.77</td> <td>2.610000</td> <td>5725.879959</td> <td>3073.767206</td> <td>0.108</td> <td>36.995337</td> <td>5489.392555</td> </tr> <tr> <th>1</th> <td>271.66</td> <td>2.786624</td> <td>5514.000000</td> <td>2222.000000</td> <td>0.108</td> <td>26.060000</td> <td>4600.000000</td> </tr> <tr> <th>2</th> <td>277.05</td> <td>3.399000</td> <td>5754.000000</td> <td>2085.000000</td> <td>0.237</td> <td>17.520000</td> <td>10200.000000</td> </tr> <tr> <th>3</th> <td>282.55</td> <td>2.800579</td> <td>5773.000000</td> <td>2011.000000</td> <td>0.151</td> <td>28.130000</td> <td>5800.000000</td> </tr> <tr> <th>4</th> <td>287.20</td> <td>3.293000</td> <td>6034.000000</td> <td>2122.000000</td> <td>0.176</td> <td>6.880000</td> <td>15500.000000</td> </tr> </tbody> </table> </div> </div> <br /> <br /> .. GENERATED FROM PYTHON SOURCE LINES 496-498 .. code-block:: Python features3.describe() .. raw:: html <div class="output_subarea output_html rendered_html output_result"> <div> <style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; } </style> <table border="1" class="dataframe"> <thead> <tr style="text-align: right;"> <th></th> <th>From</th> <th>Den</th> <th>Vp</th> <th>Vs</th> <th>Mag</th> <th>Ip</th> <th>Res</th> </tr> </thead> <tbody> <tr> <th>count</th> <td>329.000000</td> <td>329.000000</td> <td>329.000000</td> <td>329.000000</td> <td>329.000000</td> <td>329.000000</td> <td>329.000000</td> </tr> <tr> <th>mean</th> <td>543.279696</td> <td>2.723043</td> <td>5501.096555</td> <td>2808.728436</td> <td>0.142841</td> <td>48.950658</td> <td>3401.316252</td> </tr> <tr> <th>std</th> <td>262.864370</td> <td>0.143224</td> <td>538.939658</td> <td>517.989081</td> <td>0.100134</td> <td>24.374350</td> <td>3623.480943</td> </tr> <tr> <th>min</th> <td>9.600000</td> <td>2.420000</td> <td>3009.000000</td> <td>1295.000000</td> <td>0.001000</td> <td>5.390000</td> <td>156.900000</td> </tr> <tr> <th>25%</th> <td>355.580000</td> <td>2.679000</td> <td>5266.000000</td> <td>2448.000000</td> <td>0.064000</td> <td>31.808800</td> <td>1200.000000</td> </tr> <tr> <th>50%</th> <td>547.800000</td> <td>2.707000</td> <td>5587.000000</td> <td>2994.000000</td> <td>0.115000</td> <td>46.200000</td> <td>2300.000000</td> </tr> <tr> <th>75%</th> <td>750.000000</td> <td>2.731000</td> <td>5849.000000</td> <td>3231.000000</td> <td>0.201000</td> <td>65.470000</td> <td>4200.000000</td> </tr> <tr> <th>max</th> <td>1049.200000</td> <td>3.650000</td> <td>6478.000000</td> <td>3740.000000</td> <td>0.545000</td> <td>171.730000</td> <td>23900.000000</td> </tr> </tbody> </table> </div> </div> <br /> <br /> .. GENERATED FROM PYTHON SOURCE LINES 499-500 Boxplot of each imputed feature .. GENERATED FROM PYTHON SOURCE LINES 500-505 .. code-block:: Python features3.plot.box(subplots=True, grid=False, figsize=(12, 7), layout=(3, 4), flierprops={"marker": "."}) plt.suptitle('Imputed Features') plt.tight_layout() .. image-sg:: /examples/05_petrophysics/images/sphx_glr_02_no_2_filling_gaps_006.png :alt: Imputed Features :srcset: /examples/05_petrophysics/images/sphx_glr_02_no_2_filling_gaps_006.png :class: sphx-glr-single-img .. GENERATED FROM PYTHON SOURCE LINES 506-507 Copy reference variable to features3 .. GENERATED FROM PYTHON SOURCE LINES 507-510 .. code-block:: Python features3[['Hole', 'Len', 'Form']] = features2[['Hole', 'Len', 'Form']] .. GENERATED FROM PYTHON SOURCE LINES 511-512 Order of the variables .. GENERATED FROM PYTHON SOURCE LINES 512-516 .. code-block:: Python features3_series = list(features2.columns) features3_series .. rst-class:: sphx-glr-script-out .. code-block:: none ['Hole', 'From', 'Len', 'Den', 'Vp', 'Vs', 'Mag', 'Ip', 'Res', 'Form'] .. GENERATED FROM PYTHON SOURCE LINES 517-518 Reordering features3 .. GENERATED FROM PYTHON SOURCE LINES 518-521 .. code-block:: Python features3 = features3[features3_series] features3.head() .. raw:: html <div class="output_subarea output_html rendered_html output_result"> <div> <style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; } </style> <table border="1" class="dataframe"> <thead> <tr style="text-align: right;"> <th></th> <th>Hole</th> <th>From</th> <th>Len</th> <th>Den</th> <th>Vp</th> <th>Vs</th> <th>Mag</th> <th>Ip</th> <th>Res</th> <th>Form</th> </tr> </thead> <tbody> <tr> <th>0</th> <td>TC-1319-005</td> <td>265.77</td> <td>11.2</td> <td>2.610000</td> <td>5725.879959</td> <td>3073.767206</td> <td>0.108</td> <td>36.995337</td> <td>5489.392555</td> <td>CALP</td> </tr> <tr> <th>1</th> <td>TC-1319-005</td> <td>271.66</td> <td>11.8</td> <td>2.786624</td> <td>5514.000000</td> <td>2222.000000</td> <td>0.108</td> <td>26.060000</td> <td>4600.000000</td> <td>CALP</td> </tr> <tr> <th>2</th> <td>TC-1319-005</td> <td>277.05</td> <td>10.3</td> <td>3.399000</td> <td>5754.000000</td> <td>2085.000000</td> <td>0.237</td> <td>17.520000</td> <td>10200.000000</td> <td>CALP</td> </tr> <tr> <th>3</th> <td>TC-1319-005</td> <td>282.55</td> <td>11.2</td> <td>2.800579</td> <td>5773.000000</td> <td>2011.000000</td> <td>0.151</td> <td>28.130000</td> <td>5800.000000</td> <td>CALP</td> </tr> <tr> <th>4</th> <td>TC-1319-005</td> <td>287.20</td> <td>10.8</td> <td>3.293000</td> <td>6034.000000</td> <td>2122.000000</td> <td>0.176</td> <td>6.880000</td> <td>15500.000000</td> <td>CALP</td> </tr> </tbody> </table> </div> </div> <br /> <br /> .. GENERATED FROM PYTHON SOURCE LINES 522-523 Save features3 .. GENERATED FROM PYTHON SOURCE LINES 523-525 .. code-block:: Python features3.to_csv('Output/features3.csv') .. GENERATED FROM PYTHON SOURCE LINES 526-527 Indixes of Vp values in features2 .. GENERATED FROM PYTHON SOURCE LINES 527-530 .. code-block:: Python features2.Vp[~features2.Vp.isna()] .. rst-class:: sphx-glr-script-out .. code-block:: none 1 5514.0 2 5754.0 3 5773.0 4 6034.0 5 4661.0 ... 323 5621.0 324 5680.0 326 5593.0 327 5858.0 328 5805.0 Name: Vp, Length: 305, dtype: float64 .. GENERATED FROM PYTHON SOURCE LINES 531-532 Indixes of Vp NaNs in features2 .. GENERATED FROM PYTHON SOURCE LINES 532-535 .. code-block:: Python features2.Vp[features2.Vp.isna()] .. rst-class:: sphx-glr-script-out .. code-block:: none 0 NaN 15 NaN 36 NaN 77 NaN 79 NaN 111 NaN 118 NaN 159 NaN 174 NaN 183 NaN 185 NaN 187 NaN 189 NaN 191 NaN 194 NaN 201 NaN 210 NaN 233 NaN 240 NaN 251 NaN 276 NaN 311 NaN 320 NaN 325 NaN Name: Vp, dtype: float64 .. GENERATED FROM PYTHON SOURCE LINES 536-539 .. code-block:: Python Vp_nan_index = list(features2.Vp[features2.Vp.isna()].index) Vp_nan_index .. rst-class:: sphx-glr-script-out .. code-block:: none [0, 15, 36, 77, 79, 111, 118, 159, 174, 183, 185, 187, 189, 191, 194, 201, 210, 233, 240, 251, 276, 311, 320, 325] .. GENERATED FROM PYTHON SOURCE LINES 540-541 Original Vp without anomalies .. GENERATED FROM PYTHON SOURCE LINES 541-543 .. code-block:: Python features2.Vp.describe() .. rst-class:: sphx-glr-script-out .. code-block:: none count 305.000000 mean 5496.386885 std 556.077848 min 3009.000000 25% 5220.000000 50% 5581.000000 75% 5858.000000 max 6478.000000 Name: Vp, dtype: float64 .. GENERATED FROM PYTHON SOURCE LINES 544-545 Imputed Vp .. GENERATED FROM PYTHON SOURCE LINES 545-548 .. code-block:: Python features3.Vp.iloc[Vp_nan_index].describe() .. rst-class:: sphx-glr-script-out .. code-block:: none count 24.000000 mean 5560.948607 std 225.846356 min 5210.315493 25% 5369.531486 50% 5605.088801 75% 5779.916352 max 5854.418200 Name: Vp, dtype: float64 .. GENERATED FROM PYTHON SOURCE LINES 549-550 % of increase of the mean of the imputed Vp compared with the real Vp .. GENERATED FROM PYTHON SOURCE LINES 550-554 .. code-block:: Python vp_change_den = 100 * (features3.Vp.iloc[Vp_nan_index].mean() - features2.Vp.mean()) / features2.Vp.mean() print('Increase of the mean:', '{:.1f} %'.format(vp_change_den)) .. rst-class:: sphx-glr-script-out .. code-block:: none Increase of the mean: 1.2 % .. GENERATED FROM PYTHON SOURCE LINES 555-556 nii Vp for comparison .. GENERATED FROM PYTHON SOURCE LINES 556-560 .. code-block:: Python nii_vp = features3.Vp.iloc[Vp_nan_index] nii_vp.head() .. rst-class:: sphx-glr-script-out .. code-block:: none 0 5725.879959 15 5604.870331 36 5558.402401 77 5374.912754 79 5210.315493 Name: Vp, dtype: float64 .. GENERATED FROM PYTHON SOURCE LINES 561-585 Estimating Vp ------------- The calculation of a single variable such as Vp (again, just this variable due to its importance, for simplicity and speed) allows us to try and evaluate different methods for filling the gaps, from empirical formulas to the simplest Machine Learning (ML) algorithms. Vp by Gardner +++++++++++++ The Gardner's formula is a well-known empirical formula that allows us to transform the density (Den) into compressional velocity (Vp): .. math:: Vp = \alpha * \phi ^ {\beta} Where :math:`\phi` is the density and the standard coefficients (reference: https://wiki.seg.org/wiki/Dictionary:Gardner%E2%80%99s_equation) are: .. math:: \alpha \approx 4348 .. math:: \beta = 0.25 Contrary to the expected increasing trend between Den and Vp, the plot below shows an anomalous vertical trend, where multiple Vp values are associated with a single Den value (around 2.7). This anomalous trend results in a negative R² (-0.021, the worst so far) when Vp is calculated using all values, and this score improves slightly when we filter out the pairs that fall outside the expected increasing trend. This is confirmed by the low covariance presented by the Den-Vp pair in the covariance matrix of this section (low correlation coefficient in the previous notebook 1/3). %% Plot of Den vs. Vp .. GENERATED FROM PYTHON SOURCE LINES 585-592 .. code-block:: Python plt.grid() plt.scatter(features2.Den, features2.Vp, alpha=0.65, edgecolors='k') plt.xlabel('Den (g/cm$^3$)') plt.ylabel('Vp (m/s)') plt.title('Data Without Anomalies') .. image-sg:: /examples/05_petrophysics/images/sphx_glr_02_no_2_filling_gaps_007.png :alt: Data Without Anomalies :srcset: /examples/05_petrophysics/images/sphx_glr_02_no_2_filling_gaps_007.png :class: sphx-glr-single-img .. rst-class:: sphx-glr-script-out .. code-block:: none Text(0.5, 1.0, 'Data Without Anomalies') .. GENERATED FROM PYTHON SOURCE LINES 593-595 .. code-block:: Python features2.select_dtypes(include=['number']) .. raw:: html <div class="output_subarea output_html rendered_html output_result"> <div> <style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; } </style> <table border="1" class="dataframe"> <thead> <tr style="text-align: right;"> <th></th> <th>From</th> <th>Len</th> <th>Den</th> <th>Vp</th> <th>Vs</th> <th>Mag</th> <th>Ip</th> <th>Res</th> </tr> </thead> <tbody> <tr> <th>0</th> <td>265.77</td> <td>11.2</td> <td>2.610</td> <td>NaN</td> <td>NaN</td> <td>0.108</td> <td>NaN</td> <td>NaN</td> </tr> <tr> <th>1</th> <td>271.66</td> <td>11.8</td> <td>NaN</td> <td>5514.0</td> <td>2222.0</td> <td>0.108</td> <td>26.06</td> <td>4600.0</td> </tr> <tr> <th>2</th> <td>277.05</td> <td>10.3</td> <td>3.399</td> <td>5754.0</td> <td>2085.0</td> <td>0.237</td> <td>17.52</td> <td>10200.0</td> </tr> <tr> <th>3</th> <td>282.55</td> <td>11.2</td> <td>NaN</td> <td>5773.0</td> <td>2011.0</td> <td>0.151</td> <td>28.13</td> <td>5800.0</td> </tr> <tr> <th>4</th> <td>287.20</td> <td>10.8</td> <td>3.293</td> <td>6034.0</td> <td>2122.0</td> <td>0.176</td> <td>6.88</td> <td>15500.0</td> </tr> <tr> <th>...</th> <td>...</td> <td>...</td> <td>...</td> <td>...</td> <td>...</td> <td>...</td> <td>...</td> <td>...</td> </tr> <tr> <th>324</th> <td>814.55</td> <td>9.6</td> <td>2.763</td> <td>5680.0</td> <td>3179.0</td> <td>0.158</td> <td>70.94</td> <td>481.1</td> </tr> <tr> <th>325</th> <td>819.92</td> <td>NaN</td> <td>2.649</td> <td>NaN</td> <td>NaN</td> <td>0.024</td> <td>NaN</td> <td>NaN</td> </tr> <tr> <th>326</th> <td>825.60</td> <td>9.9</td> <td>2.762</td> <td>5593.0</td> <td>1787.0</td> <td>0.337</td> <td>78.44</td> <td>334.6</td> </tr> <tr> <th>327</th> <td>828.80</td> <td>9.9</td> <td>2.804</td> <td>5858.0</td> <td>1964.0</td> <td>0.311</td> <td>73.73</td> <td>466.5</td> </tr> <tr> <th>328</th> <td>833.40</td> <td>10.1</td> <td>2.735</td> <td>5805.0</td> <td>3217.0</td> <td>0.308</td> <td>33.86</td> <td>2300.0</td> </tr> </tbody> </table> <p>329 rows × 8 columns</p> </div> </div> <br /> <br /> .. GENERATED FROM PYTHON SOURCE LINES 596-597 Covariance of the features2 .. GENERATED FROM PYTHON SOURCE LINES 597-600 .. code-block:: Python vari_mat = features2.select_dtypes(include=['number']).cov() vari_mat .. raw:: html <div class="output_subarea output_html rendered_html output_result"> <div> <style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; } </style> <table border="1" class="dataframe"> <thead> <tr style="text-align: right;"> <th></th> <th>From</th> <th>Len</th> <th>Den</th> <th>Vp</th> <th>Vs</th> <th>Mag</th> <th>Ip</th> <th>Res</th> </tr> </thead> <tbody> <tr> <th>From</th> <td>69097.677012</td> <td>-16.222055</td> <td>-7.748490</td> <td>-53382.663546</td> <td>-40566.665284</td> <td>7.019153</td> <td>2766.653801</td> <td>-5.105614e+05</td> </tr> <tr> <th>Len</th> <td>-16.222055</td> <td>1.145171</td> <td>0.052440</td> <td>-49.309151</td> <td>-165.827338</td> <td>-0.010128</td> <td>-0.338487</td> <td>-2.218782e+02</td> </tr> <tr> <th>Den</th> <td>-7.748490</td> <td>0.052440</td> <td>0.021088</td> <td>2.888883</td> <td>-12.299676</td> <td>-0.000414</td> <td>-0.243046</td> <td>5.485842e+01</td> </tr> <tr> <th>Vp</th> <td>-53382.663546</td> <td>-49.309151</td> <td>2.888883</td> <td>309222.573512</td> <td>93428.327238</td> <td>0.964435</td> <td>-3777.361458</td> <td>9.111955e+05</td> </tr> <tr> <th>Vs</th> <td>-40566.665284</td> <td>-165.827338</td> <td>-12.299676</td> <td>93428.327238</td> <td>299730.099574</td> <td>-2.230359</td> <td>-1474.676736</td> <td>5.989462e+05</td> </tr> <tr> <th>Mag</th> <td>7.019153</td> <td>-0.010128</td> <td>-0.000414</td> <td>0.964435</td> <td>-2.230359</td> <td>0.010142</td> <td>0.674282</td> <td>-1.611072e+01</td> </tr> <tr> <th>Ip</th> <td>2766.653801</td> <td>-0.338487</td> <td>-0.243046</td> <td>-3777.361458</td> <td>-1474.676736</td> <td>0.674282</td> <td>632.281488</td> <td>-6.145102e+04</td> </tr> <tr> <th>Res</th> <td>-510561.399578</td> <td>-221.878244</td> <td>54.858423</td> <td>911195.549174</td> <td>598946.207455</td> <td>-16.110723</td> <td>-61451.019939</td> <td>1.380012e+07</td> </tr> </tbody> </table> </div> </div> <br /> <br /> .. GENERATED FROM PYTHON SOURCE LINES 601-603 .. code-block:: Python vari_mat.describe() .. raw:: html <div class="output_subarea output_html rendered_html output_result"> <div> <style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; } </style> <table border="1" class="dataframe"> <thead> <tr style="text-align: right;"> <th></th> <th>From</th> <th>Len</th> <th>Den</th> <th>Vp</th> <th>Vs</th> <th>Mag</th> <th>Ip</th> <th>Res</th> </tr> </thead> <tbody> <tr> <th>count</th> <td>8.000000</td> <td>8.000000</td> <td>8.000000</td> <td>8.000000</td> <td>8.000000</td> <td>8.000000</td> <td>8.000000</td> <td>8.000000e+00</td> </tr> <tr> <th>mean</th> <td>-66582.918624</td> <td>-56.548474</td> <td>4.691151</td> <td>157080.121136</td> <td>118735.366859</td> <td>-1.210452</td> <td>-7913.003762</td> <td>1.842258e+06</td> </tr> <tr> <th>std</th> <td>183023.873982</td> <td>87.703069</td> <td>20.884445</td> <td>325158.301042</td> <td>222378.924444</td> <td>6.584200</td> <td>21711.748328</td> <td>4.851295e+06</td> </tr> <tr> <th>min</th> <td>-510561.399578</td> <td>-221.878244</td> <td>-12.299676</td> <td>-53382.663546</td> <td>-40566.665284</td> <td>-16.110723</td> <td>-61451.019939</td> <td>-5.105614e+05</td> </tr> <tr> <th>25%</th> <td>-43770.664850</td> <td>-78.438698</td> <td>-2.119407</td> <td>-981.322228</td> <td>-493.039688</td> <td>-0.565186</td> <td>-2050.347917</td> <td>-1.552916e+04</td> </tr> <tr> <th>50%</th> <td>-11.985273</td> <td>-8.280271</td> <td>0.010337</td> <td>1.926659</td> <td>-7.265017</td> <td>0.004864</td> <td>-0.290766</td> <td>1.937385e+01</td> </tr> <tr> <th>75%</th> <td>696.927815</td> <td>0.005514</td> <td>0.761551</td> <td>147376.888806</td> <td>145003.770322</td> <td>0.746820</td> <td>158.576084</td> <td>6.770085e+05</td> </tr> <tr> <th>max</th> <td>69097.677012</td> <td>1.145171</td> <td>54.858423</td> <td>911195.549174</td> <td>598946.207455</td> <td>7.019153</td> <td>2766.653801</td> <td>1.380012e+07</td> </tr> </tbody> </table> </div> </div> <br /> <br /> .. GENERATED FROM PYTHON SOURCE LINES 604-605 Covariance matrix .. GENERATED FROM PYTHON SOURCE LINES 605-611 .. code-block:: Python sns.heatmap(vari_mat, vmin=-500, vmax=500, center=0, linewidths=.1, cmap='seismic_r', annot=True) plt.title('Covariance Matrix') plt.tight_layout() plt.show() .. image-sg:: /examples/05_petrophysics/images/sphx_glr_02_no_2_filling_gaps_008.png :alt: Covariance Matrix :srcset: /examples/05_petrophysics/images/sphx_glr_02_no_2_filling_gaps_008.png :class: sphx-glr-single-img .. GENERATED FROM PYTHON SOURCE LINES 612-613 Gardner Vp in features2, with the standard alpha of 4348 .. GENERATED FROM PYTHON SOURCE LINES 613-617 .. code-block:: Python features2['VpG'] = 4348 * (features2.Den) ** 0.25 features2['VpG'] .. rst-class:: sphx-glr-script-out .. code-block:: none 0 5526.493483 1 NaN 2 5903.741408 3 NaN 4 5857.165129 ... 324 5605.763308 325 5547.023745 326 5605.256023 327 5626.444480 328 5591.506938 Name: VpG, Length: 329, dtype: float64 .. GENERATED FROM PYTHON SOURCE LINES 618-619 Drop NaNs for metrics calculation .. GENERATED FROM PYTHON SOURCE LINES 619-623 .. code-block:: Python features2_nonan = features2.dropna() features2_nonan.head() .. raw:: html <div class="output_subarea output_html rendered_html output_result"> <div> <style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; } </style> <table border="1" class="dataframe"> <thead> <tr style="text-align: right;"> <th></th> <th>Hole</th> <th>From</th> <th>Len</th> <th>Den</th> <th>Vp</th> <th>Vs</th> <th>Mag</th> <th>Ip</th> <th>Res</th> <th>Form</th> <th>VpG</th> </tr> </thead> <tbody> <tr> <th>2</th> <td>TC-1319-005</td> <td>277.05</td> <td>10.3</td> <td>3.399</td> <td>5754.0</td> <td>2085.0</td> <td>0.237</td> <td>17.52</td> <td>10200.0</td> <td>CALP</td> <td>5903.741408</td> </tr> <tr> <th>4</th> <td>TC-1319-005</td> <td>287.20</td> <td>10.8</td> <td>3.293</td> <td>6034.0</td> <td>2122.0</td> <td>0.176</td> <td>6.88</td> <td>15500.0</td> <td>CALP</td> <td>5857.165129</td> </tr> <tr> <th>5</th> <td>TC-1319-005</td> <td>296.75</td> <td>10.3</td> <td>2.782</td> <td>4661.0</td> <td>2294.0</td> <td>0.107</td> <td>27.81</td> <td>2100.0</td> <td>CALP</td> <td>5615.375681</td> </tr> <tr> <th>8</th> <td>TC-1319-005</td> <td>318.75</td> <td>10.7</td> <td>3.386</td> <td>5815.0</td> <td>2115.0</td> <td>0.060</td> <td>8.53</td> <td>10700.0</td> <td>CALP</td> <td>5898.088351</td> </tr> <tr> <th>9</th> <td>TC-1319-005</td> <td>325.75</td> <td>10.0</td> <td>2.616</td> <td>4785.0</td> <td>1946.0</td> <td>0.040</td> <td>84.90</td> <td>296.1</td> <td>CALP</td> <td>5529.666895</td> </tr> </tbody> </table> </div> </div> <br /> <br /> .. GENERATED FROM PYTHON SOURCE LINES 624-625 Metrics of the Gardner calculation with all values .. GENERATED FROM PYTHON SOURCE LINES 625-630 .. code-block:: Python vpg_metrics = basic_stat.metrics(features2_nonan.Vp, features2_nonan.VpG) print("Metrics for Gardner:\n") print(vpg_metrics) .. rst-class:: sphx-glr-script-out .. code-block:: none Metrics for Gardner: {'Sum of Residuals': np.float64(-10723.609649536735), 'MAE': np.float64(365.5125465596317), 'MSE': np.float64(254469.14816210992), 'RMSE': np.float64(504.44935143392735), 'R2': -0.021743605762918117} .. GENERATED FROM PYTHON SOURCE LINES 631-632 Metrics of the Gardner calculation with filtered values .. GENERATED FROM PYTHON SOURCE LINES 632-641 .. code-block:: Python vpg_metrics2 = basic_stat.metrics( features2_nonan.Vp[(features2_nonan.Vp > 4000) & (features2_nonan.Vp < 6000)], features2_nonan.VpG[(features2_nonan.Vp > 4000) & (features2_nonan.Vp < 6000)] ) print("Metrics for Gardner with Filtered Data Vp):\n") print(vpg_metrics2) .. rst-class:: sphx-glr-script-out .. code-block:: none Metrics for Gardner with Filtered Data Vp): {'Sum of Residuals': np.float64(-25589.572721078162), 'MAE': np.float64(295.9254751013969), 'MSE': np.float64(162331.10374522334), 'RMSE': np.float64(402.9033429313082), 'R2': -0.0937786027043972} .. GENERATED FROM PYTHON SOURCE LINES 642-643 Gardner Vp for comparison .. GENERATED FROM PYTHON SOURCE LINES 643-647 .. code-block:: Python gard_vp = features2.VpG.iloc[Vp_nan_index] gard_vp.head() .. rst-class:: sphx-glr-script-out .. code-block:: none 0 5526.493483 15 5520.661324 36 5633.454391 77 5586.388848 79 5691.523695 Name: VpG, dtype: float64 .. GENERATED FROM PYTHON SOURCE LINES 648-652 Simpliest ML models for Vp ++++++++++++++++++++++++++ Machine Learning (ML) algorithms offer a wide range of options to calculate missing values of a single variable. From the simplest and most well-known, such as linear regression with an independent variable, to the most complex. Below we use the simplest ML algorithms and their respective metrics, applied again to predict Vp and fill its gaps. But first, we split the non-NaN portion of the filtered data (features2) for training and testing. .. GENERATED FROM PYTHON SOURCE LINES 654-656 .. code-block:: Python features2_num_nonan.head() .. raw:: html <div class="output_subarea output_html rendered_html output_result"> <div> <style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; } </style> <table border="1" class="dataframe"> <thead> <tr style="text-align: right;"> <th></th> <th>From</th> <th>Den</th> <th>Vp</th> <th>Vs</th> <th>Mag</th> <th>Ip</th> <th>Res</th> </tr> </thead> <tbody> <tr> <th>2</th> <td>277.05</td> <td>3.399</td> <td>5754.0</td> <td>2085.0</td> <td>0.237</td> <td>17.52</td> <td>10200.0</td> </tr> <tr> <th>4</th> <td>287.20</td> <td>3.293</td> <td>6034.0</td> <td>2122.0</td> <td>0.176</td> <td>6.88</td> <td>15500.0</td> </tr> <tr> <th>5</th> <td>296.75</td> <td>2.782</td> <td>4661.0</td> <td>2294.0</td> <td>0.107</td> <td>27.81</td> <td>2100.0</td> </tr> <tr> <th>8</th> <td>318.75</td> <td>3.386</td> <td>5815.0</td> <td>2115.0</td> <td>0.060</td> <td>8.53</td> <td>10700.0</td> </tr> <tr> <th>9</th> <td>325.75</td> <td>2.616</td> <td>4785.0</td> <td>1946.0</td> <td>0.040</td> <td>84.90</td> <td>296.1</td> </tr> </tbody> </table> </div> </div> <br /> <br /> .. GENERATED FROM PYTHON SOURCE LINES 657-658 Target from filtered features2 .. GENERATED FROM PYTHON SOURCE LINES 658-664 .. code-block:: Python # The target or objective of the model is the Vp target = features2_num_nonan.Vp[(features2_num_nonan.Vp > 4000) & (features2_num_nonan.Vp < 6000)] print(target.shape) target.head() .. rst-class:: sphx-glr-script-out .. code-block:: none (225,) 2 5754.0 5 4661.0 8 5815.0 9 4785.0 10 5707.0 Name: Vp, dtype: float64 .. GENERATED FROM PYTHON SOURCE LINES 665-666 Filtered density is the independent feature to compute Vp, the target .. GENERATED FROM PYTHON SOURCE LINES 666-669 .. code-block:: Python features2_den = features2_num_nonan.Den[(features2_num_nonan.Vp > 4000) & (features2_num_nonan.Vp < 6000)] features2_den.head() .. rst-class:: sphx-glr-script-out .. code-block:: none 2 3.399 5 2.782 8 3.386 9 2.616 10 2.821 Name: Den, dtype: float64 .. GENERATED FROM PYTHON SOURCE LINES 670-671 Split and shape of data for training and testing .. GENERATED FROM PYTHON SOURCE LINES 671-685 .. code-block:: Python X_train, X_test, y_train, y_test = train_test_split(features2_den, target, test_size=0.2, random_state=seed) X_train = np.array(X_train).reshape(-1, 1) X_test = np.array(X_test).reshape(-1, 1) y_train = np.array(y_train).reshape(-1, 1) y_test = np.array(y_test).reshape(-1, 1) print('X_train shape:', X_train.shape) print('y_train shape:', y_train.shape) print('X_test shape:', X_test.shape) print('y_test shape:', y_test.shape) .. rst-class:: sphx-glr-script-out .. code-block:: none X_train shape: (180, 1) y_train shape: (180, 1) X_test shape: (45, 1) y_test shape: (45, 1) .. GENERATED FROM PYTHON SOURCE LINES 686-690 Simple Linear Regression ++++++++++++++++++++++++ The linear regression is perhaps one of the simplest ML algorithms, it allows us to define a simple formula to compute Vp from Den, the only feature. The regression using all the data gave a negative $R^2$ (the predictions are worse than if it had simply predicted the mean), while applying a filter in the input data(4000 < Vp < 6000) resulted in a positive $R^2$, although a very small one (0.003). .. GENERATED FROM PYTHON SOURCE LINES 692-693 Linear Regression .. GENERATED FROM PYTHON SOURCE LINES 693-702 .. code-block:: Python # Model training and prediction lr_model = LinearRegression().fit(X_train, y_train) lr_predict = lr_model.predict(np.array(X_test)) # Coheficients coef = lr_model.coef_.item() inter = lr_model.intercept_.item() .. GENERATED FROM PYTHON SOURCE LINES 703-704 Metric of the linear regression with filtered data .. GENERATED FROM PYTHON SOURCE LINES 704-710 .. code-block:: Python lr_metrics = basic_stat.metrics(y_test, lr_predict) print("Metrics for Linear Regressor:\n") print(lr_metrics) .. rst-class:: sphx-glr-script-out .. code-block:: none Metrics for Linear Regressor: {'Sum of Residuals': np.float64(-663.6657176180624), 'MAE': np.float64(321.4070681962207), 'MSE': np.float64(152576.8824032091), 'RMSE': np.float64(390.61090922196365), 'R2': 0.002886141920427243} .. GENERATED FROM PYTHON SOURCE LINES 711-712 Plot of filtered data and regression .. GENERATED FROM PYTHON SOURCE LINES 712-722 .. code-block:: Python plt.scatter(X_test, y_test, label='Real data', alpha=0.65, edgecolors='k') plt.plot(X_test, lr_predict, c='r', label='Prediction') plt.title(f"Vp = {coef:.1f} * den + {inter:.1f}") plt.suptitle('Linear Regression') plt.xlabel('Density (g/cm$^3$)') plt.ylabel('Vp (m/s)') plt.grid() plt.legend(); .. image-sg:: /examples/05_petrophysics/images/sphx_glr_02_no_2_filling_gaps_009.png :alt: Linear Regression, Vp = 175.7 * den + 4996.3 :srcset: /examples/05_petrophysics/images/sphx_glr_02_no_2_filling_gaps_009.png :class: sphx-glr-single-img .. rst-class:: sphx-glr-script-out .. code-block:: none <matplotlib.legend.Legend object at 0x7f3199fc2ce0> .. GENERATED FROM PYTHON SOURCE LINES 723-724 lr Vp for comparison .. GENERATED FROM PYTHON SOURCE LINES 724-731 .. code-block:: Python # Density of the Vp calculations den_feature = np.array(features2.Den.iloc[Vp_nan_index]).reshape(-1, 1) lr_vp = lr_model.predict(den_feature) lr_vp .. rst-class:: sphx-glr-script-out .. code-block:: none array([[5454.81134299], [5452.87886354], [5491.35277254], [5475.01453721], [5512.08300661], [5473.43341766], [5454.98702294], [5427.58095077], [5466.58189962], [5467.28461942], [5474.66317731], [5468.33869912], [5458.85198183], [5466.05485977], [5454.81134299], [5517.70476501], [5469.74413872], [5470.27117857], [5467.98733922], [5478.52813621], [5470.27117857], [5462.01422093], [5495.74477128], [5461.66286103]]) .. GENERATED FROM PYTHON SOURCE LINES 732-735 Non Linear Fit ++++++++++++++ Although the non_linear fit is more powerful since it can fit the alpha and beta coefficients of a non-linear curve of density to Vp, the $R^2$ metric (0.004) does not differ much from the linear regression with the filtered data. .. GENERATED FROM PYTHON SOURCE LINES 737-738 Reshaping the filtered data .. GENERATED FROM PYTHON SOURCE LINES 738-746 .. code-block:: Python X_train = np.array(X_train).flatten() y_train = np.array(y_train).flatten() print('X_train shape:', X_train.shape) print('y_train shape:', y_train.shape) .. rst-class:: sphx-glr-script-out .. code-block:: none X_train shape: (180,) y_train shape: (180,) .. GENERATED FROM PYTHON SOURCE LINES 747-748 Gardner function for double fitting, for alpha and beta .. GENERATED FROM PYTHON SOURCE LINES 748-753 .. code-block:: Python def gardner_model(X_train, alpha, beta): return alpha * X_train ** beta .. GENERATED FROM PYTHON SOURCE LINES 754-755 Model fit .. GENERATED FROM PYTHON SOURCE LINES 755-757 .. code-block:: Python popt, pcov = curve_fit(gardner_model, X_train, y_train, p0=[0, 0]) # p0 is the initial estimation of alpha y beta .. GENERATED FROM PYTHON SOURCE LINES 758-759 Coheficients and prediction .. GENERATED FROM PYTHON SOURCE LINES 759-767 .. code-block:: Python alpha, beta = popt print('Alpha:', alpha) print('Beta:', beta) nlf_predict = gardner_model(X_test, alpha, beta) .. rst-class:: sphx-glr-script-out .. code-block:: none Alpha: 4944.85723377095 Beta: 0.1017276747649569 .. GENERATED FROM PYTHON SOURCE LINES 768-769 Metric of the non linear fit with filtered data .. GENERATED FROM PYTHON SOURCE LINES 769-774 .. code-block:: Python nlf_metrics = basic_stat.metrics(y_test, nlf_predict) print("Metrics for Non Linear Fit:\n") print(nlf_metrics) .. rst-class:: sphx-glr-script-out .. code-block:: none Metrics for Non Linear Fit: {'Sum of Residuals': np.float64(-672.1847517458073), 'MAE': np.float64(321.430456631444), 'MSE': np.float64(152446.9582936085), 'RMSE': np.float64(390.4445649431024), 'R2': 0.003735216355170601} .. GENERATED FROM PYTHON SOURCE LINES 775-776 Plot of filtered data and NLF equation .. GENERATED FROM PYTHON SOURCE LINES 776-785 .. code-block:: Python plt.scatter(X_test, y_test, label='Real data') plt.plot(X_test, nlf_predict, c='r', label='Prediction') plt.title(f"Vp = {alpha:.2f} * den ** {beta:.2f}") plt.xlabel("Density (g/cm3)") plt.suptitle('Non Linear Fit') plt.ylabel("Vp (m/s)") plt.grid() plt.show() .. image-sg:: /examples/05_petrophysics/images/sphx_glr_02_no_2_filling_gaps_010.png :alt: Non Linear Fit, Vp = 4944.86 * den ** 0.10 :srcset: /examples/05_petrophysics/images/sphx_glr_02_no_2_filling_gaps_010.png :class: sphx-glr-single-img .. GENERATED FROM PYTHON SOURCE LINES 786-787 nlf Vp for comparison .. GENERATED FROM PYTHON SOURCE LINES 787-790 .. code-block:: Python nlf_vp = gardner_model(den_feature, alpha, beta) nlf_vp .. rst-class:: sphx-glr-script-out .. code-block:: none array([[5451.77127868], [5449.42946455], [5494.46242706], [5475.7369865 ], [5517.43839526], [5473.89450644], [5451.98373102], [5417.92250818], [5465.84650885], [5466.67677416], [5475.32801939], [5467.92008814], [5456.63928719], [5465.22307824], [5451.77127868], [5523.52610545], [5469.57396647], [5470.19303406], [5467.50592748], [5479.81189483], [5470.19303406], [5460.42246055], [5499.40142812], [5460.00324939]]) .. GENERATED FROM PYTHON SOURCE LINES 791-794 Vps Comparison -------------- The real Vp values allow us to evaluate the quality of different procedures that estimate Vp, such as different imputation algorithms, the Gardner empirical formula, and the simpliest ML regression model. .. GENERATED FROM PYTHON SOURCE LINES 796-797 Datos y títulos a utilizar en el bucle .. GENERATED FROM PYTHON SOURCE LINES 797-845 .. code-block:: Python plt.figure(figsize=(9, 15)) # Histograms plt.subplot(7, 1, 1) plt.grid(zorder=2) plt.hist(features2.Vp, bins=20, zorder=3) plt.xlim(2500, 7000) plt.xlabel('m/s') plt.title('Real Vp') plt.subplot(7, 1, 2) box = plt.boxplot(features2.Vp.dropna(), vert=False, showmeans=True, meanline=True, patch_artist=False) mean_line = box['means'][0] # Obtener la línea de la media median_line = box['medians'][0] # Obtener la línea de la mediana plt.legend([mean_line, median_line], ['Mean', 'Median'], loc='upper left') plt.xlim(2500, 7000) plt.title('Real Vp') plt.xlabel('m/s') plt.grid() plt.subplot(7, 1, 3) plt.grid(zorder=2) plt.hist(gard_vp, label='Gardner Vp', zorder=4) plt.hist(nii_vp, label='NII Imputer Vp', zorder=7) plt.hist(lr_vp, label='LR Vp', zorder=9) plt.hist(nlf_vp, label='NLF Vp', zorder=10) plt.xlim(2500, 7000) plt.xlabel('m/s') plt.legend() plt.title('Estimated Vps') # Boxplots data_list = [gard_vp, nii_vp, lr_vp, nlf_vp] titles = ['Gardner Vp', 'NII Inputer Vp', 'LR Vp', 'NLF Vp'] means = [np.mean(features2.Vp.dropna()), np.mean(gard_vp), np.mean(nii_vp), np.mean(lr_vp), np.mean(nlf_vp)] for i in range(len(data_list)): plt.subplot(7, 1, i + 4) plt.boxplot(data_list[i], vert=False, showmeans=True, meanline=True) vp_change = 100 * (data_list[i].mean() - features2.Vp.mean()) / features2.Vp.mean() plt.text(means[i] - 500, 1.25, 'Mean: {:.0f}, % change: {:.1f}'.format(means[i], vp_change)) plt.xlim(2500, 7000) plt.title(titles[i]) plt.xlabel('m/s') plt.grid() plt.tight_layout() .. image-sg:: /examples/05_petrophysics/images/sphx_glr_02_no_2_filling_gaps_011.png :alt: Real Vp, Real Vp, Estimated Vps, Gardner Vp, NII Inputer Vp, LR Vp, NLF Vp :srcset: /examples/05_petrophysics/images/sphx_glr_02_no_2_filling_gaps_011.png :class: sphx-glr-single-img .. GENERATED FROM PYTHON SOURCE LINES 846-924 Observations ------------ Here are some observations related to the tasks covered in this notebook: * Anomalies represent 0.5% (18) of the data, and after removing them we reach a total of 4.3% of NaN (142). * The metrics favor the nii imputer. However, the knn, ii, and nknn imputer provided similar results. In the end, the dataset without anomalies (features2) was filled with the nii imputer (features3) because it was the best option. * In addition to the correlation matrix (described in the previous notebook 1/3), it is important to explore the covariance matrix because it provides valuable information regarding the scale-dependent relationships between variables, reflecting how changes in one variable are associated with changes in another in terms of their actual units. * The poor correlation (low covariance) between Den and Vp was not realized until Gardner's Vp was calculated. This gave negative :math:`R^2`, even after filtering part of the data outside the increasing trend. * Filtering the Den and Vp values outside the increasing trend allows us to obtain positive but very small :math:`R^2` for the LR and the NLF ML algorithms. Next Step --------- The next and last notebook (3/3), after the imputation, is going to focus on the prediction of an entire variable (GR), missing in one of the available boreholes. Reference --------- `Gardner's Equation <https://wiki.seg.org/wiki/Dictionary:Gardner%E2%80%99s_equation>`_ Annex - Regression Metrics -------------------------- These metrics help you understand different aspects of prediction accuracy and are critical for evaluating and comparing regression models. The main metrics are: Sum of Residuals ~~~~~~~~~~~~~~~~ The Sum of Residuals is the sum of errors between the predicted and the actual values. Ideally, it should add up to 0. .. math:: \text{Sum of Residuals} = \sum_{i=1}^{n} (y_i - \hat{y}_i) Where: - :math:`y_i` is the real value. - :math:`\hat{y}_i` is the predicted value. MAE - Mean Absolute Error ~~~~~~~~~~~~~~~~~~~~~~~~~ The MAE measures the average of the absolute errors between the predicted and the actual values. It is an intuitive and easy-to-interpret metric. .. math:: \text{MAE} = \frac{1}{n} \sum_{i=1}^{n} |y_i - \hat{y}_i| Where: - :math:`n` is the number of observations. MSE - Mean Squared Error ~~~~~~~~~~~~~~~~~~~~~~~~ The MSE measures the average of the squared errors between the predicted values and the actual values. It penalizes large errors more than the MAE. .. math:: \text{MSE} = \frac{1}{n} \sum_{i=1}^{n} (y_i - \hat{y}_i)^2 RMSE - Root Mean Squared Error ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ The RMSE is the square root of the MSE. It is expressed in the same units as the output variable, which makes it more interpretable. .. math:: \text{RMSE} = \sqrt{\frac{1}{n} \sum_{i=1}^{n} (y_i - \hat{y}_i)^2} = \sqrt{\text{MSE}} Coefficient of Determination or Score (:math:`R^2`) ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ The :math:`R^2` measures the proportion of the variance of the dependent or predicted variable against the independent variables. Normally it ranges from 0 (not fit at all) to 1 (perfect fit). A rare negative value indicates the prediction model is worse than a prediction with the mean. .. math:: R^2 = 1 - \frac{\sum_{i=1}^{n} (y_i - \hat{y}_i)^2}{\sum_{i=1}^{n} (y_i - \bar{y})^2} Where: - :math:`\bar{y}` is the mean of the real values :math:`y_i`. .. rst-class:: sphx-glr-timing **Total running time of the script:** (0 minutes 1.429 seconds) .. _sphx_glr_download_examples_05_petrophysics_02_no_2_filling_gaps.py: .. only:: html .. container:: sphx-glr-footer sphx-glr-footer-example .. container:: sphx-glr-download sphx-glr-download-jupyter :download:`Download Jupyter notebook: 02_no_2_filling_gaps.ipynb <02_no_2_filling_gaps.ipynb>` .. container:: sphx-glr-download sphx-glr-download-python :download:`Download Python source code: 02_no_2_filling_gaps.py <02_no_2_filling_gaps.py>` .. only:: html .. rst-class:: sphx-glr-signature `Gallery generated by Sphinx-Gallery <https://sphinx-gallery.github.io>`_