.. DO NOT EDIT.
.. THIS FILE WAS AUTOMATICALLY GENERATED BY SPHINX-GALLERY.
.. TO MAKE CHANGES, EDIT THE SOURCE PYTHON FILE:
.. "examples/05_petrophysics/02_no_2_filling_gaps.py"
.. LINE NUMBERS ARE GIVEN BELOW.

.. only:: html

    .. note::
        :class: sphx-glr-download-link-note

        :ref:`Go to the end <sphx_glr_download_examples_05_petrophysics_02_no_2_filling_gaps.py>`
        to download the full example code

.. rst-class:: sphx-glr-example-title

.. _sphx_glr_examples_05_petrophysics_02_no_2_filling_gaps.py:


Modeling and Propagation of Petrophysical Data for Mining Exploration (2/3)
===========================================================================
**Cleaning and Filling the Gaps**

Barcelona 25/09/24
GEO3BCN
Manuel David Soto, Juan Alcalde, Adrià Hernàndez-Pineda, Ramón Carbonel

Introduction
-------------

The dispersion and scarcity of petrophysical data are well-known challenges in the mining sector. These issues are primarily driven by economic factors, but also by geological factors such as sedimentary cover, weathering, erosion, or the depth of targets, geotechnical issues (e.g., slope or borehole stability), and even technical limitations or availability that have been resolved in other industries (for instance, sonic logs were not previously acquired due to a lack of interest in velocity field data).

To address the challenge of sparse and incomplete petrophysical data in the mining sector, we have developed three Jupyter notebooks that tackle these issues through a suite of open-source Python tools. These tools support researchers in the initial exploration, visualization, and integration of data from diverse sources, filling gaps caused by technical limitations and ultimately enabling the complete modeling of missing properties (through standard and more advanced ML-based models). We applied these tools to both recently acquired and legacy petrophysical data of two cores northwest of Collinstown (County Westmeath, Province of Leinster), Ireland, located 26 km west-northwest of the Navan mine. However, these tools are adaptable and applicable to mining data from any location.

After the EDA and integration of the petrophysical dataset of Collinstown (notebook 1/3), this second notebook gathers different tasks that can be grouped into what is called, in data science, data mining. These tasks are:

* Record the positions of the original NaNs.
* Record the positions and the values of the anomalies to be deleted.
* Delete the anomalies.
* Fill out all NaNs, both original and those resulting from deleting the anomalies, using different means (imputations, empirical formulas, simple ML models).
* Compare the effectiveness of the different filling gap methods.
* Use the better option to fill in the gaps and deliver the corrected petrophysical data for further investigation.

As with the previous notebook, these tasks are performed with open-source Python tools that are easily accessible by any researcher through a Python installation connected to the Internet.

.. GENERATED FROM PYTHON SOURCE LINES 30-57

Variables
---------
The dataset used in this notebook is the 'features' dataset from the previous notebook (1/3). It contains the modelable petrophysical features with their respective anomalies. 'Hole' (text object) and 'Len' (float) variables are for reference, 'Form' (text object) is a categorical variable representing the major formations:

+------+------------------------------------------------+--------+
| Name | Explanation                                    | Unit   |
+======+================================================+========+
| Hole | Hole name                                      | -      |
+------+------------------------------------------------+--------+
| From | Top of the sample                              | m      |
+------+------------------------------------------------+--------+
| Len  | Length of the core sample                      | cm     |
+------+------------------------------------------------+--------+
| Den  | Density                                        | g/cm³  |
+------+------------------------------------------------+--------+
| Vp   | Compressional velocity                         | m/s    |
+------+------------------------------------------------+--------+
| Vs   | Shear velocity                                 | m/s    |
+------+------------------------------------------------+--------+
| Mag  | Magnetic susceptibility                        | -      |
+------+------------------------------------------------+--------+
| Ip   | Chargeability or induced polarization          | mv/V   |
+------+------------------------------------------------+--------+
| Res  | Resistivity                                    | ohm·m  |
+------+------------------------------------------------+--------+
| Form | Major formations or zone along the hole        | -      |
+------+------------------------------------------------+--------+

.. GENERATED FROM PYTHON SOURCE LINES 59-62

Libraries
--------
The following are the Python libraries used along this notebook. PSL are Python Standard Libraries, UDL are User Defined Libraries, and PEL are Python External Libraries:

.. GENERATED FROM PYTHON SOURCE LINES 64-65

PLS

.. GENERATED FROM PYTHON SOURCE LINES 65-103

.. code-block:: Python

    import sys
    import warnings

    # UDL
    from vector_geology import basic_stat, geo

    # PEL- Basic
    import numpy as np
    import pandas as pd
    from tabulate import tabulate

    # PEL - Plotting
    import matplotlib.pyplot as plt
    import seaborn as sns

    # PEL - Data selection, transformation, preprocessing
    from sklearn.model_selection import train_test_split
    from sklearn.preprocessing import StandardScaler
    from sklearn.preprocessing import PolynomialFeatures

    # PEL - Imputer
    from sklearn.impute import KNNImputer, SimpleImputer
    from sklearn.experimental import enable_iterative_imputer
    from sklearn.impute import IterativeImputer

    # PEL- ML algorithms
    from scipy.optimize import curve_fit
    from sklearn.linear_model import LinearRegression, Lasso, Ridge
    from sklearn.svm import SVR
    from sklearn.neighbors import KNeighborsRegressor
    from sklearn.tree import DecisionTreeRegressor
    from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
    from sklearn import tree
    import xgboost as xgb

    # PEL - Metrics
    from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score








.. GENERATED FROM PYTHON SOURCE LINES 104-106

Settings
--------

.. GENERATED FROM PYTHON SOURCE LINES 108-109

Seed of random process

.. GENERATED FROM PYTHON SOURCE LINES 109-114

.. code-block:: Python

    seed = 123

    # Warning suppression
    warnings.filterwarnings("ignore")








.. GENERATED FROM PYTHON SOURCE LINES 115-117

Data Loading
------------

.. GENERATED FROM PYTHON SOURCE LINES 119-120

Features data load

.. GENERATED FROM PYTHON SOURCE LINES 120-123

.. code-block:: Python

    features = pd.read_csv('Output/features.csv', index_col=0)
    features.head()






.. raw:: html

    <div class="output_subarea output_html rendered_html output_result">
    <div>
    <style scoped>
        .dataframe tbody tr th:only-of-type {
            vertical-align: middle;
        }

        .dataframe tbody tr th {
            vertical-align: top;
        }

        .dataframe thead th {
            text-align: right;
        }
    </style>
    <table border="1" class="dataframe">
      <thead>
        <tr style="text-align: right;">
          <th></th>
          <th>Hole</th>
          <th>From</th>
          <th>Len</th>
          <th>Den</th>
          <th>Vp</th>
          <th>Vs</th>
          <th>Mag</th>
          <th>Ip</th>
          <th>Res</th>
          <th>Form</th>
        </tr>
      </thead>
      <tbody>
        <tr>
          <th>0</th>
          <td>TC-1319-005</td>
          <td>265.77</td>
          <td>11.2</td>
          <td>2.610</td>
          <td>2179.0</td>
          <td>NaN</td>
          <td>0.108</td>
          <td>NaN</td>
          <td>NaN</td>
          <td>CALP</td>
        </tr>
        <tr>
          <th>1</th>
          <td>TC-1319-005</td>
          <td>271.66</td>
          <td>11.8</td>
          <td>4.318</td>
          <td>5514.0</td>
          <td>2222.0</td>
          <td>0.108</td>
          <td>26.06</td>
          <td>4600.0</td>
          <td>CALP</td>
        </tr>
        <tr>
          <th>2</th>
          <td>TC-1319-005</td>
          <td>277.05</td>
          <td>10.3</td>
          <td>3.399</td>
          <td>5754.0</td>
          <td>2085.0</td>
          <td>0.237</td>
          <td>17.52</td>
          <td>10200.0</td>
          <td>CALP</td>
        </tr>
        <tr>
          <th>3</th>
          <td>TC-1319-005</td>
          <td>282.55</td>
          <td>11.2</td>
          <td>4.342</td>
          <td>5773.0</td>
          <td>2011.0</td>
          <td>0.151</td>
          <td>28.13</td>
          <td>5800.0</td>
          <td>CALP</td>
        </tr>
        <tr>
          <th>4</th>
          <td>TC-1319-005</td>
          <td>287.20</td>
          <td>10.8</td>
          <td>3.293</td>
          <td>6034.0</td>
          <td>2122.0</td>
          <td>0.176</td>
          <td>6.88</td>
          <td>15500.0</td>
          <td>CALP</td>
        </tr>
      </tbody>
    </table>
    </div>
    </div>
    <br />
    <br />

.. GENERATED FROM PYTHON SOURCE LINES 124-126

.. code-block:: Python

    features.describe()






.. raw:: html

    <div class="output_subarea output_html rendered_html output_result">
    <div>
    <style scoped>
        .dataframe tbody tr th:only-of-type {
            vertical-align: middle;
        }

        .dataframe tbody tr th {
            vertical-align: top;
        }

        .dataframe thead th {
            text-align: right;
        }
    </style>
    <table border="1" class="dataframe">
      <thead>
        <tr style="text-align: right;">
          <th></th>
          <th>From</th>
          <th>Len</th>
          <th>Den</th>
          <th>Vp</th>
          <th>Vs</th>
          <th>Mag</th>
          <th>Ip</th>
          <th>Res</th>
        </tr>
      </thead>
      <tbody>
        <tr>
          <th>count</th>
          <td>329.000000</td>
          <td>321.000000</td>
          <td>327.000000</td>
          <td>310.000000</td>
          <td>289.000000</td>
          <td>326.000000</td>
          <td>303.000000</td>
          <td>303.000000</td>
        </tr>
        <tr>
          <th>mean</th>
          <td>543.279696</td>
          <td>10.223364</td>
          <td>2.761664</td>
          <td>5441.990323</td>
          <td>2772.937716</td>
          <td>0.430543</td>
          <td>49.738548</td>
          <td>3283.762376</td>
        </tr>
        <tr>
          <th>std</th>
          <td>262.864370</td>
          <td>1.070127</td>
          <td>0.293938</td>
          <td>700.858075</td>
          <td>653.446129</td>
          <td>5.192919</td>
          <td>25.145208</td>
          <td>3714.851096</td>
        </tr>
        <tr>
          <th>min</th>
          <td>9.600000</td>
          <td>7.500000</td>
          <td>2.420000</td>
          <td>992.000000</td>
          <td>0.000000</td>
          <td>0.001000</td>
          <td>5.390000</td>
          <td>156.900000</td>
        </tr>
        <tr>
          <th>25%</th>
          <td>355.580000</td>
          <td>9.700000</td>
          <td>2.678500</td>
          <td>5196.000000</td>
          <td>2294.000000</td>
          <td>0.064000</td>
          <td>31.555000</td>
          <td>1100.000000</td>
        </tr>
        <tr>
          <th>50%</th>
          <td>547.800000</td>
          <td>10.000000</td>
          <td>2.706000</td>
          <td>5575.000000</td>
          <td>3041.000000</td>
          <td>0.114500</td>
          <td>47.040000</td>
          <td>2100.000000</td>
        </tr>
        <tr>
          <th>75%</th>
          <td>750.000000</td>
          <td>10.400000</td>
          <td>2.731000</td>
          <td>5857.000000</td>
          <td>3248.000000</td>
          <td>0.205250</td>
          <td>67.190000</td>
          <td>3850.000000</td>
        </tr>
        <tr>
          <th>max</th>
          <td>1049.200000</td>
          <td>18.000000</td>
          <td>5.074000</td>
          <td>6478.000000</td>
          <td>3740.000000</td>
          <td>93.886000</td>
          <td>171.730000</td>
          <td>23900.000000</td>
        </tr>
      </tbody>
    </table>
    </div>
    </div>
    <br />
    <br />

.. GENERATED FROM PYTHON SOURCE LINES 127-128

Columns in the features dataframe

.. GENERATED FROM PYTHON SOURCE LINES 128-130

.. code-block:: Python

    features.columns





.. rst-class:: sphx-glr-script-out

 .. code-block:: none


    Index(['Hole', 'From', 'Len', 'Den', 'Vp', 'Vs', 'Mag', 'Ip', 'Res', 'Form'], dtype='object')



.. GENERATED FROM PYTHON SOURCE LINES 131-153

Features Cleaning
-----------------
The plots below show the anomalous values of the four affected variables. Then, along the section, we will register the position of the anomalies and then delete them, according to the findings of the previous notebook (1/3):

+------+--------------------------------------------------------------+
| Feat.| Deliting Anomalies                                           |
+======+==============================================================+
| Den  | Values above 4 g/cm³ are related to bad measurements in core |
|      | samples longer than 11.2 cm.                                 |
+------+--------------------------------------------------------------+
| Vp   | Values below 3000 m/s are related to fractures in the core   |
|      | samples.                                                     |
+------+--------------------------------------------------------------+
| Vs   | Values less than 1000 m/s are also related to core fractures.|
+------+--------------------------------------------------------------+
| Mag  | The anomaly of 93.9 is above the range of the measuring      |
|      | equipment.                                                   |
+------+--------------------------------------------------------------+
| Ip   | We see no reasons to discard outliers.                       |
+------+--------------------------------------------------------------+
| Res  | We see no reasons to discard outliers.                       |
+------+--------------------------------------------------------------+

.. GENERATED FROM PYTHON SOURCE LINES 155-156

Bad data areas in Vp, Vs, Mag vs. Den

.. GENERATED FROM PYTHON SOURCE LINES 156-195

.. code-block:: Python


    plt.figure(figsize=(16, 5))

    # len vs. Den
    plt.subplot(131)
    plt.scatter(features.Vp, features.Den, alpha=0.65, edgecolors='k', c='skyblue')
    plt.axhspan(4, 5.25, color='r', alpha=0.2)
    plt.axvspan(500, 3000, color='r', alpha=0.2)
    plt.text(1100, 4.7, 'Bad Data Area')
    plt.axis([500, 7000, 2, 5.25])
    plt.xlabel('Vp (m/s)')
    plt.ylabel('Den (g/cm3)')
    plt.grid()

    # Den vs. Vs bad data area
    plt.subplot(132)
    plt.scatter(features.Vs, features.Den, alpha=0.65, edgecolors='k', c='skyblue')
    plt.axhspan(4, 5.25, color='r', alpha=0.2)
    plt.axvspan(0, 1000, color='r', alpha=0.2)
    plt.text(75, 4.7, 'Bad Data Area')
    plt.axis([0, 4000, 2, 5.25])
    plt.xlabel('Vs (m/s)')
    plt.ylabel('Den (g/cm3)')
    plt.grid()

    # Den vs. Vs bad data area
    plt.subplot(133)
    plt.scatter(features.Mag, features.Den, alpha=0.65, edgecolors='k', c='skyblue')
    plt.axvspan(20, 100, color='r', alpha=0.2)
    plt.axhspan(4, 5.25, color='r', alpha=0.2)
    plt.text(2, 4.7, 'Bad Data Area')
    plt.axis([-5, 100, 2, 5.25])
    plt.xlabel('Magnetic Susceptibility')
    plt.ylabel('Den (g/cm3)')
    plt.grid()

    plt.suptitle('Variable with Anomalies in the Keywells')

    plt.tight_layout();



.. image-sg:: /examples/05_petrophysics/images/sphx_glr_02_no_2_filling_gaps_001.png
   :alt: Variable with Anomalies in the Keywells
   :srcset: /examples/05_petrophysics/images/sphx_glr_02_no_2_filling_gaps_001.png
   :class: sphx-glr-single-img





.. GENERATED FROM PYTHON SOURCE LINES 196-199

NaN and Anomalies Preservation
------------------------------
The folowing dataframes preserves the location of original NaN (1) and anomalies (2):

.. GENERATED FROM PYTHON SOURCE LINES 201-203

.. code-block:: Python

    nan_ano_loc = features.copy()








.. GENERATED FROM PYTHON SOURCE LINES 204-205

Fill the datafreme with 1 and 2

.. GENERATED FROM PYTHON SOURCE LINES 205-214

.. code-block:: Python


    for column in features.columns:
        nan_ano_loc[column] = np.where(pd.isna(features[column]), 1, 0)

    nan_ano_loc.Den = np.where(features.Den > 4, 2, 0)
    nan_ano_loc.Vp = np.where(features.Vp < 3000, 2, 0)
    nan_ano_loc.Vs = np.where(features.Vs < 1000, 2, 0)
    nan_ano_loc.Mag = np.where(features.Mag > 20, 2, 0)
    nan_ano_loc.head()





.. raw:: html

    <div class="output_subarea output_html rendered_html output_result">
    <div>
    <style scoped>
        .dataframe tbody tr th:only-of-type {
            vertical-align: middle;
        }

        .dataframe tbody tr th {
            vertical-align: top;
        }

        .dataframe thead th {
            text-align: right;
        }
    </style>
    <table border="1" class="dataframe">
      <thead>
        <tr style="text-align: right;">
          <th></th>
          <th>Hole</th>
          <th>From</th>
          <th>Len</th>
          <th>Den</th>
          <th>Vp</th>
          <th>Vs</th>
          <th>Mag</th>
          <th>Ip</th>
          <th>Res</th>
          <th>Form</th>
        </tr>
      </thead>
      <tbody>
        <tr>
          <th>0</th>
          <td>0</td>
          <td>0</td>
          <td>0</td>
          <td>0</td>
          <td>2</td>
          <td>0</td>
          <td>0</td>
          <td>1</td>
          <td>1</td>
          <td>0</td>
        </tr>
        <tr>
          <th>1</th>
          <td>0</td>
          <td>0</td>
          <td>0</td>
          <td>2</td>
          <td>0</td>
          <td>0</td>
          <td>0</td>
          <td>0</td>
          <td>0</td>
          <td>0</td>
        </tr>
        <tr>
          <th>2</th>
          <td>0</td>
          <td>0</td>
          <td>0</td>
          <td>0</td>
          <td>0</td>
          <td>0</td>
          <td>0</td>
          <td>0</td>
          <td>0</td>
          <td>0</td>
        </tr>
        <tr>
          <th>3</th>
          <td>0</td>
          <td>0</td>
          <td>0</td>
          <td>2</td>
          <td>0</td>
          <td>0</td>
          <td>0</td>
          <td>0</td>
          <td>0</td>
          <td>0</td>
        </tr>
        <tr>
          <th>4</th>
          <td>0</td>
          <td>0</td>
          <td>0</td>
          <td>0</td>
          <td>0</td>
          <td>0</td>
          <td>0</td>
          <td>0</td>
          <td>0</td>
          <td>0</td>
        </tr>
      </tbody>
    </table>
    </div>
    </div>
    <br />
    <br />

.. GENERATED FROM PYTHON SOURCE LINES 215-217

.. code-block:: Python

    nan_ano_loc.describe()






.. raw:: html

    <div class="output_subarea output_html rendered_html output_result">
    <div>
    <style scoped>
        .dataframe tbody tr th:only-of-type {
            vertical-align: middle;
        }

        .dataframe tbody tr th {
            vertical-align: top;
        }

        .dataframe thead th {
            text-align: right;
        }
    </style>
    <table border="1" class="dataframe">
      <thead>
        <tr style="text-align: right;">
          <th></th>
          <th>Hole</th>
          <th>From</th>
          <th>Len</th>
          <th>Den</th>
          <th>Vp</th>
          <th>Vs</th>
          <th>Mag</th>
          <th>Ip</th>
          <th>Res</th>
          <th>Form</th>
        </tr>
      </thead>
      <tbody>
        <tr>
          <th>count</th>
          <td>329.0</td>
          <td>329.0</td>
          <td>329.000000</td>
          <td>329.000000</td>
          <td>329.000000</td>
          <td>329.000000</td>
          <td>329.000000</td>
          <td>329.000000</td>
          <td>329.000000</td>
          <td>329.0</td>
        </tr>
        <tr>
          <th>mean</th>
          <td>0.0</td>
          <td>0.0</td>
          <td>0.024316</td>
          <td>0.048632</td>
          <td>0.030395</td>
          <td>0.030395</td>
          <td>0.006079</td>
          <td>0.079027</td>
          <td>0.079027</td>
          <td>0.0</td>
        </tr>
        <tr>
          <th>std</th>
          <td>0.0</td>
          <td>0.0</td>
          <td>0.154263</td>
          <td>0.308527</td>
          <td>0.245049</td>
          <td>0.245049</td>
          <td>0.110264</td>
          <td>0.270192</td>
          <td>0.270192</td>
          <td>0.0</td>
        </tr>
        <tr>
          <th>min</th>
          <td>0.0</td>
          <td>0.0</td>
          <td>0.000000</td>
          <td>0.000000</td>
          <td>0.000000</td>
          <td>0.000000</td>
          <td>0.000000</td>
          <td>0.000000</td>
          <td>0.000000</td>
          <td>0.0</td>
        </tr>
        <tr>
          <th>25%</th>
          <td>0.0</td>
          <td>0.0</td>
          <td>0.000000</td>
          <td>0.000000</td>
          <td>0.000000</td>
          <td>0.000000</td>
          <td>0.000000</td>
          <td>0.000000</td>
          <td>0.000000</td>
          <td>0.0</td>
        </tr>
        <tr>
          <th>50%</th>
          <td>0.0</td>
          <td>0.0</td>
          <td>0.000000</td>
          <td>0.000000</td>
          <td>0.000000</td>
          <td>0.000000</td>
          <td>0.000000</td>
          <td>0.000000</td>
          <td>0.000000</td>
          <td>0.0</td>
        </tr>
        <tr>
          <th>75%</th>
          <td>0.0</td>
          <td>0.0</td>
          <td>0.000000</td>
          <td>0.000000</td>
          <td>0.000000</td>
          <td>0.000000</td>
          <td>0.000000</td>
          <td>0.000000</td>
          <td>0.000000</td>
          <td>0.0</td>
        </tr>
        <tr>
          <th>max</th>
          <td>0.0</td>
          <td>0.0</td>
          <td>1.000000</td>
          <td>2.000000</td>
          <td>2.000000</td>
          <td>2.000000</td>
          <td>2.000000</td>
          <td>1.000000</td>
          <td>1.000000</td>
          <td>0.0</td>
        </tr>
      </tbody>
    </table>
    </div>
    </div>
    <br />
    <br />

.. GENERATED FROM PYTHON SOURCE LINES 218-219

Save the locations of nans and anomalies

.. GENERATED FROM PYTHON SOURCE LINES 219-222

.. code-block:: Python


    nan_ano_loc.to_csv('Output/nan_ano_loc.csv')








.. GENERATED FROM PYTHON SOURCE LINES 223-224

Preservation of the anomalies

.. GENERATED FROM PYTHON SOURCE LINES 224-229

.. code-block:: Python


    anomalies = features[features.Den > 4]
    anomalies = pd.concat([anomalies, features[features.Vp < 3000]])
    anomalies = pd.concat([anomalies, features[features.Vs < 1000]])
    anomalies = pd.concat([anomalies, features[features.Mag > 2]])







.. GENERATED FROM PYTHON SOURCE LINES 230-233

.. code-block:: Python

    anomalies = anomalies.drop_duplicates().sort_index()
    anomalies






.. raw:: html

    <div class="output_subarea output_html rendered_html output_result">
    <div>
    <style scoped>
        .dataframe tbody tr th:only-of-type {
            vertical-align: middle;
        }

        .dataframe tbody tr th {
            vertical-align: top;
        }

        .dataframe thead th {
            text-align: right;
        }
    </style>
    <table border="1" class="dataframe">
      <thead>
        <tr style="text-align: right;">
          <th></th>
          <th>Hole</th>
          <th>From</th>
          <th>Len</th>
          <th>Den</th>
          <th>Vp</th>
          <th>Vs</th>
          <th>Mag</th>
          <th>Ip</th>
          <th>Res</th>
          <th>Form</th>
        </tr>
      </thead>
      <tbody>
        <tr>
          <th>0</th>
          <td>TC-1319-005</td>
          <td>265.77</td>
          <td>11.2</td>
          <td>2.610</td>
          <td>2179.0</td>
          <td>NaN</td>
          <td>0.108</td>
          <td>NaN</td>
          <td>NaN</td>
          <td>CALP</td>
        </tr>
        <tr>
          <th>1</th>
          <td>TC-1319-005</td>
          <td>271.66</td>
          <td>11.8</td>
          <td>4.318</td>
          <td>5514.0</td>
          <td>2222.0</td>
          <td>0.108</td>
          <td>26.06</td>
          <td>4600.0</td>
          <td>CALP</td>
        </tr>
        <tr>
          <th>3</th>
          <td>TC-1319-005</td>
          <td>282.55</td>
          <td>11.2</td>
          <td>4.342</td>
          <td>5773.0</td>
          <td>2011.0</td>
          <td>0.151</td>
          <td>28.13</td>
          <td>5800.0</td>
          <td>CALP</td>
        </tr>
        <tr>
          <th>6</th>
          <td>TC-1319-005</td>
          <td>301.50</td>
          <td>11.6</td>
          <td>4.251</td>
          <td>5686.0</td>
          <td>2755.0</td>
          <td>0.132</td>
          <td>14.77</td>
          <td>6000.0</td>
          <td>CALP</td>
        </tr>
        <tr>
          <th>7</th>
          <td>TC-1319-005</td>
          <td>313.65</td>
          <td>11.4</td>
          <td>4.193</td>
          <td>5588.0</td>
          <td>2039.0</td>
          <td>0.068</td>
          <td>20.37</td>
          <td>5400.0</td>
          <td>CALP</td>
        </tr>
        <tr>
          <th>11</th>
          <td>TC-1319-005</td>
          <td>334.65</td>
          <td>11.3</td>
          <td>4.079</td>
          <td>5678.0</td>
          <td>2198.0</td>
          <td>0.036</td>
          <td>34.51</td>
          <td>2000.0</td>
          <td>CALP</td>
        </tr>
        <tr>
          <th>12</th>
          <td>TC-1319-005</td>
          <td>340.00</td>
          <td>11.9</td>
          <td>4.444</td>
          <td>5561.0</td>
          <td>2581.0</td>
          <td>0.024</td>
          <td>14.92</td>
          <td>4800.0</td>
          <td>CALP</td>
        </tr>
        <tr>
          <th>13</th>
          <td>TC-1319-005</td>
          <td>344.77</td>
          <td>11.2</td>
          <td>4.123</td>
          <td>5359.0</td>
          <td>NaN</td>
          <td>0.041</td>
          <td>45.93</td>
          <td>1000.0</td>
          <td>CALP</td>
        </tr>
        <tr>
          <th>24</th>
          <td>TC-1319-005</td>
          <td>400.10</td>
          <td>13.9</td>
          <td>5.074</td>
          <td>5367.0</td>
          <td>2422.0</td>
          <td>0.066</td>
          <td>NaN</td>
          <td>NaN</td>
          <td>CALP</td>
        </tr>
        <tr>
          <th>77</th>
          <td>TC-1319-005</td>
          <td>645.30</td>
          <td>11.0</td>
          <td>2.725</td>
          <td>992.0</td>
          <td>NaN</td>
          <td>0.308</td>
          <td>98.11</td>
          <td>293.3</td>
          <td>LABL</td>
        </tr>
        <tr>
          <th>79</th>
          <td>TC-1319-005</td>
          <td>653.95</td>
          <td>12.0</td>
          <td>2.936</td>
          <td>2321.0</td>
          <td>1929.0</td>
          <td>0.342</td>
          <td>88.45</td>
          <td>443.4</td>
          <td>LABL</td>
        </tr>
        <tr>
          <th>159</th>
          <td>TC-1319-005</td>
          <td>1046.35</td>
          <td>10.6</td>
          <td>2.455</td>
          <td>2325.0</td>
          <td>NaN</td>
          <td>0.158</td>
          <td>44.44</td>
          <td>1300.0</td>
          <td>LIS</td>
        </tr>
        <tr>
          <th>174</th>
          <td>TC-3660-008</td>
          <td>64.54</td>
          <td>9.4</td>
          <td>2.677</td>
          <td>2802.0</td>
          <td>0.0</td>
          <td>0.038</td>
          <td>35.19</td>
          <td>2500.0</td>
          <td>CALP</td>
        </tr>
        <tr>
          <th>198</th>
          <td>TC-3660-008</td>
          <td>188.03</td>
          <td>9.4</td>
          <td>2.679</td>
          <td>5345.0</td>
          <td>0.0</td>
          <td>0.098</td>
          <td>NaN</td>
          <td>NaN</td>
          <td>CALP</td>
        </tr>
        <tr>
          <th>243</th>
          <td>TC-3660-008</td>
          <td>432.95</td>
          <td>9.7</td>
          <td>2.779</td>
          <td>4641.0</td>
          <td>178.0</td>
          <td>NaN</td>
          <td>69.69</td>
          <td>515.7</td>
          <td>FLT</td>
        </tr>
        <tr>
          <th>248</th>
          <td>TC-3660-008</td>
          <td>452.30</td>
          <td>10.4</td>
          <td>2.686</td>
          <td>5652.0</td>
          <td>0.0</td>
          <td>0.053</td>
          <td>27.54</td>
          <td>4000.0</td>
          <td>FLT</td>
        </tr>
        <tr>
          <th>305</th>
          <td>TC-3660-008</td>
          <td>734.80</td>
          <td>9.5</td>
          <td>2.708</td>
          <td>5026.0</td>
          <td>2914.0</td>
          <td>93.886</td>
          <td>71.63</td>
          <td>1200.0</td>
          <td>SP</td>
        </tr>
        <tr>
          <th>312</th>
          <td>TC-3660-008</td>
          <td>774.21</td>
          <td>9.6</td>
          <td>2.787</td>
          <td>6173.0</td>
          <td>0.0</td>
          <td>0.123</td>
          <td>73.08</td>
          <td>373.2</td>
          <td>SP</td>
        </tr>
      </tbody>
    </table>
    </div>
    </div>
    <br />
    <br />

.. GENERATED FROM PYTHON SOURCE LINES 234-235

Number and % of anomalies by boreholes

.. GENERATED FROM PYTHON SOURCE LINES 235-239

.. code-block:: Python


    print('Total anomalies:', len(anomalies), '({:.1f}%)'.format(100 * len(anomalies) / (features.shape[0] * features.shape[1])))
    anomalies.Hole.value_counts()





.. rst-class:: sphx-glr-script-out

 .. code-block:: none

    Total anomalies: 18 (0.5%)

    Hole
    TC-1319-005    12
    TC-3660-008     6
    Name: count, dtype: int64



.. GENERATED FROM PYTHON SOURCE LINES 240-241

Save the anomalies

.. GENERATED FROM PYTHON SOURCE LINES 241-243

.. code-block:: Python


    anomalies.to_csv('Output/anomalies.csv')







.. GENERATED FROM PYTHON SOURCE LINES 244-245

Rows with anomalies

.. GENERATED FROM PYTHON SOURCE LINES 245-248

.. code-block:: Python


    list(anomalies.index)





.. rst-class:: sphx-glr-script-out

 .. code-block:: none


    [0, 1, 3, 6, 7, 11, 12, 13, 24, 77, 79, 159, 174, 198, 243, 248, 305, 312]



.. GENERATED FROM PYTHON SOURCE LINES 249-251

Anomalies Deletion
------------------

.. GENERATED FROM PYTHON SOURCE LINES 253-254

features2, new features dataframe without anomalies

.. GENERATED FROM PYTHON SOURCE LINES 254-264

.. code-block:: Python


    features2 = features.copy()

    features2['Den'] = np.where(features.Den > 4, np.nan, features.Den)
    features2['Vp'] = np.where(features.Vp < 3000, np.nan, features.Vp)
    features2['Vs'] = np.where(features.Vs < 1000, np.nan, features.Vs)
    features2['Mag'] = np.where(features.Mag > 20, np.nan, features.Mag)

    features2.head()






.. raw:: html

    <div class="output_subarea output_html rendered_html output_result">
    <div>
    <style scoped>
        .dataframe tbody tr th:only-of-type {
            vertical-align: middle;
        }

        .dataframe tbody tr th {
            vertical-align: top;
        }

        .dataframe thead th {
            text-align: right;
        }
    </style>
    <table border="1" class="dataframe">
      <thead>
        <tr style="text-align: right;">
          <th></th>
          <th>Hole</th>
          <th>From</th>
          <th>Len</th>
          <th>Den</th>
          <th>Vp</th>
          <th>Vs</th>
          <th>Mag</th>
          <th>Ip</th>
          <th>Res</th>
          <th>Form</th>
        </tr>
      </thead>
      <tbody>
        <tr>
          <th>0</th>
          <td>TC-1319-005</td>
          <td>265.77</td>
          <td>11.2</td>
          <td>2.610</td>
          <td>NaN</td>
          <td>NaN</td>
          <td>0.108</td>
          <td>NaN</td>
          <td>NaN</td>
          <td>CALP</td>
        </tr>
        <tr>
          <th>1</th>
          <td>TC-1319-005</td>
          <td>271.66</td>
          <td>11.8</td>
          <td>NaN</td>
          <td>5514.0</td>
          <td>2222.0</td>
          <td>0.108</td>
          <td>26.06</td>
          <td>4600.0</td>
          <td>CALP</td>
        </tr>
        <tr>
          <th>2</th>
          <td>TC-1319-005</td>
          <td>277.05</td>
          <td>10.3</td>
          <td>3.399</td>
          <td>5754.0</td>
          <td>2085.0</td>
          <td>0.237</td>
          <td>17.52</td>
          <td>10200.0</td>
          <td>CALP</td>
        </tr>
        <tr>
          <th>3</th>
          <td>TC-1319-005</td>
          <td>282.55</td>
          <td>11.2</td>
          <td>NaN</td>
          <td>5773.0</td>
          <td>2011.0</td>
          <td>0.151</td>
          <td>28.13</td>
          <td>5800.0</td>
          <td>CALP</td>
        </tr>
        <tr>
          <th>4</th>
          <td>TC-1319-005</td>
          <td>287.20</td>
          <td>10.8</td>
          <td>3.293</td>
          <td>6034.0</td>
          <td>2122.0</td>
          <td>0.176</td>
          <td>6.88</td>
          <td>15500.0</td>
          <td>CALP</td>
        </tr>
      </tbody>
    </table>
    </div>
    </div>
    <br />
    <br />

.. GENERATED FROM PYTHON SOURCE LINES 265-267

.. code-block:: Python

    features2.describe()






.. raw:: html

    <div class="output_subarea output_html rendered_html output_result">
    <div>
    <style scoped>
        .dataframe tbody tr th:only-of-type {
            vertical-align: middle;
        }

        .dataframe tbody tr th {
            vertical-align: top;
        }

        .dataframe thead th {
            text-align: right;
        }
    </style>
    <table border="1" class="dataframe">
      <thead>
        <tr style="text-align: right;">
          <th></th>
          <th>From</th>
          <th>Len</th>
          <th>Den</th>
          <th>Vp</th>
          <th>Vs</th>
          <th>Mag</th>
          <th>Ip</th>
          <th>Res</th>
        </tr>
      </thead>
      <tbody>
        <tr>
          <th>count</th>
          <td>329.000000</td>
          <td>321.000000</td>
          <td>319.000000</td>
          <td>305.000000</td>
          <td>284.000000</td>
          <td>325.000000</td>
          <td>303.000000</td>
          <td>303.000000</td>
        </tr>
        <tr>
          <th>mean</th>
          <td>543.279696</td>
          <td>10.223364</td>
          <td>2.721755</td>
          <td>5496.386885</td>
          <td>2821.130282</td>
          <td>0.142988</td>
          <td>49.738548</td>
          <td>3283.762376</td>
        </tr>
        <tr>
          <th>std</th>
          <td>262.864370</td>
          <td>1.070127</td>
          <td>0.145217</td>
          <td>556.077848</td>
          <td>547.476118</td>
          <td>0.100706</td>
          <td>25.145208</td>
          <td>3714.851096</td>
        </tr>
        <tr>
          <th>min</th>
          <td>9.600000</td>
          <td>7.500000</td>
          <td>2.420000</td>
          <td>3009.000000</td>
          <td>1295.000000</td>
          <td>0.001000</td>
          <td>5.390000</td>
          <td>156.900000</td>
        </tr>
        <tr>
          <th>25%</th>
          <td>355.580000</td>
          <td>9.700000</td>
          <td>2.678000</td>
          <td>5220.000000</td>
          <td>2368.250000</td>
          <td>0.064000</td>
          <td>31.555000</td>
          <td>1100.000000</td>
        </tr>
        <tr>
          <th>50%</th>
          <td>547.800000</td>
          <td>10.000000</td>
          <td>2.705000</td>
          <td>5581.000000</td>
          <td>3041.000000</td>
          <td>0.114000</td>
          <td>47.040000</td>
          <td>2100.000000</td>
        </tr>
        <tr>
          <th>75%</th>
          <td>750.000000</td>
          <td>10.400000</td>
          <td>2.726500</td>
          <td>5858.000000</td>
          <td>3248.000000</td>
          <td>0.203000</td>
          <td>67.190000</td>
          <td>3850.000000</td>
        </tr>
        <tr>
          <th>max</th>
          <td>1049.200000</td>
          <td>18.000000</td>
          <td>3.650000</td>
          <td>6478.000000</td>
          <td>3740.000000</td>
          <td>0.545000</td>
          <td>171.730000</td>
          <td>23900.000000</td>
        </tr>
      </tbody>
    </table>
    </div>
    </div>
    <br />
    <br />

.. GENERATED FROM PYTHON SOURCE LINES 268-269

Boxplot of each feature without anomalies

.. GENERATED FROM PYTHON SOURCE LINES 269-274

.. code-block:: Python


    features2.plot.box(subplots=True, grid=False, figsize=(12, 7), layout=(3, 4), flierprops={"marker": "."})
    plt.suptitle('Features Without Anomalies')
    plt.tight_layout();




.. image-sg:: /examples/05_petrophysics/images/sphx_glr_02_no_2_filling_gaps_002.png
   :alt: Features Without Anomalies
   :srcset: /examples/05_petrophysics/images/sphx_glr_02_no_2_filling_gaps_002.png
   :class: sphx-glr-single-img





.. GENERATED FROM PYTHON SOURCE LINES 275-276

Save features without anomalies

.. GENERATED FROM PYTHON SOURCE LINES 276-279

.. code-block:: Python


    features2.to_csv('Output/features2.csv')








.. GENERATED FROM PYTHON SOURCE LINES 280-282

NaNs in the Features
--------------------

.. GENERATED FROM PYTHON SOURCE LINES 284-285

Original NaNs

.. GENERATED FROM PYTHON SOURCE LINES 285-288

.. code-block:: Python


    features.isna().sum()





.. rst-class:: sphx-glr-script-out

 .. code-block:: none


    Hole     0
    From     0
    Len      8
    Den      2
    Vp      19
    Vs      40
    Mag      3
    Ip      26
    Res     26
    Form     0
    dtype: int64



.. GENERATED FROM PYTHON SOURCE LINES 289-290

Total original NaNs

.. GENERATED FROM PYTHON SOURCE LINES 290-293

.. code-block:: Python


    features.isna().sum().sum()





.. rst-class:: sphx-glr-script-out

 .. code-block:: none


    np.int64(124)



.. GENERATED FROM PYTHON SOURCE LINES 294-295

% of original NaNs

.. GENERATED FROM PYTHON SOURCE LINES 295-298

.. code-block:: Python


    features.isna().sum() / len(features) * 100





.. rst-class:: sphx-glr-script-out

 .. code-block:: none


    Hole     0.000000
    From     0.000000
    Len      2.431611
    Den      0.607903
    Vp       5.775076
    Vs      12.158055
    Mag      0.911854
    Ip       7.902736
    Res      7.902736
    Form     0.000000
    dtype: float64



.. GENERATED FROM PYTHON SOURCE LINES 299-300

Total % original and new NaNs

.. GENERATED FROM PYTHON SOURCE LINES 300-303

.. code-block:: Python


    print('% NaN in Features:', round(100 * features.isna().sum().sum() / (features.shape[0] * features.shape[1]), 1))





.. rst-class:: sphx-glr-script-out

 .. code-block:: none

    % NaN in Features: 3.8




.. GENERATED FROM PYTHON SOURCE LINES 304-305

Original and new NaNs

.. GENERATED FROM PYTHON SOURCE LINES 305-308

.. code-block:: Python


    features2.isna().sum()





.. rst-class:: sphx-glr-script-out

 .. code-block:: none


    Hole     0
    From     0
    Len      8
    Den     10
    Vp      24
    Vs      45
    Mag      4
    Ip      26
    Res     26
    Form     0
    dtype: int64



.. GENERATED FROM PYTHON SOURCE LINES 309-310

Total original and new NaNs

.. GENERATED FROM PYTHON SOURCE LINES 310-313

.. code-block:: Python


    features2.isna().sum().sum()





.. rst-class:: sphx-glr-script-out

 .. code-block:: none


    np.int64(143)



.. GENERATED FROM PYTHON SOURCE LINES 314-315

Total % original and new NaNs

.. GENERATED FROM PYTHON SOURCE LINES 315-318

.. code-block:: Python


    print('% NaN in features2:', round(100 * features2.isna().sum().sum() / (features2.shape[0] * features2.shape[1]), 1))





.. rst-class:: sphx-glr-script-out

 .. code-block:: none

    % NaN in features2: 4.3




.. GENERATED FROM PYTHON SOURCE LINES 319-320

% Total original and new NaNs by feature

.. GENERATED FROM PYTHON SOURCE LINES 320-323

.. code-block:: Python


    features2.isna().sum() / len(features) * 100





.. rst-class:: sphx-glr-script-out

 .. code-block:: none


    Hole     0.000000
    From     0.000000
    Len      2.431611
    Den      3.039514
    Vp       7.294833
    Vs      13.677812
    Mag      1.215805
    Ip       7.902736
    Res      7.902736
    Form     0.000000
    dtype: float64



.. GENERATED FROM PYTHON SOURCE LINES 324-325

Bar plot of NaNs

.. GENERATED FROM PYTHON SOURCE LINES 325-329

.. code-block:: Python


    features2.isna().sum().plot.bar(figsize=(8, 3), color='r', ylabel='Count', title='All Features NaN', label='Without anomalies', legend=True)
    features.isna().sum().plot.bar(figsize=(8, 3), color='b', label='With anomalies', legend=True, grid=True)




.. image-sg:: /examples/05_petrophysics/images/sphx_glr_02_no_2_filling_gaps_003.png
   :alt: All Features NaN
   :srcset: /examples/05_petrophysics/images/sphx_glr_02_no_2_filling_gaps_003.png
   :class: sphx-glr-single-img


.. rst-class:: sphx-glr-script-out

 .. code-block:: none


    <Axes: title={'center': 'All Features NaN'}, ylabel='Count'>



.. GENERATED FROM PYTHON SOURCE LINES 330-331

Bar plot of NaNs (%)

.. GENERATED FROM PYTHON SOURCE LINES 331-335

.. code-block:: Python


    (features2.isna().sum() / len(features) * 100).plot.bar(figsize=(8, 3), color='r', label='Without anomalies', legend=True)
    (features.isna().sum() / len(features) * 100).plot.bar(figsize=(8, 3), color='b', label='With anomalies', legend=True,
                                                           title='All Features NaN', ylabel='%', grid=True);



.. image-sg:: /examples/05_petrophysics/images/sphx_glr_02_no_2_filling_gaps_004.png
   :alt: All Features NaN
   :srcset: /examples/05_petrophysics/images/sphx_glr_02_no_2_filling_gaps_004.png
   :class: sphx-glr-single-img


.. rst-class:: sphx-glr-script-out

 .. code-block:: none


    <Axes: title={'center': 'All Features NaN'}, ylabel='%'>



.. GENERATED FROM PYTHON SOURCE LINES 336-337

Total rows at least one with NaN and their indexes

.. GENERATED FROM PYTHON SOURCE LINES 337-342

.. code-block:: Python


    total_nan_index = list(features2[pd.isna(features2).any(axis=1)].index)
    print('Total rows with at least one NaN:', len(total_nan_index), '\n', '\n', 'Rows indexes:')
    total_nan_index





.. rst-class:: sphx-glr-script-out

 .. code-block:: none

    Total rows with at least one NaN: 62 
 
     Rows indexes:

    [0, 1, 3, 6, 7, 11, 12, 13, 15, 24, 36, 45, 47, 63, 68, 71, 75, 77, 79, 99, 100, 105, 110, 111, 112, 113, 114, 115, 117, 118, 135, 158, 159, 160, 164, 174, 183, 185, 187, 189, 191, 194, 197, 198, 201, 206, 209, 210, 229, 233, 240, 243, 248, 251, 276, 305, 311, 312, 318, 319, 320, 325]



.. GENERATED FROM PYTHON SOURCE LINES 343-359

Filling the holes
-----------------
Imputations Evaluation
++++++++++++++++++++++
Imputation refers to the alternative process of filling in all the missing values (NaN) or gaps in a dataset. Instead of removing rows or columns with missing values, those values are replaced (imputed) with estimated values based on information available in the entire dataset (in all variables). This is typically done when you want to preserve the other values in the rows where the NaNs are, and when the missing values for each variable or column do not exceed 25%. 

Since multiple imputation methods are available, it's important to evaluate each one and select the method that yields the best results, based on metrics such as the sum of residuals, MAE, MSE, RMSE, and R² (formulas of these metrics in the annex). In this case, we tested six imputation methods: 

* Simple mean (mean)
* Simple K-Means (knn)
* Iterative Impute (ii)
* Normalized mean (nmean)
* Normalized K-Means (nknn)
* Normalized Iterative Imputer (nii)

To compare these methods, we first created fictitious gaps in the Vp values (just this variable due to its importance, and for simplicity and speed) of a non-NaN dataset to compare them later the real values of Vp against the imputed. By doing so, we can compute each method's metrics and determine which performs best. In addition to these metrics, at the end of the section, there is a plot with the resulting values of each imputer method with respect to the real values, showing less dispersion towards the top of the plot.

.. GENERATED FROM PYTHON SOURCE LINES 361-362

Non-NaN dataframe to test the imputation

.. GENERATED FROM PYTHON SOURCE LINES 362-367

.. code-block:: Python


    features2_num = features2[['From', 'Den', 'Vp', 'Vs', 'Mag', 'Ip', 'Res']]
    features2_num_nonan = features2_num.dropna()
    features2_num_nonan.head()






.. raw:: html

    <div class="output_subarea output_html rendered_html output_result">
    <div>
    <style scoped>
        .dataframe tbody tr th:only-of-type {
            vertical-align: middle;
        }

        .dataframe tbody tr th {
            vertical-align: top;
        }

        .dataframe thead th {
            text-align: right;
        }
    </style>
    <table border="1" class="dataframe">
      <thead>
        <tr style="text-align: right;">
          <th></th>
          <th>From</th>
          <th>Den</th>
          <th>Vp</th>
          <th>Vs</th>
          <th>Mag</th>
          <th>Ip</th>
          <th>Res</th>
        </tr>
      </thead>
      <tbody>
        <tr>
          <th>2</th>
          <td>277.05</td>
          <td>3.399</td>
          <td>5754.0</td>
          <td>2085.0</td>
          <td>0.237</td>
          <td>17.52</td>
          <td>10200.0</td>
        </tr>
        <tr>
          <th>4</th>
          <td>287.20</td>
          <td>3.293</td>
          <td>6034.0</td>
          <td>2122.0</td>
          <td>0.176</td>
          <td>6.88</td>
          <td>15500.0</td>
        </tr>
        <tr>
          <th>5</th>
          <td>296.75</td>
          <td>2.782</td>
          <td>4661.0</td>
          <td>2294.0</td>
          <td>0.107</td>
          <td>27.81</td>
          <td>2100.0</td>
        </tr>
        <tr>
          <th>8</th>
          <td>318.75</td>
          <td>3.386</td>
          <td>5815.0</td>
          <td>2115.0</td>
          <td>0.060</td>
          <td>8.53</td>
          <td>10700.0</td>
        </tr>
        <tr>
          <th>9</th>
          <td>325.75</td>
          <td>2.616</td>
          <td>4785.0</td>
          <td>1946.0</td>
          <td>0.040</td>
          <td>84.90</td>
          <td>296.1</td>
        </tr>
      </tbody>
    </table>
    </div>
    </div>
    <br />
    <br />

.. GENERATED FROM PYTHON SOURCE LINES 368-369

Generation of random mask for the fictitious NaNs or gaps

.. GENERATED FROM PYTHON SOURCE LINES 369-375

.. code-block:: Python


    np.random.seed(seed)
    missing_rate = 0.2  # Porcentaje de valores faltantes
    Vp_mask = np.random.rand(features2_num_nonan.shape[0]) < missing_rate
    Vp_mask





.. rst-class:: sphx-glr-script-out

 .. code-block:: none


    array([False, False, False, False, False, False, False, False, False,
           False, False, False, False,  True, False, False,  True,  True,
           False, False, False, False, False, False, False, False, False,
           False, False, False,  True, False, False, False, False, False,
           False, False, False, False, False,  True, False, False, False,
           False, False, False, False, False,  True, False, False, False,
           False, False, False, False, False, False, False, False, False,
           False, False,  True, False, False,  True, False,  True, False,
           False, False,  True, False, False,  True,  True, False, False,
           False, False, False, False, False, False,  True, False, False,
           False, False, False, False, False, False,  True, False, False,
           False, False, False,  True,  True, False, False, False, False,
           False, False, False, False,  True, False, False, False, False,
           False,  True, False, False, False, False, False, False, False,
           False, False, False, False, False,  True, False, False, False,
            True, False, False, False, False,  True, False, False, False,
           False, False, False,  True, False, False,  True, False, False,
           False, False,  True, False, False, False, False, False, False,
           False,  True, False, False, False, False, False, False, False,
            True, False, False, False,  True,  True,  True, False, False,
           False,  True, False, False, False, False,  True, False, False,
           False, False, False, False, False,  True, False,  True, False,
           False,  True, False,  True, False, False, False, False,  True,
           False, False,  True, False, False, False, False, False, False,
            True, False, False,  True, False, False, False, False, False,
            True, False, False, False,  True, False,  True, False, False,
           False,  True, False, False, False, False,  True, False, False,
           False,  True, False, False, False, False, False, False, False,
           False, False,  True, False, False, False,  True, False, False,
            True, False, False, False, False,  True])



.. GENERATED FROM PYTHON SOURCE LINES 376-377

New gaps in Vp of the Non-NaN dataframe 

.. GENERATED FROM PYTHON SOURCE LINES 377-382

.. code-block:: Python


    features2_new_missing = features2_num_nonan.copy()
    features2_new_missing.loc[Vp_mask, 'Vp'] = np.nan
    features2_new_missing






.. raw:: html

    <div class="output_subarea output_html rendered_html output_result">
    <div>
    <style scoped>
        .dataframe tbody tr th:only-of-type {
            vertical-align: middle;
        }

        .dataframe tbody tr th {
            vertical-align: top;
        }

        .dataframe thead th {
            text-align: right;
        }
    </style>
    <table border="1" class="dataframe">
      <thead>
        <tr style="text-align: right;">
          <th></th>
          <th>From</th>
          <th>Den</th>
          <th>Vp</th>
          <th>Vs</th>
          <th>Mag</th>
          <th>Ip</th>
          <th>Res</th>
        </tr>
      </thead>
      <tbody>
        <tr>
          <th>2</th>
          <td>277.05</td>
          <td>3.399</td>
          <td>5754.0</td>
          <td>2085.0</td>
          <td>0.237</td>
          <td>17.52</td>
          <td>10200.0</td>
        </tr>
        <tr>
          <th>4</th>
          <td>287.20</td>
          <td>3.293</td>
          <td>6034.0</td>
          <td>2122.0</td>
          <td>0.176</td>
          <td>6.88</td>
          <td>15500.0</td>
        </tr>
        <tr>
          <th>5</th>
          <td>296.75</td>
          <td>2.782</td>
          <td>4661.0</td>
          <td>2294.0</td>
          <td>0.107</td>
          <td>27.81</td>
          <td>2100.0</td>
        </tr>
        <tr>
          <th>8</th>
          <td>318.75</td>
          <td>3.386</td>
          <td>5815.0</td>
          <td>2115.0</td>
          <td>0.060</td>
          <td>8.53</td>
          <td>10700.0</td>
        </tr>
        <tr>
          <th>9</th>
          <td>325.75</td>
          <td>2.616</td>
          <td>4785.0</td>
          <td>1946.0</td>
          <td>0.040</td>
          <td>84.90</td>
          <td>296.1</td>
        </tr>
        <tr>
          <th>...</th>
          <td>...</td>
          <td>...</td>
          <td>...</td>
          <td>...</td>
          <td>...</td>
          <td>...</td>
          <td>...</td>
        </tr>
        <tr>
          <th>323</th>
          <td>809.47</td>
          <td>2.786</td>
          <td>5621.0</td>
          <td>2334.0</td>
          <td>0.105</td>
          <td>55.20</td>
          <td>706.5</td>
        </tr>
        <tr>
          <th>324</th>
          <td>814.55</td>
          <td>2.763</td>
          <td>5680.0</td>
          <td>3179.0</td>
          <td>0.158</td>
          <td>70.94</td>
          <td>481.1</td>
        </tr>
        <tr>
          <th>326</th>
          <td>825.60</td>
          <td>2.762</td>
          <td>5593.0</td>
          <td>1787.0</td>
          <td>0.337</td>
          <td>78.44</td>
          <td>334.6</td>
        </tr>
        <tr>
          <th>327</th>
          <td>828.80</td>
          <td>2.804</td>
          <td>5858.0</td>
          <td>1964.0</td>
          <td>0.311</td>
          <td>73.73</td>
          <td>466.5</td>
        </tr>
        <tr>
          <th>328</th>
          <td>833.40</td>
          <td>2.735</td>
          <td>NaN</td>
          <td>3217.0</td>
          <td>0.308</td>
          <td>33.86</td>
          <td>2300.0</td>
        </tr>
      </tbody>
    </table>
    <p>267 rows × 7 columns</p>
    </div>
    </div>
    <br />
    <br />

.. GENERATED FROM PYTHON SOURCE LINES 383-384

At index 328 the original Vp is

.. GENERATED FROM PYTHON SOURCE LINES 384-387

.. code-block:: Python


    features.loc[328]





.. rst-class:: sphx-glr-script-out

 .. code-block:: none


    Hole    TC-3660-008
    From          833.4
    Len            10.1
    Den           2.735
    Vp           5805.0
    Vs           3217.0
    Mag           0.308
    Ip            33.86
    Res          2300.0
    Form            PAL
    Name: 328, dtype: object



.. GENERATED FROM PYTHON SOURCE LINES 388-389

True Vp values

.. GENERATED FROM PYTHON SOURCE LINES 389-393

.. code-block:: Python


    true_vp = features2_num_nonan[Vp_mask].Vp.reset_index(drop=True)
    true_vp.head()





.. rst-class:: sphx-glr-script-out

 .. code-block:: none


    0    5196.0
    1    6159.0
    2    6037.0
    3    5815.0
    4    4930.0
    Name: Vp, dtype: float64



.. GENERATED FROM PYTHON SOURCE LINES 394-395

Evaluation of the imputers

.. GENERATED FROM PYTHON SOURCE LINES 395-451

.. code-block:: Python


    # List of imputers
    imputers = []

    # Initialize imputers
    mean_imputer = SimpleImputer(strategy='mean')
    ii_imputer = IterativeImputer(max_iter=10, random_state=seed)
    knn_imputer = KNNImputer(n_neighbors=5)

    # Normalization
    scaler = StandardScaler()
    features2_new_missing_scaled = scaler.fit_transform(features2_new_missing)

    # Append of imputer (name, norm, imputer, data)
    imputers.append(('mean', 'no', mean_imputer, features2_new_missing))
    imputers.append(('ii', 'no', ii_imputer, features2_new_missing))
    imputers.append(('knn', 'no', knn_imputer, features2_new_missing))
    imputers.append(('nmean', 'yes', mean_imputer, features2_new_missing_scaled))
    imputers.append(('nii', 'yes', ii_imputer, features2_new_missing_scaled))
    imputers.append(('nknn', 'yes', knn_imputer, features2_new_missing_scaled))

    # Results table
    headers = ['Imputer', 'Normalization', 'Sum Residual', 'MAE', 'MSE', 'RMSE', 'R2']
    rows = []

    # Loop over the imputers
    for name, norm, imputer, data in imputers:
        # Impute the data
        imputed_data = imputer.fit_transform(data)

        # Reverse the normalization if data was normalized
        if norm == 'yes':
            imputed_data = scaler.inverse_transform(imputed_data)

        # Convert the array to dataframe
        imputed_data_df = pd.DataFrame(imputed_data, columns=features2_new_missing.columns)
        imputed_vp = imputed_data_df.loc[Vp_mask, 'Vp'].reset_index(drop=True)

        # Calculate residuals and metrics
        residual = sum(true_vp - imputed_vp)
        mae = mean_absolute_error(true_vp, imputed_vp)
        mse = mean_squared_error(true_vp, imputed_vp)
        rmse = np.sqrt(mse)
        r2 = r2_score(true_vp, imputed_vp)

        # Create a dataframe for imputer using globals()
        globals()[f'imputer_{name}'] = imputed_data_df.copy()

        # Append results
        rows.append([name, norm, residual, mae, mse, rmse, r2])

    # Print the results in a formatted table

    print('Seed:', seed)

    print(tabulate(rows, headers=headers, tablefmt="fancy_outline"))




.. rst-class:: sphx-glr-script-out

 .. code-block:: none

    Seed: 123
    ╒═══════════╤═════════════════╤════════════════╤═════════╤════════╤═════════╤═════════════╕
    │ Imputer   │ Normalization   │   Sum Residual │     MAE │    MSE │    RMSE │          R2 │
    ╞═══════════╪═════════════════╪════════════════╪═════════╪════════╪═════════╪═════════════╡
    │ mean      │ no              │        1779.8  │ 394.903 │ 256363 │ 506.323 │ -0.00517294 │
    │ ii        │ no              │       -1106.23 │ 308.72  │ 163029 │ 403.768 │  0.360781   │
    │ knn       │ no              │        -490.6  │ 310.118 │ 168360 │ 410.317 │  0.339878   │
    │ nmean     │ yes             │        1779.8  │ 394.903 │ 256363 │ 506.323 │ -0.00517294 │
    │ nii       │ yes             │        -646.02 │ 296.29  │ 160690 │ 400.862 │  0.36995    │
    │ nknn      │ yes             │        -924.4  │ 298.808 │ 166131 │ 407.592 │  0.348616   │
    ╘═══════════╧═════════════════╧════════════════╧═════════╧════════╧═════════╧═════════════╛




.. GENERATED FROM PYTHON SOURCE LINES 452-454

.. code-block:: Python

    imputer_nii.head()






.. raw:: html

    <div class="output_subarea output_html rendered_html output_result">
    <div>
    <style scoped>
        .dataframe tbody tr th:only-of-type {
            vertical-align: middle;
        }

        .dataframe tbody tr th {
            vertical-align: top;
        }

        .dataframe thead th {
            text-align: right;
        }
    </style>
    <table border="1" class="dataframe">
      <thead>
        <tr style="text-align: right;">
          <th></th>
          <th>From</th>
          <th>Den</th>
          <th>Vp</th>
          <th>Vs</th>
          <th>Mag</th>
          <th>Ip</th>
          <th>Res</th>
        </tr>
      </thead>
      <tbody>
        <tr>
          <th>0</th>
          <td>277.05</td>
          <td>3.399</td>
          <td>5754.0</td>
          <td>2085.0</td>
          <td>0.237</td>
          <td>17.52</td>
          <td>10200.0</td>
        </tr>
        <tr>
          <th>1</th>
          <td>287.20</td>
          <td>3.293</td>
          <td>6034.0</td>
          <td>2122.0</td>
          <td>0.176</td>
          <td>6.88</td>
          <td>15500.0</td>
        </tr>
        <tr>
          <th>2</th>
          <td>296.75</td>
          <td>2.782</td>
          <td>4661.0</td>
          <td>2294.0</td>
          <td>0.107</td>
          <td>27.81</td>
          <td>2100.0</td>
        </tr>
        <tr>
          <th>3</th>
          <td>318.75</td>
          <td>3.386</td>
          <td>5815.0</td>
          <td>2115.0</td>
          <td>0.060</td>
          <td>8.53</td>
          <td>10700.0</td>
        </tr>
        <tr>
          <th>4</th>
          <td>325.75</td>
          <td>2.616</td>
          <td>4785.0</td>
          <td>1946.0</td>
          <td>0.040</td>
          <td>84.90</td>
          <td>296.1</td>
        </tr>
      </tbody>
    </table>
    </div>
    </div>
    <br />
    <br />

.. GENERATED FROM PYTHON SOURCE LINES 455-456

Plot of the real Vp vs. the imputed Vp

.. GENERATED FROM PYTHON SOURCE LINES 456-473

.. code-block:: Python


    plt.figure(figsize=(8, 8))
    plt.scatter(true_vp, true_vp, color='k', label='True Values')
    plt.scatter(true_vp, imputer_mean[Vp_mask].Vp, label='Mean (mean)')
    plt.scatter(true_vp, imputer_nmean[Vp_mask].Vp, label='Normalized Mean (nmean)')
    plt.scatter(true_vp, imputer_ii[Vp_mask].Vp, label='Iterative (ii)')
    plt.scatter(true_vp, imputer_nii[Vp_mask].Vp, label='Normalized (nii)')
    plt.scatter(true_vp, imputer_knn[Vp_mask].Vp, label='K-Mean Neighbor (knn)')
    plt.scatter(true_vp, imputer_nknn[Vp_mask].Vp, label='Normalized K-Mean Neighbor (nknn)')
    plt.xlabel('True Vp (m/s)')
    plt.ylabel('Imputed Vp (m/s)')
    plt.title('Real Vp vs. Imputed Vp')
    plt.axis([4000, 6750, 4000, 6750])
    plt.legend()
    plt.grid()
    plt.show()




.. image-sg:: /examples/05_petrophysics/images/sphx_glr_02_no_2_filling_gaps_005.png
   :alt: Real Vp vs. Imputed Vp
   :srcset: /examples/05_petrophysics/images/sphx_glr_02_no_2_filling_gaps_005.png
   :class: sphx-glr-single-img





.. GENERATED FROM PYTHON SOURCE LINES 474-479

Best Imputation
 +++++++++++++++  
By a closer examination of the metrics, we can conclude that the normalized Iterative Imputer (nii), also called MICE (Multiple Imputation by Chained Equations), performs the best (or least poorly), particularly for the Vp variable, with the mean of the imputed values increasing just 1.2 % with respect to the mean of the real values. Consequently, we used the nii imputer to fill the gaps across all variables and saved the updated dataset in a new dataframe (features3).

The multiplot of this section shows that the imputation process maintains the shape of the boxplot of the variables from which the anomalies have been removed. In other words, this multiplot of the imputed dataset (features3) is almost identical to the multiplot of features2 in section 7.2.

.. GENERATED FROM PYTHON SOURCE LINES 481-482

Normalized interactive imputation of features2

.. GENERATED FROM PYTHON SOURCE LINES 482-495

.. code-block:: Python


    # Data normalization
    scaler = StandardScaler()
    features2_scaled = scaler.fit_transform(features2_num)

    # Imputation of normalized data
    nii_features3_array = ii_imputer.fit_transform(features2_scaled)

    # Reverse normalization and new imputed dataframe
    features3 = pd.DataFrame(scaler.inverse_transform(nii_features3_array), columns=features2_num.columns)

    features3.head()






.. raw:: html

    <div class="output_subarea output_html rendered_html output_result">
    <div>
    <style scoped>
        .dataframe tbody tr th:only-of-type {
            vertical-align: middle;
        }

        .dataframe tbody tr th {
            vertical-align: top;
        }

        .dataframe thead th {
            text-align: right;
        }
    </style>
    <table border="1" class="dataframe">
      <thead>
        <tr style="text-align: right;">
          <th></th>
          <th>From</th>
          <th>Den</th>
          <th>Vp</th>
          <th>Vs</th>
          <th>Mag</th>
          <th>Ip</th>
          <th>Res</th>
        </tr>
      </thead>
      <tbody>
        <tr>
          <th>0</th>
          <td>265.77</td>
          <td>2.610000</td>
          <td>5725.879959</td>
          <td>3073.767206</td>
          <td>0.108</td>
          <td>36.995337</td>
          <td>5489.392555</td>
        </tr>
        <tr>
          <th>1</th>
          <td>271.66</td>
          <td>2.786624</td>
          <td>5514.000000</td>
          <td>2222.000000</td>
          <td>0.108</td>
          <td>26.060000</td>
          <td>4600.000000</td>
        </tr>
        <tr>
          <th>2</th>
          <td>277.05</td>
          <td>3.399000</td>
          <td>5754.000000</td>
          <td>2085.000000</td>
          <td>0.237</td>
          <td>17.520000</td>
          <td>10200.000000</td>
        </tr>
        <tr>
          <th>3</th>
          <td>282.55</td>
          <td>2.800579</td>
          <td>5773.000000</td>
          <td>2011.000000</td>
          <td>0.151</td>
          <td>28.130000</td>
          <td>5800.000000</td>
        </tr>
        <tr>
          <th>4</th>
          <td>287.20</td>
          <td>3.293000</td>
          <td>6034.000000</td>
          <td>2122.000000</td>
          <td>0.176</td>
          <td>6.880000</td>
          <td>15500.000000</td>
        </tr>
      </tbody>
    </table>
    </div>
    </div>
    <br />
    <br />

.. GENERATED FROM PYTHON SOURCE LINES 496-498

.. code-block:: Python

    features3.describe()






.. raw:: html

    <div class="output_subarea output_html rendered_html output_result">
    <div>
    <style scoped>
        .dataframe tbody tr th:only-of-type {
            vertical-align: middle;
        }

        .dataframe tbody tr th {
            vertical-align: top;
        }

        .dataframe thead th {
            text-align: right;
        }
    </style>
    <table border="1" class="dataframe">
      <thead>
        <tr style="text-align: right;">
          <th></th>
          <th>From</th>
          <th>Den</th>
          <th>Vp</th>
          <th>Vs</th>
          <th>Mag</th>
          <th>Ip</th>
          <th>Res</th>
        </tr>
      </thead>
      <tbody>
        <tr>
          <th>count</th>
          <td>329.000000</td>
          <td>329.000000</td>
          <td>329.000000</td>
          <td>329.000000</td>
          <td>329.000000</td>
          <td>329.000000</td>
          <td>329.000000</td>
        </tr>
        <tr>
          <th>mean</th>
          <td>543.279696</td>
          <td>2.723043</td>
          <td>5501.096555</td>
          <td>2808.728436</td>
          <td>0.142841</td>
          <td>48.950658</td>
          <td>3401.316252</td>
        </tr>
        <tr>
          <th>std</th>
          <td>262.864370</td>
          <td>0.143224</td>
          <td>538.939658</td>
          <td>517.989081</td>
          <td>0.100134</td>
          <td>24.374350</td>
          <td>3623.480943</td>
        </tr>
        <tr>
          <th>min</th>
          <td>9.600000</td>
          <td>2.420000</td>
          <td>3009.000000</td>
          <td>1295.000000</td>
          <td>0.001000</td>
          <td>5.390000</td>
          <td>156.900000</td>
        </tr>
        <tr>
          <th>25%</th>
          <td>355.580000</td>
          <td>2.679000</td>
          <td>5266.000000</td>
          <td>2448.000000</td>
          <td>0.064000</td>
          <td>31.808800</td>
          <td>1200.000000</td>
        </tr>
        <tr>
          <th>50%</th>
          <td>547.800000</td>
          <td>2.707000</td>
          <td>5587.000000</td>
          <td>2994.000000</td>
          <td>0.115000</td>
          <td>46.200000</td>
          <td>2300.000000</td>
        </tr>
        <tr>
          <th>75%</th>
          <td>750.000000</td>
          <td>2.731000</td>
          <td>5849.000000</td>
          <td>3231.000000</td>
          <td>0.201000</td>
          <td>65.470000</td>
          <td>4200.000000</td>
        </tr>
        <tr>
          <th>max</th>
          <td>1049.200000</td>
          <td>3.650000</td>
          <td>6478.000000</td>
          <td>3740.000000</td>
          <td>0.545000</td>
          <td>171.730000</td>
          <td>23900.000000</td>
        </tr>
      </tbody>
    </table>
    </div>
    </div>
    <br />
    <br />

.. GENERATED FROM PYTHON SOURCE LINES 499-500

Boxplot of each imputed feature

.. GENERATED FROM PYTHON SOURCE LINES 500-505

.. code-block:: Python


    features3.plot.box(subplots=True, grid=False, figsize=(12, 7), layout=(3, 4), flierprops={"marker": "."})
    plt.suptitle('Imputed Features')
    plt.tight_layout()




.. image-sg:: /examples/05_petrophysics/images/sphx_glr_02_no_2_filling_gaps_006.png
   :alt: Imputed Features
   :srcset: /examples/05_petrophysics/images/sphx_glr_02_no_2_filling_gaps_006.png
   :class: sphx-glr-single-img





.. GENERATED FROM PYTHON SOURCE LINES 506-507

Copy reference variable to features3

.. GENERATED FROM PYTHON SOURCE LINES 507-510

.. code-block:: Python


    features3[['Hole', 'Len', 'Form']] = features2[['Hole', 'Len', 'Form']]








.. GENERATED FROM PYTHON SOURCE LINES 511-512

Order of the variables

.. GENERATED FROM PYTHON SOURCE LINES 512-516

.. code-block:: Python


    features3_series = list(features2.columns)
    features3_series





.. rst-class:: sphx-glr-script-out

 .. code-block:: none


    ['Hole', 'From', 'Len', 'Den', 'Vp', 'Vs', 'Mag', 'Ip', 'Res', 'Form']



.. GENERATED FROM PYTHON SOURCE LINES 517-518

Reordering features3

.. GENERATED FROM PYTHON SOURCE LINES 518-521

.. code-block:: Python

    features3 = features3[features3_series]
    features3.head()






.. raw:: html

    <div class="output_subarea output_html rendered_html output_result">
    <div>
    <style scoped>
        .dataframe tbody tr th:only-of-type {
            vertical-align: middle;
        }

        .dataframe tbody tr th {
            vertical-align: top;
        }

        .dataframe thead th {
            text-align: right;
        }
    </style>
    <table border="1" class="dataframe">
      <thead>
        <tr style="text-align: right;">
          <th></th>
          <th>Hole</th>
          <th>From</th>
          <th>Len</th>
          <th>Den</th>
          <th>Vp</th>
          <th>Vs</th>
          <th>Mag</th>
          <th>Ip</th>
          <th>Res</th>
          <th>Form</th>
        </tr>
      </thead>
      <tbody>
        <tr>
          <th>0</th>
          <td>TC-1319-005</td>
          <td>265.77</td>
          <td>11.2</td>
          <td>2.610000</td>
          <td>5725.879959</td>
          <td>3073.767206</td>
          <td>0.108</td>
          <td>36.995337</td>
          <td>5489.392555</td>
          <td>CALP</td>
        </tr>
        <tr>
          <th>1</th>
          <td>TC-1319-005</td>
          <td>271.66</td>
          <td>11.8</td>
          <td>2.786624</td>
          <td>5514.000000</td>
          <td>2222.000000</td>
          <td>0.108</td>
          <td>26.060000</td>
          <td>4600.000000</td>
          <td>CALP</td>
        </tr>
        <tr>
          <th>2</th>
          <td>TC-1319-005</td>
          <td>277.05</td>
          <td>10.3</td>
          <td>3.399000</td>
          <td>5754.000000</td>
          <td>2085.000000</td>
          <td>0.237</td>
          <td>17.520000</td>
          <td>10200.000000</td>
          <td>CALP</td>
        </tr>
        <tr>
          <th>3</th>
          <td>TC-1319-005</td>
          <td>282.55</td>
          <td>11.2</td>
          <td>2.800579</td>
          <td>5773.000000</td>
          <td>2011.000000</td>
          <td>0.151</td>
          <td>28.130000</td>
          <td>5800.000000</td>
          <td>CALP</td>
        </tr>
        <tr>
          <th>4</th>
          <td>TC-1319-005</td>
          <td>287.20</td>
          <td>10.8</td>
          <td>3.293000</td>
          <td>6034.000000</td>
          <td>2122.000000</td>
          <td>0.176</td>
          <td>6.880000</td>
          <td>15500.000000</td>
          <td>CALP</td>
        </tr>
      </tbody>
    </table>
    </div>
    </div>
    <br />
    <br />

.. GENERATED FROM PYTHON SOURCE LINES 522-523

Save features3

.. GENERATED FROM PYTHON SOURCE LINES 523-525

.. code-block:: Python

    features3.to_csv('Output/features3.csv')








.. GENERATED FROM PYTHON SOURCE LINES 526-527

Indixes of Vp values in features2

.. GENERATED FROM PYTHON SOURCE LINES 527-530

.. code-block:: Python


    features2.Vp[~features2.Vp.isna()]





.. rst-class:: sphx-glr-script-out

 .. code-block:: none


    1      5514.0
    2      5754.0
    3      5773.0
    4      6034.0
    5      4661.0
            ...  
    323    5621.0
    324    5680.0
    326    5593.0
    327    5858.0
    328    5805.0
    Name: Vp, Length: 305, dtype: float64



.. GENERATED FROM PYTHON SOURCE LINES 531-532

Indixes of Vp NaNs in features2

.. GENERATED FROM PYTHON SOURCE LINES 532-535

.. code-block:: Python


    features2.Vp[features2.Vp.isna()]





.. rst-class:: sphx-glr-script-out

 .. code-block:: none


    0     NaN
    15    NaN
    36    NaN
    77    NaN
    79    NaN
    111   NaN
    118   NaN
    159   NaN
    174   NaN
    183   NaN
    185   NaN
    187   NaN
    189   NaN
    191   NaN
    194   NaN
    201   NaN
    210   NaN
    233   NaN
    240   NaN
    251   NaN
    276   NaN
    311   NaN
    320   NaN
    325   NaN
    Name: Vp, dtype: float64



.. GENERATED FROM PYTHON SOURCE LINES 536-539

.. code-block:: Python

    Vp_nan_index = list(features2.Vp[features2.Vp.isna()].index)
    Vp_nan_index





.. rst-class:: sphx-glr-script-out

 .. code-block:: none


    [0, 15, 36, 77, 79, 111, 118, 159, 174, 183, 185, 187, 189, 191, 194, 201, 210, 233, 240, 251, 276, 311, 320, 325]



.. GENERATED FROM PYTHON SOURCE LINES 540-541

Original Vp without anomalies

.. GENERATED FROM PYTHON SOURCE LINES 541-543

.. code-block:: Python

    features2.Vp.describe()





.. rst-class:: sphx-glr-script-out

 .. code-block:: none


    count     305.000000
    mean     5496.386885
    std       556.077848
    min      3009.000000
    25%      5220.000000
    50%      5581.000000
    75%      5858.000000
    max      6478.000000
    Name: Vp, dtype: float64



.. GENERATED FROM PYTHON SOURCE LINES 544-545

Imputed Vp

.. GENERATED FROM PYTHON SOURCE LINES 545-548

.. code-block:: Python


    features3.Vp.iloc[Vp_nan_index].describe()





.. rst-class:: sphx-glr-script-out

 .. code-block:: none


    count      24.000000
    mean     5560.948607
    std       225.846356
    min      5210.315493
    25%      5369.531486
    50%      5605.088801
    75%      5779.916352
    max      5854.418200
    Name: Vp, dtype: float64



.. GENERATED FROM PYTHON SOURCE LINES 549-550

% of increase of the mean of the imputed Vp compared with the real Vp

.. GENERATED FROM PYTHON SOURCE LINES 550-554

.. code-block:: Python


    vp_change_den = 100 * (features3.Vp.iloc[Vp_nan_index].mean() - features2.Vp.mean()) / features2.Vp.mean()
    print('Increase of the mean:', '{:.1f} %'.format(vp_change_den))





.. rst-class:: sphx-glr-script-out

 .. code-block:: none

    Increase of the mean: 1.2 %




.. GENERATED FROM PYTHON SOURCE LINES 555-556

nii Vp for comparison

.. GENERATED FROM PYTHON SOURCE LINES 556-560

.. code-block:: Python


    nii_vp = features3.Vp.iloc[Vp_nan_index]
    nii_vp.head()





.. rst-class:: sphx-glr-script-out

 .. code-block:: none


    0     5725.879959
    15    5604.870331
    36    5558.402401
    77    5374.912754
    79    5210.315493
    Name: Vp, dtype: float64



.. GENERATED FROM PYTHON SOURCE LINES 561-585

Estimating Vp
-------------
The calculation of a single variable such as Vp (again, just this variable due to its importance, for simplicity and speed) allows us to try and evaluate different methods for filling the gaps, from empirical formulas to the simplest Machine Learning (ML) algorithms.

Vp by Gardner
+++++++++++++
The Gardner's formula is a well-known empirical formula that allows us to transform the density (Den) into compressional velocity (Vp):

.. math::
   Vp = \alpha * \phi ^ {\beta}

Where

:math:`\phi` is the density and the standard coefficients (reference: https://wiki.seg.org/wiki/Dictionary:Gardner%E2%80%99s_equation) are:

.. math::
   \alpha \approx 4348

.. math::
   \beta = 0.25

Contrary to the expected increasing trend between Den and Vp, the plot below shows an anomalous vertical trend, where multiple Vp values are associated with a single Den value (around 2.7). This anomalous trend results in a negative R² (-0.021, the worst so far) when Vp is calculated using all values, and this score improves slightly when we filter out the pairs that fall outside the expected increasing trend. This is confirmed by the low covariance presented by the Den-Vp pair in the covariance matrix of this section (low correlation coefficient in the previous notebook 1/3).
%%
Plot of Den vs. Vp

.. GENERATED FROM PYTHON SOURCE LINES 585-592

.. code-block:: Python


    plt.grid()
    plt.scatter(features2.Den, features2.Vp, alpha=0.65, edgecolors='k')
    plt.xlabel('Den (g/cm$^3$)')
    plt.ylabel('Vp (m/s)')
    plt.title('Data Without Anomalies')




.. image-sg:: /examples/05_petrophysics/images/sphx_glr_02_no_2_filling_gaps_007.png
   :alt: Data Without Anomalies
   :srcset: /examples/05_petrophysics/images/sphx_glr_02_no_2_filling_gaps_007.png
   :class: sphx-glr-single-img


.. rst-class:: sphx-glr-script-out

 .. code-block:: none


    Text(0.5, 1.0, 'Data Without Anomalies')



.. GENERATED FROM PYTHON SOURCE LINES 593-595

.. code-block:: Python

    features2.select_dtypes(include=['number'])






.. raw:: html

    <div class="output_subarea output_html rendered_html output_result">
    <div>
    <style scoped>
        .dataframe tbody tr th:only-of-type {
            vertical-align: middle;
        }

        .dataframe tbody tr th {
            vertical-align: top;
        }

        .dataframe thead th {
            text-align: right;
        }
    </style>
    <table border="1" class="dataframe">
      <thead>
        <tr style="text-align: right;">
          <th></th>
          <th>From</th>
          <th>Len</th>
          <th>Den</th>
          <th>Vp</th>
          <th>Vs</th>
          <th>Mag</th>
          <th>Ip</th>
          <th>Res</th>
        </tr>
      </thead>
      <tbody>
        <tr>
          <th>0</th>
          <td>265.77</td>
          <td>11.2</td>
          <td>2.610</td>
          <td>NaN</td>
          <td>NaN</td>
          <td>0.108</td>
          <td>NaN</td>
          <td>NaN</td>
        </tr>
        <tr>
          <th>1</th>
          <td>271.66</td>
          <td>11.8</td>
          <td>NaN</td>
          <td>5514.0</td>
          <td>2222.0</td>
          <td>0.108</td>
          <td>26.06</td>
          <td>4600.0</td>
        </tr>
        <tr>
          <th>2</th>
          <td>277.05</td>
          <td>10.3</td>
          <td>3.399</td>
          <td>5754.0</td>
          <td>2085.0</td>
          <td>0.237</td>
          <td>17.52</td>
          <td>10200.0</td>
        </tr>
        <tr>
          <th>3</th>
          <td>282.55</td>
          <td>11.2</td>
          <td>NaN</td>
          <td>5773.0</td>
          <td>2011.0</td>
          <td>0.151</td>
          <td>28.13</td>
          <td>5800.0</td>
        </tr>
        <tr>
          <th>4</th>
          <td>287.20</td>
          <td>10.8</td>
          <td>3.293</td>
          <td>6034.0</td>
          <td>2122.0</td>
          <td>0.176</td>
          <td>6.88</td>
          <td>15500.0</td>
        </tr>
        <tr>
          <th>...</th>
          <td>...</td>
          <td>...</td>
          <td>...</td>
          <td>...</td>
          <td>...</td>
          <td>...</td>
          <td>...</td>
          <td>...</td>
        </tr>
        <tr>
          <th>324</th>
          <td>814.55</td>
          <td>9.6</td>
          <td>2.763</td>
          <td>5680.0</td>
          <td>3179.0</td>
          <td>0.158</td>
          <td>70.94</td>
          <td>481.1</td>
        </tr>
        <tr>
          <th>325</th>
          <td>819.92</td>
          <td>NaN</td>
          <td>2.649</td>
          <td>NaN</td>
          <td>NaN</td>
          <td>0.024</td>
          <td>NaN</td>
          <td>NaN</td>
        </tr>
        <tr>
          <th>326</th>
          <td>825.60</td>
          <td>9.9</td>
          <td>2.762</td>
          <td>5593.0</td>
          <td>1787.0</td>
          <td>0.337</td>
          <td>78.44</td>
          <td>334.6</td>
        </tr>
        <tr>
          <th>327</th>
          <td>828.80</td>
          <td>9.9</td>
          <td>2.804</td>
          <td>5858.0</td>
          <td>1964.0</td>
          <td>0.311</td>
          <td>73.73</td>
          <td>466.5</td>
        </tr>
        <tr>
          <th>328</th>
          <td>833.40</td>
          <td>10.1</td>
          <td>2.735</td>
          <td>5805.0</td>
          <td>3217.0</td>
          <td>0.308</td>
          <td>33.86</td>
          <td>2300.0</td>
        </tr>
      </tbody>
    </table>
    <p>329 rows × 8 columns</p>
    </div>
    </div>
    <br />
    <br />

.. GENERATED FROM PYTHON SOURCE LINES 596-597

Covariance of the features2

.. GENERATED FROM PYTHON SOURCE LINES 597-600

.. code-block:: Python

    vari_mat = features2.select_dtypes(include=['number']).cov()
    vari_mat






.. raw:: html

    <div class="output_subarea output_html rendered_html output_result">
    <div>
    <style scoped>
        .dataframe tbody tr th:only-of-type {
            vertical-align: middle;
        }

        .dataframe tbody tr th {
            vertical-align: top;
        }

        .dataframe thead th {
            text-align: right;
        }
    </style>
    <table border="1" class="dataframe">
      <thead>
        <tr style="text-align: right;">
          <th></th>
          <th>From</th>
          <th>Len</th>
          <th>Den</th>
          <th>Vp</th>
          <th>Vs</th>
          <th>Mag</th>
          <th>Ip</th>
          <th>Res</th>
        </tr>
      </thead>
      <tbody>
        <tr>
          <th>From</th>
          <td>69097.677012</td>
          <td>-16.222055</td>
          <td>-7.748490</td>
          <td>-53382.663546</td>
          <td>-40566.665284</td>
          <td>7.019153</td>
          <td>2766.653801</td>
          <td>-5.105614e+05</td>
        </tr>
        <tr>
          <th>Len</th>
          <td>-16.222055</td>
          <td>1.145171</td>
          <td>0.052440</td>
          <td>-49.309151</td>
          <td>-165.827338</td>
          <td>-0.010128</td>
          <td>-0.338487</td>
          <td>-2.218782e+02</td>
        </tr>
        <tr>
          <th>Den</th>
          <td>-7.748490</td>
          <td>0.052440</td>
          <td>0.021088</td>
          <td>2.888883</td>
          <td>-12.299676</td>
          <td>-0.000414</td>
          <td>-0.243046</td>
          <td>5.485842e+01</td>
        </tr>
        <tr>
          <th>Vp</th>
          <td>-53382.663546</td>
          <td>-49.309151</td>
          <td>2.888883</td>
          <td>309222.573512</td>
          <td>93428.327238</td>
          <td>0.964435</td>
          <td>-3777.361458</td>
          <td>9.111955e+05</td>
        </tr>
        <tr>
          <th>Vs</th>
          <td>-40566.665284</td>
          <td>-165.827338</td>
          <td>-12.299676</td>
          <td>93428.327238</td>
          <td>299730.099574</td>
          <td>-2.230359</td>
          <td>-1474.676736</td>
          <td>5.989462e+05</td>
        </tr>
        <tr>
          <th>Mag</th>
          <td>7.019153</td>
          <td>-0.010128</td>
          <td>-0.000414</td>
          <td>0.964435</td>
          <td>-2.230359</td>
          <td>0.010142</td>
          <td>0.674282</td>
          <td>-1.611072e+01</td>
        </tr>
        <tr>
          <th>Ip</th>
          <td>2766.653801</td>
          <td>-0.338487</td>
          <td>-0.243046</td>
          <td>-3777.361458</td>
          <td>-1474.676736</td>
          <td>0.674282</td>
          <td>632.281488</td>
          <td>-6.145102e+04</td>
        </tr>
        <tr>
          <th>Res</th>
          <td>-510561.399578</td>
          <td>-221.878244</td>
          <td>54.858423</td>
          <td>911195.549174</td>
          <td>598946.207455</td>
          <td>-16.110723</td>
          <td>-61451.019939</td>
          <td>1.380012e+07</td>
        </tr>
      </tbody>
    </table>
    </div>
    </div>
    <br />
    <br />

.. GENERATED FROM PYTHON SOURCE LINES 601-603

.. code-block:: Python

    vari_mat.describe()






.. raw:: html

    <div class="output_subarea output_html rendered_html output_result">
    <div>
    <style scoped>
        .dataframe tbody tr th:only-of-type {
            vertical-align: middle;
        }

        .dataframe tbody tr th {
            vertical-align: top;
        }

        .dataframe thead th {
            text-align: right;
        }
    </style>
    <table border="1" class="dataframe">
      <thead>
        <tr style="text-align: right;">
          <th></th>
          <th>From</th>
          <th>Len</th>
          <th>Den</th>
          <th>Vp</th>
          <th>Vs</th>
          <th>Mag</th>
          <th>Ip</th>
          <th>Res</th>
        </tr>
      </thead>
      <tbody>
        <tr>
          <th>count</th>
          <td>8.000000</td>
          <td>8.000000</td>
          <td>8.000000</td>
          <td>8.000000</td>
          <td>8.000000</td>
          <td>8.000000</td>
          <td>8.000000</td>
          <td>8.000000e+00</td>
        </tr>
        <tr>
          <th>mean</th>
          <td>-66582.918624</td>
          <td>-56.548474</td>
          <td>4.691151</td>
          <td>157080.121136</td>
          <td>118735.366859</td>
          <td>-1.210452</td>
          <td>-7913.003762</td>
          <td>1.842258e+06</td>
        </tr>
        <tr>
          <th>std</th>
          <td>183023.873982</td>
          <td>87.703069</td>
          <td>20.884445</td>
          <td>325158.301042</td>
          <td>222378.924444</td>
          <td>6.584200</td>
          <td>21711.748328</td>
          <td>4.851295e+06</td>
        </tr>
        <tr>
          <th>min</th>
          <td>-510561.399578</td>
          <td>-221.878244</td>
          <td>-12.299676</td>
          <td>-53382.663546</td>
          <td>-40566.665284</td>
          <td>-16.110723</td>
          <td>-61451.019939</td>
          <td>-5.105614e+05</td>
        </tr>
        <tr>
          <th>25%</th>
          <td>-43770.664850</td>
          <td>-78.438698</td>
          <td>-2.119407</td>
          <td>-981.322228</td>
          <td>-493.039688</td>
          <td>-0.565186</td>
          <td>-2050.347917</td>
          <td>-1.552916e+04</td>
        </tr>
        <tr>
          <th>50%</th>
          <td>-11.985273</td>
          <td>-8.280271</td>
          <td>0.010337</td>
          <td>1.926659</td>
          <td>-7.265017</td>
          <td>0.004864</td>
          <td>-0.290766</td>
          <td>1.937385e+01</td>
        </tr>
        <tr>
          <th>75%</th>
          <td>696.927815</td>
          <td>0.005514</td>
          <td>0.761551</td>
          <td>147376.888806</td>
          <td>145003.770322</td>
          <td>0.746820</td>
          <td>158.576084</td>
          <td>6.770085e+05</td>
        </tr>
        <tr>
          <th>max</th>
          <td>69097.677012</td>
          <td>1.145171</td>
          <td>54.858423</td>
          <td>911195.549174</td>
          <td>598946.207455</td>
          <td>7.019153</td>
          <td>2766.653801</td>
          <td>1.380012e+07</td>
        </tr>
      </tbody>
    </table>
    </div>
    </div>
    <br />
    <br />

.. GENERATED FROM PYTHON SOURCE LINES 604-605

Covariance matrix

.. GENERATED FROM PYTHON SOURCE LINES 605-611

.. code-block:: Python


    sns.heatmap(vari_mat, vmin=-500, vmax=500, center=0, linewidths=.1, cmap='seismic_r', annot=True)
    plt.title('Covariance Matrix')
    plt.tight_layout()
    plt.show()




.. image-sg:: /examples/05_petrophysics/images/sphx_glr_02_no_2_filling_gaps_008.png
   :alt: Covariance Matrix
   :srcset: /examples/05_petrophysics/images/sphx_glr_02_no_2_filling_gaps_008.png
   :class: sphx-glr-single-img





.. GENERATED FROM PYTHON SOURCE LINES 612-613

Gardner Vp in features2, with the standard alpha of 4348

.. GENERATED FROM PYTHON SOURCE LINES 613-617

.. code-block:: Python


    features2['VpG'] = 4348 * (features2.Den) ** 0.25
    features2['VpG']





.. rst-class:: sphx-glr-script-out

 .. code-block:: none


    0      5526.493483
    1              NaN
    2      5903.741408
    3              NaN
    4      5857.165129
              ...     
    324    5605.763308
    325    5547.023745
    326    5605.256023
    327    5626.444480
    328    5591.506938
    Name: VpG, Length: 329, dtype: float64



.. GENERATED FROM PYTHON SOURCE LINES 618-619

Drop NaNs for metrics calculation

.. GENERATED FROM PYTHON SOURCE LINES 619-623

.. code-block:: Python


    features2_nonan = features2.dropna()
    features2_nonan.head()






.. raw:: html

    <div class="output_subarea output_html rendered_html output_result">
    <div>
    <style scoped>
        .dataframe tbody tr th:only-of-type {
            vertical-align: middle;
        }

        .dataframe tbody tr th {
            vertical-align: top;
        }

        .dataframe thead th {
            text-align: right;
        }
    </style>
    <table border="1" class="dataframe">
      <thead>
        <tr style="text-align: right;">
          <th></th>
          <th>Hole</th>
          <th>From</th>
          <th>Len</th>
          <th>Den</th>
          <th>Vp</th>
          <th>Vs</th>
          <th>Mag</th>
          <th>Ip</th>
          <th>Res</th>
          <th>Form</th>
          <th>VpG</th>
        </tr>
      </thead>
      <tbody>
        <tr>
          <th>2</th>
          <td>TC-1319-005</td>
          <td>277.05</td>
          <td>10.3</td>
          <td>3.399</td>
          <td>5754.0</td>
          <td>2085.0</td>
          <td>0.237</td>
          <td>17.52</td>
          <td>10200.0</td>
          <td>CALP</td>
          <td>5903.741408</td>
        </tr>
        <tr>
          <th>4</th>
          <td>TC-1319-005</td>
          <td>287.20</td>
          <td>10.8</td>
          <td>3.293</td>
          <td>6034.0</td>
          <td>2122.0</td>
          <td>0.176</td>
          <td>6.88</td>
          <td>15500.0</td>
          <td>CALP</td>
          <td>5857.165129</td>
        </tr>
        <tr>
          <th>5</th>
          <td>TC-1319-005</td>
          <td>296.75</td>
          <td>10.3</td>
          <td>2.782</td>
          <td>4661.0</td>
          <td>2294.0</td>
          <td>0.107</td>
          <td>27.81</td>
          <td>2100.0</td>
          <td>CALP</td>
          <td>5615.375681</td>
        </tr>
        <tr>
          <th>8</th>
          <td>TC-1319-005</td>
          <td>318.75</td>
          <td>10.7</td>
          <td>3.386</td>
          <td>5815.0</td>
          <td>2115.0</td>
          <td>0.060</td>
          <td>8.53</td>
          <td>10700.0</td>
          <td>CALP</td>
          <td>5898.088351</td>
        </tr>
        <tr>
          <th>9</th>
          <td>TC-1319-005</td>
          <td>325.75</td>
          <td>10.0</td>
          <td>2.616</td>
          <td>4785.0</td>
          <td>1946.0</td>
          <td>0.040</td>
          <td>84.90</td>
          <td>296.1</td>
          <td>CALP</td>
          <td>5529.666895</td>
        </tr>
      </tbody>
    </table>
    </div>
    </div>
    <br />
    <br />

.. GENERATED FROM PYTHON SOURCE LINES 624-625

Metrics of the Gardner calculation with all values

.. GENERATED FROM PYTHON SOURCE LINES 625-630

.. code-block:: Python

    vpg_metrics = basic_stat.metrics(features2_nonan.Vp, features2_nonan.VpG)

    print("Metrics for Gardner:\n")
    print(vpg_metrics)





.. rst-class:: sphx-glr-script-out

 .. code-block:: none

    Metrics for Gardner:

    {'Sum of Residuals': np.float64(-10723.609649536735), 'MAE': np.float64(365.5125465596317), 'MSE': np.float64(254469.14816210992), 'RMSE': np.float64(504.44935143392735), 'R2': -0.021743605762918117}




.. GENERATED FROM PYTHON SOURCE LINES 631-632

Metrics of the Gardner calculation with filtered values

.. GENERATED FROM PYTHON SOURCE LINES 632-641

.. code-block:: Python


    vpg_metrics2 = basic_stat.metrics(
        features2_nonan.Vp[(features2_nonan.Vp > 4000) & (features2_nonan.Vp < 6000)],
        features2_nonan.VpG[(features2_nonan.Vp > 4000) & (features2_nonan.Vp < 6000)]
    )

    print("Metrics for Gardner with Filtered Data Vp):\n")
    print(vpg_metrics2)





.. rst-class:: sphx-glr-script-out

 .. code-block:: none

    Metrics for Gardner with Filtered Data Vp):

    {'Sum of Residuals': np.float64(-25589.572721078162), 'MAE': np.float64(295.9254751013969), 'MSE': np.float64(162331.10374522334), 'RMSE': np.float64(402.9033429313082), 'R2': -0.0937786027043972}




.. GENERATED FROM PYTHON SOURCE LINES 642-643

Gardner Vp for comparison

.. GENERATED FROM PYTHON SOURCE LINES 643-647

.. code-block:: Python


    gard_vp = features2.VpG.iloc[Vp_nan_index]
    gard_vp.head()





.. rst-class:: sphx-glr-script-out

 .. code-block:: none


    0     5526.493483
    15    5520.661324
    36    5633.454391
    77    5586.388848
    79    5691.523695
    Name: VpG, dtype: float64



.. GENERATED FROM PYTHON SOURCE LINES 648-652

Simpliest ML models for Vp
++++++++++++++++++++++++++

Machine Learning (ML) algorithms offer a wide range of options to calculate missing values of a single variable. From the simplest and most well-known, such as linear regression with an independent variable, to the most complex. Below we use the simplest ML algorithms and their respective metrics, applied again to predict Vp and fill its gaps. But first, we split the non-NaN portion of the filtered data (features2) for training and testing.

.. GENERATED FROM PYTHON SOURCE LINES 654-656

.. code-block:: Python

    features2_num_nonan.head()






.. raw:: html

    <div class="output_subarea output_html rendered_html output_result">
    <div>
    <style scoped>
        .dataframe tbody tr th:only-of-type {
            vertical-align: middle;
        }

        .dataframe tbody tr th {
            vertical-align: top;
        }

        .dataframe thead th {
            text-align: right;
        }
    </style>
    <table border="1" class="dataframe">
      <thead>
        <tr style="text-align: right;">
          <th></th>
          <th>From</th>
          <th>Den</th>
          <th>Vp</th>
          <th>Vs</th>
          <th>Mag</th>
          <th>Ip</th>
          <th>Res</th>
        </tr>
      </thead>
      <tbody>
        <tr>
          <th>2</th>
          <td>277.05</td>
          <td>3.399</td>
          <td>5754.0</td>
          <td>2085.0</td>
          <td>0.237</td>
          <td>17.52</td>
          <td>10200.0</td>
        </tr>
        <tr>
          <th>4</th>
          <td>287.20</td>
          <td>3.293</td>
          <td>6034.0</td>
          <td>2122.0</td>
          <td>0.176</td>
          <td>6.88</td>
          <td>15500.0</td>
        </tr>
        <tr>
          <th>5</th>
          <td>296.75</td>
          <td>2.782</td>
          <td>4661.0</td>
          <td>2294.0</td>
          <td>0.107</td>
          <td>27.81</td>
          <td>2100.0</td>
        </tr>
        <tr>
          <th>8</th>
          <td>318.75</td>
          <td>3.386</td>
          <td>5815.0</td>
          <td>2115.0</td>
          <td>0.060</td>
          <td>8.53</td>
          <td>10700.0</td>
        </tr>
        <tr>
          <th>9</th>
          <td>325.75</td>
          <td>2.616</td>
          <td>4785.0</td>
          <td>1946.0</td>
          <td>0.040</td>
          <td>84.90</td>
          <td>296.1</td>
        </tr>
      </tbody>
    </table>
    </div>
    </div>
    <br />
    <br />

.. GENERATED FROM PYTHON SOURCE LINES 657-658

Target from filtered features2

.. GENERATED FROM PYTHON SOURCE LINES 658-664

.. code-block:: Python


    # The target or objective of the model is the Vp
    target = features2_num_nonan.Vp[(features2_num_nonan.Vp > 4000) & (features2_num_nonan.Vp < 6000)]
    print(target.shape)
    target.head()





.. rst-class:: sphx-glr-script-out

 .. code-block:: none

    (225,)

    2     5754.0
    5     4661.0
    8     5815.0
    9     4785.0
    10    5707.0
    Name: Vp, dtype: float64



.. GENERATED FROM PYTHON SOURCE LINES 665-666

Filtered density is the independent feature to compute Vp, the target

.. GENERATED FROM PYTHON SOURCE LINES 666-669

.. code-block:: Python

    features2_den = features2_num_nonan.Den[(features2_num_nonan.Vp > 4000) & (features2_num_nonan.Vp < 6000)]
    features2_den.head()





.. rst-class:: sphx-glr-script-out

 .. code-block:: none


    2     3.399
    5     2.782
    8     3.386
    9     2.616
    10    2.821
    Name: Den, dtype: float64



.. GENERATED FROM PYTHON SOURCE LINES 670-671

Split and shape of data for training and testing

.. GENERATED FROM PYTHON SOURCE LINES 671-685

.. code-block:: Python


    X_train, X_test, y_train, y_test = train_test_split(features2_den, target, test_size=0.2, random_state=seed)

    X_train = np.array(X_train).reshape(-1, 1)
    X_test = np.array(X_test).reshape(-1, 1)
    y_train = np.array(y_train).reshape(-1, 1)
    y_test = np.array(y_test).reshape(-1, 1)

    print('X_train shape:', X_train.shape)
    print('y_train shape:', y_train.shape)

    print('X_test shape:', X_test.shape)
    print('y_test shape:', y_test.shape)





.. rst-class:: sphx-glr-script-out

 .. code-block:: none

    X_train shape: (180, 1)
    y_train shape: (180, 1)
    X_test shape: (45, 1)
    y_test shape: (45, 1)




.. GENERATED FROM PYTHON SOURCE LINES 686-690

Simple Linear Regression
++++++++++++++++++++++++

The linear regression is perhaps one of the simplest ML algorithms, it allows us to define a simple formula to compute Vp from Den, the only feature. The regression using all the data gave a negative $R^2$ (the predictions are worse than if it had simply predicted the mean), while applying a filter in the input data(4000 < Vp < 6000) resulted in a positive $R^2$, although a very small one (0.003).

.. GENERATED FROM PYTHON SOURCE LINES 692-693

Linear Regression

.. GENERATED FROM PYTHON SOURCE LINES 693-702

.. code-block:: Python


    # Model training and prediction
    lr_model = LinearRegression().fit(X_train, y_train)
    lr_predict = lr_model.predict(np.array(X_test))

    # Coheficients
    coef = lr_model.coef_.item()
    inter = lr_model.intercept_.item()








.. GENERATED FROM PYTHON SOURCE LINES 703-704

Metric of the linear regression with filtered data

.. GENERATED FROM PYTHON SOURCE LINES 704-710

.. code-block:: Python


    lr_metrics = basic_stat.metrics(y_test, lr_predict)

    print("Metrics for Linear Regressor:\n")
    print(lr_metrics)





.. rst-class:: sphx-glr-script-out

 .. code-block:: none

    Metrics for Linear Regressor:

    {'Sum of Residuals': np.float64(-663.6657176180624), 'MAE': np.float64(321.4070681962207), 'MSE': np.float64(152576.8824032091), 'RMSE': np.float64(390.61090922196365), 'R2': 0.002886141920427243}




.. GENERATED FROM PYTHON SOURCE LINES 711-712

Plot of filtered data and regression

.. GENERATED FROM PYTHON SOURCE LINES 712-722

.. code-block:: Python


    plt.scatter(X_test, y_test, label='Real data', alpha=0.65, edgecolors='k')
    plt.plot(X_test, lr_predict, c='r', label='Prediction')
    plt.title(f"Vp = {coef:.1f} * den + {inter:.1f}")
    plt.suptitle('Linear Regression')
    plt.xlabel('Density (g/cm$^3$)')
    plt.ylabel('Vp (m/s)')
    plt.grid()
    plt.legend();




.. image-sg:: /examples/05_petrophysics/images/sphx_glr_02_no_2_filling_gaps_009.png
   :alt: Linear Regression, Vp = 175.7 * den + 4996.3
   :srcset: /examples/05_petrophysics/images/sphx_glr_02_no_2_filling_gaps_009.png
   :class: sphx-glr-single-img


.. rst-class:: sphx-glr-script-out

 .. code-block:: none


    <matplotlib.legend.Legend object at 0x7f3199fc2ce0>



.. GENERATED FROM PYTHON SOURCE LINES 723-724

lr Vp for comparison

.. GENERATED FROM PYTHON SOURCE LINES 724-731

.. code-block:: Python


    #  Density of the Vp calculations
    den_feature = np.array(features2.Den.iloc[Vp_nan_index]).reshape(-1, 1)

    lr_vp = lr_model.predict(den_feature)
    lr_vp





.. rst-class:: sphx-glr-script-out

 .. code-block:: none


    array([[5454.81134299],
           [5452.87886354],
           [5491.35277254],
           [5475.01453721],
           [5512.08300661],
           [5473.43341766],
           [5454.98702294],
           [5427.58095077],
           [5466.58189962],
           [5467.28461942],
           [5474.66317731],
           [5468.33869912],
           [5458.85198183],
           [5466.05485977],
           [5454.81134299],
           [5517.70476501],
           [5469.74413872],
           [5470.27117857],
           [5467.98733922],
           [5478.52813621],
           [5470.27117857],
           [5462.01422093],
           [5495.74477128],
           [5461.66286103]])



.. GENERATED FROM PYTHON SOURCE LINES 732-735

Non Linear Fit
++++++++++++++
Although the non_linear fit is more powerful since it can fit the alpha and beta coefficients of a non-linear curve of density to Vp, the $R^2$ metric (0.004) does not differ much from the linear regression with the filtered data.

.. GENERATED FROM PYTHON SOURCE LINES 737-738

Reshaping the filtered data

.. GENERATED FROM PYTHON SOURCE LINES 738-746

.. code-block:: Python


    X_train = np.array(X_train).flatten()
    y_train = np.array(y_train).flatten()

    print('X_train shape:', X_train.shape)
    print('y_train shape:', y_train.shape)






.. rst-class:: sphx-glr-script-out

 .. code-block:: none

    X_train shape: (180,)
    y_train shape: (180,)




.. GENERATED FROM PYTHON SOURCE LINES 747-748

Gardner function for double fitting, for alpha and beta

.. GENERATED FROM PYTHON SOURCE LINES 748-753

.. code-block:: Python


    def gardner_model(X_train, alpha, beta):
        return alpha * X_train ** beta









.. GENERATED FROM PYTHON SOURCE LINES 754-755

Model fit

.. GENERATED FROM PYTHON SOURCE LINES 755-757

.. code-block:: Python


    popt, pcov = curve_fit(gardner_model, X_train, y_train, p0=[0, 0])  # p0 is the initial estimation of alpha y beta







.. GENERATED FROM PYTHON SOURCE LINES 758-759

Coheficients and prediction

.. GENERATED FROM PYTHON SOURCE LINES 759-767

.. code-block:: Python


    alpha, beta = popt

    print('Alpha:', alpha)
    print('Beta:', beta)

    nlf_predict = gardner_model(X_test, alpha, beta)





.. rst-class:: sphx-glr-script-out

 .. code-block:: none

    Alpha: 4944.85723377095
    Beta: 0.1017276747649569




.. GENERATED FROM PYTHON SOURCE LINES 768-769

Metric of the non linear fit with filtered data

.. GENERATED FROM PYTHON SOURCE LINES 769-774

.. code-block:: Python

    nlf_metrics = basic_stat.metrics(y_test, nlf_predict)

    print("Metrics for Non Linear Fit:\n")
    print(nlf_metrics)





.. rst-class:: sphx-glr-script-out

 .. code-block:: none

    Metrics for Non Linear Fit:

    {'Sum of Residuals': np.float64(-672.1847517458073), 'MAE': np.float64(321.430456631444), 'MSE': np.float64(152446.9582936085), 'RMSE': np.float64(390.4445649431024), 'R2': 0.003735216355170601}




.. GENERATED FROM PYTHON SOURCE LINES 775-776

Plot of filtered data and NLF equation

.. GENERATED FROM PYTHON SOURCE LINES 776-785

.. code-block:: Python

    plt.scatter(X_test, y_test, label='Real data')
    plt.plot(X_test, nlf_predict, c='r', label='Prediction')
    plt.title(f"Vp = {alpha:.2f} * den ** {beta:.2f}")
    plt.xlabel("Density (g/cm3)")
    plt.suptitle('Non Linear Fit')
    plt.ylabel("Vp (m/s)")
    plt.grid()
    plt.show()




.. image-sg:: /examples/05_petrophysics/images/sphx_glr_02_no_2_filling_gaps_010.png
   :alt: Non Linear Fit, Vp = 4944.86 * den ** 0.10
   :srcset: /examples/05_petrophysics/images/sphx_glr_02_no_2_filling_gaps_010.png
   :class: sphx-glr-single-img





.. GENERATED FROM PYTHON SOURCE LINES 786-787

nlf Vp for comparison

.. GENERATED FROM PYTHON SOURCE LINES 787-790

.. code-block:: Python

    nlf_vp = gardner_model(den_feature, alpha, beta)
    nlf_vp





.. rst-class:: sphx-glr-script-out

 .. code-block:: none


    array([[5451.77127868],
           [5449.42946455],
           [5494.46242706],
           [5475.7369865 ],
           [5517.43839526],
           [5473.89450644],
           [5451.98373102],
           [5417.92250818],
           [5465.84650885],
           [5466.67677416],
           [5475.32801939],
           [5467.92008814],
           [5456.63928719],
           [5465.22307824],
           [5451.77127868],
           [5523.52610545],
           [5469.57396647],
           [5470.19303406],
           [5467.50592748],
           [5479.81189483],
           [5470.19303406],
           [5460.42246055],
           [5499.40142812],
           [5460.00324939]])



.. GENERATED FROM PYTHON SOURCE LINES 791-794

Vps Comparison
--------------    
The real Vp values allow us to evaluate the quality of different procedures that estimate Vp, such as different imputation algorithms, the Gardner empirical formula, and the simpliest ML regression model.

.. GENERATED FROM PYTHON SOURCE LINES 796-797

Datos y títulos a utilizar en el bucle

.. GENERATED FROM PYTHON SOURCE LINES 797-845

.. code-block:: Python

    plt.figure(figsize=(9, 15))

    # Histograms
    plt.subplot(7, 1, 1)
    plt.grid(zorder=2)
    plt.hist(features2.Vp, bins=20, zorder=3)
    plt.xlim(2500, 7000)
    plt.xlabel('m/s')
    plt.title('Real Vp')

    plt.subplot(7, 1, 2)
    box = plt.boxplot(features2.Vp.dropna(), vert=False, showmeans=True, meanline=True, patch_artist=False)
    mean_line = box['means'][0]  # Obtener la línea de la media
    median_line = box['medians'][0]  # Obtener la línea de la mediana
    plt.legend([mean_line, median_line], ['Mean', 'Median'], loc='upper left')
    plt.xlim(2500, 7000)
    plt.title('Real Vp')
    plt.xlabel('m/s')
    plt.grid()

    plt.subplot(7, 1, 3)
    plt.grid(zorder=2)
    plt.hist(gard_vp, label='Gardner Vp', zorder=4)
    plt.hist(nii_vp, label='NII Imputer Vp', zorder=7)
    plt.hist(lr_vp, label='LR Vp', zorder=9)
    plt.hist(nlf_vp, label='NLF Vp', zorder=10)
    plt.xlim(2500, 7000)
    plt.xlabel('m/s')
    plt.legend()
    plt.title('Estimated Vps')

    # Boxplots
    data_list = [gard_vp, nii_vp, lr_vp, nlf_vp]
    titles = ['Gardner Vp', 'NII Inputer Vp', 'LR Vp', 'NLF Vp']
    means = [np.mean(features2.Vp.dropna()), np.mean(gard_vp), np.mean(nii_vp), np.mean(lr_vp), np.mean(nlf_vp)]

    for i in range(len(data_list)):
        plt.subplot(7, 1, i + 4)
        plt.boxplot(data_list[i], vert=False, showmeans=True, meanline=True)
        vp_change = 100 * (data_list[i].mean() - features2.Vp.mean()) / features2.Vp.mean()
        plt.text(means[i] - 500, 1.25, 'Mean: {:.0f}, % change: {:.1f}'.format(means[i], vp_change))
        plt.xlim(2500, 7000)
        plt.title(titles[i])
        plt.xlabel('m/s')
        plt.grid()

    plt.tight_layout()




.. image-sg:: /examples/05_petrophysics/images/sphx_glr_02_no_2_filling_gaps_011.png
   :alt: Real Vp, Real Vp, Estimated Vps, Gardner Vp, NII Inputer Vp, LR Vp, NLF Vp
   :srcset: /examples/05_petrophysics/images/sphx_glr_02_no_2_filling_gaps_011.png
   :class: sphx-glr-single-img





.. GENERATED FROM PYTHON SOURCE LINES 846-924

Observations
------------

Here are some observations related to the tasks covered in this notebook:

* Anomalies represent 0.5% (18) of the data, and after removing them we reach a total of 4.3% of NaN (142).
* The metrics favor the nii imputer. However, the knn, ii, and nknn imputer provided similar results. In the end, the dataset without anomalies (features2) was filled with the nii imputer (features3) because it was the best option.
* In addition to the correlation matrix (described in the previous notebook 1/3), it is important to explore the covariance matrix because it provides valuable information regarding the scale-dependent relationships between variables, reflecting how changes in one variable are associated with changes in another in terms of their actual units.
* The poor correlation (low covariance) between Den and Vp was not realized until Gardner's Vp was calculated. This gave negative :math:`R^2`, even after filtering part of the data outside the increasing trend.
* Filtering the Den and Vp values outside the increasing trend allows us to obtain positive but very small :math:`R^2` for the LR and the NLF ML algorithms.

Next Step
---------

The next and last notebook (3/3), after the imputation, is going to focus on the prediction of an entire variable (GR), missing in one of the available boreholes.

Reference
---------

`Gardner's Equation <https://wiki.seg.org/wiki/Dictionary:Gardner%E2%80%99s_equation>`_

Annex - Regression Metrics
--------------------------

These metrics help you understand different aspects of prediction accuracy and are critical for evaluating and comparing regression models. The main metrics are:

Sum of Residuals
~~~~~~~~~~~~~~~~

The Sum of Residuals is the sum of errors between the predicted and the actual values. Ideally, it should add up to 0.

.. math::
  \text{Sum of Residuals} = \sum_{i=1}^{n} (y_i - \hat{y}_i)

Where:

- :math:`y_i` is the real value.
- :math:`\hat{y}_i` is the predicted value.

MAE - Mean Absolute Error
~~~~~~~~~~~~~~~~~~~~~~~~~

The MAE measures the average of the absolute errors between the predicted and the actual values. It is an intuitive and easy-to-interpret metric.

.. math::
  \text{MAE} = \frac{1}{n} \sum_{i=1}^{n} |y_i - \hat{y}_i|

Where:

- :math:`n` is the number of observations.

MSE - Mean Squared Error
~~~~~~~~~~~~~~~~~~~~~~~~

The MSE measures the average of the squared errors between the predicted values and the actual values. It penalizes large errors more than the MAE.

.. math::
  \text{MSE} = \frac{1}{n} \sum_{i=1}^{n} (y_i - \hat{y}_i)^2

RMSE - Root Mean Squared Error
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

The RMSE is the square root of the MSE. It is expressed in the same units as the output variable, which makes it more interpretable.

.. math::
   \text{RMSE} = \sqrt{\frac{1}{n} \sum_{i=1}^{n} (y_i - \hat{y}_i)^2} = \sqrt{\text{MSE}}

Coefficient of Determination or Score (:math:`R^2`)
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

The :math:`R^2` measures the proportion of the variance of the dependent or predicted variable against the independent variables. Normally it ranges from 0 (not fit at all) to 1 (perfect fit). A rare negative value indicates the prediction model is worse than a prediction with the mean.

.. math::
   R^2 = 1 - \frac{\sum_{i=1}^{n} (y_i - \hat{y}_i)^2}{\sum_{i=1}^{n} (y_i - \bar{y})^2}

Where:

- :math:`\bar{y}` is the mean of the real values :math:`y_i`.


.. rst-class:: sphx-glr-timing

   **Total running time of the script:** (0 minutes 1.429 seconds)


.. _sphx_glr_download_examples_05_petrophysics_02_no_2_filling_gaps.py:

.. only:: html

  .. container:: sphx-glr-footer sphx-glr-footer-example

    .. container:: sphx-glr-download sphx-glr-download-jupyter

      :download:`Download Jupyter notebook: 02_no_2_filling_gaps.ipynb <02_no_2_filling_gaps.ipynb>`

    .. container:: sphx-glr-download sphx-glr-download-python

      :download:`Download Python source code: 02_no_2_filling_gaps.py <02_no_2_filling_gaps.py>`


.. only:: html

 .. rst-class:: sphx-glr-signature

    `Gallery generated by Sphinx-Gallery <https://sphinx-gallery.github.io>`_