Filling in the Gaps: AI-Powered Data Imputation Using Autoencoders

All Python code is provided in a Jupyter notebook so that you can easily replicate our work. Your results may vary due to the random nature of the models.

Missing data is a fact of life in real-world machine learning tasks. Perhaps a sensor failed during data collection, or a user left certain questions unanswered—whatever the cause, incomplete records can disrupt your model-building pipeline. The process of filling in these gaps with plausible estimates—known as imputation—is crucial to preserve as much information as possible, prevent biased analyses, and keep machine learning workflows smooth.

In this article, we’ll look at autoencoders, a branch of deep learning in the AI toolbox, and see how they can outperform more traditional imputation techniques such as random forest imputation. 

Along the way, we’ll show real results illustrating how much better the autoencoder approach can be when reconstructing missing features.

Why Imputation Matters

Imagine you have a dataset with hundreds of columns—each representing a feature describing, say, housing data or customer behavior. Your analysis or machine learning model depends on these features being present and accurate. Unfortunately, in many real-world settings:

• Some sensors might go offline.

• Some survey questions might be left blank.

• Some measurements might be corrupted or simply not recorded.

If you discard rows with missing data, you reduce the size of your dataset—sometimes drastically—which can rob your model of valuable information. Imputation is about filling in these blank spaces with reasonable estimates so that no row has to be thrown out. While there are many ways to do this (e.g., mean imputation, median, nearest neighbors, or random forest), we found that an autoencoder-based approach can be an even more accurate imputation strategy.

What Are Autoencoders?

An autoencoder is a special type of neural network designed to learn a compressed representation of its input data, also known as an “embedding”. What makes autoencoders unique is that the input and the output they predict are the same data. This is by design: The network’s goal is to reconstruct its input as closely as possible, even though it first compresses the input into a bottleneck layer.

Conceptually, an autoencoder has two main parts:

  1. Encoder: This “downward” section reduces the input’s dimensionality, forcing the data through a lower-dimensional bottleneck (the latent space or embedding).

  2. Decoder: This “upward” section attempts to reconstruct the original input from that compressed embedding.

By requiring the network to reproduce its original input through a smaller set of neurons, the autoencoder effectively learns the most important patterns or features needed for reconstruction. In other words, it develops an internal representation that captures the essential structure of the data, often revealing informative, lower-dimensional features that can be used for various tasks—like imputation.

Figure 1: This diagram depicts a fully connected autoencoder network. Each circle (node) represents a neuron in a given layer, with lines (edges) denoting the weights that connect every neuron from one layer to the next. The encoder (left-to-middle) compresses the input features into a bottleneck (latent space), while the decoder (middle-to-right) reconstructs the original input from that condensed representation. Because the output and input match, the network learns the essential patterns underlying the data, enabling tasks like imputation of missing values.

Why Autoencoders for Imputation?

For imputation, we leverage the fact that an autoencoder is great at reconstructing its input. When some features in a row are missing:

  1. We feed the partial data row into the encoder, filling in missing values with placeholders (like a constant value) or using a learned approach.

  2. The encoder transforms that row into a low-dimensional embedding, drawing on patterns it’s learned across all the data.

  3. The decoder then reconstructs the full row—including estimates for the previously missing values.

Comparing Autoencoder to Random Forest Imputation

Random forest imputation is a widely used technique because it leverages the predictive power of decision tree ensembles to estimate missing values. The method works by training a random forest model to predict the missing values of a feature using the other available features as inputs. This approach is particularly effective because random forests can handle complex, non-linear relationships between variables and are robust to overfitting due to their ensemble nature. Random forest imputation is also versatile, accommodating both categorical and numerical data. Its popularity stems from its ease of use, availability in many machine learning libraries, and ability to produce reasonable imputation results without requiring extensive tuning or specialized preprocessing. However, as we demonstrate in this article, while effective, 

random forest imputation can struggle when feature relationships are intricate or when the dataset lacks sufficient data to capture the subtleties of those relationships—situations where autoencoders may excel.

Autoencoder Training on Housing Data

To demonstrate the power of autoencoder-based imputation, we experimented with a housing dataset containing critical features like median income, average room counts, and geographical coordinates. These variables capture diverse aspects of each neighborhood—spanning economic status (e.g., MedInc), demographics (Population, AveOccup), living conditions (HouseAge, AveRooms, AveBedrms), and location (Latitude, Longitude). 

Our goal was to train an autoencoder that learns to reconstruct these eight normalized features from a condensed latent representation, thereby enabling accurate imputation for any missing entries. 

Below, we detail the network architecture, training procedure, and validation results.

We feed the housing features into the network’s input layer. The encoder then progressively reduces these inputs through fully connected (Dense) layers of 256, 128, 64, and finally 32 neurons, each followed by LayerNormalization and Dropout to stabilize training and prevent overfitting. The decoder mirrors this process in reverse, expanding from 32 back up to 256 neurons before outputting the reconstructed data. A custom NonNegativeOutputLayer forces nonnegative predictions, reflecting domain assumptions about certain housing features (e.g., median income, population but not including Longitude and Latitude). We normalize the input features to ensure that all variables share similar numeric scales, allowing the neural network (and its optimizers) to converge more smoothly and avoid numerical instabilities during training. The training and validation loss curves are shown in Fig. 2.

Figure 2: This plot displays training and validation loss (mean squared error, MSE). Both curves steadily slope downward, indicating that the autoencoder is successfully learning to reconstruct the housing features. Near the final epochs the training loss begins to plateau, suggesting the model has reached a satisfactory level of training and is no longer significantly improving. The modest gap between training and validation curves indicates that overfitting is relatively contained, as the validation loss follows a similar downward trajectory.

With our autoencoder trained, we can infer the latent embedding for any house (in or out of the training set). In Fig. 3, we first show a small slice of the original, unnormalized housing data, including columns like MedInc, Population, and Longitude.  Below it, we display the corresponding embeddings generated by our autoencoder’s bottleneck layer. 

Notice how each row is reduced to just a handful of numerical values—reflecting the essential structure that the model has learned to retain.

Figure 3: Here, we display three sample rows of our original housing data (with 8 features like MedInc and HouseAge) alongside their latent embeddings in the autoencoder’s 32-dimensional bottleneck layer. Although it might seem counterintuitive to have more bottleneck dimensions than the original feature count, these extra nodes enable the network to learn nonlinear transformations that better capture relationships among the features, thereby promoting more accurate reconstructions and imputations when data is partially missing.

The Experiment: Comparing Autoencoder vs. Random Forest Imputation

We conducted an experiment on a housing dataset in which we deliberately introduced missing values for certain features. We then applied two different approaches to fill in the blanks:

  1. Autoencoder Imputation: Our AI approach, where we train a single autoencoder to reconstruct all features simultaneously.

  2. Random Forest Imputation: A popular method that uses a random forest to predict each missing feature from the remaining features.

To evaluate these approaches, we compare the imputed values to the actual (original) values. This is possible because we artificially masked known data, so we still have the “true” numbers to compare against. We then summarize the results using error statistics (e.g., mean absolute difference) to see which method does a better job. It is important to note that we only introduced missing values in a test set that is separate from the data used to train the autoencoder network. That way we aren't cheating by having the model impute data it has already seen.

Handling Missing Values in the Autoencoder

Technically, the autoencoder cannot process “missing” entries. Therefore, we designate a named placeholder, MISSING, and set it to -1. We chose -1 because it lies outside the typical range of housing features (e.g., median income or population) and is unit normalized, which helps avoid numerical and optimization issues during training.

During inference, each row—including any MISSING placeholders—are passed through the encoder, which condenses the data into a low-dimensional latent representation. The decoder then tries to reconstruct the full row and the MISSING placeholders are replaced with the decoded (imputed) values.

This works because, during training, the autoencoder sees only complete data; no placeholder values are present. It learns to map typical feature patterns (e.g., realistic housing measurements) to themselves through the encoder–decoder path. Once it is trained, if a row contains one or more MISSING_VAL placeholders, the network uses all the other valid features—and the correlations it has learned—to infer plausible numbers for the missing ones. Since -1 is far outside valid housing ranges, the decoder naturally replaces it with a coherent reconstruction aligned with the latent structure it discovered.

Results: Autoencoder vs. Random Forest

After applying both imputation methods (autoencoder and random forest), we compare them by calculating the absolute differences between the imputed and original values for each feature (see Figure 3). For all features, the autoencoder far outperforms the random forest, highlighting its effectiveness in capturing complex feature relationships and reconstructing missing values more faithfully.

Figure 3: Accuracy comparison of our tuned autoencoder and random forest imputation models, based on the absolute differences between the imputed and original values for each feature. Each pair of boxplots (one for the autoencoder and one for the random forest) shows the distribution of errors on a log scale, where the box spans the interquartile range (IQR) and the horizontal line inside the box marks the median error. The whiskers extend to capture more extreme values. A lower box/whisker in the log-scale error indicates smaller differences, thus greater accuracy in filling missing values.

The performance advantage of the autoencoder imputation model can be quantified by forming the ratio of the median absolute errors for each feature. Table 1 shows that the autoencoder imputation model is approximately 3-6 times more accurate than the random forest imputation model. Quite impressive!

Table 1: The ratio of the median error of the random forest imputation model to the autoencoder imputation model. A value of 10 would mean that the AE model was an order of magnitude more performant than the RF model, on median. The values in this table are rounded to the first decimal place for simplicity. 

Summary 

In this article, we explored the power of autoencoders as an advanced AI-driven solution for data imputation, demonstrating their ability to outperform a traditional method like random forest imputation. Using a housing dataset, we showed how autoencoders leverage a compressed latent representation to effectively reconstruct missing values, capturing complex feature relationships in ways that random forests struggle to replicate.

Through carefully designed experiments, we quantified the performance of both methods. The results were clear: the autoencoder consistently outperformed random forest, with median error reductions ranging from 3 to 6 times across various features. This highlights the strength of deep learning models in learning the underlying structure of data, even when faced with incomplete information.

By handling missing entries with a designated placeholder during inference and leveraging the network’s ability to create a unified reconstruction, the autoencoder imputation model demonstrated both robustness and accuracy. These results illustrate not only the practical value of autoencoders for imputation but also their broader potential for tasks like feature extraction, denoising, and anomaly detection in real-world datasets.

Ultimately, this work underscores the importance of adopting modern AI techniques in tackling data challenges. Autoencoders offer a compelling alternative to traditional imputation methods, especially when dealing with datasets that exhibit intricate feature interdependencies. By embracing such innovative tools, we can drive more accurate analyses, more reliable models, and better decision-making across a wide range of applications.


Struggling with Missing Data?

With 20+ years of experience, our team of AI experts works closely with you to deliver impactful, lasting results. Using advanced techniques like autoencoders, we can tackle complex data challenges, fill the gaps in your datasets, and build robust machine learning workflows that drive reliable outcomes.

Let’s Solve Your Data Problems Together.
Schedule a free consultation today and discover how Xyonix can transform your data into actionable solutions:


Discover how the Xyonix Pathfinder process can help you identify opportunities, streamline operations, and deliver personalized experiences that leave a lasting impact.