Feature Construction and Selection for PV Solar Power Modeling: A Machine Learning Framework

1. Introduction & Overview

The integration of photovoltaic (PV) solar power into industrial processes is a key strategy for reducing greenhouse gas emissions and enhancing sustainability. However, the inherent intermittency and variability of solar energy pose significant challenges for grid stability and reliable energy supply. Accurate short-term prediction of PV power generation is therefore critical for effective energy management, load balancing, and operational planning.

This paper presents a novel machine learning framework for 1-hour ahead solar power prediction. The core innovation lies in its two-stage approach: first, expanding the original feature set into a higher-dimensional space using Chebyshev polynomials and trigonometric functions; second, employing a tailored feature selection scheme coupled with constrained linear regression to build weather-specific predictive models. The proposed method aims to capture complex, non-linear relationships between meteorological variables and power output more effectively than standard models.

2. Methodology

2.1 Data & Input Features

The model utilizes historical time-series data encompassing both PV system output and relevant environmental factors. Key input features include:

Autoregressive Term: The solar power generation from the previous 15-minute interval.
Weather Conditions: Categorical data (e.g., clear, cloudy, rainy).
Meteorological Variables: Temperature, dew point, humidity, and wind speed.
Temporal Features: Implicitly considered through the time-series nature of the data.

2.2 Feature Construction with Chebyshev Polynomials

To model potential non-linearities, the original feature vector $\mathbf{x}$ is transformed into a higher-dimensional space. For each continuous input feature $x_i$, a set of Chebyshev polynomials of the first kind $T_k(x_i)$ is generated up to a specified degree $K$. The Chebyshev polynomial of degree $k$ is defined recursively:

$T_0(x) = 1$

$T_1(x) = x$

$T_{k+1}(x) = 2xT_k(x) - T_{k-1}(x)$

Trigonometric functions (sine and cosine) of the features are also added to capture periodic patterns. This construction creates a rich, expressive feature space $\Phi(\mathbf{x})$ capable of representing complex functional relationships.

2.3 Feature Selection & Constrained Regression

Not all constructed features are relevant. A wrapper-based feature selection method is employed to identify the most predictive subset for different weather conditions. Subsequently, a constrained linear regression model is fit:

$\min_{\beta} \| \mathbf{y} - \Phi(\mathbf{X})\beta \|_2^2$

subject to constraints on the coefficients $\beta$ (e.g., non-negativity constraints if physical relationships dictate that certain inputs should only positively influence output). This step ensures model parsimony and physical interpretability while maintaining accuracy.

3. Experimental Results & Analysis

3.1 Performance Metrics

The primary metric for evaluation is the Mean Squared Error (MSE) between the predicted and actual 1-hour ahead PV power output. Lower MSE indicates higher predictive accuracy.

Performance Summary

Proposed Method: Achieved the lowest MSE across test scenarios.

Key Advantage: Superior performance under diverse weather conditions, particularly during transient periods (e.g., passing clouds).

3.2 Comparison with Baseline Models

The proposed framework was benchmarked against several classical machine learning models:

Support Vector Machine (SVM) / Support Vector Regression (SVR)
Random Forest (RF)
Gradient Boosting Decision Tree (GBDT)

Result: The Chebyshev-based feature construction and selection approach consistently yielded lower MSE than all baseline models. This demonstrates the efficacy of explicitly engineering a high-dimensional feature space tailored to the solar forecasting problem, compared to relying solely on the inherent feature combination capabilities of ensemble tree methods or kernel tricks in SVM.

4. Technical Details & Mathematical Framework

The model can be summarized as a function $f$ mapping inputs to the 1-hour ahead prediction $\hat{P}_{t+1}$:

$\hat{P}_{t+1} = f(\mathbf{x}_t) = \beta_0 + \sum_{j \in S} \beta_j \phi_j(\mathbf{x}_t)$

where:

$\mathbf{x}_t$ is the feature vector at time $t$.
$\{\phi_j\}$ are the selected basis functions from the Chebyshev/trigonometric expansion.
$S$ is the set of indices selected by the feature selection algorithm.
$\beta$ are the coefficients estimated via constrained least squares.

The constraint $\beta_j \geq 0$ for some $j$ can be incorporated to reflect physical knowledge (e.g., irradiance positively correlates with power).

5. Analysis Framework: A Non-Code Example

Consider a simplified scenario for predicting power at noon on a partly cloudy day. The framework's workflow is:

Input: Features at 11:45 AM: Power=150 kW, Temperature=25°C, Humidity=60%, Cloud Cover Index=0.5 (partly cloudy).
Feature Construction: Create new features: $T_2(Temp)=2*(25)^2 -1$, $sin(Humidity)$, $Cloud Cover * T_1(Temp)$, etc. This might generate 20+ derived features.
Feature Selection (for "Partly Cloudy" model): The wrapper method identifies that only 5 of these features are critical for prediction under these conditions, e.g., $Power_{t-1}$, $T_2(Temp)$, $Cloud Cover$, $sin(Humidity)$, and an interaction term.
Constrained Prediction: The "Partly Cloudy" specific regression model, using only the 5 selected features and their pre-learned coefficients (with a constraint that cloud cover coefficient is non-positive), calculates the prediction: $\hat{P}_{12:00 PM} = 165 kW$.

6. Future Applications & Research Directions

Hybrid Physics-ML Models: Integrating the proposed data-driven approach with physical PV performance models (like those from NREL's System Advisor Model) could enhance robustness and extrapolation capability.
Probabilistic Forecasting: Extending the framework to output prediction intervals (e.g., via quantile regression on the selected features) is crucial for risk-aware grid operations.
Edge Computing for Distributed PV: Deploying lightweight versions of the feature selection and regression models on edge devices at individual solar farms for real-time, localized forecasting.
Transfer Learning Across Climates: Investigating how feature sets selected for one geographic region can be adapted or fine-tuned for another with different weather patterns.
Integration with Deep Learning: Using the selected Chebyshev features as informative inputs to a recurrent neural network (RNN) or transformer model to capture long-term temporal dependencies beyond one hour.

7. References

Yang, Y., Mao, J., Nguyen, R., Tohmeh, A., & Yeh, H. G. (Year). Feature Construction and Selection for PV Solar Power Modeling. Journal/Conference Name.
Mellit, A., & Pavan, A. M. (2010). A 24-h forecast of solar irradiance using artificial neural network: Application for performance prediction of a grid-connected PV plant at Trieste, Italy. Solar Energy, 84(5), 807-821.
National Renewable Energy Laboratory (NREL). (2023). Solar Forecasting. https://www.nrel.gov/grid/solar-forecasting.html
Hastie, T., Tibshirani, R., & Friedman, J. (2009). The Elements of Statistical Learning. Springer. (For foundations on feature expansion and regularization).
Isola, P., Zhu, J. Y., Zhou, T., & Efros, A. A. (2017). Image-to-image translation with conditional adversarial networks. Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 1125-1134). (Cited as an example of a transformative framework in another ML domain, analogous to the feature construction approach here).

8. Analyst's Perspective: Core Insight & Critique

Core Insight: This paper's real contribution isn't just another solar forecast model; it's a disciplined, two-step feature engineering protocol that decouples representation learning from model fitting. By explicitly constructing a high-dimensional Chebyshev space, it forces the model to consider specific non-linear and interaction terms that black-box models like GBDT might stumble upon inefficiently or not at all. It's a move from "hope the algorithm finds it" to "architect the space where the signal lives." This is reminiscent of the philosophy behind successful frameworks in other fields, like the carefully designed generator/discriminator architectures in CycleGAN that structure the learning problem for unpaired image translation.

Logical Flow: The logic is sound and elegant: 1) Acknowledge the complex, non-linear physics of solar generation. 2) Don't just throw raw data at a non-linear model; instead, systematically expand the input space with mathematically justified basis functions (Chebyshev polynomials are excellent for approximation). 3) Use a wrapper method for feature selection—a computationally expensive but targeted approach—to prune this space down to a weather-condition-specific, interpretable subset. 4) Apply constrained regression to inject physical prior knowledge (e.g., "more clouds cannot produce more power"). This pipeline is more principled than the typical "grid-search-over-hyperparameters" approach applied to off-the-shelf ML models.

Strengths & Flaws:
Strengths: The method achieves superior MSE, proving its empirical value. The weather-specific modeling is pragmatic. The use of constraints adds a layer of robustness and interpretability often missing in pure ML approaches. It's a great example of "glass-box" ML for engineering systems.
Flaws: The computational cost of the wrapper-based feature selection for each weather type is a major bottleneck for real-time adaptation or large-scale deployment. The paper lacks a discussion on the stability of the selected feature sets—do they change wildly with slightly different training data? Furthermore, while beating SVR, RF, and GBDT is good, a comparison against a well-tuned deep learning model (e.g., an LSTM or Temporal Fusion Transformer) or a sophisticated gradient boosting implementation like XGBoost with its own feature interaction capabilities is a glaring omission in 2023+ research.

Actionable Insights: For industry practitioners, this paper is a blueprint for building more reliable, site-specific forecast models. The immediate takeaway is to invest in feature engineering infrastructure before jumping to complex algorithms. Start by implementing this Chebyshev expansion pipeline on your historical data. However, for operational systems, replace the wrapper method with a more scalable filter method (like mutual information) or embedded method (like LASSO regression) for feature selection to reduce computational overhead. Partner with domain experts to define the most critical physical constraints for the regression. This hybrid, thoughtful approach will likely yield better returns than simply renting a larger cloud instance to train a bigger neural network.

Table of Contents