Use tidymodels instead of base libraries
Certainly! The **tidymodels** framework in R provides a cohesive set of packages designed for modeling and machine learning. It wraps many base R functions into a user-friendly syntax and structure for building and evaluating models.
Here’s a simple example of how you can use **tidymodels** to create a machine learning pipeline. In this example, we will use the **iris** dataset to build a model predicting species based on features. We'll follow these steps:
1. Load necessary libraries.
2. Split the data into training and testing sets.
3. Preprocess the data.
4. Train a model (using logistic regression as an example).
5. Evaluate the model.
Here’s a complete example:
```R
# Load necessary libraries
library(tidymodels)
# Set the seed for reproducibility
set.seed(123)
# Load the iris dataset
data("iris")
# Split the data into training (80%) and testing (20%) sets
iris_split <- initial_split(iris, prop = 0.8, strata = Species)
iris_train <- training(iris_split)
iris_test <- testing(iris_split)
# Define a recipe for preprocessing
# We'll center and scale the numeric variables
iris_recipe <- recipe(Species ~ ., data = iris_train) %>%
step_normalize(all_numeric_predictors()) %>%
step_dummy(all_nominal_predictors(), one_hot = TRUE)
# Specify the model (using logistic regression here)
logistic_model <- logistic_reg() %>%
set_engine("glm")
# Create a workflow to tie the recipe and model together
iris_workflow <- workflow() %>%
add_recipe(iris_recipe) %>%
add_model(logistic_model)
# Fit the model on the training set
iris_fit <- fit(iris_workflow, data = iris_train)
# Make predictions on the test set
iris_predictions <- predict(iris_fit, iris_test, type = "class") %>%
bind_cols(iris_test)
# Evaluate the model's performance
iris_metrics <- iris_predictions %>%
metrics(truth = Species, estimate = .pred_class)
# Show the metrics
print(iris_metrics)
# Confusion matrix for further evaluation
confusion_matrix <- iris_predictions %>%
conf_mat(truth = Species, estimate = .pred_class)
# Print the confusion matrix
print(confusion_matrix)
```
### Explanation of the Code:
1. **Loading Libraries**: We first load the `tidymodels` library, which includes several packages for data wrangling, modeling, and evaluation.
2. **Data Splitting**: We split the iris dataset into training and testing datasets while stratifying by species to ensure balanced splits.
3. **Recipe**: We create a preprocessing recipe that normalizes numeric predictors and converts categorical predictors into dummy variables.
4. **Model Specification**: We specify a logistic regression model using the `logistic_reg()` function and set the engine to "glm".
5. **Workflow**: We create a workflow by combining the recipe and model.
6. **Model Training**: We fit the model to the training dataset using the `fit()` function.
7. **Making Predictions**: We predict the species for the test dataset.
8. **Evaluation**: We compute various metrics like accuracy, and also produce a confusion matrix for detailed evaluation of model predictions.
### Notes:
- This is just a simple example, and there are various hyperparameter options, different models, and preprocessing techniques available in **tidymodels**.
Make sure that you have `tidymodels` installed in your R environment. You can install it by running:
```R
install.packages("tidymodels")
```
You may need to install additional packages used in the recipe or modeling steps, which can be handled automatically by **tidymodels**.