This dataset only has one observation where weather = 4. Since this is a categorical variable, R will result in an error if it ends up in the test data split. This is because R expects the number of levels for each categorical variable to equal the number of levels found in the training data split. Therefore, it must be removed.
Before creating our random forest, we must identify columns that add little-to-no value for predictive modeling. These columns will be dropped.
Since we are predicting total count, the registered bike rental and casual bike rental columns must be dropped. Together, these values add up to total count, which would lead to a successful but uninformative model because the values would simply be summed to see the total count. One could train separate models to predict casual and registered bike rentals independently. Azure ML would make it very easy to include these models in our experiment after creating one for total count.
The third candidate for removal is the datetime column. Each observation has a unique date-time, so this column with just add noise to our model, especially since we extracted all the useful information (day of week, time of day etc.)
Now that the dropped columns have been chosen, drag in the “Project Columns” module to drop datetime, casual, and registered. Launch the column selector and select “All columns” from the dropdown next to “Begin With.” Change “Include” to “Exclude” using the dropdown and then select the columns we are dropping.