Random Forest missing values in cases where the variables do not apply
SOME BACKGROUND
I am working on a training Random Forest regressor, for predicting yield in crops. Some of my predictor variables apply only to some cases, e.g. I have a variable denoting the number of rows, which only applies to crops grown in a polytunnel. If the crops are grown in a glasshouse, the number of rows does not apply, so it is left as a null value. I also have another variable which denotes whether the crop is grown under a polytunnel or glasshouse.
THE PROBLEM
As Random Forest does not handle missing values, is there a strategy that could deal with cases where variables take null values for cases where they do not apply? Tutorials and papers on the topic suggest imputing the values, but under the scenarios they consider these variables still apply, and are missing because of some external factor (e.g. rich people don't generally want to reveal their salaries).
machine-learning null regression missing-data imputation
add a comment |
SOME BACKGROUND
I am working on a training Random Forest regressor, for predicting yield in crops. Some of my predictor variables apply only to some cases, e.g. I have a variable denoting the number of rows, which only applies to crops grown in a polytunnel. If the crops are grown in a glasshouse, the number of rows does not apply, so it is left as a null value. I also have another variable which denotes whether the crop is grown under a polytunnel or glasshouse.
THE PROBLEM
As Random Forest does not handle missing values, is there a strategy that could deal with cases where variables take null values for cases where they do not apply? Tutorials and papers on the topic suggest imputing the values, but under the scenarios they consider these variables still apply, and are missing because of some external factor (e.g. rich people don't generally want to reveal their salaries).
machine-learning null regression missing-data imputation
Yes the best way to approach the problem is to give to those cases a special value. Ad example if for the polytunnel crops the number of rows ranges in [0,100], to all the samples in glasshouse you will give -1. What you should have is that the tree will use the polutunnel/galsshouse variable to split the data. Then, the data in polytunnel will be evaluated according to the number of rows while the number of rows will be ignored in glasshouse since is constant.
– Roberto
Nov 13 '18 at 13:54
Thank you for the answer - I have now applied your method to my data. My only worry is whether it will actually split on glasshouse/polytunnel - for all I know random forest might decide to use number of rows first, in which case the -1 fill values will have an interesting consequences. I recognise this depends on the underlying data, so as long as I am taking the best approach in the current circumstances, I am happy!
– Bodwin
Nov 14 '18 at 8:49
That is a fair. So I suggest you to check what happen plotting the tree structure. If you have small dataset you could try to compute the entropy/gini values to check manually what happen. I will post the comment as answer
– Roberto
Nov 14 '18 at 8:52
add a comment |
SOME BACKGROUND
I am working on a training Random Forest regressor, for predicting yield in crops. Some of my predictor variables apply only to some cases, e.g. I have a variable denoting the number of rows, which only applies to crops grown in a polytunnel. If the crops are grown in a glasshouse, the number of rows does not apply, so it is left as a null value. I also have another variable which denotes whether the crop is grown under a polytunnel or glasshouse.
THE PROBLEM
As Random Forest does not handle missing values, is there a strategy that could deal with cases where variables take null values for cases where they do not apply? Tutorials and papers on the topic suggest imputing the values, but under the scenarios they consider these variables still apply, and are missing because of some external factor (e.g. rich people don't generally want to reveal their salaries).
machine-learning null regression missing-data imputation
SOME BACKGROUND
I am working on a training Random Forest regressor, for predicting yield in crops. Some of my predictor variables apply only to some cases, e.g. I have a variable denoting the number of rows, which only applies to crops grown in a polytunnel. If the crops are grown in a glasshouse, the number of rows does not apply, so it is left as a null value. I also have another variable which denotes whether the crop is grown under a polytunnel or glasshouse.
THE PROBLEM
As Random Forest does not handle missing values, is there a strategy that could deal with cases where variables take null values for cases where they do not apply? Tutorials and papers on the topic suggest imputing the values, but under the scenarios they consider these variables still apply, and are missing because of some external factor (e.g. rich people don't generally want to reveal their salaries).
machine-learning null regression missing-data imputation
machine-learning null regression missing-data imputation
asked Nov 13 '18 at 9:09
BodwinBodwin
61
61
Yes the best way to approach the problem is to give to those cases a special value. Ad example if for the polytunnel crops the number of rows ranges in [0,100], to all the samples in glasshouse you will give -1. What you should have is that the tree will use the polutunnel/galsshouse variable to split the data. Then, the data in polytunnel will be evaluated according to the number of rows while the number of rows will be ignored in glasshouse since is constant.
– Roberto
Nov 13 '18 at 13:54
Thank you for the answer - I have now applied your method to my data. My only worry is whether it will actually split on glasshouse/polytunnel - for all I know random forest might decide to use number of rows first, in which case the -1 fill values will have an interesting consequences. I recognise this depends on the underlying data, so as long as I am taking the best approach in the current circumstances, I am happy!
– Bodwin
Nov 14 '18 at 8:49
That is a fair. So I suggest you to check what happen plotting the tree structure. If you have small dataset you could try to compute the entropy/gini values to check manually what happen. I will post the comment as answer
– Roberto
Nov 14 '18 at 8:52
add a comment |
Yes the best way to approach the problem is to give to those cases a special value. Ad example if for the polytunnel crops the number of rows ranges in [0,100], to all the samples in glasshouse you will give -1. What you should have is that the tree will use the polutunnel/galsshouse variable to split the data. Then, the data in polytunnel will be evaluated according to the number of rows while the number of rows will be ignored in glasshouse since is constant.
– Roberto
Nov 13 '18 at 13:54
Thank you for the answer - I have now applied your method to my data. My only worry is whether it will actually split on glasshouse/polytunnel - for all I know random forest might decide to use number of rows first, in which case the -1 fill values will have an interesting consequences. I recognise this depends on the underlying data, so as long as I am taking the best approach in the current circumstances, I am happy!
– Bodwin
Nov 14 '18 at 8:49
That is a fair. So I suggest you to check what happen plotting the tree structure. If you have small dataset you could try to compute the entropy/gini values to check manually what happen. I will post the comment as answer
– Roberto
Nov 14 '18 at 8:52
Yes the best way to approach the problem is to give to those cases a special value. Ad example if for the polytunnel crops the number of rows ranges in [0,100], to all the samples in glasshouse you will give -1. What you should have is that the tree will use the polutunnel/galsshouse variable to split the data. Then, the data in polytunnel will be evaluated according to the number of rows while the number of rows will be ignored in glasshouse since is constant.
– Roberto
Nov 13 '18 at 13:54
Yes the best way to approach the problem is to give to those cases a special value. Ad example if for the polytunnel crops the number of rows ranges in [0,100], to all the samples in glasshouse you will give -1. What you should have is that the tree will use the polutunnel/galsshouse variable to split the data. Then, the data in polytunnel will be evaluated according to the number of rows while the number of rows will be ignored in glasshouse since is constant.
– Roberto
Nov 13 '18 at 13:54
Thank you for the answer - I have now applied your method to my data. My only worry is whether it will actually split on glasshouse/polytunnel - for all I know random forest might decide to use number of rows first, in which case the -1 fill values will have an interesting consequences. I recognise this depends on the underlying data, so as long as I am taking the best approach in the current circumstances, I am happy!
– Bodwin
Nov 14 '18 at 8:49
Thank you for the answer - I have now applied your method to my data. My only worry is whether it will actually split on glasshouse/polytunnel - for all I know random forest might decide to use number of rows first, in which case the -1 fill values will have an interesting consequences. I recognise this depends on the underlying data, so as long as I am taking the best approach in the current circumstances, I am happy!
– Bodwin
Nov 14 '18 at 8:49
That is a fair. So I suggest you to check what happen plotting the tree structure. If you have small dataset you could try to compute the entropy/gini values to check manually what happen. I will post the comment as answer
– Roberto
Nov 14 '18 at 8:52
That is a fair. So I suggest you to check what happen plotting the tree structure. If you have small dataset you could try to compute the entropy/gini values to check manually what happen. I will post the comment as answer
– Roberto
Nov 14 '18 at 8:52
add a comment |
1 Answer
1
active
oldest
votes
The best way to approach the problem is to give to those cases a special value.
Ad example if for the polytunnel crops the number of rows ranges in [0,100], to all the samples in glasshouse you will give -1.
What you should have is that the tree will use the polutunnel/galsshouse variable to split the data. Then, the data in polytunnel will be evaluated according to the number of rows while the number of rows will be ignored in glasshouse since is constant.
add a comment |
Your Answer
StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");
StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});
function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});
}
});
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53277409%2frandom-forest-missing-values-in-cases-where-the-variables-do-not-apply%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
1 Answer
1
active
oldest
votes
1 Answer
1
active
oldest
votes
active
oldest
votes
active
oldest
votes
The best way to approach the problem is to give to those cases a special value.
Ad example if for the polytunnel crops the number of rows ranges in [0,100], to all the samples in glasshouse you will give -1.
What you should have is that the tree will use the polutunnel/galsshouse variable to split the data. Then, the data in polytunnel will be evaluated according to the number of rows while the number of rows will be ignored in glasshouse since is constant.
add a comment |
The best way to approach the problem is to give to those cases a special value.
Ad example if for the polytunnel crops the number of rows ranges in [0,100], to all the samples in glasshouse you will give -1.
What you should have is that the tree will use the polutunnel/galsshouse variable to split the data. Then, the data in polytunnel will be evaluated according to the number of rows while the number of rows will be ignored in glasshouse since is constant.
add a comment |
The best way to approach the problem is to give to those cases a special value.
Ad example if for the polytunnel crops the number of rows ranges in [0,100], to all the samples in glasshouse you will give -1.
What you should have is that the tree will use the polutunnel/galsshouse variable to split the data. Then, the data in polytunnel will be evaluated according to the number of rows while the number of rows will be ignored in glasshouse since is constant.
The best way to approach the problem is to give to those cases a special value.
Ad example if for the polytunnel crops the number of rows ranges in [0,100], to all the samples in glasshouse you will give -1.
What you should have is that the tree will use the polutunnel/galsshouse variable to split the data. Then, the data in polytunnel will be evaluated according to the number of rows while the number of rows will be ignored in glasshouse since is constant.
answered Nov 14 '18 at 8:57
RobertoRoberto
50512
50512
add a comment |
add a comment |
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53277409%2frandom-forest-missing-values-in-cases-where-the-variables-do-not-apply%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Yes the best way to approach the problem is to give to those cases a special value. Ad example if for the polytunnel crops the number of rows ranges in [0,100], to all the samples in glasshouse you will give -1. What you should have is that the tree will use the polutunnel/galsshouse variable to split the data. Then, the data in polytunnel will be evaluated according to the number of rows while the number of rows will be ignored in glasshouse since is constant.
– Roberto
Nov 13 '18 at 13:54
Thank you for the answer - I have now applied your method to my data. My only worry is whether it will actually split on glasshouse/polytunnel - for all I know random forest might decide to use number of rows first, in which case the -1 fill values will have an interesting consequences. I recognise this depends on the underlying data, so as long as I am taking the best approach in the current circumstances, I am happy!
– Bodwin
Nov 14 '18 at 8:49
That is a fair. So I suggest you to check what happen plotting the tree structure. If you have small dataset you could try to compute the entropy/gini values to check manually what happen. I will post the comment as answer
– Roberto
Nov 14 '18 at 8:52