Random Forest missing values in cases where the variables do not apply












-1















SOME BACKGROUND



I am working on a training Random Forest regressor, for predicting yield in crops. Some of my predictor variables apply only to some cases, e.g. I have a variable denoting the number of rows, which only applies to crops grown in a polytunnel. If the crops are grown in a glasshouse, the number of rows does not apply, so it is left as a null value. I also have another variable which denotes whether the crop is grown under a polytunnel or glasshouse.



THE PROBLEM



As Random Forest does not handle missing values, is there a strategy that could deal with cases where variables take null values for cases where they do not apply? Tutorials and papers on the topic suggest imputing the values, but under the scenarios they consider these variables still apply, and are missing because of some external factor (e.g. rich people don't generally want to reveal their salaries).










share|improve this question























  • Yes the best way to approach the problem is to give to those cases a special value. Ad example if for the polytunnel crops the number of rows ranges in [0,100], to all the samples in glasshouse you will give -1. What you should have is that the tree will use the polutunnel/galsshouse variable to split the data. Then, the data in polytunnel will be evaluated according to the number of rows while the number of rows will be ignored in glasshouse since is constant.

    – Roberto
    Nov 13 '18 at 13:54











  • Thank you for the answer - I have now applied your method to my data. My only worry is whether it will actually split on glasshouse/polytunnel - for all I know random forest might decide to use number of rows first, in which case the -1 fill values will have an interesting consequences. I recognise this depends on the underlying data, so as long as I am taking the best approach in the current circumstances, I am happy!

    – Bodwin
    Nov 14 '18 at 8:49













  • That is a fair. So I suggest you to check what happen plotting the tree structure. If you have small dataset you could try to compute the entropy/gini values to check manually what happen. I will post the comment as answer

    – Roberto
    Nov 14 '18 at 8:52
















-1















SOME BACKGROUND



I am working on a training Random Forest regressor, for predicting yield in crops. Some of my predictor variables apply only to some cases, e.g. I have a variable denoting the number of rows, which only applies to crops grown in a polytunnel. If the crops are grown in a glasshouse, the number of rows does not apply, so it is left as a null value. I also have another variable which denotes whether the crop is grown under a polytunnel or glasshouse.



THE PROBLEM



As Random Forest does not handle missing values, is there a strategy that could deal with cases where variables take null values for cases where they do not apply? Tutorials and papers on the topic suggest imputing the values, but under the scenarios they consider these variables still apply, and are missing because of some external factor (e.g. rich people don't generally want to reveal their salaries).










share|improve this question























  • Yes the best way to approach the problem is to give to those cases a special value. Ad example if for the polytunnel crops the number of rows ranges in [0,100], to all the samples in glasshouse you will give -1. What you should have is that the tree will use the polutunnel/galsshouse variable to split the data. Then, the data in polytunnel will be evaluated according to the number of rows while the number of rows will be ignored in glasshouse since is constant.

    – Roberto
    Nov 13 '18 at 13:54











  • Thank you for the answer - I have now applied your method to my data. My only worry is whether it will actually split on glasshouse/polytunnel - for all I know random forest might decide to use number of rows first, in which case the -1 fill values will have an interesting consequences. I recognise this depends on the underlying data, so as long as I am taking the best approach in the current circumstances, I am happy!

    – Bodwin
    Nov 14 '18 at 8:49













  • That is a fair. So I suggest you to check what happen plotting the tree structure. If you have small dataset you could try to compute the entropy/gini values to check manually what happen. I will post the comment as answer

    – Roberto
    Nov 14 '18 at 8:52














-1












-1








-1


1






SOME BACKGROUND



I am working on a training Random Forest regressor, for predicting yield in crops. Some of my predictor variables apply only to some cases, e.g. I have a variable denoting the number of rows, which only applies to crops grown in a polytunnel. If the crops are grown in a glasshouse, the number of rows does not apply, so it is left as a null value. I also have another variable which denotes whether the crop is grown under a polytunnel or glasshouse.



THE PROBLEM



As Random Forest does not handle missing values, is there a strategy that could deal with cases where variables take null values for cases where they do not apply? Tutorials and papers on the topic suggest imputing the values, but under the scenarios they consider these variables still apply, and are missing because of some external factor (e.g. rich people don't generally want to reveal their salaries).










share|improve this question














SOME BACKGROUND



I am working on a training Random Forest regressor, for predicting yield in crops. Some of my predictor variables apply only to some cases, e.g. I have a variable denoting the number of rows, which only applies to crops grown in a polytunnel. If the crops are grown in a glasshouse, the number of rows does not apply, so it is left as a null value. I also have another variable which denotes whether the crop is grown under a polytunnel or glasshouse.



THE PROBLEM



As Random Forest does not handle missing values, is there a strategy that could deal with cases where variables take null values for cases where they do not apply? Tutorials and papers on the topic suggest imputing the values, but under the scenarios they consider these variables still apply, and are missing because of some external factor (e.g. rich people don't generally want to reveal their salaries).







machine-learning null regression missing-data imputation






share|improve this question













share|improve this question











share|improve this question




share|improve this question










asked Nov 13 '18 at 9:09









BodwinBodwin

61




61













  • Yes the best way to approach the problem is to give to those cases a special value. Ad example if for the polytunnel crops the number of rows ranges in [0,100], to all the samples in glasshouse you will give -1. What you should have is that the tree will use the polutunnel/galsshouse variable to split the data. Then, the data in polytunnel will be evaluated according to the number of rows while the number of rows will be ignored in glasshouse since is constant.

    – Roberto
    Nov 13 '18 at 13:54











  • Thank you for the answer - I have now applied your method to my data. My only worry is whether it will actually split on glasshouse/polytunnel - for all I know random forest might decide to use number of rows first, in which case the -1 fill values will have an interesting consequences. I recognise this depends on the underlying data, so as long as I am taking the best approach in the current circumstances, I am happy!

    – Bodwin
    Nov 14 '18 at 8:49













  • That is a fair. So I suggest you to check what happen plotting the tree structure. If you have small dataset you could try to compute the entropy/gini values to check manually what happen. I will post the comment as answer

    – Roberto
    Nov 14 '18 at 8:52



















  • Yes the best way to approach the problem is to give to those cases a special value. Ad example if for the polytunnel crops the number of rows ranges in [0,100], to all the samples in glasshouse you will give -1. What you should have is that the tree will use the polutunnel/galsshouse variable to split the data. Then, the data in polytunnel will be evaluated according to the number of rows while the number of rows will be ignored in glasshouse since is constant.

    – Roberto
    Nov 13 '18 at 13:54











  • Thank you for the answer - I have now applied your method to my data. My only worry is whether it will actually split on glasshouse/polytunnel - for all I know random forest might decide to use number of rows first, in which case the -1 fill values will have an interesting consequences. I recognise this depends on the underlying data, so as long as I am taking the best approach in the current circumstances, I am happy!

    – Bodwin
    Nov 14 '18 at 8:49













  • That is a fair. So I suggest you to check what happen plotting the tree structure. If you have small dataset you could try to compute the entropy/gini values to check manually what happen. I will post the comment as answer

    – Roberto
    Nov 14 '18 at 8:52

















Yes the best way to approach the problem is to give to those cases a special value. Ad example if for the polytunnel crops the number of rows ranges in [0,100], to all the samples in glasshouse you will give -1. What you should have is that the tree will use the polutunnel/galsshouse variable to split the data. Then, the data in polytunnel will be evaluated according to the number of rows while the number of rows will be ignored in glasshouse since is constant.

– Roberto
Nov 13 '18 at 13:54





Yes the best way to approach the problem is to give to those cases a special value. Ad example if for the polytunnel crops the number of rows ranges in [0,100], to all the samples in glasshouse you will give -1. What you should have is that the tree will use the polutunnel/galsshouse variable to split the data. Then, the data in polytunnel will be evaluated according to the number of rows while the number of rows will be ignored in glasshouse since is constant.

– Roberto
Nov 13 '18 at 13:54













Thank you for the answer - I have now applied your method to my data. My only worry is whether it will actually split on glasshouse/polytunnel - for all I know random forest might decide to use number of rows first, in which case the -1 fill values will have an interesting consequences. I recognise this depends on the underlying data, so as long as I am taking the best approach in the current circumstances, I am happy!

– Bodwin
Nov 14 '18 at 8:49







Thank you for the answer - I have now applied your method to my data. My only worry is whether it will actually split on glasshouse/polytunnel - for all I know random forest might decide to use number of rows first, in which case the -1 fill values will have an interesting consequences. I recognise this depends on the underlying data, so as long as I am taking the best approach in the current circumstances, I am happy!

– Bodwin
Nov 14 '18 at 8:49















That is a fair. So I suggest you to check what happen plotting the tree structure. If you have small dataset you could try to compute the entropy/gini values to check manually what happen. I will post the comment as answer

– Roberto
Nov 14 '18 at 8:52





That is a fair. So I suggest you to check what happen plotting the tree structure. If you have small dataset you could try to compute the entropy/gini values to check manually what happen. I will post the comment as answer

– Roberto
Nov 14 '18 at 8:52












1 Answer
1






active

oldest

votes


















1














The best way to approach the problem is to give to those cases a special value.



Ad example if for the polytunnel crops the number of rows ranges in [0,100], to all the samples in glasshouse you will give -1.



What you should have is that the tree will use the polutunnel/galsshouse variable to split the data. Then, the data in polytunnel will be evaluated according to the number of rows while the number of rows will be ignored in glasshouse since is constant.






share|improve this answer























    Your Answer






    StackExchange.ifUsing("editor", function () {
    StackExchange.using("externalEditor", function () {
    StackExchange.using("snippets", function () {
    StackExchange.snippets.init();
    });
    });
    }, "code-snippets");

    StackExchange.ready(function() {
    var channelOptions = {
    tags: "".split(" "),
    id: "1"
    };
    initTagRenderer("".split(" "), "".split(" "), channelOptions);

    StackExchange.using("externalEditor", function() {
    // Have to fire editor after snippets, if snippets enabled
    if (StackExchange.settings.snippets.snippetsEnabled) {
    StackExchange.using("snippets", function() {
    createEditor();
    });
    }
    else {
    createEditor();
    }
    });

    function createEditor() {
    StackExchange.prepareEditor({
    heartbeatType: 'answer',
    autoActivateHeartbeat: false,
    convertImagesToLinks: true,
    noModals: true,
    showLowRepImageUploadWarning: true,
    reputationToPostImages: 10,
    bindNavPrevention: true,
    postfix: "",
    imageUploader: {
    brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
    contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
    allowUrls: true
    },
    onDemand: true,
    discardSelector: ".discard-answer"
    ,immediatelyShowMarkdownHelp:true
    });


    }
    });














    draft saved

    draft discarded


















    StackExchange.ready(
    function () {
    StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53277409%2frandom-forest-missing-values-in-cases-where-the-variables-do-not-apply%23new-answer', 'question_page');
    }
    );

    Post as a guest















    Required, but never shown

























    1 Answer
    1






    active

    oldest

    votes








    1 Answer
    1






    active

    oldest

    votes









    active

    oldest

    votes






    active

    oldest

    votes









    1














    The best way to approach the problem is to give to those cases a special value.



    Ad example if for the polytunnel crops the number of rows ranges in [0,100], to all the samples in glasshouse you will give -1.



    What you should have is that the tree will use the polutunnel/galsshouse variable to split the data. Then, the data in polytunnel will be evaluated according to the number of rows while the number of rows will be ignored in glasshouse since is constant.






    share|improve this answer




























      1














      The best way to approach the problem is to give to those cases a special value.



      Ad example if for the polytunnel crops the number of rows ranges in [0,100], to all the samples in glasshouse you will give -1.



      What you should have is that the tree will use the polutunnel/galsshouse variable to split the data. Then, the data in polytunnel will be evaluated according to the number of rows while the number of rows will be ignored in glasshouse since is constant.






      share|improve this answer


























        1












        1








        1







        The best way to approach the problem is to give to those cases a special value.



        Ad example if for the polytunnel crops the number of rows ranges in [0,100], to all the samples in glasshouse you will give -1.



        What you should have is that the tree will use the polutunnel/galsshouse variable to split the data. Then, the data in polytunnel will be evaluated according to the number of rows while the number of rows will be ignored in glasshouse since is constant.






        share|improve this answer













        The best way to approach the problem is to give to those cases a special value.



        Ad example if for the polytunnel crops the number of rows ranges in [0,100], to all the samples in glasshouse you will give -1.



        What you should have is that the tree will use the polutunnel/galsshouse variable to split the data. Then, the data in polytunnel will be evaluated according to the number of rows while the number of rows will be ignored in glasshouse since is constant.







        share|improve this answer












        share|improve this answer



        share|improve this answer










        answered Nov 14 '18 at 8:57









        RobertoRoberto

        50512




        50512






























            draft saved

            draft discarded




















































            Thanks for contributing an answer to Stack Overflow!


            • Please be sure to answer the question. Provide details and share your research!

            But avoid



            • Asking for help, clarification, or responding to other answers.

            • Making statements based on opinion; back them up with references or personal experience.


            To learn more, see our tips on writing great answers.




            draft saved


            draft discarded














            StackExchange.ready(
            function () {
            StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53277409%2frandom-forest-missing-values-in-cases-where-the-variables-do-not-apply%23new-answer', 'question_page');
            }
            );

            Post as a guest















            Required, but never shown





















































            Required, but never shown














            Required, but never shown












            Required, but never shown







            Required, but never shown

































            Required, but never shown














            Required, but never shown












            Required, but never shown







            Required, but never shown







            Popular posts from this blog

            Full-time equivalent

            Bicuculline

            さくらももこ