R: how to debug “factor has new levels” error for linear model and prediction











up vote
1
down vote

favorite












I am trying to make and test a linear model as follows:



lm_model <- lm(Purchase ~., data = train)
lm_prediction <- predict(lm_model, test)


This results in the following error, stating that the Product_Category_1 column has values that exist in the test data frame but not the train data frame):




factor Product_Category_1 has new levels 7, 9, 14, 16, 17, 18




However, if I check these they definitely look to appear in both data frames:



> nrow(subset(train, Product_Category_1 == "7"))
[1] 2923
> nrow(subset(test, Product_Category_1 == "7"))
[1] 745
> nrow(subset(train, Product_Category_1 == "9"))
[1] 312
> nrow(subset(test, Product_Category_1 == "9"))
[1] 92


Also showing the table for train and test show they have the same factors:



> table(train$Product_Category_1)

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
110820 18818 15820 9265 118955 16159 2923 89511 312 4030 19113 3108 4407 1201 4991 7730 467 2430
> table(test$Product_Category_1)

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
27533 4681 4029 2301 29637 4005 745 22621 92 1002 4847 767 1033 299 1212 1967 100 645
>









share|improve this question




















  • 1




    What do table(train$Product_Category_1) and table(test$Product_Category_1) tell you? There is not much we can do to help without a reproducible example here.
    – Roland
    Jul 27 at 6:53










  • table(train$Product_Category_1) and table(test$Product_Category_1) show they have the same factors (edited the post)
    – ZhouW
    Jul 27 at 7:03










  • still need a reproducible example. Read this: stackoverflow.com/questions/5963269/…
    – dww
    Jul 27 at 8:13






  • 1




    One possibility are NA values in other variables. lm applies na.omit and thereby could remove all observations of a specific factor level.
    – Roland
    Jul 27 at 8:48















up vote
1
down vote

favorite












I am trying to make and test a linear model as follows:



lm_model <- lm(Purchase ~., data = train)
lm_prediction <- predict(lm_model, test)


This results in the following error, stating that the Product_Category_1 column has values that exist in the test data frame but not the train data frame):




factor Product_Category_1 has new levels 7, 9, 14, 16, 17, 18




However, if I check these they definitely look to appear in both data frames:



> nrow(subset(train, Product_Category_1 == "7"))
[1] 2923
> nrow(subset(test, Product_Category_1 == "7"))
[1] 745
> nrow(subset(train, Product_Category_1 == "9"))
[1] 312
> nrow(subset(test, Product_Category_1 == "9"))
[1] 92


Also showing the table for train and test show they have the same factors:



> table(train$Product_Category_1)

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
110820 18818 15820 9265 118955 16159 2923 89511 312 4030 19113 3108 4407 1201 4991 7730 467 2430
> table(test$Product_Category_1)

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
27533 4681 4029 2301 29637 4005 745 22621 92 1002 4847 767 1033 299 1212 1967 100 645
>









share|improve this question




















  • 1




    What do table(train$Product_Category_1) and table(test$Product_Category_1) tell you? There is not much we can do to help without a reproducible example here.
    – Roland
    Jul 27 at 6:53










  • table(train$Product_Category_1) and table(test$Product_Category_1) show they have the same factors (edited the post)
    – ZhouW
    Jul 27 at 7:03










  • still need a reproducible example. Read this: stackoverflow.com/questions/5963269/…
    – dww
    Jul 27 at 8:13






  • 1




    One possibility are NA values in other variables. lm applies na.omit and thereby could remove all observations of a specific factor level.
    – Roland
    Jul 27 at 8:48













up vote
1
down vote

favorite









up vote
1
down vote

favorite











I am trying to make and test a linear model as follows:



lm_model <- lm(Purchase ~., data = train)
lm_prediction <- predict(lm_model, test)


This results in the following error, stating that the Product_Category_1 column has values that exist in the test data frame but not the train data frame):




factor Product_Category_1 has new levels 7, 9, 14, 16, 17, 18




However, if I check these they definitely look to appear in both data frames:



> nrow(subset(train, Product_Category_1 == "7"))
[1] 2923
> nrow(subset(test, Product_Category_1 == "7"))
[1] 745
> nrow(subset(train, Product_Category_1 == "9"))
[1] 312
> nrow(subset(test, Product_Category_1 == "9"))
[1] 92


Also showing the table for train and test show they have the same factors:



> table(train$Product_Category_1)

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
110820 18818 15820 9265 118955 16159 2923 89511 312 4030 19113 3108 4407 1201 4991 7730 467 2430
> table(test$Product_Category_1)

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
27533 4681 4029 2301 29637 4005 745 22621 92 1002 4847 767 1033 299 1212 1967 100 645
>









share|improve this question















I am trying to make and test a linear model as follows:



lm_model <- lm(Purchase ~., data = train)
lm_prediction <- predict(lm_model, test)


This results in the following error, stating that the Product_Category_1 column has values that exist in the test data frame but not the train data frame):




factor Product_Category_1 has new levels 7, 9, 14, 16, 17, 18




However, if I check these they definitely look to appear in both data frames:



> nrow(subset(train, Product_Category_1 == "7"))
[1] 2923
> nrow(subset(test, Product_Category_1 == "7"))
[1] 745
> nrow(subset(train, Product_Category_1 == "9"))
[1] 312
> nrow(subset(test, Product_Category_1 == "9"))
[1] 92


Also showing the table for train and test show they have the same factors:



> table(train$Product_Category_1)

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
110820 18818 15820 9265 118955 16159 2923 89511 312 4030 19113 3108 4407 1201 4991 7730 467 2430
> table(test$Product_Category_1)

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
27533 4681 4029 2301 29637 4005 745 22621 92 1002 4847 767 1033 299 1212 1967 100 645
>






r regression linear-regression prediction lm






share|improve this question















share|improve this question













share|improve this question




share|improve this question








edited Jul 27 at 11:43









李哲源

46.8k1491142




46.8k1491142










asked Jul 27 at 6:46









ZhouW

342423




342423








  • 1




    What do table(train$Product_Category_1) and table(test$Product_Category_1) tell you? There is not much we can do to help without a reproducible example here.
    – Roland
    Jul 27 at 6:53










  • table(train$Product_Category_1) and table(test$Product_Category_1) show they have the same factors (edited the post)
    – ZhouW
    Jul 27 at 7:03










  • still need a reproducible example. Read this: stackoverflow.com/questions/5963269/…
    – dww
    Jul 27 at 8:13






  • 1




    One possibility are NA values in other variables. lm applies na.omit and thereby could remove all observations of a specific factor level.
    – Roland
    Jul 27 at 8:48














  • 1




    What do table(train$Product_Category_1) and table(test$Product_Category_1) tell you? There is not much we can do to help without a reproducible example here.
    – Roland
    Jul 27 at 6:53










  • table(train$Product_Category_1) and table(test$Product_Category_1) show they have the same factors (edited the post)
    – ZhouW
    Jul 27 at 7:03










  • still need a reproducible example. Read this: stackoverflow.com/questions/5963269/…
    – dww
    Jul 27 at 8:13






  • 1




    One possibility are NA values in other variables. lm applies na.omit and thereby could remove all observations of a specific factor level.
    – Roland
    Jul 27 at 8:48








1




1




What do table(train$Product_Category_1) and table(test$Product_Category_1) tell you? There is not much we can do to help without a reproducible example here.
– Roland
Jul 27 at 6:53




What do table(train$Product_Category_1) and table(test$Product_Category_1) tell you? There is not much we can do to help without a reproducible example here.
– Roland
Jul 27 at 6:53












table(train$Product_Category_1) and table(test$Product_Category_1) show they have the same factors (edited the post)
– ZhouW
Jul 27 at 7:03




table(train$Product_Category_1) and table(test$Product_Category_1) show they have the same factors (edited the post)
– ZhouW
Jul 27 at 7:03












still need a reproducible example. Read this: stackoverflow.com/questions/5963269/…
– dww
Jul 27 at 8:13




still need a reproducible example. Read this: stackoverflow.com/questions/5963269/…
– dww
Jul 27 at 8:13




1




1




One possibility are NA values in other variables. lm applies na.omit and thereby could remove all observations of a specific factor level.
– Roland
Jul 27 at 8:48




One possibility are NA values in other variables. lm applies na.omit and thereby could remove all observations of a specific factor level.
– Roland
Jul 27 at 8:48












1 Answer
1






active

oldest

votes

















up vote
5
down vote













Table of Contents:




  • A simple example for walkthrough

  • Suggestion for users

  • Helpful information that we can get from the fitted model object

  • OK, I see what the problem is now, but how to make predict work?

  • Is there a better way to avoid such problem at all?




A simple example for walkthrough



Here is simple enough reproducible example to hint you what has happened.



train <- data.frame(y = runif(4), x = c(runif(3), NA), f = factor(letters[1:4]))
test <- data.frame(y = runif(4), x = runif(4), f = factor(letters[1:4]))
fit <- lm(y ~ x + f, data = train)
predict(fit, newdata = test)
#Error in model.frame.default(Terms, newdata, na.action = na.action, xlev = object$xlevels) :
# factor f has new levels d


I am fitting a model with more parameters than data so the model is rank-deficient (to be explained in the end). However, this does not affect how lm and predict work.



If you just check table(train$f) and table(test$f) it is not useful as the problem is not caused by variable f but by NA in x. lm and glm drop incomplete cases, i.e., rows with at least one NA (see ?complete.cases) for model fitting. They have to do so as otherwise the underlying FORTRAN routine for QR factorization would fail because it can not handle NA. If you check the documentation at ?lm you will see this function has an argument na.action which defaults to na.omit. You can also set it to na.exclude but na.pass which retains NA will cause FORTRAN error:



fit <- lm(y ~ x + f, data = train, na.action = na.pass)
#Error in lm.fit(x, y, offset = offset, singular.ok = singular.ok, ...) :
# NA/NaN/Inf in 'x'


Let's remove NA from the training dataset.



train <- na.omit(train)
train$f
#[1] a b c
#Levels: a b c d


f now has an unused level "d". lm and glm will drop unused levels when building the model frame (and later the model matrix):



## source code of lm; don't run
mf$drop.unused.levels <- TRUE
mf[[1L]] <- quote(stats::model.frame)
mf <- eval(mf, parent.frame())


This is not user controllable. The reason is that if an unused level is included, it will generate a column of zeros in the model matrix.



mf <- model.frame(y ~ x + f, data = train, drop.unused.levels = FALSE)
model.matrix(y ~ x + f, data = mf)
# (Intercept) x fb fc fd
#1 1 0.90021178 0 0 0
#2 1 0.10188534 1 0 0
#3 1 0.05881954 0 1 0
#attr(,"assign")
#[1] 0 1 2 2 2
#attr(,"contrasts")
#attr(,"contrasts")$f
#[1] "contr.treatment"


This is undesired as it produces NA coefficient for dummy variable fd. By drop.unused.levels = TRUE as forced by lm and glm:



mf <- model.frame(y ~ x + f, data = train, drop.unused.levels = TRUE)
model.matrix(y ~ x + f, data = mf)
# (Intercept) x fb fc
#1 1 0.90021178 0 0
#2 1 0.10188534 1 0
#3 1 0.05881954 0 1
#attr(,"assign")
#[1] 0 1 2 2
#attr(,"contrasts")
#attr(,"contrasts")$f
#[1] "contr.treatment"


The fd is gone, and



mf$f
#[1] a b c
#Levels: a b c


The now non-existing "d" level will cause the "new factor level" error in predict.





Suggestion for users



It is highly recommended that all users do the following manually when fitting models:





  • [No. 1] remove incomplete cases;


  • [No. 2] drop unused factor levels.


This is exactly the procedure as recommended here: How to debug "contrasts can be applied only to factors with 2 or more levels" error? This gets users aware of what lm and glm do under the hood, and makes their debugging life much easier.



Note, there should be another recommendation in the list:





  • [No. 0] do subsetting yourself


Users may occasionally use subset argument. But there is a potential pitfall: not all factor levels might appear in the subsetted dataset, thus you may get "new factor levels" when using predict later.



The above advice is particularly important when you write functions wrapping lm or glm. You want your functions to be robust. Ask your function to return an informative error rather than waiting for lm and glm to complain.





Helpful information that we can get from the fitted model object



lm and glm return an xlevels value in the fitted object. It contains the factor levels actually used for model fitting.



fit$xlevels
#$f
#[1] "a" "b" "c"


So in case you have not followed the recommendations listed above and have got into trouble with factor levels, this xlevels should be the first thing to inspect.



If you want to use something like table to count how many cases there are for each factor levels, here is a way: Get number of data in each factor level (as well as interaction) from a fitted lm or glm [R], although making a model matrix can use much RAM.





OK, I see what the problem is now, but how to make predict work?



If you can not choose to work with a different set of train and test dataset (see the next section), you need to set those factor levels in the test but not in xlevels to NA. Then predict will just predict NA for such incomplete cases.





Is there a better way to avoid such problem at all?



People split data into train and test as they want to do cross-validation. The first step is to apply na.omit on your full dataset to get rid of NA noise. Then we could do a random partitioning on what is left, but this this naive way may end up with




  • some factor levels in test but not in train (oops, we get "new factor level" error when using predict);

  • some factor variables in train only have 1 level after unused levels removed (oops, we get "contrasts" error when using lm and glm);


So, it is highly recommended that you do some more sophisticated partitioning like stratified sampling.



There is in fact another hazard, but not causing programming errors:




  • the model matrix for train is rank-deficient (oops, we get a "prediction for rank-deficient model may be misleading" warning when using predict).


Regarding the rank-deficiency in model fitting, see lme4::lmer reports "fixed-effect model matrix is rank deficient", do I need a fix and how to? Rank-deficiency does not cause problem for model estimation and checking, but can be a hazard for prediction: R lm, Could anyone give me an example of the misleading case on “prediction from a rank-deficient”? However, such issue is more difficult to avoid, particularly if you have many factors and possibly with interaction.






share|improve this answer























  • Or filling NA's in the training dataset with multiple imputation might be a different way to avoid the problem.
    – Roland
    Jul 30 at 8:30











Your Answer






StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});


}
});














 

draft saved


draft discarded


















StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f51552203%2fr-how-to-debug-factor-has-new-levels-error-for-linear-model-and-prediction%23new-answer', 'question_page');
}
);

Post as a guest
































1 Answer
1






active

oldest

votes








1 Answer
1






active

oldest

votes









active

oldest

votes






active

oldest

votes








up vote
5
down vote













Table of Contents:




  • A simple example for walkthrough

  • Suggestion for users

  • Helpful information that we can get from the fitted model object

  • OK, I see what the problem is now, but how to make predict work?

  • Is there a better way to avoid such problem at all?




A simple example for walkthrough



Here is simple enough reproducible example to hint you what has happened.



train <- data.frame(y = runif(4), x = c(runif(3), NA), f = factor(letters[1:4]))
test <- data.frame(y = runif(4), x = runif(4), f = factor(letters[1:4]))
fit <- lm(y ~ x + f, data = train)
predict(fit, newdata = test)
#Error in model.frame.default(Terms, newdata, na.action = na.action, xlev = object$xlevels) :
# factor f has new levels d


I am fitting a model with more parameters than data so the model is rank-deficient (to be explained in the end). However, this does not affect how lm and predict work.



If you just check table(train$f) and table(test$f) it is not useful as the problem is not caused by variable f but by NA in x. lm and glm drop incomplete cases, i.e., rows with at least one NA (see ?complete.cases) for model fitting. They have to do so as otherwise the underlying FORTRAN routine for QR factorization would fail because it can not handle NA. If you check the documentation at ?lm you will see this function has an argument na.action which defaults to na.omit. You can also set it to na.exclude but na.pass which retains NA will cause FORTRAN error:



fit <- lm(y ~ x + f, data = train, na.action = na.pass)
#Error in lm.fit(x, y, offset = offset, singular.ok = singular.ok, ...) :
# NA/NaN/Inf in 'x'


Let's remove NA from the training dataset.



train <- na.omit(train)
train$f
#[1] a b c
#Levels: a b c d


f now has an unused level "d". lm and glm will drop unused levels when building the model frame (and later the model matrix):



## source code of lm; don't run
mf$drop.unused.levels <- TRUE
mf[[1L]] <- quote(stats::model.frame)
mf <- eval(mf, parent.frame())


This is not user controllable. The reason is that if an unused level is included, it will generate a column of zeros in the model matrix.



mf <- model.frame(y ~ x + f, data = train, drop.unused.levels = FALSE)
model.matrix(y ~ x + f, data = mf)
# (Intercept) x fb fc fd
#1 1 0.90021178 0 0 0
#2 1 0.10188534 1 0 0
#3 1 0.05881954 0 1 0
#attr(,"assign")
#[1] 0 1 2 2 2
#attr(,"contrasts")
#attr(,"contrasts")$f
#[1] "contr.treatment"


This is undesired as it produces NA coefficient for dummy variable fd. By drop.unused.levels = TRUE as forced by lm and glm:



mf <- model.frame(y ~ x + f, data = train, drop.unused.levels = TRUE)
model.matrix(y ~ x + f, data = mf)
# (Intercept) x fb fc
#1 1 0.90021178 0 0
#2 1 0.10188534 1 0
#3 1 0.05881954 0 1
#attr(,"assign")
#[1] 0 1 2 2
#attr(,"contrasts")
#attr(,"contrasts")$f
#[1] "contr.treatment"


The fd is gone, and



mf$f
#[1] a b c
#Levels: a b c


The now non-existing "d" level will cause the "new factor level" error in predict.





Suggestion for users



It is highly recommended that all users do the following manually when fitting models:





  • [No. 1] remove incomplete cases;


  • [No. 2] drop unused factor levels.


This is exactly the procedure as recommended here: How to debug "contrasts can be applied only to factors with 2 or more levels" error? This gets users aware of what lm and glm do under the hood, and makes their debugging life much easier.



Note, there should be another recommendation in the list:





  • [No. 0] do subsetting yourself


Users may occasionally use subset argument. But there is a potential pitfall: not all factor levels might appear in the subsetted dataset, thus you may get "new factor levels" when using predict later.



The above advice is particularly important when you write functions wrapping lm or glm. You want your functions to be robust. Ask your function to return an informative error rather than waiting for lm and glm to complain.





Helpful information that we can get from the fitted model object



lm and glm return an xlevels value in the fitted object. It contains the factor levels actually used for model fitting.



fit$xlevels
#$f
#[1] "a" "b" "c"


So in case you have not followed the recommendations listed above and have got into trouble with factor levels, this xlevels should be the first thing to inspect.



If you want to use something like table to count how many cases there are for each factor levels, here is a way: Get number of data in each factor level (as well as interaction) from a fitted lm or glm [R], although making a model matrix can use much RAM.





OK, I see what the problem is now, but how to make predict work?



If you can not choose to work with a different set of train and test dataset (see the next section), you need to set those factor levels in the test but not in xlevels to NA. Then predict will just predict NA for such incomplete cases.





Is there a better way to avoid such problem at all?



People split data into train and test as they want to do cross-validation. The first step is to apply na.omit on your full dataset to get rid of NA noise. Then we could do a random partitioning on what is left, but this this naive way may end up with




  • some factor levels in test but not in train (oops, we get "new factor level" error when using predict);

  • some factor variables in train only have 1 level after unused levels removed (oops, we get "contrasts" error when using lm and glm);


So, it is highly recommended that you do some more sophisticated partitioning like stratified sampling.



There is in fact another hazard, but not causing programming errors:




  • the model matrix for train is rank-deficient (oops, we get a "prediction for rank-deficient model may be misleading" warning when using predict).


Regarding the rank-deficiency in model fitting, see lme4::lmer reports "fixed-effect model matrix is rank deficient", do I need a fix and how to? Rank-deficiency does not cause problem for model estimation and checking, but can be a hazard for prediction: R lm, Could anyone give me an example of the misleading case on “prediction from a rank-deficient”? However, such issue is more difficult to avoid, particularly if you have many factors and possibly with interaction.






share|improve this answer























  • Or filling NA's in the training dataset with multiple imputation might be a different way to avoid the problem.
    – Roland
    Jul 30 at 8:30















up vote
5
down vote













Table of Contents:




  • A simple example for walkthrough

  • Suggestion for users

  • Helpful information that we can get from the fitted model object

  • OK, I see what the problem is now, but how to make predict work?

  • Is there a better way to avoid such problem at all?




A simple example for walkthrough



Here is simple enough reproducible example to hint you what has happened.



train <- data.frame(y = runif(4), x = c(runif(3), NA), f = factor(letters[1:4]))
test <- data.frame(y = runif(4), x = runif(4), f = factor(letters[1:4]))
fit <- lm(y ~ x + f, data = train)
predict(fit, newdata = test)
#Error in model.frame.default(Terms, newdata, na.action = na.action, xlev = object$xlevels) :
# factor f has new levels d


I am fitting a model with more parameters than data so the model is rank-deficient (to be explained in the end). However, this does not affect how lm and predict work.



If you just check table(train$f) and table(test$f) it is not useful as the problem is not caused by variable f but by NA in x. lm and glm drop incomplete cases, i.e., rows with at least one NA (see ?complete.cases) for model fitting. They have to do so as otherwise the underlying FORTRAN routine for QR factorization would fail because it can not handle NA. If you check the documentation at ?lm you will see this function has an argument na.action which defaults to na.omit. You can also set it to na.exclude but na.pass which retains NA will cause FORTRAN error:



fit <- lm(y ~ x + f, data = train, na.action = na.pass)
#Error in lm.fit(x, y, offset = offset, singular.ok = singular.ok, ...) :
# NA/NaN/Inf in 'x'


Let's remove NA from the training dataset.



train <- na.omit(train)
train$f
#[1] a b c
#Levels: a b c d


f now has an unused level "d". lm and glm will drop unused levels when building the model frame (and later the model matrix):



## source code of lm; don't run
mf$drop.unused.levels <- TRUE
mf[[1L]] <- quote(stats::model.frame)
mf <- eval(mf, parent.frame())


This is not user controllable. The reason is that if an unused level is included, it will generate a column of zeros in the model matrix.



mf <- model.frame(y ~ x + f, data = train, drop.unused.levels = FALSE)
model.matrix(y ~ x + f, data = mf)
# (Intercept) x fb fc fd
#1 1 0.90021178 0 0 0
#2 1 0.10188534 1 0 0
#3 1 0.05881954 0 1 0
#attr(,"assign")
#[1] 0 1 2 2 2
#attr(,"contrasts")
#attr(,"contrasts")$f
#[1] "contr.treatment"


This is undesired as it produces NA coefficient for dummy variable fd. By drop.unused.levels = TRUE as forced by lm and glm:



mf <- model.frame(y ~ x + f, data = train, drop.unused.levels = TRUE)
model.matrix(y ~ x + f, data = mf)
# (Intercept) x fb fc
#1 1 0.90021178 0 0
#2 1 0.10188534 1 0
#3 1 0.05881954 0 1
#attr(,"assign")
#[1] 0 1 2 2
#attr(,"contrasts")
#attr(,"contrasts")$f
#[1] "contr.treatment"


The fd is gone, and



mf$f
#[1] a b c
#Levels: a b c


The now non-existing "d" level will cause the "new factor level" error in predict.





Suggestion for users



It is highly recommended that all users do the following manually when fitting models:





  • [No. 1] remove incomplete cases;


  • [No. 2] drop unused factor levels.


This is exactly the procedure as recommended here: How to debug "contrasts can be applied only to factors with 2 or more levels" error? This gets users aware of what lm and glm do under the hood, and makes their debugging life much easier.



Note, there should be another recommendation in the list:





  • [No. 0] do subsetting yourself


Users may occasionally use subset argument. But there is a potential pitfall: not all factor levels might appear in the subsetted dataset, thus you may get "new factor levels" when using predict later.



The above advice is particularly important when you write functions wrapping lm or glm. You want your functions to be robust. Ask your function to return an informative error rather than waiting for lm and glm to complain.





Helpful information that we can get from the fitted model object



lm and glm return an xlevels value in the fitted object. It contains the factor levels actually used for model fitting.



fit$xlevels
#$f
#[1] "a" "b" "c"


So in case you have not followed the recommendations listed above and have got into trouble with factor levels, this xlevels should be the first thing to inspect.



If you want to use something like table to count how many cases there are for each factor levels, here is a way: Get number of data in each factor level (as well as interaction) from a fitted lm or glm [R], although making a model matrix can use much RAM.





OK, I see what the problem is now, but how to make predict work?



If you can not choose to work with a different set of train and test dataset (see the next section), you need to set those factor levels in the test but not in xlevels to NA. Then predict will just predict NA for such incomplete cases.





Is there a better way to avoid such problem at all?



People split data into train and test as they want to do cross-validation. The first step is to apply na.omit on your full dataset to get rid of NA noise. Then we could do a random partitioning on what is left, but this this naive way may end up with




  • some factor levels in test but not in train (oops, we get "new factor level" error when using predict);

  • some factor variables in train only have 1 level after unused levels removed (oops, we get "contrasts" error when using lm and glm);


So, it is highly recommended that you do some more sophisticated partitioning like stratified sampling.



There is in fact another hazard, but not causing programming errors:




  • the model matrix for train is rank-deficient (oops, we get a "prediction for rank-deficient model may be misleading" warning when using predict).


Regarding the rank-deficiency in model fitting, see lme4::lmer reports "fixed-effect model matrix is rank deficient", do I need a fix and how to? Rank-deficiency does not cause problem for model estimation and checking, but can be a hazard for prediction: R lm, Could anyone give me an example of the misleading case on “prediction from a rank-deficient”? However, such issue is more difficult to avoid, particularly if you have many factors and possibly with interaction.






share|improve this answer























  • Or filling NA's in the training dataset with multiple imputation might be a different way to avoid the problem.
    – Roland
    Jul 30 at 8:30













up vote
5
down vote










up vote
5
down vote









Table of Contents:




  • A simple example for walkthrough

  • Suggestion for users

  • Helpful information that we can get from the fitted model object

  • OK, I see what the problem is now, but how to make predict work?

  • Is there a better way to avoid such problem at all?




A simple example for walkthrough



Here is simple enough reproducible example to hint you what has happened.



train <- data.frame(y = runif(4), x = c(runif(3), NA), f = factor(letters[1:4]))
test <- data.frame(y = runif(4), x = runif(4), f = factor(letters[1:4]))
fit <- lm(y ~ x + f, data = train)
predict(fit, newdata = test)
#Error in model.frame.default(Terms, newdata, na.action = na.action, xlev = object$xlevels) :
# factor f has new levels d


I am fitting a model with more parameters than data so the model is rank-deficient (to be explained in the end). However, this does not affect how lm and predict work.



If you just check table(train$f) and table(test$f) it is not useful as the problem is not caused by variable f but by NA in x. lm and glm drop incomplete cases, i.e., rows with at least one NA (see ?complete.cases) for model fitting. They have to do so as otherwise the underlying FORTRAN routine for QR factorization would fail because it can not handle NA. If you check the documentation at ?lm you will see this function has an argument na.action which defaults to na.omit. You can also set it to na.exclude but na.pass which retains NA will cause FORTRAN error:



fit <- lm(y ~ x + f, data = train, na.action = na.pass)
#Error in lm.fit(x, y, offset = offset, singular.ok = singular.ok, ...) :
# NA/NaN/Inf in 'x'


Let's remove NA from the training dataset.



train <- na.omit(train)
train$f
#[1] a b c
#Levels: a b c d


f now has an unused level "d". lm and glm will drop unused levels when building the model frame (and later the model matrix):



## source code of lm; don't run
mf$drop.unused.levels <- TRUE
mf[[1L]] <- quote(stats::model.frame)
mf <- eval(mf, parent.frame())


This is not user controllable. The reason is that if an unused level is included, it will generate a column of zeros in the model matrix.



mf <- model.frame(y ~ x + f, data = train, drop.unused.levels = FALSE)
model.matrix(y ~ x + f, data = mf)
# (Intercept) x fb fc fd
#1 1 0.90021178 0 0 0
#2 1 0.10188534 1 0 0
#3 1 0.05881954 0 1 0
#attr(,"assign")
#[1] 0 1 2 2 2
#attr(,"contrasts")
#attr(,"contrasts")$f
#[1] "contr.treatment"


This is undesired as it produces NA coefficient for dummy variable fd. By drop.unused.levels = TRUE as forced by lm and glm:



mf <- model.frame(y ~ x + f, data = train, drop.unused.levels = TRUE)
model.matrix(y ~ x + f, data = mf)
# (Intercept) x fb fc
#1 1 0.90021178 0 0
#2 1 0.10188534 1 0
#3 1 0.05881954 0 1
#attr(,"assign")
#[1] 0 1 2 2
#attr(,"contrasts")
#attr(,"contrasts")$f
#[1] "contr.treatment"


The fd is gone, and



mf$f
#[1] a b c
#Levels: a b c


The now non-existing "d" level will cause the "new factor level" error in predict.





Suggestion for users



It is highly recommended that all users do the following manually when fitting models:





  • [No. 1] remove incomplete cases;


  • [No. 2] drop unused factor levels.


This is exactly the procedure as recommended here: How to debug "contrasts can be applied only to factors with 2 or more levels" error? This gets users aware of what lm and glm do under the hood, and makes their debugging life much easier.



Note, there should be another recommendation in the list:





  • [No. 0] do subsetting yourself


Users may occasionally use subset argument. But there is a potential pitfall: not all factor levels might appear in the subsetted dataset, thus you may get "new factor levels" when using predict later.



The above advice is particularly important when you write functions wrapping lm or glm. You want your functions to be robust. Ask your function to return an informative error rather than waiting for lm and glm to complain.





Helpful information that we can get from the fitted model object



lm and glm return an xlevels value in the fitted object. It contains the factor levels actually used for model fitting.



fit$xlevels
#$f
#[1] "a" "b" "c"


So in case you have not followed the recommendations listed above and have got into trouble with factor levels, this xlevels should be the first thing to inspect.



If you want to use something like table to count how many cases there are for each factor levels, here is a way: Get number of data in each factor level (as well as interaction) from a fitted lm or glm [R], although making a model matrix can use much RAM.





OK, I see what the problem is now, but how to make predict work?



If you can not choose to work with a different set of train and test dataset (see the next section), you need to set those factor levels in the test but not in xlevels to NA. Then predict will just predict NA for such incomplete cases.





Is there a better way to avoid such problem at all?



People split data into train and test as they want to do cross-validation. The first step is to apply na.omit on your full dataset to get rid of NA noise. Then we could do a random partitioning on what is left, but this this naive way may end up with




  • some factor levels in test but not in train (oops, we get "new factor level" error when using predict);

  • some factor variables in train only have 1 level after unused levels removed (oops, we get "contrasts" error when using lm and glm);


So, it is highly recommended that you do some more sophisticated partitioning like stratified sampling.



There is in fact another hazard, but not causing programming errors:




  • the model matrix for train is rank-deficient (oops, we get a "prediction for rank-deficient model may be misleading" warning when using predict).


Regarding the rank-deficiency in model fitting, see lme4::lmer reports "fixed-effect model matrix is rank deficient", do I need a fix and how to? Rank-deficiency does not cause problem for model estimation and checking, but can be a hazard for prediction: R lm, Could anyone give me an example of the misleading case on “prediction from a rank-deficient”? However, such issue is more difficult to avoid, particularly if you have many factors and possibly with interaction.






share|improve this answer














Table of Contents:




  • A simple example for walkthrough

  • Suggestion for users

  • Helpful information that we can get from the fitted model object

  • OK, I see what the problem is now, but how to make predict work?

  • Is there a better way to avoid such problem at all?




A simple example for walkthrough



Here is simple enough reproducible example to hint you what has happened.



train <- data.frame(y = runif(4), x = c(runif(3), NA), f = factor(letters[1:4]))
test <- data.frame(y = runif(4), x = runif(4), f = factor(letters[1:4]))
fit <- lm(y ~ x + f, data = train)
predict(fit, newdata = test)
#Error in model.frame.default(Terms, newdata, na.action = na.action, xlev = object$xlevels) :
# factor f has new levels d


I am fitting a model with more parameters than data so the model is rank-deficient (to be explained in the end). However, this does not affect how lm and predict work.



If you just check table(train$f) and table(test$f) it is not useful as the problem is not caused by variable f but by NA in x. lm and glm drop incomplete cases, i.e., rows with at least one NA (see ?complete.cases) for model fitting. They have to do so as otherwise the underlying FORTRAN routine for QR factorization would fail because it can not handle NA. If you check the documentation at ?lm you will see this function has an argument na.action which defaults to na.omit. You can also set it to na.exclude but na.pass which retains NA will cause FORTRAN error:



fit <- lm(y ~ x + f, data = train, na.action = na.pass)
#Error in lm.fit(x, y, offset = offset, singular.ok = singular.ok, ...) :
# NA/NaN/Inf in 'x'


Let's remove NA from the training dataset.



train <- na.omit(train)
train$f
#[1] a b c
#Levels: a b c d


f now has an unused level "d". lm and glm will drop unused levels when building the model frame (and later the model matrix):



## source code of lm; don't run
mf$drop.unused.levels <- TRUE
mf[[1L]] <- quote(stats::model.frame)
mf <- eval(mf, parent.frame())


This is not user controllable. The reason is that if an unused level is included, it will generate a column of zeros in the model matrix.



mf <- model.frame(y ~ x + f, data = train, drop.unused.levels = FALSE)
model.matrix(y ~ x + f, data = mf)
# (Intercept) x fb fc fd
#1 1 0.90021178 0 0 0
#2 1 0.10188534 1 0 0
#3 1 0.05881954 0 1 0
#attr(,"assign")
#[1] 0 1 2 2 2
#attr(,"contrasts")
#attr(,"contrasts")$f
#[1] "contr.treatment"


This is undesired as it produces NA coefficient for dummy variable fd. By drop.unused.levels = TRUE as forced by lm and glm:



mf <- model.frame(y ~ x + f, data = train, drop.unused.levels = TRUE)
model.matrix(y ~ x + f, data = mf)
# (Intercept) x fb fc
#1 1 0.90021178 0 0
#2 1 0.10188534 1 0
#3 1 0.05881954 0 1
#attr(,"assign")
#[1] 0 1 2 2
#attr(,"contrasts")
#attr(,"contrasts")$f
#[1] "contr.treatment"


The fd is gone, and



mf$f
#[1] a b c
#Levels: a b c


The now non-existing "d" level will cause the "new factor level" error in predict.





Suggestion for users



It is highly recommended that all users do the following manually when fitting models:





  • [No. 1] remove incomplete cases;


  • [No. 2] drop unused factor levels.


This is exactly the procedure as recommended here: How to debug "contrasts can be applied only to factors with 2 or more levels" error? This gets users aware of what lm and glm do under the hood, and makes their debugging life much easier.



Note, there should be another recommendation in the list:





  • [No. 0] do subsetting yourself


Users may occasionally use subset argument. But there is a potential pitfall: not all factor levels might appear in the subsetted dataset, thus you may get "new factor levels" when using predict later.



The above advice is particularly important when you write functions wrapping lm or glm. You want your functions to be robust. Ask your function to return an informative error rather than waiting for lm and glm to complain.





Helpful information that we can get from the fitted model object



lm and glm return an xlevels value in the fitted object. It contains the factor levels actually used for model fitting.



fit$xlevels
#$f
#[1] "a" "b" "c"


So in case you have not followed the recommendations listed above and have got into trouble with factor levels, this xlevels should be the first thing to inspect.



If you want to use something like table to count how many cases there are for each factor levels, here is a way: Get number of data in each factor level (as well as interaction) from a fitted lm or glm [R], although making a model matrix can use much RAM.





OK, I see what the problem is now, but how to make predict work?



If you can not choose to work with a different set of train and test dataset (see the next section), you need to set those factor levels in the test but not in xlevels to NA. Then predict will just predict NA for such incomplete cases.





Is there a better way to avoid such problem at all?



People split data into train and test as they want to do cross-validation. The first step is to apply na.omit on your full dataset to get rid of NA noise. Then we could do a random partitioning on what is left, but this this naive way may end up with




  • some factor levels in test but not in train (oops, we get "new factor level" error when using predict);

  • some factor variables in train only have 1 level after unused levels removed (oops, we get "contrasts" error when using lm and glm);


So, it is highly recommended that you do some more sophisticated partitioning like stratified sampling.



There is in fact another hazard, but not causing programming errors:




  • the model matrix for train is rank-deficient (oops, we get a "prediction for rank-deficient model may be misleading" warning when using predict).


Regarding the rank-deficiency in model fitting, see lme4::lmer reports "fixed-effect model matrix is rank deficient", do I need a fix and how to? Rank-deficiency does not cause problem for model estimation and checking, but can be a hazard for prediction: R lm, Could anyone give me an example of the misleading case on “prediction from a rank-deficient”? However, such issue is more difficult to avoid, particularly if you have many factors and possibly with interaction.







share|improve this answer














share|improve this answer



share|improve this answer








edited Jul 27 at 15:25

























answered Jul 27 at 10:25









李哲源

46.8k1491142




46.8k1491142












  • Or filling NA's in the training dataset with multiple imputation might be a different way to avoid the problem.
    – Roland
    Jul 30 at 8:30


















  • Or filling NA's in the training dataset with multiple imputation might be a different way to avoid the problem.
    – Roland
    Jul 30 at 8:30
















Or filling NA's in the training dataset with multiple imputation might be a different way to avoid the problem.
– Roland
Jul 30 at 8:30




Or filling NA's in the training dataset with multiple imputation might be a different way to avoid the problem.
– Roland
Jul 30 at 8:30


















 

draft saved


draft discarded



















































 


draft saved


draft discarded














StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f51552203%2fr-how-to-debug-factor-has-new-levels-error-for-linear-model-and-prediction%23new-answer', 'question_page');
}
);

Post as a guest




















































































Popular posts from this blog

Full-time equivalent

さくらももこ

13 indicted, 8 arrested in Calif. drug cartel investigation