Validating Fuzzy Clustering












0















I would like to use fuzzy C-means clustering on a large unsupervided data set of 41 variables and 415 observations. However, I am stuck on trying to validate those clusters. When I plot with a random number of clusters, I can explain a total of 54% of the variance, which is not great and there are no really nice clusters as their would be with the iris database for example.



First I ran the fcm with my scales data on 3 clusters just to see, but if I am trying to find way to search for the optimal number of clusters, then I do not want to set an arbitrary defined number of clusters.



So I turned to google and googled: "valdiate fuzzy clustering in R." This link here was good, but I still have to try a bunch of different numbers of clusters. I looked at the advclust, ppclust, and clvalid packages but I could not find a walkthrough for the functions. I looked at the documentation of each package, but also could not discern what to do next.



I walked through some possible number of clusters and checked each one with the k.crisp object from fanny. I started with 100 and got down to 4. Based on object description in the documentation,




k.crisp=integer ( ≤ k ) giving the number of crisp clusters; can be less than
k , where it's recommended to decrease memb.exp.




it doesn't seem like a valid way because it is comparing the number of crisp clusters to our fuzzy clusters.



Is there a function where I can check the validity of my clusters from 2:10 clusters? Also, is it worth while to check the validity of 1 cluster? I think that is a stupid question, but I have a strange feeling 1 optimal cluster might be what I get. (Any tips on what to do if I were to get 1 cluster besides cry a little on the inside?)



Code



library(cluster)
library(factoextra)
library(ppclust)
library(advclust)
library(clValid)
data(iris)
df<-sapply(iris[-5],scale)
res.fanny<-fanny(df,3,metric='SqEuclidean')
res.fanny$k.crisp
# When I try to use euclidean, I get the warning all memberships are very close to 1/l. Maybe increase memb.exp, which I don't fully understand
# From my understanding using the SqEuclidean is equivalent to Fuzzy C-means, use the website below. Ultimately I do want to use C-means, hence I use the SqEuclidean distance
fviz_cluster(Res.fanny,ellipse.type='norm',palette='jco',ggtheme=theme_minimal(),legend='right')
fviz_silhouette(res.fanny,palette='jco',ggtheme=theme_minimal())

# With ppclust
set.seed(123)
res.fcm<-fcm(df,centers=3,nstart=10)


website as mentioned above.










share|improve this question





























    0















    I would like to use fuzzy C-means clustering on a large unsupervided data set of 41 variables and 415 observations. However, I am stuck on trying to validate those clusters. When I plot with a random number of clusters, I can explain a total of 54% of the variance, which is not great and there are no really nice clusters as their would be with the iris database for example.



    First I ran the fcm with my scales data on 3 clusters just to see, but if I am trying to find way to search for the optimal number of clusters, then I do not want to set an arbitrary defined number of clusters.



    So I turned to google and googled: "valdiate fuzzy clustering in R." This link here was good, but I still have to try a bunch of different numbers of clusters. I looked at the advclust, ppclust, and clvalid packages but I could not find a walkthrough for the functions. I looked at the documentation of each package, but also could not discern what to do next.



    I walked through some possible number of clusters and checked each one with the k.crisp object from fanny. I started with 100 and got down to 4. Based on object description in the documentation,




    k.crisp=integer ( ≤ k ) giving the number of crisp clusters; can be less than
    k , where it's recommended to decrease memb.exp.




    it doesn't seem like a valid way because it is comparing the number of crisp clusters to our fuzzy clusters.



    Is there a function where I can check the validity of my clusters from 2:10 clusters? Also, is it worth while to check the validity of 1 cluster? I think that is a stupid question, but I have a strange feeling 1 optimal cluster might be what I get. (Any tips on what to do if I were to get 1 cluster besides cry a little on the inside?)



    Code



    library(cluster)
    library(factoextra)
    library(ppclust)
    library(advclust)
    library(clValid)
    data(iris)
    df<-sapply(iris[-5],scale)
    res.fanny<-fanny(df,3,metric='SqEuclidean')
    res.fanny$k.crisp
    # When I try to use euclidean, I get the warning all memberships are very close to 1/l. Maybe increase memb.exp, which I don't fully understand
    # From my understanding using the SqEuclidean is equivalent to Fuzzy C-means, use the website below. Ultimately I do want to use C-means, hence I use the SqEuclidean distance
    fviz_cluster(Res.fanny,ellipse.type='norm',palette='jco',ggtheme=theme_minimal(),legend='right')
    fviz_silhouette(res.fanny,palette='jco',ggtheme=theme_minimal())

    # With ppclust
    set.seed(123)
    res.fcm<-fcm(df,centers=3,nstart=10)


    website as mentioned above.










    share|improve this question



























      0












      0








      0








      I would like to use fuzzy C-means clustering on a large unsupervided data set of 41 variables and 415 observations. However, I am stuck on trying to validate those clusters. When I plot with a random number of clusters, I can explain a total of 54% of the variance, which is not great and there are no really nice clusters as their would be with the iris database for example.



      First I ran the fcm with my scales data on 3 clusters just to see, but if I am trying to find way to search for the optimal number of clusters, then I do not want to set an arbitrary defined number of clusters.



      So I turned to google and googled: "valdiate fuzzy clustering in R." This link here was good, but I still have to try a bunch of different numbers of clusters. I looked at the advclust, ppclust, and clvalid packages but I could not find a walkthrough for the functions. I looked at the documentation of each package, but also could not discern what to do next.



      I walked through some possible number of clusters and checked each one with the k.crisp object from fanny. I started with 100 and got down to 4. Based on object description in the documentation,




      k.crisp=integer ( ≤ k ) giving the number of crisp clusters; can be less than
      k , where it's recommended to decrease memb.exp.




      it doesn't seem like a valid way because it is comparing the number of crisp clusters to our fuzzy clusters.



      Is there a function where I can check the validity of my clusters from 2:10 clusters? Also, is it worth while to check the validity of 1 cluster? I think that is a stupid question, but I have a strange feeling 1 optimal cluster might be what I get. (Any tips on what to do if I were to get 1 cluster besides cry a little on the inside?)



      Code



      library(cluster)
      library(factoextra)
      library(ppclust)
      library(advclust)
      library(clValid)
      data(iris)
      df<-sapply(iris[-5],scale)
      res.fanny<-fanny(df,3,metric='SqEuclidean')
      res.fanny$k.crisp
      # When I try to use euclidean, I get the warning all memberships are very close to 1/l. Maybe increase memb.exp, which I don't fully understand
      # From my understanding using the SqEuclidean is equivalent to Fuzzy C-means, use the website below. Ultimately I do want to use C-means, hence I use the SqEuclidean distance
      fviz_cluster(Res.fanny,ellipse.type='norm',palette='jco',ggtheme=theme_minimal(),legend='right')
      fviz_silhouette(res.fanny,palette='jco',ggtheme=theme_minimal())

      # With ppclust
      set.seed(123)
      res.fcm<-fcm(df,centers=3,nstart=10)


      website as mentioned above.










      share|improve this question
















      I would like to use fuzzy C-means clustering on a large unsupervided data set of 41 variables and 415 observations. However, I am stuck on trying to validate those clusters. When I plot with a random number of clusters, I can explain a total of 54% of the variance, which is not great and there are no really nice clusters as their would be with the iris database for example.



      First I ran the fcm with my scales data on 3 clusters just to see, but if I am trying to find way to search for the optimal number of clusters, then I do not want to set an arbitrary defined number of clusters.



      So I turned to google and googled: "valdiate fuzzy clustering in R." This link here was good, but I still have to try a bunch of different numbers of clusters. I looked at the advclust, ppclust, and clvalid packages but I could not find a walkthrough for the functions. I looked at the documentation of each package, but also could not discern what to do next.



      I walked through some possible number of clusters and checked each one with the k.crisp object from fanny. I started with 100 and got down to 4. Based on object description in the documentation,




      k.crisp=integer ( ≤ k ) giving the number of crisp clusters; can be less than
      k , where it's recommended to decrease memb.exp.




      it doesn't seem like a valid way because it is comparing the number of crisp clusters to our fuzzy clusters.



      Is there a function where I can check the validity of my clusters from 2:10 clusters? Also, is it worth while to check the validity of 1 cluster? I think that is a stupid question, but I have a strange feeling 1 optimal cluster might be what I get. (Any tips on what to do if I were to get 1 cluster besides cry a little on the inside?)



      Code



      library(cluster)
      library(factoextra)
      library(ppclust)
      library(advclust)
      library(clValid)
      data(iris)
      df<-sapply(iris[-5],scale)
      res.fanny<-fanny(df,3,metric='SqEuclidean')
      res.fanny$k.crisp
      # When I try to use euclidean, I get the warning all memberships are very close to 1/l. Maybe increase memb.exp, which I don't fully understand
      # From my understanding using the SqEuclidean is equivalent to Fuzzy C-means, use the website below. Ultimately I do want to use C-means, hence I use the SqEuclidean distance
      fviz_cluster(Res.fanny,ellipse.type='norm',palette='jco',ggtheme=theme_minimal(),legend='right')
      fviz_silhouette(res.fanny,palette='jco',ggtheme=theme_minimal())

      # With ppclust
      set.seed(123)
      res.fcm<-fcm(df,centers=3,nstart=10)


      website as mentioned above.







      r validation cluster-analysis






      share|improve this question















      share|improve this question













      share|improve this question




      share|improve this question








      edited Nov 13 '18 at 7:16







      Jack Armstrong

















      asked Nov 12 '18 at 23:09









      Jack ArmstrongJack Armstrong

      318519




      318519
























          1 Answer
          1






          active

          oldest

          votes


















          1














          As far as I know, you need to go through different number of clusters and see how the percentage of variance explained is changing with different number of clusters. This method is called elbow method.



          wss <- sapply(2:10, 
          function(k){fcm(df,centers=k,nstart=10)$sumsqrs$tot.within.ss})

          plot(2:10, wss,
          type="b", pch = 19, frame = FALSE,
          xlab="Number of clusters K",
          ylab="Total within-clusters sum of squares")


          The resulting plot is



          wss-number of clusters



          After k = 5, total within cluster sum of squares tend to change slowly. So, k = 5 is a good candidate for being optimal number of clusters according to elbow method.






          share|improve this answer
























          • I am looking more for a formal method. But also isn't that using K-means clustering?

            – Jack Armstrong
            Nov 13 '18 at 10:19






          • 1





            The objective is similar so I think that we can use this method. Please check this paper, researchgate.net/publication/… They are using k = 1 as null hypothesis and use some kind of measure and look for an "elbow" on a graph.

            – boyaronur
            Nov 13 '18 at 10:53













          Your Answer






          StackExchange.ifUsing("editor", function () {
          StackExchange.using("externalEditor", function () {
          StackExchange.using("snippets", function () {
          StackExchange.snippets.init();
          });
          });
          }, "code-snippets");

          StackExchange.ready(function() {
          var channelOptions = {
          tags: "".split(" "),
          id: "1"
          };
          initTagRenderer("".split(" "), "".split(" "), channelOptions);

          StackExchange.using("externalEditor", function() {
          // Have to fire editor after snippets, if snippets enabled
          if (StackExchange.settings.snippets.snippetsEnabled) {
          StackExchange.using("snippets", function() {
          createEditor();
          });
          }
          else {
          createEditor();
          }
          });

          function createEditor() {
          StackExchange.prepareEditor({
          heartbeatType: 'answer',
          autoActivateHeartbeat: false,
          convertImagesToLinks: true,
          noModals: true,
          showLowRepImageUploadWarning: true,
          reputationToPostImages: 10,
          bindNavPrevention: true,
          postfix: "",
          imageUploader: {
          brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
          contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
          allowUrls: true
          },
          onDemand: true,
          discardSelector: ".discard-answer"
          ,immediatelyShowMarkdownHelp:true
          });


          }
          });














          draft saved

          draft discarded


















          StackExchange.ready(
          function () {
          StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53271424%2fvalidating-fuzzy-clustering%23new-answer', 'question_page');
          }
          );

          Post as a guest















          Required, but never shown

























          1 Answer
          1






          active

          oldest

          votes








          1 Answer
          1






          active

          oldest

          votes









          active

          oldest

          votes






          active

          oldest

          votes









          1














          As far as I know, you need to go through different number of clusters and see how the percentage of variance explained is changing with different number of clusters. This method is called elbow method.



          wss <- sapply(2:10, 
          function(k){fcm(df,centers=k,nstart=10)$sumsqrs$tot.within.ss})

          plot(2:10, wss,
          type="b", pch = 19, frame = FALSE,
          xlab="Number of clusters K",
          ylab="Total within-clusters sum of squares")


          The resulting plot is



          wss-number of clusters



          After k = 5, total within cluster sum of squares tend to change slowly. So, k = 5 is a good candidate for being optimal number of clusters according to elbow method.






          share|improve this answer
























          • I am looking more for a formal method. But also isn't that using K-means clustering?

            – Jack Armstrong
            Nov 13 '18 at 10:19






          • 1





            The objective is similar so I think that we can use this method. Please check this paper, researchgate.net/publication/… They are using k = 1 as null hypothesis and use some kind of measure and look for an "elbow" on a graph.

            – boyaronur
            Nov 13 '18 at 10:53


















          1














          As far as I know, you need to go through different number of clusters and see how the percentage of variance explained is changing with different number of clusters. This method is called elbow method.



          wss <- sapply(2:10, 
          function(k){fcm(df,centers=k,nstart=10)$sumsqrs$tot.within.ss})

          plot(2:10, wss,
          type="b", pch = 19, frame = FALSE,
          xlab="Number of clusters K",
          ylab="Total within-clusters sum of squares")


          The resulting plot is



          wss-number of clusters



          After k = 5, total within cluster sum of squares tend to change slowly. So, k = 5 is a good candidate for being optimal number of clusters according to elbow method.






          share|improve this answer
























          • I am looking more for a formal method. But also isn't that using K-means clustering?

            – Jack Armstrong
            Nov 13 '18 at 10:19






          • 1





            The objective is similar so I think that we can use this method. Please check this paper, researchgate.net/publication/… They are using k = 1 as null hypothesis and use some kind of measure and look for an "elbow" on a graph.

            – boyaronur
            Nov 13 '18 at 10:53
















          1












          1








          1







          As far as I know, you need to go through different number of clusters and see how the percentage of variance explained is changing with different number of clusters. This method is called elbow method.



          wss <- sapply(2:10, 
          function(k){fcm(df,centers=k,nstart=10)$sumsqrs$tot.within.ss})

          plot(2:10, wss,
          type="b", pch = 19, frame = FALSE,
          xlab="Number of clusters K",
          ylab="Total within-clusters sum of squares")


          The resulting plot is



          wss-number of clusters



          After k = 5, total within cluster sum of squares tend to change slowly. So, k = 5 is a good candidate for being optimal number of clusters according to elbow method.






          share|improve this answer













          As far as I know, you need to go through different number of clusters and see how the percentage of variance explained is changing with different number of clusters. This method is called elbow method.



          wss <- sapply(2:10, 
          function(k){fcm(df,centers=k,nstart=10)$sumsqrs$tot.within.ss})

          plot(2:10, wss,
          type="b", pch = 19, frame = FALSE,
          xlab="Number of clusters K",
          ylab="Total within-clusters sum of squares")


          The resulting plot is



          wss-number of clusters



          After k = 5, total within cluster sum of squares tend to change slowly. So, k = 5 is a good candidate for being optimal number of clusters according to elbow method.







          share|improve this answer












          share|improve this answer



          share|improve this answer










          answered Nov 13 '18 at 9:16









          boyaronurboyaronur

          17419




          17419













          • I am looking more for a formal method. But also isn't that using K-means clustering?

            – Jack Armstrong
            Nov 13 '18 at 10:19






          • 1





            The objective is similar so I think that we can use this method. Please check this paper, researchgate.net/publication/… They are using k = 1 as null hypothesis and use some kind of measure and look for an "elbow" on a graph.

            – boyaronur
            Nov 13 '18 at 10:53





















          • I am looking more for a formal method. But also isn't that using K-means clustering?

            – Jack Armstrong
            Nov 13 '18 at 10:19






          • 1





            The objective is similar so I think that we can use this method. Please check this paper, researchgate.net/publication/… They are using k = 1 as null hypothesis and use some kind of measure and look for an "elbow" on a graph.

            – boyaronur
            Nov 13 '18 at 10:53



















          I am looking more for a formal method. But also isn't that using K-means clustering?

          – Jack Armstrong
          Nov 13 '18 at 10:19





          I am looking more for a formal method. But also isn't that using K-means clustering?

          – Jack Armstrong
          Nov 13 '18 at 10:19




          1




          1





          The objective is similar so I think that we can use this method. Please check this paper, researchgate.net/publication/… They are using k = 1 as null hypothesis and use some kind of measure and look for an "elbow" on a graph.

          – boyaronur
          Nov 13 '18 at 10:53







          The objective is similar so I think that we can use this method. Please check this paper, researchgate.net/publication/… They are using k = 1 as null hypothesis and use some kind of measure and look for an "elbow" on a graph.

          – boyaronur
          Nov 13 '18 at 10:53




















          draft saved

          draft discarded




















































          Thanks for contributing an answer to Stack Overflow!


          • Please be sure to answer the question. Provide details and share your research!

          But avoid



          • Asking for help, clarification, or responding to other answers.

          • Making statements based on opinion; back them up with references or personal experience.


          To learn more, see our tips on writing great answers.




          draft saved


          draft discarded














          StackExchange.ready(
          function () {
          StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53271424%2fvalidating-fuzzy-clustering%23new-answer', 'question_page');
          }
          );

          Post as a guest















          Required, but never shown





















































          Required, but never shown














          Required, but never shown












          Required, but never shown







          Required, but never shown

































          Required, but never shown














          Required, but never shown












          Required, but never shown







          Required, but never shown







          Popular posts from this blog

          Full-time equivalent

          さくらももこ

          13 indicted, 8 arrested in Calif. drug cartel investigation