I’m trying to execute a Principal Components Analysis, but I’m getting the error: Error in colMeans(x, na.rm = TRUE) : ‘x’ must be numeric
I know all the columns have to be numeric, but how to handle when you have character objects in the data set? E.g:
data(birth.death.rates.1966) data2 <- birth.death.rates.1966 princ <- prcomp(data2)
- data2 example of data below:
Should I add a new column referring the country name to a numeric code? If yes, how to do this in R?
You can convert a character vector to numeric values by going via
factor. Then each unique value gets a unique integer code. In this example, there’s four values so the numbers are 1 to 4, in alphabetical order, I think:
> d = data.frame(country=c("foo","bar","baz","qux"),x=runif(4),y=runif(4)) > d country x y 1 foo 0.84435112 0.7022875 2 bar 0.01343424 0.5019794 3 baz 0.09815888 0.5832612 4 qux 0.18397525 0.8049514 > d$country = as.numeric(as.factor(d$country)) > d country x y 1 3 0.84435112 0.7022875 2 1 0.01343424 0.5019794 3 2 0.09815888 0.5832612 4 4 0.18397525 0.8049514
You can then run
> prcomp(d) Standard deviations:  1.308665216 0.339983614 0.009141194 Rotation: PC1 PC2 PC3 country -0.9858920 0.132948161 -0.101694168 x -0.1331795 -0.991081523 -0.004541179 y -0.1013910 0.009066471 0.994805345
Whether this makes sense for your application is up to you. Maybe you just want to drop the first column:
prcomp(d[,-1]) and work with the numeric data, which seems to be what the other “answers” are trying to achieve.
The first column of the data frame is character. So you can recode it to row names as :
library(tidyverse) data2 %>% remove_rownames %>% column_to_rownames(var="country") princ <- prcomp(data2)
Alternatively as :
data2 <- data2[,-1] rownames(data2) <- data2[,1] princ <- prcomp(data2)
In R, adding the factor method to a character set of data, does not make it numeric. Indeed it is to make our machine learning model a mathematical model but it is not numeric data.
Example: If you have a list of names and then they are being encoded numerically then it may happen that a certain name may have a higher numerical value which will give it a different definition depending on our model.
Which should not be the case as names(text data which is just for labeling a specific set) generally should not define the way a model should work.
Also if you try working with this data assuming it to be numeric, you may get the following error:
Error in colMeans(x, na.rm = TRUE) : ‘x’ must be numeric
I have defined why you may get this error above
To overcome this problem
training_set[,2:3] = scale(training_set) test_set[,2:3] = scale(test_set)
In the following image, columns 1 and 4 have encoded data and cannot be treated as a numerical model Columns 2 and 3 have been originally containing numerical data so we can run our model only on that part of the data. The above code just shows how to select the data it includes all rows and columns 2 and 3