I am reading through predict() in R and am confused:
There is a dataset Spam from which we have created a train data and test data using random sampling. We have used the trainSpam(training data set to train the system). We want to see how good the model is, by testing on the test dataset(testSpam).
predictionModel = glm(numType ~ charDollar, family ="binomial", data = trainSpam)
predictionTest = predict(predictionModel, testSpam)
predictedSpam = rep("nonspam", dim(testSpam))
predictedSpam[predictionModel$fitted >0.5]="spam"#Here is my problem
In the line where we say:
How does predictionModel$fitted predict spams in the test data. It seems to be using predictionModel$fitted from the training data. Then we go on to compare with the spams of test data. Can someone explain?
Here is what I understood. In the line:
predictionModel = glm(numType ~ charDollar, family = “binomial”, data = trainSpam)
We create a model using the trainSpam data.
In the next line:
We are using the predictionModel$fitted, which has been fitted over the training data to decide which of the rows are to be classified as spam. Shouldn’t we rather use something like predictionTest to identify the spams?
This is where I am reading from: https://github.com/jtleek/dataanalysis/blob/master/week2/002structureOfADataAnalysis2/structureOfADataAnalysis2.pdf