I was wondering what data points you are using to flag something as spam. One that I thought might be useful is mean time between posts/word count. Or some variant. I saw an account today that was posting a 500 word article very 10 minutes or so. Sorry, but nobody writes that fast and ends up with the quality. These had to be cut and paste - maybe even blatant plagiarism.
In theory, you could trap plagiarism by comparing consecutive posts and seeing if there is linguistic consistency between them. Someone cutting and pasting content from other sites would show variations in vocabulary, sentence structure and other linguistic markers. These markers would be similar if posted by the same person.
I know there is academic work that has done this very thing but I suspect it would be a difficult task to do in real time.
Anyway, good work on this project. Keep it up!
That kind of feature isn't currently used, but I've considered similar, and it's still a work in progress, so thanks for your thoughts.