Bot that detects spam in comments #2 (more training data, SVM classifier, checking user previous comments, whitelist / blacklist / scamlist)

in #utopian-io7 years ago (edited)

I updated a bot which purpose is to detect spam comments on Steem blockchain. It uses Multinomial Naive Bayes algorithm combined with SVM (model stacking). It can reply to spam comment and downvote it. I've done it for #polish community, but it can be adapted for every tag (or all tags) - it's a matter of training file.

Github repository

Log from console:
image.png

I have stacked 4 algorithms: Multinomial Naive Bayes and 3 variants of SVM.

self.model = StackedModel([
            MultinomialNB(),
            SVC(kernel='linear', C=C, probability=True),
            SVC(kernel='rbf', gamma=0.7, C=C, probability=True),
            NuSVC(probability=True)

To check the accuracy, I calculated a confusion matrix for each algorithm.

MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True) 

Confusion matrix:
[[65  1]
 [ 0 45]] 

SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape='ovr', degree=3, gamma='auto', kernel='linear',
  max_iter=-1, probability=True, random_state=None, shrinking=True,
  tol=0.001, verbose=False) 

Confusion matrix:
[[65  1]
 [ 0 45]] 

SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape='ovr', degree=3, gamma=0.7, kernel='rbf',
  max_iter=-1, probability=True, random_state=None, shrinking=True,
  tol=0.001, verbose=False) 

Confusion matrix:
[[66  0]
 [ 0 45]] 

NuSVC(cache_size=200, class_weight=None, coef0=0.0,
   decision_function_shape='ovr', degree=3, gamma='auto', kernel='rbf',
   max_iter=-1, nu=0.5, probability=True, random_state=None,
   shrinking=True, tol=0.001, verbose=False) 

Confusion matrix:
[[65  1]
 [ 1 44]] 

The confusion matrix for the stacked model looks as follows.

Stacked model
Confusion matrix:
[[65  1]
 [ 0 45]]

image.png

As you can see the results are similar for each algorithm separately as well as for the stacked model. You have to experiment a bit here to find the best combination. The results will probably change slightly as the data set increases.

Bot checks not only current comment, but also previous comments. I think that single comment nice photo is ok, but if user posts this type of comments all the time it is considered spam:

image.png

The bot also pays attention to repeated, generic comments:

image.png

And even scams (if user is on scamlist):
image.png

Running

$ POSTING_KEY=<posting_key> spam_detector.py config.json

Private posting key is stored as environment variable.

Configuration

All parameters are stored in config.json file.

KeyValue
accountaccount used by bot
nodeslist of Steem nodes
tagstags which are observed
probability_thresholdthreshold to classify as spam
training_fileinput training file
blacklist_filefile containing blacklist
whitelist_filefile containing whitelist
scamlist_filefile containing users who post scams
reply_mode0 - without reply, 1 - with reply
vote_mode0 - without vote, 1 with vote
vote_weightweight of the vote from range [-100.0, 100.0]
num_previous_commentsnumber of user comments that are investigated

Training file contains rows with label ham or spam like below:

ham    Wow. Even though I was well aware of Churchill's later career, I actually didn't know he was here during the Anglo Boer war, let alone as a prisoner of war. Thank you for a very interesting and informative post!
ham    Yea this post isn't really about fixing all the problems on Steem - it's just that there always seems to be a lot of drama over the trending page, and i think it's a bad thing for new people coming to the site to see first, so just throwing out the idea of getting rid of it for now.
ham    Yea, I believe there was something about notifications in one of the SteemIt, Inc roadmaps but don't quote me on that. Notifications are really important though, can't expect everyone to use
ham    Yeah, I may have to sit down & do a post or two myself! It’s fun to imagine! Other than promoted posts, I do think we should have advertising, albeit in a very user focused & friendly way.
ham    Yes I agree. My suggestion was based on how things actually are currently which as you said is not representative of the best posts. I don’t believe that is going to change any time soon, if ever, so in the mean time I think it would be better to just get rid of that page.
ham    Yes! This thought never occurred to me before, but your idea is perfect!! I think it would help underpaid content creators be noticed. Better yet, don't sort people based on potential payout. Create an algorithm that sorts out such things as grammatical and spelling errors, "articles" that are too short, authors that post 10 times per day, copy/paste content, ect. and only the highest quality bloggers would make it to the top...
ham    Yes, there are only a few flagging because majority is scared. He has already ruined many people's accounts and reps and flagged all of their posts to $0.00 for voicing opinons. People disagree with the rewards of his posts. You are well aware of haejin's 10-12 posts per day reaching an easy $350 per post every time. I don't think anyone is against his predictions in the sense that anyone is able to use common sense and choose if they invest or not based on his predictions. I have not seen any whales helping recover these people's accounts for flagging him. Perhaps this is not an unjustified flag war? I have sacrificed my entire blog and all earnings for six weeks to try and lower the rewards. I am not scared of the consequences as I know what they are. People are scared though so I think if a lot of users delegate a small portion of their Steem power to one of these accounts then the rewards can be lower substantially. I also feel that it would be a more organized approach at flagging him as it will be a scheduled downvote of 10 posts every evening. I feel that if enough people make the delegation's he will be unable to flag every user that delegated down to $0.00 as he would have to use all of his power flagging instead of upvoting himself. You can count on support from whales to resolve unjustified flag wars, if you feel like post are more over-valued than the majority of Steem content then flag them and don't be scared of reprisals.
ham    you are right. As it is now, he's spending a tonne of his vote power flagging anyone who disagrees with his rewards. He cannot flag everyone it would cut into his profits, as his vote power drains to 0. If rancho comes in and starts flagging too, then they are making even less money because now he's wasting his vote power by flagging instead of upvoting the 10 posts a day that he has to.
ham    You know.. I delegated what little SP i can afford exactly because you took the risk. Now if he did wanna go all out flag, he'd had to waste his vp on both you and me. if enough people did it we can even go against the biggest abusers too.
ham    Your concept is very solid, it might seem hard to implement in the start but I know that if you keep at it you will reach your goal!I cannot wait to start using your system!
spam    i follow you
spam    Upvote, follow, resteem
spam    UPVOTED
spam    UPVOTED & RESTEEMED
spam    Upvoted and followed you back
spam    UPVOTED RESTEEMED
spam    very funny
spam    very nice
spam    Write Link, send 0.100 sbd. 3000+ followers can see you (resteem)
spam    Yes very nice post.

Technology Stack

  • python3.6
  • libraries: steem-python, scikit-learn, pandas, textblob, bs4

Repository contains requirements.txt file.

Roadmap:

  • enlarging the training set
  • adding new algorithm such as Support Vector Machine
  • taking into account previous comments, not only current one
  • adding to blacklist / whitelist
  • taking into account user reputation
  • tuning parameters in existing algorithms
  • adding new algorithm such as Neural Network and maybe Random Forest
  • enlarging the training set (again)



Posted on Utopian.io - Rewarding Open Source Contributors

Sort:  

Thank you for the contribution. It has been approved.

You can contact us on Discord.
[utopian-moderator]

Hey @vladimir-simovic, I just gave you a tip for your hard work on moderation. Upvote this comment to support the utopian moderators and increase your future rewards!

I am very glad to see this!
This could improve the quality of the comments.

This is a good, useful, helpful and beneficial bot for the community, like @cheetah.

Further improvement on the bot would be, if the bot would get a lot of Steem Power and flag these spam comments itself.

I think comments like "Nice post", "Nice photo" and "Please follow me" should be automatically flagged as spam (especially, if the authors of these comments are upvoted their own comments), as these comments are meaningless to the author of the original post and the writers of these comments are only showing the fact that they don't really care about the original post and/or the author of the original post, they only want attention (to their own profiles, to their own posts), so they are doing it for their own good.

Great work! :)

What I really want to see is a REST api where I just put the comment URL or comment body and get a probability about it is spam or not. That would be epic.

Upvote + resteem
Tylko tyle mogę na tą chwilę.
Pozdrawiam.

PS. Już wiem napiszę o tym więcej!

Amazing, I think @emrebeyler was thinking about developing this. Can't wait to see it in action!

Impressive! What is the efficiency ratio (without false-positive)?

I divided the entire dataset in the ratio 80:20 into a training and test set. After training the model, I carried out the test, resulting in the following confusion matrix.

As you can see, the results are pretty good here. Only one type II error (False Negative).

But the real challenge here is precisely defining what is spam and what is not. At the beginning I was probably too overzealous, now I try to balance it more. Therefore, for example, I do not treat single comments like nice photo or please follow me as true spam, but only when they are repeated over and over again.

That is why I constantly analyze the results and, if necessary, I make corrections in the data set / parameters, so that it all works well in practice, and not only in theory. The other thing is the fact that a lot of comments that the bot classifies as spam have to be ignored (bid-bots, photocontests, welcoming users) because marking them as spam would not end well :)

Hey @jacek-w I am @utopian-io. I have just upvoted you!

Achievements

  • You have less than 500 followers. Just gave you a gift to help you succeed!
  • Seems like you contribute quite often. AMAZING!

Community-Driven Witness!

I am the first and only Steem Community-Driven Witness. Participate on Discord. Lets GROW TOGETHER!

mooncryption-utopian-witness-gif

Up-vote this comment to grow my power and help Open Source contributions like this one. Want to chat? Join me on Discord https://discord.gg/Pc8HG9x

Congratulations @jacek-w! You received a personal award!

Happy Birthday! - You are on the Steem blockchain for 1 year!

Click here to view your Board

Support SteemitBoard's project! Vote for its witness and get one more award!

Congratulations @jacek-w! You received a personal award!

DrugWars Early Access
Thank you for taking part in the early access of Drugwars.

You can view your badges on your Steem Board and compare to others on the Steem Ranking

Do not miss the last post from @steemitboard:

Are you a DrugWars early adopter? Benvenuto in famiglia!
Vote for @Steemitboard as a witness to get one more award and increased upvotes!

Congratulations @jacek-w! You received a personal award!

Happy Birthday! - You are on the Steem blockchain for 2 years!

You can view your badges on your Steem Board and compare to others on the Steem Ranking

Vote for @Steemitboard as a witness to get one more award and increased upvotes!

Coin Marketplace

STEEM 0.25
TRX 0.20
JST 0.035
BTC 95284.46
ETH 3462.33
USDT 1.00
SBD 3.49