Machine learning and Steem #2: Multiclass classification (content creator vs scammer vs comment spammer vs bid-bot)
Repository
https://github.com/keras-team/keras
What Will I Learn?
- Collect data from beem and SteemSQL
- Preprocess data to improve classification accuracy
- Visualize features
- Build neural network model for multiclass classification problem
- Measure model performance
- Use different neural network architectures
- Use different neural network optimizers
Requirements
- python
- basic concepts of data analysis / machine learning
Tools:
- python 3
- pandas
- matplotlib
- seaborn
- jupyter notebook
- scikit-learn
- keras + tensorflow (as backend to keras wrapper)
It looks like a lot of libraries, but it's a standard python toolset for data analysis / machine learning.
Difficulty
- Intermediate
Tutorial Contents
- Problem description
- Collecting account list for different classes
- Collecting account features from beem
- Collecting account features from SteemSQL
- Visualization of the dataset
- Preprocessing
- Building model for multiclass classification problem
- Measuring model performance
- Comparing different neural network architectures
- Comparing different neural network optimizers
- Conclusions
Problem description
The purpose of this tutorial is to learn how to build a neural network model for multiclass classification problem.
We will use the following classes:
- content creator
- scammer (an account that writes comments with links to phishing sites)
- comment spammer (an account that spams short with comments such as:
nice
,beautiful
,upvoted
) - bid-bots
Each class will have 100 records.
Collecting account list for different classes
Accounts creating the content will be collected from the SteemSQL database with the following script:
SELECT TOP 100 author
FROM Comments (NOLOCK)
WHERE depth = 0 AND
category in ('utopian-io', 'dtube', 'dlive')
ORDER BY NEWID()
Scammers will be selected from the comments of the user guard
, who warns about this type of accounts. For example:
SELECT TOP 100 account
FROM (
SELECT SUBSTRING(body, CHARINDEX('@', body) + 1, CHARINDEX(' leads', body) - CHARINDEX('@', body) - 1) as account, *
FROM Comments (NOLOCK)
WHERE depth > 0 AND
author = 'guard' AND CONTAINS(body, 'phishing')) C
GROUP BY account
ORDER BY COUNT(*) DESC
The users who have written the most comments such as: nice
, beautiful
, upvoted
will be selected as comment spammers:
SELECT TOP 100 author
FROM Comments (NOLOCK)
WHERE depth = 1 AND
created BETWEEN GETUTCDATE() - 90 AND GETUTCDATE() AND
((CONTAINS (body, 'nice') AND body LIKE 'nice') OR
(CONTAINS (body, 'beautiful') AND body LIKE 'beautiful') OR
(CONTAINS (body, 'upvoted') AND body LIKE 'upvoted'))
GROUP BY author
ORDER BY COUNT(*) DESC
The list of bid-bots has been collected manually from https://steembottracker.com/.
Collecting account features from beem
The features to be analysed are:
- followers
- followings
- follow ratio
- muters
- reputation
- effective sp
- own sp
- sp ratio
- curation_rewards
- posting_rewards
- witnesses_voted_for
- posts
- average_post_len
- comments
- average_comment_len
- comments_with_link_ratio
- posts_to_comments_ratio
Some of the features will be collected with the beem script below.
full_list = content_creators + scammers + comment_spammers + bid_bots
d = defaultdict(lambda: defaultdict(int))
for name in full_list:
account = Account(name)
foll = account.get_follow_count()
d[name]['name'] = name
d[name]['followers'] = foll['follower_count']
d[name]['followings'] = foll['following_count']
d[name]['follow ratio'] = foll['following_count'] / foll['follower_count']
d[name]['muters'] = len(account.get_muters())
d[name]['reputation'] = account.get_reputation()
d[name]['effective sp'] = account.get_steem_power()
own_sp = account.get_steem_power(onlyOwnSP=True)
d[name]['own sp'] = own_sp
d[name]['sp ratio'] = account.get_steem_power() / own_sp if own_sp > 0 else 0
Collecting account features from SteemSQL
The remaining features will be downloaded from the SteemSQL database. This will be done with several separate queries, to keep them simple.
The following query will collect features such as:
- curation_rewards
- posting_rewards
- witnesses_voted_for
query = """\
SELECT
name,
curation_rewards,
posting_rewards,
witnesses_voted_for
FROM Accounts (NOLOCK) a
WHERE name in """ + to_sql_list(full_list)
for row in cursor.execute(query):
name = row[0]
curation_rewards = row[1] / 1000.0
posting_rewards = row[2] / 1000.0
witnesses_voted_for = row[3]
d[name]['curation_rewards'] = curation_rewards
d[name]['posting_rewards'] = posting_rewards
d[name]['witnesses_voted_for'] = witnesses_voted_for
The following query will collect features such as:
- posts
- average_post_len
query = """\
SELECT
author,
COUNT(*),
AVG(LEN(body))
FROM Comments (NOLOCK) a
WHERE depth = 0 AND
created BETWEEN GETUTCDATE() - 90 AND GETUTCDATE()
AND author in """ + to_sql_list(full_list) + """
GROUP BY author"""
for row in cursor.execute(query):
name = row[0]
posts = row[1]
average_post_len = row[2]
d[name]['posts'] = posts
d[name]['average_post_len'] = average_post_len
The last query will collect features such as:
- comments
- average_comment_len
- comments_with_link_ratio
- posts_to_comments_ratio
query="""\
SELECT
author,
COUNT(*),
AVG(LEN(body)),
CAST(SUM(CASE WHEN body LIKE '%http%' THEN 1 ELSE 0 END) as DECIMAL(10, 3)) / COUNT(*)
FROM Comments (NOLOCK) a
WHERE depth > 0 AND
created BETWEEN GETUTCDATE() - 90 AND GETUTCDATE()
AND author in """ + to_sql_list(full_list) + """
GROUP BY author
"""
for row in cursor.execute(query):
name = row[0]
comments = row[1]
average_comment_len = row[2]
comments_with_link_ratio = row[3]
d[name]['comments'] = comments
d[name]['average_comment_len'] = average_comment_len
d[name]['comments_with_link_ratio'] = comments_with_link_ratio
d[name]['posts_to_comments_ratio'] = d[name]['posts'] / comments if comments > 0 else 0
You also need to add the appropriate class, depending on which set the account belongs to.
def to_class(name):
if name in content_creators:
return 0
elif name in scammers:
return 1
elif name in comment_spammers:
return 2
else:
return 3
for name in full_list:
d[name]['class'] = to_class(name)
All data will be saved as file data.csv.
columns = ['name', 'followers', 'followings', 'follow ratio', 'muters',
'reputation', 'effective sp', 'own sp', 'sp ratio', 'curation_rewards',
'posting_rewards', 'witnesses_voted_for', 'posts', 'average_post_len', 'comments',
'average_comment_len', 'comments_with_link_ratio', 'posts_to_comments_ratio', 'class']
with open ('data.csv' , 'w') as f:
f.write(','.join(columns) + '\n')
for name in full_list:
row = [d[name][column] for column in columns]
f.write(','.join(map(str, row)) + '\n')
Visualization of the dataset
The dataset is loaded with the following command.
df = pd.read_csv('data.csv', index_col=None, sep=",")
It looks as follows:
name followers followings follow ratio muters reputation effective sp own sp sp ratio curation_rewards posting_rewards witnesses_voted_for posts average_post_len comments average_comment_len comments_with_link_ratio posts_to_comments_ratio class
0 alucare 556 119 0.214029 1 58.640845 4.417161e+02 441.716135 1.000000 21.275 433.544 4 111 630 387 97 0.074935 0.286822 0
1 ayufitri 644 318 0.493789 4 46.679335 1.503135e+01 8.926123 1.683973 0.037 17.479 30 301 1016 717 125 0.054393 0.419805 0
2 imh3ll 87 26 0.298851 0 53.589527 5.795137e-01 67.930278 0.008531 0.545 132.534 0 0 0 2 24 0.000000 0.000000 0
3 andeladenaro 205 75 0.365854 0 56.414916 9.938178e+01 99.381785 1.000000 0.158 195.882 1 0 0 0 0 0.000000 0.000000 0
4 shenoy 1225 1576 1.286531 3 65.416593 2.577685e+03 2063.983517 1.248888 46.552 2581.548 0 88 568 41 34 0.048780 2.146341 0
5 tidylive 542 70 0.129151 2 64.583380 8.252583e+02 825.258289 1.000000 18.131 1891.259 1 88 496 61 99 0.245902 1.442623 0
6 arpita182 26 0 0.000000 0 27.253055 1.505696e+01 0.173166 86.950855 0.000 0.143 0 0 0 0 0 0.000000 0.000000 0
7 world-of-music 45 1 0.022222 5 15.326558 5.008056e+00 0.100673 49.745673 0.000 0.000 0 0 0 0 0 0.000000 0.000000 0
8 mirnasahara 1145 2038 1.779913 2 47.750057 1.563195e+01 15.631949 1.000000 0.128 29.303 0 46 74 12 29 0.000000 3.833333 0
9 haveaniceday 57 2 0.035088 0 42.550673 8.587180e-01 0.858718 1.000000 0.000 2.137 0 0 0 0 0 0.000000 0.000000 0
10 blackhypnotik 513 70 0.136452 4 61.570600 4.559381e+02 455.938075 1.000000 5.423 893.037 1 189 955 70 372 0.671429 2.700000 0
11 openingiszwagra 207 42 0.202899 0 25.000000 1.505430e+01 0.100362 149.999441 0.001 0.000 0 47 395 7 28 0.000000 6.714286 0
12 swelker101 1610 131 0.081366 13 63.245558 1.681717e+03 5831.538042 0.288383 61.037 1845.328 26 45 731 52 106 0.057692 0.865385 0
13 phoneinf 1032 19 0.018411 14 65.778281 6.021037e+03 2418.063434 2.490024 50.373 1789.770 7 451 659 1573 184 0.052765 0.286713 0
14 spawn09 989 93 0.094034 6 62.882242 1.253178e+02 125.317849 1.000000 10.626 1736.275 0 240 598 65 32 0.015385 3.692308 0
15 dschense 112 34 0.303571 1 28.768973 5.017238e+00 0.637683 7.867919 0.012 0.233 0 0 0 0 0 0.000000 0.000000 0
16 brainpod 310 187 0.603226 0 53.062904 3.407024e+01 34.070244 1.000000 0.047 77.218 2 44 1219 64 123 0.031250 0.687500 0
17 jezieshasan 350 216 0.617143 1 42.953291 6.709893e+00 5.705882 1.175961 0.447 12.838 0 0 0 0 0 0.000000 0.000000 0
18 hmagellan 453 169 0.373068 0 45.283351 1.004239e+01 689.153713 0.014572 0.596 30.575 0 0 0 3 224 0.000000 0.000000 0
19 marishkamoon 191 1 0.005236 0 35.302759 1.326083e+01 13.260832 1.000000 0.000 1.042 0 24 311 0 0 0.000000 0.000000 0
20 xee-shan 198 8 0.040404 0 52.254091 3.666877e+01 36.668770 1.000000 0.076 72.899 0 27 1516 3 44 0.000000 9.000000 0
21 marcmichael 300 78 0.260000 2 39.739864 1.504864e+01 1.925210 7.816625 0.001 2.900 0 80 769 2 91 0.000000 40.000000 0
22 ridvanunver 425 53 0.124706 0 57.346462 1.128660e+02 112.865950 1.000000 0.211 265.501 1 86 927 16 62 0.062500 5.375000 0
23 cheaky 1131 48 0.042440 0 66.096330 1.512192e+03 1512.191914 1.000000 65.840 3511.447 2 71 309 30 27 0.000000 2.366667 0
24 dbroze 2385 76 0.031866 10 66.849140 2.073023e+03 2073.022665 1.000000 56.955 3972.949 0 60 2492 4 88 0.250000 15.000000 0
25 jarendesta 1911 55 0.028781 6 70.460557 1.206929e+04 12069.286644 1.000000 235.837 9692.369 0 101 1771 91 68 0.000000 1.109890 0
26 bitcoinmarketss 628 606 0.964968 3 -1.445307 1.412191e+01 0.646501 21.843597 0.007 0.078 3 2 137 1 9 0.000000 2.000000 0
27 revehs 180 0 0.000000 0 25.000000 1.505181e+01 0.109308 137.700692 0.000 0.017 0 33 531 0 0 0.000000 0.000000 0
28 skyshine 504 847 1.680556 0 52.665926 4.753474e+01 47.534744 1.000000 0.118 94.111 1 38 296 4 29 0.000000 9.500000 0
29 adamkokesh 14144 8882 0.627969 100 73.792439 5.002053e+05 5791.887941 86.363087 4782.833 22908.451 0 127 2114 23 122 0.043478 5.521739 0
.. ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
Let's first look at the dataset. We can expect for example that:
- scammers and comment spammers have low reputation
- scammers and comment spammers add a lot of comments
- most of the comments from scammers have link to some site
- bid-bots have the highest effective STEEM POWER
- bid-bots have most of the STEEM POWER from delegation
- bid-bots observe fewer users
You can also add many more options here. Now let's look at the average values of the features of each class.
Feature | content-creator average | scammer average | comment-spammer average | bid-bot average |
---|---|---|---|---|
followers | 1056.673 | 340.810 | 577.545 | 2328.080 |
followings | 535.990 | 382.760 | 456.354 | 1753.760 |
follow ratio | 0.448 | 0.764 | 0.757 | 0.420 |
muters | 5.673 | 5.520 | 2.303 | 14.820 |
reputation | 49.285 | 13.538 | 41.884 | 45.506 |
effective sp | 5353.143 | 104.365 | 226.147 | 214445.739 |
own sp | 731.946 | 96.332 | 1224.500 | 12243.641 |
sp ratio | 33167243.073 | 22.164 | 28.494 | 121.184 |
curation_rewards | 155.473 | 2.196 | 13.305 | 4703.424 |
posting_rewards | 1800.861 | 78.733 | 149.129 | 938.840 |
witnesses_voted_for | 4.020 | 3.540 | 3.010 | 2.270 |
posts | 55.980 | 20.540 | 64.677 | 15.620 |
average_post_len | 876.465 | 961.980 | 939.040 | 1930.990 |
comments | 101.495 | 121.200 | 360.596 | 3390.350 |
average_comment_len | 71.990 | 67.580 | 27.818 | 408.540 |
comments_with_link_ratio | 0.045 | 0.179 | 0.025 | 0.481 |
posts_to_comments_ratio | 3.395 | 2.692 | 0.307 | 0.159 |
We can see great differences in average values here. However, this is not enough. Let's look also at some charts.
followers + followings:
muters + rep:
follow ratio + sp ratio:
curation rewards + posting rewards:
posts + comments:
average_post_len + average_comment_len:
comments_with_link_ratio + posts_to_comments_ratio:
This gives us a much better picture of the situation. It is very important to have a good understanding of our data
Preprocessing
Preprocessing is a process of data preparation so that it is more valuable for the model being built. We will use QuantileTransformer.
X_cols = columns
y_cols = ['class']
X = pd.DataFrame(QuantileTransformer().fit_transform(df[X_cols]))
y = to_categorical(df[y_cols])
What is the difference between raw and preprocessed data is shown in the charts below.
Raw data | StandardScaler | QuantileTransformer |
---|---|---|
StandardScaler
standardizes features by removing the mean and scaling to unit variance.QuantileTransformer
makes the data not only in the range of [0,1] but also evenly distributed throughout the area.
For the neural network, the best input is standardized to the range of [0, 1].
Building model for multiclass classification problem
The code building the neural network model is relatively simple. There are only two layers:
- input layer with 17 neurons
- output layer with 4 neurons
- (there is no hidden layer at the moment)
model = Sequential()
model.add(Dense(17, input_dim=17, activation='relu'))
model.add(Dense(4, activation='softmax'))
model.compile(loss='categorical_crossentropy',
optimizer='adam',
metrics=['accuracy'])
model.fit(X_train, y_train,epochs=40, batch_size=1, verbose=0)
score = model.evaluate(X_test, y_test,verbose=0)
y_pred = model.predict_classes(X_test)
A short glossary:
Name | Description |
---|---|
model.add | adds new layer |
Dense | fully connected layer |
relu | ReLU activation function |
sigmoid | sigmoid activation function |
categorical_crossentropy | loss used for multiclass classification problem |
adam | Adaptive Moment Estimation optimizer, basically RMSProp with momentum |
accuracy | percentage of correctly classified inputs |
Measuring model performance
After training the neural network on the training set, we want to check its effectiveness on the test set. To do this, you will display the accuracy and the confusions matrix, which counts the records assigned to the given class.
print('accuracy: %.2f' % score[1])
cm = confusion_matrix(np.argmax(y_test, axis=1), y_pred)
plot_confusion_matrix(cm)
accuracy: 0.81
The obtained accuracy is 81 %. It is acceptable, but it could have been better.
Comparing different neural network architectures
We will now try to build a larger neural network. We will increase the number of neurons in the first layer and add two hidden layers.
model = Sequential()
model.add(Dense(85, input_dim=17, activation='relu'))
model.add(Dense(40, activation='relu'))
model.add(Dense(20, activation='relu'))
model.add(Dense(4, activation='softmax'))
accuracy: 0.83
Accuracy is slightly better, but it is not a big difference.
Comparing different neural network optimizers
We will try to use other optimizers. The optimizer is an algorithm for updating neural network weights during the learning process.
for optimizer in ['sgd', 'rmsprop', 'adagrad', 'adadelta', 'adam', 'adamax', 'nadam']:
model = Sequential()
model.add(Dense(85, input_dim=17, activation='relu'))
model.add(Dense(40, activation='relu'))
model.add(Dense(20, activation='relu'))
model.add(Dense(4, activation='softmax'))
model.compile(loss='categorical_crossentropy',
optimizer=optimizer,
metrics=['accuracy'])
model.fit(X_train, y_train,epochs=40, batch_size=1, verbose=0)
score = model.evaluate(X_test, y_test,verbose=0)
y_pred = model.predict_classes(X_test)
cm = confusion_matrix(np.argmax(y_test, axis=1), y_pred)
plot_confusion_matrix(cm)
print('%s|accuracy: %.2f' % (optimizer, score[1]))
Optimizer | Accuracy | Confusion matrix |
---|---|---|
sgd | 0.82 | |
rmsprop | 0.78 | |
adagrad | 0.80 | |
adadelta | 0.80 | |
adam | 0.80 | |
adamax | 0.81 | |
nadam | 0.81 |
The results are almost the same as for the Adam
optimizer used earlier. In my opinion, changing the neural network itself will not help much here. In order to increase accuracy, a much larger data would need to be collected and possibly new attributes would need to be added.
Curriculum
Conclusions
- the bigger the dataset, the better we can train the neural network
- we should have a good understanding of our data
- preprocessing can make a huge difference
- building the optimal neural network model requires experimenting
- it is worth to experiment with different architecture of the neural network / different optimizers
Nicely written. I have been teaching myself neural networks and have only reached till ADALINE, have not yet covered Adam. They also did not tell us in the book that pre processing is needed. It might be helpful though to show your weight matrix, which should be 4 columns and 17 rows? Also is there a good guide for ADALINE? I solved it with hand and cannot seem to match the weights. They do not seem to converge.
Hey @hispeedimagins
Here's a tip for your valuable feedback! @Utopian-io loves and incentivises informative comments.
Contributing on Utopian
Learn how to contribute on our website.
Want to chat? Join us on Discord https://discord.gg/h52nFrV.
Vote for Utopian Witness!
Thank you for your contribution.
Your contribution has been evaluated according to Utopian policies and guidelines, as well as a predefined set of questions pertaining to the category.
To view those questions and the relevant answers related to your post, click here.
Need help? Write a ticket on https://support.utopian.io/.
Chat with us on Discord.
[utopian-moderator]
Hey @jacekw.dev
Thanks for contributing on Utopian.
Congratulations! Your contribution was Staff Picked to receive a maximum vote for the tutorials category on Utopian for being of significant value to the project and the open source community.
We’re already looking forward to your next contribution!
Want to chat? Join us on Discord https://discord.gg/h52nFrV.
Vote for Utopian Witness!
Cool stuff! :-)
Congratulations @jacekw.dev! You have completed the following achievement on Steemit and have been rewarded with new badge(s) :
Award for the number of upvotes
Click on the badge to view your Board of Honor.
If you no longer want to receive notifications, reply to this comment with the word
STOP
Excelent post, I want your permission to translate this to spanish and share in my blog, the information that are you sharing is very valuable!
Thanks @jacekw.dev