Natural Language Text Analysis for January 2018

in #utopian-io7 years ago (edited)

MT Analysis Banner.png

Abstract

This is the third and final part of a Natural Language Text Analysis. It is a follow-on to prior work that analyzed text and emoji. This analysis differs in that we are focusing on natural language word cluster analysis seeking markers of successful or high quality posts.

Scope

We will be considering data for the month of January 2018 and will include foreign languages and character sets. We exclude the following multimedia catagories:

Music & VideoPhotographyMemes
dtube, youtube, musicphotography, colorchallenge, architecturalphotography, vehiclephotography, photofeed, photodmania, decentmemes, meme

The dataset includes 1,203,022 posts from 58,846 categories (excluding those above) and 110,474 distinct authors.

Tools

The analysis will be performed in R using only Open Source Public Domain tools (particularly Quanteda) and on a 10 year old MacBook(!).

Top 10 most popular categories by Post count

RankCategoryTotal PostsAvg VotesAuthors
1life59,08711.32761916,804
2bitcoin37,77711.78473710,059
3news30,2575.1568563,973
4kr29,77113.5498303,282
5spanish29,62629.3678865,621
6cryptocurrency28,91811.5357918,443
7art28,38912.9813316,979
8steemit28,14016.08582110,653
9food26,75111.7927187,894
10introduceyourself21,46013.44329016,079

Vote Statistical Summary

Standard Deviation
    46.51859
Quantiles
  0%  10%  20%  30%  40%  50%  60%  70%  80%  90% 100% 
 -93    1    1    2    3    4    6    8   12   23 4023 
Summary
   Min. 1st Qu.  Median    Mean   3rd Qu.    Max. 
 -93.00    2.00    4.00   12.87   10.00     4023.00 

Subsetting Best & Worst Posts

Here we subset posts by votes (specifically, net_votes per post) and illustrate the author of those posts. We will subset at the bottom 10th percentile (< 1 vote) and the top 90th percentile (> 23 votes).

Illustrated here is the top and bottom 30-user cohort from each subset.

Rplot02.png

Lexical Diversity

There is some evidence to suggest that vocabulary is correlated with intelligence. In this study we equate size of vocabulary with Lexical Diversity to test the hypothesis that smart users produce high performing, high quality content.

In Natural Language Processing, sentences are tokenized into individual words. Strictly speaking, a lexical token is a character vector and may not be an actual word. e.g. "Ah-ha!" would be a valid token. A Chinese character would also be a valid token.

In this analysis we use Token count as a proxy for Lexical Diversity. We also use the terms "word" and "token" interchangeably.

Rplot01.png

This chart illustrates Global Lexical Diversity for all Posts. These are users with lexical diversity greater than 300.

This is an arbitrary cut-off. We know the average 4 year old (native english speaker) knows approximately 5,000 distinct words. I am intentionally setting the cut off well below average to include non-native english speakers.

We can now compare these authors with high lexical diversity (big vocabularies and presumably smart) to high and low performing post authors (our subsets).

Highest Voted Authors

Our top performing user (@haejin) by vote count doesn't appear in our list of users with high lexical diversity (the smart people with big vocabularies). This doesn't suggest he's not smart just that he uses a narrow vocabulary. This user publishes specialized, technical content on Eliot Wave Analysis.

The following 14 authors from the Top Performing Post subset also appear in the High Lexical Diversity group. These users have large vocabularies and high votes.

RankAuthorTokensTotal Votes
1@glenalbrethsen199123
2@amf6138191
3@karyroa122440
4@aqiel86437
5@orianandreina1874828
6@justyy6895602
7@gexi66483
8@abialfatih55130
9@svitlaangel55123
10@lorenitaarmy5173513
11@mellisaramirez45139
12@pataty6941052
13@meidy40748
14@michaelizer347164

Lowest Voted Authors

We also have six users with low performing posts and high lexical diversity. These users have low performing posts but also large vocabularies. It would suggest, being smart and having a big vocabulary is not an indicator of how well your content will perform.

RankAuthorTokensTotal Votes
1@ddd6725510
2@oneness22840
3@karyroa12240
4@shemzy4530
5@meidy4070
6@mhmtbhtyr3150

Visual Comparison

Lining the plots up we observe the Top Performing cohort (middle chart) contains consideraly more authors with Lexical Diversity greater than 300. We're seeing more users in the Top Performing cohort using large vocabularies. While we can not draw conclusions, it would suggest there are more smart people in the top performing cohort.

Rplot03.png

Word Frequency

To analyze word frequency we must coerce our post data into a Corpus and then to a Document Frequency Matrix (DFM). Creating the Corpus and DFMs takes approximately 45 mins processing time.

Word Frequency

Rplot07.png

During this process, I removed the following stop-words. Stop-Words are discussed in earlier posts. This step is considered data preprocessing or cleansing.

Stop Word List

"steem","steemit","steemian","steemians","resteem","upvote","upvotes","post","SBD","SP","jpeg","jpg","png","www","com","td","re","nbsp","p","li","br","strong","quote","s3","amazonaws'com","steemit'com","steemitimages'com","img","height","width","src","center","em","html","de","href","hr","blockquote","h1","h2","h3","960","720","div","en","que","la","will","y","el","https","http","do","does","did","has","have","had","is","am","are","was","were","be","being","been","may","must","might","should","could","would","shall","will","can","un","get","alt","_blank","i","me","my","myself","we","our","ours","ourselves","you","your","yours","yourself","yourselves","he","him","his","himself","she","her","hers","herself","it","its","itself", "they","them","their","theirs","themselves","what", "which","who","whom","this","that","these","those","am","is","are","was","were", "be","been","being","have","has","had", "having","do","does","did","doing","would", "should","could","ought","i'm","you're","he's", "she's","it's","we're","they're","i've","you've", "we've","they've","i'd","you'd","he'd","she'd", "we'd","they'd","i'll","you'll","he'll","she'll", "we'll","they'll","isn't","aren't","wasn't","weren't", "hasn't","haven't","hadn't","doesn't","don't","didn't", "won't","wouldn't","shan't","shouldn't","can't","cannot", "couldn't","mustn't","let's","that's","who's","what's", "here's","there's","when's","where's","why's","how's", "a","an","the","and","but","if", "or","because","as","until","while","of", "at","by","for","with","about","against", "between","into","through","during","before","after", "above","below","to","from","up","down", "in","out","on","off","over","under", "again","further","then","once","here","there", "when","where","why","how","all","any", "both","each","few","more","most","other", "some","such","no","nor","not","only", "own","same","so","than","too","very"

Sparse Word Removal

As part of data preparation for illustration we also remove sparse terms. Ths is known as trimming the DFM of sparse terms. This leaves us with words (terms, tokens) used more than 10 times and appear in more than 25% of posts.

Top Performer Cohort

When we examine the most frequently used words in the Top Performing Cohort, we again observe high lexical diversity. This word cloud illustrates words used more than 5,000 times across all Top Voted posts.

Rplot04.png

These are the Top 10 tokens in this wordcloud. I have chosen not to remove numbers at this point.

imageone1liketimejustalso2peoplenew
101916946509195583957801397076767883672756548854249

Bottom Performer Cohort

By comparison, the bottom performing posts use far fewer words 5,000 or more times. The lexical diversity is much lower.

Rplot06.png

The Top 10 most frequently used words in this cohort appears to be a subset of those used in the Top Performer Cohort (above). In other words, a similar core vocabulary is used but fewer times and in the context of a much narrower vocabulary.

likeone1newnowalsojusttimepeople2
90445764137218172170696276694562857613385441050133

Topic Models

Latent Dirichlet allocation (LDA) is a generative statistical model used to identify groups of similar words across documents. I'm using it here in an attempt to identify Topics or Themes.

This is as much art as science as it requires manual tweaking of the algorithms parameters. The algorithm is computationally intensive and takes a long time to run on my crappy MacBook, so I have not invested a lot of time seeking the optimum set of parameters.

Top Performer, Top 5 Topics

      Topic 1  Topic 2 Topic 3  Topic 4 Topic 5
 [1,] "people" "one"   "source" "image" "1"    
 [2,] "time"   "also"  "new"    "watch" "2"    
 [3,] "life"   "like"  "follow" "2017"  "3"    
 [4,] "just"   "first" "2018"   "part"  "4"    
 [5,] "know"   "even"  "use"    "made"  "5"    
 [6,] "day"    "way"   "also"   "2018"  "10"   
 [7,] "see"    "well"  "come"   "used"  "20"   
 [8,] "like"   "much"  "world"  "today" "7"    
 [9,] "go"     "just"  "year"   "day"   "6"    
[10,] "now"    "time"  "time"   "long"  "2018" 

Bottom Performer, Top 5 Topics

      Topic 1  Topic 2  Topic 3  Topic 4 Topic 5
 [1,] "time"   "one"    "also"   "1"     "like" 
 [2,] "people" "first"  "new"    "2"     "now"  
 [3,] "make"   "see"    "one"    "3"     "just" 
 [4,] "good"   "people" "make"   "new"   "new"  
 [5,] "know"   "just"   "first"  "first" "make" 
 [6,] "just"   "know"   "time"   "make"  "first"
 [7,] "like"   "make"   "people" "time"  "one"  
 [8,] "see"    "like"   "like"   "one"   "know" 
 [9,] "one"    "good"   "3"      "like"  "time" 
[10,] "first"  "time"   "2"      "good"  "3"  

Lexical Dispersion Plots

Using the most frequently used terms from above we can examine how they're used in the top performing posts by Author. This plot illustrates how authors are using particular high frequency words.

Rplot08.png

Heirarchical Clusters & Dendrograms

Finally, we perform a heirarchical cluster analysis using an agglomeration method called "ward.D". We are examining similar clusters across posts from authors in each subset. I have attempted to bin the clusters into 9 arbitrary groups.

We are suggesting the users in each red box are using statistically similar vocabulary.

Rplot09.png

Rplot10.png

Conclusions

This analysis only scratches the surface and is severely limited by my time and crappy laptop. However, it seems to suggest that authors with high lexical diversity (the potentially smarter ones) seem to gain more Up Votes.

Readers should note some acknowledged confounding factors such as,

  1. Some posts are duplicated and translated into two or more languages. This doubles or trebles their token count.
  2. I was unable to account for the effects of Resteeming posts of others.
  3. Some content is straight language translations of other work (e.g. Utopian.io translations)
  4. There are several kanji characters that equate to an english one. This distorts the token count in favor of asian language speakers.



Posted on Utopian.io - Rewarding Open Source Contributors

Sort:  

Thank you for the contribution. It has been approved.

You can contact us on Discord.
[utopian-moderator]

Hey @crokkon, I just gave you a tip for your hard work on moderation. Upvote this comment to support the utopian moderators and increase your future rewards!

Congratulations! This post has been upvoted from the communal account, @minnowsupport, by morningtundra from the Minnow Support Project. It's a witness project run by aggroed, ausbitbank, teamsteem, theprophet0, someguy123, neoxian, followbtcnews, and netuoso. The goal is to help Steemit grow by supporting Minnows. Please find us at the Peace, Abundance, and Liberty Network (PALnet) Discord Channel. It's a completely public and open space to all members of the Steemit community who voluntarily choose to be there.

If you would like to delegate to the Minnow Support Project you can do so by clicking on the following links: 50SP, 100SP, 250SP, 500SP, 1000SP, 5000SP.
Be sure to leave at least 50SP undelegated on your account.

It's a comprehensive long analysis. Thanks for the information @morningtundra

@morningtundra, Contribution to open source project, I like you and upvote.

Hey @morningtundra I am @utopian-io. I have just upvoted you!

Achievements

  • You have less than 500 followers. Just gave you a gift to help you succeed!
  • Seems like you contribute quite often. AMAZING!

Community-Driven Witness!

I am the first and only Steem Community-Driven Witness. Participate on Discord. Lets GROW TOGETHER!

mooncryption-utopian-witness-gif

Up-vote this comment to grow my power and help Open Source contributions like this one. Want to chat? Join me on Discord https://discord.gg/Pc8HG9x

I have never seen a statistical analysis of this kind and it seems to me an impeccable job on your part

That’s very kind if you. Thank you.

Hola soy nueva en esta gran comunidad, espero tener su apoyo, como dice el un dicho: El que se arriesga no pierde nada y el que NO se arriesga no gana... Éxito

You are doing interesting work here @morningtundra,
how is your leg/foot healing up?

On the mend. Thanks for asking 🙂

Coin Marketplace

STEEM 0.21
TRX 0.20
JST 0.034
BTC 98850.87
ETH 3317.61
USDT 1.00
SBD 3.02