Exploratory Analysis: Statistical User Profiling for Data Driven Votes and Improved Business Performance
The world around us becomes more and more data driven as the days go by. Organisations of all sizes are using data to make informed decisions to grow their organisation, improve business processes and increase their bottom line.
Repository
https://github.com/steemit/steem
Proof of Brain
The current voting system on STEEM works something like this:
• You find content you like, you vote on it
• You have a friend you want to support, you vote on their content
• You want to show someone your graduated, you vote
• You find a project worth supporting, you vote
• You sell votes
• You don’t bother voting
This all works around the proof of brain concept, where what is supposed to happen is that the best content is rewarded.
However, we all know that the best content is not always the content that gets the rewards and even the term ‘best content’ can be rather subjective.
Does the current voting system take in business needs for Steemit.com and future Dapps? For example, it has been well discussed that Steemit lacks a large active middle class. It has been well discussed that the initial distribution of coins weighed heavily towards some. The knock-on effect of these problem makes Steemit an uphill struggle for some. With such a small middle class on Steemit, creating a fairer distribution of tokens is massively hard work. This is only one of the business problems faced by the steem business ecosystem.
( Edit - the next paragraph has been edited from a statement to a question)
Does Proof of Brain Voting favor creators that are working to build a sustainable platform.? Does Proof of brain voting support decentralized departments, such as a central marketing department (example could be promo-steemit), or a central advertising committee for all DApps? Does Proof of brain vote support accounts with a long-term vision? Does Proof of brain voting support the creators bring the traffic to Steemit in the first place? Does Proof of brain vote support loyalty.
Yet in a business, these and many more operational aspects would be supported. This could be a major weakness and I would have a concern that DApps created on the block, will face the same problems as Steemit.com and Steemit Inc. if they do not have a business and operational plan for growth and success.
What is Statistical User Profiling and how can it be used?
The STEEM blockchain is full of data. Metrics are captured that some companies could only dream of having. Yet this data is not used to improve the weaknesses of Steemit.com
Steemit.com and any future Dapp created on the block is a social media/social networking/ content sharing/ publishing type platform. Its user base is both its supplier and its customer. Who hasn’t heard of customer profiling? Most of us have seen it in action. A simple example is an online store. They can use data to know where you come from and offer you location-based pricing.
Customer profiling has also lead to improvements in machine learning algorithms. Look at Amazon for example, using customer profiling and other techniques they can identify customers that will abandon their carts. With this information, they can use smart marketing aimed at those and have managed to reduce the number of abandoned carts.
Statistical user profiling does not have to be complex. I have carried out a sample analysis to give you a demonstration and put forward some ways it can be used to identify accounts to vote on that are in the best business interests of the platform.
STEEM data Statistical user profiling - Actual analysis
For this analysis I have explored the data for one Dapp on the steem blockchain for the period of July. I do not wish to divulge the actual app in question as the purpose is to show you what is possible. Only one sample of data is therefore needed for this purpose as the aim is not to compare Dapps.
The code for this analysis can be found at the bottom of the post, however I have blacked out the app name from the code.
Choice of Analysis
For this example, I have chosen to carry out a descriptive statically analysis of the user base.
Descriptive Statistics provide a summary of a given data set and they are broken down into measures of central tendency and measures of spread.
A typical descriptive analysis would include a histogram, maybe a box plot and the following metrics:
- Mean: This is the average.
- Standard Error: measures how far the sample mean of the data is likely to be from the true population mean.
- Median: The central number.
- Mode: The number that shows up most.
- Standard Deviation: calculates a unit of difference from the mean.
- Sample Variance: Measure of spread, mathematically defined as the average of the squared differences from the mean
- Kurtosis: measures the amount of probability in the tails. The value is often compared to the kurtosis of the normal distribution, which is equal to 3. If the kurtosis is greater than 3, then the dataset has heavier tails than a normal distribution
- Skewness: negatively skewed data is left-skewed and positively skewed data is right skewed
- Range: This is the total spread of the data.
- Minimum: Will return the lowest value
- Maximum: Will return the highest value
- Sum: Will add all the values together
- Count: Will count the number of values
By comparing an entire population (all creators posting to a particular Dapp) to a population subset (creators supported by votes from the Dapp), areas of weakness can be quickly identified. KPIs can be set and monitored and business objectives can be aligned.
Descriptive statistics gives you an idea where you are not preforming well and where to start looking for improvements.
Example 1.
The app wants to reward loyal creators. For these examples the entire population would be all the contributors that have posted to the app. We could take a count of their total posts and then a count of posts made via the app or within the tag. From here we can get a % of posts made via the app and then run our descriptive stats on this.
Below is a sample output. We can see from the histogram a large portion of accounts 1636, have posted 90-100% of their posts via the app. These are the most loyal creators (if that is the metric chosen)
Our population subset for comparison would be all the authors supported with votes directly from the Dapp. Let’s see if we reward our loyal users well?
We can see from the histogram that we do tend to reward loyal users more that other users, as the highest bar is in this 91-100% bracket. 75% of the accounts voted on have 83% or less of their posts done via the app. 25% of the accounts voted on have only 26% or less of their posts done via the app or tag.
The median value of the sample subset is higher than the total population median value. Without having a target or KPI, this data would be meaningless. However, aligned with business objective, this could add a lot of value and help identify account to be voted on based on loyalty.
Example 2.
The app wants to reward users based on the age of the account (this could be another loyalty measure). Age can be quickly calculated by taking he account created date away from the current date. We can then run our descriptive stats to judge the distribution of the data.
Looking at the full population, we can see that a large portion of the posts are coming from accounts less than 57 days old. The median age of accounts posting is 157 days and 50% of the posts are made from accounts older than this and 50% from accounts that are younger. 50% of creators are between 47 and 261 days old
Let’s look now at our subset of data, so accounts receiving support via votes
From this we can see the largest supported account age bracket is accounts that are 160-210 days old. The histogram for the subset is more normally distributed that the entire population, although both sets of data are skewed to the right.
Again, it would really depend on the business needs and set KPI’s to make decisions based on this data. If the organisation wanted to support new accounts that post 90% via their app, this analysis makes it very easy to find the accounts that should be targeted.
Further Examples
Dapps may have a business need to support accounts that power up, or that do not power down
The median of the supported sample in this case is higher than the median of the general population. Is this inline with the organisational goals? Although there were only 152 account voted on that have powered up, yet 1004 account powered up. These 152 accounts powered up 79K whereas the entire population powered up 146K, so this data could suggest that the larger power ups are being rewarded. Obviously more work would be required to confirm this, however with an analysis like this, it’s easy to drill down into more details.
We can also look at the power downs
Again, depending on the business objectives would decide metrics are important.
Owned SP could be another influencing factor when it comes to data driven voting
Above shows the entire population sample and below the supported subset.
Other metrics that could be easily included in profiling users
Other metrics that we could quick take from the block could include and are not limited to:
• Number of comments, and depth of comments the use leaves and/or receive
• Number of votes and voting weight used/received
• Reputation
• Posting language
• Controlling SP appose to Owned SP
Non blockchain data can also be use. This data would have to be captured by the Dapp as part of their normal operations
• Geolocation
• Traffic (both unique and total views)
• UA currently under development by @scipio
Conclusion
Data driven voting could easily be used on Steemit.com and DApps and future SMT’s. The biggest drawback from using this would be quality of content can not be measured so easily with descriptive statistics (although other data driven metrics could be established).
However, it could solve other problems faced by Steemit, one of which being the lack of middle class. Maybe @ned could consider using data base voting on the remaining SP held by @Misterdelegation to improve Steemit.com
Oversights have been made by Steemit Inc in terms of running Steemit.com. I understand that Steemit Incs focus is on the blockchain and not Steemit.com. My hope is the developers of the Apps are not just focused on developing the app but also the business.
In this day and age, data drives business and business success. Should data driven voting be now given some consideration? Should it be considered as part of the growth strategy for new Apps and SMTs on the STEEM blockchain? How much data driven voting do you think is happening at the moment?
Data and queries
I wish to keep the name of the app private as I do not wish to seem bias in favor of any particular app. Therefore I have amended the code to include XXX instead of the identifier. The data was collected and transformed and modeled using Power BI, however Power BI does not include the functions needed for Statistical analysis like the one above. Therefore I then transferred the modeled data into Excel to carry out the descriptive statistics and produce the visualizations.
All data was collected using SteemSQL held and managed by @arcange
All Posts
let
Source = Sql.Database("vip.steemsql.com", "DBSteem", [Query="select id, author, created, category, total_payout_value from comments#(lf) where CONVERT(DATE,created) BETWEEN '2018-07-01' AND '2018-08-01' and depth = 0"]),
#"Changed Type" = Table.TransformColumnTypes(Source,{{"created", type date}})
in
#"Changed Type"
SP Power Down
let
Source = Sql.Database("vip.steemsql.com", "DBSteem", [Query="select timestamp, from_account, deposited#(lf)from VOFillVestingWithdraws#(lf) where CONVERT(DATE,timestamp) BETWEEN '2018-07-01' AND '2018-08-01'"]),
#"Replaced Value" = Table.ReplaceValue(Source,"STEEM","",Replacer.ReplaceText,{"deposited"}),
#"Changed Type" = Table.TransformColumnTypes(#"Replaced Value",{{"deposited", type number}}),
#"Renamed Columns" = Table.RenameColumns(#"Changed Type",{{"deposited", "SP powerd down"}}),
#"Removed Errors" = Table.RemoveRowsWithErrors(#"Renamed Columns", {"SP powerd down"}),
#"Changed Type1" = Table.TransformColumnTypes(#"Removed Errors",{{"timestamp", type date}})
in
#"Changed Type1"
SP Powered UP
let
Source = Sql.Database("vip.steemsql.com", "DBSteem", [Query="select *#(lf)#(lf)from Txtransfers#(lf) where CONVERT(DATE,timestamp) BETWEEN '2018-07-01' AND '2018-08-01'"]),
#"Filtered Rows" = Table.SelectRows(Source, each ([type] = "transfer_to_vesting")),
#"Removed Other Columns" = Table.SelectColumns(#"Filtered Rows",{"type", "ID", "from", "amount", "amount_symbol", "timestamp"}),
#"Changed Type" = Table.TransformColumnTypes(#"Removed Other Columns",{{"timestamp", type date}}),
#"Renamed Columns" = Table.RenameColumns(#"Changed Type",{{"amount", "SP Powered up"}})
in
#"Renamed Columns"
Account Data
let
Source = Sql.Database("vip.steemsql.com", "DBSteem", [Query="select name, created, vesting_shares, delegated_vesting_shares, received_vesting_shares from accounts#(lf) "]),
#"Replaced Value" = Table.ReplaceValue(Source,"VESTS","",Replacer.ReplaceText,{"vesting_shares", "delegated_vesting_shares", "received_vesting_shares"}),
#"Changed Type1" = Table.TransformColumnTypes(#"Replaced Value",{{"vesting_shares", type number}, {"delegated_vesting_shares", type number}, {"received_vesting_shares", type number}}),
#"Changed Type" = Table.TransformColumnTypes(#"Changed Type1",{{"created", type date}}),
#"Added Custom" = Table.AddColumn(#"Changed Type", "controlling vesting shares", each [vesting_shares]+[received_vesting_shares]-[delegated_vesting_shares]),
#"Changed Type2" = Table.TransformColumnTypes(#"Added Custom",{{"controlling vesting shares", type number}}),
#"Added Custom1" = Table.AddColumn(#"Changed Type2", "Controlling SP", each [controlling vesting shares]*.000495),
#"Added Custom2" = Table.AddColumn(#"Added Custom1", "Owned SP", each [vesting_shares]*.000495),
#"Changed Type3" = Table.TransformColumnTypes(#"Added Custom2",{{"Controlling SP", Currency.Type}, {"Owned SP", Currency.Type}})
in
#"Changed Type3"
Posts via APP/TAG
let
Source = Sql.Database("vip.steemsql.com", "DBSteem", [Query="select id, author, created, category, total_payout_value, json_metadata from comments#(lf) where CONVERT(DATE,created) BETWEEN '2018-07-01' AND '2018-08-01' and depth = 0"]),
#"Changed Type" = Table.TransformColumnTypes(Source,{{"created", type date}}),
#"Filtered Rows" = Table.SelectRows(#"Changed Type", each Text.Contains([json_metadata], "XXX"))
in
#"Filtered Rows"
Votes by APP
let
Source = Sql.Database("vip.steemsql.com", "DBSteem", [Query="select *#(lf)from TXvotes#(lf) where CONVERT(DATE,timestamp) BETWEEN '2018-07-01' AND '2018-08-01'#(lf)and voter = 'XXX' "]),
#"Changed Type" = Table.TransformColumnTypes(Source,{{"timestamp", type date}}),
#"Added Custom" = Table.AddColumn(#"Changed Type", "% weight", each [weight]/10000),
#"Changed Type1" = Table.TransformColumnTypes(#"Added Custom",{{"% weight", Percentage.Type}})
in
#"Changed Type1"
Hi @paulag, seems like you hit a quite controversial topic with that analysis! From technical perspective I understand that you took a few metrics as examples to show the possibilities and much much more could be done.
While I'm "guilty" myself for applying methods like these to some extend with my other accounts, I'm a bit skeptic if Steemit or dApp stake applying data driven voting would be really a good idea. I think the metrics had to be chosen much more carefully and differentiating than the examples. Would new users join an app that rewards older accounts higher? At which point turns "loyalty" in "spam"? Gentlebot applied data driven voting to some extend, and it didn't take long until people realized at which point a simple "nice post" can give a pretty good chance for a $1.5 vote (this seems to be fixed now). Algorithms can and will be tricked, but so can humans.
I think the current set of dApps with human curation or moderation is a step into the right direction for Steem. Of course, an algorithm scales much easier with growth than human resources. I'm torn, but from my personal opinion, I think I'd prefer to add more "brain" to the process instead of replacing it with an algorithm.
Your contribution has been evaluated according to Utopian policies and guidelines, as well as a predefined set of questions pertaining to the category.
To view those questions and the relevant answers related to your post, click here.
Need help? Write a ticket on https://support.utopian.io/.
Chat with us on Discord.
[utopian-moderator]
yep looks like it hit a nerve, but lets face it, data is a fact of business these days.
I agree, more 'brain' would be better, but like you have pointed out, algorithms can often scale better. And yes, the metrics must be chosen carefully, and would probably differ between app depending on the business needs. I was previously working on a 'contribution score' metric and as mentioned a few times UA is on the way in some form. 1 UP as mentioned in the comments is also using data to help. I think some people and apps are going to use data to help them grow, and probably more than are willing to let on right now :-)
Thank you for your review, @crokkon!
So far this week you've reviewed 3 contributions. Keep up the good work!
Brilliant analysis and super close to what I currently deal with. As I am developing the 1UP SMT project to create a 1-account-1-vote funnel for the best content from various communities via a gamified platform that does nothing else but curating Steem content with both Steem and 1UP votes. Data like this, probably in form of UA from @scipio, will allow us to make this happen. My goal is to keep the entry level as low as possible but without opening the door for abuse via multiple accounts too much. The whole system depends on smart data to run mostly autonomously.
I will have a closer look at your suggested metrics. Good work!
From what i know, UA will be available as an api call. This will be cool for developers if it is the case. I think more and more people will start using data to identify account that fall within a certain criteria as we grow.
If i can be of any assistance with 1up, even just for brain storming, please do let me know. 😊. And thanks for the comment. Was starting to think the concept got a little lost on people
This posts with such amount of data and statistics are something that I love, will you be doing more of these in the future?
PS: There's a typo in the post:
All rewards from this comment will go to charity
thanks, I have updated the error
Hey @paulag
Thanks for contributing on Utopian.
We’re already looking forward to your next contribution!
Want to chat? Join us on Discord https://discord.gg/h52nFrV.
Vote for Utopian Witness!
Hello @paulag.
It appears you have exposed your private MEMO key as a memo in a recent transfer.
We suggest changing your password before another user can maliciously use it; reading any encrypted messages sent to your account, or pretending to be you on services such as Minnow Booster.
To change your password go to https://steemit.com/change_password and fill out the form.
Click here for more information.
Yes you are right
So correct analysis
I put forth this same idea last year although not as elaborate as you.
really, and how did the community respond?
See the post for yourself: https://steemit.com/smt/@dana-edwards/adding-utility-to-steem-by-data-analytics-using-a-smart-media-token
Your post was much more elaborate and detailed as I didn't dive all the way in like you did. But we agree on the vision.