You are viewing a single comment's thread from:
RE: [Steem Rep] Update - September 2024 | AI-Comments | Tags | Trendings Scores
For trending, my guess is that it might be best to forget about the total value, and instead experiment with something like these:
- Median vote value
- Median vote value * number of voters
- Median vote value * log(number of voters)
- Median vote value after discarding the highest value vote.
- Total payout after discarding the highest value vote
etc...
We clearly don't want to go with raw numbers of votes, because that could be gamed with alt accounts, but there might be other solutions that don't depend on trying to maintain tables of investors and curation trails. Maintaining those tables sounds like a nightmare challenge to me.
By the way, I have now started with the first attempts.
You can also compare steemit.com and https://steemit.steemnet.org/trending. The full bot list from the-gorilla is currently filtered out there.
I have experimented locally with a variant that weights the rshares according to time (under 5 minutes = 0, between 5 and 10 minutes linearly increasing, from 10 minutes = 1). This shows even more significant differences.
I will compare the variants mentioned above.
You can also see the score in the PostSummary header on my test page. Overall, the values differ only very minimally.
Edit: new URL to use caching.
I see the scores in your test page, but I'm still seeing posts that were voted by accounts that I would've expected to be filtered as bots. Maybe I'm misunderstanding the filtering.
You probably assumed that posts with bot votes would be filtered out completely.
That was not my intention. I ‘only’ wanted to influence the calculation of the score by either filtering out or weakening the bot votes (i.e. their
rshares
) as far as possible. That way, the posts with bot votes should not end up so high in the ranking.But as I suspected, there are still too many other (trail) votes that distort the result.
I have now calculated with various other methods:
rshares
results in the biggest changes,I would implement the median calculation in the test environment as a test.
I had a long conversation with ChatGPT today in which the use of percentiles emerged as quite promising to identify the biggest outlier votes. A weakening could then be applied to these.
Ah, you're right. I was assuming it would be total filtering. But your approach makes more sense. I suppose they should get credit for legitimate organic votes, even if they're using bots.
That was my gut feeling. I look forward to seeing that. My only reservation is for posts with a small number of votes. I guess the median would need to be combined with some sort of minimum number of votes and/or rshares.
That makes a lot of sense. It hadn't crossed my mind, but I like it.
That's true. I have already observed that posts with one vote had a higher trending score than posts with many (good) votes. A sensible combination is absolutely essential.
Unfortunately, the ‘old’ scores are higher, so the trending page is still dominated by them...
It looks very different, though. I think there's a little bit of improvement, already.
Forgot to mention: This was why I came up with
median * log(number of votes)
, too. Someone could try to manipulate the score with fake accounts, but it gets expensive to fund them to high enough levels that they won't drag down the median.To prevent the curve from rising too quickly with smaller vote numbers, I have now used log10. We'll see how the values develop.
When debugging the new function, I noticed that there is a bug in the caching wrapper that makes the caching unusable (https://github.com/steemit/hivemind/issues/338).
I would fix that now :-)
This is fascinating. First, I was able to knock a post with low numbers of votes off of your trending page just by voting for it with a small value, so that's an interesting dynamic. It actually gives an advantage to people who want to reduce a post's visibility (with low numbers of votes), and this has ramifications that I hadn't considered for alt-accounts. Not sure if that's a problem - or how big of a problem. Definitely something to be aware of, though.
Second, and more importantly, does this mean that if 10 web sites all run condenser and hivemind, they can use 10 different customized trending algorithms in order to distinguish themselves? That's a competitive dynamic that I was not aware of.
Thank you very much. I was hoping for more suggestions from you. And you didn't disappoint me :-)
I've been thinking for a few days about how I could make effective comparative calculations (with the possibility of visualisation) with reasonable effort. Since the votes are not directly linked to the post with all the data, I would have to spend a lot of time collating the data first. But I think the data for your suggestions might even be available with one request.
Let's see...