从用户连续活跃的最大天数说起

in #cn6 years ago

问题引入

最近经常碰到这样的问题,每天每个城市播放最多的10首歌,某月每支股票连续下跌/上涨的最大天数,用户连续活跃的最大天数,初步看起来都和分析函数相关,考验逻辑思维和写复杂SQL的能力。

以用户连续活跃的最大天数为例

以Oracle的分析函数语法说明,首先模拟一些用户活跃的数据,

-- 建表语句
DROP TABLE sigin;
create table sigin(
userid int, 
sigindate varchar2(20) 
); 

-- 模拟数据插入

insert into sigin values(1,'2017-01-01');
insert into sigin values(1,'2017-01-02');
insert into sigin values(1,'2017-01-03');
insert into sigin values(1,'2017-01-04');
insert into sigin values(2,'2017-01-01');
insert into sigin values(2,'2017-01-02');
insert into sigin values(2,'2017-01-03');

insert into sigin values(1,'2017-01-10');
insert into sigin values(2,'2017-01-10');
insert into sigin values(1,'2017-01-11');
insert into sigin values(2,'2017-01-11');
insert into sigin values(1,'2017-01-12');
insert into sigin values(2,'2017-01-12');
commit;

大体思路如下:

  • 首先根据userid进行分组,将用户所有活跃的记录依次按照时间排序并标上序号。
  • 根据时间有序的特点,将所有时间减去它对应的序号,获取连续活跃时间唯一的时间点。如
2017-01-01 1
2017-01-02 2
2017-01-03 3

时间减去序号,得唯一时间2016-12-31。

  • 根据userid和这个连续活跃时间唯一的时间点进行分组,计算连续活跃天数。
-- 每个用户的几段连续活跃的天数
select 
userid,
to_date(sigindate,'yyyy-mm-dd')-sigin_rank as date_rank,
count(1) as sigincount
from 
(
select 
userid,
sigindate,
row_number() over(partition by userid order by sigindate) as sigin_rank
 from sigin
 ) c group by userid, to_date(sigindate,'yyyy-mm-dd')-sigin_rank;

得到结果1如下,

USERID  DATE_RANK   SIGINCOUNT
1   2017/1/5    3
2   2017/1/6    3
1   2016/12/31  4
2   2016/12/31  3

上述方法可以找到每个用户的连续活跃天数,但用户中间有中断时程序就无法满足,一个用户出现了多条记录,分别为用户的多段连续活跃所产生。

我们最终的目标是得到用户连续活跃的最大天数,可利用上述方法所得到的结果,在外面再嵌套一层,针对userid进行group by,得到每个用户的最大活跃天数。

select d.userid, Max(d.sigincount) as max_sigincount from (
select 
userid,
to_date(sigindate,'yyyy-mm-dd')-sigin_rank as date_rank,
count(1) as sigincount
from 
(
select 
userid,sigindate,row_number() over(partition by userid order by sigindate) as sigin_rank
 from sigin
 ) c group by userid ,to_date(sigindate,'yyyy-mm-dd')-sigin_rank 

) d  group by d.userid

得到结果2如下,

USERID  MAX_SIGINCOUNT
1   4
2   3

如果还需要得到用户连续活跃最大天数中这一段的首次活跃时间,可以把以上两个结果进行关联得到。

-- 每个用户连续活跃的最大天数和连续活跃的第一天的时间
 select f.userid,g.date_rank+1,f.max_sigincount from (
select d.userid, Max(d.sigincount) as max_sigincount from (
select 
userid,
to_date(sigindate,'yyyy-mm-dd')-sigin_rank as date_rank,
count(1) as sigincount
from 
(
select 
userid,sigindate,row_number() over(partition by userid order by sigindate) as sigin_rank
 from sigin
 ) c group by userid ,to_date(sigindate,'yyyy-mm-dd')-sigin_rank 

) d  group by d.userid
) f inner join (
select 
userid,
to_date(sigindate,'yyyy-mm-dd')-sigin_rank  as date_rank,
count(1) as sigincount
from 
(
select 
userid,sigindate,row_number() over(partition by userid order by sigindate) as sigin_rank
 from sigin
 ) c group by userid ,to_date(sigindate,'yyyy-mm-dd')-sigin_rank 

) g on f.userid = g.userid and f.max_sigincount = g.sigincount;

得到结果3如下,

USERID  G.DATE_RANK+1   MAX_SIGINCOUNT
2   2017/1/7    3
1   2017/1/1    4
2   2017/1/1    3

结果3还存在一个问题,如果用户有两段连续活跃的天数相同且最大,则第二段连续活跃的首次活跃时间是不对的,这个问题怎么解决呢?欢迎留言你的解决方案。


本文首发于steem,感谢阅读,转载请注明。

https://steemit.com/@padluo


微信公众号「padluo」,分享数据科学家的自我修养,既然遇见,不如一起成长。

数据分析


读者交流电报群

https://t.me/sspadluo


知识星球交流群

知识星球读者交流群

Sort:  

Congratulations @padluo! You received a personal award!

Happy Birthday! - You are on the Steem blockchain for 2 years!

You can view your badges on your Steem Board and compare to others on the Steem Ranking

Do not miss the last post from @steemitboard:

SteemitBoard Ranking update - A better rich list comparator
Vote for @Steemitboard as a witness to get one more award and increased upvotes!

Congratulations @padluo! You received a personal award!

Happy Birthday! - You are on the Steem blockchain for 1 year!

Click here to view your Board

Support SteemitBoard's project! Vote for its witness and get one more award!

Coin Marketplace

STEEM 0.20
TRX 0.25
JST 0.038
BTC 96601.54
ETH 3445.82
USDT 1.00
SBD 3.09