一个有意思的大数据SQL

On 2019年9月5日By yuer

最近对大数据有兴趣，但是没有实际的业务场景和需求。

这篇文章简单介绍了一下大数据挖掘：https://www.infoq.cn/article/more-time-of-big-data-mining-is-used-to-clean-the-data，其中的HIVE小案例特别有意思，于是记录如下。

需求

我简化了一下需求，就是有一个这样的很大日志文件：

mid	ip	systime
102011	101.1.1.2	2019-07-02
101111	101.1.1.5	2019-07-02

mid是用户ID，ip是登录来源IP，system是登录时间。

希望为每一个用户，找出和该用户使用过相同IP地址进行登录的其他用户。

一个看起来很有意思但又比较简单的数据挖掘需求。

第一步：日志去重

因为同一个用户可能多次在同一个IP登录，所以挖掘前应该先考虑按（mid,ip）去重：

select ip,mid from tmp.fraud_sheep_behavdetail_union where 
systime>='2016-11-06' group by ip,mid

1 2	select ip,mid from tmp.fraud_sheep_behavdetail_union where systime>='2016-11-06' group by ip,mid

把它作为一个去重后的临时表。

第二步：按IP聚集

以IP作为分组依据，把使用过该IP的用户聚集到一起，那么同一个IP下的用户们一定是共享过IP的：

select ip,concat_ws('_',collect_set(cast(mid as string))) as midset 
from tmp.fraud_sheep_behavdetail_union 
where ip is not null and systime>='2016-11-06' group by ip

select ip,concat_ws('_',collect_set(cast(mid as string))) as midset

from tmp.fraud_sheep_behavdetail_union

where ip is not null and systime>='2016-11-06' group by ip

collect_set是HIVE的功能，可以对值进行去重；concat_ws用于连接多个值。

因此经过group by ip后，使用相同ip的mid用户列表就被用”_”连成了1列。

第三步：以IP为中间媒介，关联用户->其他用户集合

最终希望知道每个mid与哪一些mid共享过IP。

因为每个mid可能使用过多个IP登录，而每个IP又关联了所有的一组mid，所以可以用IP作为媒介把mid和一组mid连接到一起。

把第一步和第二步的表作为2个临时表进行join，然后按照mid再次聚集，就可以得到每个mid关联的所有mid了：

select s1.mid,concat_ws('_',collect_set(s2.midset)) as ip_midset 
from 
(
select ip,mid 
from tmp.fraud_sheep_behavdetail_union
where systime>='2016-11-06' group by ip,mid
) s1 
join 
( 
select ip,concat_ws('_',collect_set(cast(mid as string))) as midset 
from tmp.fraud_sheep_behavdetail_union 
where ip is not null and systime>='2016-11-06' group by ip
) s2 
on (s1.ip=s2.ip) 
group by s1.mid

select s1.mid,concat_ws('_',collect_set(s2.midset)) as ip_midset

from

(

select ip,mid

from tmp.fraud_sheep_behavdetail_union

where systime>='2016-11-06' group by ip,mid

) s1

join

(

select ip,concat_ws('_',collect_set(cast(mid as string))) as midset

from tmp.fraud_sheep_behavdetail_union

where ip is not null and systime>='2016-11-06' group by ip

) s2

on (s1.ip=s2.ip)

group by s1.mid

不过这个SQL得到的ip_midset列内是包含重复mid的，因为1个用户登录过的多个IP可能也被另外某个用户同时访问过，只需要进一步对这一列做字符串拆分和去重即可。

总结

对于一个给定的挖掘需求，从编码角度思考出计算流程并不复杂，但是要将其转换为SQL方式或者大数据计算的方式，则需要对相关大数据技术有比较全面的了解，才能自然而然的进行技术过渡。

如果文章帮助您解决了工作难题，您可以帮我点击屏幕上的任意广告，或者赞助少量费用来支持我的持续创作，谢谢~

需求

第一步：日志去重

第二步：按IP聚集

第三步：以IP为中间媒介，关联用户->其他用户集合

总结

发表回复 取消回复

发表回复取消回复