表关联
内连接(INNER JOIN)
返回两个表中满足关联条件的记录。
SELECT * FROM t1 INNER JOIN t2 ON t1.col1 = t2.col2;
左连接(LEFT JOIN)
返回左表中的所有记录,以及右表中满足关联条件的记录。
SELECT * FROM t1 LEFT JOIN t2 ON t1.col1 = t2.col2;
右连接(RIGHT JOIN)
返回右表中的所有记录,以及左表中满足关联条件的记录。
SELECT * FROM t1 RIGHT JOIN t2 ON t1.col1 = t2.col2;
全连接(FULL OUTER JOIN)
返回左表和右表中的所有记录。
hive full join多表多关联键联合查询
SELECT * FROM t1 FULL OUTER JOIN t2 ON t1.col1 = t2.col2;
DDL
字段操作
--添加字段 alter table app.table_name add columns(bu_name STRING COMMENT "事业部名称") CASCADE; --修改字段类型(修改为double) Alter table tmp.tmp_zp_tablename column columnname columnname double; --调整列位置 alter table app.table_name change bu_name bu_name STRING after col_a;
注意不要直接对有数据的表进行字段顺序调整,会导致历史分区数据错误。
分区操作
--删除分区 alter table tmp.tmp_zp_tablename drop if exists partition(dt='2020-10-24');
常用函数
官网文档
sort_array
sort_array(Array) 只有一个参数
根据自然顺序按升序对输入数组进行排序
SELECT sort_array(array(5, 2, 8, 1, 7)) AS sorted_array;
使用中常和collect函数使用 sort_array(collect_set())
concat_ws
concat_ws(separator, string1, string2, …)
用于将多个字符串连接在一起,中间使用指定的分隔符进行分隔。
SELECT concat_ws(',', 'Hello', 'World') AS result;
常和数组集合函数使用,collect_set collect_list 将数据内容转为字符串
concat_ws(‘,’,collect_set(col) )
collect_set collect_list
collect_set函数可以将指定字段的所有不重复的值,以Set的形式返回。Set是一种无序且不包含重复元素的数据结构。
collect_list函数可以将指定字段的所有值,以List的形式返回。List是一种有序且允许重复元素的数据结构。
SELECT collect_set(name) FROM student;
注意
collect_set和collect_list函数只能应用于对一个字段进行聚合操作,不能对多个字段同时聚合。
collect_set和collect_list函数的性能较差,当数据量较大时,可能会影响查询性能。
collect_set和collect_list函数都是在Reducer阶段进行聚合操作,因此在分布式环境下,需要确保数据被正确分组。
length size
length(string A) Returns the length of the string.
size(Map
size(Array) Returns the number of elements in the array type.
TRUNC
TRUNC(number,num_digits)Number需要截尾取整的数字。Num_digits用于指定取整精度的数字,默认值为0。TRUNC()函数截取时不进行四舍五入。
select trunc(123.458) from dual --123 select trunc(123.458,0) from dual --123 select trunc(123.458,1) from dual --123.4 select trunc(123.458,-1) from dual --120 select trunc(123.458,-4) from dual --0 select trunc(123.458,4) from dual --123.458 select trunc(123) from dual --123 select trunc(123,1) from dual --123 select trunc(123,-1) from dual --120
lag/lead
查询每个顾客上次的购买时间
select *,lag(orderdate) over(partition by name order by orderdate) from business; +----------------+---------------------+----------------+---------------+--+ | business.name | business.orderdate | business.cost | lag_window_0 | +----------------+---------------------+----------------+---------------+--+ | jack | 2017-01-01 | 10 | NULL | | jack | 2017-01-05 | 46 | 2017-01-01 | | jack | 2017-01-08 | 55 | 2017-01-05 | | jack | 2017-02-03 | 23 | 2017-01-08 | | jack | 2017-04-06 | 42 | 2017-02-03 | | mart | 2017-04-08 | 62 | NULL | | mart | 2017-04-09 | 68 | 2017-04-08 | | mart | 2017-04-11 | 75 | 2017-04-09 | | mart | 2017-04-13 | 94 | 2017-04-11 | | neil | 2017-05-10 | 12 | NULL | | neil | 2017-06-12 | 80 | 2017-05-10 | | tony | 2017-01-02 | 15 | NULL | | tony | 2017-01-04 | 29 | 2017-01-02 | | tony | 2017-01-07 | 50 | 2017-01-04 | +----------------+---------------------+----------------+---------------+--+ select *,lag(orderdate,1,"1970-01-01") over(partition by name order by orderdate) from business; --lag --lag(col,n,DEFAULT) 第一个参数为列名,第二个参数为往上第n行(可选,默认为1),第三个参数为默认值(当往上第n行为NULL时候,取默认值,如不指定,则为NULL) --与LAG相反 --LEAD(col,n,DEFAULT) 用于统计窗口内往下第n行值。第一个参数为列名,第二个参数为往下第n行(可选,默认为1),第三个参数为默认值(当往下第n行为NULL时候,取默认值,如不指定,则为NULL) +----------------+---------------------+----------------+---------------+--+ | business.name | business.orderdate | business.cost | lag_window_0 | +----------------+---------------------+----------------+---------------+--+ | jack | 2017-01-01 | 10 | 1970-01-01 | | jack | 2017-01-05 | 46 | 2017-01-01 | | jack | 2017-01-08 | 55 | 2017-01-05 | | jack | 2017-02-03 | 23 | 2017-01-08 | | jack | 2017-04-06 | 42 | 2017-02-03 | | mart | 2017-04-08 | 62 | 1970-01-01 | | mart | 2017-04-09 | 68 | 2017-04-08 | | mart | 2017-04-11 | 75 | 2017-04-09 | | mart | 2017-04-13 | 94 | 2017-04-11 | | neil | 2017-05-10 | 12 | 1970-01-01 | | neil | 2017-06-12 | 80 | 2017-05-10 | | tony | 2017-01-02 | 15 | 1970-01-01 | | tony | 2017-01-04 | 29 | 2017-01-02 | | tony | 2017-01-07 | 50 | 2017-01-04 | +----------------+---------------------+----------------+---------------+--+
ntile
用于将分组数据按照顺序切分成n片(不是严格等分),返回当前记录所在的切片值。
--查询前20%时间的订单信息 select *,ntile(5) tgroup over(order by orderdate) from business; +----------------+---------------------+----------------+-----------------+--+ | business.name | business.orderdate | business.cost | ntile_window_0 | +----------------+---------------------+----------------+-----------------+--+ | jack | 2017-01-01 | 10 | 1 | | tony | 2017-01-02 | 15 | 1 | | tony | 2017-01-04 | 29 | 1 | | jack | 2017-01-05 | 46 | 2 | | tony | 2017-01-07 | 50 | 2 | | jack | 2017-01-08 | 55 | 2 | | jack | 2017-02-03 | 23 | 3 | | jack | 2017-04-06 | 42 | 3 | | mart | 2017-04-08 | 62 | 3 | | mart | 2017-04-09 | 68 | 4 | | mart | 2017-04-11 | 75 | 4 | | mart | 2017-04-13 | 94 | 4 | | neil | 2017-05-10 | 12 | 5 | | neil | 2017-06-12 | 80 | 5 | +----------------+---------------------+----------------+-----------------+--+ select * from (select *,ntile(5) tgroup over(order by orderdate) from business) t1 where t1.tgroup=1;
persent_rank
分组内当前行的RANK值-1/分组内总行数-1
select *,percent_rank() over(order by orderdate) pr from business; +----------------+---------------------+----------------+----------------------+--+ | business.name | business.orderdate | business.cost | pr | +----------------+---------------------+----------------+----------------------+--+ | jack | 2017-01-01 | 10 | 0.0 | | tony | 2017-01-02 | 15 | 0.07692307692307693 | | tony | 2017-01-04 | 29 | 0.15384615384615385 | | jack | 2017-01-05 | 46 | 0.23076923076923078 | | tony | 2017-01-07 | 50 | 0.3076923076923077 | | jack | 2017-01-08 | 55 | 0.38461538461538464 | | jack | 2017-02-03 | 23 | 0.46153846153846156 | | jack | 2017-04-06 | 42 | 0.5384615384615384 | | mart | 2017-04-08 | 62 | 0.6153846153846154 | | mart | 2017-04-09 | 68 | 0.6923076923076923 | | mart | 2017-04-11 | 75 | 0.7692307692307693 | | mart | 2017-04-13 | 94 | 0.8461538461538461 | | neil | 2017-05-10 | 12 | 0.9230769230769231 | | neil | 2017-06-12 | 80 | 1.0 | +----------------+---------------------+----------------+----------------------+--+
开窗函数
示例表:
+----------------+---------------------+----------------+--+ | business.name | business.orderdate | business.cost | +----------------+---------------------+----------------+--+ | jack | 2017-01-01 | 10 | | tony | 2017-01-02 | 15 | | jack | 2017-02-03 | 23 | | tony | 2017-01-04 | 29 | | jack | 2017-01-05 | 46 | | jack | 2017-04-06 | 42 | | tony | 2017-01-07 | 50 | | jack | 2017-01-08 | 55 | | mart | 2017-04-08 | 62 | | mart | 2017-04-09 | 68 | | neil | 2017-05-10 | 12 | | mart | 2017-04-11 | 75 | | neil | 2017-06-12 | 80 | | mart | 2017-04-13 | 94 | +----------------+---------------------+----------------+--+``` ```sql select name,orderdate,cost, sum(cost) over() as sample1,--所有行相加 sum(cost) over(partition by name) as sample2,--按name分组,组内数据相加 sum(cost) over(partition by name order by orderdate) as sample3,--按name分组,组内数据累加 sum(cost) over(partition by name order by orderdate rows between UNBOUNDED PRECEDING and current row ) as sample4 ,--和sample3一样,由起点到当前行的聚合 sum(cost) over(partition by name order by orderdate rows between 1 PRECEDING and current row) as sample5, --当前行和前面一行做聚合 sum(cost) over(partition by name order by orderdate rows between 1 PRECEDING AND 1 FOLLOWING ) as sample6,--当前行和前边一行及后面一行 sum(cost) over(partition by name order by orderdate rows between current row and UNBOUNDED FOLLOWING ) as sample7 --当前行及后面所有行 from business;
其中sample3和sample4是一样的,都是按name分组,组内数据累加。上面总共开了7个窗口函数,select执行完了之后(select不需要执行MapReduce程序),每多一个窗口,就多一个MapReduce执行函数,但是这个前提是窗口开的不一样,只有窗口开的不一样才有额外的MapReduce,sample3~sample7的窗口都是一样的,只不过他们各自加的行的范围不一样而已,所以窗口都是一个窗口。
排序函数
排序函数有rank()、dense_rank()、row_number(),下面对比差异。
给定下表:
+-------------+----------------+--------------+--+ | score.name | score.subject | score.score | +-------------+----------------+--------------+--+ | 孙悟空 | 语文 | 87 | | 孙悟空 | 数学 | 95 | | 孙悟空 | 英语 | 68 | | 大海 | 语文 | 94 | | 大海 | 数学 | 56 | | 大海 | 英语 | 84 | | 宋宋 | 语文 | 64 | | 宋宋 | 数学 | 86 | | 宋宋 | 英语 | 84 | | 婷婷 | 语文 | 65 | | 婷婷 | 数学 | 85 | | 婷婷 | 英语 | 78 | +-------------+----------------+--------------+--+ select *,rank() over(partition by subject order by score desc) r, dense_rank() over(partition by subject order by score desc) dr, row_number() over(partition by subject order by score desc) rr from score; +-------------+----------------+--------------+----+-----+-----+--+ | score.name | score.subject | score.score | r | dr | rr | +-------------+----------------+--------------+----+-----+-----+--+ | 孙悟空 | 数学 | 95 | 1 | 1 | 1 | | 宋宋 | 数学 | 86 | 2 | 2 | 2 | | 婷婷 | 数学 | 85 | 3 | 3 | 3 | | 大海 | 数学 | 56 | 4 | 4 | 4 | | 宋宋 | 英语 | 84 | 1 | 1 | 1 | | 大海 | 英语 | 84 | 1 | 1 | 2 | | 婷婷 | 英语 | 78 | 3 | 2 | 3 | | 孙悟空 | 英语 | 68 | 4 | 3 | 4 | | 大海 | 语文 | 94 | 1 | 1 | 1 | | 孙悟空 | 语文 | 87 | 2 | 2 | 2 | | 婷婷 | 语文 | 65 | 3 | 3 | 3 | | 宋宋 | 语文 | 64 | 4 | 4 | 4 | +-------------+----------------+--------------+----+-----+-----+--+
注:排序还可以用累加至当前行实现,效果和row_number()相同
count(1) over(partition by subject order by score desc rows between unbounded preceding and current row) as rank
时间函数
months_between
MONTHS_BETWEEN (date1, date2)用于计算date1和date2之间有几个月。如果date1在日历中比date2晚,那么MONTHS_BETWEEN()就返回一个正数。如果date1在日历中比date2早,那么MONTHS_BETWEEN()就返回一个负数。如果date1和date2日期一样,那MONTHS_BETWEEN()就返回一个0。
hive> select months_between('2020-10-21','2020-08-20'); OK 2.03225806 Time taken: 0.995 seconds, Fetched: 1 row(s) hive> select months_between('2020-08-20','2020-10-21'); OK -2.03225806 Time taken: 0.076 seconds, Fetched: 1 row(s) hive> select months_between('2020-08-20','2020-08-20'); OK 0.0 Time taken: 0.056 seconds, Fetched: 1 row(s) # 行专列/列转行 https://zhuanlan.zhihu.com/p/115913870 https://blog.csdn.net/jiantianming2/article/details/79189672 ## Hive Map Reduce个数如何设置? 来自面试官的10大连环拷问 https://zhuanlan.zhihu.com/p/270002498
猜你喜欢
- 15小时前Kafka系列 - Kafka一篇入门
- 15小时前打败一切NeRF! 3D Gaussian Splatting 的 简单入门知识
- 15小时前前端超好玩的小游戏合集来啦--周末两天用html5做一个3D飞行兔子萝卜小游戏
- 14小时前段子特仑苏什么意思(特仑苏 段子)
- 9小时前魔方六面还原公式(魔方六面还原公式口诀)
- 8小时前柏多音字组词(柏多音字组词是什么)
- 6小时前隔离开关是指承担接通和断开电流任务(隔离开关指承担接通和断开电流任务,将电路与电源断开)
- 6小时前奋笔勤书还是奋笔疾书(奋笔疾书还是奋书疾笔)
- 1小时前厦门自贸区蚕丝被真假(厦门自贸区蚕丝被好吗)
- 57分钟前吉利ec7(吉利ec715rv)
网友评论
- 搜索
- 最新文章
- 热门文章