我的编程空间,编程开发者的网络收藏夹
学习永远不晚

hbase之宽表与窄表对split的影响

短信预约 -IT技能 免费直播动态提醒
省份

北京

  • 北京
  • 上海
  • 天津
  • 重庆
  • 河北
  • 山东
  • 辽宁
  • 黑龙江
  • 吉林
  • 甘肃
  • 青海
  • 河南
  • 江苏
  • 湖北
  • 湖南
  • 江西
  • 浙江
  • 广东
  • 云南
  • 福建
  • 海南
  • 山西
  • 四川
  • 陕西
  • 贵州
  • 安徽
  • 广西
  • 内蒙
  • 西藏
  • 新疆
  • 宁夏
  • 兵团
手机号立即预约

请填写图片验证码后获取短信验证码

看不清楚,换张图片

免费获取短信验证码

hbase之宽表与窄表对split的影响

     hbase的hbase.hregion.max.filesize属性值用来指定region分割的阀值, 该值默认为268435456(256MB), 当一个列族文件大小超过该值时,将会分裂成两个region。
     hbase的列可以有很多,设计时有两种方式可选择, 宽表(一行有很多列)和窄表
如有一个存储用户邮件的表
按宽表设计时,可以表示成(一个用户的所有邮件存成一行)
userid1 email1 emali2 email3 ... ... ... ... ... emailn
userid2 email1 emali2 email3 ... ... ... ... ... emailn
useridn                 
按窄表设计时,可以表示成(rowkey由用ID和emailID组成)
userid1_emialid1  email1
userid1_emialid2  email2
userid1_emialid3  email2
userid1_emialidn  emailn
userid2_emialid1  email1
userid2_emialid2  email2
userid2_emialid3  email3
userid2_emialidn  emailn
这两种设计方法会对region的分割造成影响, 今天在看HFileOutputFormat代码时发现它new出的RecordWriter对 region分割有一定的限制,

只有当rowkey不同是才会做分割, 而rowkey相同时即使region大小已经超过hbase.hregion.max.filesize值, 也不会分割
RecordWriter代码:

  1. public void write(ImmutableBytesWritable row, KeyValue kv)   
  2.       throws IOException {   
  3.         long length = kv.getLength();   
  4.         byte [] family = kv.getFamily();   
  5.         WriterLength wl = this.writers.get(family);   
  6.         if (wl == null || ((length + wl.written) >= maxsize) &&   
  7.             Bytes.compareTo(this.previousRow, 0, this.previousRow.length,   
  8.               kv.getBuffer(), kv.getRowOffset(), kv.getRowLength()) != 0) {   
  9.           // Get a new writer.   
  10.           Path basedir = new Path(outputdir, Bytes.toString(family));   
  11.           if (wl == null) {   
  12.             wl = new WriterLength();   
  13.             this.writers.put(family, wl);   
  14.             if (this.writers.size() > 1) throw new IOException("One family only");   
  15.             // If wl == null, first file in family.  Ensure family dir exits.   
  16.             if (!fs.exists(basedir)) fs.mkdirs(basedir);   
  17.           }   
  18.           wl.writer = getNewWriter(wl.writer, basedir);   
  19.           LOG.info("Writer=" + wl.writer.getPath() +   
  20.             ((wl.written == 0)? "": ", wrote=" + wl.written));   
  21.           wl.written = 0;   
  22.         }   
  23.         kv.updateLatestStamp(this.now);   
  24.         wl.writer.append(kv);   
  25.         wl.written += length;   
  26.         // Copy the row so we know when a row transition.   
  27.         this.previousRow = kv.getRow();   
  28.       }   

标红加粗部分说明当块大小大于hbase.hregion.max.filesize值, 并却当前行与上一次插入的行不同时才会分割region.
1. 宽表情况下, 单独一行大小超过hbase.hregion.max.filesize值, 不会做分割
2. 相同rowkey下插入很多不同版本的记录,即使大小超过hbase.hregion.max.filesize值, 也不会做分割

下面就来验证下:
为了尽早看到效果, 需要在hbase-site.xml中修改两个配置参数

  1. <property>   
  2.     <name>hbase.hregion.memstore.flush.size</name>   
  3.     <value>5</value>   
  4.     <description>   
  5.     Memstore will be flushed to disk if size of the memstore   
  6.     exceeds this number of bytes.  Value is checked by a thread that runs   
  7.     every hbase.server.thread.wakefrequency.   
  8.     </description>   
  9.   </property>   
  10. <property>   
  11.     <name>hbase.hregion.max.filesize</name>   
  12.     <value>10</value>   
  13.     <description>   
  14.     Maximum HStoreFile size. If any one of a column families' HStoreFiles has   
  15.     grown to exceed this value, the hosting HRegion is split in two.   
  16.     Default: 256M.   
  17.     </description>   
  18.   </property>   

 建测试表t1和t2

  1. hbase(main):076:0* create 't1','f1'  
  2. 0 row(s) in 1.6460 seconds  
  3.  
  4. hbase(main):077:0> create 't2','f1'  
  5. 0 row(s) in 1.1790 seconds  

查看系统表 .META.

  1. hbase(main):081:0* scan '.META.'  
  2. ROW                                                 COLUMN+CELL                                                                                                                                            
  3.  t1,,1314720667274.d8acd6bc659ac8326b88850d645a90ad column=info:regioninfo, timestamp=1314720667384, value=REGION => {NAME => 't1,,1314720667274.d8acd6bc659ac8326b88850d645a90ad.', STARTKEY => '', ENDK  
  4.  .                                                  EY => '', ENCODED => d8acd6bc659ac8326b88850d645a90ad, TABLE => {{NAME => 't1', FAMILIES => [{NAME => 'f1', BLOOMFILTER => 'NONE', REPLICATION_SCOPE   
  5.                                                     => '0', COMPRESSION => 'NONE', VERSIONS => '3', TTL => '2147483647', BLOCKSIZE => '65536', IN_MEMORY => 'false', BLOCKCACHE => 'true'}]}}              
  6.  t1,,1314720667274.d8acd6bc659ac8326b88850d645a90ad column=info:server, timestamp=1314720667941, value=yinjie:60020                                                                                        
  7.  .                                                                                                                                                                                                         
  8.  t1,,1314720667274.d8acd6bc659ac8326b88850d645a90ad column=info:serverstartcode, timestamp=1314720667941, value=1314716290123                                                                              
  9.  .                                                                                                                                                                                                         
  10.  t2,,1314720672168.16bb3d2563eab3b4e25477c64e007e71 column=info:regioninfo, timestamp=1314720672241, value=REGION => {NAME => 't2,,1314720672168.16bb3d2563eab3b4e25477c64e007e71.', STARTKEY => '', ENDK  
  11.  .                                                  EY => '', ENCODED => 16bb3d2563eab3b4e25477c64e007e71, TABLE => {{NAME => 't2', FAMILIES => [{NAME => 'f1', BLOOMFILTER => 'NONE', REPLICATION_SCOPE   
  12.                                                     => '0', COMPRESSION => 'NONE', VERSIONS => '3', TTL => '2147483647', BLOCKSIZE => '65536', IN_MEMORY => 'false', BLOCKCACHE => 'true'}]}}              
  13.  t2,,1314720672168.16bb3d2563eab3b4e25477c64e007e71 column=info:server, timestamp=1314720672346, value=yinjie:60020                                                                                        
  14.  .                                                                                                                                                                                                         
  15.  t2,,1314720672168.16bb3d2563eab3b4e25477c64e007e71 column=info:serverstartcode, timestamp=1314720672346, value=1314716290123                                                                              
  16.  .                                                                                                                                                                                                         
  17. 2 row(s) in 0.0230 seconds  

可以看到此时,t1,t2都已有一个region
先往t1表插入10条记录,rowkwy相同

  1. hbase(main):086:0* for i in 0..9 do\  
  2. hbase(main):087:1* put 't1','row1',"f1:c#{i}","swallow#{i}"\  
  3. hbase(main):088:1* end  
  4. 0 row(s) in 0.0180 seconds  
  5.  
  6. 0 row(s) in 0.0070 seconds  
  7.  
  8. 0 row(s) in 0.0420 seconds  
  9.  
  10. 0 row(s) in 0.0620 seconds  
  11.  
  12. 0 row(s) in 0.0120 seconds  
  13.  
  14. 0 row(s) in 0.0770 seconds  
  15.  
  16. 0 row(s) in 0.0150 seconds  
  17.  
  18. 0 row(s) in 0.1290 seconds  
  19.  
  20. 0 row(s) in 10.0740 seconds  
  21.  
  22. 0 row(s) in 0.1230 seconds 
  23.  
  24. => 0..9  
  25. hbase(main):089:0>  

查看t1记录

  1. hbase(main):089:0> scan 't1'  
  2. ROW                                                 COLUMN+CELL                                                                                                                                            
  3.  row1                                               column=f1:c0, timestamp=1314720946495, value=swallow0                                                                                                  
  4.  row1                                               column=f1:c1, timestamp=1314720946507, value=swallow1                                                                                                  
  5.  row1                                               column=f1:c2, timestamp=1314720946903, value=swallow2                                                                                                  
  6.  row1                                               column=f1:c3, timestamp=1314720946939, value=swallow3                                                                                                  
  7.  row1                                               column=f1:c4, timestamp=1314720946976, value=swallow4                                                                                                  
  8.  row1                                               column=f1:c5, timestamp=1314720947055, value=swallow5                                                                                                  
  9.  row1                                               column=f1:c6, timestamp=1314720947070, value=swallow6                                                                                                  
  10.  row1                                               column=f1:c7, timestamp=1314720947198, value=swallow7                                                                                                  
  11.  row1                                               column=f1:c8, timestamp=1314720957272, value=swallow8                                                                                                  
  12.  row1                                               column=f1:c9, timestamp=1314720957392, value=swallow9                                                                                                  
  13. 1 row(s) in 0.0300 seconds 

查看 .META.

  1. hbase(main):090:0> scan '.META.'  
  2. ROW                                                 COLUMN+CELL                                                                                                                                            
  3.  t1,,1314720667274.d8acd6bc659ac8326b88850d645a90ad column=info:regioninfo, timestamp=1314720667384, value=REGION => {NAME => 't1,,1314720667274.d8acd6bc659ac8326b88850d645a90ad.', STARTKEY => '', ENDK  
  4.  .                                                  EY => '', ENCODED => d8acd6bc659ac8326b88850d645a90ad, TABLE => {{NAME => 't1', FAMILIES => [{NAME => 'f1', BLOOMFILTER => 'NONE', REPLICATION_SCOPE   
  5.                                                     => '0', COMPRESSION => 'NONE', VERSIONS => '3', TTL => '2147483647', BLOCKSIZE => '65536', IN_MEMORY => 'false', BLOCKCACHE => 'true'}]}}              
  6.  t1,,1314720667274.d8acd6bc659ac8326b88850d645a90ad column=info:server, timestamp=1314720667941, value=yinjie:60020                                                                                        
  7.  .                                                                                                                                                                                                         
  8.  t1,,1314720667274.d8acd6bc659ac8326b88850d645a90ad column=info:serverstartcode, timestamp=1314720667941, value=1314716290123                                                                              
  9.  .                                                                                                                                                                                                         
  10.  t2,,1314720672168.16bb3d2563eab3b4e25477c64e007e71 column=info:regioninfo, timestamp=1314720672241, value=REGION => {NAME => 't2,,1314720672168.16bb3d2563eab3b4e25477c64e007e71.', STARTKEY => '', ENDK  
  11.  .                                                  EY => '', ENCODED => 16bb3d2563eab3b4e25477c64e007e71, TABLE => {{NAME => 't2', FAMILIES => [{NAME => 'f1', BLOOMFILTER => 'NONE', REPLICATION_SCOPE   
  12.                                                     => '0', COMPRESSION => 'NONE', VERSIONS => '3', TTL => '2147483647', BLOCKSIZE => '65536', IN_MEMORY => 'false', BLOCKCACHE => 'true'}]}}              
  13.  t2,,1314720672168.16bb3d2563eab3b4e25477c64e007e71 column=info:server, timestamp=1314720672346, value=yinjie:60020                                                                                        
  14.  .                                                                                                                                                                                                         
  15.  t2,,1314720672168.16bb3d2563eab3b4e25477c64e007e71 column=info:serverstartcode, timestamp=1314720672346, value=1314716290123                                                                              
  16.  .                                                                                                                                                                                                         
  17. 2 row(s) in 0.0210 seconds  

可以看到t1仍旧只有一个region

接下去往往t2表插入10条相同记录,但rowkwy不同

  1. hbase(main):091:0> for i in 0..9 do\                          
  2. hbase(main):092:1* put 't2',"row#{i}","f1:c#{i}","swallow#{i}"\  
  3. hbase(main):093:1* end  
  4. 0 row(s) in 0.1140 seconds  
  5.  
  6. 0 row(s) in 0.0080 seconds  
  7.  
  8. 0 row(s) in 0.0410 seconds  
  9.  
  10. 0 row(s) in 0.0820 seconds  
  11.  
  12. 0 row(s) in 0.0210 seconds  
  13.  
  14. 0 row(s) in 0.0410 seconds  
  15.  
  16. 0 row(s) in 0.0200 seconds  
  17.  
  18. 0 row(s) in 0.1210 seconds  
  19.  
  20. 0 row(s) in 0.0140 seconds  
  21.  
  22. 0 row(s) in 0.0360 seconds 
  23.  
  24. => 0..9  

查看t2记录

  1. hbase(main):097:0* scan 't2'  
  2. ROW                                                 COLUMN+CELL                                                                                                                                            
  3.  row0                                               column=f1:c0, timestamp=1314721110769, value=swallow0                                                                                                  
  4.  row1                                               column=f1:c1, timestamp=1314721110787, value=swallow1                                                                                                  
  5.  row2                                               column=f1:c2, timestamp=1314721110830, value=swallow2                                                                                                  
  6.  row3                                               column=f1:c3, timestamp=1314721110916, value=swallow3                                                                                                  
  7.  row4                                               column=f1:c4, timestamp=1314721110932, value=swallow4                                                                                                  
  8.  row5                                               column=f1:c5, timestamp=1314721110971, value=swallow5                                                                                                  
  9.  row6                                               column=f1:c6, timestamp=1314721110989, value=swallow6                                                                                                  
  10.  row7                                               column=f1:c7, timestamp=1314721111121, value=swallow7                                                                                                  
  11.  row8                                               column=f1:c8, timestamp=1314721111130, value=swallow8                                                                                                  
  12.  row9                                               column=f1:c9, timestamp=1314721111172, value=swallow9                                                                                                  
  13. 10 row(s) in 1.0450 seconds  

查看 .META.

  1. hbase(main):102:0> scan '.META.'  
  2. ROW                                                 COLUMN+CELL                                                                                                                                            
  3.  t1,,1314720667274.d8acd6bc659ac8326b88850d645a90ad column=info:regioninfo, timestamp=1314720667384, value=REGION => {NAME => 't1,,1314720667274.d8acd6bc659ac8326b88850d645a90ad.', STARTKEY => '', ENDK  
  4.  .                                                  EY => '', ENCODED => d8acd6bc659ac8326b88850d645a90ad, TABLE => {{NAME => 't1', FAMILIES => [{NAME => 'f1', BLOOMFILTER => 'NONE', REPLICATION_SCOPE   
  5.                                                     => '0', COMPRESSION => 'NONE', VERSIONS => '3', TTL => '2147483647', BLOCKSIZE => '65536', IN_MEMORY => 'false', BLOCKCACHE => 'true'}]}}              
  6.  t1,,1314720667274.d8acd6bc659ac8326b88850d645a90ad column=info:server, timestamp=1314720667941, value=yinjie:60020                                                                                        
  7.  .                                                                                                                                                                                                         
  8.  t1,,1314720667274.d8acd6bc659ac8326b88850d645a90ad column=info:serverstartcode, timestamp=1314720667941, value=1314716290123                                                                              
  9.  .                                                                                                                                                                                                         
  10.  t2,,1314720672168.16bb3d2563eab3b4e25477c64e007e71 column=info:regioninfo, timestamp=1314721112130, value=REGION => {NAME => 't2,,1314720672168.16bb3d2563eab3b4e25477c64e007e71.', STARTKEY => '', ENDK  
  11.  .                                                  EY => '', ENCODED => 16bb3d2563eab3b4e25477c64e007e71, OFFLINE => true, SPLIT => true, TABLE => {{NAME => 't2', FAMILIES => [{NAME => 'f1', BLOOMFILT  
  12.                                                     ER => 'NONE', REPLICATION_SCOPE => '0', VERSIONS => '3', COMPRESSION => 'NONE', TTL => '2147483647', BLOCKSIZE => '65536', IN_MEMORY => 'false', BLOC  
  13.                                                     KCACHE => 'true'}]}}                                                                                                                                   
  14.  t2,,1314720672168.16bb3d2563eab3b4e25477c64e007e71 column=info:server, timestamp=1314720672346, value=yinjie:60020                                                                                        
  15.  .                                                                                                                                                                                                         
  16.  t2,,1314720672168.16bb3d2563eab3b4e25477c64e007e71 column=info:serverstartcode, timestamp=1314720672346, value=1314716290123                                                                              
  17.  .                                                                                                                                                                                                         
  18.  t2,,1314720672168.16bb3d2563eab3b4e25477c64e007e71 column=info:splitA, timestamp=1314721112130, value=REGION => {NAME => 't2,,1314721111490.71df02214242923574b71fe5e2a19360.', STARTKEY => '', ENDKEY =  
  19.  .                                                  > 'row0', ENCODED => 71df02214242923574b71fe5e2a19360, TABLE => {{NAME => 't2', FAMILIES => [{NAME => 'f1', BLOOMFILTER => 'NONE', REPLICATION_SCOPE   
  20.                                                     => '0', VERSIONS => '3', COMPRESSION => 'NONE', TTL => '2147483647', BLOCKSIZE => '65536', IN_MEMORY => 'false', BLOCKCACHE => 'true'}]}}              
  21.  t2,,1314720672168.16bb3d2563eab3b4e25477c64e007e71 column=info:splitB, timestamp=1314721112130, value=REGION => {NAME => 't2,row0,1314721111490.915ee8d4a32c59a4ec3960e335b061ca.', STARTKEY => 'row0',   
  22.  .                                                  ENDKEY => '', ENCODED => 915ee8d4a32c59a4ec3960e335b061ca, TABLE => {{NAME => 't2', FAMILIES => [{NAME => 'f1', BLOOMFILTER => 'NONE', REPLICATION_SC  
  23.                                                     OPE => '0', VERSIONS => '3', COMPRESSION => 'NONE', TTL => '2147483647', BLOCKSIZE => '65536', IN_MEMORY => 'false', BLOCKCACHE => 'true'}]}}          
  24.  t2,,1314721111490.71df02214242923574b71fe5e2a19360 column=info:regioninfo, timestamp=1314721112267, value=REGION => {NAME => 't2,,1314721111490.71df02214242923574b71fe5e2a19360.', STARTKEY => '', ENDK  
  25.  .                                                  EY => 'row0', ENCODED => 71df02214242923574b71fe5e2a19360, TABLE => {{NAME => 't2', FAMILIES => [{NAME => 'f1', BLOOMFILTER => 'NONE', REPLICATION_SC  
  26.                                                     OPE => '0', VERSIONS => '3', COMPRESSION => 'NONE', TTL => '2147483647', BLOCKSIZE => '65536', IN_MEMORY => 'false', BLOCKCACHE => 'true'}]}}          
  27.  t2,,1314721111490.71df02214242923574b71fe5e2a19360 column=info:server, timestamp=1314721112267, value=yinjie:60020                                                                                        
  28.  .                                                                                                                                                                                                         
  29.  t2,,1314721111490.71df02214242923574b71fe5e2a19360 column=info:serverstartcode, timestamp=1314721112267, value=1314716290123                                                                              
  30.  .                                                                                                                                                                                                         
  31.  t2,row0,1314721111490.915ee8d4a32c59a4ec3960e335b0 column=info:regioninfo, timestamp=1314721112627, value=REGION => {NAME => 't2,row0,1314721111490.915ee8d4a32c59a4ec3960e335b061ca.', STARTKEY => 'row  
  32.  61ca.                                              0', ENDKEY => '', ENCODED => 915ee8d4a32c59a4ec3960e335b061ca, TABLE => {{NAME => 't2', FAMILIES => [{NAME => 'f1', BLOOMFILTER => 'NONE', REPLICATIO  
  33.                                                     N_SCOPE => '0', VERSIONS => '3', COMPRESSION => 'NONE', TTL => '2147483647', BLOCKSIZE => '65536', IN_MEMORY => 'false', BLOCKCACHE => 'true'}]}}      
  34.  t2,row0,1314721111490.915ee8d4a32c59a4ec3960e335b0 column=info:server, timestamp=1314721112627, value=yinjie:60020                                                                                        
  35.  61ca.                                                                                                                                                                                                     
  36.  t2,row0,1314721111490.915ee8d4a32c59a4ec3960e335b0 column=info:serverstartcode, timestamp=1314721112627, value=1314716290123                                                                              
  37.  61ca.                                                                                                                                                                                                     
  38. 4 row(s) in 0.0380 seconds  

可以看到t2的region已经分裂.

免责声明:

① 本站未注明“稿件来源”的信息均来自网络整理。其文字、图片和音视频稿件的所属权归原作者所有。本站收集整理出于非商业性的教育和科研之目的,并不意味着本站赞同其观点或证实其内容的真实性。仅作为临时的测试数据,供内部测试之用。本站并未授权任何人以任何方式主动获取本站任何信息。

② 本站未注明“稿件来源”的临时测试数据将在测试完成后最终做删除处理。有问题或投稿请发送至: 邮箱/279061341@qq.com QQ/279061341

hbase之宽表与窄表对split的影响

下载Word文档到电脑,方便收藏和打印~

下载Word文档

猜你喜欢

hbase之宽表与窄表对split的影响

hbase的hbase.hregion.max.filesize属性值用来指定region分割的阀值, 该值默认为268435456(256MB), 当一个列族文件大小超过该值时,将会分裂成两个region。 hbase
2022-11-30

PHP与MySQL索引的分区表和水平分表的设计策略及其对查询性能的影响

引言:在开发Web应用程序时,PHP与MySQL是经常使用的强大工具。在设计数据库结构时,索引的选择和使用对查询性能影响极大。本文将重点讨论索引的分区表和水平分表的设计策略以及对查询性能的影响,并提供具体的代码示例。一、索引的分区表设计策略
2023-10-21

PHP与MySQL索引的批量修改和表重建的优化策略及其对性能的影响

引言:在开发Web应用程序时,PHP与MySQL是最常用的组合之一。MySQL数据库的性能优化对于提升应用程序的速度和响应能力至关重要。在MySQL中使用索引是一种常用的优化策略,它可以加快数据查询操作的速度。本文将讨论如何使用PHP在My
2023-10-21

PHP与MySQL索引的数据缓存和内存表的优化策略及其对查询性能的影响

引言:在开发和优化数据库驱动的应用程序时,PHP和MySQL是非常常见的组合。而在PHP与MySQL的交互中,索引的数据缓存和内存表的优化策略对于提高查询性能起着至关重要的作用。本文将介绍关于PHP与MySQL索引的数据缓存和内存表的优化策
2023-10-21

编程热搜

目录