Typical scenario for Sqoop incremental import and merge

Apache Sqoop is a tool that is designed to efficiently transfer large amounts of data between Apache Hadoop and structured data stores such as relational databases. Sqoop provides an incremental import tool that you can use to retrieve only those rows that are newer than some previously-imported set of rows. And the Sqoop merge tool enables you to combine two data sets, whereby entries in one data set overwrite entries in an older data set.

Let’s examine a typical scenario in detail. Consider a remote RDMS (DB2, for example) with a STUDENT table that contains basic information about students. The STUDENT table includes a column named ID (INTEGER) and a column named TIME (DATE).

You can use the Sqoop incremental import tool to update the latest student information from DB2 to Hive, as shown in the following example:

sqoop import -incremental lastmodified --check-column TIME --last-value 2017-02-08...

The incremental import operation runs based on values in the TIME column and imports records from “2017-02-08” to current. Corresponding information in the Sqoop log might look like the following example:

17/02/08 22:44:47 INFO tool.ImportTool: Incremental import based on column TIME
17/02/08 22:44:47 INFO tool.ImportTool: Lower bound value: '2017-02-08'
17/02/08 22:44:47 INFO tool.ImportTool: Upper bound value: '2017-02-09 22:44:47.564217'

The Hive STUDENT table is updated with the latest records, and a new file is created under the table directory in the HDFS:

[ambari-qa@mauve1 ~]$ hadoop fs -ls /apps/hive/warehouse/student
Found 2 items
-rwxrwx--- 1 ambari-qa hadoop 26 2017-02-08 23:13 /apps/hive/warehouse/student/part-m-00000
-rwxrwx--- 1 ambari-qa hadoop 26 2017-02-09 21:51 /apps/hive/warehouse/student/part-m-00000_copy_1
Include new records from 201-02-08 to 2017-02-09

Because the Hive table directory can grow significantly in size with daily incremental import jobs, it is a good idea to use the Sqoop merge tool to generate a more streamlined file.

Let’s take a look at another example. You can use the codegen command to generate code that interacts with database records. For example:

sqoop codegen --connect --username --password
--table STUDENT --outdir /tmp/sqoop --fields-terminated-by '\t'

You can then use the Sqoop merge tool to “flatten” two data sets into one, as shown in the following example:

sqoop merge --new-data /apps/hive/warehouse/student/part-m-00000
--onto /apps/hive/warehouse/student/part-m-00000_copy_1
--target-dir /tmp/sqoop_merge
--jar-file /tmp/sqoop-ambari-qa/compile/9062c87c959e4090dcec5995a439b514/TIME.jar
--class-name TIME
--merge-key TIME

You can also use the merge tool to extract special data from the HDFS. For example, to extract the first two months of student data, you could run a command that is similar to the following example:

sqoop merge --new-data /apps/hive/warehouse/student/part-m-00000_copy_1
--onto /apps/hive/warehouse/student/part-m-00000_copy_2
--target-dir /tmp/student_first_two_month
--jar-file /tmp/sqoop-ambari-qa/compile/9062c87c959e4090dcec5995a439b514/TIME.jar
--class-name TIME
--merge-key TIME

After the merge operation completes, you could import the data back into a Hive or HBase data store.

For more information about Sqoop incremental options, see Incremental Imports .

Typical scenario for Sqoop incremental import and merge

Trending Articles

[下载工具]Think4V utubedown(Youtube高清视频下载工具) v2.1.6 官方版2.1.3

【學界欖球】加拿大國際首奪全港賽冠軍基信相隔一年再封后

出售: SINE Othello 電源線

[Zero-Raws] Panty & Stocking with Garterbelt (BD 1920x1080 x264 FLAC)

[jibaketa合成&二次壓制][ViuTV粵語]光之美少女偶像與你 / Idol光之美少女你與我 / Kimi to Idol Precure -...

詐騙猖獗網路名師也中鏢江兆君(小M老師)：學員勿上當！

Screensaver Factory 7.6 - 螢幕保護程式製作軟體

五代RAV4 降車身（機械車位因素）

okhttp3: cocos creator 2.2.2 接入友盟+Mopub 编译APK报错

同門四角戀？李沛旭喇舌「小郭雪芙」曾智希，蔡淑臻拍完婚紗...怒毀婚

df-dferh-01 中国区 Android 安装 Google Play Store 后报错的解决办法

【英文字幕/OVA/冷门动画】装鬼兵系列美版两部全

出售: Denon DA-S1 頂級旗艦解碼 (220v)

博讯｜张磊帮助下，李源潮的儿子被耶鲁录取

地方扫描－贩毒团火力强大遭破获

浪Live首位勇奪金曲男歌手Eason亦宸發行單曲「如果我不曾愛過」

致喬立建設道歉聲明

jetBrains Product crack 2024 Java based

SM3268AB 8CE三星量产无法格式化

【百度】Topaz Video AI Pro v7.0.0（Team V.R）