Sqoop is a tool designed to transfer data between Hadoop and relational databases or mainframes. You can use Sqoop to import data from a relational database management system (RDBMS) such as MySQL or Oracle or a mainframe into the Hadoop Distributed File System (HDFS), transform the data in Hadoop MapReduce, and then export the data back into an RDBMS.
Sqoop automates most of this process, relying on the database to describe the schema for the data to be imported. Sqoop uses MapReduce to import and export the data, which provides parallel operation as well as fault tolerance. Sqoop is an open source software product of the Apache Software Foundation.
The following prerequisite knowledge is required for Sqoop
- Basic computer technology and terminology
- Familiarity with command-line interfaces such as bash
- Relational database management systems
- Basic familiarity with the purpose and operation of Hadoop
Before you can use Sqoop, a release of Hadoop must be installed and configured. Sqoop is currently supporting 4 major Hadoop releases – 0.20, 0.23, 1.0 and 2.0.
Sqoop has an extension framework that makes it possible to import data from—and export data to—any external storage system that has bulk data transfer capabilities. A Sqoop connector is a modular component that uses this framework to enable Sqoop imports and exports. Sqoop ships with connectors for working with a range of popular databases, including MySQL, PostgreSQL, Oracle, SQL Server, DB2, and Netezza. There is also a generic JDBC connector for connecting to any database that supports Java’s JDBC protocol. Sqoop provides optimized MySQL, PostgreSQL, Oracle, and Netezza connectors that use database-specific APIs to perform bulk transfers more efficiently.
Below video illustrates Sqoop
Sqoop is capable of importing into a few different file formats. Text files (the default) offer a human-readable representation of data, platform independence, and the simplest structure. However, they cannot hold binary fields (such as database columns of type VARBINARY), and distinguishing between null values and String-based fields containing the value “null” can be problematic (although using the –null-string import option allows you to control the representation of null values).
To handle these conditions, Sqoop also supports SequenceFiles and Avro datafiles. These binary formats provide the most precise representation possible of the imported data. They also allow data to be compressed while retaining MapReduce’s ability to process different sections of the same file in parallel. However, current versions of Sqoop cannot load Avro datafiles.
Before the import can start, Sqoop uses JDBC to examine the table it is to import. It retrieves a list of all the columns and their SQL data types. These SQL types (VARCHAR, INTEGER, etc.) can then be mapped to Java data types (String, Integer, etc.), which will hold the field values in MapReduce applications. Sqoop’s code generator will use this information to create a table-specific class to hold a record extracted from the table. Sqoop does not need to import an entire table at a time. For example, a subset of the table’s columns can be specified for import. Users can also specify a WHERE clause to include in queries via the –where argument, which bounds the rows of the table to import.
Before processing imported data, the field delimiters must be parsed and the field values extracted and converted to the appropriate data types. The generated table class provided by Sqoop can automate this process, allowing you to focus on the actual MapReduce job to run. Each autogenerated class has several overloaded methods named parse() that operate on the data represented as Text, CharSequence, char[], or other common types.
An export in sqoop, uses HDFS as the source of data and a remote database as the destination. In the previous sections, we imported some data and then performed some analysis using Hive. We can export the results of this analysis to a database for consumption by other tools. Before exporting a table from HDFS to a database, we must prepare the database to receive the data by creating the target table. Although Sqoop can infer which Java types are appropriate to hold SQL data types, this translation does not work in both directions. Before performing the export, Sqoop picks a strategy based on the database connect string. For most systems, Sqoop uses JDBC. Sqoop then generates a Java class based on the target table definition. This generated class has the ability to parse records from text files and insert values of the appropriate types into a table (in addition to the ability to read the columns from a ResultSet). A MapReduce job is then launched that reads the source datafiles from HDFS, parses the records using the generated class, and executes the chosen export strategy.
Installation
- Downloading Sqoop – We can download the latest version of Sqoop from the Sqoop’s project website.
- Extract Sqoop – The following commands are used to extract the Sqoop tar ball and move it to “/usr/lib/sqoop” directory.
$tar -xvf sqoop-1.4.4.bin__hadoop-2.0.4-alpha.tar.gz
$ su
password:
# mv sqoop-1.4.4.bin__hadoop-2.0.4-alpha /usr/lib/sqoop
#exit
- Configuring bashrc – You have to set up the Sqoop environment by appending the following lines to ~/.bashrc file:
#Sqoop
export SQOOP_HOME=/usr/lib/sqoop export PATH=$PATH:$SQOOP_HOME/bin
The following command is used to execute ~/.bashrc file.
$ source ~/.bashrc
- Configuring Sqoop – To configure Sqoop with Hadoop, you need to edit the sqoop-env.sh file, which is placed in the $SQOOP_HOME/conf directory. First of all, Redirect to Sqoop config directory and copy the template file using the following command:
$ cd $SQOOP_HOME/conf
$ mv sqoop-env-template.sh sqoop-env.sh
Open sqoop-env.sh and edit the following lines:
export HADOOP_COMMON_HOME=/usr/local/hadoop
export HADOOP_MAPRED_HOME=/usr/local/hadoop
- Download and Configure Databse connector – We can download mysql-connector-java-5.1.30.tar.gz file as databse connector.
The following commands are used to extract mysql-connector-java tarball and move mysql-connector-java-5.1.30-bin.jar to /usr/lib/sqoop/lib directory.
$ tar -zxf mysql-connector-java-5.1.30.tar.gz
$ su
password:
# cd mysql-connector-java-5.1.30
# mv mysql-connector-java-5.1.30-bin.jar /usr/lib/sqoop/lib
- Verifying Sqoop – The following command is used to verify the Sqoop version.
$ cd $SQOOP_HOME/bin
$ sqoop-version
Expected output:
14/12/17 14:52:32 INFO sqoop.Sqoop: Running Sqoop version: 1.4.5
Sqoop 1.4.5 git commit id 5b34accaca7de251fc91161733f906af2eddbe83
Compiled by abe on Fri Aug 1 11:19:26 PDT 2014
Basic Usage
With Sqoop, you can import data from a relational database system or a mainframe into HDFS. The input to the import process is either database table or mainframe datasets. For databases, Sqoop will read the table row-by-row into HDFS. For mainframe datasets, Sqoop will read records from each mainframe dataset into HDFS. The output of this import process is a set of files containing a copy of the imported table or datasets. The import process is performed in parallel. For this reason, the output will be in multiple files. These files may be delimited text files (for example, with commas or tabs separating each field), or binary Avro or SequenceFiles containing serialized record data.
A by-product of the import process is a generated Java class which can encapsulate one row of the imported table. This class is used during the import process by Sqoop itself. The Java source code for this class is also provided to you, for use in subsequent MapReduce processing of the data. This class can serialize and deserialize data to and from the SequenceFile format. It can also parse the delimited-text form of a record. These abilities allow you to quickly develop MapReduce applications that use the HDFS-stored records in your processing pipeline. You are also free to parse the delimiteds record data yourself, using any other tools you prefer.
After manipulating the imported records (for example, with MapReduce or Hive) you may have a result data set which you can then export back to the relational database. Sqoop’s export process will read a set of delimited text files from HDFS in parallel, parse them into records, and insert them as new rows in a target database table, for consumption by external applications or users.
Sqoop includes some other commands which allow you to inspect the database you are working with. For example, you can list the available database schemas (with the sqoop-list-databases tool) and tables within a schema (with the sqoop-list-tables tool). Sqoop also includes a primitive SQL execution shell (the sqoop-eval tool).
Most aspects of the import, code generation, and export processes can be customized. For databases, you can control the specific row range or columns imported. You can specify particular delimiters and escape characters for the file-based representation of the data, as well as the file format used. You can also control the class or package names used in generated code. Subsequent sections of this document explain how to specify these and other arguments to Sqoop.
Sqoop Tools
Sqoop is a collection of related tools. To use Sqoop, you specify the tool you want to use and the arguments that control the tool. If Sqoop is compiled from its own source, you can run Sqoop without a formal installation process by running the bin/sqoop program. Users of a packaged deployment of Sqoop (such as an RPM shipped with Apache Bigtop) will see this program installed as /usr/bin/sqoop. Sqoop ships with a help tool. To display a list of all available tools, type the following command
$ sqoop help
usage: sqoop COMMAND [ARGS]
Available commands:
codegen Generate code to interact with database records
create-hive-table Import a table definition into Hive
eval Evaluate a SQL statement and display the results
export Export an HDFS directory to a database table
help List available commands
import Import a table from a database to HDFS
import-all-tables Import tables from a database to HDFS
import-mainframe Import mainframe datasets to HDFS
list-databases List available databases on a server
list-tables List available tables in a database
version Display version information
You can display help for a specific tool by entering: sqoop help (tool-name); for example, sqoop help import. You can also add the –help argument to any command: sqoop import –help.
Using Command Aliases
In addition to typing the sqoop (toolname) syntax, you can use alias scripts that specify the sqoop-(toolname) syntax. For example, the scripts sqoop-import, sqoop-export, etc. each select a specific tool.
Controlling the Hadoop Installation
You invoke Sqoop through the program launch capability provided by Hadoop. The sqoop command-line program is a wrapper which runs the bin/hadoop script shipped with Hadoop. If you have multiple installations of Hadoop present on your machine, you can select the Hadoop installation by setting the $HADOOP_COMMON_HOME and $HADOOP_MAPRED_HOME environment variables.
For example:
$ HADOOP_COMMON_HOME=/path/to/some/hadoop \
HADOOP_MAPRED_HOME=/path/to/some/hadoop-mapreduce \
sqoop import –arguments…
or:
$ export HADOOP_COMMON_HOME=/some/path/to/hadoop
$ export HADOOP_MAPRED_HOME=/some/path/to/hadoop-mapreduce
$ sqoop import –arguments…
If either of these variables are not set, Sqoop will fall back to $HADOOP_HOME. If it is not set either, Sqoop will use the default installation locations for Apache Bigtop, /usr/lib/hadoop and /usr/lib/hadoop-mapreduce, respectively.
The active Hadoop configuration is loaded from $HADOOP_HOME/conf/, unless the $HADOOP_CONF_DIR environment variable is set.
Using Generic and Specific Arguments
To control the operation of each Sqoop tool, you use generic and specific arguments.
For example:
$ sqoop help import
usage: sqoop import [GENERIC-ARGS] [TOOL-ARGS]
Common arguments:
–connect <jdbc-uri> Specify JDBC connect string
–connect-manager <class-name> Specify connection manager class to use
–driver <class-name> Manually specify JDBC driver class to use
–hadoop-mapred-home <dir> Override $HADOOP_MAPRED_HOME
–help Print usage instructions
–password-file Set path for file containing authentication password
-P Read password from console
–password <password> Set authentication password
–username <username> Set authentication username
–verbose Print more information while working
–hadoop-home <dir> Deprecated. Override $HADOOP_HOME
[…]Generic Hadoop command-line arguments:
(must preceed any tool-specific arguments)
Generic options supported are
-conf <configuration file> specify an application configuration file
-D <property=value> use value for given property
-fs <local|namenode:port> specify a namenode
-jt <local|jobtracker:port> specify a job tracker
-files <comma separated list of files> specify comma separated files to be copied to the map reduce cluster
-libjars <comma separated list of jars> specify comma separated jar files to include in the classpath.
-archives <comma separated list of archives> specify comma separated archives to be unarchived on the compute machines.
The general command line syntax is –
bin/hadoop command [genericOptions] [commandOptions]
You must supply the generic arguments -conf, -D, and so on after the tool name but before any tool-specific arguments (such as –connect). Note that generic Hadoop arguments are preceeded by a single dash character (-), whereas tool-specific arguments start with two dashes (–), unless they are single character arguments such as -P.
The -conf, -D, -fs and -jt arguments control the configuration and Hadoop server settings. For example, the -D mapred.job.name=<job_name> can be used to set the name of the MR job that Sqoop launches, if not specified, the name defaults to the jar name for the job – which is derived from the used table name.
The -files, -libjars, and -archives arguments are not typically used with Sqoop, but they are included as part of Hadoop’s internal argument-parsing system.
Using Options Files to Pass Arguments
When using Sqoop, the command line options that do not change from invocation to invocation can be put in an options file for convenience. An options file is a text file where each line identifies an option in the order that it appears otherwise on the command line. Option files allow specifying a single option on multiple lines by using the back-slash character at the end of intermediate lines. Also supported are comments within option files that begin with the hash character. Comments must be specified on a new line and may not be mixed with option text. All comments and empty lines are ignored when option files are expanded. Unless options appear as quoted strings, any leading or trailing spaces are ignored. Quoted strings if used must not extend beyond the line on which they are specified.
Option files can be specified anywhere in the command line as long as the options within them follow the otherwise prescribed rules of options ordering. For instance, regardless of where the options are loaded from, they must follow the ordering such that generic options appear first, tool specific options next, finally followed by options that are intended to be passed to child programs.
To specify an options file, simply create an options file in a convenient location and pass it to the command line via –options-file argument. Whenever an options file is specified, it is expanded on the command line before the tool is invoked. You can specify more than one option files within the same invocation if needed. For example, the following Sqoop invocation for import can be specified alternatively as shown below:
$ sqoop import –connect jdbc:mysql://localhost/db –username foo –table TEST
$ sqoop –options-file /users/homer/work/import.txt –table TEST
where the options file /users/homer/work/import.txt contains the following:
import
–connect
jdbc:mysql://localhost/db
–username
foo
The options file can have empty lines and comments for readability purposes. So the above example would work exactly the same if the options file /users/homer/work/import.txt contained the following:
#
# Options file for Sqoop import
#
# Specifies the tool being invoked
import
# Connect parameter and value
–connect
jdbc:mysql://localhost/db
# Username parameter and value
–username
foo
#
# Remaining options should be specified in the command line.
#
sqoop-import
The import tool imports an individual table from an RDBMS to HDFS. Each row from a table is represented as a separate record in HDFS. Records can be stored as text files (one record per line), or in binary representation as Avro or SequenceFiles.
$ sqoop import (generic-args) (import-args)
$ sqoop-import (generic-args) (import-args)
While the Hadoop generic arguments must precede any import arguments, you can type the import arguments in any order with respect to one another.
Common arguments
Argument | Description |
–connect <jdbc-uri> | Specify JDBC connect string |
–connection-manager <class-name> | Specify connection manager class to use |
–driver <class-name> | Manually specify JDBC driver class to use |
–hadoop-mapred-home <dir> | Override $HADOOP_MAPRED_HOME |
–help | Print usage instructions |
–password-file | Set path for a file containing the authentication password |
-P | Read password from console |
–password <password> | Set authentication password |
–username <username> | Set authentication username |
–verbose | Print more information while working |
–connection-param-file <filename> | Optional properties file that provides connection parameters |
–relaxed-isolation | Set connection transaction isolation to read uncommitted for the mappers. |
Connecting to a Database Server – Sqoop is designed to import tables from a database into HDFS. To do so, you must specify a connect string that describes how to connect to the database. The connect string is similar to a URL, and is communicated to Sqoop with the –connect argument. This describes the server and database to connect to; it may also specify the port. For example:
$ sqoop import –connect jdbc:mysql://database.example.com/employees
This string will connect to a MySQL database named employees on the host database.example.com. It’s important that you do not use the URL localhost if you intend to use Sqoop with a distributed Hadoop cluster. The connect string you supply will be used on TaskTracker nodes throughout your MapReduce cluster; if you specify the literal name localhost, each node will connect to a different database (or more likely, no database at all). Instead, you should use the full hostname or IP address of the database host that can be seen by all your remote nodes.
You might need to authenticate against the database before you can access it. You can use the –username to supply a username to the database. Sqoop provides couple of different ways to supply a password, secure and non-secure, to the database which is detailed below.
Secure way of supplying password to the database. You should save the password in a file on the users home directory with 400 permissions and specify the path to that file using the –password-file argument, and is the preferred method of entering credentials. Sqoop will then read the password from the file and pass it to the MapReduce cluster using secure means with out exposing the password in the job configuration. The file containing the password can either be on the Local FS or HDFS. For example:
$ sqoop import –connect jdbc:mysql://database.example.com/employees \
–username venkatesh –password-file ${user.home}/.password
Validation arguments
Argument | Description |
–validate | Enable validation of data copied, supports single table copy only. |
–validator <class-name> | Specify validator class to use. |
–validation-threshold <class-name> | Specify validation threshold class to use. |
–validation-failurehandler <class-name> | Specify validation failure handler class to use. |
Import control arguments:
Argument | Description |
–append | Append data to an existing dataset in HDFS |
–as-avrodatafile | Imports data to Avro Data Files |
–as-sequencefile | Imports data to SequenceFiles |
–as-textfile | Imports data as plain text (default) |
–as-parquetfile | Imports data to Parquet Files |
–boundary-query <statement> | Boundary query to use for creating splits |
–columns <col,col,col…> | Columns to import from table |
–delete-target-dir | Delete the import target directory if it exists |
–direct | Use direct connector if exists for the database |
–fetch-size <n> | Number of entries to read from database at once. |
–inline-lob-limit <n> | Set the maximum size for an inline LOB |
-m,–num-mappers <n> | Use n map tasks to import in parallel |
-e,–query <statement> | Import the results of statement. |
–split-by <column-name> | Column of the table used to split work units. Cannot be used with –autoreset-to-one-mapper option. |
–autoreset-to-one-mapper | Import should use one mapper if a table has no primary key and no split-by column is provided. Cannot be used with –split-by <col> option. |
–table <table-name> | Table to read |
–target-dir <dir> | HDFS destination dir |
–warehouse-dir <dir> | HDFS parent for table destination |
–where <where clause> | WHERE clause to use during import |
-z,–compress | Enable compression |
–compression-codec <c> | Use Hadoop codec (default gzip) |
–null-string <null-string> | The string to be written for a null value for string columns |
–null-non-string <null-string> | The string to be written for a null value for non-string columns |
The –null-string and –null-non-string arguments are optional.\ If not specified, then the string “null” will be used.
Selecting the Data to Import – Sqoop typically imports data in a table-centric fashion. Use the –table argument to select the table to import. For example, –table employees. This argument can also identify a VIEW or other table-like entity in a database.
By default, all columns within a table are selected for import. Imported data is written to HDFS in its “natural order;” that is, a table containing columns A, B, and C result in an import of data such as:
A1,B1,C1
A2,B2,C2
…
You can select a subset of columns and control their ordering by using the –columns argument. This should include a comma-delimited list of columns to import. For example: –columns “name,employee_id,jobtitle”.
You can control which rows are imported by adding a SQL WHERE clause to the import statement. By default, Sqoop generates statements of the form SELECT <column list> FROM <table name>. You can append a WHERE clause to this with the –where argument. For example: –where “id > 400”. Only rows where the id column has a value greater than 400 will be imported.
By default sqoop will use query select min(<split-by>), max(<split-by>) from <table name> to find out boundaries for creating splits. In some cases this query is not the most optimal so you can specify any arbitrary query returning two numeric columns using –boundary-query argument.
Free-form Query Imports – Sqoop can also import the result set of an arbitrary SQL query. Instead of using the –table, –columns and –where arguments, you can specify a SQL statement with the –query argument.
When importing a free-form query, you must specify a destination directory with –target-dir.
If you want to import the results of a query in parallel, then each map task will need to execute a copy of the query, with results partitioned by bounding conditions inferred by Sqoop. Your query must include the token $CONDITIONS which each Sqoop process will replace with a unique condition expression. You must also select a splitting column with –split-by.
For example:
$ sqoop import \
–query ‘SELECT a.*, b.* FROM a JOIN b on (a.id == b.id) WHERE $CONDITIONS’ \
–split-by a.id –target-dir /user/foo/joinresults
Alternately, the query can be executed once and imported serially, by specifying a single map task with -m 1:
$ sqoop import \
–query ‘SELECT a.*, b.* FROM a JOIN b on (a.id == b.id) WHERE $CONDITIONS’ \
-m 1 –target-dir /user/foo/joinresults
sqoop-export
The export tool exports a set of files from HDFS back to an RDBMS. The target table must already exist in the database. The input files are read and parsed into a set of records according to the user-specified delimiters.
The default operation is to transform these into a set of INSERT statements that inject the records into the database. In “update mode,” Sqoop will generate UPDATE statements that replace existing records in the database, and in “call mode” Sqoop will make a stored procedure call for each record.
$ sqoop export (generic-args) (export-args)
$ sqoop-export (generic-args) (export-args)
Although the Hadoop generic arguments must preceed any export arguments, the export arguments can be entered in any order with respect to one another.
Common arguments
Argument | Description |
–connect <jdbc-uri> | Specify JDBC connect string |
–connection-manager <class-name> | Specify connection manager class to use |
–driver <class-name> | Manually specify JDBC driver class to use |
–hadoop-mapred-home <dir> | Override $HADOOP_MAPRED_HOME |
–help | Print usage instructions |
–password-file | Set path for a file containing the authentication password |
-P | Read password from console |
–password <password> | Set authentication password |
–username <username> | Set authentication username |
–verbose | Print more information while working |
–connection-param-file <filename> | Optional properties file that provides connection parameters |
–relaxed-isolation | Set connection transaction isolation to read uncommitted for the mappers. |
Validation arguments
Argument | Description |
–validate | Enable validation of data copied, supports single table copy only. |
–validator <class-name> | Specify validator class to use. |
–validation-threshold <class-name> | Specify validation threshold class to use. |
–validation-failurehandler <class-name> | Specify validation failure handler class to use. |
Export control arguments:
Argument | Description |
–columns <col,col,col…> | Columns to export to table |
–direct | Use direct export fast path |
–export-dir <dir> | HDFS source path for the export |
-m,–num-mappers <n> | Use n map tasks to export in parallel |
–table <table-name> | Table to populate |
–call <stored-proc-name> | Stored Procedure to call |
–update-key <col-name> | Anchor column to use for updates. Use a comma separated list of columns if there are more than one column. |
–update-mode <mode> | Specify how updates are performed when new rows are found with non-matching keys in database. |
Legal values for mode include updateonly (default) and allowinsert. | |
–input-null-string <null-string> | The string to be interpreted as null for string columns |
–input-null-non-string <null-string> | The string to be interpreted as null for non-string columns |
–staging-table <staging-table-name> | The table in which data will be staged before being inserted into the destination table. |
–clear-staging-table | Indicates that any data present in the staging table can be deleted. |
–batch | Use batch mode for underlying statement execution. |
The –export-dir argument and one of –table or –call are required. These specify the table to populate in the database (or the stored procedure to call), and the directory in HDFS that contains the source data.
By default, all columns within a table are selected for export. You can select a subset of columns and control their ordering by using the –columns argument. This should include a comma-delimited list of columns to export. For example: –columns “col1,col2,col3”. Note that columns that are not included in the –columns parameter need to have either defined default value or allow NULL values. Otherwise your database will reject the imported data which in turn will make Sqoop job fail.
You can control the number of mappers independently from the number of files present in the directory. Export performance depends on the degree of parallelism. By default, Sqoop will use four tasks in parallel for the export process. This may not be optimal; you will need to experiment with your own particular setup. Additional tasks may offer better concurrency, but if the database is already bottlenecked on updating indices, invoking triggers, and so on, then additional load may decrease performance. The –num-mappers or -m arguments control the number of map tasks, which is the degree of parallelism used.
Some databases provides a direct mode for exports as well. Use the –direct argument to specify this codepath. This may be higher-performance than the standard JDBC codepath. The –input-null-string and –input-null-non-string arguments are optional. If –input-null-string is not specified, then the string “null” will be interpreted as null for string-type columns. If –input-null-non-string is not specified, then both the string “null” and the empty string will be interpreted as null for non-string columns. Note that, the empty string will be always interpreted as null for non-string columns, in addition to other string if specified by –input-null-non-string.
Since Sqoop breaks down export process into multiple transactions, it is possible that a failed export job may result in partial data being committed to the database. This can further lead to subsequent jobs failing due to insert collisions in some cases, or lead to duplicated data in others. You can overcome this problem by specifying a staging table via the –staging-table option which acts as an auxiliary table that is used to stage exported data. The staged data is finally moved to the destination table in a single transaction.
In order to use the staging facility, you must create the staging table prior to running the export job. This table must be structurally identical to the target table. This table should either be empty before the export job runs, or the –clear-staging-table option must be specified. If the staging table contains data and the –clear-staging-table option is specified, Sqoop will delete all of the data before starting the export job.
Inserts vs. Updates – By default, sqoop-export appends new rows to a table; each input record is transformed into an INSERT statement that adds a row to the target database table. If your table has constraints (e.g., a primary key column whose values must be unique) and already contains data, you must take care to avoid inserting records that violate these constraints. The export process will fail if an INSERT statement fails. This mode is primarily intended for exporting records to a new, empty table intended to receive these results.
If you specify the –update-key argument, Sqoop will instead modify an existing dataset in the database. Each input record is treated as an UPDATE statement that modifies an existing row. The row a statement modifies is determined by the column name(s) specified with –update-key. For example, consider the following table definition:
CREATE TABLE foo(
id INT NOT NULL PRIMARY KEY,
msg VARCHAR(32),
bar INT);
Consider also a dataset in HDFS containing records like these:
0,this is a test,42
1,some more data,100
…
Running sqoop-export –table foo –update-key id –export-dir /path/to/data –connect … will run an export job that executes SQL statements based on the data like so:
UPDATE foo SET msg=’this is a test’, bar=42 WHERE id=0;
UPDATE foo SET msg=’some more data’, bar=100 WHERE id=1;
…
If an UPDATE statement modifies no rows, this is not considered an error; the export will silently continue. (In effect, this means that an update-based export will not insert new rows into the database.) Likewise, if the column specified with –update-key does not uniquely identify rows and multiple rows are updated by a single statement, this condition is also undetected.
The argument –update-key can also be given a comma separated list of column names. In which case, Sqoop will match all keys from this list before updating any existing record.
Depending on the target database, you may also specify the –update-mode argument with allowinsert mode if you want to update rows if they exist in the database already or insert rows if they do not exist yet.
Exports and Transactions – Exports are performed by multiple writers in parallel. Each writer uses a separate connection to the database; these have separate transactions from one another. Sqoop uses the multi-row INSERT syntax to insert up to 100 records per statement. Every 100 statements, the current transaction within a writer task is committed, causing a commit every 10,000 rows. This ensures that transaction buffers do not grow without bound, and cause out-of-memory conditions. Therefore, an export is not an atomic process. Partial results from the export will become visible before the export is complete.