Import and replicate data from MySQL to Clickhouse

引入

本次将数据从 MySQL 备份到 ClickHouse 采用的是 MinervaDB ClickHouse MySQL Data Reader ，该项目基础版本来自于 Alitinity/clickhouse-mysql-data-reader。

安装

1
2
3

pip3 install mysqlclient
pip3 install mysql-replication
pip3 install clickhouse-driver

这里必须使用 pip3，并且最少需要 Python 3.4 版本。

下面我们直接拉取主版本分支，保证使用的是最新版本。

1 2	git clone https://github.com/Altinity/clickhouse-mysql-data-reader pip3 install -e clickhouse-mysql-data-reader/

配置

使用 root 用户登录 MySQL 创建 clickhouse-mysql-data-reader 访问数据所需要的用户和权限。

mysql> CREATE USER 'clickhousereader'@'%' IDENTIFIED BY 'MDB@2019';
Query OK, 0 rows affected (0.02 sec)
mysql> CREATE USER 'clickhousereader'@'127.0.0.1' IDENTIFIED BY 'MDB@2019';
Query OK, 0 rows affected (0.00 sec)
mysql> CREATE USER 'clickhousereader'@'localhost' IDENTIFIED BY 'MDB@2019';
Query OK, 0 rows affected (0.02 sec)
mysql> GRANT SELECT, REPLICATION CLIENT, REPLICATION SLAVE ON *.* TO 'clickhousereader'@'%';
Query OK, 0 rows affected (0.01 sec)
mysql> GRANT SELECT, REPLICATION CLIENT, REPLICATION SLAVE ON *.* TO 'clickhousereader'@'127.0.0.1';
Query OK, 0 rows affected (0.00 sec)
mysql> GRANT SELECT, REPLICATION CLIENT, REPLICATION SLAVE ON *.* TO 'clickhousereader'@'localhost';
Query OK, 0 rows affected, 1 warning (0.01 sec)

my.cnf（前提需要开启 bin log）

[mysqld]
# mandatory
server-id = 200
log_bin = /var/lib/mysql/bin.log
binlog-format = row # very important if you want to receive write, update and delete row
events
# optional
expire_logs_days = 30
max_binlog_size = 900M
# setup listen address
bind-address = 0.0.0.0

在完成上面的配置之后，我们就可以将数据导入到 ClickHouse 中。
这里建议使用从库导入数据，另外不建议再修改表结构避免数据损坏。

优点和限制

仅支持 Insert 事务。
不支持 Delete 和 Update。
不支持 DDL，MySQL 源端任何不兼容的修改都会破坏同步数据的流畅。

运行同步历史数据

在完成上面的内容之后，下面就可以将历史数据导入到 ClickHouse 中。下面就是运行导入的脚本：

建议在确定表结构的实例上进行导入，以避免可能发生数据损坏。另外建议使用 MySQL 从机作为数据源。

clickhouse-mysql --src-host=127.0.0.1 --src-user=mysql_user --src-password=password --src-tables=wiki.pageviews --dst-host=127.0.0.1 --dst-port=9000 --dst-user=ck_user --dst-password=password --dst-schema=risk_engine --dst-create-table --migrate-table

上面的脚本连接主机 192.168.56.101 上的 MySQL，他会迁移 wiki 库的 pageviews 表中的数据到主机 127.0.0.1 上的 ClickHouse 中，并且会自动建表和迁移数据。

需要对迁移数据做筛选时，可使用 --src-tables-where-clauses 参数。

剩下的就是验证同步数据前后的数量是否一致即可。

监听 `bin log` 同步实时数据

另外还可以在 bin log 日志文件中指定clickhouse-mysql 工具开始同步的位置同步数据。

clickhouse-mysql --src-host=127.0.0.1 --src-user=mysql_user --src-password=password --src-tables=db.table1,db.table2 --dst-host=127.0.0.1 --dst-port=9000 --dst-user=ck_user --dst-password=password --dst-schema=db --dst-create-table --src-resume --src-wait --nice-pause=1 --log-level=info --csvpool --mempool-max-flush-interval=60 --mempool-max-events-num=1 --pump-data --src-server-id=222 --log-file=clickhouse-mysql.log --binlog-position-file=/tmp/ck

常见异常

密码配置错误

1 2	CRITICAL:Code: 516. DB::Exception: default: Authentication failed: password is incorrect or there is no user with such name.

解决方法：
修改为正确的密码即可。

缺少 Python 文件

1 2	MySQLdb/_mysql.c:46:20: fatal error: Python.h: No such file or directory #include "Python.h"

解决方法：

1	yum install python3-devel

gcc 命令失败

1 2	compilation terminated. error: command 'gcc' failed with exit status 1

解决方法：

1	yum install gcc

MySQL 配置文件缺失
1
OSError: mysql_config not found
解决方法：
1
yum install mysql-devel
在安装过程中可能会出现密钥失效的情况，可以更新密钥：
1
rpm --import https://repo.mysql.com/RPM-GPG-KEY-mysql-2022
未找到 pip3 命令
1
yum install pyhton3-pip

参数解析

-h, --help              show this help message and exit
查看帮助文档

--config-file CONFIG_FILE
                        Path to config file. Default - not specified
配置文件路径。

--log-file LOG_FILE     Path to log file. Default - not specified
日志文件路径。

--log-level LOG_LEVEL   Log Level. Default - NOTSET
日志级别。

--nice-pause NICE_PAUSE
                        Make specified (in sec) pause between attempts to read binlog stream
读取前后两次 binlog 之间暂停的时间（以秒为单位）。

--dry                   Dry mode - do not do anything that can harm. Useful
                        for debugging.
空跑模式，不做任何有害的事情，用于调试。

--daemon                Daemon mode - go to background.
守护模式，转向后台运行。

--pid-file PID_FILE     Pid file to be used by the app in daemon mode
应用程序在守护模式下使用的 PID 文件。

--binlog-position-file BINLOG_POSITION_FILE
                        File to write binlog position to during bin log
                        reading and to read position from on start
在读取 binlog 文件期间要将 binlog 写入文件的位置和从开始读取位置的文件。

--mempool               Cache data in mem.
缓存内存中的数据

--mempool-max-events-num MEMPOOL_MAX_EVENTS_NUM
                        Max events number to pool - triggering pool flush
缓存池中的最大事件数，会触发缓存池的刷新。

--mempool-max-rows-num MEMPOOL_MAX_ROWS_NUM
                        Max rows number to pool - triggering pool flush
缓存池中的最大行数，会触发缓存池的刷新。

--mempool-max-flush-interval MEMPOOL_MAX_FLUSH_INTERVAL
                        Max seconds number between pool flushes
缓存池之间刷新的最大秒数。

--csvpool               Cache data in CSV pool files on disk. Requires memory
                        pooling, thus enables --mempool even if it is not explicitly specified
将 CSV 池的数据缓存在磁盘中。过程中需要使用内存池，即使没有明确指定也要启用 --mempool。

--csvpool-file-path-prefix CSVPOOL_FILE_PATH_PREFIX
                        File path prefix to CSV pool files
CSV 池文件前缀路径。

--csvpool-keep-files    Keep CSV pool files. Useful for debugging
保留 CSV 池文件，用于调试。

--create-table-sql-template
                        Prepare CREATE TABLE SQL template(s).
准备 CREATE TABLE SQL 模版。

--create-table-sql      Prepare CREATE TABLE SQL statement(s).
准备 CREATE TABLE SQL 语句。

--with-create-database
                        Prepend each CREATE TABLE SQL statement(s) with CREATEDATABASE statement
在每一个 CREATE TABLE SQL 语句前添加 CREATE DATABASE 语句。

--create-table-json-template
                        Prepare CREATE TABLE template(s) as JSON. Useful for IPC
准备 JSON 格式的 CREATE TABLE 模版，用于 IPC。

--migrate-table         Migrate table(s). Copy existing data from MySQL
                        table(s) with SELECT statement. Binlog is not read
                        during this procedure - just copy data from the src
                        table(s). IMPORTANT!. Target table has to be created
                        in ClickHouse or it has to be created wit--dst-
                        create-table and possibly wit--with-create-database
                        options. Se--create-table-sql-template an--create-
                        table-sql options for additional info.
合并表。使用 SELECT 语句从 MySQL 表中复制数据。在此期间从 binlog 文件中读取数据，仅从源表中复制数据。目标表需要在 ClickHouse 中被创建 或者使用 --dst-create-table 参数创建表和使用过 --with-create-database 创建库。更多的信息可以参考 --create-table-sql-template 和 --create-table-sql。

--pump-data             Pump data from MySQL binlog into ClickHouse. Copy rows
                        from binlog until the end of binlog reached. When end
                        of binlog reached, process ends. Use in combination
                        wit--src-wait in case would like to continue and
                        wait for new rows after end of binlog reached
从 MYSQL 的 binlog 中抽取数据到 ClickHosue。从 binlog 中复制行，直到到达 binlog 的末尾。当到达 binlog 末尾时，运行停止。如果你想在到达 binlog 末尾时可以等待新的行到达然后重新运行，可以使用 --src-wait 参数。

--install               Install service file(s)
安装服务文件。

--src-server-id SRC_SERVER_ID
                        Set server_id to be used when reading date from MySQL
                        src. Ex.: 1
设置从 MySQL 源读取数据时要使用的 server_id。例如：1。

--src-host SRC_HOST     Host to be used when reading from src. Ex.: 127.0.0.1
从源读取数据需要使用的 Host。例如：127.0.0.1。

--src-port SRC_PORT     Port to be used when reading from src. Ex.: 3306
从源读取数据需要使用的 Port。例如：3306。

--src-user SRC_USER     Username to be used when reading from src. Ex.: root
从源读取数据需要使用的 Username。例如：root。

--src-password SRC_PASSWORD
                        Password to be used when reading from src. Ex.: qwerty
从源读取数据需要使用的 Password。例如：qwerty。

--src-schemas SRC_SCHEMAS
                        Comma-separated list of databases (a.k.a schemas) to
                        be used when reading from src. Ex.: db1,db2,db3
从源读取数据需要使用的数据库列表（以逗号分隔），例如：db1,db2,db3。

--src-tables SRC_TABLES
                        Comma-separated list of tables to be used when reading
                        from src. Ex.: table1,table2,table3Ex.:
                        db1.table1,db2.table2,db3.table3
从源读取数据需要使用的表（以逗号分隔），例如：table1,table2,table3，例如：db1.table1,db2.table2,db3.table3。

--src-tables-where-clauses SRC_TABLES_WHERE_CLAUSES
                        Comma-separated list of WHERE clauses for tables to be
                        migrated. Ex.: db1.t1="a=1 and b=2",db2.t2="c=3 and
                        k=4". Accepts both (comma-separated) clause (useful
                        for short clauses) or file where clause is located
                        (useful for long clauses)
要迁移表的 WHERE 子句（以逗号分隔）。例如：db1.t1="a=1 and b=2",db2.t2="c=3 and k=4"。

--src-tables-prefixes SRC_TABLES_PREFIXES
                        Comma-separated list of table prefixes to be used when
                        reading from src.Useful when we need to process
                        unknown-in-advance tables, say day-named log tables,
                        as log_2017_12_27Ex.: mylog_,anotherlog_,extralog_3
从源读取数据要使用的表前缀列表。在处理未知的预先表时非常有用，例如 log_2017_12_27 日期命名的日志表。例如：mylog_,anotherlog_,extralog_3。

--src-wait               Wait indefinitely for new records to come.
无限期等待新的记录出现。

--src-resume            Resume reading from previous position. Previous
                        position is read from `binlog-position-file`
从先前位置开始读取。从 binlog-position-file 上读取先前位置。

--src-binlog-file SRC_BINLOG_FILE
                        Binlog file to be used to read from src. Related to
                        `binlog-position-file`. Ex.: mysql-bin.000024
从源读取数据需要使用的 binlog 文件，与 binlog-position-file 有关。例如：mysql-bin.000024。

--src-binlog-position SRC_BINLOG_POSITION
                        Binlog position to be used when reading from src.
                        Related to `binlog-position-file`. Ex.: 5703
从源读取数据需要使用的 binlog 位置，与 binlog-position-file 有关。例如：5703。

--src-file SRC_FILE     Source file to read data from. CSV
从中读取数据的源文件，例如：CSV。

--dst-file DST_FILE     Target file to be used when writing data. CSV
写入数据需要使用的目标文件，例如：CSV。

--dst-host DST_HOST     Host to be used when writing to dst. Ex.: 127.0.0.1
写入目标需要使用的 Host，例如：127.0.0.1。

--dst-port DST_PORT     Port to be used when writing to dst. Ex.: 9000
写入目标需要使用的 Port，例如：9000。

--dst-user DST_USER     Username to be used when writing to dst. Ex: default
写入目标需要使用的 Username，例如：default。

--dst-password DST_PASSWORD
                        Password to be used when writing to dst. Ex.: qwerty
写入目标需要使用的 Password，例如：qwerty。

--dst-schema DST_SCHEMA
                        Database (a.k.a schema) to be used to create tables in
                        ClickHouse. It overwrites source database(s) name(s),
                        so tables in ClickHouse would be located in
                        differently named db than in MySQL. Ex.: db1
在 ClickHouse 中被用于创建表的 Database，他会覆盖源数据库的名称，因此跟 MYSQL 相比在 ClickHouse 本地可以用不同的名称，例如：db1。

--dst-distribute        Whether to add distribute table
是否添加分发表。

--dst-cluster DST_CLUSTER
                        Cluster to be used when writing to dst. Ex.: cluster1
写入目标需要使用的 Cluster，例如：cluster1。

--dst-table DST_TABLE
                        Table to be used when writing to dst. Ex.: table1
写入目标需要使用的 table，例如：table1。

--dst-table-prefix DST_TABLE_PREFIX
                        Prefix to be used when creating dst table. Ex.:copy_table_
当创建目标表时需要使用的前缀，例如：copy_yable_。

--dst-create-table      Prepare and run CREATE TABLE SQL statement(s).
准备和运行 CREATE TABLE SQL 语句。

--column-default-value [COLUMN_DEFAULT_VALUE [COLUMN_DEFAULT_VALUE ...]]
                        Set of key=value pairs for columns default values.
                        Ex.: date_1=2000-01-01 timestamp_1=2002-01-01 01:02:03
设置列默认值的 key=value 对，例如：date_1=2000-01-01，timestamp_1=2002-01-01 01:02:03。

--column-skip [COLUMN_SKIP [COLUMN_SKIP ...]]
                        Set of column names to skip. Ex.: column1 column2
设置跳过的列名称，例如：column1，column2。

--ch-converter-file CH_CONVERTER_FILE
                        Filename where to search for CH converter class
搜索 CH 转换器类的文件名。

--ch-converter-class CH_CONVERTER_CLASS
                        Converter class name i--ch-converter-file file
--ch-converter-file 文件中的转换器类名。

引用

MySQL-to-ClickHouse

个人备注

此博客内容均为作者学习所做笔记，侵删！
若转作其他用途，请注明来源！

引入

安装

配置