GitXplorerGitXplorer
o

cdk_delta_glue

public
0 stars
0 forks
0 issues

Commits

List of commits on branch master.
Unverified
4fb18263750a2ebd834ae0757057450bfacadd5f

doc

oorangewise committed 4 years ago
Unverified
f9d6da543ac2556a375ee1d4d54e5eb0406c49a0

update doc

oorangewise committed 4 years ago
Unverified
2f48b2973d405a31bb9820ad9f9d3ba05c2efb98

remove .DS_Store

oorangewise committed 4 years ago
Unverified
9994374dbd50ea9af749d0fa363fa459853cd5e2

glue job setup

oorangewise committed 4 years ago
Unverified
6b186b8df70c0f16d15ea02bbc82654f2700e342

synth

oorangewise committed 4 years ago
Unverified
90f7de4e3cad005ab9c1bba0eef6f0afad68b658

__init__

oorangewise committed 4 years ago

README

The README file for this repository.

cdk_delta_glue

Run Delta Lake in AWS Glue using the Cloud Development Kit.

Overview

overview

Usage

# install cdk and python deps
./install.sh

# deploy stack to your account (first change the profile in the ./deploy.sh)
./deploy.sh

DMS Dummy Data

Full load

parquet-tools show assets/dms/full/data_ronald/LOAD00000001.parquet

+----------------------------+------+---------------------+-----------+---------+
| TIMESTAMP                  |   id | datetime            |   channel |   value |
|----------------------------+------+---------------------+-----------+---------|
| 2021-06-11 08:28:55.010260 |    2 | 2021-05-07 09:21:25 |        24 |      12 |
| 2021-06-11 08:28:55.017027 |    3 | 2021-05-07 09:21:25 |        26 |      13 |
| 2021-06-11 08:28:55.017043 |    4 | 2021-05-07 09:21:25 |        28 |      14 |
+----------------------------+------+---------------------+-----------+---------+

Changes

parquet-tools show assets/dms/cdc/data_ronald/2021/06/11/20210611-084409279.parquet

+------+----------------------------+------+---------------------+-----------+---------+
| Op   | TIMESTAMP                  |   id | datetime            |   channel |   value |
|------+----------------------------+------+---------------------+-----------+---------|
| U    | 2021-06-11 08:43:04.000000 |    2 | 2021-05-07 09:21:25 |        48 |      12 |
+------+----------------------------+------+---------------------+-----------+---------+

parquet-tools show assets/dms/cdc/data_ronald/2021/06/11/20210611-084751679.parquet

+------+----------------------------+------+---------------------+-----------+---------+
| Op   | TIMESTAMP                  |   id | datetime            |   channel |   value |
|------+----------------------------+------+---------------------+-----------+---------|
| I    | 2021-06-11 08:46:47.000000 |    1 | 2021-06-11 08:46:47 |        11 |      11 |
| D    | 2021-06-11 08:47:42.000000 |    1 | 2021-06-11 08:46:47 |        11 |      11 |
+------+----------------------------+------+---------------------+-----------+---------+

parquet-tools show assets/dms/cdc/data_ronald/2021/06/11/20210611-084932768.parquet

+------+----------------------------+------+---------------------+-----------+---------+
| Op   | TIMESTAMP                  |   id | datetime            |   channel |   value |
|------+----------------------------+------+---------------------+-----------+---------|
| I    | 2021-06-11 08:48:30.000000 |    1 | 2021-06-11 08:48:30 |       100 |     200 |
| I    | 2021-06-11 08:48:30.000000 |   10 | 2021-06-11 08:48:30 |      1000 |    2000 |
+------+----------------------------+------+---------------------+-----------+---------+

Table after the merge

+----------------------------+------+---------------------+-----------+---------+
| TIMESTAMP                  |   id | datetime            |   channel |   value |
+----------------------------+------+---------------------+-----------+---------+
| 2021-06-11 08:48:30.000000 |    1 | 2021-06-11 10:48:30 |       100 |     200 |
| 2021-06-11 08:43:04.000000 |    2 | 2021-05-07 11:21:25 |        48 |      12 |
| 2021-06-11 08:28:55.017027 |    3 | 2021-05-07 11:21:25 |        26 |      13 |
| 2021-06-11 08:28:55.017043 |    4 | 2021-05-07 11:21:25 |        28 |      14 |
| 2021-06-11 08:48:30.000000 |   10 | 2021-06-11 10:48:30 |      1000 |    2000 |
+----------------------------+------+---------------------+-----------+---------+

Athena Table

Create the Athena table by running the following sql statement (in the sAWS Athena console):

CREATE EXTERNAL TABLE `data_ronald` (
  `id` bigint COMMENT '',
  `datetime` timestamp COMMENT '',
  `channel` bigint COMMENT '',
  `value` float COMMENT ''
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.SymlinkTextInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION 's3://<YOUR_ACCOUNT_ID>-cdk-delta-glue/delta/data_ronald/_symlink_format_manifest/'

Query the table

overview