GitXplorerGitXplorer
o

cdk_delta_glue

public
0 stars
0 forks
0 issues

Commits

List of commits on branch master.
Unverified
c4818da0949783b6a207d3a1bb0a5ff7418a72ce

update readme

oorangewise committed 3 years ago
Unverified
cf0789f5b9b3cb90463a54ad323b299cf6958002

format

oorangewise committed 4 years ago
Unverified
fa204785c9746c097772685d2ac178f0d1e43c94

add merged table to readme

oorangewise committed 4 years ago
Unverified
2e36a29269425097cbd15a00203166f0a7f17380

refactor spark job

oorangewise committed 4 years ago
Unverified
7372d211862e07400f5cb2294f617457ae1baa54

add cdc merge job

oorangewise committed 4 years ago
Unverified
20d44303dba10ca68239763ea9119ed993c3ef4a

add dms dummy parquet files

oorangewise committed 4 years ago

README

The README file for this repository.

cdk_delta_glue

Run Delta Lake in AWS Glue using the Cloud Development Kit.

Overview

overview

Usage

# install cdk and python deps
./install.sh

# deploy stack to your account (first change the profile in the ./deploy.sh)
./deploy.sh

DMS Dummy Data

Full load

parquet-tools show assets/dms/full/data_ronald/LOAD00000001.parquet

+----------------------------+------+---------------------+-----------+---------+
| TIMESTAMP                  |   id | datetime            |   channel |   value |
|----------------------------+------+---------------------+-----------+---------|
| 2021-06-11 08:28:55.010260 |    2 | 2021-05-07 09:21:25 |        24 |      12 |
| 2021-06-11 08:28:55.017027 |    3 | 2021-05-07 09:21:25 |        26 |      13 |
| 2021-06-11 08:28:55.017043 |    4 | 2021-05-07 09:21:25 |        28 |      14 |
+----------------------------+------+---------------------+-----------+---------+

Changes

parquet-tools show assets/dms/cdc/data_ronald/2021/06/11/20210611-084409279.parquet

+------+----------------------------+------+---------------------+-----------+---------+
| Op   | TIMESTAMP                  |   id | datetime            |   channel |   value |
|------+----------------------------+------+---------------------+-----------+---------|
| U    | 2021-06-11 08:43:04.000000 |    2 | 2021-05-07 09:21:25 |        48 |      12 |
+------+----------------------------+------+---------------------+-----------+---------+

parquet-tools show assets/dms/cdc/data_ronald/2021/06/11/20210611-084751679.parquet

+------+----------------------------+------+---------------------+-----------+---------+
| Op   | TIMESTAMP                  |   id | datetime            |   channel |   value |
|------+----------------------------+------+---------------------+-----------+---------|
| I    | 2021-06-11 08:46:47.000000 |    1 | 2021-06-11 08:46:47 |        11 |      11 |
| D    | 2021-06-11 08:47:42.000000 |    1 | 2021-06-11 08:46:47 |        11 |      11 |
+------+----------------------------+------+---------------------+-----------+---------+

parquet-tools show assets/dms/cdc/data_ronald/2021/06/11/20210611-084932768.parquet

+------+----------------------------+------+---------------------+-----------+---------+
| Op   | TIMESTAMP                  |   id | datetime            |   channel |   value |
|------+----------------------------+------+---------------------+-----------+---------|
| I    | 2021-06-11 08:48:30.000000 |    1 | 2021-06-11 08:48:30 |       100 |     200 |
| I    | 2021-06-11 08:48:30.000000 |   10 | 2021-06-11 08:48:30 |      1000 |    2000 |
+------+----------------------------+------+---------------------+-----------+---------+

Table after the merge

+----------------------------+------+---------------------+-----------+---------+
| TIMESTAMP                  |   id | datetime            |   channel |   value |
+----------------------------+------+---------------------+-----------+---------+
| 2021-06-11 08:48:30.000000 |    1 | 2021-06-11 10:48:30 |       100 |     200 |
| 2021-06-11 08:43:04.000000 |    2 | 2021-05-07 11:21:25 |        48 |      12 |
| 2021-06-11 08:28:55.017027 |    3 | 2021-05-07 11:21:25 |        26 |      13 |
| 2021-06-11 08:28:55.017043 |    4 | 2021-05-07 11:21:25 |        28 |      14 |
| 2021-06-11 08:48:30.000000 |   10 | 2021-06-11 10:48:30 |      1000 |    2000 |
+----------------------------+------+---------------------+-----------+---------+

Athena Table

Create the Athena table by running the following sql statement (in the sAWS Athena console):

CREATE EXTERNAL TABLE `data_ronald` (
  `id` bigint COMMENT '',
  `datetime` timestamp COMMENT '',
  `channel` bigint COMMENT '',
  `value` float COMMENT ''
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.SymlinkTextInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION 's3://<YOUR_ACCOUNT_ID>-cdk-delta-glue/delta/data_ronald/_symlink_format_manifest/'

Query the table

overview