본문 바로가기
CSP (Cloud Service Provider)/GCP

[Qwiklab] Cloud Composer Copying BigQuery Tables Across Different Locations

by BTC_유온 2022. 6. 3.

Data Engineering 과정 (in Qwiklab)

Cloud Composer Copying BigQuery Tables Across Different Locations
https://www.cloudskillsboost.google/focuses/3528?locale=en&parent=catalog

 


Data Engineering
Cloud Composer Copying BigQuery Tables Across Different Locations


첫번째  개요
(1) 관련 서비스 소개
ⓐ Cloud Composer
클라우드 및 온프레미스 데이터 센터 전체의 워크플로를 생성, 예약, 모니터링, 관리할 수 있는 완전 관리형 워크플로 조정 서비스

[GCP Data] Cloud Composer 개요 및 기능
https://blog.naver.com/ohn-lab/222599687351

[GCP 뽀개기] CH 11. Cloud Composer-①
https://blog.naver.com/ohn-lab/222606082089

[GCP 뽀개기] CH 11. Cloud Composer-②
https://blog.naver.com/ohn-lab/222607072431


ⓑ BigQuery
몇 초만에 대규모 데이터를 쿼리할 수 있는 관계형 클라우드 데이터베이스
[GCP in Action] CH 19. BigQuery-①
https://blog.naver.com/ohn-lab/222601565238

[GCP in Action] CH 19. BigQuery-②
https://blog.naver.com/ohn-lab/222605131452

(2) 실습 목표
Cloud Composer에서 아래의 작업을 진행하기 위한 Apache Airflow 워크플로를 생성하고 실행

Reads from a config file the list of tables to copy
Exports the list of tables from a BigQuery dataset located in US to Cloud Storage
Copies the exported tables from US to EU Cloud Storage buckets
Imports the list of tables into the target BigQuery Dataset in EU


두번째  준비
(1) Create Cloud Composer environment
Version : Composer 1
Name : composer-advaced-lab
Location : us-central1
Zone : us-central1-a

(2) Create Cloud Storage buckets
Cloud Storage 버킷은 BigQuery의 테이블을 한 지역에서 다른 지역으로 복사하기 위해 임의로 생성함

ⓐ cloud storage in us
Name : <아무문자1>-us
Location : US

ⓑ cloud storage in eu
Name : <아무문자2>-eu
Location : EU


(3) BigQuery destination dataset

DatasetID : nyc_tlc_EU
Location : eu (multiple region in European Union)


세번째  워크플로 생성
(1) Defining the workflow
Cloud Composer 워크플로는 DAG로 구성
워크플로는 bq_copy_across_locations.py 코드를 통해 생성

https://github.com/GoogleCloudPlatform/python-docs-samples/blob/main/composer/workflows/bq_copy_across_locations.py


워크플로 태스크를 위해 사용할 오퍼레이터
ⓐ DummyOperator
Creates Start and End dummy tasks for better visual representation of the DAG

ⓑ BigQueryToCloudStorageOperator
Exports BigQuery tables to Cloud Storage buckets using Avro format

ⓒ GoogleCloudStorageToGoogleCloudStorageOperator
Copies files across Cloud Storage buckets

ⓓ GoogleCloudStorageToBigQueryOperator
Imports tables from Avro files in Cloud Storage bucket



함수
read_master_file() : Read the config file and build the list of tables to copy
ⓐ [using data] master CSV file : Help in creating Airflow tasks in the DAG dynamically
ⓑ [param] table_list_file : (String) The file location of the master file
ⓒ [return] master_record_all : (List) List of Python dictionaries containing the information for a single row in master CSV file

플러그인


(2) Viewing environment information
Creating a virtual environment

* virtualenv
한 서버 안에 여러 virtualenv 환경 설정을 통해 각각 독립된 버전 관리

Activate the virtual environment


(3) Setting DAGs Cloud Storage bucket
Composer 생성 시 자동으로 DAG 생성용 파이썬 파일 관리를 위한 버킷이 함께 생성됨
이 버킷의 이름을 확인하여 YOURDAGSBUCKET에 넣어주고 아래 명령 실행
DAGS_BUCKET=us-central1-composer-advanc-YOURDAGSBUCKET-bucket


(4) Setting Airflow variables
ENVIRONMENT_NAME : Cloud Composer 환경 이름
LOCATION : Cloud Composer 환경이 구축된 위치
KEY, VALUE : 변수와 해당 값

ENVIRONMENT_NAME과 LOCATION에는 앞서 생성했던 Cloud Composer에 대한 정보를 기입
Name : composer-advaced-lab
Location : us-central1


​네번째  워크플로 실행
(1) Uploading the DAG and dependencies to Cloud Storage
DAG 생성을 위한 파이썬 코드와 관련 데이터 파일을 DAG 관리용 버킷에 업로드
업로드될 관련 파일은 GCP에서 제공하는 것으로, 파일 내용에 대해선 위에서 설명함

$DAGS_BUCKET : 앞서 값을 설정해준 변수로, Composer에서 자동으로 생성한 DAG 관련 파일을 저장하기 위한 버킷을 나타냄

(2) Using the Airflow UI

(3) Validate the results

댓글