데이터 엔지니어는 10개의 소스 시스템에서 Amazon Redshift 데이터베이스에 있는 10개의 테이블로 데이터를 처리하고 로드하기 위해 ETL(추출, 변환 및 로드) 파이프라인을 구축해야 합니다. 모든 소스 시스템은 15분마다 .csv, JSON 또는 Apache Parquet 파일을 생성합니다. 소스 시스템은 모두 파일을 하나의 Amazon S3 버킷으로 전달합니다. 파일 크기는 10MB에서 20GB까지입니다. ETL 파이프라인은 데이터 스키마 변경에도 불구하고 올바르게 작동해야 합니다. 이러한 요구 사항을 충족하는 데이터 파이프라인 솔루션은 무엇인가요? (2개를 선택하세요.)
정답: A,B
Using an Amazon EventBridge rule to run an AWS Glue job or invoke an AWS Glue workflow job every 15 minutes are two possible solutions that will meet the requirements. AWS Glue is a serverless ETL service that can process and load data from various sources to various targets, including Amazon Redshift. AWS Glue can handle different data formats, such as CSV, JSON, and Parquet, and also support schema evolution, meaning it can adapt to changes in the data schema over time. AWS Glue can also leverage Apache Spark to perform distributed processing and transformation of large datasets. AWS Glue integrates with Amazon EventBridge, which is a serverless event bus service that can trigger actions based on rules and schedules. By using an Amazon EventBridge rule, you can invoke an AWS Glue job or workflow every 15 minutes, and configure the job or workflow to run an AWS Glue crawler and then load the data into the Amazon Redshift tables. This way, you can build a cost-effective and scalable ETL pipeline that can handle data from 10 source systems and function correctly despite changes to the data schema. The other options are not solutions that will meet the requirements. Option C, configuring an AWS Lambda function to invoke an AWS Glue crawler when a file is loaded into the S3 bucket, and creating a second Lambda function to run the AWS Glue job, is not a feasible solution, as it would require a lot of Lambda invocations and coordination. AWS Lambda has some limits on the execution time, memory, and concurrency, which can affect the performance and reliability of the ETL pipeline. Option D, configuring an AWS Lambda function to invoke an AWS Glue workflow when a file is loaded into the S3 bucket, is not a necessary solution, as you can use an Amazon EventBridge rule to invoke the AWS Glue workflow directly, without the need for a Lambda function. Option E, configuring an AWS Lambda function to invoke an AWS Glue job when a file is loaded into the S3 bucket, and configuring the AWS Glue job to put smaller partitions of the DataFrame into an Amazon Kinesis Data Firehose delivery stream, is not a cost-effective solution, as it would incur additional costs for Lambda invocations and data delivery. Moreover, using Amazon Kinesis Data Firehose to load data into Amazon Redshift is not suitable for frequent and small batches of data, as it can cause performance issues and data fragmentation. References: * AWS Glue * Amazon EventBridge * Using AWS Glue to run ETL jobs against non-native JDBC data sources * [AWS Lambda quotas] * [Amazon Kinesis Data Firehose quotas]
Data-Engineer-Associate-KR 문제 62
데이터 엔지니어는 회사의 Amazon S3 버킷과 Amazon RDS 데이터베이스를 기반으로 엔터프라이즈 데이터 카탈로그를 구축해야 합니다. 데이터 카탈로그에는 카탈로그의 데이터에 대한 스토리지 형식 메타데이터가 포함되어야 합니다. 가장 적은 노력으로 이러한 요구 사항을 충족할 수 있는 솔루션은 무엇일까요?
정답: A
To build an enterprise data catalog with metadata for storage formats, the easiest and most efficient solution is using an AWS Glue crawler. The Glue crawler can scan Amazon S3 buckets and Amazon RDS databases to automatically create a data catalog that includes metadata such as the schema and storage format (e.g., CSV, Parquet, etc.). By using AWS Glue crawler classifiers, you can configure the crawler to recognize the format of the data and store this information directly in the catalog. * Option B: Use an AWS Glue crawler to build a data catalog. Use AWS Glue crawler classifiers to recognize the format of data and store the format in the catalog.This option meets the requirements with the least effort because Glue crawlers automate the discovery and cataloging of data from multiple sources, including S3 and RDS, while recognizing various file formats via classifiers. Other options (A, C, D) involve additional manual steps, like having data stewards inspect the data, or using services like Amazon Macie that focus more on sensitive data detection rather than format cataloging. References: AWS Glue Crawler Documentation AWS Glue Classifiers
Data-Engineer-Associate-KR 문제 63
회사에는 Amazon API Gateway를 사용하여 REST API를 호출하는 프런트엔드 ReactJS 웹 사이트가 있습니다. API는 웹사이트의 기능을 수행합니다. 데이터 엔지니어는 API 게이트웨이를 통해 가끔 호출될 수 있는 Python 스크립트를 작성해야 합니다. 코드는 API Gateway에 결과를 반환해야 합니다. 최소한의 운영 오버헤드로 이러한 요구 사항을 충족하는 솔루션은 무엇입니까?
정답: B
AWS Lambda is a serverless compute service that lets you run code without provisioning or managing servers. You can use Lambda to create functions that perform custom logic and integrate with other AWS services, such as API Gateway. Lambda automatically scales your application by running code in response to each trigger. You pay only for the compute time you consume1. Amazon ECS is a fully managed container orchestration service that allows you to run and scale containerized applications on AWS. You can use ECS to deploy, manage, and scale Docker containers using either Amazon EC2 instances or AWS Fargate, a serverless compute engine for containers2. Amazon EKS is a fully managed Kubernetes service that allows you to run Kubernetes clusters on AWS without needing to install, operate, or maintain your own Kubernetes control plane. You can use EKS to deploy, manage, and scale containerized applications using Kubernetes on AWS3. The solution that meets the requirements with the least operational overhead is to create an AWS Lambda Python function with provisioned concurrency. This solution has the following advantages: It does not require you to provision, manage, or scale any servers or clusters, as Lambda handles all the infrastructure for you. This reduces the operational complexity and cost of running your code. It allows you to write your Python script as a Lambda function and integrate it with API Gateway using a simple configuration. API Gateway can invoke your Lambda function synchronously or asynchronously, and return the results to the frontend website. It ensures that your Lambda function is ready to respond to API requests without any cold start delays, by using provisioned concurrency. Provisioned concurrency is a feature that keeps your function initialized and hyper-ready to respond in double-digit milliseconds. You can specify the number of concurrent executions that you want to provision for your function. Option A is incorrect because it requires you to deploy a custom Python script on an Amazon ECS cluster. This solution has the following disadvantages: It requires you to provision, manage, and scale your own ECS cluster, either using EC2 instances or Fargate. This increases the operational complexity and cost of running your code. It requires you to package your Python script as a Docker container image and store it in a container registry, such as Amazon ECR or Docker Hub. This adds an extra step to your deployment process. It requires you to configure your ECS cluster to integrate with API Gateway, either using an Application Load Balancer or a Network Load Balancer. This adds another layer of complexity to your architecture. Option C is incorrect because it requires you to deploy a custom Python script that can integrate with API Gateway on Amazon EKS. This solution has the following disadvantages: It requires you to provision, manage, and scale your own EKS cluster, either using EC2 instances or Fargate. This increases the operational complexity and cost of running your code. It requires you to package your Python script as a Docker container image and store it in a container registry, such as Amazon ECR or Docker Hub. This adds an extra step to your deployment process. It requires you to configure your EKS cluster to integrate with API Gateway, either using an Application Load Balancer, a Network Load Balancer, or a service of type LoadBalancer. This adds another layer of complexity to your architecture. Option D is incorrect because it requires you to create an AWS Lambda function and ensure that the function is warm by scheduling an Amazon EventBridge rule to invoke the Lambda function every 5 minutes by using mock events. This solution has the following disadvantages: It does not guarantee that your Lambda function will always be warm, as Lambda may scale down your function if it does not receive any requests for a long period of time. This may cause cold start delays when your function is invoked by API Gateway. It incurs unnecessary costs, as you pay for the compute time of your Lambda function every time it is invoked by the EventBridge rule, even if it does not perform any useful work1. 1: AWS Lambda - Features 2: Amazon Elastic Container Service - Features 3: Amazon Elastic Kubernetes Service - Features [4]: Building API Gateway REST API with Lambda integration - Amazon API Gateway [5]: Improving latency with Provisioned Concurrency - AWS Lambda [6]: Integrating Amazon ECS with Amazon API Gateway - Amazon Elastic Container Service [7]: Integrating Amazon EKS with Amazon API Gateway - Amazon Elastic Kubernetes Service [8]: Managing concurrency for a Lambda function - AWS Lambda
Data-Engineer-Associate-KR 문제 64
회사의 데이터 엔지니어는 테이블 SQL 쿼리의 성능을 최적화해야 합니다. 회사는 Amazon Redshift 클러스터에 데이터를 저장합니다. 데이터 엔지니어는 예산 제약으로 인해 클러스터 크기를 늘릴 수 없습니다. 회사는 데이터를 여러 테이블에 저장하고 EVEN 배포 스타일을 사용하여 데이터를 로드합니다. 일부 테이블의 크기는 수백 기가바이트입니다. 다른 테이블의 크기는 10MB 미만입니다. 어떤 솔루션이 이러한 요구 사항을 충족합니까?
정답: D
This solution meets the requirements of optimizing the performance of table SQL queries without increasing the size of the cluster. By using the ALL distribution style for rarely updated small tables, you can ensure that the entire table is copied to every node in the cluster, which eliminates the need for data redistribution during joins. This can improve query performance significantly, especially for frequently joined dimension tables. However, using the ALL distribution style also increases the storage space and the load time, so it is only suitable for small tables that are not updated frequently or extensively. By specifying primary and foreign keys for all tables, you can help the query optimizer to generate better query plans and avoid unnecessary scans or joins. You can also use the AUTO distribution style to let Amazon Redshift choose the optimal distribution style based on the table size and the query patterns. References: * Choose the best distribution style * Distribution styles * Working with data distribution styles
Data-Engineer-Associate-KR 문제 65
한 회사가 Amazon Redshift를 사용하여 데이터 웨어하우스 솔루션을 구축하고 있습니다. 이 회사는 Redshift 클러스터에 있는 택트 테이블에 수백 개의 타일을 로드하고 있습니다. 회사는 데이터웨어하우스 솔루션이 가능한 최대 처리량을 달성하기를 원합니다. 솔루션은 회사가 데이터를 tact 테이블에 로드할 때 클러스터 리소스를 최적으로 사용해야 합니다. 어떤 솔루션이 이러한 요구 사항을 충족시킬까요?
정답: D
To achieve the highest throughput and efficiently use cluster resources while loading data into an Amazon Redshift cluster, the optimal approach is to use a single COPY command that ingests data in parallel. Option D: Use a single COPY command to load the data into the Redshift cluster.The COPY command is designed to load data from multiple files in parallel into a Redshift table, using all the cluster nodes to optimize the load process. Redshift is optimized for parallel processing, and a single COPY command can load multiple files at once, maximizing throughput. Options A, B, and C either involve unnecessary complexity or inefficient approaches, such as using multiple COPY commands or INSERT statements, which are not optimized for bulk loading. References: Amazon Redshift COPY Command Documentation