무료 온라인 액세스 Amazon.Data-Engineer-Associate-KR.v2026-04-11.q93 모의 시험 (Page 15)

Data-Engineer-Associate-KR 문제 66

한 미디어 회사에서는 사용자 행동과 선호도에 따라 고객에게 미디어 콘텐츠를 추천하는 시스템을 개선하려고 합니다. 추천 시스템을 개선하려면 회사는 타사 데이터 세트의 통찰력을 회사의 기존 분석 플랫폼에 통합해야 합니다.
회사는 타사 데이터 세트를 통합하는 데 필요한 노력과 시간을 최소화하려고 합니다.
최소한의 운영 오버헤드로 이러한 요구 사항을 충족하는 솔루션은 무엇입니까?

A. API 호출을 사용하여 AWS Data Exchange에서 타사 데이터 세트에 액세스하고 통합합니다.

B. API 호출을 사용하여 AWS의 타사 데이터 세트에 액세스하고 통합합니다.

C. Amazon Kinesis Data Streams를 사용하여 AWS CodeCommit 리포지토리의 타사 데이터 세트에 액세스하고 통합합니다.

D. Amazon Kinesis Data Streams를 사용하여 Amazon Elastic Container Registry(Amazon ECR)의 타사 데이터 세트에 액세스하고 통합합니다.

정답: A

AWS Data Exchange is a service that makes it easy to find, subscribe to, and use third-party data in the cloud.
It provides a secure and reliable way to access and integrate data from various sources, such as data providers, public datasets, or AWS services. Using AWS Data Exchange, you can browse and subscribe to data products that suit your needs, and then use API calls or the AWS Management Console to export the data to Amazon S3, where you can use it with your existing analytics platform. This solution minimizes the effort and time required to incorporate third-party datasets, as you do not need to set up and manage data pipelines, storage, or access controls. You also benefit from the data quality and freshness provided by the data providers, who can update their data products as frequently as needed12.
The other options are not optimal for the following reasons:
B). Use API calls to access and integrate third-party datasets from AWS. This option is vague and does not specify which AWS service or feature is used to access and integrate third-party datasets. AWS offers a variety of services and features that can help with data ingestion, processing, and analysis, but not all of them are suitable for the given scenario. For example, AWS Glue is a serverless data integration service that can help you discover, prepare, and combine data from various sources, but it requires you to create and run data extraction, transformation, and loading (ETL) jobs, which can add operational overhead3.
C). Use Amazon Kinesis Data Streams to access and integrate third-party datasets from AWS CodeCommit repositories. This option is not feasible, as AWS CodeCommit is a source control service that hosts secure Git- based repositories, not a data source that can be accessed by Amazon Kinesis Data Streams. Amazon Kinesis Data Streams is a service that enables you to capture, process, and analyze data streams in real time, such as clickstream data, application logs, or IoT telemetry. It does not support accessing and integrating data from AWS CodeCommit repositories, which are meant for storing and managing code, not data .
D). Use Amazon Kinesis Data Streams to access and integrate third-party datasets from Amazon Elastic Container Registry (Amazon ECR). This option is also not feasible, as Amazon ECR is a fully managed container registry service that stores, manages, and deploys container images, not a data source that can be accessed by Amazon Kinesis Data Streams. Amazon Kinesis Data Streams does not support accessing and integrating data from Amazon ECR, which is meant for storing and managing container images, not data .
1: AWS Data Exchange User Guide
2: AWS Data Exchange FAQs
3: AWS Glue Developer Guide
AWS CodeCommit User Guide
Amazon Kinesis Data Streams Developer Guide
Amazon Elastic Container Registry User Guide
Build a Continuous Delivery Pipeline for Your Container Images with Amazon ECR as Source

Data-Engineer-Associate-KR 문제 67

한 회사에서는 AWS Glue 작업을 사용하여 여러 데이터 파이프라인을 구현합니다. 이 파이프라인은 회사에 매우 중요합니다.
회사는 파이프라인에 문제가 생기면 이해관계자에게 경고하는 모니터링 메커니즘을 구현해야 합니다.
어떤 솔루션이 운영 비용을 최소화하면서 이러한 요구 사항을 충족할 수 있을까요?

A. AWS Glue 작업 실패 이벤트와 일치하는 Amazon EventBridge 규칙을 생성합니다. 이벤트를 처리할 AWS Lambda 함수를 지정하도록 규칙을 구성합니다. Amazon Simple Notification Service(Amazon SNS) 주제로 알림을 전송하도록 함수를 구성합니다.

B. AWS Glue 작업에 대한 Amazon CloudWatch Logs 로그 그룹을 구성합니다. 로그 그룹의 새 로그 생성 이벤트와 일치하는 Amazon EventBridge 규칙을 생성합니다. AWS Glue 작업 실패 로그가 있는 경우 로그를 읽고 Amazon Simple Notification Service(Amazon SNS) 주제로 알림을 전송하는 AWS Lambda 함수를 대상으로 규칙을 구성합니다.

C. AWS Glue 작업 실패 이벤트와 일치하는 Amazon EventBridge 규칙을 생성합니다. EventBridge 규칙을 기반으로 Amazon CloudWatch 지표를 정의합니다. 해당 지표를 기반으로 Amazon Simple Notification Service(Amazon SNS) 주제로 알림을 전송하는 CloudWatch 경보를 설정합니다.

D. AWS Glue 작업에 대한 Amazon CloudWatch Logs 로그 그룹을 구성합니다. 로그 그룹의 새 로그 생성 이벤트와 일치하는 Amazon EventBridge 규칙을 생성합니다. Amazon Simple Notification Service(Amazon SNS) 주제로 알림을 전송하도록 규칙을 구성합니다.

Data-Engineer-Associate-KR 문제 68

데이터 엔지니어는 AWS 서비스를 사용하여 데이터 세트를 Amazon S3 데이터 레이크로 수집해야 합니다. 데이터 엔지니어는 데이터 세트를 프로파일링하고 데이터 세트에 개인 식별 정보(PII)가 포함되어 있음을 발견합니다. 데이터 엔지니어는 데이터 세트를 프로파일링하고 PII를 난독화하는 솔루션을 구현해야 합니다.
최소한의 운영 노력으로 이 요구 사항을 충족할 수 있는 솔루션은 무엇입니까?

A. Amazon Kinesis Data Firehose 전송 스트림을 사용하여 데이터 세트를 처리합니다. PII를 식별하기 위해 AWS Lambda 변환 함수를 생성합니다. AWS SDK를 사용하여 PII를 난독화합니다. S3 데이터 레이크를 전송 스트림의 대상으로 설정합니다.

B. AWS Glue Studio에서 PII 감지 변환을 사용하여 PII를 식별합니다. PII를 난독화합니다. AWS Step Functions 상태 시스템을 사용하여 데이터 파이프라인을 조정하여 데이터를 S3 데이터 레이크로 수집합니다.

C. AWS Glue Studio에서 PII 감지 변환을 사용하여 PII를 식별합니다. AWS Glue 데이터 품질에서 PII를 난독화하는 규칙을 생성합니다. AWS Step Functions 상태 시스템을 사용하여 데이터 파이프라인을 조정하여 데이터를 S3 데이터 레이크로 수집합니다.

D. 데이터 세트를 Amazon DynamoDB로 수집합니다. DynamoDB 테이블에서 PII를 식별 및 난독화하고 데이터를 변환하는 AWS Lambda 함수를 생성합니다. 동일한 Lambda 함수를 사용하여 데이터를 S3 데이터 레이크로 수집합니다.

정답: C

AWS Glue is a fully managed service that provides a serverless data integration platform for data preparation, data cataloging, and data loading. AWS Glue Studio is a graphical interface that allows you to easily author, run, and monitor AWS Glue ETL jobs. AWS Glue Data Quality is a feature that enables you to validate, cleanse, and enrich your data using predefined or custom rules. AWS Step Functions is a service that allows you to coordinate multiple AWS services into serverless workflows.
Using the Detect PII transform in AWS Glue Studio, you can automatically identify and label the PII in your dataset, such as names, addresses, phone numbers, email addresses, etc. You can then create a rule in AWS Glue Data Quality to obfuscate the PII, such as masking, hashing, or replacing the values with dummy data.
You can also use other rules to validate and cleanse your data, such as checking for null values, duplicates, outliers, etc. You can then use an AWS Step Functions state machine to orchestrate a data pipeline to ingest the data into the S3 data lake. You can use AWS Glue DataBrew to visually explore and transform the data, AWS Glue crawlers to discover and catalog the data, and AWS Glue jobs to load the data into the S3 data lake.
This solution will meet the requirement with the least operational effort, as it leverages the serverless and managed capabilities of AWS Glue, AWS Glue Studio, AWS Glue Data Quality, and AWS Step Functions.
You do not need to write any code to identify or obfuscate the PII, as you can use the built-in transforms and rules in AWS Glue Studio and AWS Glue Data Quality. You also do not need to provision or manage any servers or clusters, as AWS Glue and AWS Step Functions scale automatically based on the demand.
The other options are not as efficient as using the Detect PII transform in AWS Glue Studio, creating a rule in AWS Glue Data Quality, and using an AWS Step Functions state machine. Using an Amazon Kinesis Data Firehose delivery stream to process the dataset, creating an AWS Lambda transform function to identify the PII, using an AWS SDK to obfuscate the PII, and setting the S3 data lake as the target for the delivery stream will require more operational effort, as you will need to write and maintain code to identify and obfuscate the PII, as well as manage the Lambda function and its resources. Using the Detect PII transform in AWS Glue Studio to identify the PII, obfuscating the PII, and using an AWS Step Functions state machine to orchestrate a data pipeline to ingest the data into the S3 data lake will not be as effective as creating a rule in AWS Glue Data Quality to obfuscate the PII, as you will need to manually obfuscate the PII after identifying it, which can be error-prone and time-consuming. Ingesting the dataset into Amazon DynamoDB, creating an AWS Lambda function to identify and obfuscate the PII in the DynamoDB table and to transform the data, and using the same Lambda function to ingest the data into the S3 data lake will require more operational effort, as you will need to write and maintain code to identify and obfuscate the PII, as well as manage the Lambda function and its resources. You will also incur additional costs and complexity by using DynamoDB as an intermediate data store, which may not be necessary for your use case. References:
* AWS Glue
* AWS Glue Studio
* AWS Glue Data Quality
* [AWS Step Functions]
* [AWS Certified Data Engineer - Associate DEA-C01 Complete Study Guide], Chapter 6: Data Integration and Transformation, Section 6.1: AWS Glue

Data-Engineer-Associate-KR 문제 69

보안 회사는 JSON 형식의 IoT 데이터를 Amazon S3 버킷에 저장합니다. 회사가 IoT 장치를 업그레이드하면 데이터 구조가 변경될 수 있습니다. 회사는 IoT 데이터가 포함된 데이터 카탈로그를 생성하려고 합니다. 회사의 분석 부서는 데이터 카탈로그를 사용하여 데이터를 색인화합니다.
이러한 요구 사항을 가장 비용 효율적으로 충족하는 솔루션은 무엇입니까?

A. AWS Glue 데이터 카탈로그를 생성합니다. AWS Glue 스키마 레지스트리를 구성합니다. 분석 부서가 Amazon Redshift Serverless에 사용할 데이터 수집을 조정하기 위해 새로운 AWS Glue 워크로드를 생성합니다.

B. Amazon Redshift 프로비저닝된 클러스터를 생성합니다. 분석 부서가 Amazon S3에 있는 데이터를 탐색할 수 있도록 Amazon Redshift Spectrum 데이터베이스를 생성합니다. Amazon Redshift에 데이터를 로드하는 Redshift 저장 프로시저를 생성합니다.

C. Amazon Athena 작업 그룹을 생성합니다. Athena를 통해 Apache Spark를 사용하여 Amazon S3에 있는 데이터를 탐색합니다. Athena 작업 그룹 스키마와 테이블을 분석 부서에 제공합니다.

D. AWS Glue 데이터 카탈로그를 생성합니다. AWS Glue 스키마 레지스트리를 구성합니다. Amazon Redshift Data API를 사용하여 AWS Lambda 사용자 정의 함수(UDF)를 생성합니다. 분석 부서가 Amazon Redshift Serverless에 사용할 데이터 수집을 조율하기 위해 AWS Step Functions 작업을 생성합니다.

정답: C

The best solution to meet the requirements of creating a data catalog that includes the IoT data, and allowing the analytics department to index the data, most cost-effectively, is to create an Amazon Athena workgroup, explore the data that is in Amazon S3 by using Apache Spark through Athena, and provide the Athena workgroup schema and tables to the analytics department.
Amazon Athena is a serverless, interactive query service that makes it easy to analyze data directly in Amazon S3 using standard SQL or Python1. Amazon Athena also supports Apache Spark, an open-source distributed processing framework that can run large-scale data analytics applications across clusters of servers2. You can use Athena to run Spark code on data in Amazon S3 without having to set up, manage, or scale any infrastructure. You can also use Athena to create and manage external tables that point to your data in Amazon S3, and store them in an external data catalog, such as AWS Glue Data Catalog, Amazon Athena Data Catalog, or your own Apache Hive metastore3. You can create Athena workgroups to separate query execution and resource allocation based on different criteria, such as users, teams, or applications4. You can share the schemas and tables in your Athena workgroup with other users or applications, such as Amazon QuickSight, for data visualization and analysis5.
Using Athena and Spark to create a data catalog and explore the IoT data in Amazon S3 is the most cost- effective solution, as you pay only for the queries you run or the compute you use, and you pay nothing when the service is idle1. You also save on the operational overhead and complexity of managing data warehouse infrastructure, as Athena and Spark are serverless and scalable. You can also benefit from the flexibility and performance of Athena and Spark, as they support various data formats, including JSON, and can handle schema changes and complex queries efficiently.
Option A is not the best solution, as creating an AWS Glue Data Catalog, configuring an AWS Glue Schema Registry, creating a new AWS Glue workload to orchestrate the ingestion of the data that the analytics department will use into Amazon Redshift Serverless, would incur more costs and complexity than using Athena and Spark. AWS Glue Data Catalog is a persistent metadata store that contains table definitions, job definitions, and other control information to help you manage your AWS Glue components6. AWS Glue Schema Registry is a service that allows you to centrally store and manage the schemas of your streaming data in AWS Glue Data Catalog7. AWS Glue is a serverless data integration service that makes it easy to prepare, clean, enrich, and move data between data stores8. Amazon Redshift Serverless is a feature of Amazon Redshift, a fully managed data warehouse service, that allows you to run and scale analytics without having to manage data warehouse infrastructure9. While these services are powerful and useful for many data engineering scenarios, they are not necessary or cost-effective for creating a data catalog and indexing the IoT data in Amazon S3. AWS Glue Data Catalog and Schema Registry charge you based on the number of objects stored and the number of requests made67. AWS Glue charges you based on the compute time and the data processed by your ETL jobs8. Amazon Redshift Serverless charges you based on the amount of data scanned by your queries and the compute time used by your workloads9. These costs can add up quickly, especially if you have large volumes of IoT data and frequent schema changes. Moreover, using AWS Glue and Amazon Redshift Serverless would introduce additional latency and complexity, as you would have to ingest the data from Amazon S3 to Amazon Redshift Serverless, and then query it from there, instead of querying it directly from Amazon S3 using Athena and Spark.
Option B is not the best solution, as creating an Amazon Redshift provisioned cluster, creating an Amazon Redshift Spectrum database for the analytics department to explore the data that is in Amazon S3, and creating Redshift stored procedures to load the data into Amazon Redshift, would incur more costs and complexity than using Athena and Spark. Amazon Redshift provisioned clusters are clusters that you create and manage by specifying the number and type of nodes, and the amount of storage and compute capacity10. Amazon Redshift Spectrum is a feature of Amazon Redshift that allows you to query and join data across your data warehouse and your data lake using standard SQL11. Redshift stored procedures are SQL statements that you can define and store in Amazon Redshift, and then call them by using the CALL command12. While these features are powerful and useful for many data warehousing scenarios, they are not necessary or cost-effective for creating a data catalog and indexing the IoT data in Amazon S3. Amazon Redshift provisioned clusters charge you based on the node type, the number of nodes, and the duration of the cluster10. Amazon Redshift Spectrum charges you based on the amount of data scanned by your queries11.
These costs can add up quickly, especially if you have large volumes of IoT data and frequent schema changes. Moreover, using Amazon Redshift provisioned clusters and Spectrum would introduce additional latency and complexity, as you would have to provision and manage the cluster, create an external schema and database for the data in Amazon S3, and load the data into the cluster using stored procedures, instead of querying it directly from Amazon S3 using Athena and Spark.
Option D is not the best solution, as creating an AWS Glue Data Catalog, configuring an AWS Glue Schema Registry, creating AWS Lambda user defined functions (UDFs) by using the Amazon Redshift Data API, and creating an AWS Step Functions job to orchestrate the ingestion of the data that the analytics department will use into Amazon Redshift Serverless, would incur more costs and complexity than using Athena and Spark. AWS Lambda is a serverless compute service that lets you run code without provisioning or managing servers13. AWS Lambda UDFs are Lambda functions that you can invoke from within an Amazon Redshift query. Amazon Redshift Data API is a service that allows you to run SQL statements on Amazon Redshift clusters using HTTP requests, without needing a persistent connection. AWS Step Functions is a service that lets you coordinate multiple AWS services into serverless workflows. While these services are powerful and useful for many data engineering scenarios, they are not necessary or cost-effective for creating a data catalog and indexing the IoT data in Amazon S3. AWS Glue Data Catalog and Schema Registry charge you based on the number of objects stored and the number of requests made67. AWS Lambda charges you based on the number of requests and the duration of your functions13. Amazon Redshift Serverless charges you based on the amount of data scanned by your queries and the compute time used by your workloads9. AWS Step Functions charges you based on the number of state transitions in your workflows. These costs can add up quickly, especially if you have large volumes of IoT data and frequent schema changes. Moreover, using AWS Glue, AWS Lambda, Amazon Redshift Data API, and AWS Step Functions would introduce additional latency and complexity, as you would have to create and invoke Lambda functions to ingest the data from Amazon S3 to Amazon Redshift Serverless using the Data API, and coordinate the ingestion process using Step Functions, instead of querying it directly from Amazon S3 using Athena and Spark. References:
What is Amazon Athena?
Apache Spark on Amazon Athena
Creating tables, updating the schema, and adding new partitions in the Data Catalog from AWS Glue ETL jobs Managing Athena workgroups Using Amazon QuickSight to visualize data in Amazon Athena AWS Glue Data Catalog AWS Glue Schema Registry What is AWS Glue?
Amazon Redshift Serverless
Amazon Redshift provisioned clusters
Querying external data using Amazon Redshift Spectrum
Using stored procedures in Amazon Redshift
What is AWS Lambda?
[Creating and using AWS Lambda UDFs]
[Using the Amazon Redshift Data API]
[What is AWS Step Functions?]
AWS Certified Data Engineer - Associate DEA-C01 Complete Study Guide

Data-Engineer-Associate-KR 문제 70

회사는 회사 운영 데이터베이스의 데이터를 Amazon S3 기반 데이터 레이크로 수집하는 여러 추출, 변환 및 로드(ETL) 워크플로를 유지 관리합니다. ETL 워크플로는 AWS Glue 및 Amazon EMR을 사용하여 데이터를 처리합니다.
회사는 자동화된 조정을 제공하고 수동 작업을 최소화하기 위해 기존 아키텍처를 개선하려고 합니다.
최소한의 운영 오버헤드로 이러한 요구 사항을 충족하는 솔루션은 무엇입니까?

A. AWS Glue 워크플로

B. AWS Step Functions 작업

C. AWS Lambda 함수

D. Apache Airflow용 Amazon 관리형 워크플로(Amazon MWAA) 워크플로

정답: A

AWS Glue workflows are a feature of AWS Glue that enable you to create and visualize complex ETL pipelines using AWS Glue components, such as crawlers, jobs, triggers, and development endpoints. AWS Glue workflows provide automated orchestration and require minimal manual effort, as they handle dependency resolution, error handling, state management, and resource allocation for your ETL workflows.
You can use AWS Glue workflows to ingest data from your operational databases into your Amazon S3 based data lake, and then use AWS Glue and Amazon EMR to process the data in the data lake. This solution will meet the requirements with the least operational overhead, as it leverages the serverless and fully managed nature of AWS Glue, and the scalability and flexibility of Amazon EMR12.
The other options are not optimal for the following reasons:
B). AWS Step Functions tasks. AWS Step Functions is a service that lets you coordinate multiple AWS services into serverless workflows. You can use AWS Step Functions tasks to invoke AWS Glue and Amazon EMR jobs as part of your ETL workflows, and use AWS Step Functions state machines to define the logic and flow of your workflows. However, this option would require more manual effort than AWS Glue workflows, as you would need to write JSON code to define your state machines, handle errors and retries, and monitor the execution history and status of your workflows3.
C). AWS Lambda functions. AWS Lambda is a service that lets you run code without provisioning or managing servers. You can use AWS Lambda functions to trigger AWS Glue and Amazon EMR jobs as part of your ETL workflows, and use AWS Lambda event sources and destinations to orchestrate the flow of your workflows. However, this option would also require more manual effort than AWS Glue workflows, as you would need to write code to implement your business logic, handle errors and retries, and monitor the invocation and execution of your Lambda functions. Moreover, AWS Lambda functions have limitations on the execution time, memory, and concurrency, which may affect the performance and scalability of your ETL workflows.
D). Amazon Managed Workflows for Apache Airflow (Amazon MWAA) workflows. Amazon MWAA is a managed service that makes it easy to run open source Apache Airflow on AWS. Apache Airflow is a popular tool for creating and managing complex ETL pipelines using directed acyclic graphs (DAGs). You can use Amazon MWAA workflows to orchestrate AWS Glue and Amazon EMR jobs as part of your ETL workflows, and use the Airflow web interface to visualize and monitor your workflows. However, this option would have more operational overhead than AWS Glue workflows, as you would need to set up and configure your Amazon MWAA environment, write Python code to define your DAGs, and manage the dependencies and versions of your Airflow plugins and operators.
1: AWS Glue Workflows
2: AWS Glue and Amazon EMR
3: AWS Step Functions
AWS Lambda
Amazon Managed Workflows for Apache Airflow