You are currently viewing Migrate information from Google Cloud Storage to Amazon S3 utilizing AWS Glue

Migrate information from Google Cloud Storage to Amazon S3 utilizing AWS Glue


At present, we’re happy to announce a brand new AWS Glue connector for Google Cloud Storage that means that you can transfer information bi-directionally between Google Cloud Storage and Amazon Easy Storage Service (Amazon S3).

We’ve seen that there’s a demand to design functions that allow information to be transportable throughout cloud environments and provide the capability to derive insights from a number of information sources. One of many information sources now you can rapidly combine with is Google Cloud Storage, a managed service for storing each unstructured information and structured information. With this connector, you’ll be able to convey the information from Google Cloud Storage to Amazon S3.

On this publish, we go over how the brand new connector works, introduce the connector’s features, and offer you key steps to set it up. We offer you conditions, share learn how to subscribe to this connector in AWS Market, and describe learn how to create and run AWS Glue for Apache Spark jobs with it.

AWS Glue is a serverless information integration service that makes it easy to find, put together, and mix information for analytics, machine studying, and software improvement. AWS Glue natively integrates with numerous information shops equivalent to MySQL, PostgreSQL, MongoDB, and Apache Kafka, together with AWS information shops equivalent to Amazon S3, Amazon Redshift, Amazon Relational Database Service (Amazon RDS), Amazon DynamoDB, and Amazon S3. AWS Glue Market connectors let you uncover and combine further information sources, equivalent to software program as a service (SaaS) functions and your customized information sources. With only a few clicks, you’ll be able to search and choose connectors from AWS Market and start your information preparation workflow in minutes.

How the connector works

This connector depends on the Spark DataSource API in AWS Glue and calls Hadoop’s FileSystem interface. The latter has carried out libraries for studying and writing numerous distributed or conventional storage. This connector additionally consists of the Google Cloud Storage Connector for Hadoop, which helps you to run Apache Hadoop or Apache Spark jobs instantly on information in Google Cloud Storage. AWS Glue masses the library from the Amazon Elastic Container Registry (Amazon ECR) repository throughout initialization (as a connector), reads the connection credentials utilizing AWS Secrets and techniques Supervisor, and reads information supply configurations from enter parameters. When AWS Glue has web entry, the Spark job in AWS Glue can learn from and write to Google Cloud Storage.

Answer overview

The next structure diagram exhibits how AWS Glue connects to Google Cloud Storage for information ingestion.

Within the following sections, we present you learn how to create a brand new secret for Google Cloud Storage in Secrets and techniques Supervisor, subscribe to the AWS Glue connector, and transfer information from Google Cloud Storage to Amazon S3.

Conditions

You want the next conditions:

  • An account in Google Cloud and your information path in Google Cloud Storage. Put together the GCP account keys file prematurely and add them to the S3 bucket. For directions, discuss with Create a service account key.
  • A Secrets and techniques Supervisor secret to retailer a Google Cloud secret.
  • An AWS Id and Entry Administration (IAM) position for the AWS Glue job with the next insurance policies:
    • AWSGlueServiceRole, which permits the AWS Glue service position entry to associated providers.
    • AmazonEC2ContainerRegistryReadOnly, which gives read-only entry to Amazon EC2 Container Registry repositories. This coverage is for utilizing AWS Market’s connector libraries.
    • A Secrets and techniques Supervisor coverage, which gives learn entry to the key in Secrets and techniques Supervisor.
    • An S3 bucket coverage for the S3 bucket that it’s good to load ETL (extract, rework, and cargo) information from Google Cloud Storage.

We assume that you’re already aware of the important thing ideas of Secrets and techniques Supervisor, IAM, and AWS Glue. Concerning IAM, these roles needs to be granted the permissions wanted to speak with AWS providers and nothing extra, in keeping with the precept of least privilege.

Create a brand new secret for Google Cloud Storage in Secrets and techniques Supervisor

Full the next steps to create a secret in Secrets and techniques Supervisor to retailer the Google Cloud Storage credentials:

  1. On the Secrets and techniques Supervisor console, select Retailer a brand new secret.
  2. For Secret sort, choose Different sort of secret.
  3. Enter your key as keyS3Uri and the worth as your key file within the s3 bucket, for instance, s3://keys/project-gcs-connector **.json.
  4. Depart the remainder of the choices at their default.
  5. Select Subsequent.
  6. Present a reputation for the key, equivalent to googlecloudstorage_credentials.
  7. Comply with the remainder of the steps to retailer the key.

Subscribe to the AWS Glue connector for Google Cloud Storage

To subscribe to the connector, full the next steps:

  1. Navigate to the Google Cloud Storage Connector for AWS Glue on AWS Market.
  2. On the product web page for the connector, use the tabs to view details about the connector. Should you resolve to buy this connector, select Proceed to Subscribe.
  3. Evaluate the pricing phrases and the vendor’s Finish Consumer License Settlement, then select Settle for Phrases.
  4. Proceed to the subsequent step by selecting Proceed to Configuration.
  5. On the Configure this software program web page, select the success choices and the model of the connector to make use of. We have now supplied two choices for the Google Cloud Storage Connector, AWS Glue 3.0 and AWS Glue 4.0. On this instance, we deal with AWS Glue 4.0. After deciding on Glue 3.0 or Glue 4.0, choose corresponding AWS Glue model while you configure the AWS Glue job.
  6. Select Proceed to Launch.
  7. On the Launch this software program web page, you’ll be able to evaluation the Utilization Directions supplied by AWS. While you’re able to proceed, select Activate the Glue connector in AWS Glue Studio.

The console will show the Create market connection web page in AWS Glue Studio.

Transfer information from Google Cloud Storage to Amazon S3

To maneuver your information to Amazon S3, you need to configure the customized connection after which arrange an AWS Glue job.

Create a customized connection in AWS Glue

An AWS Glue connection shops connection data for a selected information retailer, together with login credentials, URI strings, digital non-public cloud (VPC) data, and extra. Full the next steps to create your connection:

  1. On the AWS Glue console, select Connectors within the navigation pane.
  2. Select Create connection.
  3. For Connector, select Google Cloud Storage Connector for AWS Glue.
  4. For Identify, enter a reputation for the connection (for instance, GCSConnection).
  5. Enter an non-compulsory description.
  6. For AWS secret, enter the key you created (googlecloudstorage_credentials).
  7. Select Create connection and activate connector.

The connector and connection data is now seen on the Connectors web page.

Create an AWS Glue job and configure connection choices

Full the next steps:

  1. On the AWS Glue console, select Connectors within the navigation pane.
  2. Select the connection you created (GCSConnection).
  3. Select Create job.
  4. On the Node properties tab within the node particulars pane, enter the next data:
    • For Identify, enter Google Cloud Storage Connector for AWS Glue. This identify needs to be distinctive amongst all of the nodes for this job.
    • For Node sort, select the Google Cloud Storage Connector.
  5. On the Information supply properties tab, present the next data:
    • For Connection, select the connection you created (GCSConnection).
    • For Key, enter path, and for Worth, enter your Google Cloud Storage URI (for instance, gs://bucket/covid-csv-data/).
    • Enter one other key-value pair. For Key, enter fileFormat. For Worth, enter csv, as a result of our pattern information is that this format.
  6. On the Job particulars tab, for IAM Position, select the IAM position talked about within the conditions.
  7. For Glue model, select your AWS Glue model.
  8. Proceed to create your ETL job. For directions, discuss with Creating ETL jobs with AWS Glue Studio.
  9. Select Run to run your job.

After the job succeeds, we will examine the logs in Amazon CloudWatch.

The info is ingested into Amazon S3, as proven within the following screenshot.                        

We at the moment are capable of import information from Google Cloud Storage to Amazon S3.

Scaling issues

On this instance, we set the AWS Glue capability as 10 DPU (Information Processing Items). A DPU is a relative measure of processing energy that consists of 4 vCPUs of compute capability and 16 GB of reminiscence. To scale your AWS Glue job, you’ll be able to enhance the variety of DPU, and in addition make the most of Auto Scaling. With Auto Scaling enabled, AWS Glue mechanically provides and removes employees from the cluster relying on the workload. This removes the necessity so that you can experiment and resolve on the variety of employees to assign to your AWS Glue ETL jobs. Should you select the utmost variety of employees, AWS Glue will adapt the best dimension of sources for the workload.

Clear up

To wash up your sources, full the next steps:

  1. Take away the AWS Glue job and secret in Secrets and techniques Supervisor with the next command:
    aws glue delete-job —job-name <your_job_name> aws glue delete-connection —connection-name <your_connection_name>
    aws secretsmanager delete-secret —secret-id <your_secretsmanager_id> 

  2. Cancel the Google Cloud Storage Connector for AWS Glue’s subscription:
    • On the AWS Market console, go to the Handle subscriptions web page.
    • Choose the subscription for the product that you just wish to cancel.
    • On the Actions menu, select Cancel subscription.
    • Learn the knowledge supplied and choose the acknowledgement examine field.
    • Select Sure, cancel subscription.
  3. Delete the information within the S3 buckets.

Conclusion

On this publish, we confirmed learn how to use AWS Glue and the brand new connector for ingesting information from Google Cloud Storage to Amazon S3. This connector gives entry to Google Cloud Storage, facilitating cloud ETL processes for operational reporting, backup and catastrophe restoration, information governance, and extra.

This connector allows your information to be transportable throughout Google Cloud Storage and Amazon S3. We welcome any suggestions or questions within the feedback part.

References


Concerning the authors

Qiushuang Feng is a Options Architect at AWS, liable for Enterprise clients’ technical structure design, consulting, and design optimization on AWS Cloud providers. Earlier than becoming a member of AWS, Qiushuang labored in IT corporations equivalent to IBM and Oracle, and collected wealthy sensible expertise in improvement and analytics.

Noritaka Sekiyama is a Principal Large Information Architect on the AWS Glue staff. He’s enthusiastic about architecting fast-growing information environments, diving deep into distributed massive information software program like Apache Spark, constructing reusable software program artifacts for information lakes, and sharing data in AWS Large Information weblog posts.

 Greg Huang is a Senior Options Architect at AWS with experience in technical structure design and consulting for the China G1000 staff. He’s devoted to deploying and using enterprise-level functions on AWS Cloud providers. He possesses practically 20 years of wealthy expertise in large-scale enterprise software improvement and implementation, having labored within the cloud computing discipline for a few years. He has in depth expertise in serving to numerous forms of enterprises migrate to the cloud. Previous to becoming a member of AWS, he labored for well-known IT enterprises equivalent to Baidu and Oracle.

Maciej Torbus is a Principal Buyer Options Supervisor inside Strategic Accounts at Amazon Internet Providers. With in depth expertise in large-scale migrations, he focuses on serving to clients transfer their functions and techniques to extremely dependable and scalable architectures in AWS. Outdoors of labor, he enjoys crusing, touring, and restoring classic mechanical watches.

Leave a Reply