You are currently viewing Question your Apache Hive metastore with AWS Lake Formation permissions

Question your Apache Hive metastore with AWS Lake Formation permissions


Apache Hive is a SQL-based knowledge warehouse system for processing extremely distributed datasets on the Apache Hadoop platform. There are two key elements to Apache Hive: the Hive SQL question engine and the Hive metastore (HMS). The Hive metastore is a repository of metadata concerning the SQL tables, resembling database names, desk names, schema, serialization and deserialization info, knowledge location, and partition particulars of every desk. Apache Hive, Apache Spark, Presto, and Trino can all use a Hive Metastore to retrieve metadata to run queries. The Hive metastore will be hosted on an Apache Hadoop cluster or will be backed by a relational database that’s exterior to a Hadoop cluster. Though the Hive metastore shops the metadata of tables, the precise knowledge of the desk could possibly be residing on Amazon Easy Storage Service (Amazon S3), the Hadoop Distributed File System (HDFS) of the Hadoop cluster, or another Hive-supported knowledge shops.

As a result of Apache Hive was constructed on high of Apache Hadoop, many organizations have been utilizing the software program from the time they’ve been utilizing Hadoop for giant knowledge processing. Additionally, Hive metastore offers versatile integration with many different open-source large knowledge software program like Apache HBase, Apache Spark, Presto, and Apache Impala. Subsequently, organizations have come to host large volumes of metadata of their structured datasets within the Hive metastore. A metastore is a essential a part of a knowledge lake, and having this info accessible, wherever it resides, is necessary. Nonetheless, many AWS analytics providers don’t combine natively with the Hive metastore, and subsequently, organizations have needed to migrate their knowledge to the AWS Glue Information Catalog to make use of these providers.

AWS Lake Formation has launched help for managing consumer entry to Apache Hive metastores via a federated AWS Glue connection. Beforehand, you possibly can use Lake Formation to handle consumer permissions on AWS Glue Information Catalog sources solely. With the Hive metastore connection from AWS Glue, you may hook up with a database in a Hive metastore exterior to the Information Catalog, map it to a federated database within the Information Catalog, apply Lake Formation permissions on the Hive database and tables, share them with different AWS accounts, and question them utilizing providers resembling Amazon Athena, Amazon Redshift Spectrum, Amazon EMR, and AWS Glue ETL (extract, rework, and cargo). For added particulars on how the Hive metastore integration with Lake Formation works, check with Managing permissions on datasets that use exterior metastores.

Use circumstances for Hive metastore integration with the Information Catalog embody the next:

  • An exterior Apache Hive metastore used for legacy large knowledge workloads like on-premises Hadoop clusters with knowledge in Amazon S3
  • Transient Amazon EMR workloads with underlying knowledge in Amazon S3 and the Hive metastore on Amazon Relational Database Service (Amazon RDS) clusters.

On this put up, we show how one can apply Lake Formation permissions on a Hive metastore database and tables and question them utilizing Athena. We illustrate a cross-account sharing use case, the place a Lake Formation steward in producer account A shares a federated Hive database and tables utilizing LF-Tags to client account B.

Resolution overview

Producer account A hosts an Apache Hive metastore in an EMR cluster, with underlying knowledge in Amazon S3. We launch the AWS Glue Hive metastore connector from AWS Serverless Software Repository in account A and create the Hive metastore connection in account A’s Information Catalog. After we create the HMS connection, we create a database in account A’s Information Catalog (referred to as the federated database) and map it to a database within the Hive metastore utilizing the connection. The tables from the Hive database are then accessible to the Lake Formation admin in account A, identical to another tables within the Information Catalog. The admin continues to arrange Lake Formation tag-based entry management (LF-TBAC) on the federated Hive database and share it to account B.

The info lake customers in account B will entry the Hive database and tables of account A, identical to querying another shared Information Catalog useful resource utilizing Lake Formation permissions.

The next diagram illustrates this structure.

The answer consists of steps in each accounts. In account A, carry out the next steps:

  1. Create an S3 bucket to host the pattern knowledge.
  2. Launch an EMR 6.10 cluster with Hive. Obtain the pattern knowledge to the S3 bucket. Create a database and exterior tables, pointing to the downloaded pattern knowledge, in its Hive metastore.
  3. Deploy the appliance GlueDataCatalogFederation-HiveMetastore from AWS Serverless Software Repository and configure it to make use of the Amazon EMR Hive metastore. This may create an AWS Glue connection to the Hive metastore that exhibits up on the Lake Formation console.
  4. Utilizing the Hive metastore connection, create a federated database within the AWS Glue Information Catalog.
  5. Create LF-Tags and affiliate them to the federated database.
  6. Grant permissions on the LF-Tags to account B. Grant database and desk permissions to account B utilizing LF-Tag expressions.

In account B, carry out the next steps:

  1. As a knowledge lake admin, overview and settle for the AWS Useful resource Entry Supervisor (AWS RAM) invitations for the shares from account A.
  2. The info lake admin then sees the shared database and tables. The admin creates a useful resource hyperlink to the database and grants fine-grained permissions to an information analyst on this account.
  3. Each the information lake admin and the information analyst question the Hive tables which are accessible to them utilizing Athena.

Account A has the next personas:

  • hmsblog-producersteward – Manages the information lake within the producer account A

Account B has the next personas:

  • hmsblog-consumersteward – Manages the information lake within the client account B
  • hmsblog-analyst – A knowledge analyst who wants entry to chose Hive tables

Conditions

To observe the tutorial on this put up, you want the next:

Lake Formation and AWS CloudFormation setup in account A

To maintain the setup easy, we have now an IAM admin registered as the information lake admin. Full the next steps:

  1. Signal into the AWS Administration Console and select the us-west-2 Area.
  2. On the Lake Formation console, beneath Permissions within the navigation pane, select Administrative roles and duties.
  3. Select Handle Directors within the Information lake directors part.
  4. Beneath IAM customers and roles, select the IAM admin consumer that you’re logged in as and select Save.
  5. Select Launch Stack to deploy the CloudFormation template:
  6. Select Subsequent.
  7. Present a reputation for the stack and select Subsequent.
  8. On the following web page, select Subsequent.
  9. Evaluate the small print on the ultimate web page and choose I acknowledge that AWS CloudFormation would possibly create IAM sources.
  10. Select Create.

Stack creation takes about 10 minutes. The stack establishes the producer account A setup as follows:

  • Creates an S3 knowledge lake bucket
  • Registers the information lake bucket to Lake Formation with the Allow catalog federation flag
  • Launches an EMR 6.10 cluster with Hive and runs two steps in Amazon EMR:
    • Downloads the pattern knowledge from public S3 bucket to the newly created bucket
    • Creates a Hive database and 4 exterior tables for the information in Amazon S3, utilizing a HQL script
  • Creates an IAM consumer (hmsblog-producersteward) and units this consumer as Lake Formation administrator
  • Creates LF-Tags (LFHiveBlogCampaignRole = Admin, Analyst)

Evaluate CloudFormation stack output in account A

To overview the output of your CloudFormation stack, full the next steps:

  1. Log in to the console because the IAM admin consumer you used earlier to run the CloudFormation template.
  2. Open the CloudFormation console in one other browser tab.
  3. Evaluate and word down the stack Outputs tab particulars.
  4. Select the hyperlink beneath Worth for ProducerStewardCredentials.

This may open the AWS Secrets and techniques Supervisor console.

  1. Select Retrieve worth and word down the credentials of hmsblog-producersteward.

Arrange a federated AWS Glue connection in account A

To arrange a federated AWS Glue connection, full the next steps:

  1. Open the AWS Serverless Software Repository console in one other browser tab.
  2. Within the navigation pane, select Obtainable functions.
  3. Choose Present apps that create customized IAM roles or useful resource insurance policies.
  4. Within the search bar, enter Glue.

This may record numerous functions.

  1. Select the appliance named GlueDataCatalogFederation-HiveMetastore.

This may open the AWS Lambda console configuration web page for a Lambda perform that runs the connector software code.

To configure the Lambda perform, you want particulars of the EMR cluster launched by the CloudFormation stack.

  1. On one other tab of your browser, open the Amazon EMR console.
  2. Navigate to the cluster launched for this put up and word down the next particulars from the cluster particulars web page:
    1. Major node public DNS
    2. Subnet ID
    3. Safety group ID of the first node

  3. Again on the Lambda configuration web page, beneath Evaluate, configure, and deploy, within the Software settings part, present the next particulars. Go away the remaining because the default values.
    1. For GlueConnectionName, enter hive-metastore-connection.
    2. For HiveMetastoreURIs enter thrift://<Major-node-public-DNS-of your-EMR>:9083. For instance, thrift://ec2-54-70-203-146.us-west-2.compute.amazonaws.com:9083, the place 9083 is the Hive metastore port in EMR cluster.
    3. For VPCSecurityGroupIds, enter the safety group ID of the EMR main node.
    4. For VPCSubnetIds, enter the subnet ID of the EMR cluster.
  4. Select Deploy.

Anticipate the Create Accomplished standing of the Lambda software. You may overview the small print of the Lambda software on the Lambda console.

  1. Open Lake Formation console and within the navigation pane, select Information sharing.

You must see hive-metastore-connection beneath Connections.

  1. Select it and overview the small print.
  2. Within the navigation pane, beneath Administrative roles and duties, select LF-Tags.

You must see the created LF-tag LFHiveBlogCampaignRole with two values: Analyst and Admin.

  1. Select LF-Tag permissions and select Grant.
  2. Select IAM customers and roles and enter hmsblog-producersteward.
  3. Beneath LF-Tags, select Add LF-Tag.
  4. Enter LFHiveBlogCampaignRole for Key and enter Analyst and Admin for Values.
  5. Beneath Permissions, choose Describe and Affiliate for LF-Tag permissions and Grantable permissions.
  6. Select Grant.

This offers LF-Tags permissions for the producer steward.

  1. Sign off because the IAM administrator consumer.

Grant Lake Formation permissions as producer steward

Full the next steps:

  1. Check in to the console as hmsblog-producersteward, utilizing the credentials from the CloudFormation stack Output tab that you just famous down earlier.
  2. On the Lake Formation console, within the navigation pane, select Administrative roles and duties.
  3. Beneath Database creators, select Grant.
  4. Add hmsblog-producersteward as a database creator.
  5. Within the navigation pane, select Information sharing.
  6. Beneath Connections, select the hive-metastore-connection hyperlink.
  7. On the Connection particulars web page, select Create database.
  8. For Database title, enter federated_emrhivedb.

That is the federated database within the native AWS Glue Information Catalog that can level to a Hive metastore database. It is a one-to-one mapping of a database within the Information Catalog to a database within the exterior Hive metastore.

  1. For Database identifier, enter the title of the database within the EMR Hive metastore that was created by the Hive SQL script. For this put up, we use emrhms_salesdb.
  2. As soon as created, choose federated_emrhivedb and select View tables.

This may fetch the database and desk metadata from the Hive metastore on the EMR cluster and show the tables created by the Hive script.

Now you affiliate the LF-Tags created by the CloudFormation script on this federated database and share it to the patron account B utilizing LF-Tag expressions.

  1. Within the navigation pane, select Databases.
  2. Choose federated_emrhivedb and on the Actions menu, select Edit LF-Tags.
  3. Select Assign new LF-Tag.
  4. Enter LFHiveBlogCampaignRole for Assigned keys and Admin for Values, then select Save.
  5. Within the navigation pane, select Information lake permissions.
  6. Select Grant.
  7. Choose Exterior accounts and enter the patron account B quantity.
  8. Beneath LF-Tags or catalog sources, select Useful resource matched by LF-Tags.
  9. Select Add LF-Tag.
  10. Enter LFHiveBlogCampaignRole for Key and Admin for Values.
  11. Within the Database permissions part, choose Describe for Database permissions and Grantable permissions.
  12. Within the Desk permissions part, choose Choose and Describe for Desk permissions and Grantable permissions.
  13. Select Grant.
  14. Within the navigation pane, beneath Administrative roles and duties, select LF-Tag permissions.
  15. Select Grant.
  16. Choose Exterior accounts and enter the account ID of client account B.
  17. Beneath LF-Tags, enter LFHiveBlogCampaignRole for Key and enter Analyst and Admin for Values.
  18. Beneath Permissions, choose Describe and Affiliate beneath LF-Tag permissions and Grantable permissions.
  19. Select Grant and confirm that the granted LF-Tag permissions show accurately.
  20. Within the navigation pane, select Information lake permissions.

You may overview and confirm the permissions granted to account B.

  1. Within the navigation pane, beneath Administrative roles and duties, select LF-Tag permissions.

You may overview and confirm the permissions granted to account B.

  1. Sign off of account A.

Lake Formation and AWS CloudFormation setup in account B

To maintain the setup easy, we use an IAM admin registered as the information lake admin.

  1. Signal into the AWS Administration Console of account B and choose the us-west-2 Area.
  2. On the Lake Formation console, beneath Permissions within the navigation pane, select Administrative roles and duties.
  3. Select Handle Directors within the Information lake directors part.
  4. Beneath IAM customers and roles, select the IAM admin consumer that you’re logged in as and select Save.
  5. Select Launch Stack to deploy the CloudFormation template:
  6. Select Subsequent.
  7. Present a reputation for the stack and select Subsequent.
  8. On the following web page, select Subsequent.
  9. Evaluate the small print on the ultimate web page and choose I acknowledge that AWS CloudFormation would possibly create IAM sources.
  10. Select Create.

Stack creation ought to take about 5 minutes. The stack establishes the producer account B setup as follows:

  • Creates an IAM consumer hmsblog-consumersteward and units this consumer as Lake Formation administrator
  • Creates one other IAM consumer hmsblog-analyst
  • Creates an S3 knowledge lake bucket to retailer Athena question outcomes, with ListBucket and write object permissions to each hmsblog-consumersteward and hmsblog-analyst

Observe down the stack output particulars.

Settle for useful resource shares in account B

Check in to the console as hmsblog-consumersteward and full the next steps:

  1. On the AWS CloudFormation console, navigate to the stack Outputs tab.
  2. Select the hyperlink for ConsumerStewardCredentials to be redirected to the Secrets and techniques Supervisor console.
  3. On the Secrets and techniques Supervisor console, select Retrieve secret worth and duplicate the password for the patron steward consumer.
  4. Use the ConsoleIAMLoginURL worth from the CloudFormation template Output to log in to account B with the patron steward consumer title hmsblog-consumersteward and the password you copied from Secrets and techniques Supervisor.
  5. Open the AWS RAM console in one other browser tab.
  6. Within the navigation pane, beneath Shared with me, select Useful resource shares to view the pending invites.

You must see two useful resource share invites from producer account A: one for a database-level share and one for a table-level share.

  1. Select every useful resource share hyperlink, overview the small print, and select Settle for.

After you settle for the invites, the standing of the useful resource shares modifications from Pending to Energetic.

  1. Open the Lake Formation console in one other browser tab.
  2. Within the navigation pane, select Databases.

You must see the shared database federated_emrhivedb from producer account A.

  1. Select the database and select View tables to overview the record of tables shared beneath that database.

You must see the 4 tables of the Hive database that’s hosted on the EMR cluster within the producer account.

Grant permissions in account B

To grant permissions in account B, full the next steps as hmsblog-consumersteward:

  1. On the Lake Formation console, within the navigation pane, select Administrative roles and duties.
  2. Beneath Database creators, select Grant.
  3. For IAM customers and roles, enter hmsblog-consumersteward.
  4. For Catalog permissions, choose Create database.
  5. Select Grant.

This permits hmsblog-consumersteward to create a database useful resource hyperlink.

  1. Within the navigation pane, select Databases.
  2. Choose federated_emrhivedb and on the Actions menu, select Create useful resource hyperlink.
  3. Enter rl_federatedhivedb for Useful resource hyperlink title and select Create.
  4. Select Databases within the navigation pane.
  5. Choose the useful resource hyperlink rl_federatedhivedb and on the Actions menu, select Grant.
  6. Select hmsblog-analyst for IAM customers and roles.
  7. Beneath Useful resource hyperlink permissions, choose Describe, then select Grant.
  8. Choose Databases within the navigation pane.
  9. Choose the useful resource hyperlink rl_federatedhivedb and on the Actions menu, select Grant on track.
  10. Select hmsblog-analyst for IAM customers and roles.
  11. Select hms_productcategory and hms_supplier for Tables.
  12. For Desk permissions, choose Choose and Describe, then select Grant.
  13. Within the navigation pane, select Information lake permissions and overview the permissions granted to hms-analyst.

Question the Apache Hive database of the producer from the patron Athena

Full the next steps:

  1. On the Athena console, navigate to the question editor.
  2. Select Edit settings to configure the Athena question outcomes bucked.
  3. Browse and select the S3 bucket hmsblog-athenaresults-<your-account-B>-us-west-2 that the CloudFormation template created.
  4. Select Save.

hmsblog-consumersteward has entry to all 4 tables beneath federated_emrhivedb from the producer account.

  1. Within the Athena question editor, select the database rl_federatedhivedb and run a question on any of the tables.

You had been in a position to question an exterior Apache Hive metastore database of the producer account via the AWS Glue Information Catalog and Lake Formation permissions utilizing Athena from the recipient client account.

  1. Signal out of the console as hmsblog-consumersteward and signal again in as hmsblog-analyst.
  2. Use the identical technique as defined earlier to get the login credentials from the CloudFormation stack Outputs tab.

hmsblog-analyst has Describe permissions on the useful resource hyperlink and entry to 2 of the 4 Hive tables. You may confirm that you just see them on the Databases and Tables pages on the Lake Formation console.

On the Athena console, you now configure the Athena question outcomes bucket, much like the way you configured it as hmsblog-consumersteward.

  1. Within the question editor, select Edit settings.
  2. Browse and select the S3 bucket hmsblog-athenaresults-<your-account-B>-us-west-2 that the CloudFormation template created.
  3. Select Save.
  4. Within the Athena question editor, select the database rl_federatedhivedb and run a question on the 2 tables.
  5. Signal out of the console as hmsblog-analyst.

You had been in a position to prohibit sharing the exterior Apache Hive metastore tables utilizing Lake Formation permissions from one account to a different and question them utilizing Athena. You can too question the Hive tables utilizing Redshift Spectrum, Amazon EMR, and AWS Glue ETL from the patron account.

Clear up

To keep away from incurring expenses on the AWS sources created on this put up, you may carry out the next steps.

Clear up sources in account A

There are two CloudFormation stacks related to producer account A. You want to delete the dependencies and the 2 stacks within the appropriate order.

  1. Log in because the admin consumer to producer account B.
  2. On the Lake Formation console, select Information lake permissions within the navigation pane.
  3. Select Grant.
  4. Grant Drop permissions to your position or consumer on federated_emrhivedb.
  5. Within the navigation pane, select Databases.
  6. Choose federated_emrhivedb and on the Actions menu, select Delete to delete the federated database that’s related to the Hive metastore connection.

This makes the AWS Glue connection’s CloudFormation stack able to be deleted.

  1. Within the navigation pane, select Administrative roles and duties.
  2. Beneath Database creators, choose Revoke and take away hmsblog-producersteward permissions.
  3. On the CloudFormation console, delete the stack named serverlessrepo-GlueDataCatalogFederation-HiveMetastore first.

That is the one created by your AWS SAM software for the Hive metastore connection. Anticipate it to finish deletion.

  1. Delete the CloudFormation stack that you just created for the producer account arrange.

This deletes the S3 buckets, EMR cluster, customized IAM roles and insurance policies, and the LF-Tags, database, tables, and permissions.

Clear up sources in account B

Full the next steps in account B:

  1. Revoke permission to hmsblog-consumersteward as database creator, much like the steps within the earlier part.
  2. Delete the CloudFormation stack that you just created for the patron account setup.

This deletes the IAM customers, S3 bucket, and all of the permissions from Lake Formation.

If there are any useful resource hyperlinks and permissions left, delete them manually in Lake Formation from each accounts.

Conclusion

On this put up, we confirmed you how one can launch the AWS Glue Hive metastore federation software from AWS Serverless Software Repository, configure it with a Hive metastore working on an EMR cluster, create a federated database within the AWS Glue Information Catalog, and map it to a Hive metastore database on the EMR cluster. We illustrated how one can share and entry the Hive database tables for a cross-account situation and the advantages of utilizing Lake Formation to limit permissions.

All Lake Formation options resembling sharing to IAM principals inside identical account, sharing to exterior accounts, sharing to exterior account IAM principals, limiting column entry, and setting knowledge filters work on federated Hive database and tables. You should use any of the AWS analytics providers which are built-in with Lake Formation, resembling Athena, Redshift Spectrum, AWS Glue ETL, and Amazon EMR to question the federated Hive database and tables.

We encourage you to take a look at the options of the AWS Glue Hive metastore federation connector and discover Lake Formation permissions in your Hive database and tables. Please touch upon this put up or speak to your AWS Account Staff to share suggestions on this characteristic.

For extra particulars, see Managing permissions on datasets that use exterior metastores.


Concerning the authors

Aarthi Srinivasan is a Senior Huge Information Architect with AWS Lake Formation. She likes constructing knowledge lake options for AWS prospects and companions. When not on the keyboard, she explores the newest science and expertise developments and spends time along with her household.

Leave a Reply