AWS DMS: How to migrate data to Amazon S3?
- arjun5792
- Oct 29, 2022
- 2 min read
Have questions about how to use AWS DMS to move data to Amazon S3? You can rely on us!
As part of our Server Management Services at Skynats, we frequently handle requests to migrate data using AWS DMS for our clients using AWS.
Let's look at how our support engineers handle this for our clients right now.
Migration of data to Amazon S3 using AWS DMS
Here, data in Apache Parquet (.parquet) format will be transferred to Amazon Simple Storage Service.
If we use replication 3.1.3 or a more recent version, we can move data to an S3 bucket in Apache Parquet format. Version 1.0 of Parquet is the standard version.
The procedures that our support engineers use for the migration are as follows:
First, we must create a target Amazon SE endpoint using the AWS DMS Console.
Then utilize the following to add an additional connection attribute (ECA):
dataFormat=parquet;
Additionally, we need to look into the additional connection attributes that can be used to store parquet objects in an S3 target.
Alternately, use the create-endpoint command in the AWS Command Line Interface (AWS CLI) to create a target Amazon S3 endpoint:
aws dms create-endpoint --endpoint-identifier s3-target-parque --engine-name s3 --endpoint-type target --s3-settings '{"ServiceAccessRoleArn": <IAM role ARN for S3 endpoint>, "BucketName": <S3 bucket name to migrate to>, "DataFormat": "parquet"}'
3. After that, we can use the additional connection attribute listed below to specify the output file's Parquet version:
parquetVersion=PARQUET_2_0;
4. To check if the S3 endpoint we created has the S3 setting DataFormat or the additional connection attribute dataFormat set to "parquet," run the describe-endpoints command.
We can use the following command to verify the S3 setting DataFormat:
aws dms describe-endpoints --filters Name=endpoint-arn,Values=<S3 target endpoint ARN> --query "Endpoints[].S3Settings.DataFormat"
[
"parquet"
]
5. We must recreate the endpoint if the DataFormat parameter's value is CSV.
6. Installing the Apache Parquet command-line tool will allow us to parse the output file once we have it in Parquet format:
pip install parquet-cli --user
7. After that, look at the file format:
parq LOAD00000001.parquet
# Metadata
<pyarrow._parquet.FileMetaData object at 0x10e948aa0>
created_by: AWS
num_columns: 2
num_rows: 2
num_row_groups: 1
format_version: 1.0
serialized_size: 169
8. Lastly, we can print the contents of the file:
parq LOAD00000001.parquet --head
i c
0 1 insert1
1 2 insert2
Comentarios