[PYTHON] The story of copying data from S3 to Google's TeamDrive

background

I don't think there is much need, but I made it possible to easily share files while saving the files in S3 to the storage of a different Vender. It wasn't difficult to use GoogleDriveApi to follow the documentation, but there were times when the file size was large. I think there is a similar implementation again, I will leave it as a memo just in case.

What you want to do

Automatically transfer S3 data to Google Drive.

What i did

Migrated the contents of the S3 bucket to the specified team drive in Python.

Implementation details

  1. Download the file from S3 to tmp.
  2. Check if the file already exists in Gdrive
  3. Overwrite if it exists
  4. If it does not exist, add it as a new one.

Advance preparation

--Files to be migrated exist in S3 bucket --Preparation of GoogleClientId, GoogleClientSecret Reference site --Preparation of GoogleRefreshToken Reference site --Preparation of FolderId of migration destination Google Drive

This is the source that I actually implemented.

# Download File from S3 to Local tmp Dir
# Upload a file to Google Drive

import os
import boto3
import json
import requests
import magic



## setting info
CONTENT_BUCKET_NAME = 'MY_S3_BUCKET_NAME'
CONTENT_BACKUP_KEY = 'MY_S3_BUCKET_KEY'
GOOGLE_CLIENT_ID = "XXXXXXXXXXXX.apps.googleusercontent.com"
GOOGLE_CLIENT_SECRET = "XXXXXXXXXXXX"
GOOGLE_REFRESH_TOKEN = "XXXXXXXXXXXX"
GOOGLE_FOLDER_ID = 'GOOGLE_FOLDER_ID'


s3 = boto3.resource('s3')
 
# Get the object from the event and show its content type
bucket = CONTENT_BUCKET_NAME
key = CONTENT_BACKUP_KEY
file_name = key.split("/")[1]
file_path = os.path.join("/tmp/"+ file_name)
s3.Object(bucket, key).download_file(file_path)
filesize = os.path.getsize(file_path)
fname, extension = os.path.splitext(file_name)

# refresh token
access_token_url = 'https://accounts.google.com/o/oauth2/token'
headers = {"Content-Type":"application/json","X-Accept":"application/json"}
refresh_token_request = {"grant_type":"refresh_token", "client_id": GOOGLE_CLIENT_ID, "client_secret": GOOGLE_CLIENT_SECRET, "refresh_token": GOOGLE_REFRESH_TOKEN}
access_token_request = requests.post(access_token_url,headers=headers,data=json.dumps(refresh_token_request))
access_token = access_token_request.json()['access_token']
print(access_token)

# check file already exist 
downloadUrl = "https://www.googleapis.com/drive/v3/files"
headers = {
    'Host':'www.googleapis.com',
    'Authorization': 'Bearer ' + access_token,
    'Content-Type':'application/json; charset=UTF-8',
    "X-Accept":"application/json"
}
qs= { "q": "'" + GOOGLE_FOLDER_ID + "' in parents and name='" + file_name + "' and trashed=false",
      "supportsAllDrives": True,
      "includeItemsFromAllDrives": True
    }

fileExistCheck = requests.get(downloadUrl, params=qs, headers=headers)
responseJsonFiles = fileExistCheck.json()['files']
searchResponseLength = len(responseJsonFiles)

#upload_file()
mime = magic.Magic(mime=True)
mimeType = mime.from_file(file_path) 

#folder_id = GOOGLE_FOLDER_ID
headers = {
    'Host':'www.googleapis.com',
    'Content-Length': str(filesize),
    'Authorization': 'Bearer ' + access_token,
    'Content-Type':'application/json; charset=UTF-8',
    'X-Upload-Content-Type': mimeType,
    'X-Upload-Content-Length': str(filesize)
}

with open(file_path, 'rb') as data:
  file_name= os.path.basename(file_path)
  metadata = {
    "name": file_name,
    "title": file_name,
    "parents": [GOOGLE_FOLDER_ID],
    'kind': 'drive#permission',
    "permissionDetails": [
      {
        "permissionType": "file",
        "role": "organizer"
      }
    ],
  }
 
 # No file exist. Post new one.
  if searchResponseLength < 1:
    postUrl = "https://www.googleapis.com/upload/drive/v3/files?uploadType=resumable&supportsAllDrives=true"
    r = requests.post(postUrl, data=json.dumps(metadata), headers=headers)
    # data upload url
    uploadUrl = r.headers['Location']

    r2 = requests.post(uploadUrl, data=data, headers=headers)
  
  # file exist. Put to override
  else:
    fileId = responseJsonFiles[0]['id']
    metadata = {
      "filename": file_name,
      "name": file_name,
      "title": file_name,
      'kind': 'drive#permission',
      "permissionDetails": [
        {
          "permissionType": "file",
          "role": "organizer"
        }
      ]
    }
    
    putUrl = "https://www.googleapis.com/upload/drive/v3/files/" + fileId + "?uploadType=resumable&supportsAllDrives=true"
    r = requests.patch(putUrl, data=json.dumps(metadata), headers=headers)
    uploadUrl = r.headers['Location']
    r2 = requests.patch(uploadUrl, data=data, headers=headers)

At the end

I think there are more improvements, but if there is an easier way to implement it, please comment.

Recommended Posts

The story of copying data from S3 to Google's TeamDrive
The story of moving from Pipenv to Poetry
After all, the story of returning from Linux to Windows
The story of trying to reconnect the client
The story of adding MeCab to ubuntu 16.04
The story of pep8 changing to pycodestyle
Change the decimal point of logging from, to.
The story of reading HSPICE data in Python
From the introduction of pyethapp to the execution of contract
The transition of baseball as seen from the data
The story of switching the Azure App Service web system from Windows to Linux
Extract data from S3
The story of sys.path.append ()
The wall of changing the Django service from Python 2.7 to Python 3
Send log data from the server to Splunk Cloud
DataNitro, implementation of function to read data from sheet
The story of wanting to buy Ring Fit Adventure
The story of using circleci to build manylinux wheels
The story of switching from WoSign to Let's Encrypt for a free SSL certificate
The story of porting code from C to Go and getting hooked (and to the language spec)
The story of building Zabbix 4.4
SIGNATE Quest ② From creation of targeting model to creation of submitted data
[Apache] The story of prefork
[Introduction to logarithmic graph] Predict the end time of each country from the logarithmic graph of infection number data ♬
The story of a Django model field disappearing from a class
How to calculate the amount of calculation learned from ABC134-D
The story of rubyist struggling with python :: Dict data with pycall
[Introduction to matplotlib] Read the end time from COVID-19 data ♬
How to plot the distribution of bacterial composition from Qiime2 analysis data in a box plot
Pass OpenCV data from the original C ++ library to Python
I sent the data of Raspberry Pi to GCP (free)
Try to extract the features of the sensor data with CNN
Learn accounting data and try to predict accounts from the content of the description when entering journals
Ported from R language of "Sazae-san's rock-paper-scissors data analysis" to Python
The story that the version of python 3.7.7 was not adapted to Heroku
The story of not being able to run pygame with pycharm
Save the results of crawling with Scrapy to the Google Data Store
[Python] Try to graph from the image of Ring Fit [OCR]
A story that struggled to handle the Python package of PocketSphinx
Read all the contents of proc / [pid] ~ From setgroups to wchan ~
Studying web scraping for the purpose of extracting data from Filmarks # 2
How to avoid duplication of data when inputting from Python to SQLite.
Read all the contents of proc / [pid] ~ From cwd to loginuid ~
Read all the contents of proc / [pid] ~ From map_files to numa_maps ~
[Pythonista] The story of making an action to copy selected text
From the introduction of JUMAN ++ to morphological analysis of Japanese with Python
Read all the contents of proc / [pid] ~ from attr to cpuset ~
The story of failing to update "calendar.day_abbr" on the admin screen of django
From Elasticsearch installation to data entry
The story of Python and the story of NaN
Existence from the viewpoint of Python
The story of the "hole" in the file
The story of remounting the application server
Supplement to the explanation of vscode
The story of writing a program
A story about creating a program that will increase the number of Instagram followers from 0 to 700 in a week
Upload data to s3 of aws with a command and update it, and delete the used data (on the way)
From the introduction of GoogleCloudPlatform Natural Language API to how to use it
About the order of learning programming languages (from beginner to intermediate) Part 2
[Note] I want to completely preprocess the data of the Titanic issue-Age version-
Comparing R, Python, SAS, SPSS from the perspective of European data scientists