Working with Amazon S3 using boto: Multithreaded Edition!
Categories:
Let's say you need to update lots of keys in Amazon S3. If you have many objects in your S3 bucket, this can be quite slow. Of course, as a Python developer, you're using the nifty boto library. We can make update all of your keys much, much faster using multiple threads!
In this example, I will enable caching for all of the objects in my bucket.
from multiprocessing import Pool
import boto
conn = boto.connect_s3()
bucket = conn.get_bucket('my_bucket_foo')
cache_control = {'Cache-Control': str.encode('no-transform,public,max-age=300,s-maxage=900')}
def update(key):
k = bucket.get_key(key.name)
cache_control.update({'Content-Type': k.content_type})
k.metadata.update(cache_control)
key.copy(k.bucket.name,
k.name,
k.metadata,
preserve_acl=True)
print(k.name)
pool = Pool(processes=100)
pool.map(update, bucket.list()
In this example, I will enable public access to all objects in my bucket.
from multiprocessing import Pool
import boto
all_users = 'http://acs.amazonaws.com/groups/global/AllUsers'
conn = boto.connect_s3()
bucket = conn.get_bucket('my_bucket_foo')
def update(key):
acl = key.get_acl()
for grant in acl.acl.grants:
if grant.uri != all_users:
key.make_public()
print(key.name)
pool = Pool(processes=100)
pool.map(update, bucket.list())
If you're running this on Windows, a slight change is necessary:
if __name__ == '__main__':
freeze_support()
pool = Pool(processes=100)
pool.map(update, bucket.list())