Working with Amazon S3 using boto: Multithreaded Edition!

Categories:

Let's say you need to update lots of keys in Amazon S3. If you have many objects in your S3 bucket, this can be quite slow. Of course, as a Python developer, you're using the nifty boto library. We can make update all of your keys much, much faster using multiple threads!

In this example, I will enable caching for all of the objects in my bucket.


from multiprocessing import Pool
import boto

conn = boto.connect_s3()
bucket = conn.get_bucket('my_bucket_foo')

cache_control = {'Cache-Control': str.encode('no-transform,public,max-age=300,s-maxage=900')}


def update(key):
    k = bucket.get_key(key.name)
    cache_control.update({'Content-Type': k.content_type})
    k.metadata.update(cache_control)
    key.copy(k.bucket.name,
             k.name,
             k.metadata,
             preserve_acl=True)
    print(k.name)


pool = Pool(processes=100)
pool.map(update, bucket.list()


In this example, I will enable public access to all objects in my bucket.


from multiprocessing import Pool
import boto

all_users = 'http://acs.amazonaws.com/groups/global/AllUsers'
conn = boto.connect_s3()
bucket = conn.get_bucket('my_bucket_foo')


def update(key):
    acl = key.get_acl()
    for grant in acl.acl.grants:
        if grant.uri != all_users:
            key.make_public()
            print(key.name)


pool = Pool(processes=100)
pool.map(update, bucket.list())

If you're running this on Windows, a slight change is necessary:


if __name__ == '__main__':
    freeze_support()
    pool = Pool(processes=100)
    pool.map(update, bucket.list())