How to List Blobs Properly with Azure Storage Client

This article will be about some library methods of Azure Storage .NET Client Library often misused. I have seen several cases people used these in an improper way and doing so would often introduce bugs that you wouldn’t see on day one but appear much later on –which is the most dangerous kind.

The first method ListBlobsSegmentedAsync is used to list blobs (files) in a container (folder or bucket) on an Azure Cloud Storage Service account. The second method ListContainersSegmentedAsync lists the containers (folders) in the account.

Since these are almost doing the same thing, I will be talking only about ListBlobsSegmentedAsync. This method normally has a blocking alternative, called ListBlobs. The difference of these two is, obviously, one uses asynchronous networking methods to make async I/O and the other one does not, which would block the entire thread (something bad if you’re running in a thread pool). However, the more significant difference happens when your container has more than 5,000 blobs to list, that’s when you get a pagination for the results, because it’s the API limit. In this case, ListBlobs will call each page sequentially will block the entire thread while doing so and the thread will be blocked with just that job until list of all blobs are downloaded, which could take several minutes very easily if you have some 6-digit of blobs in the container.

ListBlobsSegmentedAsync method, on the other hand, will intiate the REST API request using asynchronous networking libraries and the control of the thread will be yielded to other tasks running, since the implementation of this uses await method of .NET. When this task is completed, it will return you a maximum of 5,000 blobs and a continuation token if there are more results,hich you are supposed to call this method again with this token if you want to get the “Page 2”. This is the point where people usually forget calling this method because while they’re testing there are probably less than 5,000 blobs in the container. Later on when the results are paginated, the code will start seeing the first 5,000 results only.

Why is this a problem?

Naming. If I would be designing this library and there was a method called ListBlobs, I would provide its async equivalent as ListBlobsAsync which uses ListBlobsSegmentedAsync and it would look like this:

public Task<List<IListBlobItem> ListBlobsAsync(BlobContinuationToken currentToken){
	BlobContinuationToken continuationToken = null;
	List<IListBlobItem> results = new List<IListBlobItem>();
    do
    {
        var response = await ListBlobsSegmentedAsync(continuationToken);
        continuationToken = response.ContinuationToken;
        results.AddRange(response.Results);
    }
    while (continuationToken != null);
    return results;
}

This way, developers could go on coding non-stop instead of taking a step backwards and figuring out what segmented means and what’s the difference between these two methods. They would just pick ListBlobsAsync assuming it’s the exact async equivalent of ListAsync by name convention and move on.

Same goes for ListContainersSegmentedAsync, a more helpful ListContainersAsync method would look like:

public async Task<List<CloudBlobContainer>> ListContainersAsync()
{
    BlobContinuationToken continuationToken = null;
    List<CloudBlobContainer> results = new List<CloudBlobContainer>();
    do
    {
        var response = await ListContainersSegmentedAsync(continuationToken);
        continuationToken = response.ContinuationToken;
        results.AddRange(response.Results);
    }
    while (continuationToken != null);
    return results;
}

You can add those as extenstion methods to CloudBlobContainer and CloudBlobClient classees, respectively.

It is just some arguably missing helper methods in the client library which, again, arguably introduces some friction and are not exactly behaving according to Principle of least astonishment. People usually end up using it improperly by just making a call to ListBlobsSegmentedAsync hoping it will return all the blobs at once.

So now you know how to correcly use these kind of cryptic methods. Happy coding!

Why is this a problem?

Leave your thoughts