Cloud Storage

bibtutils.gcp.storage

Functionality making use of GCP’s Cloud Storage.

See the official Cloud Storage Python Client documentation here: link.

bibtutils.gcp.storage.create_bucket(project, bucket_name, location='US', credentials=None)[source]

Creates a Google Cloud Storage bucket in the specified project.

Parameters:
  • project (str) – the project in which to create the bucket. The account being used must have “Storage Admin” rights on the GCP project.

  • bucket_name (str) – the name of the bucket to create. Note that bucket names must be universally unique in GCP, and need to adhere to the GCS bucket naming guidelines: https://cloud.google.com/storage/docs/naming-buckets

  • location ((Optional) str) – if specified, creates the dataset in the desired location/region. The locations and regions supported are listed in #locations_and_regions. if unspoecified https://cloud.google.com/bigquery/docs/locations defaults to US.

  • credentials (google.oauth2.credentials.Credentials) – the credentials object to use when making the API call, if not to use the account running the function for authentication.

Return type:

google.cloud.storage.bucket.Bucket

Returns:

The bucket created during this function call.

bibtutils.gcp.storage.read_gcs(bucket_name, blob_name, decode=True, credentials=None)[source]

Reads the contents of a blob from GCS. Service account must have (at least) read permissions on the bucket/blob.

Note that for extremely large files having decode=True can increase runtime substantially.

from bibtutils.gcp.storage import read_gcs
data = read_gcs('my_bucket', 'my_blob')
print(data)
Parameters:
  • bucket_name (str) – the bucket hosting the specified blob.

  • blob_name (str) – the blob to read from GCS.

  • decode (bool) – (Optional) whether or not to decode the blob contents into utf-8. Defaults to True.

  • credentials (google.oauth2.credentials.Credentials) – the credentials object to use when making the API call, if not to use the account running the function for authentication.

Return type:

str

Returns:

blob contents, decoded to utf-8.

bibtutils.gcp.storage.read_gcs_nldjson(bucket_name, blob_name, **kwargs)[source]

Reads a blob in JSON NLD format from GCS and returns it as a list of dicts. Any extra arguments (kwargs) are passed to the read_gcs() function.

from bibtutils.gcp.storage import read_gcs_nldjson
data = read_gcs_nldjson('my_bucket', 'my_nldjson_blob')
print(item['favorite_color'] for item in data)
Parameters:
  • bucket_name (str) – the bucket hosting the specified blob.

  • blob_name (str) – the blob to read from GCS.

Return type:

list

Returns:

the data from the blob, converted into a list of dict.

bibtutils.gcp.storage.write_gcs(bucket_name, blob_name, data, mime_type='text/plain', create_bucket_if_not_found=False, timeout=60, credentials=None)[source]

Writes a String to GCS storage under a given blob name to the given bucket. The executing account must have (at least) write permissions to the bucket. If data is a str, will be encoded as utf-8 before uploading.

from bibtutils.gcp.storage import write_gcs
write_gcs('my_bucket', 'my_blob', data='my favorite color is blue')
Parameters:
  • bucket_name (str) – the name of the bucket to which to write.

  • blob_name (str) – the name of the blob to write.

  • data (str OR bytes) – the data to be written.

  • create_bucket_if_not_found (bool) – (Optional) if True, will attempt to create the bucket if it does not exist. Defaults to False.

  • credentials (google.oauth2.credentials.Credentials) – the credentials object to use when making the API call, if not to use the account running the function for authentication.

  • content_type (str) – (Optional) the MIME type being uploaded. defaults to 'text/plain'.

bibtutils.gcp.storage.write_gcs_nldjson(bucket_name, blob_name, json_data, add_date=False, **kwargs)[source]

Writes a dict to GCS storage under a given blob name to the given bucket. The executing account must have (at least) write permissions to the bucket. Use in conjunction with upload_gcs_json() to upload JSON data to BigQuery tables. Any extra arguments (kwargs) are passed to the write_gcs() function.

from bibtutils.gcp.storage import write_gcs_nldjson
write_gcs_nldjson(
    'my_bucket',
    'my_nldjson_blob',
    json_data=[
        {'name': 'leo', 'favorite_color': 'red'},
        {'name': 'matthew', 'favorite_color': 'blue'}
    ]
)
Parameters:
  • bucket_name (str) – the name of the bucket to which to write.

  • blob_name (str) – the name of the blob to write.

  • json_data (list OR dict) – the data to be written. can be a list or a dict. will treat a dict as one row of data (and convert it to a one-item list). data will be converted to a JSON NLD formatted string before uploading for compatibility with upload_gcs_json().

  • add_date (bool) – (Optional) whether or not to add upload date to the data before upload. Defaults to False.

  • create_bucket_if_not_found (bool) – (Optional) if True, will attempt to create the bucket if it does not exist. Defaults to False.