BigQuery

bibtutils.gcp.bigquery

Functionality making use of GCP’s BigQuery.

See the official BigQuery Python Client documentation here: link.

bibtutils.gcp.bigquery.create_and_upload(bucket_name, blob_name, bq_project, dataset, table, append=True, ignore_unknown=True, autodetect_schema=False, schema_json=None, generate_schema=False, time_partitioning_interval=None, time_partitioning_field=None, already_created_ok=False, credentials=None, await_result=True)[source]

Combines the functionality of create_table() and upload_gcs_json(). You may specify whether upload should proceed if the table already exists. You may also either specify the desired schema or ask BQ to autodetect the schema.

Note this function does NOT support kwargs.

Parameters:
  • bucket_name (str) – the bucket hosting the specified blob.

  • blob_name (str) – the blob to upload to BQ. must be in JSON NLD format.

  • bq_project (str) – the project hosting the specified BQ dataset.

  • dataset (str) – the dataset hosting the specified table.

  • table (str) – the table to which to upload the blob.

  • append (bool) – (Optional) if true, will append to table. if false, will overwrite. Defaults to True.

  • ignore_unknown (bool) – (Optional) if true, will ignore values not reflected in table schema while uploading. Defaults to True.

  • autodetect_schema (bool) – (Optional) if true, will instruct BQ to automatically detect the schema of the data being uploaded. Defaults to False. Should be set to False if schema_json is defined.

  • schema_json (dict) – (Optional) the schema for the new table. Defaults to an empty list (no schema). The format of the schema should be identical to what is returned by bq show --format=prettyjson project:dataset.table | jq '.schema.fields'

  • time_partitioning_interval (str) – (Optional) if specified, will create the table with the time partitioning interval desired. Only recognizes values of: HOUR, DAY, MONTH, or YEAR. Capitalization doesn’t matter. If value is unrecognized, this parameter will simply be ignored. Defaults to None.

  • time_partitioning_field (str) – (Optional) if specified, will create the table with time partitioning on the desired field. Any value specified must match a top-level DATE, DATETIME, or TIMESTAMP field in the table. The field must also be NULLABLE or REQUIRED according to the schema. Defaults to None.

  • already_created_ok (bool) – (Optional) whether or not to proceed with data upload if the table already exists. Defaults to False (will fail if table exists).

  • credentials (google.oauth2.credentials.Credentials) – the credentials object to use when making the API call, if not to use the account running the function for authentication.

  • await_result (bool) – Whether or not to hang and await the job result or simply return None once the job is submitted.

bibtutils.gcp.bigquery.create_dataset(bq_project, dataset_name, description=None, location='US', credentials=None, **kwargs)[source]

Creates a dataset in BigQuery using the specified parameters.

Any extra args (kwargs) are passed to

gcp_bigquery:google.cloud.bigquery.client.Client.create_dataset().

Parameters:
  • bq_project (str) – the project in which to find the dataset.

  • dataset_name (str) – the name of the dataset to be created.

  • description ((Optional) str) – the description for the datset. if unspecified defaults to None

  • location ((Optional) str) – if specified, creates the dataset in the desired location/region. The locations and regions supported are listed in #locations_and_regions. if unspoecified https://cloud.google.com/bigquery/docs/locations defaults to US.

  • credentials (google.oauth2.credentials.Credentials) – the credentials object to use when making the API call, if not to use the account running the function for authentication.

bibtutils.gcp.bigquery.create_table(bq_project, dataset, table, schema_json=[], time_partitioning_interval=None, time_partitioning_field=None, credentials=None, **kwargs)[source]

Creates a table in BigQuery using the specified parameters.

Any extra args (kwargs) are passed to the

gcp_bigquery:google.cloud.bigquery.client.Client.create_table() method.

Parameters:
  • bq_project (str) – the project in which to find the dataset.

  • dataset (str) – the dataset in which to create the table.

  • table (str) – the name of the table to create.

  • schema_json (dict) – (Optional) the schema for the new table. Defaults to an empty list (no schema). The format of the schema should be identical to what is returned by bq show --format=prettyjson project:dataset.table | jq '.schema.fields'

  • time_partitioning_interval (str) – (Optional) if specified, will create the table with the time partitioning interval desired. Only recognizes values of: HOUR, DAY, MONTH, or YEAR. Capitalization doesn’t matter. If value is unrecognized, this parameter will simply be ignored. Defaults to None.

  • time_partitioning_field (str) – (Optional) if specified, will create the table with time partitioning on the desired field. Any value specified must match a top-level DATE, DATETIME, or TIMESTAMP field in the table. If also specifying HOUR as the partition interval, this parameter cannot be DATE. The field must also be NULLABLE or REQUIRED according to the schema. Defaults to None.

  • credentials (google.oauth2.credentials.Credentials) – the credentials object to use when making the API call, if not to use the account running the function for authentication.

bibtutils.gcp.bigquery.delete_dataset(bq_project, dataset_name, delete_contents=False, not_found_ok=False, credentials=None, **kwargs)[source]

Creates a dataset in BigQuery using the specified parameters.

Any extra args (kwargs) are passed to

gcp_bigquery:google.cloud.bigquery.client.Client.create_dataset().

Parameters:
  • bq_project (str) – the project in which to find the dataset.

  • dataset_name (str) – the name of the dataset to be created.

  • delete_contents ((Boolean) str) – The boolean that decides to delete the dataset. if unspecified defaults to False where in the dataset is not deleted if it contains tables within.

  • not_found_ok ((Boolean) str) – Boolean used to control errors if dataset is not found. if unspecified defaults to False where in errors are not suppressed.

  • credentials (google.oauth2.credentials.Credentials) – the credentials object to use when making the API call, if not to use the account running the function for authentication.

bibtutils.gcp.bigquery.delete_table(bq_project, dataset, table, credentials=None, **kwargs)[source]

Method to delete a given table.

Any extra args (kwargs) are passed to the

gcp_bigquery:google.cloud.bigquery.client.Client.delete_table() method.

Parameters:
  • bq_project (str) – the bq project where the dataset lives.

  • dataset (str) – the bq dataset where the table lives.

  • table (str) – the bq table to delete.

  • credentials (google.oauth2.credentials.Credentials) – the credentials object to use when making the API call, if not to use the account running the function for authentication.

bibtutils.gcp.bigquery.query(query, query_project=None, credentials=None, await_result=True)[source]

Sends the user-supplied query to BQ and returns the result as a list of dicts. The account running the query must have Job Create permissions in the GCP Project and at least Data Viewer on the target dataset.

from bibtutils.gcp.bigquery import query
fav_color_blue = query(
    'select name, favorite_color '
    'from `my_project.my_dataset.my_table` '
    'where favorite_color="blue"'
)
print(row['name'] for row in fav_color_blue)
Parameters:
  • query (str) – a full BQ query (e.g. 'select * from `x.y.z` where a=b')

  • query_project (str) – the ID of the project in which to run the query. If not specified, defaults to the environment’s credential’s project.

  • credentials (google.oauth2.credentials.Credentials) – the credentials object to use when making the API call, if not to use the account running the function for authentication.

  • await_result (bool) – Whether or not to hang and await the job result or simply return None once the job is submitted.

Return type:

list

Returns:

a list of dicts, one row in the result table per dict.

bibtutils.gcp.bigquery.upload_gcs_json(bucket_name, blob_name, bq_project, dataset, table, append=True, ignore_unknown=True, autodetect_schema=False, schema_json=None, credentials=None, await_result=True, **kwargs)[source]

Uploads a GCS blob in JSON NLD format to the specified table in BQ.

Executing account must have both read permissions on the bucket/blob and edit permissions on the dataset, in addition to the IAM bigquery jobs user role in the project. NLD JSON file schema must match that of the destination table.

Use write_gcs_nldjson() to get a properly formatted blob from JSON objects.

Any extra args (kwargs) are passed to

gcp_bigquery:google.cloud.bigquery.client.Client.load_table_from_uri().

from bibtutils.gcp.bigquery import upload_gcs_json
upload_gcs_json(
    bucket_name='my_bucket',
    blob_name='my_nldjson_blob',
    bq_project='my_project',
    dataset='my_dataset',
    table='my_table'
)
Parameters:
  • bucket_name (str) – the bucket hosting the specified blob.

  • blob_name (str) – the blob to upload to BQ. must be in JSON NLD format.

  • bq_project (str) – the project hosting the specified BQ dataset.

  • dataset (str) – the dataset hosting the specified table.

  • table (str) – the table to which to upload the blob.

  • append (bool) – (Optional) if true, will append to table. if false, will overwrite. Defaults to True.

  • ignore_unknown (bool) – (Optional) if true, will ignore values not reflected in table schema while uploading. Defaults to True.

  • autodetect_schema (bool) – (Optional) if true, will instruct BQ to automatically detect the schema of the data being uploaded. Defaults to False.

  • schema_json (dict) – (Optional) the schema for the new table. Defaults to an empty list (no schema). The format of the schema should be identical to what is returned by bq show --format=prettyjson project:dataset.table | jq '.schema.fields'

  • credentials (google.oauth2.credentials.Credentials) – the credentials object to use when making the API call, if not to use the account running the function for authentication.

  • await_result (bool) – Whether or not to hang and await the job result or simply return None once the job is submitted.