BigQuery¶
bibtutils.gcp.bigquery¶
Functionality making use of GCP’s BigQuery.
See the official BigQuery Python Client documentation here: link.
- bibtutils.gcp.bigquery.create_and_upload(bucket_name, blob_name, bq_project, dataset, table, append=True, ignore_unknown=True, autodetect_schema=False, schema_json=None, generate_schema=False, time_partitioning_interval=None, time_partitioning_field=None, already_created_ok=False, credentials=None, await_result=True)[source]¶
Combines the functionality of
create_table()
andupload_gcs_json()
. You may specify whether upload should proceed if the table already exists. You may also either specify the desired schema or ask BQ to autodetect the schema.Note this function does NOT support
kwargs
.- Parameters:
bucket_name (
str
) – the bucket hosting the specified blob.blob_name (
str
) – the blob to upload to BQ. must be in JSON NLD format.bq_project (
str
) – the project hosting the specified BQ dataset.dataset (
str
) – the dataset hosting the specified table.table (
str
) – the table to which to upload the blob.append (
bool
) – (Optional) if true, will append to table. if false, will overwrite. Defaults toTrue
.ignore_unknown (
bool
) – (Optional) if true, will ignore values not reflected in table schema while uploading. Defaults toTrue
.autodetect_schema (
bool
) – (Optional) if true, will instruct BQ to automatically detect the schema of the data being uploaded. Defaults toFalse
. Should be set to False ifschema_json
is defined.schema_json (
dict
) – (Optional) the schema for the new table. Defaults to an empty list (no schema). The format of the schema should be identical to what is returned bybq show --format=prettyjson project:dataset.table | jq '.schema.fields'
time_partitioning_interval (
str
) – (Optional) if specified, will create the table with the time partitioning interval desired. Only recognizes values of:HOUR
,DAY
,MONTH
, orYEAR
. Capitalization doesn’t matter. If value is unrecognized, this parameter will simply be ignored. Defaults toNone
.time_partitioning_field (
str
) – (Optional) if specified, will create the table with time partitioning on the desired field. Any value specified must match a top-levelDATE
,DATETIME
, orTIMESTAMP
field in the table. The field must also beNULLABLE
orREQUIRED
according to the schema. Defaults toNone
.already_created_ok (
bool
) – (Optional) whether or not to proceed with data upload if the table already exists. Defaults toFalse
(will fail if table exists).credentials (
google.oauth2.credentials.Credentials
) – the credentials object to use when making the API call, if not to use the account running the function for authentication.await_result (
bool
) – Whether or not to hang and await the job result or simply return None once the job is submitted.
- bibtutils.gcp.bigquery.create_dataset(bq_project, dataset_name, description=None, location='US', credentials=None, **kwargs)[source]¶
Creates a dataset in BigQuery using the specified parameters.
- Any extra args (
kwargs
) are passed to gcp_bigquery:google.cloud.bigquery.client.Client.create_dataset()
.
- Parameters:
bq_project (
str
) – the project in which to find the dataset.dataset_name (
str
) – the name of the dataset to be created.description ((Optional)
str
) – the description for the datset. if unspecified defaults to Nonelocation ((Optional)
str
) – if specified, creates the dataset in the desired location/region. The locations and regions supported are listed in #locations_and_regions. if unspoecified https://cloud.google.com/bigquery/docs/locations defaults to US.credentials (
google.oauth2.credentials.Credentials
) – the credentials object to use when making the API call, if not to use the account running the function for authentication.
- Any extra args (
- bibtutils.gcp.bigquery.create_table(bq_project, dataset, table, schema_json=[], time_partitioning_interval=None, time_partitioning_field=None, credentials=None, **kwargs)[source]¶
Creates a table in BigQuery using the specified parameters.
- Any extra args (
kwargs
) are passed to the gcp_bigquery:google.cloud.bigquery.client.Client.create_table()
method.
- Parameters:
bq_project (
str
) – the project in which to find the dataset.dataset (
str
) – the dataset in which to create the table.table (
str
) – the name of the table to create.schema_json (
dict
) – (Optional) the schema for the new table. Defaults to an empty list (no schema). The format of the schema should be identical to what is returned bybq show --format=prettyjson project:dataset.table | jq '.schema.fields'
time_partitioning_interval (
str
) – (Optional) if specified, will create the table with the time partitioning interval desired. Only recognizes values of:HOUR
,DAY
,MONTH
, orYEAR
. Capitalization doesn’t matter. If value is unrecognized, this parameter will simply be ignored. Defaults toNone
.time_partitioning_field (
str
) – (Optional) if specified, will create the table with time partitioning on the desired field. Any value specified must match a top-levelDATE
,DATETIME
, orTIMESTAMP
field in the table. If also specifyingHOUR
as the partition interval, this parameter cannot beDATE
. The field must also beNULLABLE
orREQUIRED
according to the schema. Defaults toNone
.credentials (
google.oauth2.credentials.Credentials
) – the credentials object to use when making the API call, if not to use the account running the function for authentication.
- Any extra args (
- bibtutils.gcp.bigquery.delete_dataset(bq_project, dataset_name, delete_contents=False, not_found_ok=False, credentials=None, **kwargs)[source]¶
Creates a dataset in BigQuery using the specified parameters.
- Any extra args (
kwargs
) are passed to gcp_bigquery:google.cloud.bigquery.client.Client.create_dataset()
.
- Parameters:
bq_project (
str
) – the project in which to find the dataset.dataset_name (
str
) – the name of the dataset to be created.delete_contents ((Boolean)
str
) – The boolean that decides to delete the dataset. if unspecified defaults to False where in the dataset is not deleted if it contains tables within.not_found_ok ((Boolean)
str
) – Boolean used to control errors if dataset is not found. if unspecified defaults to False where in errors are not suppressed.credentials (
google.oauth2.credentials.Credentials
) – the credentials object to use when making the API call, if not to use the account running the function for authentication.
- Any extra args (
- bibtutils.gcp.bigquery.delete_table(bq_project, dataset, table, credentials=None, **kwargs)[source]¶
Method to delete a given table.
- Any extra args (
kwargs
) are passed to the gcp_bigquery:google.cloud.bigquery.client.Client.delete_table()
method.
- Parameters:
bq_project (
str
) – the bq project where the dataset lives.dataset (
str
) – the bq dataset where the table lives.table (
str
) – the bq table to delete.credentials (
google.oauth2.credentials.Credentials
) – the credentials object to use when making the API call, if not to use the account running the function for authentication.
- Any extra args (
- bibtutils.gcp.bigquery.query(query, query_project=None, credentials=None, await_result=True)[source]¶
Sends the user-supplied query to BQ and returns the result as a list of dicts. The account running the query must have Job Create permissions in the GCP Project and at least Data Viewer on the target dataset.
from bibtutils.gcp.bigquery import query fav_color_blue = query( 'select name, favorite_color ' 'from `my_project.my_dataset.my_table` ' 'where favorite_color="blue"' ) print(row['name'] for row in fav_color_blue)
- Parameters:
query (
str
) – a full BQ query (e.g.'select * from `x.y.z` where a=b'
)query_project (
str
) – the ID of the project in which to run the query. If not specified, defaults to the environment’s credential’s project.credentials (
google.oauth2.credentials.Credentials
) – the credentials object to use when making the API call, if not to use the account running the function for authentication.await_result (
bool
) – Whether or not to hang and await the job result or simply return None once the job is submitted.
- Return type:
- Returns:
a list of dicts, one row in the result table per dict.
- bibtutils.gcp.bigquery.upload_gcs_json(bucket_name, blob_name, bq_project, dataset, table, append=True, ignore_unknown=True, autodetect_schema=False, schema_json=None, credentials=None, await_result=True, **kwargs)[source]¶
Uploads a GCS blob in JSON NLD format to the specified table in BQ.
Executing account must have both read permissions on the bucket/blob and edit permissions on the dataset, in addition to the IAM bigquery jobs user role in the project. NLD JSON file schema must match that of the destination table.
Use
write_gcs_nldjson()
to get a properly formatted blob from JSON objects.- Any extra args (
kwargs
) are passed to gcp_bigquery:google.cloud.bigquery.client.Client.load_table_from_uri()
.
from bibtutils.gcp.bigquery import upload_gcs_json upload_gcs_json( bucket_name='my_bucket', blob_name='my_nldjson_blob', bq_project='my_project', dataset='my_dataset', table='my_table' )
- Parameters:
bucket_name (
str
) – the bucket hosting the specified blob.blob_name (
str
) – the blob to upload to BQ. must be in JSON NLD format.bq_project (
str
) – the project hosting the specified BQ dataset.dataset (
str
) – the dataset hosting the specified table.table (
str
) – the table to which to upload the blob.append (
bool
) – (Optional) if true, will append to table. if false, will overwrite. Defaults toTrue
.ignore_unknown (
bool
) – (Optional) if true, will ignore values not reflected in table schema while uploading. Defaults toTrue
.autodetect_schema (
bool
) – (Optional) if true, will instruct BQ to automatically detect the schema of the data being uploaded. Defaults toFalse
.schema_json (
dict
) – (Optional) the schema for the new table. Defaults to an empty list (no schema). The format of the schema should be identical to what is returned bybq show --format=prettyjson project:dataset.table | jq '.schema.fields'
credentials (
google.oauth2.credentials.Credentials
) – the credentials object to use when making the API call, if not to use the account running the function for authentication.await_result (
bool
) – Whether or not to hang and await the job result or simply return None once the job is submitted.
- Any extra args (