Connect to filesystem Data Assets
Use the information provided here to connect to Data Assets stored on Amazon S3, Google Cloud Storage (GCS), Microsoft Azure Blob Storage, or local filesystems. Great Expectations (GX) uses the term Data Asset when referring to data in its original format, and the term Data Source when referring to the storage location for Data Assets.
- Amazon S3
- Microsoft Azure Blob Storage
- Google Cloud Storage
- Filesystem
Amazon S3 Data Source
Connect to an Amazon S3 Data Source.
- pandas
- Spark
The following examples connect to .csv data. However, GX supports most of the Pandas read methods.
Prerequisites
- An installation of GX set up to work with S3
- Access to data on a S3 bucket
Import GX and instantiate a Data Context
Run the following Python code to import GX and instantiate a Data Context:
import great_expectations as gx
context = gx.get_context()
Create a Data Source
The following information is required when you create an Amazon S3 Data Source:
-
name: The Data Source name. In the following examples, this is"my_s3_datasource" -
bucket_name: The Amazon S3 bucket name. -
boto3_options: Optional. Additional options for the Data Source. In the following examples, the default values are used.
-
Run the following Python code to define
name,bucket_nameandboto3_options:Pythondatasource_name = "my_s3_datasource"
bucket_name = "my_bucket"
boto3_options = {}Additional options forboto3_optionsThe parameter
boto3_optionsallows you to pass the following information:endpoint_url: specifies an S3 endpoint. You can use an environment variable such as"${S3_ENDPOINT}"to securely include this in your code. The string"${S3_ENDPOINT}"will be replaced with the value of the corresponding environment variable.region_name: Your AWS region name.
-
Run the following Python code to pass
name,bucket_name, andboto3_optionsas parameters when you create your Data Source::Pythondatasource = context.sources.add_pandas_s3(
name=datasource_name, bucket=bucket_name, boto3_options=boto3_options
)
Add data to the Data Source as a Data Asset
Run the following Python code:
asset_name = "my_taxi_data_asset"
s3_prefix = "data/taxi_yellow_tripdata_samples/"
batching_regex = r"yellow_tripdata_sample_(?P<year>\d{4})-(?P<month>\d{2})\.csv"
data_asset = datasource.add_csv_asset(
name=asset_name, batching_regex=batching_regex, s3_prefix=s3_prefix
)
Your Data Asset will connect to all files that match the regex that you provide. Each matched file will become a Batch inside your Data Asset.
For example:
Let's say that your S3 bucket has the following files:
- "yellow_tripdata_sample_2021-11.csv"
- "yellow_tripdata_sample_2021-12.csv"
- "yellow_tripdata_sample_2023-01.csv"
If you define a Data Asset using the full file name with no regex groups, such as "yellow_tripdata_sample_2023-01\.csv" your Data Asset will contain only one Batch, which will correspond to that file.
However, if you define a partial file name with a regex group, such as "yellow_tripdata_sample_(?P<year>\d{4})-(?P<month>\d{2})\.csv" your Data Asset will contain 3 Batches, one corresponding to each matched file. You can then use the keys year and month to indicate exactly which file you want to request from the available Batches.
Next steps
The following examples connect to .csv data.
Prerequisites
- An installation of GX set up to work with S3
- Access to data on a S3 bucket
Import GX and instantiate a Data Context
Run the following Python code to import GX and instantiate a Data Context:
import great_expectations as gx
context = gx.get_context()
Create a Data Source
The following information is required when you create an Amazon S3 Data Source:
-
name: The Data Source name. In the following examples, this is"my_s3_datasource" -
bucket_name: The Amazon S3 bucket name. -
boto3_options: Optional. Additional options for the Data Source. In the following examples, the default values are used.
-
Run the following Python code to define
name,bucket_name, andboto3_options:Pythondatasource_name = "my_s3_datasource"
bucket_name = "my_bucket"
boto3_options = {}Additional options forboto3_optionsThe parameter
boto3_optionsallows you to pass the following information:endpoint_url: Specifies an S3 endpoint. You can use an environment variable such as"${S3_ENDPOINT}"to securely include this in your code. The string"${S3_ENDPOINT}"will be replaced with the value of the corresponding environment variable.region_name: Your AWS region name.
-
Run the following Python code to pass
name,bucket_name, andboto3_optionsas parameters when you create your Data Source::Pythondatasource = context.sources.add_spark_s3(
name=datasource_name, bucket=bucket_name, boto3_options=boto3_options
)
Add data to the Data Source as a Data Asset
Run the following Python code:
asset_name = "my_taxi_data_asset"
s3_prefix = "data/taxi_yellow_tripdata_samples/"
batching_regex = r"yellow_tripdata_sample_(?P<year>\d{4})-(?P<month>\d{2})\.csv"
data_asset = datasource.add_csv_asset(
name=asset_name,
batching_regex=batching_regex,
s3_prefix=s3_prefix,
header=True,
infer_schema=True,
)
Your Data Asset will connect to all files that match the regex that you provide. Each matched file will become a Batch inside your Data Asset.
For example:
Let's say that your S3 bucket has the following files:
- "yellow_tripdata_sample_2021-11.csv"
- "yellow_tripdata_sample_2021-12.csv"
- "yellow_tripdata_sample_2023-01.csv"
If you define a Data Asset using the full file name with no regex groups, such as "yellow_tripdata_sample_2023-01\.csv" your Data Asset will contain only one Batch, which will correspond to that file.
However, if you define a partial file name with a regex group, such as "yellow_tripdata_sample_(?P<year>\d{4})-(?P<month>\d{2})\.csv" your Data Asset will contain 3 Batches, one corresponding to each matched file. You can then use the keys year and month to indicate exactly which file you want to request from the available Batches.
Next steps
Microsoft Azure Blob Storage
Connect to a Microsoft Azure Blob Storage Data Source.
- pandas
- Spark
Use Pandas to connect to data stored in files on a filesystem. The following examples connect to .csv data. However, GX supports most of the Pandas read methods.
Prerequisites
- GX installed and set up to work with Azure Blob Storage
- Access to data in Azure Blob Storage
Import GX and instantiate a Data Context
Run the following Python code to import GX and instantiate a Data Context:
import great_expectations as gx
context = gx.get_context()
Create a Data Source
The following information is required when you create a Microsoft Azure Blob Storage Data Source:
-
name: The Data Source name. In the following examples, this is"my_datasource". -
azure_options: Authentication settings.
-
Run the following Python code to define
nameandazure_options:Pythondatasource_name = "my_datasource"
azure_options = {
"account_url": "${AZURE_STORAGE_ACCOUNT_URL}",
"credential": "${AZURE_CREDENTIAL}",
} -
Run the following Python code to pass
nameandazure_optionsas parameters when you create your Data Source:Pythondatasource = context.sources.add_pandas_abs(
name=datasource_name, azure_options=azure_options
)Where did that connection string come from?In the previous example, the value for
account_urlis substituted for the contents of theAZURE_STORAGE_CONNECTION_STRINGkey you configured when you installed GX and set up your Azure Blob Storage dependencies.
Add data to the Data Source as a Data Asset
To specify data to connect to you can specify the following elements:
name: A name by which you can reference the Data Asset in the future.batching_regex: A regular expression that indicates which files to treat as batches in your Data Asset and how to identify them.abs_container: The name of your Azure Blob Storage container.abs_name_starts_with: A string indicating what part of thebatching_regexto truncate from the final batch names.abs_recursive_file_discovery: (Optional) A boolean (True/False) indicating if files should be searched recursively from subfolders
-
Run the following Python code to define the connection values:
Python codeasset_name = "my_taxi_data_asset"
batching_regex = r"yellow_tripdata_sample_(?P<year>\d{4})-(?P<month>\d{2})\.csv"
abs_container = "my_container"
abs_name_starts_with = "data/taxi_yellow_tripdata_samples/"
-
Run the following Python code to create the Data Asset:
Pythondata_asset = datasource.add_csv_asset(
name=asset_name,
batching_regex=batching_regex,
abs_container=abs_container,
abs_name_starts_with=abs_name_starts_with,
)
Your Data Asset will connect to all files that match the regex that you provide. Each matched file will become a Batch inside your Data Asset.
For example:
Let's say that your has the following files:
- "yellow_tripdata_sample_2021-11.csv"
- "yellow_tripdata_sample_2021-12.csv"
- "yellow_tripdata_sample_2023-01.csv"
If you define a Data Asset using the full file name with no regex groups, such as "yellow_tripdata_sample_2023-01\.csv" your Data Asset will contain only one Batch, which will correspond to that file.
However, if you define a partial file name with a regex group, such as "yellow_tripdata_sample_(?P<year>\d{4})-(?P<month>\d{2})\.csv" your Data Asset will contain 3 Batches, one corresponding to each matched file. You can then use the keys year and month to indicate exactly which file you want to request from the available Batches.
Next steps
Use Spark to connect to data stored in files on a filesystem. The following examples connect to .csv data.
Prerequisites
- GX installed and set up to work with Azure Blob Storage
- Access to data in Azure Blob Storage
Import GX and instantiate a Data Context
Run the following Python code to import GX and instantiate a Data Context:
import great_expectations as gx
context = gx.get_context()
Create a Data Source
The following information is required when you create a Microsoft Azure Blob Storage Data Source:
-
name: The Data Source name. In the following examples, this is"my_datasource". -
azure_options: Authentication settings.
-
Run the following Python code to define
nameandazure_options:Pythondatasource_name = "my_datasource"
azure_options = {
"account_url": "${AZURE_STORAGE_ACCOUNT_URL}",
} -
Run the following Python code to pass
nameandazure_optionsas parameters when you create your Data Source:Pythondatasource = context.sources.add_spark_abs(
name=datasource_name, azure_options=azure_options
)Where did that connection string come from?In the previous example, the value for
account_urlis substituted for the contents of theAZURE_STORAGE_CONNECTION_STRINGkey you configured when you installed GX and set up your Azure Blob Storage dependencies.
Add data to the Data Source as a Data Asset
To specify data to connect to you can specify the following elements:
name: A name by which you can reference the Data Asset in the future.batching_regex: A regular expression that indicates which files to treat as batches in your Data Asset and how to identify them.abs_container: The name of your Azure Blob Storage container.abs_name_starts_with: A string indicating what part of thebatching_regexto truncate from the final batch names.abs_recursive_file_discovery: (Optional) A boolean (True/False) indicating if files should be searched recursively from subfolders
-
Run the following Python code to define the connection values:
Python codeasset_name = "my_taxi_data_asset"
batching_regex = r"yellow_tripdata_sample_(?P<year>\d{4})-(?P<month>\d{2})\.csv"
abs_container = "my_container"
abs_name_starts_with = "data/taxi_yellow_tripdata_samples/"
-
Run the following Python code to create the Data Asset:
Pythondata_asset = datasource.add_csv_asset(
name=asset_name,
batching_regex=batching_regex,
abs_container=abs_container,
header=True,
infer_schema=True,
abs_name_starts_with=abs_name_starts_with,
)
Your Data Asset will connect to all files that match the regex that you provide. Each matched file will become a Batch inside your Data Asset.
For example:
Let's say that your has the following files:
- "yellow_tripdata_sample_2021-11.csv"
- "yellow_tripdata_sample_2021-12.csv"
- "yellow_tripdata_sample_2023-01.csv"
If you define a Data Asset using the full file name with no regex groups, such as "yellow_tripdata_sample_2023-01\.csv" your Data Asset will contain only one Batch, which will correspond to that file.
However, if you define a partial file name with a regex group, such as "yellow_tripdata_sample_(?P<year>\d{4})-(?P<month>\d{2})\.csv" your Data Asset will contain 3 Batches, one corresponding to each matched file. You can then use the keys year and month to indicate exactly which file you want to request from the available Batches.
Next steps
GCS Data Source
Connect to a GCS Data Source.
- pandas
- Spark
The following examples connect to .csv data. However, GX supports most of the Pandas read methods.
Prerequisites
- An installation of GX set up to work with GCS
- Access to data in a GCS bucket
Import GX and instantiate a Data Context
Run the following Python code to import GX and instantiate a Data Context:
import great_expectations as gx
context = gx.get_context()
Create a Data Source
The following information is required when you create a GCS Data Source:
-
name: The Data Source name. In the following examples, this is"my_gcs_datasource". -
bucket_or_name: The GCS bucket or instance name. -
gcs_options: Optional. Additional options for the Data Source. In the following examples, the default values are used.
-
Run the following Python code to define
name,bucket_or_name, andgcs_options:Pythondatasource_name = "my_gcs_datasource"
bucket_or_name = "my_bucket"
gcs_options = {} -
Run the following Python code to pass
name,bucket_or_name, andgcs_optionsas parameters when you create your Data Source:Pythondatasource = context.sources.add_pandas_gcs(
name=datasource_name, bucket_or_name=bucket_or_name, gcs_options=gcs_options
)
Add GCS data to the Data Source as a Data Asset
Run the following Python code:
asset_name = "my_taxi_data_asset"
gcs_prefix = "data/taxi_yellow_tripdata_samples/"
batching_regex = r"yellow_tripdata_sample_(?P<year>\d{4})-(?P<month>\d{2})\.csv"
data_asset = datasource.add_csv_asset(
name=asset_name, batching_regex=batching_regex, gcs_prefix=gcs_prefix
)
Your Data Asset will connect to all files that match the regex that you provide. Each matched file will become a Batch inside your Data Asset.
For example:
Let's say that your GCS bucket has the following files:
- "yellow_tripdata_sample_2021-11.csv"
- "yellow_tripdata_sample_2021-12.csv"
- "yellow_tripdata_sample_2023-01.csv"
If you define a Data Asset using the full file name with no regex groups, such as "yellow_tripdata_sample_2023-01\.csv" your Data Asset will contain only one Batch, which will correspond to that file.
However, if you define a partial file name with a regex group, such as "yellow_tripdata_sample_(?P<year>\d{4})-(?P<month>\d{2})\.csv" your Data Asset will contain 3 Batches, one corresponding to each matched file. You can then use the keys year and month to indicate exactly which file you want to request from the available Batches.
Next steps
- How to organize Batches in a file-based Data Asset
- How to request Data from a Data Asset
- How to create Expectations while interactively evaluating a set of data
Related documentation
For more information on Google Cloud and authentication, see the following:
Use Spark to connect to a GCS Data Source. The following examples connect to .csv data.
Prerequisites
- An installation of GX set up to work with GCS
- Access to data on a GCS bucket
1. Import GX and instantiate a Data Context
Run the following Python code to import GX and instantiate a Data Context:
import great_expectations as gx
context = gx.get_context()
Create a Data Source
The following information is required when you create a GCS Data Source:
-
name: The Data Source name. In the following examples, this is"my_gcs_datasource". -
bucket_or_name: The GCS bucket or instance name. -
gcs_options: Optional. Additional options for the Data Source. In the following examples, the default values are used.
-
Run the following Python code to define
name,bucket_or_name, andgcs_options:Pythondatasource_name = "my_gcs_datasource"
bucket_or_name = "my_bucket"
gcs_options = {} -
Run the following Python code to pass
name,bucket_or_name, andgcs_optionsas parameters when you create your Data Source:Pythondatasource = context.sources.add_spark_gcs(
name=datasource_name, bucket_or_name=bucket_or_name, gcs_options=gcs_options
)
Add GCS data to the Data Source as a Data Asset
Run the following Python code:
asset_name = "my_taxi_data_asset"
gcs_prefix = "data/taxi_yellow_tripdata_samples/"
batching_regex = r"yellow_tripdata_sample_(?P<year>\d{4})-(?P<month>\d{2})\.csv"
data_asset = datasource.add_csv_asset(
name=asset_name,
batching_regex=batching_regex,
gcs_prefix=gcs_prefix,
header=True,
infer_schema=True,
)
header and infer_schemaIn the previous example there are two optional parameters. If the file does not have a header line, the header parameter can be left out as it will default to false. If you do not want GX to infer the schema of your file, you can exclude the infer_schema parameter as it also defaults to false.
Your Data Asset will connect to all files that match the regex that you provide. Each matched file will become a Batch inside your Data Asset.
For example:
Let's say that your GCS bucket has the following files:
- "yellow_tripdata_sample_2021-11.csv"
- "yellow_tripdata_sample_2021-12.csv"
- "yellow_tripdata_sample_2023-01.csv"
If you define a Data Asset using the full file name with no regex groups, such as "yellow_tripdata_sample_2023-01\.csv" your Data Asset will contain only one Batch, which will correspond to that file.
However, if you define a partial file name with a regex group, such as "yellow_tripdata_sample_(?P<year>\d{4})-(?P<month>\d{2})\.csv" your Data Asset will contain 3 Batches, one corresponding to each matched file. You can then use the keys year and month to indicate exactly which file you want to request from the available Batches.
Next steps
- How to organize Batches in a file-based Data Asset
- How to request Data from a Data Asset
- How to create Expectations while interactively evaluating a set of data
Related documentation
For more information on Google Cloud and authentication, see the following:
Filesystem Data Source
Connect to filesystem Data Assets.
- Single file with pandas
- Multiple files with pandas
- Multiple files with Spark
Use Pandas to connect to data stored in files on a filesystem. The following examples connect to .csv data. However, GX supports most of the Pandas read methods.
Prerequisites
- A Great Expectations instance. See Install Great Expectations with Data Source dependencies.
- A Data Context.
- Access to filesystem Data Assets
Import the GX module and instantiate a Data Context
Run the following Python code to import GX and instantiate a Data Context:
import great_expectations as gx
context = gx.get_context()
Specify a file to read into a Data Asset
Run the following Python code to read the data in individual files directly into a Validator with Pandas:
validator = context.sources.pandas_default.read_csv(
"https://raw.githubusercontent.com/great-expectations/gx_tutorials/main/data/yellow_tripdata_sample_2019-01.csv"
)
In this example, we are connecting to a csv file. However, Great Expectations supports connecting to most types of files that Pandas has read_* methods for.
Because you will be using Pandas to connect to these files, the specific add_*_asset methods that will be available to you will be determined by your currently installed version of Pandas.
For more information on which Pandas read_* methods are available to you as add_*_asset methods, please reference the official Pandas Input/Output documentation for the version of Pandas that you have installed.
In the GX Python API, add_*_asset methods will require the same parameters as the corresponding Pandas read_* method, with one caveat: In Great Expectations, you will also be required to provide a value for an asset_name parameter.
Create Data Source (Optional)
Modify the following code to connect to your Data SourceProvides a standard API for accessing and interacting with data from a wide variety of source systems.. If you don't have data available for testing, you can use the NYC taxi data. The NYC taxi data is open source, and it is updated every month. An individual record in the data corresponds to one taxi trip.
Do not include sensitive information such as credentials in the configuration when you connect to your Data Source. This information appears as plain text in the database. If you must include credentials or a full connection string, GX recommends using a config variables file.
# Give your Datasource a name
datasource_name = None
datasource = context.sources.add_pandas(datasource_name)
# Give your first Asset a name
asset_name = None
path_to_data = None
# to use sample data uncomment next line
# path_to_data = "https://raw.githubusercontent.com/great-expectations/gx_tutorials/main/data/yellow_tripdata_sample_2019-01.csv"
asset = datasource.add_csv_asset(asset_name, filepath_or_buffer=path_to_data)
# Build batch request
batch_request = asset.build_batch_request()
Next steps
Use Pandas to connect to data stored in files on a filesystem. The following examples connect to .csv data. However, GX supports most of the Pandas read methods.
Prerequisites
- A Great Expectations instance. See Install Great Expectations with Data Source dependencies.
- A Data Context.
- Access to filesystem Data Assets
Import the GX module and instantiate a Data Context
Run the following Python code to import GX and instantiate a Data Context:
import great_expectations as gx
context = gx.get_context()
Create a Data Source
The following information is required when you create a Filesystem Data Source:
-
name: The Data Source name. -
base_directory: The path to the folder containing the files the Data Source connects to.
-
Run the following Python code to define
nameandbase_directoryand store the information in the Python variablesdatasource_nameandpath_to_folder_containing_csv_files:Pythondatasource_name = "my_new_datasource"
path_to_folder_containing_csv_files = "<insert_path_to_files_here>"
base_directory of a Filesystem Data SourceIf you are using a Filesystem Data Context you can provide a path for base_directory that is relative to the folder containing your Data Context.
However, an in-memory Ephemeral Data Context doesn't exist in a folder. Therefore, when using an Ephemeral Data Context, relative paths will be determined based on the folder your Python code is being executed in, instead.
-
Run the following Python code to pass
nameandbase_directoryas parameters when you create your Data Source:Pythondatasource = context.sources.add_pandas_filesystem(
name=datasource_name, base_directory=path_to_folder_containing_csv_files
)
You can access files that are nested in folders under your Data Source's base_directory!
If your Data Assets are located in multiple folders, you can use the folder that contains those folders as your base_directory. When you define a Data Asset for your Data Source, you can then include the folder path (relative to your base_directory) in the regular expression that indicates which files to connect to.
Add a Data Asset to the Data Source
A Data Asset requires the following information to be defined:
-
name: The Data Asset name. Helpful when you define multiple Data Assets in the same Data Source. -
batching_regex: A regular expression that matches the files to be included in the Data Asset.
batching_regex matches multiple files?Your Data Asset will connect to all files that match the regex that you provide. Each matched file will become an individual Batch inside your Data Asset.
For example:
Let's say that you have a filesystem Data Source pointing to a base folder that contains the following files:
- "yellow_tripdata_sample_2019-03.csv"
- "yellow_tripdata_sample_2020-07.csv"
- "yellow_tripdata_sample_2021-02.csv"
If you define a Data Asset using the full file name with no regex groups, such as "yellow_tripdata_sample_2019-03.csv" your Data Asset will contain only one Batch, which will correspond to that file.
However, if you define a partial file name with a regex group, such as r"yellow_tripdata_sample_(?P<year>\d{4})-(?P<month>\d{2}).csv" your Data Asset can be organized ("partitioned") into Batches according to the two dimensions, defined by the group names, "year" and "month". When you send a Batch Request query featuring this Data Asset in the future, you can use these group names with their respective values as options to control which Batches will be returned.
For example, you could return all Batches in the year of 2021, or the one Batch for July of 2020.
-
Run the following Python code to define
nameandbatching_regexand store the information in the Python variablesasset_nameandbatching_regex:Pythonasset_name = "my_taxi_data_asset"
batching_regex = r"yellow_tripdata_sample_(?P<year>\d{4})-(?P<month>\d{2})\.csv" -
Run the following Python code to pass
nameandbatching_regexas parameters when you create your Data Asset:Pythondatasource.add_csv_asset(name=asset_name, batching_regex=batching_regex)Using Pandas to connect to different file typesIn this example, we are connecting to a csv file. However, Great Expectations supports connecting to most types of files that Pandas has
read_*methods for.Because you will be using Pandas to connect to these files, the specific
add_*_assetmethods that will be available to you will be determined by your currently installed version of Pandas.For more information on which Pandas
read_*methods are available to you asadd_*_assetmethods, please reference the official Pandas Input/Output documentation for the version of Pandas that you have installed.In the GX Python API,
add_*_assetmethods will require the same parameters as the corresponding Pandasread_*method, with one caveat: In Great Expectations, you will also be required to provide a value for anasset_nameparameter.
Add additional files as Data Assets (Optional)
Your Data Source can contain multiple Data Assets. If you have additional files to connect to, you can provide different name and batching_regex parameters to create additional Data Assets for those files in your Data Source. You can even include the same files in multiple Data Assets, if a given file matches the batching_regex of more than one Data Asset.
Next steps
- How to organize Batches in a file-based Data Asset
- How to request Data from a Data Asset
- How to create Expectations while interactively evaluating a set of data
Related documentation
For more information on Pandas read_* methods, see the Pandas Input/output documentation.
Use Spark to connect to data stored in files on a filesystem. The following examples connect to .csv data.
Prerequisites
- A Great Expectations instance. See Install Great Expectations with Data Source dependencies.
- A Data Context.
- Access to filesystem Data Assets
Import the GX module and instantiate a Data Context
Run the following Python code to import GX and instantiate a Data Context:
import great_expectations as gx
context = gx.get_context()
Create a Data Source
The following information is required when you create a Filesystem Data Source:
-
name: The Data Source name. -
base_directory: The path to the folder containing the files the Data Source connects to.
-
Run the following Python code to define
nameandbase_directoryand store the information in the Python variablesdatasource_nameandpath_to_folder_containing_csv_files:Pythondatasource_name = "my_new_datasource"
path_to_folder_containing_csv_files = "<insert_path_to_files_here>"Using relative paths as thebase_directoryof a Filesystem Data SourceIf you are using a Filesystem Data Context you can provide a path for
base_directorythat is relative to the folder containing your Data Context.However, an in-memory Ephemeral Data Context doesn't exist in a folder. Therefore, when using an Ephemeral Data Context, relative paths will be determined based on the folder your Python code is being executed in, instead.
-
Run the following Python code to pass
nameandbase_directoryas parameters when you create your Data Source:Pythondatasource = context.sources.add_spark_filesystem(
name=datasource_name, base_directory=path_to_folder_containing_csv_files
)What if my Data Assets are located in different folders?You can access files that are nested in folders under your Data Source's
base_directory!If your Data Assets are located in multiple folders, you can use the folder that contains those folders as your
base_directory. When you define a Data Asset for your Data Source, you can then include the folder path (relative to yourbase_directory) in the regular expression that indicates which files to connect to.
Add a Data Asset to the Data Source
A Data Asset requires the following information to be defined:
-
name: The Data Asset name. Helpful when you define multiple Data Assets in the same Data Source. -
batching_regex: A regular expression that matches the files to be included in the Data Asset.
batching_regex matches multiple files?Your Data Asset will connect to all files that match the regex that you provide. Each matched file will become an individual Batch inside your Data Asset.
For example:
Let's say that you have a filesystem Data Source pointing to a base folder that contains the following files:
- "yellow_tripdata_sample_2019-03.csv"
- "yellow_tripdata_sample_2020-07.csv"
- "yellow_tripdata_sample_2021-02.csv"
If you define a Data Asset using the full file name with no regex groups, such as "yellow_tripdata_sample_2019-03.csv" your Data Asset will contain only one Batch, which will correspond to that file.
However, if you define a partial file name with a regex group, such as r"yellow_tripdata_sample_(?P<year>\d{4})-(?P<month>\d{2}).csv" your Data Asset can be organized ("partitioned") into Batches according to the two dimensions, defined by the group names, "year" and "month". When you send a Batch Request query featuring this Data Asset in the future, you can use these group names with their respective values as options to control which Batches will be returned.
For example, you could return all Batches in the year of 2021, or the one Batch for July of 2020.
-
Run the following Python code to define
nameandbatching_regexand store the information in the Python variablesasset_nameandbatching_regex:Pythonasset_name = "my_taxi_data_asset"
batching_regex = r"yellow_tripdata_sample_(?P<year>\d{4})-(?P<month>\d{2})\.csv"In addition, the argument
headerinforms the SparkDataFramereader that the files contain a header column, while the argumentinfer_schemainstructs the SparkDataFramereader to make a best effort to determine the schema of the columns automatically. -
Run the following Python code to pass
nameandbatching_regexand the optionalheaderandinfer_schemaarguments as parameters when you create your Data Asset:Pythondatasource.add_csv_asset(
name=asset_name, batching_regex=batching_regex, header=True, infer_schema=True
)
Add additional files as Data Assets (Optional)
Your Data Source can contain multiple Data Assets. If you have additional files to connect to, you can provide different name and batching_regex parameters to create additional Data Assets for those files in your Data Source. You can even include the same files in multiple Data Assets, if a given file matches the batching_regex of more than one Data Asset.
Next steps
Related documentation
For more information about storing credentials for use with GX, see How to configure credentials.