Connect to filesystem Data Assets

Amazon S3 Data Source

Connect to an Amazon S3 Data Source.

pandas
Spark

The following examples connect to .csv data. However, GX supports most of the Pandas read methods.

Prerequisites

An installation of GX set up to work with S3
Access to data on a S3 bucket

Import GX and instantiate a Data Context

Run the following Python code to import GX and instantiate a Data Context:

Python
import great_expectations as gx

context = gx.get_context()

Create a Data Source

The following information is required when you create an Amazon S3 Data Source:

name: The Data Source name. In the following examples, this is "my_s3_datasource"
bucket_name: The Amazon S3 bucket name.
boto3_options: Optional. Additional options for the Data Source. In the following examples, the default values are used.

Run the following Python code to define name, bucket_name and boto3_options:
Python
```
datasource_name = "my_s3_datasource"
bucket_name = "my_bucket"
boto3_options = {}
```
Additional options for boto3_options
The parameter boto3_options allows you to pass the following information:
- endpoint_url: specifies an S3 endpoint. You can use an environment variable such as "${S3_ENDPOINT}" to securely include this in your code. The string "${S3_ENDPOINT}" will be replaced with the value of the corresponding environment variable.
- region_name: Your AWS region name.

Run the following Python code to pass name, bucket_name, and boto3_options as parameters when you create your Data Source::

Python
datasource = context.sources.add_pandas_s3(
    name=datasource_name, bucket=bucket_name, boto3_options=boto3_options
)

Add data to the Data Source as a Data Asset

Run the following Python code:

Python
asset_name = "my_taxi_data_asset"
s3_prefix = "data/taxi_yellow_tripdata_samples/"
batching_regex = r"yellow_tripdata_sample_(?P<year>\d{4})-(?P<month>\d{2})\.csv"
data_asset = datasource.add_csv_asset(
    name=asset_name, batching_regex=batching_regex, s3_prefix=s3_prefix
)

Your Data Asset will connect to all files that match the regex that you provide. Each matched file will become a Batch inside your Data Asset.

For example:

Let's say that your S3 bucket has the following files:

"yellow_tripdata_sample_2021-11.csv"
"yellow_tripdata_sample_2021-12.csv"
"yellow_tripdata_sample_2023-01.csv"

If you define a Data Asset using the full file name with no regex groups, such as "yellow_tripdata_sample_2023-01\.csv" your Data Asset will contain only one Batch, which will correspond to that file.

However, if you define a partial file name with a regex group, such as "yellow_tripdata_sample_(?P<year>\d{4})-(?P<month>\d{2})\.csv" your Data Asset will contain 3 Batches, one corresponding to each matched file. You can then use the keys year and month to indicate exactly which file you want to request from the available Batches.

Next steps

The following examples connect to .csv data.

Prerequisites

An installation of GX set up to work with S3
Access to data on a S3 bucket

Import GX and instantiate a Data Context

Run the following Python code to import GX and instantiate a Data Context:

Python
import great_expectations as gx

context = gx.get_context()

Create a Data Source

The following information is required when you create an Amazon S3 Data Source:

name: The Data Source name. In the following examples, this is "my_s3_datasource"
bucket_name: The Amazon S3 bucket name.
boto3_options: Optional. Additional options for the Data Source. In the following examples, the default values are used.

Run the following Python code to define name, bucket_name, and boto3_options:
Python
```
datasource_name = "my_s3_datasource"
bucket_name = "my_bucket"
boto3_options = {}
```
Additional options for boto3_options
The parameter boto3_options allows you to pass the following information:
- endpoint_url: Specifies an S3 endpoint. You can use an environment variable such as "${S3_ENDPOINT}" to securely include this in your code. The string "${S3_ENDPOINT}" will be replaced with the value of the corresponding environment variable.
- region_name: Your AWS region name.
Run the following Python code to pass name, bucket_name, and boto3_options as parameters when you create your Data Source::
Python
```
datasource = context.sources.add_spark_s3(
    name=datasource_name, bucket=bucket_name, boto3_options=boto3_options
)
```

Add data to the Data Source as a Data Asset

Run the following Python code:

Python
asset_name = "my_taxi_data_asset"
s3_prefix = "data/taxi_yellow_tripdata_samples/"
batching_regex = r"yellow_tripdata_sample_(?P<year>\d{4})-(?P<month>\d{2})\.csv"
data_asset = datasource.add_csv_asset(
    name=asset_name,
    batching_regex=batching_regex,
    s3_prefix=s3_prefix,
    header=True,
    infer_schema=True,
)

Your Data Asset will connect to all files that match the regex that you provide. Each matched file will become a Batch inside your Data Asset.

For example:

Let's say that your S3 bucket has the following files:

"yellow_tripdata_sample_2021-11.csv"
"yellow_tripdata_sample_2021-12.csv"
"yellow_tripdata_sample_2023-01.csv"

If you define a Data Asset using the full file name with no regex groups, such as "yellow_tripdata_sample_2023-01\.csv" your Data Asset will contain only one Batch, which will correspond to that file.

However, if you define a partial file name with a regex group, such as "yellow_tripdata_sample_(?P<year>\d{4})-(?P<month>\d{2})\.csv" your Data Asset will contain 3 Batches, one corresponding to each matched file. You can then use the keys year and month to indicate exactly which file you want to request from the available Batches.

Next steps

Microsoft Azure Blob Storage

Connect to a Microsoft Azure Blob Storage Data Source.

pandas
Spark

Use Pandas to connect to data stored in files on a filesystem. The following examples connect to .csv data. However, GX supports most of the Pandas read methods.

Prerequisites

GX installed and set up to work with Azure Blob Storage
Access to data in Azure Blob Storage

Import GX and instantiate a Data Context

Run the following Python code to import GX and instantiate a Data Context:

Python
import great_expectations as gx

context = gx.get_context()

Create a Data Source

The following information is required when you create a Microsoft Azure Blob Storage Data Source:

name: The Data Source name. In the following examples, this is "my_datasource".
azure_options: Authentication settings.

Run the following Python code to define name and azure_options:

Python
datasource_name = "my_datasource"
azure_options = {
    "account_url": "${AZURE_STORAGE_ACCOUNT_URL}",
    "credential": "${AZURE_CREDENTIAL}",
}

Run the following Python code to pass name and azure_options as parameters when you create your Data Source:
Python
```
datasource = context.sources.add_pandas_abs(
    name=datasource_name, azure_options=azure_options
)
```
Where did that connection string come from?
In the previous example, the value for account_url is substituted for the contents of the AZURE_STORAGE_CONNECTION_STRING key you configured when you installed GX and set up your Azure Blob Storage dependencies.

Add data to the Data Source as a Data Asset

To specify data to connect to you can specify the following elements:

name: A name by which you can reference the Data Asset in the future.
batching_regex: A regular expression that indicates which files to treat as batches in your Data Asset and how to identify them.
abs_container: The name of your Azure Blob Storage container.
abs_name_starts_with: A string indicating what part of the batching_regex to truncate from the final batch names.
abs_recursive_file_discovery: (Optional) A boolean (True/False) indicating if files should be searched recursively from subfolders

Run the following Python code to define the connection values:

Python code
asset_name = "my_taxi_data_asset"
batching_regex = r"yellow_tripdata_sample_(?P<year>\d{4})-(?P<month>\d{2})\.csv"
abs_container = "my_container"
abs_name_starts_with = "data/taxi_yellow_tripdata_samples/"

Run the following Python code to create the Data Asset:

Python
data_asset = datasource.add_csv_asset(
    name=asset_name,
    batching_regex=batching_regex,
    abs_container=abs_container,
    abs_name_starts_with=abs_name_starts_with,
)

Your Data Asset will connect to all files that match the regex that you provide. Each matched file will become a Batch inside your Data Asset.

For example:

Let's say that your has the following files:

"yellow_tripdata_sample_2021-11.csv"
"yellow_tripdata_sample_2021-12.csv"
"yellow_tripdata_sample_2023-01.csv"

If you define a Data Asset using the full file name with no regex groups, such as "yellow_tripdata_sample_2023-01\.csv" your Data Asset will contain only one Batch, which will correspond to that file.

However, if you define a partial file name with a regex group, such as "yellow_tripdata_sample_(?P<year>\d{4})-(?P<month>\d{2})\.csv" your Data Asset will contain 3 Batches, one corresponding to each matched file. You can then use the keys year and month to indicate exactly which file you want to request from the available Batches.

Next steps

Use Spark to connect to data stored in files on a filesystem. The following examples connect to .csv data.

Prerequisites

GX installed and set up to work with Azure Blob Storage
Access to data in Azure Blob Storage

Import GX and instantiate a Data Context

Run the following Python code to import GX and instantiate a Data Context:

Python
import great_expectations as gx

context = gx.get_context()

Create a Data Source

The following information is required when you create a Microsoft Azure Blob Storage Data Source:

name: The Data Source name. In the following examples, this is "my_datasource".
azure_options: Authentication settings.

Run the following Python code to define name and azure_options:

Python
datasource_name = "my_datasource"
azure_options = {
    "account_url": "${AZURE_STORAGE_ACCOUNT_URL}",
}

Run the following Python code to pass name and azure_options as parameters when you create your Data Source:
Python
```
datasource = context.sources.add_spark_abs(
    name=datasource_name, azure_options=azure_options
)
```
Where did that connection string come from?
In the previous example, the value for account_url is substituted for the contents of the AZURE_STORAGE_CONNECTION_STRING key you configured when you installed GX and set up your Azure Blob Storage dependencies.

Add data to the Data Source as a Data Asset

To specify data to connect to you can specify the following elements:

name: A name by which you can reference the Data Asset in the future.
batching_regex: A regular expression that indicates which files to treat as batches in your Data Asset and how to identify them.
abs_container: The name of your Azure Blob Storage container.
abs_name_starts_with: A string indicating what part of the batching_regex to truncate from the final batch names.
abs_recursive_file_discovery: (Optional) A boolean (True/False) indicating if files should be searched recursively from subfolders

Run the following Python code to define the connection values:

Python code
asset_name = "my_taxi_data_asset"
batching_regex = r"yellow_tripdata_sample_(?P<year>\d{4})-(?P<month>\d{2})\.csv"
abs_container = "my_container"
abs_name_starts_with = "data/taxi_yellow_tripdata_samples/"

Run the following Python code to create the Data Asset:

Python
data_asset = datasource.add_csv_asset(
    name=asset_name,
    batching_regex=batching_regex,
    abs_container=abs_container,
    header=True,
    infer_schema=True,
    abs_name_starts_with=abs_name_starts_with,
)

Your Data Asset will connect to all files that match the regex that you provide. Each matched file will become a Batch inside your Data Asset.

For example:

Let's say that your has the following files:

"yellow_tripdata_sample_2021-11.csv"
"yellow_tripdata_sample_2021-12.csv"
"yellow_tripdata_sample_2023-01.csv"

If you define a Data Asset using the full file name with no regex groups, such as "yellow_tripdata_sample_2023-01\.csv" your Data Asset will contain only one Batch, which will correspond to that file.

However, if you define a partial file name with a regex group, such as "yellow_tripdata_sample_(?P<year>\d{4})-(?P<month>\d{2})\.csv" your Data Asset will contain 3 Batches, one corresponding to each matched file. You can then use the keys year and month to indicate exactly which file you want to request from the available Batches.

Next steps

GCS Data Source

Connect to a GCS Data Source.

pandas
Spark

The following examples connect to .csv data. However, GX supports most of the Pandas read methods.

Prerequisites

An installation of GX set up to work with GCS
Access to data in a GCS bucket

Import GX and instantiate a Data Context

Run the following Python code to import GX and instantiate a Data Context:

Python
import great_expectations as gx

context = gx.get_context()

Create a Data Source

The following information is required when you create a GCS Data Source:

name: The Data Source name. In the following examples, this is "my_gcs_datasource".
bucket_or_name: The GCS bucket or instance name.
gcs_options: Optional. Additional options for the Data Source. In the following examples, the default values are used.

Run the following Python code to define name, bucket_or_name, and gcs_options:

Python
datasource_name = "my_gcs_datasource"
bucket_or_name = "my_bucket"
gcs_options = {}

Run the following Python code to pass name, bucket_or_name, and gcs_options as parameters when you create your Data Source:

Python
datasource = context.sources.add_pandas_gcs(
    name=datasource_name, bucket_or_name=bucket_or_name, gcs_options=gcs_options
)

Add GCS data to the Data Source as a Data Asset

Run the following Python code:

Python
asset_name = "my_taxi_data_asset"
gcs_prefix = "data/taxi_yellow_tripdata_samples/"
batching_regex = r"yellow_tripdata_sample_(?P<year>\d{4})-(?P<month>\d{2})\.csv"
data_asset = datasource.add_csv_asset(
    name=asset_name, batching_regex=batching_regex, gcs_prefix=gcs_prefix
)

Your Data Asset will connect to all files that match the regex that you provide. Each matched file will become a Batch inside your Data Asset.

For example:

Let's say that your GCS bucket has the following files:

"yellow_tripdata_sample_2021-11.csv"
"yellow_tripdata_sample_2021-12.csv"
"yellow_tripdata_sample_2023-01.csv"

If you define a Data Asset using the full file name with no regex groups, such as "yellow_tripdata_sample_2023-01\.csv" your Data Asset will contain only one Batch, which will correspond to that file.

However, if you define a partial file name with a regex group, such as "yellow_tripdata_sample_(?P<year>\d{4})-(?P<month>\d{2})\.csv" your Data Asset will contain 3 Batches, one corresponding to each matched file. You can then use the keys year and month to indicate exactly which file you want to request from the available Batches.

Next steps

For more information on Google Cloud and authentication, see the following:

Use Spark to connect to a GCS Data Source. The following examples connect to .csv data.

Prerequisites

An installation of GX set up to work with GCS
Access to data on a GCS bucket

1. Import GX and instantiate a Data Context

Run the following Python code to import GX and instantiate a Data Context:

Python
import great_expectations as gx

context = gx.get_context()

Create a Data Source

The following information is required when you create a GCS Data Source:

name: The Data Source name. In the following examples, this is "my_gcs_datasource".
bucket_or_name: The GCS bucket or instance name.
gcs_options: Optional. Additional options for the Data Source. In the following examples, the default values are used.

Run the following Python code to define name, bucket_or_name, and gcs_options:

Python
datasource_name = "my_gcs_datasource"
bucket_or_name = "my_bucket"
gcs_options = {}

Run the following Python code to pass name, bucket_or_name, and gcs_options as parameters when you create your Data Source:

Python
datasource = context.sources.add_spark_gcs(
    name=datasource_name, bucket_or_name=bucket_or_name, gcs_options=gcs_options
)

Add GCS data to the Data Source as a Data Asset

Run the following Python code:

Python
asset_name = "my_taxi_data_asset"
gcs_prefix = "data/taxi_yellow_tripdata_samples/"
batching_regex = r"yellow_tripdata_sample_(?P<year>\d{4})-(?P<month>\d{2})\.csv"
data_asset = datasource.add_csv_asset(
    name=asset_name,
    batching_regex=batching_regex,
    gcs_prefix=gcs_prefix,
    header=True,
    infer_schema=True,
)

Optional parameters: header and infer_schema

In the previous example there are two optional parameters. If the file does not have a header line, the header parameter can be left out as it will default to false. If you do not want GX to infer the schema of your file, you can exclude the infer_schema parameter as it also defaults to false.

Your Data Asset will connect to all files that match the regex that you provide. Each matched file will become a Batch inside your Data Asset.

For example:

Let's say that your GCS bucket has the following files:

"yellow_tripdata_sample_2021-11.csv"
"yellow_tripdata_sample_2021-12.csv"
"yellow_tripdata_sample_2023-01.csv"

If you define a Data Asset using the full file name with no regex groups, such as "yellow_tripdata_sample_2023-01\.csv" your Data Asset will contain only one Batch, which will correspond to that file.

However, if you define a partial file name with a regex group, such as "yellow_tripdata_sample_(?P<year>\d{4})-(?P<month>\d{2})\.csv" your Data Asset will contain 3 Batches, one corresponding to each matched file. You can then use the keys year and month to indicate exactly which file you want to request from the available Batches.

Next steps

For more information on Google Cloud and authentication, see the following:

Filesystem Data Source

Connect to filesystem Data Assets.

Single file with pandas
Multiple files with pandas
Multiple files with Spark

Use Pandas to connect to data stored in files on a filesystem. The following examples connect to .csv data. However, GX supports most of the Pandas read methods.

Prerequisites

A Great Expectations instance. See Install Great Expectations with Data Source dependencies.
A Data Context.

Access to filesystem Data Assets

Import the GX module and instantiate a Data Context

Run the following Python code to import GX and instantiate a Data Context:

Python
import great_expectations as gx

context = gx.get_context()

Specify a file to read into a Data Asset

Run the following Python code to read the data in individual files directly into a Validator with Pandas:

Python
validator = context.sources.pandas_default.read_csv(
    "https://raw.githubusercontent.com/great-expectations/gx_tutorials/main/data/yellow_tripdata_sample_2019-01.csv"
)

Using Pandas to connect to different file types

In this example, we are connecting to a csv file. However, Great Expectations supports connecting to most types of files that Pandas has read_* methods for.

Because you will be using Pandas to connect to these files, the specific add_*_asset methods that will be available to you will be determined by your currently installed version of Pandas.

For more information on which Pandas read_* methods are available to you as add_*_asset methods, please reference the official Pandas Input/Output documentation for the version of Pandas that you have installed.

In the GX Python API, add_*_asset methods will require the same parameters as the corresponding Pandas read_* method, with one caveat: In Great Expectations, you will also be required to provide a value for an asset_name parameter.

Create Data Source (Optional)

Modify the following code to connect to your Data SourceProvides a standard API for accessing and interacting with data from a wide variety of source systems.. If you don't have data available for testing, you can use the NYC taxi data. The NYC taxi data is open source, and it is updated every month. An individual record in the data corresponds to one taxi trip.

caution

Do not include sensitive information such as credentials in the configuration when you connect to your Data Source. This information appears as plain text in the database. If you must include credentials or a full connection string, GX recommends using a config variables file.

Python
# Give your Datasource a name
datasource_name = None
datasource = context.sources.add_pandas(datasource_name)

# Give your first Asset a name
asset_name = None
path_to_data = None
# to use sample data uncomment next line
# path_to_data = "https://raw.githubusercontent.com/great-expectations/gx_tutorials/main/data/yellow_tripdata_sample_2019-01.csv"
asset = datasource.add_csv_asset(asset_name, filepath_or_buffer=path_to_data)

# Build batch request
batch_request = asset.build_batch_request()

Next steps

Create Expectations while interactively evaluating a set of data

Use Pandas to connect to data stored in files on a filesystem. The following examples connect to .csv data. However, GX supports most of the Pandas read methods.

Prerequisites

A Great Expectations instance. See Install Great Expectations with Data Source dependencies.
A Data Context.

Access to filesystem Data Assets

Import the GX module and instantiate a Data Context

Run the following Python code to import GX and instantiate a Data Context:

Python
import great_expectations as gx

context = gx.get_context()

Create a Data Source

The following information is required when you create a Filesystem Data Source:

name: The Data Source name.
base_directory: The path to the folder containing the files the Data Source connects to.

Run the following Python code to define name and base_directory and store the information in the Python variables datasource_name and path_to_folder_containing_csv_files:
Python
```
datasource_name = "my_new_datasource"
path_to_folder_containing_csv_files = "<insert_path_to_files_here>"
```

Using relative paths as the base_directory of a Filesystem Data Source

If you are using a Filesystem Data Context you can provide a path for base_directory that is relative to the folder containing your Data Context.

However, an in-memory Ephemeral Data Context doesn't exist in a folder. Therefore, when using an Ephemeral Data Context, relative paths will be determined based on the folder your Python code is being executed in, instead.

Run the following Python code to pass name and base_directory as parameters when you create your Data Source:

Python
datasource = context.sources.add_pandas_filesystem(
    name=datasource_name, base_directory=path_to_folder_containing_csv_files
)

What if my Data Assets are located in different folders?

You can access files that are nested in folders under your Data Source's base_directory!

If your Data Assets are located in multiple folders, you can use the folder that contains those folders as your base_directory. When you define a Data Asset for your Data Source, you can then include the folder path (relative to your base_directory) in the regular expression that indicates which files to connect to.

Add a Data Asset to the Data Source

A Data Asset requires the following information to be defined:

name: The Data Asset name. Helpful when you define multiple Data Assets in the same Data Source.
batching_regex: A regular expression that matches the files to be included in the Data Asset.

What if the batching_regex matches multiple files?

Your Data Asset will connect to all files that match the regex that you provide. Each matched file will become an individual Batch inside your Data Asset.

For example:

Let's say that you have a filesystem Data Source pointing to a base folder that contains the following files:

"yellow_tripdata_sample_2019-03.csv"
"yellow_tripdata_sample_2020-07.csv"
"yellow_tripdata_sample_2021-02.csv"

If you define a Data Asset using the full file name with no regex groups, such as "yellow_tripdata_sample_2019-03.csv" your Data Asset will contain only one Batch, which will correspond to that file.

However, if you define a partial file name with a regex group, such as r"yellow_tripdata_sample_(?P<year>\d{4})-(?P<month>\d{2}).csv" your Data Asset can be organized ("partitioned") into Batches according to the two dimensions, defined by the group names, "year" and "month". When you send a Batch Request query featuring this Data Asset in the future, you can use these group names with their respective values as options to control which Batches will be returned. For example, you could return all Batches in the year of 2021, or the one Batch for July of 2020.

Run the following Python code to define name and batching_regex and store the information in the Python variables asset_name and batching_regex:
Python
```
asset_name = "my_taxi_data_asset"
batching_regex = r"yellow_tripdata_sample_(?P<year>\d{4})-(?P<month>\d{2})\.csv"
```
Run the following Python code to pass name and batching_regex as parameters when you create your Data Asset:
Python
```
datasource.add_csv_asset(name=asset_name, batching_regex=batching_regex)
```
Using Pandas to connect to different file types
In this example, we are connecting to a csv file. However, Great Expectations supports connecting to most types of files that Pandas has read_* methods for.
Because you will be using Pandas to connect to these files, the specific add_*_asset methods that will be available to you will be determined by your currently installed version of Pandas.
For more information on which Pandas read_* methods are available to you as add_*_asset methods, please reference the official Pandas Input/Output documentation for the version of Pandas that you have installed.
In the GX Python API, add_*_asset methods will require the same parameters as the corresponding Pandas read_* method, with one caveat: In Great Expectations, you will also be required to provide a value for an asset_name parameter.

Add additional files as Data Assets (Optional)

Your Data Source can contain multiple Data Assets. If you have additional files to connect to, you can provide different name and batching_regex parameters to create additional Data Assets for those files in your Data Source. You can even include the same files in multiple Data Assets, if a given file matches the batching_regex of more than one Data Asset.

Next steps

For more information on Pandas read_* methods, see the Pandas Input/output documentation.

Use Spark to connect to data stored in files on a filesystem. The following examples connect to .csv data.

Prerequisites

A Great Expectations instance. See Install Great Expectations with Data Source dependencies.
A Data Context.

Access to filesystem Data Assets

Import the GX module and instantiate a Data Context

Run the following Python code to import GX and instantiate a Data Context:

Python
import great_expectations as gx

context = gx.get_context()

Create a Data Source

The following information is required when you create a Filesystem Data Source:

name: The Data Source name.
base_directory: The path to the folder containing the files the Data Source connects to.

Run the following Python code to define name and base_directory and store the information in the Python variables datasource_name and path_to_folder_containing_csv_files:
Python
```
datasource_name = "my_new_datasource"
path_to_folder_containing_csv_files = "<insert_path_to_files_here>"
```
Using relative paths as the base_directory of a Filesystem Data Source
If you are using a Filesystem Data Context you can provide a path for base_directory that is relative to the folder containing your Data Context.
However, an in-memory Ephemeral Data Context doesn't exist in a folder. Therefore, when using an Ephemeral Data Context, relative paths will be determined based on the folder your Python code is being executed in, instead.
Run the following Python code to pass name and base_directory as parameters when you create your Data Source:
Python
```
datasource = context.sources.add_spark_filesystem(
    name=datasource_name, base_directory=path_to_folder_containing_csv_files
)
```
What if my Data Assets are located in different folders?
You can access files that are nested in folders under your Data Source's base_directory!
If your Data Assets are located in multiple folders, you can use the folder that contains those folders as your base_directory. When you define a Data Asset for your Data Source, you can then include the folder path (relative to your base_directory) in the regular expression that indicates which files to connect to.

Add a Data Asset to the Data Source

A Data Asset requires the following information to be defined:

name: The Data Asset name. Helpful when you define multiple Data Assets in the same Data Source.
batching_regex: A regular expression that matches the files to be included in the Data Asset.

What if the batching_regex matches multiple files?

Your Data Asset will connect to all files that match the regex that you provide. Each matched file will become an individual Batch inside your Data Asset.

For example:

Let's say that you have a filesystem Data Source pointing to a base folder that contains the following files:

"yellow_tripdata_sample_2019-03.csv"
"yellow_tripdata_sample_2020-07.csv"
"yellow_tripdata_sample_2021-02.csv"

If you define a Data Asset using the full file name with no regex groups, such as "yellow_tripdata_sample_2019-03.csv" your Data Asset will contain only one Batch, which will correspond to that file.

However, if you define a partial file name with a regex group, such as r"yellow_tripdata_sample_(?P<year>\d{4})-(?P<month>\d{2}).csv" your Data Asset can be organized ("partitioned") into Batches according to the two dimensions, defined by the group names, "year" and "month". When you send a Batch Request query featuring this Data Asset in the future, you can use these group names with their respective values as options to control which Batches will be returned. For example, you could return all Batches in the year of 2021, or the one Batch for July of 2020.

Run the following Python code to define name and batching_regex and store the information in the Python variables asset_name and batching_regex:
Python
```
asset_name = "my_taxi_data_asset"
batching_regex = r"yellow_tripdata_sample_(?P<year>\d{4})-(?P<month>\d{2})\.csv"
```
In addition, the argument header informs the Spark DataFrame reader that the files contain a header column, while the argument infer_schema instructs the Spark DataFrame reader to make a best effort to determine the schema of the columns automatically.
Run the following Python code to pass name and batching_regex and the optional header and infer_schema arguments as parameters when you create your Data Asset:
Python
```
datasource.add_csv_asset(
    name=asset_name, batching_regex=batching_regex, header=True, infer_schema=True
)
```

Add additional files as Data Assets (Optional)

Your Data Source can contain multiple Data Assets. If you have additional files to connect to, you can provide different name and batching_regex parameters to create additional Data Assets for those files in your Data Source. You can even include the same files in multiple Data Assets, if a given file matches the batching_regex of more than one Data Asset.

Amazon S3 Data Source​

Prerequisites​

Import GX and instantiate a Data Context​

Create a Data Source​

Add data to the Data Source as a Data Asset​

Next steps​

Prerequisites​

Import GX and instantiate a Data Context​

Create a Data Source​

Add data to the Data Source as a Data Asset​

Next steps​

Microsoft Azure Blob Storage​

Prerequisites​

Import GX and instantiate a Data Context​

Create a Data Source​

Add data to the Data Source as a Data Asset​

Next steps​

Prerequisites​

Import GX and instantiate a Data Context​

Create a Data Source​

Add data to the Data Source as a Data Asset​

Next steps​

GCS Data Source​

Prerequisites​

Import GX and instantiate a Data Context​

Create a Data Source​

Add GCS data to the Data Source as a Data Asset​

Next steps​

Related documentation​

Prerequisites​

1. Import GX and instantiate a Data Context​

Create a Data Source​

Add GCS data to the Data Source as a Data Asset​

Next steps​

Related documentation​

Filesystem Data Source​

Prerequisites​

Import the GX module and instantiate a Data Context​

Specify a file to read into a Data Asset​

Create Data Source (Optional)​

Next steps​

Prerequisites​

Import the GX module and instantiate a Data Context​

Create a Data Source​

Add a Data Asset to the Data Source​

Add additional files as Data Assets (Optional)​

Next steps​

Related documentation​

Prerequisites​

Import the GX module and instantiate a Data Context​

Create a Data Source​

Add a Data Asset to the Data Source​

Add additional files as Data Assets (Optional)​

Next steps​

Related documentation​

Amazon S3 Data Source

Prerequisites

Import GX and instantiate a Data Context

Create a Data Source

Add data to the Data Source as a Data Asset

Next steps

Prerequisites

Import GX and instantiate a Data Context

Create a Data Source

Add data to the Data Source as a Data Asset

Next steps

Microsoft Azure Blob Storage

Prerequisites

Import GX and instantiate a Data Context

Create a Data Source

Add data to the Data Source as a Data Asset

Next steps

Prerequisites

Import GX and instantiate a Data Context

Create a Data Source

Add data to the Data Source as a Data Asset

Next steps

GCS Data Source

Prerequisites

Import GX and instantiate a Data Context

Create a Data Source

Add GCS data to the Data Source as a Data Asset

Next steps

Related documentation

Prerequisites

1. Import GX and instantiate a Data Context

Create a Data Source

Add GCS data to the Data Source as a Data Asset

Next steps

Related documentation

Filesystem Data Source

Prerequisites

Import the GX module and instantiate a Data Context

Specify a file to read into a Data Asset

Create Data Source (Optional)

Next steps

Prerequisites

Import the GX module and instantiate a Data Context

Create a Data Source

Add a Data Asset to the Data Source

Add additional files as Data Assets (Optional)

Next steps

Related documentation

Prerequisites

Import the GX module and instantiate a Data Context

Create a Data Source

Add a Data Asset to the Data Source

Add additional files as Data Assets (Optional)

Next steps

Related documentation