Skip to content

Commit 99bcaf1

Browse files
committed
Release 5.4 - Hello Meta, SFMC, Sustainability, and Credly Badger! 🐈
So long @Lsubatin and thank you for everything!
1 parent 9736bda commit 99bcaf1

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

60 files changed

+3869
-2168
lines changed

LICENSE

+1-1
Original file line numberDiff line numberDiff line change
@@ -187,7 +187,7 @@
187187
same "printed page" as the copyright notice for easier
188188
identification within third-party archives.
189189

190-
Copyright [yyyy] [name of copyright owner]
190+
Copyright 2024 Google
191191

192192
Licensed under the Apache License, Version 2.0 (the "License");
193193
you may not use this file except in compliance with the License.

README.md

+42-557
Large diffs are not rendered by default.

README_Marketing.md

+648
Large diffs are not rendered by default.

README_SAP.md

+74
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,74 @@
1+
# Integration options for SAP ECC or SAP S/4HANA
2+
3+
## Deployment Configuration for SAP
4+
5+
| Parameter | Meaning | Default Value | Description |
6+
| ------------------------ | ----------------------- | -------------- | ------------------------------------------------------------------------ |
7+
| `SAP.deployCDC` | Deploy CDC | `true` | Generate CDC processing scripts to run as DAGs in Cloud Composer. |
8+
| `SAP.datasets.raw` | Raw landing dataset | - | Used by the CDC process, this is where the replication tool lands the data from SAP. If using test data, create an empty dataset. |
9+
| `SAP.datasets.cdc` | CDC Processed Dataset | - | Dataset that works as a source for the reporting views, and target for the records processed DAGs. If using test data, create an empty dataset. |
10+
| `SAP.datasets.reporting` | Reporting Dataset SAP | `"REPORTING"` | Name of the dataset that is accessible to end users for reporting, where views and user-facing tables are deployed. |
11+
| `SAP.datasets.ml` | ML dataset | `"ML_MODELS"` | Name of the dataset that stages results of Machine Learning algorithms or BQML models. |
12+
| `SAP.SQLFlavor` | SQL flavor for source system | `"ecc"` | `s4` or `ecc`. For test data, keep the default value (`ecc`). For Demand Sensing, only `ecc` test data is provided at this time. |
13+
| `SAP.mandt` | Mandant or Client | `"100"` | Default mandant or client for SAP. For test data, keep the default value (`100`). For Demand Sensing, use `900`. |
14+
15+
Note: While there is not a minimum version of SAP that is required, the ECC models have been developed on the current earliest supported version of SAP ECC. Differences in fields between our system and other systems are expected, regardless of the version.
16+
17+
## Loading SAP data into BigQuery
18+
19+
### **Prerequisites for SAP replication**
20+
21+
- Cortex Data Foundation expects SAP tables to be replicated with the same field names and types as they are created in SAP.
22+
- As long as the tables are replicated with the same format, names of fields and granularity as in the source, there is no requirement to use a specific replication tool.
23+
- Table names need to be created in BigQuery in lowercase.
24+
- The list of tables used by SAP models are available and configurable in the CDC [setting.yaml](https://github.com/GoogleCloudPlatform/cortex-dag-generator/blob/main/setting.yaml). If a table is not present during deployment, the models depending on it will fail. Other models will deploy successfully.
25+
- If in doubt about a conversion option, we recommend following the [default table mapping](https://cloud.google.com/solutions/sap/docs/bq-connector/latest/planning#default_data_type_mapping).
26+
- **`DD03L` for SAP metadata**: If you are not planning on deploying test data, and if you are planning on generating CDC DAG scripts during deployment, make sure table `DD03L` is replicated from SAP in the source project.
27+
This table contains metadata about tables, like the list of keys, and is needed for the CDC generator and dependency resolver to work.
28+
This table will also allow you to add tables not currently covered by the model to generate CDC scripts, like custom or Z tables.
29+
30+
> **Note**: **What happens if I have minor differences in a table name?** Because SAP systems may have minor variations due to versions or add-on and append structures into tables, or because some replication tools may have slightly different handling of special characters, some views may fail not finding a field. We recommend executing the deployment with `turboMode : false` to spot most failures in one go. Examples of this are:
31+
> - Fields starting with `_` (e.g., `_DATAAGING`) have their `_` removed
32+
> - Fields cannot start with `/` in BigQuery
33+
>
34+
> In this case, you can adapt the failing view to select the field as it is landed by your replication tool of choice.
35+
36+
## **Change Data Capture (CDC) processing**
37+
38+
There are two main ways for replication tools to load records from SAP:
39+
- Append-always: Insert every change in a record with a timestamp and an operation flag (Insert, Update, Delete), so the last version can be identified.
40+
- Update when landing (merge or upsert): This creates an updated version of a record on landing in the `change data capture processed`. It performs the CDC operation in BigQuery.
41+
42+
![CDC options for SAP](images/cdc_options.png)
43+
44+
Cortex Data Foundation supports both modes (append-always or update when landing). For append-always, we provide CDC processing templates.
45+
46+
> **Note** Some functionality will need to be commented out for Update on landing. For example, [OneTouchOrder.sql](https://github.com/GoogleCloudPlatform/cortex-reporting/blob/main/OneTouchOrder.sql) and all its dependent queries. The functionality can be replaced with tables like CDPOS.
47+
48+
### Configure CDC templates for tools replicating in append-always mode
49+
50+
#### **Configure CDC for SAP**
51+
52+
> **Note**: **We strongly recommend configuring this file according to your needs.** Some default frequencies may result in unnecessary cost if the business does not require such level of data freshness.
53+
54+
If using a tool that runs in append-always mode, Cortex Data Foundation provides CDC templates to automate the updates and create a _latest version of the truth_ or digital twin in the CDC processed dataset.
55+
56+
You can use the configuration in the file [`setting.yaml`](https://github.com/GoogleCloudPlatform/cortex-dag-generator/blob/main/setting.yaml) if you need to generate change-data capture processing scripts. See the [Appendix - Setting up CDC Processing](./README.md#setting-up-cdc-processing) for options. For test data, you can leave this file as a default.
57+
58+
Make any changes to the [DAG templates](https://github.com/GoogleCloudPlatform/cortex-dag-generator/blob/main/src/template_dag/dag_sql.py) as required by your instance of Airflow or Cloud Composer. You will find more information in the [Appendix - Gathering Cloud Composer settings](./README.md#gathering-cloud-composer-settings).
59+
60+
This module is optional. If you want to add/process tables individually after deployment, you can modify the `setting.yaml` file to process only the tables you need and re-execute the specific module calling `src/SAP_CDC/cloudbuild.cdc.yaml` directly.
61+
62+
#### Performance optimization for CDC Tables
63+
For certain CDC datasets, you may want to take advantages of BigQuery [table partitioning](https://cloud.google.com/bigquery/docs/partitioned-tables), [table clustering](https://cloud.google.com/bigquery/docs/clustered-tables) or both. This choice depends on many factors - the size and data of the table, columns available in the table, and your need for real time data with views vs data materialized as tables. By default, CDC settings do not apply table partitioning or table clustering - the choice is yours to configure it based on what works best for you.
64+
65+
To create tables with partitions and/or clusters, update the CDC `setting.yaml` file with relevant configurations. See Appendix section [Table Partition and Cluster Settings](./README.md#table-partition-and-cluster-settings) for details on how to configure this.
66+
67+
> **NOTE**:
68+
> 1. This feature only applies when a dataset in `setting.yaml` is configured for replication as a table (e.g. `load_frequency = "@daily"`) and not defined as a view (`load_frequency = "RUNTIME"`).
69+
> 2. A table can be both - a partitioned table as well as a clustered table.
70+
71+
72+
> **Important ⚠️**: If you are using a replication tool that allows partitions in the raw dataset, like the BigQuery Connector for SAP, we recommend [setting time-based partitions](https://cloud.google.com/solutions/sap/docs/bq-connector/latest/planning#table_partitioning) in the raw tables. The type of partition will work better if it matches the frequency for CDC DAGs in the `setting.yaml` configuration.
73+
74+
You can read more about partitioning and clustering for SAP [here](https://cloud.google.com/blog/products/sap-google-cloud/design-considerations-for-sap-data-modeling-in-bigquery).

README_SFDC.md

+74
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,74 @@
1+
# Integration options for Salesforce
2+
3+
## Deployment Configuration for Salesforce
4+
5+
| Parameter | Meaning | Default Value | Description |
6+
| ------------------ | ------------- | ---------------------- | ------------------|
7+
| `SFDC.deployCDC` | Deploy CDC | `true` | Generate CDC processing scripts to run as DAGs in Cloud Composer. See the documentation for different ingestion options for Salesforce. |
8+
| `SFDC.createMappingViews` | Create mapping views | `true` | The provided DAGs to fetch new records from the Salesforce APIs update records on landing. This value set to **true** will generate views in the CDC processed dataset to surface tables with the "latest version of the truth" from the Raw dataset. If **false** and `SFDC.deployCDC` is `true`, DAGs will be generated with change data capture processing based on SystemModstamp. See details on [CDC processing for Salesforce](./README_SFDC.md#configure-api-integration-and-cdc-for-salesforce). |
9+
| `SFDC.createPlaceholders` | Create Placeholders | `true` | Create empty placeholder tables in case they are not generated by the ingestion process to allow the downstream reporting deployment to execute without failure. |
10+
| `SFDC.datasets.raw` | Raw landing dataset | - | Used by the CDC process, this is where the replication tool lands the data from SFDC. If using test data, create an empty dataset. |
11+
| `SFDC.datasets.cdc` | CDC Processed Dataset | - | Dataset that works as a source for the reporting views, and target for the records processed DAGs. If using test data, create an empty dataset. |
12+
| `SFDC.datasets.reporting` | Reporting Dataset SFDC| `"REPORTING_SFDC"` | Name of the dataset that is accessible to end users for reporting, where views and user-facing tables are deployed. |
13+
14+
15+
## Loading Salesforce data into BigQuery
16+
17+
We provide a replication solution based on Python scripts scheduled in [Apache Airflow](https://airflow.apache.org/) and [Salesforce Bulk API 2.0](https://developer.salesforce.com/docs/atlas.en-us.api_asynch.meta/api_asynch/bulk_api_2_0.htm). These Python scripts can be adapted and scheduled in your tool of choice.
18+
19+
There are three sets of processing options for data integration:
20+
- API call and load into Raw datasets, updating existing records if needed
21+
- Source-to-target structure mapping views
22+
- CDC processing scripts
23+
24+
If you have datasets already loaded through a different tool in append-always mode, the CDC processing scripts contain mapping files to map the schema of the tables as generated by your tool into the names and data types of the structure required by the reporting views in Cortex Data Foundation. You can also add custom fields in the schema definition so they are incorporated in the CDC processing.
25+
26+
> **Note**: For CDC scripts to work, the **Id** for each API (e.g., `Account Id`) and the [**SystemModStamp**](https://developer.salesforce.com/docs/atlas.en-us.object_reference.meta/object_reference/system_fields.htm) need to be present in the source table. These fields should either have their original name (`Id`, `SystemModstamp`) or be mapped respectively to the `{object_name}` `Id` and `SystemModstamp`.
27+
>
28+
> For example, the source table with data of Account object should have original `Id` and `SystemModstamp` fields. If these fields have different names, then `src/SFDC/src/table_schema/accounts.csv` file must be updated with id field's name mapped to `AccountId` and whatever system modification timestamp field mapped to `SystemModstamp`.
29+
30+
If you already have the replication and CDC working for Salesforce APIs and only need the mapping, you can edit the [mapping files](https://github.com/GoogleCloudPlatform/cortex-salesforce/tree/main/src/table_schema) to generate views that translate the structure generated by the integration tool to the structure expected by Cortex Data Foundation reporting views.
31+
32+
## Salesforce data requirements
33+
* The structure of the source tables follows *snake_case* naming in plural, i.e., `some_objects`. The columns have the same data types as how Salesforce represents them internally. Some fields have been renamed for better readability in the reporting layer.
34+
* Any required tables that did not exist within the raw dataset will be created as empty tables during deployment. This is to ensure the CDC deployment step runs correctly.
35+
* If required, for CDC scripts to work, the **Id** for each API (e.g., `Account Id`) and the [**SystemModStamp**](https://developer.salesforce.com/docs/atlas.en-us.object_reference.meta/object_reference/system_fields.htm) need to be present in the source table. The provided Raw processing scripts fetch these fields automatically from the APIs and update the target replication table.
36+
* The provided Raw processing scripts do not require additional change data capture processing. This behavior is set during deployment by default.
37+
38+
39+
### **Source tables for Currency Conversion in Salesforce**
40+
41+
The currency conversion functionality of Salesforce relies on the existence of the objects `CurrencyTypes` and `DatedConversionRates` within the source Salesforce system, which are available only if [Advanced Currency Management](https://help.salesforce.com/s/articleView?id=sf.administration_about_advanced_currency_management.htm) is enabled. If not, you may want to remove relevant entries from `src/SFDC/config/ingestion_settings.yaml` to avoid running into errors during Salesforce to Raw extraction.
42+
43+
If these tables are not available, we will automatically create empty placeholder tables for them during deployment to avoid break Reporting logic.
44+
45+
## Configure Salesforce integration with Cortex provided ingestion templates</summary>
46+
47+
### Configure API integration and CDC for Salesforce
48+
49+
Following a principle of openness, customers are free to use the provided replication scripts for Salesforce or a data replication tool of their choice as long as data meets the same structure and level of aggregation as provided by the Salesforce APIs. If you are using another tool for replication, this tool can either append updates as new records (_[append always](https://cloud.google.com/bigquery/docs/migration/database-replication-to-bigquery-using-change-data-capture#overview_of_cdc_data_replication)_ pattern) or update existing records with changes when landing the data in BigQuery. If the tool does not update the records and replicates any changes as new records into a target (Raw) table, Cortex Data Foundation provides the option to create change-data-capture processing scripts.
50+
51+
To ensure the names of tables, names of fields, and data types are consistent with the structures expected by Cortex regardless of the replication tool, you can modify the mapping configuration to map your replication tool or existing schemata. This will generate mapping views compatible with the structure expected by Cortex Data Foundation.
52+
53+
![Three options depending on replication tool](images/dataflows.png)
54+
55+
You can use the configuration in [`setting.yaml`](https://github.com/GoogleCloudPlatform/cortex-salesforce/blob/main/config/setting.yaml) to configure the generation of scripts to call the salesforce APIs and replicate the data into the Raw dataset (section `salesforce_to_raw_tables`) and the generation of scripts to process changes incoming into the Raw dataset and into the CDC processed dataset (section `raw_to_cdc_tables`).
56+
57+
By default, the scripts provided to read from APIs update changes into the Raw dataset, so CDC processing scripts are not required, and mapping views to align the source schema to the expected schema are created instead.
58+
59+
The generation of CDC processing scripts is not executed if `SFDC.createMappingViews` in the [config.json](https://github.com/GoogleCloudPlatform/cortex-data-foundation/blob/main/config/config.json#L29) file remains true (default behavior). If CDC scripts are required, set `SFDC.createMappingViews` to false. This second step also allows for mapping between the source schemata into the required schemata as required by Cortex Data Foundation.
60+
61+
The following example of a `setting.yaml` configuration file illustrates the generation of mapping views when a replication tool updates the data directly into the replicated dataset, as illustrated in `option 3` (i.e., no CDC is required, only re-mapping of tables and field names). Since no CDC is required, this option executes as long as the parameter `SFDC.createMappingViews` in the config.json file remains `true`.
62+
63+
![settings.yaml example](images/settingyaml.png)
64+
65+
In this example, removing the configuration for a base table or all of them from the sections will skip the generation of DAGs of that base table or the entire section, as illustrated for `salesforce_to_raw_tables`. For this scenario, setting the parameter `deployCDC : False` has the same effect, as no CDC processing scripts need to be generated.
66+
67+
The following example illustrates the mapping of the field `unicornId` as landed by a replication tool to the name and type expected by Cortex Data Foundation, `AccountId` as a `String`.
68+
69+
![Only remap](images/remap.png)
70+
71+
Make any changes to the [DAG templates for CDC](https://github.com/GoogleCloudPlatform/cortex-salesforce/tree/main/src/cdc_dag_generator/templates) or for [Raw](https://github.com/GoogleCloudPlatform/cortex-salesforce/tree/main/src/raw_dag_generator/templates) as required by your instance of Airflow or Cloud Composer. You will find more information in the [Appendix - Gathering Cloud Composer settings](./README.md#gathering-cloud-composer-settings).
72+
73+
If you do not need any DAGs for Raw data generation from API calls or CDC processing, set [parameter](#deployment-configuration-for-salesforce) `deployCDC` to `false`. Alternatively, you can remove the contents of the sections in [`setting.yaml`](https://github.com/GoogleCloudPlatform/cortex-salesforce/blob/main/config/setting.yaml). If data structures are known to be consistent with those expected by Cortex Data Foundation, you can skip the generation of mapping views with [parameter](#deployment-configuration-for-salesforce) `SFDC.createMappingViews` set to `false`.
74+

0 commit comments

Comments
 (0)