DataIQ Tutorial with Code Examples

DataIQ Tutorial with Code Examples


Example 1: Creating a Data Catalog in DataIQ

from dataiq.sdk import DataCatalog

catalog = DataCatalog()
catalog.create_dataset("customer_data", description="Customer demographics and transactions")
catalog.add_metadata("customer_data", {"source": "CRM", "updated": "2024-02-24"})
catalog.publish("customer_data")

Explanation

  1. Initialize Data Catalog – The DataCatalog() instance initializes a DataIQ catalog, enabling data asset management within the platform for governance and metadata tracking.

  2. Create a Datasetcatalog.create_dataset("customer_data", ...) registers a dataset in DataIQ, making it accessible for analysis, lineage tracking, and collaboration.

  3. Add Metadata – The catalog.add_metadata("customer_data", ...) function assigns metadata, such as the source and last update timestamp, improving discoverability.

  4. Publish Datasetcatalog.publish("customer_data") ensures the dataset is available for use across the organization, enforcing governance policies and quality standards.


Example 2: Data Quality Check with DataIQ

from dataiq.sdk import DataQuality

dq = DataQuality()
dq.add_rule("customer_data", "email_valid", "email IS NOT NULL AND email LIKE '%@%'")
dq.run_checks("customer_data")
dq.get_results("customer_data")

Explanation

  1. Initialize Data Qualitydq = DataQuality() initializes the DataIQ Data Quality module, allowing validation rules to be applied to datasets for integrity checks.

  2. Define a Ruledq.add_rule("customer_data", "email_valid", "email IS NOT NULL AND email LIKE '%@%'") ensures all email addresses follow a valid format.

  3. Run Quality Checksdq.run_checks("customer_data") executes the defined validation rules across the dataset, identifying records that do not meet the criteria.

  4. Retrieve Check Resultsdq.get_results("customer_data") fetches the validation outcomes, helping analysts review failed records and take corrective actions.


Example 3: Automating Data Classification in DataIQ

from dataiq.sdk import DataClassifier

classifier = DataClassifier()
classifier.train("customer_data", labels=["personal", "transactional"])
classifier.predict("new_customer_data")
classifier.save_model("customer_classifier")

Explanation

  1. Initialize Data Classifierclassifier = DataClassifier() creates a classification model in DataIQ to automatically label datasets based on predefined categories.

  2. Train the Modelclassifier.train("customer_data", labels=["personal", "transactional"]) learns patterns from historical data to categorize records into “personal” or “transactional.”

  3. Apply Predictionsclassifier.predict("new_customer_data") uses the trained model to classify incoming data, ensuring consistent labeling for governance and compliance.

  4. Save the Modelclassifier.save_model("customer_classifier") persists the trained model, allowing it to be reused for continuous data classification without retraining.


Example 4: Enforcing Access Control in DataIQ

from dataiq.sdk import AccessControl

ac = AccessControl()
ac.grant_permission("customer_data", "user123", "read")
ac.revoke_permission("customer_data", "user456")
ac.list_permissions("customer_data")

Explanation

  1. Initialize Access Controlac = AccessControl() initializes the security module in DataIQ, allowing fine-grained control over who can access datasets.

  2. Grant Read Permissionac.grant_permission("customer_data", "user123", "read") allows user123 to view, but not modify, the dataset, ensuring controlled access.

  3. Revoke User Accessac.revoke_permission("customer_data", "user456") removes access for user456, enforcing security policies and protecting sensitive information.

  4. List Current Permissionsac.list_permissions("customer_data") retrieves all assigned permissions for the dataset, enabling auditing and compliance tracking.


These four examples demonstrate essential DataIQ functionalities, covering cataloging, quality checks, classification, and access control.