Glossary

Semi-Structured Data

Semi-structured data contains organizational tags or metadata but lacks a rigid schema. Learn how CDPs ingest and normalize semi-structured formats.

CDP.com Staff CDP.com Staff 5 min read

Semi-structured data is data that contains some organizational properties such as tags or metadata, but does not conform to a rigid schema like structured data in relational databases. Semi-structured data sits in between structured data and unstructured data. It has some level of metadata tagging to identify information that gives context to what data points are about. But, like unstructured data, it is not collected in accordance with a particular data model or schema.

Semi-structured data is the fastest-growing category of enterprise data because most modern digital systems — web analytics, mobile SDKs, marketing platforms, IoT devices — emit data in formats like JSON, XML, Avro, and Parquet rather than flat relational tables. For organizations building a customer data infrastructure, the ability to ingest and normalize semi-structured data determines how quickly new sources can be connected and how completely customer profiles can be built.

Semi-Structured Data vs. Unstructured Data

For example, an image file may be considered unstructured data. But adding image ALT tags associated with the image that provides some information on what the image is about transforms the file into semi-structured data.

The key distinction is that semi-structured data carries embedded metadata that describes its own structure — field names, nesting, and data types are encoded within the data itself. Unstructured data (raw images, audio, free-text documents) has no self-describing structure whatsoever. This self-describing quality is what makes semi-structured data machine-parseable without requiring a predefined data model, while unstructured data requires specialized processing like natural language processing or computer vision.

Semi-structured data is the largest growing area of data. This is due to the increase of metadata tagging across documents, images, and video to help classify and categorize the content for search engine optimization and organization. As organizations build out their data pipelines, handling semi-structured formats efficiently becomes critical to downstream analytics.

Common Types of Semi-Structured Data

Different types of semi-structured data include:

  • JSON (JavaScript Object Notation) — The dominant format for web and mobile event tracking, API responses, and webhook payloads. Most CDP event streams arrive as nested JSON objects
  • XML (Extensible Markup Language) — Common in legacy enterprise systems, SOAP APIs, and data feeds from financial and healthcare platforms
  • Emails — Unstructured body text combined with structured metadata like subject lines, timestamps, sender addresses, and headers
  • Log files — Server logs, application logs, and clickstream data with consistent field patterns but no formal schema
  • Metadata-enriched media — Images, videos, and documents with embedded EXIF data, tags, or annotations

How Customer Data Platforms Manage Semi-Structured Data

Most CDP data sources produce semi-structured data. Clickstream events arrive as nested JSON with variable properties depending on the page or action. CRM webhook payloads include structured contact fields alongside free-form custom attributes. Marketing platform exports deliver XML feeds with campaign metadata nested inside irregular hierarchies. A CDP must ingest all of these formats without requiring source systems to restructure their output.

CDPs handle this through schema-on-read ingestion — accepting raw semi-structured data as-is, then applying data modeling and normalization rules during processing. This approach lets teams onboard new data sources in hours rather than weeks, because there is no upfront schema design or migration required. The CDP’s ingestion layer parses nested structures, flattens hierarchies, and maps fields to a canonical profile schema that feeds identity resolution and customer segmentation.

Effective semi-structured data handling is particularly critical for identity resolution. A single customer interaction may arrive as a JSON event containing an email address nested three levels deep, a cookie ID in a different field, and a transaction amount in yet another. The CDP must reliably extract identity keys from these variable structures to match events to the correct unified profile. Without robust semi-structured parsing, identity graphs become incomplete, and downstream personalization and data activation suffer.

Data collection needs to be standardized in order for data integration to succeed. Whether the destination is a data warehouse or a unified customer profile, data is often fractured and residing in disparate silos. The right technology solution can help gather that data and combine it together in a standardized fashion, supported by strong data governance frameworks.

FAQ

What is the difference between semi-structured and structured data?

Structured data conforms to a predefined schema with fixed fields and data types; semi-structured data carries its own metadata but has no rigid schema. Structured data lives in relational database tables with rows and columns. Semi-structured data uses formats like JSON or XML where fields are self-describing and can vary between records, making it more flexible but harder to query with standard SQL.

Why is semi-structured data important for CDPs?

Most real-time customer interactions — clickstreams, mobile events, API webhooks — produce semi-structured data that CDPs must ingest to build complete profiles. If a CDP cannot parse nested JSON payloads or variable XML feeds, it misses critical identity signals and behavioral context. Schema-on-read ingestion lets CDPs accept these formats immediately without requiring source systems to restructure their output.

How do CDPs transform semi-structured data into unified profiles?

CDPs parse semi-structured formats, extract identity keys and attributes, then map them to a canonical schema through normalization rules. The ingestion layer flattens nested structures, resolves field name variations across sources, and feeds standardized records into the identity resolution engine. This transforms raw JSON events and XML feeds into the structured, queryable profiles that power segmentation and activation.

  • Data Modeling — Defines the schemas that give semi-structured data more formal organization
  • Data Lakehouse — Storage architecture that handles semi-structured formats alongside structured data
  • ETL and ELT — Processes that transform semi-structured data into queryable structured formats
  • Data Validation — Ensures semi-structured data meets quality standards before downstream use
CDP.com Staff
Written by
CDP.com Staff

The CDP.com staff has collaborated to deliver the latest information and insights on the customer data platform industry.