GA4’s BigQuery export contains two user identifiers: user_id (when user is authenticated) and user_pseudo_id (GA4’s device cookie). A single user often has multiple user_pseudo_ids across devices but one consistent user_id when logged in. Matching these identifiers is critical for accurate user counting and cross-device journey analysis. This guide shows the SQL patterns to deduplicate and stitch user sessions in BigQuery.

BigQuery GA4 User ID vs Client ID Matching: Dedupl

Understanding User ID vs User Pseudo ID in GA4

user_id: Your internal user identifier (database primary key, email, etc.). Set explicitly when user authenticates. Null for anonymous users.

user_pseudo_id: GA4’s automatic device identifier (stored in _ga cookie). Always present, one per browser/device.

A typical user journey involves multiple pseudo IDs:

  • Day 1: Visits on desktop anonymously → user_pseudo_id = abc123, user_id = null
  • Day 2: Visits on mobile anonymously → user_pseudo_id = def456, user_id = null
  • Day 3: Logs in on desktop → user_pseudo_id = abc123, user_id = user_12345
  • Day 4: Logs in on mobile → user_pseudo_id = def456, user_id = user_12345

GA4’s reporting merges these automatically (if Blended reporting identity is enabled), but in BigQuery, you see the raw events. To count unique users accurately, you must deduplicate.

Basic User Deduplication Query

-- Count unique users, deduplicating user_id and user_pseudo_id
WITH user_mapping AS (
  SELECT DISTINCT
    user_id,
    user_pseudo_id
  FROM `project.dataset.events_*` as events
  WHERE _TABLE_SUFFIX = '20240115'
    AND (user_id IS NOT NULL OR user_pseudo_id IS NOT NULL)
),

-- For each user_id, find all associated user_pseudo_ids
user_device_mapping AS (
  SELECT
    user_id,
    ARRAY_AGG(DISTINCT user_pseudo_id) AS all_pseudo_ids
  FROM user_mapping
  WHERE user_id IS NOT NULL
  GROUP BY user_id
),

-- Count unique users (deduped)
user_counts AS (
  SELECT
    COUNT(DISTINCT user_id) AS authenticated_users,
    COUNT(DISTINCT user_pseudo_id) AS device_count,
    COUNT(DISTINCT COALESCE(user_id, user_pseudo_id)) AS total_unique_users
  FROM user_mapping
)

SELECT * FROM user_counts;

Advanced: Creating a User Profile Table

For comprehensive cross-device analysis, create a user profile table mapping all user identifiers:

-- Create a master user mapping table
CREATE OR REPLACE TABLE `project.dataset.ga4_user_master` AS

WITH user_id_to_pseudo_ids AS (
  -- For authenticated users, collect all device IDs
  SELECT
    user_id,
    ARRAY_AGG(DISTINCT user_pseudo_id IGNORE NULLS) AS authenticated_device_ids,
    COUNT(DISTINCT user_pseudo_id) AS device_count,
    MIN(event_timestamp) AS first_event_time,
    MAX(event_timestamp) AS last_event_time
  FROM `project.dataset.events_*` as events
  WHERE _TABLE_SUFFIX >= FORMAT_DATE('%Y%m%d', DATE_SUB(CURRENT_DATE(), INTERVAL 90 DAYS))
    AND user_id IS NOT NULL
  GROUP BY user_id
),

unidentified_pseudo_ids AS (
  -- Anonymous device IDs (no associated user_id)
  SELECT
    NULL AS user_id,
    user_pseudo_id,
    ARRAY[] AS authenticated_device_ids,
    1 AS device_count,
    MIN(event_timestamp) AS first_event_time,
    MAX(event_timestamp) AS last_event_time
  FROM `project.dataset.events_*` as events
  WHERE _TABLE_SUFFIX >= FORMAT_DATE('%Y%m%d', DATE_SUB(CURRENT_DATE(), INTERVAL 90 DAYS))
    AND user_id IS NULL
    AND user_pseudo_id NOT IN (
      -- Exclude pseudo IDs that are mapped to authenticated users
      SELECT DISTINCT pseudo_id FROM UNNEST(
        (SELECT ARRAY_AGG(all_pseudo_ids) FROM (
          SELECT ARRAY_CONCAT(authenticated_device_ids) as all_pseudo_ids
          FROM user_id_to_pseudo_ids
        ))
      ) AS pseudo_id
    )
  GROUP BY user_pseudo_id
)

SELECT
  COALESCE(a.user_id, CONCAT('anon_', b.user_pseudo_id)) AS canonical_user_id,
  a.user_id,
  COALESCE(ARRAY_LENGTH(a.authenticated_device_ids), 0) AS device_count,
  a.authenticated_device_ids,
  a.first_event_time,
  a.last_event_time
FROM user_id_to_pseudo_ids a

UNION ALL

SELECT
  CONCAT('anon_', b.user_pseudo_id),
  NULL,
  1,
  [b.user_pseudo_id],
  b.first_event_time,
  b.last_event_time
FROM unidentified_pseudo_ids b;

This table becomes your source of truth for user identity. Query it to answer:

  • “How many authenticated users visited on multiple devices?” → Count rows where device_count > 1
  • “Which devices belong to user X?” → Look up in authenticated_device_ids
  • “What’s the cross-device journey for user Y?” → Join this table to events

Cross-Device Journey Analysis

Once you have a user master table, analyze cross-device behavior:

Guide BigQuery GA4 User ID vs Client
-- Analyze cross-device user journeys
SELECT
  canonical_user_id,
  COUNT(DISTINCT user_pseudo_id_in_events) AS device_count,
  ARRAY_AGG(DISTINCT CASE 
    WHEN device_type = 'mobile' THEN 'mobile'
    WHEN device_type = 'desktop' THEN 'desktop'
    ELSE 'tablet'
  END) AS devices_used,
  MIN(event_timestamp) AS first_touch,
  MAX(event_timestamp) AS last_touch,
  TIMESTAMP_DIFF(MAX(event_timestamp), MIN(event_timestamp), DAY) AS days_in_journey,
  COUNTIF(event_name = 'purchase') AS purchase_events,
  SUM(CASE WHEN event_name = 'purchase' THEN value ELSE 0 END) AS total_value
FROM user_master
JOIN `project.dataset.events_*` as events
  ON user_master.canonical_user_id = COALESCE('user_' || events.user_id, 'anon_' || events.user_pseudo_id)
WHERE _TABLE_SUFFIX >= '20240115'
GROUP BY canonical_user_id
HAVING COUNT(DISTINCT user_pseudo_id_in_events) > 1;  -- Multi-device users only

Handling User ID Changes and Merges

Sometimes users change their user_id (account merge, recovery, etc.). Handle this:

-- If a single user_pseudo_id is associated with multiple user_ids over time
SELECT
  user_pseudo_id,
  ARRAY_AGG(DISTINCT user_id) AS associated_user_ids,
  COUNT(DISTINCT user_id) AS user_id_count
FROM `project.dataset.events_*`
WHERE user_pseudo_id IS NOT NULL
  AND user_id IS NOT NULL
GROUP BY user_pseudo_id
HAVING COUNT(DISTINCT user_id) > 1;

-- If you find these, you need a user_id mapping table to handle merges
-- (beyond the scope here, but involves tracking when user_id A merged into user_id B)

Attribution Across Devices with Deduplicated Users

Once you have deduplicated users, attribute conversions across devices:

-- Multi-touch attribution: which channels touched before purchase?
WITH user_events AS (
  SELECT
    canonical_user_id,
    user_pseudo_id,
    traffic_source.source,
    traffic_source.medium,
    event_name,
    event_timestamp,
    ROW_NUMBER() OVER (PARTITION BY canonical_user_id ORDER BY event_timestamp) AS event_sequence
  FROM user_master
  JOIN `project.dataset.events_*` as events ON user_master.canonical_user_id = ...
),

purchase_events AS (
  SELECT DISTINCT canonical_user_id
  FROM user_events
  WHERE event_name = 'purchase'
),

user_channels AS (
  SELECT
    e.canonical_user_id,
    ARRAY_AGG(DISTINCT CONCAT(e.source, '/', e.medium)) AS channels_before_purchase,
    COUNT(DISTINCT e.event_sequence) AS events_before_purchase
  FROM user_events e
  JOIN purchase_events p ON e.canonical_user_id = p.canonical_user_id
  GROUP BY e.canonical_user_id
)

SELECT
  channels_before_purchase,
  COUNT(*) AS user_count
FROM user_channels
GROUP BY channels_before_purchase
ORDER BY user_count DESC;

Reconciliation: BigQuery Deduplicated Count vs GA4 Report Count

Your deduplicated user count in BigQuery should roughly match GA4’s reported user count (within 5-10% variance):

  • GA4 uses: Blended reporting identity (merges user_id with user_pseudo_id automatically)
  • BigQuery should use: Your deduplicated logic (COALESCE user_id with user_pseudo_id)
  • Expected variance: 5-10% due to data processing delays and sampling
  • Large variance (>15%): Indicates user_id not set correctly or deduplication logic error
-- Compare counts
SELECT
  'GA4 Reported Users' AS source,
  (SELECT COUNT(DISTINCT user_id) FROM `project.dataset.events_*` WHERE user_id IS NOT NULL) AS user_count
UNION ALL
SELECT
  'BigQuery Deduplicated',
  (SELECT COUNT(DISTINCT canonical_user_id) FROM `project.dataset.ga4_user_master`)

User deduplication is the foundation of accurate cross-device analytics. Getting it right unlocks insights into how users switch between devices and complete multi-device journeys.

Related guides: BigQuery GA4 Event Params, GA4 User ID Cross-Device, GA4 Reporting Identity.

Leave a Comment