Skip to content

[Feature](Iceberg) Implement expire_snapshots procedure for Iceberg tables#59979

Merged
morningman merged 6 commits intoapache:masterfrom
suxiaogang223:impl_expire_snapshots
Feb 4, 2026
Merged

[Feature](Iceberg) Implement expire_snapshots procedure for Iceberg tables#59979
morningman merged 6 commits intoapache:masterfrom
suxiaogang223:impl_expire_snapshots

Conversation

@suxiaogang223
Copy link
Copy Markdown
Contributor

@suxiaogang223 suxiaogang223 commented Jan 16, 2026

What problem does this PR solve?

Summary

This PR implements the expire_snapshots procedure for Iceberg tables, following the Apache Iceberg Spark procedure specification. This procedure removes old snapshots from Iceberg tables to free up storage space and improve metadata performance.

Changes

Main Implementation

  • File: fe/fe-core/src/main/java/org/apache/doris/datasource/iceberg/action/IcebergExpireSnapshotsAction.java
    • Implemented executeAction() method to expire snapshots using Iceberg's ExpireSnapshots API
    • Added getResultSchema() method returning 6-column output matching Spark's schema
    • Added parseTimestamp() helper method to support ISO datetime and milliseconds formats
    • Updated validation to allow snapshot_ids as a standalone parameter
    • Fixed retain_last behavior: when specified alone, automatically sets expireOlderThan to current time

Supported Parameters

Parameter Description
older_than Timestamp before which snapshots will be removed (ISO datetime or milliseconds)
retain_last Number of ancestor snapshots to preserve
snapshot_ids Comma-separated list of specific snapshot IDs to expire
max_concurrent_deletes Size of thread pool for delete operations
clean_expired_metadata When true, cleans up unused partition specs and schemas

Output Schema

The procedure returns 6 columns:

  • deleted_data_files_count
  • deleted_position_delete_files_count
  • deleted_equality_delete_files_count
  • deleted_manifest_files_count
  • deleted_manifest_lists_count
  • deleted_statistics_files_count

Test Updates

  • File: regression-test/suites/external_table_p0/iceberg/action/test_iceberg_execute_actions.groovy
    • Added functional tests for expire_snapshots with retain_last parameter
    • Added validation tests for snapshot_ids parameter
    • Updated error message expectations

Usage Example

-- Expire snapshots, keeping only the last 2
ALTER TABLE catalog.db.table EXECUTE expire_snapshots("retain_last" = "2");

-- Expire snapshots older than a specific timestamp
ALTER TABLE catalog.db.table EXECUTE expire_snapshots("older_than" = "2024-01-01T00:00:00");

-- Expire specific snapshots by ID
ALTER TABLE catalog.db.table EXECUTE expire_snapshots("snapshot_ids" = "123456789,987654321");

-- Combine parameters
ALTER TABLE catalog.db.table EXECUTE expire_snapshots("older_than" = "2024-06-01T00:00:00", "retain_last" = "5");

Check List (For Author)

  • Test

    • Regression test
    • Unit Test
    • Manual test (add detailed scripts or steps below)
    • No need to test or manual test. Explain why:
      • This is a refactor/code format and no logic has been changed.
      • Previous test can cover this change.
      • No code files have been changed.
      • Other reason
  • Behavior changed:

    • No.
    • Yes.
  • Does this need documentation?

    • No.
    • Yes.

Check List (For Reviewer who merge this PR)

  • Confirm the release note
  • Confirm test cases
  • Confirm document
  • Add branch pick label

@Thearas
Copy link
Copy Markdown
Contributor

Thearas commented Jan 16, 2026

Thank you for your contribution to Apache Doris.
Don't know what should be done next? See How to process your PR.

Please clearly describe your PR:

  1. What problem was fixed (it's best to include specific error reporting information). How it was fixed.
  2. Which behaviors were modified. What was the previous behavior, what is it now, why was it modified, and what possible impacts might there be.
  3. What features were added. Why was this function added?
  4. Which code was refactored and why was this part of the code refactored?
  5. Which functions were optimized and what is the difference before and after the optimization?

@suxiaogang223 suxiaogang223 marked this pull request as draft January 16, 2026 14:13
@suxiaogang223
Copy link
Copy Markdown
Contributor Author

run external

@suxiaogang223 suxiaogang223 changed the title impl expire_snapshots [Feature](Iceberg) Implement expire_snapshots procedure for Iceberg tables Jan 17, 2026
@suxiaogang223 suxiaogang223 marked this pull request as ready for review February 3, 2026 12:10
@suxiaogang223
Copy link
Copy Markdown
Contributor Author

run external

@hello-stephen
Copy link
Copy Markdown
Contributor

FE Regression Coverage Report

Increment line coverage 0.00% (0/98) 🎉
Increment coverage report
Complete coverage report

@suxiaogang223
Copy link
Copy Markdown
Contributor Author

run buildall

@doris-robot
Copy link
Copy Markdown

TPC-H: Total hot run time: 32135 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpch-tools
Tpch sf100 test result on commit 36a8f715a204c69d0d567f3ce7e2eb54ec0b8e27, data reload: false

------ Round 1 ----------------------------------
q1	17625	5193	5024	5024
q2	2059	310	185	185
q3	10179	1305	757	757
q4	10215	849	315	315
q5	7539	2159	1913	1913
q6	199	187	152	152
q7	875	744	606	606
q8	9277	1370	1121	1121
q9	5216	4869	4906	4869
q10	6782	1978	1561	1561
q11	517	279	276	276
q12	339	384	227	227
q13	17766	4040	3265	3265
q14	240	245	221	221
q15	908	830	815	815
q16	674	711	615	615
q17	654	818	440	440
q18	6846	6478	7549	6478
q19	1233	1067	635	635
q20	414	362	255	255
q21	2870	2283	2128	2128
q22	382	335	277	277
Total cold run time: 102809 ms
Total hot run time: 32135 ms

----- Round 2, with runtime_filter_mode=off -----
q1	5756	5515	5525	5515
q2	264	339	253	253
q3	2406	2911	2566	2566
q4	1433	1844	1386	1386
q5	4881	4556	4518	4518
q6	224	181	153	153
q7	2023	1929	1768	1768
q8	2714	2478	2405	2405
q9	7489	7424	7399	7399
q10	2904	3091	2712	2712
q11	534	463	469	463
q12	661	778	611	611
q13	3925	4279	3304	3304
q14	277	286	264	264
q15	850	804	788	788
q16	648	687	639	639
q17	1058	1233	1277	1233
q18	7450	7403	7225	7225
q19	818	817	818	817
q20	1970	2056	1890	1890
q21	4516	4273	4079	4079
q22	568	534	485	485
Total cold run time: 53369 ms
Total hot run time: 50473 ms

@doris-robot
Copy link
Copy Markdown

ClickBench: Total hot run time: 28.25 s
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/clickbench-tools
ClickBench test result on commit 36a8f715a204c69d0d567f3ce7e2eb54ec0b8e27, data reload: false

query1	0.06	0.04	0.04
query2	0.09	0.05	0.04
query3	0.25	0.08	0.08
query4	1.60	0.11	0.11
query5	0.27	0.25	0.25
query6	1.16	0.67	0.67
query7	0.03	0.02	0.03
query8	0.05	0.04	0.04
query9	0.56	0.50	0.49
query10	0.57	0.55	0.55
query11	0.15	0.09	0.09
query12	0.14	0.10	0.11
query13	0.64	0.61	0.62
query14	1.09	1.06	1.05
query15	0.87	0.86	0.88
query16	0.39	0.38	0.39
query17	1.09	1.16	1.12
query18	0.23	0.20	0.21
query19	2.12	1.96	2.04
query20	0.02	0.02	0.01
query21	15.41	0.25	0.15
query22	5.25	0.05	0.05
query23	16.00	0.28	0.10
query24	1.28	0.62	0.34
query25	0.11	0.07	0.05
query26	0.14	0.14	0.13
query27	0.05	0.06	0.05
query28	3.34	1.15	0.97
query29	12.59	3.92	3.16
query30	0.28	0.14	0.11
query31	2.81	0.65	0.39
query32	3.24	0.61	0.50
query33	3.30	3.22	3.29
query34	16.43	5.38	4.74
query35	4.80	4.74	4.85
query36	0.65	0.51	0.49
query37	0.12	0.07	0.07
query38	0.07	0.04	0.05
query39	0.05	0.03	0.03
query40	0.19	0.16	0.16
query41	0.08	0.04	0.03
query42	0.04	0.03	0.03
query43	0.05	0.04	0.03
Total cold run time: 97.66 s
Total hot run time: 28.25 s

@hello-stephen
Copy link
Copy Markdown
Contributor

FE Regression Coverage Report

Increment line coverage 0.00% (0/98) 🎉
Increment coverage report
Complete coverage report

@github-actions github-actions bot added the approved Indicates a PR has been approved by one committer. label Feb 4, 2026
@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Feb 4, 2026

PR approved by at least one committer and no changes requested.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Feb 4, 2026

PR approved by anyone and no changes requested.

Copy link
Copy Markdown
Contributor

@kaka11chen kaka11chen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@morningman morningman merged commit 6d9883e into apache:master Feb 4, 2026
32 checks passed
suxiaogang223 added a commit to suxiaogang223/doris that referenced this pull request Feb 10, 2026
…ables (apache#59979)

- Issue Number:  apache#58199

This PR implements the `expire_snapshots` procedure for Iceberg tables,
following the Apache Iceberg Spark procedure specification. This
procedure removes old snapshots from Iceberg tables to free up storage
space and improve metadata performance.

- **File:**
`fe/fe-core/src/main/java/org/apache/doris/datasource/iceberg/action/IcebergExpireSnapshotsAction.java`
- Implemented `executeAction()` method to expire snapshots using
Iceberg's `ExpireSnapshots` API
- Added `getResultSchema()` method returning 6-column output matching
Spark's schema
- Added `parseTimestamp()` helper method to support ISO datetime and
milliseconds formats
  - Updated validation to allow `snapshot_ids` as a standalone parameter
- Fixed `retain_last` behavior: when specified alone, automatically sets
`expireOlderThan` to current time

| Parameter | Description |
|-----------|-------------|
| `older_than` | Timestamp before which snapshots will be removed (ISO
datetime or milliseconds) |
| `retain_last` | Number of ancestor snapshots to preserve |
| `snapshot_ids` | Comma-separated list of specific snapshot IDs to
expire |
| `max_concurrent_deletes` | Size of thread pool for delete operations |
| `clean_expired_metadata` | When true, cleans up unused partition specs
and schemas |

The procedure returns 6 columns:
- `deleted_data_files_count`
- `deleted_position_delete_files_count`
- `deleted_equality_delete_files_count`
- `deleted_manifest_files_count`
- `deleted_manifest_lists_count`
- `deleted_statistics_files_count`

- **File:**
`regression-test/suites/external_table_p0/iceberg/action/test_iceberg_execute_actions.groovy`
- Added functional tests for `expire_snapshots` with `retain_last`
parameter
  - Added validation tests for `snapshot_ids` parameter
  - Updated error message expectations

```sql
-- Expire snapshots, keeping only the last 2
ALTER TABLE catalog.db.table EXECUTE expire_snapshots("retain_last" = "2");

-- Expire snapshots older than a specific timestamp
ALTER TABLE catalog.db.table EXECUTE expire_snapshots("older_than" = "2024-01-01T00:00:00");

-- Expire specific snapshots by ID
ALTER TABLE catalog.db.table EXECUTE expire_snapshots("snapshot_ids" = "123456789,987654321");

-- Combine parameters
ALTER TABLE catalog.db.table EXECUTE expire_snapshots("older_than" = "2024-06-01T00:00:00", "retain_last" = "5");
```

(cherry picked from commit 6d9883e)
yiguolei pushed a commit that referenced this pull request Feb 12, 2026
@suxiaogang223 suxiaogang223 deleted the impl_expire_snapshots branch February 13, 2026 03:34
ybtsdst pushed a commit to ybtsdst/doris that referenced this pull request Feb 27, 2026
@yiguolei yiguolei mentioned this pull request Mar 7, 2026
@suxiaogang223
Copy link
Copy Markdown
Contributor Author

Documentation for expire_snapshots has already been added in the website repo:

So no additional documentation PR is needed for this feature.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by one committer. dev/4.0.4-merged kind/need-document-merged reviewed

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants