While diffusion models show promising results in image editing given a target prompt, achieving both prompt fidelity and background preservation remains difficult. Recent works have introduced score distillation techniques that leverage the rich generative prior of text-to-image diffusion models to solve this task without additional fine-tuning. However, these methods often struggle with tasks such as object insertion. Our investigation of these failures reveals significant variations in gradient magnitude and spatial distribution, making hyperparameter tuning highly input-specific or unsuccessful. To address this, we propose two simple yet effective modifications: attention-based spatial regularization and gradient filtering-normalization, both aimed at reducing these variations during gradient updates. Experimental results show our method outperforms state-of-the-art score distillation techniques in prompt fidelity, improving successful edits while preserving the background. Users also preferred our method over state-of-the-art techniques across three metrics, and by 58-64% overall.
Overview. Given an input image and a target prompt, we obtain gradient of the SBP loss and an attention-based mask. With spatial regularization, gradient filtering and normalization, we modify the image to match the prompt.
Given an input source image and a target prompt describing how the image should be modified, our goal is to modify the image to match the prompt. Our method builds upon score distillation sampling with a simple L2 regularization. We address a key challenge: how to modulate variations in gradient magnitudes and their spatial distributions.
Challenge: gradients from score distillation sampling needs modulation. Gradients vary with different timesteps, noises, prompts, images. Standard deviations of gradient magnitudes can filter less focused and "counterproductive" gradients.
Input
+ helicopter
Output
Input
notebook → drawing of pikachu
Output
Input
coffee → matcha
Output
Input
+ lightning
Output
Input
food → wagyu steak
Output
Input
+ stylish sunglasses
Output
Input
+ bird
Output
Input
+ straw
Output
Input
+ kite in the sky
Output
Input
+ small boat
Output
Input
+ party hat
Output
Input
+ glasses
Output
Input
meatballs → chrome balls
Output
Input
+ lantern
Output
Input
+ painting
Output
Input
+ google logo
Output
Input
+ flying airplane
Output
Input
+ lamp
Output
Input
+ bracelet
Output
Input
+ glasses
Output
Input
+ camel
Output
Input
+ rabbit
Output
Input
+ corgi
Output
Input
+ seafood
Output
Input
+ dragon
Output
Input
+ horse
Output
Input
+ rainbow
Output
Input
+ long red necktie
Output
Input
+ jam
Output
Input
blonde hair
Output
Input
+ suitcase
Output
Input
+ thunder
Output
Input
+ burger flag
Output
Input
lake → lava
Output
Input
+ blindfold
Output
Input
blue eyes
Output
Input
+ dining table
Output
Input
freeze the lake
Output
Input
+ hot air balloon
Output
Input
+ sun
Output
Input
+ vase
Output
Input
+ vase
Output
Input
+ hat
Output
Input
+ sunglasses
Output
Input
+ scarf
Output
@misc{chinchuthakun2025lusd,
title={LUSD: Localized Update Score Distillation for Text-Guided Image Editing},
author={Worameth Chinchuthakun and Tossaporn Saengja and Nontawat Tritrong and Pitchaporn Rewatbowornwong
and
Pramook Khungurn and Supasorn Suwajanakorn},
year={2025},
eprint={2503.11054},
archivePrefix={arXiv},
primaryClass={cs.GR},
url={https://arxiv.org/abs/2503.11054},
}