Memoize file contents by path in CachedParser to skip redundant reads#5928
Memoize file contents by path in CachedParser to skip redundant reads#5928SanderMuller wants to merge 1 commit into
Conversation
CachedParser::parseFile() read the whole file via FileReader::read() on every call, before the content-keyed node cache. The same file is parsed many times (a trait file once per class that uses it - on Tempest, 73 498 parseFile calls for 2 327 distinct files, 96.8% redundant reads; the 256-entry content cache thrashes on hot traits), so the read is repeated even when nothing changed. Memoize the contents by path, keyed by mtime, and skip the re-read when the file is unchanged. clearstatcache() before the mtime check keeps this correct in long-running processes (PHPStan Pro, fixer worker) where a file may be edited between calls, so an edited file is always re-read and re-parsed. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
|
How to reproduce/measure the performance improvement? |
|
Here's the A/B I used to measure it. Swap only # in the phpstan checkout, swap just the one file between the two versions:
git show <ver>:src/Parser/CachedParser.php > src/Parser/CachedParser.php # <ver> = this branch, then bad7874ec
# from the target project, cold + single-process, 3 reps:
rm -rf <tmpDir> && /usr/bin/time php <phpstan>/bin/phpstan analyse -l 8 --no-progress <paths>The mechanism: So the win tracks how much cross-file re-read redundancy a codebase has, which makes it corpus-dependent. Single-process, cold, 3 reps, median:
So it's a real win on high-fan-in codebases and neutral on low-redundancy ones. The content memo costs about 4MB regardless of whether the project benefits. If that always-paid cost is the concern, a bounded variant (memoize only files that get read more than once) would make the memory track the benefit; happy to do that if you'd prefer it. The red CI check is the flaky |
CachedParser::parseFile()reads the whole file viaFileReader::read()on every call, before the content-keyed node cache. The same file is parsed many times (a trait file once per class that uses it), so the read repeats even when nothing changed. On Tempest that is 73,498parseFilecalls for 2,327 distinct files, 96.8% of them redundant; hot trait files are re-read thousands of times and the 256-entry content cache thrashes on them.This memoizes the contents by path, keyed by mtime, and skips the re-read when the file is unchanged.
clearstatcache()before the mtime check keeps it correct in long-running processes (PHPStan Pro, the fixer worker), where a file can change between calls: an edited file gets a new mtime, so it is read and parsed again. A test covers both paths, that an unchanged file is not re-read and a file with a newer mtime is.Effect (CPU = user+sys,
/usr/bin/time, result-cold), measured on Tempest and on phpstan-src's own source:This is a CPU / shared-runner saving on result-cold runs (the path taken after
composer updateor a config change in CI), not a latency win: parallel analysis is CPU-bound, so wall time is unchanged. The saving is a roughly constant amount of syscall time, so it is a larger fraction on lighter-per-file projects. Output is byte-identical before and after on both codebases.Memory is about +4 MB on Tempest for the contents held by path; the node ASTs are not duplicated, since the existing content cache still holds those.