-
The use ofDefault in python 3.cPickle
instead ofpickle
inCachingUtilities.py
for object serialization can potentially boost performance significantly for large projects. - Find a reliable hash for collection-types that don't guarantee an order (e.g. dict, set). issue link [WIP] See also: https://github.com/coala/coala/issues/4350#issuecomment-363723566
- Tracking directories
- glob pattern match issue
- context-object refactoring
- gitignore tracking
- yield_once for hashable types
- lazy property
-
Nextgen Core Integation: Important for closing this issue: Writing extensive tests testing out all main features of the new core relevant for coala-users by invoking coala itself, especially for caching (different bear types, different file scenarios).
-
Cache control: Providing caching flags
-
--cache-strategy / --cache-protocol
: Controls how coala manages caches for the next run.- none: Don't use a cache at all. A shortcut-flag could be additionally implemented,
--no-cache
, effectively meaning--cache-protocol=none
-
primitive
: Use a cache that grows infinitely. All cache entries are stored for all following runs, and aren't removed. Effective when many recurrent changes happen in coafiles and settings. Fastest in storing. -
lri
/last-recently-used
(default flag): Cached items persist only until the next run. Stretch issue: Implement count-parameters that allow to control when to discard items from the cache, e.g. after 3 runs of coala without using a cached item, discard it. -
--clear-cache
: Clears the cache. -
--export-cache / --import-cache
: Maybe useful to share caches. Like CI server for any project run coala, and you can download the cache from there as an artifact to speed up your builds / coala runs. -
--cache-compression
: Accepts as arguments:
- none: No cache compression. This is default.
- Other flags that specify common compression capabilities Python provides (for example lzma or gzip). Cache compression should be evaluated before regarding its effectiveness, because the cache will mainly store hashes which usually aren't really redundant, the gain might be very low. The little performance penalty when loading the cache might be too much when respecting a possible very low gain of cache space reduction.
-
--optimize-cache
A little performance penalty to make the cache loading faster. Particularly this feature shall utilize pickletools.optimize. But this is not exclusive to this flag.
- none: Don't use a cache at all. A shortcut-flag could be additionally implemented,
-
- Loading bears takes times and can be improved using cib. Reference: this issue
- Bear loading needs to be improved
-
yield_once
takes a lot of CPU time. It can be found in coala_utils.decoraters, Collectors.py, Importers.py and Globbing.py. Performance can be optimized by not usingyield_once
in cases where we expect all distinct results (without any duplication) will be yielded. - Significant files and packages of interest coalib.misc.CachingUtilities, coalib/processes/Processing.py, coalib/misc/Caching.py and coala_main.py.
- Need to read RELEASE_NOTES.rst and coalib/parsing/DefaultArgParser.py.
- FileProxy or FileFactory class will construct objects that contain useful information about files.
- These are persistent objects.
- These proxy objects will replace the contents of the file-dict used by coala which currently uses filenames mapped to their
contents.
file-dict = {filename: FileProxyObject, ...}
- Should provide different interfaces to files like
- utf8-decoded
- with line endings
- without line endings
- binary file
- One of the most important properties of the FileProxy object would be to have a
last-modified-timestamp
to be used for caching. - Even though the FileProxy objects will be hashed using
persistent_hash
it might still be beneficial to have a_hash
method inside the FileProxy class to return the hash of the file content. - Now the file bears will be passed these proxy objects instead of the file contents. We will just be storing the file name and timestamp
See the previous file proxy implementation by Udayan: https://github.com/coala/coala/pull/2784
- These will also reside in the file-dict (i think it should be called proxy-dict)
- A DirectoryBear can also be implemented to only work on directories by extracting objects from the proxy-dict. The directory paths have a trailing slash unlike file paths which can be used to make the distinction.
- The 2 proxy objects (file and directory) can also be diffrentiated using a simple type-check inside bears.
Directory can cache file proxies
so during walking the file tree and constructing Directory objects
we always try to see whether the timestamp has changed
if not, we do a cache lookup: dir_cache.get(my_directory), which constructs file-proxies like in the previous run
but that's still quite a hard one to implement, that needs tight integration with globs and more control over them
like an iterator, where you can tell the glob to skip directories to walk into or so
- A Future is an object that doesn't have a result yet and is returned and handled by the executors used inside the core, while the core refers to tasks as (args, kwargs) objects that bears can pass to offload work into the core.
-
There are 3 steps involved in caching:
- Populating the cache
- Keepig the cache in sync
- Managing the cache size
-
Population processes are of 2 kinds:
- Upfront (when we know of all the data that we want to cache before hand)
- Lazy (cache as per the needs with an initial check for possible duplicates)
Note: Lazy population will take less initial cache build time than upfront but it still might cause one-off delays if there are checks in place for pre-existing cached objects (which might not be there at all)
- Cahce size management: These are the approaches for cache eviction
- Time based eviction: Either keep a separate thread for this (costly approach) or evict data at the time of reading it.
- First in, first out (FIFO)
- First in, last out (FILO).
- Least accessed (not recommende since old values are accessed more)
- Least time between access: When a value is accessed the cache marks the time the value was accessed and increases the access count. When the value is accessed the next time, the cache increments the access count, and calculates the average time between all accesses. Values that were once accessed a lot but fade in popularity will have a dropping average time between accesses. Sooner or later the average may drop low enough that the value will be evicted (seems costly).
This will provide some reference as to how caching works and is implemented