Treatment of Top-Level Statements

IMPLEMENTATION

Unnecessary Limitations
Fundamental Limitations
Internals

TODO

Data Tracking
Aliasing and Scoping
Typing
Subclass Method Calls
Function References
Predefined Symbols
Top-Level Statements
Preprocessing
Pragmas
Debugging and Reporting Features
Lint-Like Features

AUTHOR

NAME

  jslink.pl - eliminate unused code from a javascript library

SYNOPSIS

  jslink.pl -pre cat -i myapp.js -l lib.js -o -

Options are:

   -pre cat                  # preprocessor to apply to input files (but not -e).
   -e 'myfunc(3); foobar;'   # an "anchor" expression.
   -h index.html             # an "anchor" html file, whose script will be used. unimplemented.
   -i myapp.js               # an "anchor" script file, which will pull in things from library files.  
   -l lib.js                 # a library file.
   -o output.js              # the -l files with unneeded code removed. defaults to '-' (stdout).
   -debug debug              # default is none (no debugging)
   -warn functionmatch,instmeth,ambigs,dups                # default is none (no warnings about things that may be acceptable). 'all' is also supported.
   -dump used,unused,usedby,refs,undefs                    # default is none (no dumping). 'all' is also supported.
   -print used,filemarker,skipped,sourcelines              # default is 'used,filemarker' (print used code from libraries)
   -trace symname            # issue debug output every time symname is seen. unimplemented.
   -tabwidth 4               # set the number of spaces that tabs are interpreted as, if different from default of 8.
   -nestedassigns 0          # whether to attempt to track assignments of nested function definitions to other symbols.

All output except for -o (from -debug, -warn, or -dump) is sent to STDERR.

DESCRIPTION

This determines "dead" code based on following the transitive closure of references to symbols in one or more "anchor" files.

It eliminates whole definitions only; it does not do anything even approaching full dead code elimination, as might be done within function bodies.

It will eliminate nested functions if they are not used.

It has knowledge of the builtin ECMAScript objects and their method names, as well as many DOM objects and their methods. It assumes that any parameter which calls a method with a name matching a builtin method name is indeed that kind of object. It does not currently do any sort of detailed static analysis that might actually prove this.

It also has knowledge of all builtin ECMAscript global functions, and so knows not to try to find their definitions. Note that this means that it is up to you to determine if you need to provide definitions for missing builtin function or missing builtin object methods (for example, Array.prototype.push for IE 5.0).

Treatment of Top-Level Statements

If it finds any references at all to a symbol in some library file (global data or function), it will not only pull in the definitions of those symbols, it will then also include *all* of the top-level statements in that file. That is because we don't want to try to analyze the necessity of the load-time statements; it is all or nothing for them. If any of those load-time statements have references to functions in the same file, then those functions are pulled in too. For this reason, it is better to supply individual smaller javascript files (or just have few load-time statements).

IMPLEMENTATION

Unnecessary Limitations

This implementation is based on a crude parser using regular expressions, which makes assumptions about indentation in order to identify the beginnings and endings of function definitions.

The assumptions about indentation are valid if a pretty-printing preprocessor is used. The default preprocessor is just 'cat'. (We also have a Rhino-based preprocessor that we hacked together, but it is dependent on Rhino patches we have not yet organized.)

Even if the indentation is regular, there are still lots of problems with this implementation: not properly skipping literals (String and RegExp), not properly parsing all expressions that might be a function call, etc.

There are a variety of alternative implementation approaches, any of which would be more interesting and probably better, such as:

  - based on a real ECMAScript grammar, or
  - based on extending some real ECMAScript interpreter (such as SpiderMonkey or
    Rhino, which do not explicitly use a grammar), or
  - implemented in ECMAScript itself, such as something based on Narcissus, or
  - implemented by analyzing the result of a "real" ECMAScript/JScript linker
    (which would work with a "real" ECMAScript compiler), to see what symbols it pulls in.
  - perform source translation to some other programming language that has
    more mature tools.

Fundamental Limitations

Given the possible uses of eval(), function lookup tables, computed function names, dynamically added object methods, redefinition of functions, and so on, it is practically impossible to do this job fully automatically and correctly.

These challenges can result in both false negatives and false positives. We can for example entirely miss a dependence on some code whose entry point is a string that might even come from the outside environment.

On the other hand, when applied to an application that has a "driver" or "plugin" model, we might end up pulling in all available drivers/plugins, just because their entry points show up in some global registry table.

We will also tend to pull in code even if the reference is protected by a conditional:

   if (typeof someFunc != 'undefined') someFunc()

The intent of the programmer in such code is typically to call someFunc() only if the code that defines it has been loaded for some other reason (vz. "weak references" in compiled languages).

It is impossible really to solve all these situations automatically. It is necessary to get some guidance from the programmer, such as pragmas in the code itself, or by some other external configuration.

Internals

As we parse the input, we make up a list of definition objects, which have these keys:

deftype # one of: 'ctormeth', 'protometh', 'instmeth', 'globalfunc', 'localfunc', 'assign', 'singleton'
is_global # whether this is a top-level expression in a file (versus a function definition).
actualname # the string the actually occurred in the source code for the definition, such as 'MyClass.prototype.sort'.
justname # the last identifier name in the dotted name of the definition, such as 'sort'. key for $DEFS_BY_UNQUAL.
qualname # the fully qualified name for the defined symbol, which may be more explicit than appeared in source code. key for $DEFS_BY_QUAL.
parentqual # the value of qualname of the parent definition (in nesting level), if any.
protoname # the name of the prototype class, such as 'MyClass' (if any -- just for deftype 'protometh')
aliasto # the name of another symbol that this is an alias for (as in "Foo.prototype.meth = aliasfunc").
level # the nesting level of the definition
filename # name of the file (or expression) this came from.
startline # zero-based line number this definition starts with.
lastline # zero-based line number of last line of this definition.
lines # array ref of source lines corresponding to startline to lastline.
params # a hash ref with keys which are the parameter names of the defined function.
refs # the references from this definition to other symbols. hash ref from qualnames to [$reftype, $lineno]
undefs # array ref of symbols in refs that are apparently not defined anywhere.
usedby # array ref of other definition objects which refer (directly) to this one. opposite of refs.
used # boolean to indicate whether it is part of the transitive closure (actually it is the loopcount).

TODO

Data Tracking

Track definitions of global data variables, and references to them.

   # global data
   ^var $NAMERE = 
   # member data
   ^\ *this.$NAMERE = 
   # local data
   ^\ *var $NAMERE =

Aliasing and Scoping

Make sure aliased definitions are not used to resolve references across lexical blocks.

Warn about shadowed variable/function names. Implement 'localfunc' and 'assign' properly (see $ANALYZE_NESTED_ASSIGNMENTS).

Track alias definitions (in 'assign' case, LHS inherits all the methods available from RHS). Things like "a = b.c.d;".

Track creation of global instances, things like "var foo = new SomeFunc();".

Ultimately, properly tracking of aliases could allow for tightening up of typeing.

Typing

Right now, if we can't resolve a reference using the fully qualified name, we'll match to any definition of a function with the same unqualified name. This has the unfortunate consequence of potentially pulling in the wrong code.

We also currently ignore all references to methods that share a name with any DOM or builtin ECMAScript object method. This means we can miss undefined symbols.

A proper fix would involve attempts at full data flow analysis to determine and variable argument types, and/or capitalize on some declaration mechanism (such as commented out types, or reliance on JScript .NET source code prior to down translation).

Other benefits would be obtained as well, such as catching such errors as calling getElementById on a Window object instead of a Document object....

Subclass Method Calls

Track calls to prototype functions from within subprototype method bodies (based on remembering the set of the subprototype's prototype object).

Function References

Do something about passing in named function references (no parens). Though we win somewhat with "justname" look up on the other side.

Scope locally bound functions such as "OuterFunction.inner."

Extend $QUALRE to handle function calls with parameters in dotted expressions: "foobar(1,2).println()"

Handle (ignore) calls from literals, such as '00000'.substring(2)

Predefined Symbols

Make complete list of ECMAScript functions, variables, and builtin object methods.

Make complete (or somewhat complete) list of DOM object methods.

Top-Level Statements

Break up top-level statements into multiple continuous sequences within the file, or even per whole statement. This would make for better diagnostics (tracking line number for the definition/reference instead of just line -1). It would also pave the way for perhaps excluding more blocks from the final code.

Preprocessing

Get Rhino to issue original file line numbers. Also maybe get Rhino to mangle/change names, or do other transforms (such as conditional "in").

Exclude pattern matches within quoted strings and regexp literals.

Regular expressions to match for strings containing references that should be considered (vs. ignore all strings).

Support for -h: allow anchor refs to come from an html page.

Support -D of expressions known to be false or true, to exclude references in:

    if (FALSEEXPR) {...} 
    if (!FALSEEXPR) {} else {...}
    if (TRUEEXPR) {} else {...}

   calls to eval 
   computed apply and computed call
   js file names
   unknown method on builtin object
   redefinition of builtin object method

AUTHOR

This is free software; you can redistribute it and/or modify it under the same terms as Perl itself.

Alternatively, this is licensed under Academic Free License version 1.2.