Sunday, September 28, 2008

C.B. Martin's Mind in Nature

I am reading a review of C.B. Martin's Mind in Nature (2008).

mind is just another
  • system of dispositional states

  • capable of complex,

    • spatially and temporally projective,

    • directed

    • regulative

    • adjustments and control (p. 111ff).

  • Martin holds that the same basic functions and properties (

    • positive and negative feedback,

    • feedforward,

    • use,

    • material of use,

    • correlativity of manifestation and disposition base,

    • representation and

    • content)

  • that are found in the non-mental, non-conscious and non-linguistic occur in the mental, conscious and linguistic as well (p. 115-116).


Tuesday, September 9, 2008

documentary linguistics

[to do: clean up my text, give it some proper footnotes from Quakenbush and Himmelmann at least, and republish on Linguistic Exploration.]

I guess it is fair to say that "descriptive linguistics" is accepted as a term distinct from theoretical and applied linguistics, although obviously there is a gradient. Perhaps we can refer to this as (overlapping) fields within the discipline, orthogonal to the usual division by level of representation: phonetics, phonology, morphology, syntax, semantics, pragmatics. There are also the "border fields" which do not belong to the linguistics discipline exclusively, but bring in additional methods from neighboring disciplines, e.g. computational linguistics.

So perhaps we can think of documentary linguistics as that subfield of descriptive linguistics as the subfield of descriptive linguistics that borders on the distinct and applied discipline of information technology use (not computer science research, which is what computational linguists often do).

The target audience of a language engineering environment for linguistic exploration would be professional and non-professional practitioners of documentary linguistics. Non-professionals would include language teachers and students in general, which professionals would include participants in professionally managed projects (including language teachers doing graduate research in applied linguistics) whose background may be linguistic sciences, education, professional editing, creative writing, information technology or something else.

J. Stephen Quakenbush. SIL International and Endangered Austronesian Languages.

[Note 5: The concept of “language documentation” as the product of “documentary linguistics” is discussed further below. The primary distinctive of language “documenation” is its focus on primary data, collected, annotated, and made available as “a lasting multipurpose record of a language.” (cf Himmelmann 2006:1).]
Awareness of the importance of language documentation has been growing worldwide over the past couple of decades along with awareness and concern over language endangerment. Language documentation has to do with producing a lasting record of representative samples of that language. As traditionally practiced by SIL, and indeed by the whole Western linguistic enterprise, language documentation has focused on the production of resources for the linguist or academician more than on resources that directly benefit speakers of the language being documented. In the tradition of early twentieth century American linguists Sapir and Bloomfield, field linguists have gone out to produce grammatical descriptions and text collections which would be published by major universities and academic publishing companies in order to advance the understanding of their fellow linguists. [Note 23: Bloomfield’s 1917 Tagalog texts with grammatical analysis is still considered a classic of this sort.]
In summary, language data on which much linguistic analysis and description is based has rarely been published as such. This is as true of SIL-published data on Austronesian languages as much as it is true of material published by other field linguists on less commonly studied languages around the world. Where language data has been published, it has usually not been “primary data,” but rather “secondary data” that has been edited, systematized or regularized in some way.
The past decade has seen increasing interest in the documentation of representative primary data in a form that will be permanently accessible to speakers and researchers in an electronic environment.26 Indeed, a new sub-discipline of linguistics has appeared bearing the name of Documentary Linguistics. The web-site of the Hans Rausing Endangered Languages credits Nikolaus Himmelmann as a catalyst for the development of this discipline, citing his 1998 paper entitled “Documentary and descriptive linguistics.” In it, Himmelmann (1998: 116) states that
"The aim of a language documentation is to provide a comprehensive record of the linguistic practices characteristic of a given speech community... This... differs fundamentally from... language description [which] aims at the record of a language... as a system of abstract elements, constructions, and rules."
Himmelmann, Gippert and Mosel (2006: v) specify that documentary linguistics is concerned with the “methods, tools and theoretical underpinnings for compiling a representative and lasting multipurpose record of a natural language or one of its varieties.”
Selected references

Himmelmann, Nikolaus P. 2006. Language documentation: What is it and what is it good for? In Gippert, Himmelmann and Mosel, eds. Essentials of language documentation, 1-30. Berlin: Mouton de Gruyter.

Himmelmann, Nikolaus P. 2002. Documentary and descriptive linguistics (full version). In Osamu Sakiyama and Fubito Endo, eds. Lectures on Endangered Languages: 5 (Endangered Languages of the Pacific Rim, Kyoto, 2002)
Himmelmann, Nikolaus P. 1998. Documentary and descriptive linguistics. Linguistics 36.161-195.

Himmelmann, Nikolaus P., Jost Gippert and Ulrike Mosel. 2006. Editor’s preface. In Gippert, Himmelmann and Mosel, eds. Essentials of language documentation, v-vii. Berlin: Mouton de Gruyter.

List: researchers



Linguists, digital data


Austronesianists
Carl Rubino's list of linguists working on Philippine languages 
Himmelmann, Nikolaus P. ANU and Rühr-U Bochum paper on tagalog zero anaphora (transtives)
J. Stephen Quakenbush. Agutaynen, SIL. Endangered AN
Starosta, Stanley. Austronesian ‘Focus’ as Derivation: Evidence from Nominalization. LANGUAGE AND LINGUISTICS 3.2:427-479, 2002.   Formosan and Philippine examples, seamless morphology.



General Linguistics
Lauri Karttunen. Word Play. ACL Lifetime Achievement Award talk. 26 pp.
PARC, Stanford.
reviews two lines of research that lie at the opposite ends on the field: semantics and morphology. The semantic part deals with issues from the 1970s such as discourse referents, implicative verbs, presuppositions, and questions. The second part presents a brief history of the application of finite-state transducers to linguistic analysis starting with the advent of two-level morphology in the early 1980s and culminating in successful commercial applications in the 1990s. It offers some commentary on the relationship, or the lack thereof, between computational and paper-and-pencil linguistics. The final section returns to the semantic issues and their application to currently popular tasks such as textual inference and question answering.
Historical Linguistics


Departments

Payap U, home of WeSay software project with SIL

List: Proceedings



Web-Based Language Documentation and Description, Papers from the Workshop on


miscellaneous links

The Linguist List calls and conferences


Saturday, September 6, 2008

Incremental Sigmoid Bayesian Networks

[python obj model  time   01:09:10]

talk at Google Tech Talks
James Henderson, U Geneva

ISBN's provide a powerful method of feature induction

This talk reminds me that I have a lot to learn about statistical processing and machine learning.

new terms: 

marginalize 
... related to to summing over all data, which is avoided
fully factorized
... without any links
beam search
... look at 100 best options, with each new word
branching factor
... limits blow up

Has ability to pass features.

Simple Synchrony Networks (Henderson 2003) are claimed to be ä strictly feed-forward approximation equivalent to neural networks (presumably back prop).

The means in mean field approximation turn out to be equivalent to the activation value of an edge in the neural network. Using discrete (0-1) random variables.

Perhaps hidden variables and visible variables captures the intuition about structural and substructural (substratal? features implicit from lexicon) analysis.

Using models with vectors of typed features, rather than trying to induce a grammar on atomic symbols.

Software

SSN Statistical Parser: A broad coverage natural language syntactic parser. 
ISBN Dependency Parser: The statistical dependency parser described in [Titov and Henderson, IWPT 2007] and evaluated in [Titov and Henderson, EMNLP-CoNLL 2007]. 

I.Titov and J.Henderson. A Latent Variable Model for Generative Dependency Parsing. In Proc. International Conference on Parsing Technologies(IWPT 2007), Prague, Czech Republic, 2007.

I.Titov and J.Henderson. Fast and Robust Multilingual Dependency Parsing with a Generative Latent Variable Model. In Proc. Joint Conf. on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL 2007), Prague, Czech Republic, 2007. (CoNLL Shared Task, 3rd result out of 23)


James Henderson, Peter Lane
A Connectionist Architecture for Learning to Parse (1998)  (8 citations)
Dept of Computer Science, Univ of Exeter  PDF

Grammatical Frameworks

Reference sites

a general ConstructionGrammar site

Laura A. Michaelis pubs
Zwicky: "Dealing out meaning", on construction grammar (Berkeley Linguistics Society, 1994). Also local
Benjamin K. Bergen (UH Manoa) pubs

HPSG


Delphin. edited collection
 
Tibor Kiss Ruhr-U Bochum

some papers
  • Graham Wilcock. An OWL Ontology for HPSG. University of Helsinki. [integrated with an existing OWL ontology, GOLD, as a community of practice extension.]


Miscellaneous People

Arnold Zwicky the founder of the OUTIL (OUT In Linguistics) mailing list

video talks

01:17:25 From: UserGroupsatGoogle

01:00:58 From: googletechtalks

26:59 From: pycon08

Google Tech Talks February, 28 2008 
Added: 6 months ago
Views: 1,491
55:59

ABSTRACT

Treebank parsing can be seen as the search for an optimally refined grammar consistent with a coarse training treebank. We describe a method in which a minimal grammar is hierarchically refined using EM to give accurate, compact grammars. The resulting grammars are extremely compact compared to other high-performance parsers, yet the parser gives the best published accuracies on several languages, as well as the best generative parsing numbers in English. In addition, we give an associated coarse-to-fine inference scheme which vastly improves inference time with no loss in test set accuracy.

Slides: http://www.eecs.berkeley.edu/~petrov/...

Speaker: Slav Petrov
Slav Petrov is a Ph.D. Candidate at University of California Berkeley Dept of Computer Science, where he is also a research assistant working with Dan Klein and Jitendra Malik on inducing latent structure for perception problems in vision and language. 



Google Tech Talks April, 17 2008 ABSTRACT Modeling human sentence-processing can help us (more)
Added: 4 months ago
Views: 7,892
49:35

xABSTRACT
Modeling human sentence-processing can help us both better understand how the brain processes language, and also help improve user interfaces. For example, our systems could compare different (computer-generated) sentences and produce ones that are easiest to understand.
Modeling human sentence-processing can help us both better understand how the brain processes language, and also help improve user interfaces. For example, our systems could compare different (computer-generated) sentences and produce ones that are easiest to understand.
Modeling human sentence-processing can help us both better understand how the brain processes language, and also help improve user interfaces. For example, our systems could compare different (computer-generated) sentences and produce ones that are easiest to understand.
I will talk about my work on evaluating theories about syntactic processing difficulty on a large eye-tracking corpus, and present a model of sentence processing which uses an incremental, fully connected parsing strategy.
Speaker: Vera Demberg
Vera Demberg is a Ph.D. student in Computational Linguistics from the University of Edinburgh, Scotland. Her research focus is on building computational models of human sentence processing.
Vera obtained a Diplom (MSc) in Computational Linguistics from Stuttgart University, and a MSc in Artificial Intelligence from the University of Edinburgh. She has published papers in a number of top venues for language processing and psycholinguistic research, including ACL, EACL, CogSci and Cognition.
For her PhD research, she's been awarded the AMLaP Young Scientist Award for best platform presentation by a junior scientist. She was a finalist for the Google Europe Anita Borg Memorial Scholarship in 2007.x

Short videos

08:52 From: lingosteve
02:20 From: lingosteve

Python

01:40:15  googletechtalks

01:06:41  From: googletechtalks

Cognitive Science

01:37:42 From: googletechtalks

Added: 8 months ago
Views: 9,573
01:02:13
ABSTRACT

Neurocomputational models provide fundamental insights towards
understanding the human brain circuits for learning new associations
and organizing our world into appropriate categories. In this talk I
will review the information-processing functions of four interacting
brain systems for learning and categorization:

(1) the basal ganglia which incrementally adjusts choice behaviors using environmental
feedback about the consequences of our actions,

(2) the hippocampus which supports learning in other brain regions through the creation of
new stimulus representations (and, hence, new similarity
relationships) that reflect important statistical regularities in the
environment,

(3) the medial septum which works in a feedback-loop with
the hippocampus, using novelty-detection to alter the rate at which
stimulus representations are updated through experience,

(4) the frontal lobes which provide for selective attention and executive
control of learning and memory.

The computational models to be described have been evaluated through a variety of empirical
methodoligies including human functional brain imaging, studies of
patients with localized brain damage due to injury or early-stage
neurodegenerative diseases, behavioral genetic studies of
naturally-occuring individual variability, as well as comparative
lesion and genetic studies with rodents. Our applications of these
models to engineering and computer science including automated anomaly
detection systems for mechanical fault diagnosis on US Navy
helicopters and submarines as well more recent contributions to the
DoD's DARPA program for Biologically Inspired Cognitive Architectures
(BICA).

Speaker: Dr. Mark Gluck
Mark Gluck is a Professor of Neuroscience at Rutgers University - Newark, co-director of the Rutgers Memory Disorders Project, and publisher of the public health newsletter, Memory Loss and the Brain. He works at the interface between neuroscience, psychology, and computer science, where his research focuses on the neural bases of learning and memory, and the consequences of memory loss due to aging, trauma, and disease. He is the co-author of "Gateway to Memory: An Introduction to Neural Network Models of the Hippocampus and Memory " (MIT Press, 2001) and a forthcoming undergraduate textbook, "Learning and Memory: From Brain to Behavior." He has edited several other books and has published over 60 scientific journal articles. His awards include the Distinguished Scientific Award for Early Career Contributions from the American Psychological Society and the Young Investigator Award for Cognitive and Neural Sciences from the Office of Naval Research. In 1996, he was awarded a NSF Presidential Early Career Award for Scientists and Engineers by President Bill Clinton. For more information,



Miscellaneous

Google Tech Talks January, 29 2008 ABSTRACT IPv6 and the DNS Speaker: Suzanne Woolf (more)

[TRANSLATED] jQuery
Google Tech Talks April, 3 2008 ABSTRACT jQuery is a JavaScript library that stand (more)
Added: 5 months ago
Views: 66,703
01:00:37
June 4, 2008


Google Tech Talks June 4, 2008 ABSTRACT In software engineering, aspects are concerns t (more)
Added: 3 months ago
Views: 2,535
3.5
40:12
ABSTRACT

In software engineering, aspects are concerns that cut across multiple modules. They can lead to the common problems of concern tangling and scattering: concern tangling is where software concerns are not represented independently of each other; concern scattering is where a software concern is represented in multiple remote places in a software artifact. Although aspect-oriented programming is relatively well understood, aspect-oriented modeling (i.e., the representation of aspects during requirements engineering, architecture, design) is still rather immature. Although a wide variety of approaches to aspect-oriented modeling have been suggested, there is, as yet, no common consensus on how aspect-oriented models should be captured, manipulated and reasoned about. This talk presents MATA (Modeling Aspects Using a Transformation Approach), which is a unified way of handling aspects for any well-defined modeling language. The talk will argue why MATA is necessary and highlight some of the key features of MATA. In particular, the talk will motivate the decision to base MATA on graph transformations and will describe an application of MATA to modeling security concerns.

Speaker: Jon Whittle
Prof. Jon Whittle joined Lancaster University in August 2007 as a Professor of Software Engineering. Previously, he was an Associate Professor at George Mason University, Fairfax, VA, USA, and, prior to that, he was a researcher and contractor technical area lead at NASA Ames Research Center. In July 2007, he was awarded a highly prestigious Wolfson Merit Award from the Royal Society in the UK. Jon's research interests are in model-driven software development, formal methods, secure software development, requirements engineering and domain-specific methods for software engineering. His research has been recognized by a number of Best Paper awards, including the IEE Software Premium prize (with João Araújo). He is Chair of the Steering
Committee of the International Conference on Model-Driven Engineering, Languages and Systems
and has been a program committee member of this conference since 2002 (including experience track PC chair in 2006). He has served on over 30 program committees for international conferences and workshops.
He is an Associate Editor of the Journal of Software and Systems Modeling. Jon has also been a guest editor of the IEEE Transactions on Software Engineering, the Journal of Software Quality, and has co-edited two special issues of the Journal of Software and Systems Modeling. 


browsed googletechtalks until 300 (WINE conf 2007)

Convergent Grammar

[convert into an essay or review article at Linguistic Exploration]

CVG Course at ESSLI 08, Day 1 Slides

(23) convergent grammar (CVG): a Look Ahead

• CVG is closely related to both ACG and HPSG.
• Like ACG—but unlike other frameworks descended from EMG—
CVG uses Curry-Howard proof terms (which we will explain)
to denote NL syntactic entities.
• This makes it easy to connect CVG to mainstream generative
grammar because the proof terms are really just a more precise
version of EST/GB-style labelled bracketings.
• Like HPSG—but unlike other frameworks descended from EMG—
the relation between syntax and semaentics is not a function, but
rather is one-to-many.

Curry-Howard Correspondence

(30)
The basic ideas of CH are that, if you let the atomic formulas be
the types of a TLC, then
1. a formula is the same thing as a type.
2. A formula A has a proof iff there is a combinator (closed
term containing no basic constants) of type A.
• Hence the Curry-Howard slogan:
formulas = types, proofs = terms
(34)
• Variables correspond to hypotheses.
• Basic constants correspond to nonlogical ax-
ioms.
• Derivability of Γ ⊢ a : A corresponds to A being
provable from the hypotheses in Γ.
• Application corresponds to Modus Ponens.
• Abstraction corresponds to Hypothetical Proof.

(31) Notation for ND Proof Theory

• An ND proof theory consists of inference rules,
which have premisses and a conclusion.
• An n-ary rule is one with n premisses, and a
0-ary rule is called an axiom.
• Premisses and conclusions have the format of a
judgment:
                        Γ ⊢ a : A
read ‘a is a proof of A with hypotheses Γ’.

Autonomy of syntax is possible, but not at the granularity of word strings. The syntactic parallel to semantic hypotheses must be both words and constructions.

A deterministic functional interface from syntax to semantics seems to be less of a fit to real language than a non-deterministic relational interface. [find slide to quote]

"variables correspond to hypothesis" — does this mean the granularity is that established by referential indexes? Does it make sense to consider a finer granularity? Does this tie up with DRT?

Perhaps we retain the granularity of referential indexes, but consider implicit relations contributed by the constructions (and also by the individual words). This explains why schema instances in working memory have the granularity they have. There may be some finer granularity less accessible to consciousness, but the level of folk semantics demands an explanatory account.

(38) ND-Style Syntax

• The inference rules are the syntax rules.
• The formulas/types are the syntactic categories.
• The proofs/terms are the syntactic expressions.
• The basic constants are the syntactic words;
• The variables are traces.
• The context of a judgment is the list of traces still
unbound at that point in the proof.

This slide confirms the granularity of variables as referential indexes or traces. Can at least some of the syntax rules (for phrase structure) be considered as types and hypotheses from the lexicon, rather than inference rules for constructing results? You still need inference rules on how to combine the word-level and the construction level. And the elements of a the constrution semantics may be implicit, below the surface.

(40) Basic Categories

• To get started: S, NP, and N. Others will be
added as needed.
• Here we ignore morphosyntactic details such as
case, agreement, and verb inflection.
• In a more detailed CVG, these would be handled
(much as in pregroup grammars) by subtyping.

"pregroup grammars"? Does this subtyping refer to feature structures at a finer granularrity?

(41) Function Categories

• As in many frameworks (RG, HPSG, LFG, DG,
traditional grammar) grammatical functions (gram-
funs) like subject and complement are treated
as theoretical primitives.
• To start we just assume the gramfuns subject
(s) and complement (c). Others will be added
as needed.

Can gramfuns be extended to handle thematic roles? Word-specific participant roles? This could be a finer grained semantics, capturing the insights from lexical semantics and its data.

(56) An Embedded Constituent Question

⊢ [whatfill t(s Kim (likes t c))] : Q
• Here what is an operator of type NPQS : it combines with an S containing an unbound NP trace
to form a Q, while binding the trace.
• Notice that the is not analyzed as a “projection”
of a ‘functional category”: there is no null com-
plementizer with respect to which the operator is
a “specifier”.

Can something like this be used to analyze "ang" in "babae ang bumili ng lasones"


Day 2

(2) Some Examples of Overt Movement

a. Johni, Fido bit ti. [Topicalization]
b. I wonder [whoi Fido bit ti]. [Indirect Question]
c. Whoi did Fido bite ti? [Direct Question]
d. The neighbor [whoi Fido bit ti] was John. [Relative Clause]
e. Felix bit [who(ever)i Fido bit ti]. [Free Relative]
f. It was John [whoi Fido bit ti]. [Cleft]
g. [Whoi Fido bit ti] was John. [Plain Pseudocleft]
h. [Whoi Fido bit ti] was he bit John. [Amalgamated Pseudocleft]
i. [[The more cats]i Fido bit ti], [[the more dogs]j Felix scratched tj ].
[Left and right sides of Correlative Comparatives]
In all these examples, the expression on the left periphery that
is coindexed with the trace is called the filler, or extractee, or
dislocated expression.

It seems likely that the dislocated "ang" in the Tagalog construction above can be analyzed as Overt Movement in this sense.

This list seems to capture what Goldber 1995 referred to as nonbasic constructions:

... it is not being claimed that all clause-level constructions encode scenes basic to human experience. Nonbasic clause-level constructions such as cleft constructions, question constructions, and topicalization constructions (and possibly passives) are primarily designed to provide an alternative information structure of the clause by allowing various arguments to be topicalized or focused. Thus children must also be sensivtive to the pragmatic information structure of the clause (Halliday 1967) and must learn additional constructions which can encode the pragmatic information structure in accord with the message to be conveyed. These cases are not discussed further here (cf. Lambrecht 1987, 1994).

This would hint that passives are not to be analyzed the same way. Also, the intentional "design" might be glossed as "statistically selected to fill the social function." What representation might be suitable for this pragmatic information structure?

The dislocated "ang" introduces a trace for the intiator of the event of the specified VP, and the predicative noun characterizes that initiator with a common noun type. It is a bit like the Free Relative construction, but with a nominal-type predicate instead of a subject-specified action-verb predicator. So the pragmatic information might be glossed: "[(it) (was) a woman]i [whoeveri bought the lanzones fruit]"



Friday, September 5, 2008

Onomasiology

I learned a new word today, 
a branch of linguistics concerned with the question "how do you express X?". ...as a part of lexicology, departs from a concept (i.e. an idea, an object, a quality, an activity etc.) and asks for its names. The opposite approach is known as semasiology: here one departs from a word and asks what it means, or what concepts the word refers to.  
It seems to be a field relevant to the distinctions in lexical field.

English Cebuano Tagalog
Seed Lisu Butó
Bone (Tetrapod) Bukóg Butó
Bone (Fish) Bukóg Tiník
Thorn   Tunók Tiník

I should gather data for these concepts from several Central Philippine languages, and see what it tells me about shared derived characters.

"The coinage of a new designation can be incited by various forces (cf. Grzega 2004):

  • difficulties in classifying the thing to be named or attributing the right word to the thing to be named, thus confusing designations
  • fuzzy difference between superordinate and subordinate term due to the monopoly of the prototypical member of a category in the real world
  • everyday contact situations
  • institutionalized and non-institutionalized linguistic pre- and proscriptivism
  • flattery
  • insult
  • disguising things (i.e. euphemistic language, doublespeak)
  • taboo
  • avoidance of words that are phonetically similar or identical to negatively associated words
  • abolition of forms that can be ambiguous in many contexts
  • word play/punning
  • excessive length of words
  • morphological misinterpretation (creation of transparency by changes within a word = folk-etymology)
  • deletion of irregularity
  • desire for plastic/illustrative/telling names for a thing
  • natural prominence of a concept
  • cultural-induced prominence of a concept
  • changes in the world
  • changes in the categorization of the world
  • prestige/fashion (based on the prestige of another language or variety, of certain word-formation patterns, or of certain semasiological centers of expansion)

The following alleged motives found in many works have shown to be invalid by Grzega (2004): decrease in salience, reading errors, laziness, excessive phonetic shortness, difficult sound combinations, unclear stress patterns, cacophony."

  - Onomasiology, Wikipedia, citing:

Grzega, Joachim (2004), Bezeichnungswandel: Wie, Warum, Wozu? Ein Beitrag zur englischen und allgemeinen Onomasiologie. Heidelberg: Winter, ISBN 3-8253-5016-9. (reviewed by Bernhard Kelle in Zeitschrift für Dialektologie und Linguistik vol. 73.1 (2006), p. 92-95)

Web journals touching on lexical semantics and computational linguistics





Linguistik Online
 -  Helpful Internet Sources: links to online dictionaries and linguistic atlases


Thursday, September 4, 2008

papers on the Web September 2008

Brett Kessler

Kessler's encyclopedia entry on Language Families

Brett Kessler. (in Press). Language Families. In Hogan, P. C. (Ed.). The Cambridge Encyclopedia of the Language Sciences.
http://spell.psychology.wustl.edu/~bkessler/CELS/Language_Families.pdf

He points out that it is a model divergence, not similarity that is the basis of family relationships.
"... a particular model of LANGUAGE CHANGE: divergence. When innovations in one part of a language community fail to spread to other parts, differences accumulate until the community can be said to speak different languages."
"there is no requirement that cognates be similar at all (e.g., English two is related to Armenian yerku), and many sources of similarity are disavowed as being irrelevant to the model. These include borrowing (see CONTACT, LANGUAGE), onomatopoeia, universals (ABSOLUTE AND STATISTICAL UNIVERSALS), and chance similarities."
The problem I am interested in related to the Bisayan subfamily is called cladogenesis: "Subgrouping seeks to uncover the history of the divergence (cladogenesis) of a language family." What is important is not to look for cognate sets, but for a sister subfamily that has a base character rather than an member of the cognate set.
"the linguist looks for evidence that some proper subset of those languages may have descended from an intermediate common ancestor. This is done by looking for shared innovations (synapomorphies) – sound changes or new words or grammatical constructions that were not in the ancestor language but are found in two or more of the descendant languages."
"some of Greenberg’s key ideas can be transformed into algorithmic (reproducible) methodologies that introduce to language family research the benefit of statistical significance testing. Oswalt’s procedure (1998) minimized experimenter bias by requiring that a specific concept list be used and that one specify in advance specific criteria for measuring degree of similarity between two languages. Baxter and Manaster Ramer (2000) added reliable significance testing procedures based on randomization tests. Kessler and Lehtonen (2006) adapted the technique to handle multiple languages in a single test, informally confirming Greenberg’s claim that such large-scale comparisons are inherently more powerful than two-language comparisons. Ringe (1992; see Kessler 2001 for extensive discussion and methodological refinements) measured not similarity but the number of recurrent sound correspondences. This has the advantages both of being closer to the traditional comparative method and of generating correspondences useful for subgrouping and reconstruction. Disappointingly, however, none of these neo-Greenbergian techniques found evidence for the deep relations that were advertised for the original, impressionistic, method."
There may be hope for quantifying evidence.
"The recent development of computational cladistic methods similar to those used in biology (e.g., Ringe, Warnow, and Taylor 2002) is a tremendous advance in helping the linguist find optimal trees. In addition, several solutions to the problem of borrowing have emerged in the form of programs that construct networks instead of trees. Shared innovations that cannot be cleanly attributed to a shared ancestor are taken as evidence of contact, obviating somewhat the need to make a priori judgments about whether borrowing was involved (e.g., Bryant, Philimon, and Gray 2005; Nakhleh, Ringe, and Warnow 2005)."
"Recent computer techniques add simplicity, reproducibility, and quantitative rigor to methodologies for proving relationships between languages, but so far there has been no noticeable increase in power over what experts are able to do by hand."
References
  • Baxter, William H. and Alexis Manaster Ramer, 2000. “Beyond Lumping and Splitting: Probabilistic Issues in Historical Linguistics.” In Time Depth in Historical
  • Linguistics, ed. Colin Renfrew, April McMahon and Larry Trask, 167–188. Cambridge, England: McDonald Institute for Archaeological Research.
  • Blust, Robert. 1999. “Subgrouping, Circularity and Extinction: Some Issues in Austronesian Comparative Linguistics.” In Selected Papers from the Eighth International Conference on Austronesian Linguistics, ed. E. Zeitoun and P. J. K Li, 31–94. Taipei: Academia Sinica.
  • Bryant, David, Flavia Filimon, and 
  • Russell D. Gray. 2005. “Untangling Our Past: Languages, Trees, Splits and Networks.” In The Evolution of Cultural Diversity: A Phylogenetic Approach, ed. Ruth Mace, Clare J. Holden, and Stephen Shennan, 69–85. London: UCL Press.
  • Cavalli-Sforza, Luigi Luca, Paolo Menozzi, and Alberto Piazza, 1994. The History and Geography of Human Genes. Princeton University Press
  • Gordon, Raymond G., Jr. (ed.). 2005. Ethnologue: Languages of the World. 15th ed. Dallas, TX: SIL International. Content also available online at http://www.ethnologue.com/
  • Greenberg, Joseph H. 1963. “The Languages of Africa.” International Journal of American Linguistics, supplement 29(1), pt. 2.
  • Greenberg, Joseph H., 1987. Language in the Americas. Stanford (CA): Stanford University Press.
  • Greenberg, Joseph H., 2002. Indo-European and its Closest Relatives: the Eurasiatic Language Family: Lexicon. Stanford, CA: Stanford University Press.
  • Kessler, Brett, 2001. The Significance of Word Lists. Stanford, CA: Center for the Study of Language and Information.
  • Kessler, Brett and Annukka Lehtonen. 2006. “Multilateral Comparison and Significance Testing of the Indo-Uralic Question.” In Phylogenetic Methods and the Prehistory of Languages, ed. P. Forster and C. Renfrew, 33–42. Cambridge, England: McDonald Institute for Archaeological Research.
  • Mallory, J. P. 1989. In Search of the Indo-Europeans: Language, Archaeology and 
  • Myth. London: Thames & Hudson.
  • Nakhleh, Luay, Don Ringe, and Tandy Warnow. 2005. “Perfect Phylogenetic Networks: A New Methodology for Reconstructing the Evolutionary History of Natural Languages.” Language 81: 382–420.
  • Oswalt, Robert L., 1998. “A Probabilistic Evaluation of North Eurasiatic Nostratic.” In Nostratic: Sifting the Evidence, ed. J. C. Salmons and B. D. Joseph, 199–216. Amsterdam: Benjamins.
  • Renfrew, Colin. 1987. Archaeology and Language: The Puzzle of Indo-European Origins. London: Pimlico.
  • Ringe, Don A., Jr., 1992. On Calculating the Factor of Chance in Language Comparison. Philadelphia, PA: American Philosophical Society.
  • Ringe, Don, Tandy Warnow, and A. Taylor. 2002. “Indo-European and Computational Cladistics.” Transactions of the Philological Society 100: 59–129.
  • Swadesh, Morris, 1955. “Towards Greater Accuracy in Lexicostatistic Dating.” International Journal of American Linguistics 21: 121–37.
  • Thomason, Sarah Grey and Terrence Kaufman. 1988. Language Contact, Creolization, and Genetic Linguistics. Berkeley, CA: University of California Press.
Kessler's home page.

His thesis at Stanford was on:
Thesis title: Estimating the Probability of Historical Connections Between Languages. Available through UMI. However, a significantly revised version is published by CSLI Publications under the titleThe Significance of Word Lists: Statistical Tests for Investigating Historical Connections Between Languages and is distributed by The University of Chicago Press (2001; ISBN cloth 1-575862-99-9, paper 1-575863-00-6). From the preface: