Thursday, November 27, 2008

evo1: Vocalization Displays and the Recognition of Typed Object-Features

The evolution of language is difficult to explain. It is a complex faculty with various aspects or components, and it is difficult to see why rudimentary versions of those aspects would have selectional advantages for differential reproduction in hominids. It would be helpful to identify preadaptations, particular characters that developed complexity in some other function, that came to be used in a new function of language-based communication.

My working hypothesis is that the capacity for visual recognition of objects and scenes, specifically recognition of physical objects and behaving animals in the shared immediate social environment of hominids, gave rise to a complex of preadaptations for language. One preadaptation is 1) a shared conceptual scheme of individuation, where certain scenes were classified as involving significant happenings or objects. This scheme would eventually give rise to the mental lexicon, and the related capacity for semantic memory. Another is 2) the capacity to represent a scene and individual objects in terms of typed structures of object-features. By the hyphenated term object-feature, I refer specifically to higher level visual features like shape outline and component shape, that are distinguished from low level features like edge or color or texture. Both these preadaptations were refined and repurposed in the context of 3) recognizing social displays of individuals of the same species. Some displays, like pointing, have 4) a property of intentionality. Intentional displays are produced to trigger an observer's attunement to a relation between the structure of the display and a scene where some salient aspects are referenced (physical objects in the immediate environment are referred to by a shift of gaze or a hand movement, happenings are referenced in relation to a doer or an undergoer). Clearly vocalization displays are core to the evolution of language, but gestural displays that support the repeated emergence of sign languages (among deaf and hearing communities) are also of research interest.

How did the use of articulatory gestures in vocalization evolve into phonology? One possibility is song, perhaps evolved from gibbon-like vocalizations to mark territory. Tone is differentiated and evolves to be worth remembering, so song is born. The continuous articulation of vowels is another dimension of song, and this could eventually evolve into a phonemic set of distinguished vowel phonemes. Why would consonants and syllabification be added to the mix? Why would song recognition and articulation capacities get repurposed for referring gestures, replacing eye gaze and pointing with nouns or noun phrases?

From studying the semantics-syntax interface, we can theorize about two large layers in the structure of utterances. These layers define the linguitic types that an utterance (or sign language expression) is recognized to instantiate. One layer is the referring expression, where an concrete object or abstraction is referred to and a corresponding mental concept takes a place in a shared information structure where it could be referred to again. In other words, referring expressions introduce referential indices into discourse, and create the possibility of anaphoric resolution. We have three levels here:
  1. the etic level of a complex nominal utterance in the context where the speaker picks out a particular situation for the listener,
  2. the emic level of the types from the mental lexicon and language faculty that recognize the utterance as an instance of a structure phrasal type involving lexical items and a distinguished head item, and
  3. the level of the shared information state including referential indices.
The other large layer is that of speech acts made with clausal utterances (or the sentences used in them), that introduce a semantics and pragmatics of what Searle calls conditions of satisfaction. In a request, the shared information state is about a state which does not obtain but where the listener can act to fulfill those conditions. A promise mentions a possible future state where the the speaker (or signer) will act to fulfill the conditions. And an assertive speech act calls attention to a state where the situation referred to (a visible scene, or some abstract situation) has the participants referred to with referential indexes, and they stand in a happening-relation mentioned by the head verb or predicator. Any modifiers also introduce properties of the participants or situation. Returning to the three levels, at
  1. the etic level, a clausal utterance is used to perform a speech act between speaker and listener
  2. the emic level, a sentence or clause is recognized as consisting of a head verb, arguments that have dependency relations to that verb, and additional modifiers
  3. the shared information state level, of a discourse representation structure where referential indexes are resolved to salient items in discourse, creating shared attunement to a described situation, which may be a part of the concrete situation surrounding speaker and listener, or else a more abstract situation that does not (yet) exist in the physical surroundings.
This is the rudimentary model of language that I am using. I would need to demonstrate that a language faculty that supports this model is systematically related to existing visual recognition faculties (of objects, scenes and social displays) that provided some preadaptations for the relatively rapid emergence of complex language among hominids

First I will discuss typed feature structures in language and high-level vision, then I will discuss shared schemes of information in gestural displays and language. Future posts will cover:
  • Feature structures for phrasal features and visual object-features and scenes
  • Typing and semantic memory
  • Schemes of individuation
  • Shared schemes and the first-person character of perceptual qualia
  • Types in the mental lexicon

No comments: