Saturday, November 1, 2014

Topic 8 - Subject Thesaurus Construction

Free software download for Thesaurus construction on a Windows PC only (!)



Subject Thesaurus Construction

Aspects in Need of Consideration

 

The subject field

Does your subject have easily defined boundaries? 
What are the core areas? 
How often do new developments take place? 
What are the peripheral areas? 
Are there any existing schemes in the subject field?

The Collection

What sort of documents/items/information objects do you have? 
Books? 
Electronic documents? 
Reports or continuing resources?  This may seem a strange thing to take into consideration but thinking about this will help you decide on the depth of indexing required.  A full-text database, for example, will need greater specificity in the indexing terms.

Language Considerations

Do the intended users of the thesaurus have any special language requirements? 
For example: will it be used by scientists who might prefer to use scientific terms, or will it be used by a user who is more familiar with everyday language usage?

Thesaurus Users

Is the thesaurus intended for use by end users or information professionals?  If it is aimed at end users it should be user friendly and the controlled language should be as unobtrusive as possible.  Use as many natural language terms and forms as you can.

Questions, Searches and Profiles

Ask yourself about the sort of queries users will be making? 
Will they be of a general nature or will they be very specific? 
Your response will impact on the design of the thesaurus.

Resources

The really big question to consider is the amount of resources you have at your disposal.  Thesaurus construction is a costly exercise. 
Can your organisation afford it? 
What about staff?  Can you free up staff time so they have the time to do the job properly? 
What about your access to thesaurus software?

Adaptation

Is there a thesaurus available in the field?  If so, is it possible that it can be adapted?  This is a less costly option than producing one from scratch.  It may not be an ideal solution but it may be a compromise that you can make.

Once the Issues Have Been Considered, What’s Next?
If you have considered the above issues and have decided the construction of a new thesaurus is necessary, then remember that it cannot be done in a quick and dirty fashion.  A professional thesaurus should conform to standards which have been set down by the International Standards Organisation (ISO).  There are also standards in individual countries which need to be met.  These standards cover all aspects of thesaurus construction such as word control, grammatical form, ambiguous terms and the use of explanatory notes.

Once you have your facet analysis, you are ready for the next stage in the process – turning it into a thesaurus.  


Vocabulary Control


Indexing terms

Let’s begin with the actual indexing terms – preferred and non-preferred.  In the ISO standards an indexing term is described as being the representation of a concept.  This is not a new idea as we have discussed this before in relation to subject indexing in general.  The representation can be made by using one word or a combination of words.  A preferred term is that which is consistently used to represent a concept.  The non-preferred form is usually a synonym (equivalent term).  In the literature this is also referred to as a non-descriptor.

Indexing terms are usually broken down into two types:

·      Concrete entities
·      Abstract concepts

Knowing which category a term belongs to is important.  Concrete entities are usually made up of things and their parts, eg: cars, gear levers, or materials such as steel or plastic. 

Abstract concepts cover actions and events, abstract entities, properties of things, materials and actions.  An example of these might be strength, durability or management.  They also include disciplines and sciences such as law or physics.

At this point you might still be asking yourself why knowing which category a term is in is still important.  Well, knowing the category helps you to decide on whether a term is going to be plural or singular in the thesaurus, as well as helping to verify the validity of the facet analysis.  For example: in the English language, concrete entities are usually nouns and if you can ask yourself how many of the item you can have, then they are usually recorded in the plural form.  An exception to this, according to Aitchison, is when you are dealing with body parts and then we have to modify our thinking and use terms such as ‘mouth’ or ‘renal system’ – the singular form.  If you have concrete entities such as ‘mercury’ or ‘water’, you can’t ask how many of them can you have so they, so they are always recorded in the singular.

Complex isn’t it!

Spelling

You must decide to adopt a particular version of the English language.  For example: American English or Australian English.

Punctuation

Punctuation should be avoided as much as possible as it can cause retrieval problems.  The hyphen is probably the cause of most difficulties which crop up.  If you leave it out, what are you going to replace it with?  You must decide whether to leave a space or join the two words together.  Whatever your decision, make sure you are consistent in your practice.

Homographs

These are words which have the same spelling but have a different meaning.  For example: Cell (Biology) and Cell (Battery).  The use of qualifiers in brackets as shown in the previous sentence can help overcome problems of meaning.

Scope Notes
Scope notes should only be used when absolutely necessary.  They are used to explain how you want the preferred term to be used or to explain how the term is to be interpreted.  Scope notes should not be used to define terms on a regular basis.  If your facet analysis has been correctly carried out, you should not have to use a large number of scope notes.

Finishing Touches

When you have reached this point in the construction process, there should be a series of hierarchically structured facets which are ready to turn into a thesaurus.  We would recommend that all the conventional thesaurus relationships and their abbreviations are used:

Use
UF Used for
BT Broader term
NT Narrower term
RT Related term      NB: RTs never come from the same facet!

Following through on turning the facets into a thesaurus is the easy bit!  And with a bit of luck, you will have the software to do it for you.

You might like to look at the demonstrations available at the following site:


This is quite an interesting way of testing your facet analysis. At Curtin, we used this software in focus groups to assess reactions to terms in the new enterprise wide classification scheme which will be used with the electronic document management system. The groups found it quite fun to use and the added bonus is that we could save groups responses for further analysis which made our scheme more accurate and user friendly.



Some Extra Hints on Creating an Indexing Language
DO consult colleagues at all stages and seek expert help in areas of your subject which are technical.

DO record all important decisions so that it is not necessary to go back and re-invent the wheel.

DO keep a sense of proportion.  This is a job which cannot be done perfectly.  Indexing languages need to be continually fine tuned.

DO be prepared to make the final decisions as the information specialist.  The information professional is the one who knows about indexing, not the office expert on widgets.  Find out about widgets from the expert for your widget manufacture thesaurus, then make decisions based on best indexing practice.

DON'T keep too many previous drafts.  Learn to recognise when a draft contains a genuinely alternative analysis which might prove worth going back to from when it contains a dead end.

DON'T get too hung up on the hard bits.  Have a place to note them but concentrate on getting the basic structure right.  Perhaps when that is achieved all will become clear.

DON'T expect to be done in a short time and don't let the boss labour under the delusion that it can.  You will both be disappointed.

DON'T get downhearted.

Appendix

Retrieval from the APAIS Thesaurus explanation: 


Reference Abbrev Meaning
USE   Indicates the preferred term, e.g. Currency USE Money
Note   Scope notes are used to indicate the meaning or application of certain descriptors.
USED FOR UF Indicates the non-preferred terms which the synonymous preferred term encompasses, e.g. Wildlife UF Fauna
BROADER TERM BT Indicates the name of the class of which the term is a member, e.g. Consumption tax BT Taxation
NAROWER TERM NT Indicates members of the class represented by the term, e.g Women NT Aboriginal Women; Women NT Professional women
RELATED TERM RT Indicates concepts associated with the term but not related in a class membership way, e.g. Counselling RT Crisis centres; Counselling RT Social work
 Non preferred terms: ND refers to
Preferred term: PT




Wednesday, October 29, 2014

Week 5 - RDF = Resource Description Framework

Great video on youtube visualising the relation between
RDFs
URLs
URNs
URIs

URL locates, URN names, URI identifies (both locates and names), and that's something you can find out when you look at how they are named. Unified Resource Identifier (URI) Unified Resource Name (URN) Unified Resource Locator (URL).
The URIs are the basic modules to create RDFs - Resource Description Frameworks.

Each RDF is created from a Subject - Predicat - Object relation such as
"Jessica sells books"
Subject Predicate Object

Following Peter's presentation further we get to the point, where I doubt that we can talk about making computers understand ontology. What we actually do in trying to make computers "understand ontology" is to brake complex semantics into bite sized triplets which together will not leave any space for interpretation. The computer is forced to use the yes/no or binary logic to come to unique conclusions. RDF is dumbing down complex logic into a set of triplets.

Interestingly, Peter makes a mistake that should lead to a wrong response in his slide about trust. In explaining the simple question, "Who is the President of the United States?", he takes the usual approach of the US American, forgetting that there might be a few more clues missing than explained in the longer statement, "Who is the President of the United States according to the latest trustworthy data I have?"Of course it should be "Who is the President of the United States of America according to the latest trustworthy data I have". Humans know that United States is a short form for United States of America. While this may sound like splitting hair, the crux with explaining things to computers is that you have to be as precise as possible to leave out space for alternatives. The space for ambiguity is what really creates the big problem.

Another problem I see is that one and the same sentence in English can have a set of different meanings. While we set the menaings apart by emphasising/stressing different parts of the sentence, depending on what we want to express, computers see the exact same sentence. A precarious example could be "Peter helped his grandfather to get off." While "Getting Off" has different meanings, the sentence doesn't give any further indication how Peter helped his grandfather to get off and also, to get off what?

Further considering the globality of the Web, how will RDF help to mediate between different languages? Is the Subject Predicate Object triplet omni-present in any language across the globe?

Also, what is commonly referred to as "big data" is stored in relational databases. Facebook, Google and a large number of social media outlets already know what you want by following your browser activity on the World Wide Web or in their confined spaces. Meaning is not added by triplets, but by analysing online behaviour. I regard this as a separate ontology about the consumer. While RDF provides semantics about things, big data provides semantics about humans.