added a comment - - edited
@Matthew We'll need to maintain a reference data table which preferably could be maintained through an admin web interface, which I don't believe OA currently has (but will probably need in future) so we can enter a list of root terms and alternative terms that translate to the root term.
For example, when the clean URL engine is processing, say:
"Carbon Pollution Reduction Scheme Bill 2009"
It picks up the term "pollution" as a match against alternative terms, finds the mapped root term "environment" and sets that as a the topic.
Another example:
"Nuclear Non-Proliferation"
Nuclear might be a match for "energy" as the root term or "defence" as a root term. Open to ideas on how to solve this conflict, but probably just a straightforward priority ranking so that if the parser picks up "energy" first it would then be overridden by "defence" later down the taxonomy as the parser continues to match against remaining root and alternative terms. Not sure if that's robust enough - would need to test once we've actually developed the taxonomy.
I'm guessing we'd have about 15 root words and a total of 50 alternative terms mapped to them.
More examples:
"Wetlands" would map to "environment"
"Nation building" would map to "infrastructure"
"Telecommunications Interception" would map to "law-enforcement"
"Customs Tariff" would map to "trade"
etc.
Anything that can't be mapped, the topic would be omitted. We'd keep an eye on that and continue entering alternative terms into the taxonomy (and even expanding the root words / topics list if required) to cater for more keywords in debate titles as we notice them.
Does that make sense?
The actual table would look something like:
alternative, root
pollution, environment
wetlands, environment
carbon, environment,
climate, environment
sorry day, indigenous
native land, indigenous
workplace, employment
union, employment
...
I'm happy to work up a schema/model for clean URLs. Just assign the issue to me.