wtf-plugin-classify
v2.1.0
Published
basic classifier for wikipedia articles
Downloads
54
Readme
This plugin uses a (large) number of heuristics to classify a wikipedia article into a basic Person/Place/Thing scheme.
Things it looks at:
- infoboxes (like {{Infobox Person ...}})
- categories (like '[[Category:Canadian Saxophone Players]]')
- templates (like {{Liechtenstein-sport-bio-stub}})
- sections (like '==Early life==')
- titles (like 'John Smith (poet)')
const wtf = require('wtf_wikipedia')
wtf.extend(require('wtf-plugin-classify'))
wtf.fetch('Toronto Raptors').then((doc) => {
let res = doc.classify()
//{
// type: 'Organization/SportsTeam',
// score: 0.9,
// details: {...}
//}
})
<script src="https://unpkg.com/wtf_wikipedia"></script>
<script src="https://unpkg.com/wtf-plugin-classify"></script>
<script defer>
wtf.plugin(window.wtfClassify)
wtf.fetch('Radiohead', function (err, doc) {
console.log(doc.classify())
})
</script>
Justification:
Traversing wikipedia's categories to find say, all the People or Places is a notoriously broken strategy: or worse:
Infoboxes like {{Infobox person}}
are a really clear signal, but get muddled quickly with things like {{Infobox architect}}
.
This library tries to do this sort of work, to determine if a page is about Person, a Place, or an Organization in broad terms.
Types:
Person:
Athlete:
AmericanFootballPlayer : true
BaseballPlayer : true
FootballPlayer : true
BasketballPlayer : true
HockeyPlayer : true
Creator:
Actor : true
Musician : true
Author : true
Director : true
Politician : true
Place:
Jurisdiction:
City : true
Country : true
Structure:
Bridge : true
Airport : true
BodyOfWater : true
Organization:
MusicalGroup : true
Company : true
SportsTeam : true
PoliticalParty : true
School : true
Event:
Disaster : true
Election : true
MilitaryConflict : true
SportsEvent : true
Creation:
CreativeWork:
Album : true
Book : true
Film : true
TVShow : true
Play : true
Song : true
VideoGame : true
Product : true
FictionalCharacter : true
Concept:
MedicalCondition : true
Organism : true
as of March 2020, it can classify ~65% of english wikipedia articles:
null: 37.71%
People: 18.86%
Place: 14.01%
Organization: 8.27%
CreativeWork: 5.38%
Event: 4.57%
Thing: 5.75%
i18n
it is trained on the english wikipedia, but may also provide reasonable results in other languages.
it may help if you first require wtf-plugin-i18n, which maps many templates to their english forms.
work-in-progress.
MIT