delicious spam
At R&Mi we've been discussing how we could increase the "findability" of BBC radio programmes. One of the ideas was to give them a presence on sites like del.icio.us or flickr (OK, so that may work better for TV). Then users of these sites may run across our content serendipitously - rather than when explicitly searching for BBC content. We are, in a way, diverting people's attention to our programmes.
So I spent a bit of time this afternoon writing a bit of code to...
- take a programme page (e.g. the canonical In Our Time example)
- run its text through the Yahoo Term Extraction API to get a list of tags
- post to del.icio.us
- repeat for the next programme
But I haven't gone any further because I'm a bit worried about the ethics of this. Would we (i.e. the BBC) be spamming the site?
A lot of the value of these community bookmarking/photo/whatever sites is from the behaviour of the users who take care over what they post and how they tag. And is that something that just comes from a dedicated community? Will these sites lose their value if large numbers of links are automatically posted? What if we got our editorial teams to manually post the links - is that different? Actually, what happens when these sites go more mainstream - any studies on the effects?
Matt reminded me that Joshua Schachter of del.icio.us mentioned the issue of spamming at the recent Carson Workshops Summit and commented that when they find someone spamming they just let them keep posting but it all goes into a big black hole. He also suggested that...
...the value in Delicious is in the "attention" - auto-tagging detracts from thisAlso note that nobody else on del.icio.us has yet used any of the tags knighterrant, tiltingatwindmills or belies. Why not?
3 comments:
If the url is already in del.icio.us before you start, then you haven't gained a huge amount by adding it again. Granted it pops to the top of the list for a while, but that aside...
You do gain more tag coverage, but then while the Yahoo term extractor is good, it isn't that good ...
queenmaryuniversityoflondon
... doesn't really trip off the keyboard too easily.
If the url isn't already in del.icio.us, one possibility is to wonder why not, rather than auto-splatting it in. Maybe that page needs to be removed and replaced with something better!
Have you considered rather than Y! term extract you could alternatively try word frequency. Most frequent words on the page, minus common stop words.
Last up : Adding a link to "tag this on del.icio.us" might be highly complicated to navigate around the BBC's fairness rules, but that'd probably get you better results. Plus you are giving something back to del.icio.us by promoting them a bit.
Or better yet let users tag the pages, use del.icio.us as the backend trasparently for your data, (so user submits to bbc, you proxy and tag it on del.icio.us), but have all the tags appearing as coming one "bbcrmi" user.
One uber tag cloud for BBC users ... mmmm.
I was thinking of the increased tag coverage rather than posting the url for the sake of it. And in terms of the tag quality - well that's something that could be worked on, I just used the Yahoo term extractor cos it's there and easy to use. Hmmm...maybe we could use some of the internal tagging data that's being worked on at the moment.
And creating a proxy for tagging pages - nice idea. Though if you aggregate all the users into one super-user what is individual's motivation for them to tag it, apart from the general navigation goodness that could result?
Hi everybody!
TermExtractor, my master thesis, is online at the
address http://lcl2.di.uniroma1.it.
TermExtractor is a FREE and high-performing software package for Terminology
Extraction. The software helps a web community to
extract and validate relevant domain terms in their
interest domain, by submitting an archive of
domain-related documents in any format
(txt, pdf, ps, dvi, tex, doc, rtf, ppt, xls, xml,
html/htm, chm, wpd and also zip archives.)
TermExtractor extracts terminology consensually
referred in a specific application domain. The
software takes as input a corpus of domain documents,
parses the documents, and extracts a list of
"syntactically plausible" terms (e.g. compounds,
adjective-nouns, etc.).
Documents parsing assigns a greater importance
to terms with text layouts (title, bold, italic,
underlined, etc.). Two entropy-based measures, called
Domain Relevance and Domain Consensus, are then used.
Domain Consensus is used to select only the terms
which are consensually referred throughout the corpus
documents. Domain Relevance to select only the terms
which are relevant to the domain of interest, Domain
Relevance is computed with reference to a set of
contrastive terminologies from different domains.
Finally, extracted terms are further filtered using
Lexical Cohesion, that measures the degree of
association of all the words in a terminological
string.
I'd like if you partecipate in the TermExtractor
evaluation task. The result of your evaluation will be
put in a paper (I enclose a draft). Please contact me
if you want to partecipate (this is very important for
me!).
MANY THANKS!!!
--
Francesco Sclano
home page: http://lcl2.di.uniroma1.it/~sclano
msn: francesco_sclano@yahoo.it
skype: francesco978
Post a Comment