Beta
×

Welcome to the Slashdot Beta site -- learn more here. Use the link in the footer or click here to return to the Classic version of Slashdot.

Thank you!

Before you choose to head back to the Classic look of the site, we'd appreciate it if you share your thoughts on the Beta; your feedback is what drives our ongoing development.

Beta is different and we value you taking the time to try it out. Please take a look at the changes we've made in Beta and  learn more about it. Thanks for reading, and for making the site better!

Competition Produces Vandalism Detection For Wikis

timothy posted about 4 years ago | from the citation-needed dept.

Wikipedia 62

marpot writes "Recently, the 1st International Competition on Wikipedia Vandalism Detection (PDF) finished: 9 groups (5 from the USA, 1 affiliated with Google) tried their best in detecting all vandalism cases from a large-scale evaluation corpus. The winning approach (PDF) detects 20% of all vandalism cases without misclassifying regular edits; moreover, it can be adjusted to detect 95% of the vandalism edits while misclassifying only 30% of all regular edits. Thus, by applying both settings, manual double-checking would only be required on 34% of all edits. Nothing is known, yet, whether the rule-based bots on Wikipedia can compete with this machine learning-based strategy. Anyway, there is still a lot potential for improvements since the top 2 detectors use entirely different detection paradigms: the first analyzes an edit's content, whereas the second (PDF) analyzes an edit's context using WikiTrust."

Sorry! There are no comments related to the filter you selected.

First Vandal (1, Interesting)

Anonymous Coward | about 4 years ago | (#33703216)

here we are knocking at the friendly gates of wikipedia and now they want to throw us out.
Is there no hospitality in this world ?

ad: Searching for experienced pillagers and fortress busters. Exp 5 years min. Trebuchet skills a plus. Bring your own broadsword.

Re:First Vandal (0)

Anonymous Coward | about 4 years ago | (#33703352)

I'd be interested. Point of contact? ;)

Re:First Vandal (0)

Anonymous Coward | about 4 years ago | (#33703484)

Sorry, I run with the visigoths.

Sounds interesting... (0)

Anonymous Coward | about 4 years ago | (#33703316)

I'd like to hear more about MATT IS GAY LOL

20% with no false positives? (3, Insightful)

Dan East (318230) | about 4 years ago | (#33703402)

If the algorithm can detect 20% with perfection then that must constitute extremely low hanging fruit. That type of vandalism is just annoyance. It is so obvious that the end user readily recognizes it as such and can skip over it or revert the edit.

The real issue is disinformation, which is vastly more subtle. The only defense is fact-checking or seeking out references. If the algorithm is capable of recognizing that kind of vandalism then the developers should have the software writing all the articles in the first place, because it'd have to be pretty spectacular to manage that.

Re:20% with no false positives? (1)

Moryath (553296) | about 4 years ago | (#33703638)

The major problem now is that 99% of all good edits submitted to wikipedia are reverted anyways as false positives.

The reason for this is that corrupt administrators do nothing to stop it, and corrupt idiots wanting to become admins just sit all day on the semi-automated tools like "Twinkle" or "Huggle" reverting anything in sight to get their edit counts up.

The real issue is disinformation, which is vastly more subtle. The only defense is fact-checking or seeking out references.

While true, the larger problem with wikipedia is, and has always been, cadres of editors who carry around administrator support to ban any opposing viewpoint, picking off anyone who comes to their "owned" articles one at a time so as to prevent a consensus change. I spent a good amount of time analyzing the writings of a former wikipedia administrator [livejournal.com] who gave up on the whole thing, as well as the old writings of another one [blogspot.com] who, after being part of the corrupt system a long while, somehow stepped on the wrong toes and got finally tossed out. Comparing them to wikipedia behavior today, it's apparent that nothing has changed and the whole system, especially the "administrators", "bureaucrats", and "arbitration committee", are completely corrupt.

You can dig up all the sources you want; all they have to do is scream how it's not a "real" source, or simply revert-war you as a group and then accuse you of "breaking 3RR", and you're toast. Have the temerity to reveal an organized campaign by these groups in an unblock request, and they'll send along one of the Wiki-Assassins like Sandstein, FisherQueen, Barek, or the various other "unblock patrollers" to abuse you, harass you, and finally just hound you into something they can use to push for an indefinite ban. Often, they'll just out-and-out lie, claiming someone ran "checkuser" (a tool that NEVER comes up with a response other than guilty) and that you are a "sockpuppet" of someone else.

Re:20% with no false positives? (2, Insightful)

bunratty (545641) | about 4 years ago | (#33703782)

Care to show us even one article where 99% of good edits are reverted? Remember, that will mean that over 99% of all edits are reverted.

Re:20% with no false positives? (2, Informative)

Rhaban (987410) | about 4 years ago | (#33704636)

Care to show us even one article where 99% of good edits are reverted? Remember, that will mean that over 99% of all edits are reverted.

not if there are bad edits that are not reverted.

Re:20% with no false positives? (1)

bunratty (545641) | about 4 years ago | (#33704690)

I suppose if there were lots of bad edits, and good edits were reverted more often than bad edits, that could be true. Care to find a page like that?

Re:20% with no false positives? (0)

Anonymous Coward | about 4 years ago | (#33705272)

No true Scotsman... [wikipedia.org]

Re:20% with no false positives? (1)

AmiMoJo (196126) | about 4 years ago | (#33710344)

not if there are bad edits that are not reverted.

Depends on your definition of bad edits. Take for example an edit with useful information but unsourced or poorly worded. Delete or keep and try to improve?

Increasingly WP seems to be going for the former. Ditto with useful articles that are about niche subjects, particularly software, which are deemed not notable enough to keep. Whatever happened to "Wikipedia is not paper"?

Re:20% with no false positives? (1)

Rhaban (987410) | about 4 years ago | (#33710444)

the definition of good or bad edits has nothing to do with my previous comment.

if 50% of a page edits are good edits, and the other 50% are bad edits, if none of the bad edits were ever reverted, 99% of good edits were reverted, only 49.5% of all edits were reverted.

It's just mathematics. nothing to do with WP politics of edition and reversions.

On a side note, I agree with you about niche subjects. The notability threshold should be lower.

Re:20% with no false positives? (1)

AmiMoJo (196126) | about 4 years ago | (#33716804)

I wasn't really arguing with your mathematics, just pointing out that any kind of measurement of good/bad edits is subjective.

Re:20% with no false positives? (0)

Anonymous Coward | about 4 years ago | (#33706238)

You forgot to mention the BRD "essay" defense for reverts. I've had that thrown at me as well. It doesn't matter that it's an essay, not policy. It is used as an excuse to revert without discussion. The "discussion" part of BRD is never applied to the "owners", just to anyone wanting to make a change.

Re:20% with no false positives? (1)

Eivind (15695) | about 4 years ago | (#33708314)

The major problem now is that 99% of all good edits submitted to wikipedia are reverted anyways as false positives.

That's just nonsense, and you know it. It does indeed happen that good edits are reverted, but it does not happen 99% of the time, not even in close to that.

It's hard to say what the bigger problem is -- good edits that are reverted -- or bad edits that aren't. My guess is that both these problems are about equal at the moment, and neither of them are particularily large. That is, Wikipedia seems to be progressing nicely, and it's not at all a problem for most good-faith contributors to have their contributions "stick".

Re:20% with no false positives? (1)

kryptKnight (698857) | about 4 years ago | (#33704906)

If the algorithm can detect 20% with perfection then that must constitute extremely low hanging fruit. That type of vandalism is just annoyance. It is so obvious that the end user readily recognizes it as such and can skip over it or revert the edit.

You have to consider that the people doing the vast majority of vandalism reversions aren't the end users, it's registered wikipedians who maintain articles as a hobby. Automatically reverting 20% of the vandalism means contributors have that much more time to spend verifying uncited claims in other articles.

Re:20% with no false positives? (1)

migla (1099771) | about 4 years ago | (#33705410)

Just run the thing a few times and you'll get almost all of it, duh!

20%? (1)

ProdigyPuNk (614140) | about 4 years ago | (#33703404)

I'm sure it's relatively easy to find 20% of the incidents of vandalism when it's a blatant 'rip out half the page and write profanities' sort of thing, but even those results aren't that great. They can 'turn it up' a bit and catch a higher percentage, but that seems to be a slightly bad idea. If wikipedia is based on information from the community at large, I really doubt the people that insert such knowledge will be thrilled when their edits are deleted immediately.

Also, what about more subtle vandalism, the kind that's hard to detect? A few edits that introduce bias in an article, for example. It's this reason that college students everywhere read wikipedia to get a general idea of a topic, and then go elsewhere to places they can actually cite in a paper.

Re:20%? (0)

Anonymous Coward | about 4 years ago | (#33703496)

I really doubt the people that insert such knowledge will be thrilled when their edits are deleted immediately.

Looks like you missed the part that mentioned "manual double-checking" for the edits that the "turned up" algorithm flagged but not the original one.

and the reversionists? (2, Insightful)

Anonymous Coward | about 4 years ago | (#33703410)

The people who "own" a page with the assistance of powerful insiders and revert any changes to their "pet" pages, even spelling fixes or simple corrections to bad information?

Will edits of *those* insiders, who are ruining wikipedia for the rest of us, be flagged by the algorithm as vandalism?

Re:and the reversionists? (2)

bunratty (545641) | about 4 years ago | (#33703796)

Can you show us a page where any changes, even spelling fixes or simple corrections, are reverted?

Re:and the reversionists? (1)

CarpetShark (865376) | about 4 years ago | (#33704492)

Can you show us a page where any changes, even spelling fixes or simple corrections, are reverted?

Can you show me where the goal of wikipedia is documented as being so low that only spelling fixes and simple corrections are needed? That sounds more like recaptcha than a wiki.

Re:and the reversionists? (1)

bunratty (545641) | about 4 years ago | (#33704644)

I didn't make such a claim.

Re:and the reversionists? (1)

AthanasiusKircher (1333179) | about 4 years ago | (#33706280)

Can you show us a page where any changes, even spelling fixes or simple corrections, are reverted?

No, except in edit wars, I haven't seen spelling fixes randomly reverted.

But I have seen pages where simple factual errors have been corrected, along with citations, AND even a note about the edit on the Talk Page, and they are still reverted. It most often happens on articles too obscure to be policed well that are likely to attract people with agendas. (For example, I've seen both left-wing and right-wing religious crazies peddling their incorrect historical/factual assertions on obscure pages on religious topics.)

I've also seen a territorial admin who kept deleting things even after an academic familiar with the field did a survey of dozens of the standard textbooks in an area and posted the results on the Talk Page, proving that the admin's view on the subject was absolutely wrong.

This is perhaps a more wacko subset of the territorial editors the GP was referencing, who are also a problem when it comes to actually making progress on an article.

By the way, I personally know three good-natured experts in a field who were willing to write basic articles but have stopped working on Wikipedia after various run-ins with the aforementioned admin. (They have their own careers that pay them to do research; why should they offer their expertise for free while having to deal with some idiot halfway across the world questioning everything?) But that admin lives on Wikipedia and goes around fixing random spelling/grammar errors all day... so who does the rest of the Wiki bureaucracy side with?

This is a particularly egregious case, but the GP is right that good people are driven away from Wikipedia by territorial editors and even admins.

Re:and the reversionists? (1)

Idarubicin (579475) | about 4 years ago | (#33707522)

I've also seen a territorial admin who kept deleting things even after an academic familiar with the field did a survey of dozens of the standard textbooks in an area and posted the results on the Talk Page, proving that the admin's view on the subject was absolutely wrong.

[Citation needed]?

Which article? Which admin? When?

Re:and the reversionists? (0)

Anonymous Coward | about 4 years ago | (#33706400)

I can show you a page that is "owned", where constructive edits are reverted:

F-16 [wikipedia.org]

Try reading that page as someone that knows nothing about fighter aircraft, and then try making any kind of copyedit change to make the article easier to read, such as summarizing any of the rambling prose, or making any of the jargon more readable. Try adding a jargon tag to a jargon-y word or try tagging a section for copyedit.

Note that there is already an article about the Lightweight Fighter program, but this article rehashes it in a rambling way that makes it difficult to understand in the context of the F-16. Note the talk page, how any questions for genuine enhancement to the article are ignored or are ganged up on on technicalities and "policies". Also note how consensus is frequently mentioned, but rarely seen on the talk page *before* edits are made to the article.

This article is why I no longer believe in Wikipedia.

Wikiality (1)

Eevee (535658) | about 4 years ago | (#33703428)

The elephant^wvandalized article population in Africa^wWikipedia has tripled in the past six months.

Iterate (0)

Anonymous Coward | about 4 years ago | (#33703454)

That's good. start out with a blank page and run it through the algorithm a couple of thousand times til it becomes true.

100% effective method (3, Funny)

robably (1044462) | about 4 years ago | (#33703478)

Thus, by applying both settings, manual double-checking would only be required on 34% of all edits.

Or, you know, just keep applying the first setting that always correctly detects 20% of vandalism on the 80% that's left over, until there's nothing left. Problem solved.

Re:100% effective method (0)

TeXMaster (593524) | about 4 years ago | (#33703554)

Doesn't work. It's multiplicative, not additive, meaning that the second time you only get 20% of the 80% you had left, i.e. 16%, for a 36% cumulative cleansing, thus remaining with 64% of the original, etc.

Re:100% effective method (1)

TeXMaster (593524) | about 4 years ago | (#33703586)

To clarify: the sequence asymptotically approaches 100%, but you'll not get there in a finite number of steps (of course, this is the purely mathematical side of things, to which you have to add the fact that the number of vandalism is discrete, on one hand, so if the percentage has to be rounded to the nearest integer you _would_ get 100% in a finite number of steps, but not if it rounds by defect; also, the vandalism cleanup takes time, so even if the number of steps is finite you would have more vandalism getting created in the mean time ...)

Re:100% effective method (1)

robably (1044462) | about 4 years ago | (#33703690)

I still don't get it. Can you clarify your clarification, please?

In fact, can you continue clarifying 20% of the 80% left from each previous clarification until you reach infinity? I think that'll do it.

Re:100% effective method (0)

Anonymous Coward | about 4 years ago | (#33704552)

of course! Just like if you zip an already zipped file it will always get smaller until it's only 1 byte long. Diminishing returns? Entropy? Never heard of it.

Re:100% effective method (3, Funny)

Zocalo (252965) | about 4 years ago | (#33703678)

Otherwise known as Zeno's Dichotomy Paradox [wikipedia.org] (often shorted to just "Xeno's Paradox", although he in fact suggested three).

I suppose I should now go and vandalise the article to keep in the spirit of things. Hang on, I'm half way there...

Re:100% effective method (1)

bunratty (545641) | about 4 years ago | (#33703800)

Whoosh!

Re:100% effective method (1)

Mitchell314 (1576581) | about 4 years ago | (#33703816)

I was going to point out that it relies on the assumption that it cleans 20% of any given article. I would say it cleans 20% of an average article, and once an arbitrary article is cleaned, you are now given that it isn't average, ie that 20% figure no longer applies. Don't know what it is, but I suspect reapplying would immediately or quickly converge to 0%. Especially machine learning approaches, I guess they would (eventually) get to the point of covering everything they can in the initial pass.

Manual double checking? (2, Interesting)

structural_biologist (1122693) | about 4 years ago | (#33703500)

I don't know where that 34% figure comes from for the manual double checking. The test set contains about 60% vandalism and 40% real edits, so I'll assume this represents the rate of vandalism on wikipedia. Now, consider a set of 1000 edits. 600 would be vandalism while 400 would be real edits. The second filter would catch 570 instances of real vandalism along with 120 false positives. Even if you used the first filter to automatically remove the 120 instances of vandalism it finds, you would still be left with a set of 450 instances of vandalism + 120 false positives to check. This means that you would have to sort through about 57% of the original edits in order to identify the 120 false positives.

Re:Manual double checking? (0)

Anonymous Coward | about 4 years ago | (#33703542)

It's worse than that. The best it can do in detecting vandalism is " it can be adjusted to detect 95%". Even ignoring the false positives that means that one vandalising edit in 20 is being missed. That's obviously unacceptably high so you still need to manually review all the remaining edits classified as valid to pick out the ones that aren't.

Re:Manual double checking? (1)

allo (1728082) | about 4 years ago | (#33705744)

but someone without much knowledge of the subjects could look at the 95% and trash them. the person who knows to tell subtile vandalism from good edits needs only to look at the remaining 5%

Re:Manual double checking? (1)

monkyyy (1901940) | about 4 years ago | (#33706170)

that person has got to be very busy

Re:Manual double checking? (0)

Anonymous Coward | about 4 years ago | (#33703634)

According to the 2nd link, the vandalism rate on Wikipedia is 2391/28468 = 0.084, not 0.60!

lower bound: 0.084*(0.95-0.20) + (1-0.084)*0.30 = 0.33780
upper bound: 0.084*(1-0.20) + (1-0.084)*0.30 =.34200

Looks like 34% to me.

Re:Manual double checking? (1, Informative)

Anonymous Coward | about 4 years ago | (#33704862)

According to the 2nd link, the vandalism rate on Wikipedia is 2391/28468 = 0.084, not 0.60!

 
The second link actually says:

The corpus compiles 32452 edits on 28468 Wikipedia articles, among which 2391 vandalism edits have been identified.

 
So that is a vandalism rate of 2391/32452 = 0.074. When I do the math I get 33% of all edits requiring a manual check. The vast majority of them are false positives.

0.074 * (0.95-0.20) + (1-0.074) * 0.30 = 0.0555 + 0.2778 = 0.3333

Re:Manual double checking? (1)

structural_biologist (1122693) | about 4 years ago | (#33704928)

Thank you two for the clarification. I should really read the links more thoroughly before posting.

i hope they include this algorithm asap... (1)

mapkinase (958129) | about 4 years ago | (#33703520)

...because I am tired of my small edits here and there to be classified automatically as vandalism.

i hope they ban this algorithm asap... (0)

Anonymous Coward | about 4 years ago | (#33708992)

...because I am tired of my small edits here and penis to be classified automatically as vagina.

There is a pretty simple heuristic (2, Interesting)

pieterh (196118) | about 4 years ago | (#33703546)

This comes from personally maintaining some 200+ wikis on Wikidot.com.

There are two kinds of vandals: those in the community of contributors, and those outside it. The first class of vandals cannot easily be detected automatically but when a wiki is actively built, the community will easily and happily fix damage done by these. The second class are usually spammers and come along when the wiki is stale. They are easily detected by the fact that a long static page is suddenly edited by an unknown person. It's very rare to find a real edit happening late after a wiki has solidified. We handle the second type of vandalism trivially by getting email notifications on any edits.

Trick is, wikis (maybe not Wikipedia but then certainly individual pages) don't have random life cycles but go through growth and stasis.

Re:There is a pretty simple heuristic (1)

AthanasiusKircher (1333179) | about 4 years ago | (#33706146)

The second class are usually spammers and come along when the wiki is stale. They are easily detected by the fact that a long static page is suddenly edited by an unknown person. It's very rare to find a real edit happening late after a wiki has solidified.

Ah... now I know why people revert my generally anonymous but high quality edits on neglected articles. Anyone who edits a dormant article must be a spammer or vandal? I don't think this is true.

Trick is, wikis (maybe not Wikipedia but then certainly individual pages) don't have random life cycles but go through growth and stasis.

While I guess you're correct in general, I've seen quite a few situations on Wikipedia where a new user coming in and taking a look at an established article actually leads to a period of revision, reconsideration, and perhaps growth on a given page.

I'm not saying your opinion isn't correct, but it is an overgeneralization. Effectively, I think why your "second class" of vandals is easier to detect than the first is that it's easier to spot bad edits on a page when there are few edits made on a page. Which seems pretty obvious...

top 2 (2, Insightful)

trb (8509) | about 4 years ago | (#33703672)

Anyway, there is still a lot potential for improvements since the top 2 detectors use entirely different detection paradigms

This implies that the lower-scoring detectors are less valuable in terms of looking for sources of improvement. That's not true, and that wasn't stated in the paper's "Conclusions" section. If the lowest scoring detector finds 5% of the bad data, and it's a different slice from what the other detectors find, then that's quite valuable.

Machine learning - right (4, Informative)

Animats (122034) | about 4 years ago | (#33703750)

Wikipedia already has programs which detect most of the blatant vandalism. Page blanking and big deletions are caught immediately. Deletions that delete references generate warnings. Incoming text that duplicates other content on the Web is caught. That gets rid of most of the blatant vandalism. It's not a serious problem on Wikipedia.

The current headaches are mostly advertising, fancruft, and pushing of some political point of view. That's hard to deal with using what is, after all, a rather dumb machine learning algorithm that has no model of the content or subject matter.

There already IS a competitive angle (2, Insightful)

Grimbleton (1034446) | about 4 years ago | (#33703756)

They already compete to be the first to revert edits they disagree with.

Re:There already IS a competitive angle (0)

Anonymous Coward | about 4 years ago | (#33703818)

They already compete to be the first to revert edits they disagree with.

You're doing it wrong. Anyone can revert first. The objective of the wiki game is to be the LAST to revert edits you disgree with. That takes a lot more skill.

Mine works 100% of the time. (1)

hedwards (940851) | about 4 years ago | (#33703832)

It just characterizes all the edits to the conservapedia to be vandalism.

Hah, bout time. (2, Insightful)

OnePumpChump (1560417) | about 4 years ago | (#33704384)

4chan and Somethingawful have been having Wikipedia vandalizing competitions for years. (Usually, whoever's edit or fake article stays the longest wins.)

Re:Hah, bout time. (1)

DamienRBlack (1165691) | about 4 years ago | (#33704756)

I made a fake article that has been up for three years and ten months. It has even been brushed up a little bit by a few people. The article is full of fake companies, fake people and fake ideas. Do I win? I'd tell you what it is, but I want to see how long it stays up and if I post it someone will see to taking it down.

Re:Hah, bout time. (1)

ultranova (717540) | about 4 years ago | (#33708474)

I'd tell you what it is, but I want to see how long it stays up and if I post it someone will see to taking it down.

And you want to see how many other pages get taken down in the hunt for the fake.

Good troll, bro :).

Re:Hah, bout time. (0)

Anonymous Coward | about 4 years ago | (#33704882)

Hmm, these people must have a pretty good idea what it takes to beat-the-bots.
They could give valueable input.
Maybe someone would start bragging...

Rules can only get so much (3, Informative)

tawker (860711) | about 4 years ago | (#33704414)

As the owner of the first vandalism reverting bot in mainstream use - http://en.wikipedia.org/wiki/User:Tawkerbot2 [wikipedia.org] I guess I have a bit of perspective on the whole problem. Originally the bot was designed / created to auto revert one very specific type of vandalism, a user who would put a picture of spongebob squarepants into pages while blinking them (or squidward or some cartoon character) - that was pretty easy to get. Next we went to stuff like full page blanking, ALL CAP LETTER UPDATES and additions of a tonne of bad words, based on common vandalism trends (ie, if a page had 0 profanity on it and someone added a few words it would be reverted, again, not too many false positives. That basically caught the "dumb kid" type of vandalism, and it was amazing how much lower a percentage it caught of total edits when students went back to school. The only problem, at the time, it was a resource pig. The bot was originally running on a P2 300MHz w/ a grand total of 256MB of RAM and the load got to be so high that we had to move it about 5 times. It's interesting to note that at first, many many people were opposed to the idea of automated vandalism revision, it was almost a contest to revert stuff first - and the bot would win a vast majority of the time. However, as time went on, my inbox started getting rather full whenever I had a power outage, cat knocked the cord out of the box hosting it etc. Community reaction to bots doing the grunt work in vandalism really changed. Anyways, just my 2c on it, and just for the heck of it to prove I'm actually the Tawker on wiki, http://en.wikipedia.org/w/index.php?title=User%3ATawker&action=historysubmit&diff=387163504&oldid=268687392 [wikipedia.org]

Re:Rules can only get so much (1, Informative)

Anonymous Coward | about 4 years ago | (#33705142)

It looks like the winning entry [uni-weimar.de] uses all of those attributes plus a bunch more. From pages 3-4 of the paper.

 

  1. Anonymous -- Wether the editor is anonymous or not.

    Vandals are likely to be anonymous. This feature is used in a way or another in
    most antivandalism working bots such as ClueBot and AVBOT. In the PAN-WVC-
    10 training set (Potthast, 2010) anonymous edits represent 29% of the regular edits
    and 87% of vandalism edits.

  2. Comment length -- Length in characters of the edit summary.

    Long comments might indicate regular editing and short or blank ones might suggest vandalism, however, this feature is quite weak, since leaving an empty comment in regular editing is a common practice.

  3. Upper to lower ratio -- Uppercase to lowercase letters ratio

    Vandals often do not follow capitalization rules, writing everything in lowercase or
    in uppercase.

  4. Upper to all ratio -- Uppercase letters to all letters ratio.
  5. Digit ratio -- Digit to all characters ratio

    This feature helps to spot minor edits that only change numbers, which might help to find some cases of subtle vandalism where the vandal changes arbitrarily a date or a number to introduce misinformation.

  6. Non-alphanumeric ratio -- Non-alphanumeric to all characters ratio

    An excess of non-alphanumeric characters in short texts might indicate excessive
    use of exclamation marks or emoticons.

  7. Character diversity -- Measure of different characters compared to the length of inserted text.

    This feature helps to spot random keyboard hits and other non-sense. It should take
    into account QWERTY keyboard layout in the future.

  8. Character distribution -- Kullback-Leibler divergence of the character distribution of the inserted text with respect the expectation. Useful to detect non-sense.
  9. Compressibility -- Compression rate of inserted text using the LZW algorithm.

    Useful to detect non-sense, repetitions of the same character or words, etc.

  10. Size increment -- Absolute increment of size, i.e., |new| |old|.

    The value of this feature is already well-established. ClueBot uses various thresholds of size increment for its heuristics, e.g., a big size decrement is considered an
    indicator of blanking.

  11. Size ratio -- Size of the new revision relative to the old revision

    Complements size increment.

  12. Average term frequency -- Average relative frequency of inserted words in the new
    revision.

    In long and well-established articles too many words that do not appear in the rest
    of the article indicates that the edit might be including non-sense or non-related
    content.

  13. Longest word -- Length of the longest word in inserted text.

    Useful to detect non-sense.

  14. Longest character sequence -- Longest consecutive sequence of the same character in
    the inserted text.

    Long sequences of the same character are frequent in vandalism (e.g. aaggggghhhhhhh!!!!!, soooooo huge).

Along with analyzing those basic stats, the winning entry also examines categories of words.

 

  1. Vulgarisms -- Vulgar and offensive words, e.g., fuck, suck, stupid.
  2. Pronouns -- First and second person pronouns, including slang spellings, e.g., I, you, ya.
  3. Biased -- Colloquial words with high bias, e.g., coolest, huge.
  4. Sex -- Non-vulgar sex-related words, e.g., sex, penis, nipple.
  5. Bad -- Hodgepodge category for colloquial contractions (e.g. wanna, gotcha), typos (e.g.
    dosent), etc.
  6. All -- A meta-category, containing vulgarisms, pronouns, biased, sex-related and bad
    words.
  7. Good -- Words rarely used by vandals, mainly wiki-syntax elements (e.g. __TOC__, )

I don't understand everything in the paper. My impression is that a large set of known edits are fed into the tool. It then uses the stats from that set of edits to build up a statistical framework for what values it should be expecting on each of the categories and rejects any edit that appears far out of norm.

I'm sure this tool will have a lot of false-positives on certain articles just because of the nature of the article. While it works well on more typical articles.

Non-AI algo that can tell good edits from bad? No. (1)

CarpetShark (865376) | about 4 years ago | (#33704480)

without misclassifying regular edits

I suspect that, if the regular edits weren't misclassified by the algorithms then:

a) the reference classification was incorrect itself
b) they were much too convenient samples, compared to the kind of complex changes needed to improve pages in real life.

Can it detect "spin"? (1)

presidenteloco (659168) | about 4 years ago | (#33705360)

Detecting spam-like vandalism would seem to be fairly easy.

Far more insidious is politically spun issue-framing masquerading as objective description
of events or topics.

It is truly amazing what you can hide in there by using high-falutin', officious,
grammatically correct language to accomplish your spin. Oft' times you can even fool the
domain experts.

Physicists say that everything is either "spin-UP" or "spin-DOWN". Master spin-doctors
say the same thing.

Simple (1)

eonduckem (1107975) | about 4 years ago | (#33708648)

if $edit = EncyclopediaDramatica.match then $vandalism = TRUE;

Why? (1)

xenobyte (446878) | about 4 years ago | (#33708960)

Why vandalize articles in the first place?

Sure, stupid spammers think replacing an article with a badly spelled advert for ViAGRa is the way to go, and morons think that they gain something from inserting "I'M GAY!!!!!" into an article about someone they dislike, but why just do damage for no other purpose than destroying other people's hard work?

I just don't get it.

These trolls/vandals need to get their asses kicked - hard. Or maybe just have something of theirs broken, just for the fun of it, and see if they find that funny.

Check for New Comments
Slashdot Login

Need an Account?

Forgot your password?