Resources for Research on Web Spam
Mailing List
We recommend you to subscribe to our mailing list. Datasets, challenges and conferences related to Web spam are posted to this low-volume, announcements-only mailing list.
Datasets
We host Web spam datasets developed by a collaborative effort by a team of volunteers. The goal of our dataset activity is to make available reference collections that should be:
- Large: the collections should include many examples of spam and non-spam content.
- Clean: the collections should contain little classification errors.
- Uniform: the collections should represent a uniform random sample over a set of pages or hosts.
- Broad: the collections should include as many different Web spam aspects as possible.
- Open: the collections should be freely available for researchers.
Currently we are hosting a set of collections for research on Web Spam. See datasets >>.
See also
Web Spam Challenge — competition to identify methods for detecting Web Spam.
AIRWeb — workshop on Adversarial Information Retrieval on the Web
Source code (archived) — Truncated PageRank and Adaptive Estimation of Supporters, the algorithms proposed in a WebKDD'06 paper.
For inquiries please contact Carlos CastilloLast updated: January 15, 2008.