[sigtag-l] ICWSM 2009 Data Challenge

Emma L. Tonkin e.tonkin at ukoln.ac.uk
Sun Oct 26 18:28:40 EDT 2008

Hi all (hello again, to those of you who I've already encountered at 
this year's ASIS&T AM),

I saw this on the IR mailing list and thought that it might be 
interesting for the more mathematically inclined SIG-TAGgers.

-------- Original Message --------
Subject: [SIG-IRList] Data Challenge/ICWSM 2009: International 
Conference on Weblogging and Social Media, San Jose, CA, USA; May 17-20, 

                               ICWSM 2009
        International Conference on Weblogging and Social Media
                   San Jose, CA, USA; May 17-20, 2009

                             Data Challenge
                         Call for Participation

Continuing the ICWSM tradition, ICWSM 2009 is making a dataset
available to researchers in the blog and social media fields.  We
invite you to download the dataset, explore it, learn something
interesting about it, and submit a paper about it to ICWSM 2009.

The dataset, provided by Spinn3r.com, is a set of 44 million blog
posts made between August 1st and October 1st, 2008.  The post
includes the text as syndicated, as well as metadata such as the
blog's homepage, timestamps, etc.  The data is formatted in XML and is
further arranged into tiers approximating to some degree search engine
ranking.  The total size of the dataset is 142 GB uncompressed, (27 GB

(We also anticipate possibly releasing additional datasets.  Stay

For details on how to get the dataset, including a usage agreement,
please see the data page on the conference website,
http://www.icwsm.org/2009/data/.  There is also a mailing list and
Google Code site for sharing ideas and resources.

This dataset spans a number of big news events (the Olympics; both US
presidential nominating conventions; the beginnings of the financial
crisis; ...) as well as everything else you might expect to find
posted to blogs.  ICWSM invites research studies of this data,
including but not limited to

  - link analysis
  - social network extraction
  - tracing the evolution of news
  - blog search and filtering
  - psychological, sociological, ethnographic, or personality-based
  - analysis of influence among bloggers
  - blog summarization and discourse analysis

Instructions for submitting papers to ICWSM may be found at
http://icwsm.org/2009/cfp.shtml.  When submitting your paper, indicate
that it makes use of the dataset.  Dataset papers will be reviewed for
the main conference, and additionally for presentation at the data
challenge workshop to take place on May 20th, 2009 (the last day of
the conference).  While we anticipate that several dataset papers may
appear in the main conference, the data challenge workshop will
provide an opportunity for in-depth discussion of the dataset in a
more focused forum.

We will be making a collaborative website available for sharing tools,
indexes, or other extracts of the dataset.  Please see the ICWSM
website for links.

Ian Soboroff, NIST
Akshay Java, UMBC
ICWSM 2009 Data Chairs

More information about the sigtag-l mailing list