Upgrade to Pro — share decks privately, control downloads, hide ads and more …

FLOSSmole, FLOSShub and the SRDA repositories

FLOSSmole, FLOSShub and the SRDA repositories

FLOSSmole, FLOSShub and the Sourceforge Research Data Archive (SRDA) are repositories of papers and data describing how FLOSS is developed. Collectively these community resources store several terabytes of data, which includes both artifacts of the software creation process and metadata about the FLOSS projects and developers. This presentation describes a new collaborative project that will expand, enhance, and integrate these community resources serving the FLOSS research community. The project will capitalize on years of successful collaboration amongst the maintainers of the resources to become a new developmental framework for facilitating access to the massive amounts of data collected by many previously unconnected FLOSS research efforts. Goals also include development of tools to mine that data, an expanded papers repository, and a unified portal to promote collaboration. The presentation will describe the current metadata archives, architectures, and services, followed by planned integration and enhancements.

More Decks by FLOSS Community Metrics meeting

Other Decks in Technology

Transcript

  1. FLOSSmole, FLOSShub and the SRDA Repositories Past, Present, and Future

    Greg Madey University of Notre Dame Megan Squire Elon University SRDA FLOSSmole
  2. Who do we serve? •  Mostly academics: graduate students, faculty,

    post-docs, class projects, etc. •  Researchers interested in: FLOSS, software engineering, sociotechnical processes, social networks, data/text mining, economics, management information systems, open processes, citizen science/ engineering, innovation, …
  3. How do we serve? •  Collect, archive, curate meta-data about

    FLOSS: project and developer statistics, discussion forums, bug/issues, releases, developer roles, project governance, project evolution, success criteria, history, … •  Mostly data from Forges …
  4. SRDA & FLOSSmole •  SRDA started in 2002 crawler, direct

    data dump PostgreSQL •  FLOSSmole started in 2004 crawlers MySQL … SRDA 112 dumps FLOSSmole/SourceForge data
  5. 2002-2009 •  Focus was on collecting metadata, storing it SRDA

    collects from SourceForge FLOSSmole began collecting from other forges as well •  Freshmeat •  Rubyforge •  Objectweb •  Free Software Foundation •  Alioth •  Launchpad •  Tigris •  Google Code •  Github
  6. 2002-2009 •  This period was "The Age of the Forge"

    image from Squire & Williams, 2012 (PDF)
  7. 2002-2009 •  Growing our user base & supporting our users

    so they can conduct research using our data and write scholarly papers •  Example: Crowston, Kevin, Kangning Wei, James Howison, and Andrea Wiggins. "Free/Libre open-source software development: What we know and what we do not know." ACM Computing Surveys (CSUR) 44, no. 2 (2012): 7
  8. 2002-2009 •  We finally all met in the same room

    at the same time! (Limerick OSS 2007) •  2nd WoPDaSD, 2007 (organized by Jesús González-Barahona, Megan Squire, Gregorio Robles) •  Kevin Crowston •  Greg Madey •  Walt Scacchi
  9. 2002-2009 Both teams won NSF grants to support work • 

    ISS-0222829 •  CNS-0751120 •  CNS-0708437 •  CNS-0708767 these were Megan's & Kevin Crowston's these were Greg's
  10. In 2009-10 a few things happened... 1. FLOSSmole stopped collecting

    from SF Due to a major SF site overhaul that broke the crawlers - again
  11. In 2009-10 a few things happened... 2. SRDA & FLOSSmole

    merged mailing lists We acknowledged that our users are largely the same
  12. In 2009-10 a few things happened... 3. We started working

    on CRA-CCC grant with Walt Scacchi This made us more of an intentional community Scacchi, Walt, K. Jensen Crowston, Chris Jensen, Greg Madey, Megan Squire, Thomas Alspaugh, Les Gasser et al. "Towards a science of open source systems." (2010).
  13. In 2009-10 a few things happened ... 4. OSS 2010

    conference was held in US at University of Notre Dame Several meetings and collaborative efforts, joint presentations, panels, etc
  14. In 2009-10 a few things happened... 5. We created FLOSShub/biblio

    •  We inherited the MIT F/OSS paper repository (Lakhani, von Hippel, and Hill) •  Since then we have added about 1000 papers to it
  15. In 2009-10 a few things happened... 6. The ascendance of

    Github and the decline of other forges begins Graph shows the decline in new project registrations on Rubyforge more info on this image
  16. Statistics … SRDA •  ~150 paper cites •  ~470 registered

    user accounts (perhaps double the number of users) FLOSSmole •  ~200 paper cites •  unknown # of users o  203 on mailing list o  43 active db user accounts o  hundreds of thousands of downloads
  17. Current Machines & Sizes SRDA: •  Currently on 2nd gen

    machine: 2 cores, 12GB memory, 7TB storage, SRDA data = 1.5TB •  New machine: 64 cores, 128GB memory, 12 TB storage
  18. Current Machines & Sizes FLOSSmole: •  Currently several terabytes of

    data stored at Elon (development server) and Syracuse (production server) •  We briefly hosted at San Diego Supercomputer Center and on Amazon Web Services
  19. More recently... FLOSSmole began shifting gears •  Away from metadata

    and away from forges due to the ascendance of Github & newer tools (Forge++ paper) •  Towards more complicated artifacts collecting & analyzing text, especially developer communication: email, IRC, Stack Overflow, Twitter •  Primary focus is still on building re-usable data sets
  20. Goals for 2014 & beyond 1. Continue integrating SRDA &

    FLOSSmole Grant funding (supposedly forthcoming) will help 2. Continue FLOSS focus FLOSS is no longer a "weird" phenomenon, but it's still worth studying 3. Continue data focus •  Our primary job is data collection, archiving, and curation •  Our secondary job is analysis
  21. Examples of current focus areas Data source: Email Project: Apache,

    LKML Example 1: Review of how 72 FLOSS research papers have used email archives Example 2: Using email archives to study developer reactions to an innovation (paper under review)
  22. Examples of current focus areas Data source: IRC logs Projects:

    Ubuntu, Apache projects, Django, etc Example: Using FLOSS developer dialogue to train natural language classifiers to discern humor, insults, and profanity (under review) All IRC data is available on FLOSSmole; some as flat files and all in MySQL
  23. Examples of current focus areas Data source: Messy metadata Project:

    Apache Example: Use the Apache Board meeting minutes (very messy!) to create data sets of who's-who on Apache (paper link & data set)
  24. Examples of current focus areas Data source: Twitter Projects: all

    FLOSS Example: In order to mine Twitter, we need to know WHO are the FLOSS developers/projects on there (paper link & data set)
  25. Examples of newer focus areas Data source: Stack Overflow • 

    Critically important as a "Forge 2.0" code/knowledge repository •  Not specific to FLOSS, but very important to both FLOSS and non-FLOSS Example 1: MSR challenge 2013 was on Stack Overflow Example 2: "A Bit of Code": How the Stack Overflow community creates quality postings Example 3: Many FLOSS projects are outsourcing their developer support to Stack Overflow. Does this work? (paper under review)
  26. Growth Areas •  Text sources - FLOSSmole is particularly interested

    in social coding and communication artifacts •  Donations - FLOSSmole is starting to get more donations of data, although there is still a lot of resistance
  27. Ongoing Challenges •  Hardware - Constantly growing, need funding for

    machines, backups, disks •  Software - New data sources don't always fit in relational model; newer analysis methods need more sophisticated software •  Access & Support - community always needs more/ better support, docs, examples, PR, integration
  28. Ongoing Challenges •  Funding to enable integration … several fails,

    tentative small success! FOSSter A Data Sharing Infrastructure for the Open Source Research Community FOSSter Advisory Board Chaired by Scacchi - Consultant (UC - Irvine) FLOSSmole Squire - PI (Elon University) FLOSShub Crowston - Consultant (Syracuse University) SRDA Madey - PI (Notre Dame)