A blog by Dave Lester.

Recent Projects (February 2013 Edition)

| Comments

With the month almost complete, I’m overdue for an update about what’s keeping me busy:

Drone Lab

Since attending Drone Games last fall, I’ve been crazy about personal drones and Nodecopter. I bit the bullet and bought a Parrot AR Drone, my friends have them, the School of Information bought one, and I’ve rallied a posse of fellow graduate students to participate a group directed study this semester that we’re calling Drone Lab. I was even called a ‘drone fiend’ on Twitter, based upon my enthusiasm.

The structure and idea of Drone Lab is simple: every Friday we meet to discuss and program AR Drones, some of us simply hacking while others working on more-ambitious projects. So far the development has ranged in complexity from programming basic flight patterns, to exploring computer vision, hooking up an arduino in hopes of attaching additional sensors, and ultimately I think our collective goal is to deliver a taco (see Tacocopter). Random people have even started showing up to check it out, interested in learning how to program the drones – it’s pretty amazing.

I think Drone Lab has been a great and informal way for a group of us to work together, creating a local community of hackers. I’m not sure if this will remain a small group, or turn into something more open and large, but I’m having a great time. We have a shared repo of drone scripts on GitHub if you’re interested in checking out some of the code we’re experimenting with.

Harlem Shake (Drone Edition)

With the help of Arthur Che and Gilbert Hernandez, I directed a Harlem Shake at the UC Berkeley School of Information, featuring an AR Drone! I know the meme is dead, but it’s worth checking out for lols.

Mozilla Open Badges

I’m sprinting with the rest of the Open Badges team as we approach our Mid-March 1.0 release. Working as an Integration Engineer, I’m helping partners and community developers become Open Badges-compliant while also contributing code including Facebook integration with an upcoming release of the Mozilla backpack, and merging some large community contributions for a future version of WPBadger. It’s awesome. The badges team is wildly talented, and it feels like the community has momentum as we work toward hitting 1.0.

GitHub for Collaborative Writing

As part of my Computer-Mediated Communication (CMC) course, my team is researching and prototyping a tool that uses GitHub as a backend for collaborative writing. We’ll be looking at a variety of domains: academic writing, journalism, business materials, and technical documentation in order to identify overlapping needs and features that we could integrate into our tool. Some social and community features of GitHub like issues, comments, and forking will likely be integrated into our tool in addition to the distributed version control features of git. My contribution will be a technical evaluation of GitHub’s API, and contributing to our prototype.

When explaining the implications of this project, I’ve pointed people to one of the slickest tools I’ve recently seen, called Huboard. Huboard is a Kanban-style project management tool that allows you to organize GitHub tickets at various stages of development, and creates a radically different workflow for how you normally interact with tickets. And while the details of the prototype we’re developing for CMC are still up in the air, I hope that whatever we come up with is similarly simplistic in its approach, making use of GitHub to provide a lightweight visual and organizational layer. GitHub is a tremendously powerful platform, and finding non-code uses for it will be great.


Last but not least, Michael Hintze and I have been busy working together on our final masters project, called Drinkly. This project is solidly in the ‘fun’ category, with the purpose of making it easy to find happy hour specials at nearby bars. Drinkly is the latest of a number of projects we’ve worked on related to the chicken and egg problems of incentivising businesses to share information with consumers in a dedicated channel. We’ve received some positive feedback so far. More details on that next month.

Designing by Principle

| Comments

Michael Hintze and I recently met up with Peter Merholtz to discuss our I School final masters project. With three months until graduation, we were seeking advice on product development and the decisions we’ll need to make.

One of the strategies Peter suggested was to focus around a set of design principles, referring us to a blog post by Luke Wroblewski. This has helped organize our project early on, and spurred a series of related thoughts I felt were worth writing down and sharing.

In his blog post about Design Principles at Microsoft, Luke cites his own definition of design principles:

Design principles are the guiding light for any software application. They define and communicate the key characteristics of the product to a wide variety of stakeholders including clients, colleagues, and team members. Design principles articulate the fundamental goals that all decisions can be measured against and thereby keep the pieces of a project moving toward an integrated whole.

This definition captures several important things. First of all, the audience of design principles can be diverse, simultaneously resonating with programmers, designers, project managers, and sponsors. Particularly on large teams, communication breakdowns between team members can fragment a project and sometimes force revisions to work that has already been done. Additionally, principles can help teams make better decisions.

Peter’s suggestion was to write the principles on paper and hang them from the walls of the workplace. When making a design decision, does it align with the principles? Principles can become a filter through which team decisions are made.

Principles for our project

Here’s a sampling of the design principles Michael and I identified for our final project:

  1. Delight and surprise. The smallest details matter for a great experience. Use well placed animations when you can.
  2. Notify users sparingly. Notifications are disruptive. Only bother users when the alert is meaningful and significant.
  3. Have fun when you can. A beer mug filling up is more fun than a spinning circle, or beach ball of death.
  4. Keep it simple. People love brevity. Keep things short and meaningful.
  5. Get to know the user. Learn about our users over time. Don’t make them input the same thing over and over

These principles capture our user experience goals, and begin to define the personality of the project itself. Simplicity, delight and surprise, and having fun clarify both what the project is, as well as what it is not. I’ll update this post later in the semester to measure their effectiveness.

Luke’s blog post shares examples of principles that have persisted through product development at Microsoft, including Windows 7 and the Microsoft Surface. There’s some interesting crossover between their principles and ours, but also some divergence – I think it’s important to not be dogmatic about the use and definition principles, but to think of them as a management tool. No two projects will have the same principles, nor the same quantity or granularity.

Design principles in an agile process

Principles create an artifact that persists across the development cycle for a product, sometimes in the absence of a predefined plan and before a working version of software has been delivered. In an agile development model, decisions are made throughout the life of a project, instead of up front as they would be in a waterfall process. After all, the Manifesto for Agile Software Development itself is organized around a series of principles. Principles offer flexibility to make decisions later, while keeping them focused on a larger goals.

They can also establish an understanding among stakeholders about the experience of a product, which sometimes is difficult to grasp until a prototype exists. There are obviously examples of when this is less crucial – strong corporate or team cultures are other ways to focus the passion and ideas of employees, establishing consensus regarding style and qualities of a work. But the reality of many teams is one where not everyone is centrally located, possibly involving differing stakeholders including contractors, content partners, and sponsors. Principles can provide cohesiveness and certainty to what otherwise may seem distributed and chaotic.


Lastly, principles can be applied more broadly than to a single project, and instead to a larger body of work. One of the inspiring talks I watched last year was given by Bret Victor, Inventing on Principle. Bret asked us to define what the guiding principle of our individual work is, and I spent a lot of time trying to define my own guiding principle as a result. I’ve come to decide that I lack a single principle, although I admire Bret’s focus. I could name half a dozen that guide me, but not a only one. That is fine. The purpose of principles is simply to help you focus, provide something to look back at in the future to evaluate your progress, and offer a way to clearly communicate your purpose to others.

I’d love to discuss how principles are used in other ways than I’ve described in this blog post. Please let me know what you think.

Why We Should Be Discussing Bots

| Comments

Recently asked what I’ve been thinking about in grad school, my response left some friends scratching their heads: “bots” I said. I wrote this post in an attempt to describe some reasons to investigate them, examples of their roles, and the design features that make them unique and desirable.

What is a bot?

There’s no great definition out there, but the vagueness of the term should not overshadow their awesomeness. Bots are automated or semi-automated software agents that perform a set of actions. Often, these scripts read and write data on the web, running continuously to monitor changes or actions. Each bot can be created for a single purpose rather than a complex set of interactions, and easily replaced by better bots that may come along. Bots are not new or unique to the Web, but are increasingly visible. They can be time-saving, spammy, unexpected, or even destructive.

Reasons to Investigate Bots

Bots can help automate mundane or time-intensive tasks.

During the API Workshop, I was impressed by George Oates’ presentation on the Open Library project and how they use bots to perform mundane tasks like batch-importing catalog records from the Library of Congress. Every morning, their ImportBot checks the LoC for new records, and adds them to the Open Library database. No interns or software developers are required. Tools like IFTTT are similarly beginning to make it easier for non-technical users to automate parts of how we interact with the web.

They are increasingly easy to run in the cloud.

The ease of deploying applications to cloud services like Heroku, Windows Azure, and Amazon EC2 has become significantly easier for developers, and I expect that similar services will be more accessible to non-technical users in the near future. This opens up new opportunities to continuously run bots and smaller pieces of software in the cloud, while in the past you may have only considered firing up a server for a more persistent function like your personal website.

The ability to run software in the cloud becomes particularly profound when it’s possible to increase the number of servers and software simultaneously running a task; this is something bots can take advantage of, and a method of solving problems that is unique to the cloud.

They can be fun to build and use.

One of the fun outcomes for THATCamp Prime last year was @horse_thatbooks, a Twitter bot created by @boone that has resulted in some funny tweets including my favorite description of the unconference: “it’s a great time despite the humanities geek alert.” @horse_thatbooks reads a stream of tweets that include the #thatcamp hashtag, stores them in a cache, and will periodically run them through a markov chain process to create new tweets. While not all bots as playful as @horse_thatbooks, their design features create increased opportunities for experimentation, and yes – fun.

Networks of bots can perform functions that would otherwise require single/unified platforms.

In many of the examples I’ve looked at where bots have thrived, it’s possible that robust and centralized software could have been adopted to provide similar functionality. Bots can be powerful examples of how core functionality of a system can be limited in scope, and extended in an ecosystem through the creation of software by third parties. I think @substack summed up this idea with this tweet: “instead of writing a big app, compose together lots of tiny apps split out over separate processes that talk to each other over the network.”

Many related questions and concerns came up a few years ago during the planning process of Bamboo Corpora Space, where we grappled with the question of how distributed or centralized a system should be to establish interoperability among tools and collections for humanities researchers. On one end of the spectrum was a model that was mostly decentralized and run by software agents (similar to bots), and the other was a large service model that was controlled by a single entity. Many of the design features of bots (listed below) make a decentralized system desirable.

Monitoring, notification, and extending oneself through software.

The most far-out of my bot interests and ideas include the possibility that we’re entering a time in mobile computing where own interactions with the world are continuously monitored, and our daily decision making is effectively augmented by software agents that are looking out for us. Scoble has referred to a similar trend (although he is interested more in social and sensor data) as the rise of contextual systems. Part of my Triggers project had similar goals in mind, best summed up by the Google Now tagline: “the right information at just the right time.”

I’d like to see platforms like Google Now play a role in our future, but for bots or software that has design features of bots to be leveraged instead of black boxes. Looking at the roles and design features of bots, I think there’s a lot that can be desired.

Roles and Examples of Bots

Some great examples of bots in action are on Wikipedia, where there are currently 1,638 bot tasks approved for use [1], making 9% of edits to the English site. Bots play an active role in updating the site, and perform a variety of algorithmic functions that include importing new articles, vandal fighting and reverting changes, identifying copyright violations, ban enforcements, and recommending pages for users to contribute to. Other languages have much higher rates of bot participation, such as Spanish with 21%, and Africaans 61%. [2]

One of the earliest Wikipedia bots was RamBot, which imported public domain census data in October 2002 to create approximately 30,000 U.S. city articles. Unlike the OpenLibrary ImportBot that I previously mentioned, RamBot was fixed around a finite corpus to import and doesn’t need to continuously run. In seeding Wikipedia with content and stubs, RamBot played an important role. Another example is ClueBot, a vandal fighting bot that examines edit and contribution history, and is capable of reverting edits in seconds. ClueBot will then notify the offending user about the content that was modified. These scripts are created and operated by members of the community, and interact with the Mediawiki software that powers Wikipedia. If I was trying to build a crowdsourced or community-powered site right now, I would want to be thinking about ways that bots could help out.

Sometimes bots interface with humans in the form of a command, notification, or message. For quite some time there have been IRC bots that respond to commands or greet users, and more recently we’ve seen the rapid growth of Twitter bots. A funny example of Twitter bot that I stumbled upon is called @BronyRT, which retweets any message that uses the term “brony.” [For those that are not already aware, a brony is a male viewer of the television show My Little Pony.] If you tweet the message “STATS” at @BronyRT, it will return your ranking in terms of brony-related tweets, based upon the number of times your message has been retweeted. It’s a very strange system of karma that I don’t entirely understand.

There are a few other bots worth mentioning: @paperbot is a twitter bot created by Ed Summers that tweets 100-year-old news from OCR’ed newspapers using the Chronicling America API. I love how this is an example of a bot is both creating new awareness of content, and servers a very specific purpose. There’s also an interesting CHI article about GetLostBot, a bot that tries to facilitate serendipity by using Foursquare location data.

Design Features of Bots

Several design features of bots jump out at me:

  1. Bots are cheap. You only need a client with a web connection to do HTTP requests, on a server or a desktop. Beyond this basic hardware, there is nothing about the software that makes them expensive.
  2. Diversity of languages. Unlike many systems that assume a certain technology stack, bots can be written in any language which opens enables a diverse set of libraries and languages to be used. You can write a bot in PHP, Python, Java, Clojure, Javascript, or any other language.
  3. They move agency to the edges of an ecosystem. By allowing outside developers to add functionality through the creation bots, core software can be lighter and allow for greater participation. For example, I have a lot more control over how a Wikipedia bot works than I do the Mediawiki software that powers the entire site. The ability to create a bot gives me more agency in that situation.
  4. They are modular. Bots are typically built around a set of actions and problems. When addressing problems of interoperability, bots can act as a lightweight solution to be the glue between different APIs.
  5. Replicability. As software, they can be copied, which makes it possible to share them with relative ease. The ability to have many copies of the same bot also means that they can be run in parallel, increasing the computing power and processing time by running on multiple machines.

Let’s Build Bots

The best part about bots is that they don’t rely on technical specifications, and their implementation can take a number of forms. We can build bots today, and don’t need funding or to ask for permission. There are also some interesting design patterns among bots, and several frameworks exist today to create bots for various platforms.

I’d love feedback on the ideas mentioned here, and hope we can build some awesome bots together.



Help Hacking a Bluetooth Helicopter

| Comments

I was given a cheap toy helicopter for Christmas that is controlled with a mobile app. Since the helicopter uses bluetooth to communicate, I am hoping to connect my laptop to the device and figure out a way to control it outside of the app. Anyone have experience with bluetooth or device hacking that could let me know if I’m on the right path or hopeless?

I began by pairing the helicopter to my laptop, which I was able to do using the Bluetooth Explorer utility. I collected device info, and posted screenshots on Flickr. What caught my attention was that the device is listed as a headset. I’m not sure what devices like toys are typically listed as, but it appears that the helicopter may be using the bluetooth headset profile for at least some of its communication. And so I jumped to a conclusion which is probably wrong: if I could find a way to simulate or execute the same commands a bluetooth headset does, and successfully connect it to the helicopter, maybe I could control it.

At this point, I see the following paths moving forward:

1) Try to communicate directly with the helicopter, using a bluetooth headset or writing code of my own. (Assumes that I can just use the headset profile) 2) Bluetooth sniffing to monitor the communication between the helicopter and my phone to discern the basic commands being sent. (Assumes that I have no idea how the two devices communicate) 3) Something brilliant that I haven’t thought of. 4) Giving up.

I’m sure there are layers of complexity that I’m not even aware of, but I’m having fun hacking around. Any advice would be great!

Update 12/31/2012: I’m experimenting with using Wireshark to pursue the second path.

Recent Projects (December 2012 Edition)

| Comments

I wrapped up the Fall semester with a few new projects. Here are the highlights:

Coffee On Me

Originally hacked together for the Information Design Hackathon in late October, is a recommendation service to meet professionals outside your network. I’m an avid coffee drinker and often end up networking with other professionals over a cup; hopefully a service like Coffee on Me can make the discovery and invitation process easier and more enjoyable.

We are currently accepting signups for the service and have a working version of our recommendation algorithm, but the service is still in private testing. My role in the project was backend dev, and team includes Michael Hintze, Rui Dai, and Aijia Yan. If you’re interested in the service, drop me a line and let’s grab coffee, on me.


Triggers is the first step toward building the mobile application of my dreams; a web application that makes it simple to create triggers based upon my location, time, and other data to notify me of things that I otherwise would have forgotten or not been aware of. Some friends have referred to the idea as a mobile IFTTT. As the final project for my database management class, Triggers was meant to primarily showcase MySQL5 triggers executing customized and repeatable queries that a user generates using web mobile forms. As new data is added to the system, these queries are run, and when they are true a notification is sent to the user.

What would it be used for? There are a variety of uses for parameterized and custom queries that guided my project, ranging from something as basic as an alarm clock notification that goes off at a specified time, to a notification that reminds me the next time I’m within a certain proximity of a location, to a notification reading data from the National Weather Service that alerts me when it’s going to rain tomorrow so that I’ll remember to carry an umbrella.

I also used this project as an excuse to writing the web application in NodeJS, and it ended up becoming the largest Node app that I’ve ever built from scratch. It needs some refactoring before I officially release the codebase, but I hope to share it on Github over winter break.


I spent a ton of time this semester researching the mobile coupon space, both for the Designing Mobile Experiences class I took, as well as Startup Lab course where we planned and pitched startup ideas. My mobile team, including Aijia Yan, Michael Hintze, and Gaurav Shetti, prototyped a mobile application called InStore for the discovery and redemption of coupons, deals, and events for retail stores. A prototype is online, as well as our assignments that outline some of our research, presentation, and competitive analysis are online to check out.

In the end, we focused on two scenarios when designing our prototype:

  1. A user visits a local shopping district, and opens the app to locate deals at nearby stores. They choose a particular store and redeem a deal.
  2. A user opens app to find deals that their close friends have recently claimed using the app, and redeems a coupon that they find.

Mobile coupons and deals are in a huge and quickly changing space, and one of the takeaways I had from the InStore project was that our scope and ambition for the app was too broad which made it difficult to design and refine throughout the semester. We also began to identify some mistakes that a lot of applications are making, which will be useful down the line. Michael and I plan to take lessons learned from this semester’s work on InStore as we work on our final masters project which will be in a related space.

Recent Projects (October 2012 Edition)

| Comments

I’ve made it a priority to work on more group projects this semester, and not surprised that I’m having fun and learning a lot. Here’s a sampling of what I’ve been up to:

Who Is Who

Screencapture of Who Is Who (

Who Is Who is a web app to learn the names and faces of current students at the UC Berkeley School of Information. When loading the page, three student headshots are randomly displayed with the question: “who is _______?” This little web app was hacked together in a few hours, and gave me an excuse to try out Python’s Flask microframework. I was pleasantly surprised to see another I School student fork the project.


UC Berkeley Scavenger Hunt

The UC Berkeley Scavenger Hunt was a group project with Michael Hintze and Kate Rushton as part of our Information Organization Lab. After rereading Vannevar Bush’s Memex paper in class, we were assigned a fairly open-ended project: develop a web application that uses the Delicious API, and organize the data around trails and steps within trails. Most groups in class used this prompt to pursue projects around organizing web bookmarks, but we took the trail metaphor more literally and created a scavenger hunt that is powered by Delicious.

The web application is a scavenger hunt, meant to be played using smartphones. Groups are shown a set of hints of locations around campus, and when they reach that physical location they can “check in”. The application uses HTML5 geolocation to pinpoint a user, and if a group is successfully within a specified distance to their location, they can move onto the next hint. This project was also the first time I played with HTML5’s localStorage, which we used to save some basic session and group info. We had to use some hacks in how we saved data with Delicious in order to satisfy the assignment, but overall it was a fun project.


Grad Badge

Last month I had the awesome opportunity to spend a day at Facebook in Menlo Park for the Gates Foundation hackathon, where we were developing Facebook applications for college students. My team included Sunny Lee from Mozilla, Lia Siebert from Sheepdog Sciences, Charles Wright from the Gates Foundation, and Catalina Ruiz-Healy from Grad Guru. The hackathon itself was limited in terms of the time to write code, so we scoped our effort around a prototype and focused on the pitch we gave to judges in the afternoon.

Grad Guru is a mobile application that acts as a community college advisor in students’ pockets, and using Catalina’s own experiences developing apps for Community College students we prototyped an app called Grad Badge that managed the deadlines a high school student would have to apply to community college, using the Facebook Open Graph to push badges and notifications to friends. The event itself was a kickoff for the College Knowledge Challenge, a $2.5 million fund to develop Facebook apps that will help students apply to, attend, and stay in college.


I’m now working in a new group for IOLab that is creating an alternate way to navigate and discover free fonts in Google’s Font Directory. We jokingly began to refer to it as “Spotify for Fonts”, so the name “Spotifont” has stuck. The solution we’re pursuing is a mix of a controlled vocabulary to describe fonts (like Serif, San-Serif, Handwritten, etc) with user-generated tags (like Cool, Headline, Clean, Oldtimey) to make the font discovery process more personalized.


Announcing WPBadger

| Comments

Earlier this week I released WPBadger, a simple WordPress plugin for issuing badges and adding them to a user’s Open Badges backpack. Open Badges is a Mozilla project that is creating a platform to associate badges representing skills, achievements, and other data with people who earn them. I’m jazzed about the project as a whole, and particularly excited about how WPBadger has connected WordPress and the Open Badges infrastructure to make it easy for anyone to award badges.

WPBadger is open source, and uses some of the latest and greatest WordPress features like custom post types to manage and award badges. You can download WPBadger version 0.6.1 from the WordPress plugin directory, or watch it on GitHub. Both the plugin and github repository include installation instructions, and earlier today Doug Belshaw blogged a howto guide for using WPBadger. Special thanks to Doug, as well as Greg Wilson, Chris McAvoy, Les Orchard, and Boone Gorges for help testing and advising on this plugin.

I’m also working on a separate WordPress plugin called WPBadgeDisplay that creates a widget to show off your badges on your blog, and hope to release that next week.

Ideas for Improving THATCamp

| Comments

This year’s THATCamp was a blast, and it was great to be back at CHNM for the weekend. Having finally caught up on rest, I began filling out the feedback form for this year’s event and my responses morphed into a handful of bloggable ideas about how to shake-up the unconference. I’m not even sure THATCamp needs to be shaken up, because over the last five years it has matured into a consistently awesome event in itself. So, consider these ideas icing on the cake and some things that I hope to incorporate into the next event that I organize.

Setting The Vibe

  • I’m a big fan of kicking off workshops, meet-ups, or conferences by going around the room and allowing each person to introduce themselves to the entire group. We’re there to participate, and while not everyone will propose a session these basic intros ultimately lead to engaging conversations, sessions, and future collaborations. It doesn’t take much time for each person to say: “I’m (blank) from (blank). I do (blank). I’d like to discuss (blank).” If there’s a time crunch, having each person say their name and three words to describe themselves as an alternative can result in some fun or odd responses. These individual introductions break down any remaining hesitancy to speak to the group, and help connect faces with twitter handles. I also think that in events like this year’s THATCamp V where over half the participants were new to the event, this may be welcoming.

  • Ditch the auditorium, and embrace chaos. I understand space constraints when organizing an event at a university, but the use of the auditorium for initial scheduling and sessions immediately results in the unconference vibe shifting from one that is participatory to one that is more controlled and less spontaneous. My urge is to resist that. Ideally, we would be facing one another with the ability to see the faces of our peers. 150 people is too large for a single circle, but I’ve seen large events like the LOD-LAM summit effectively divide the participants into two sides of chairs facing one another.

Sessions / Alt Sessions

  • Shorter session periods, smaller discussions, and smaller rooms. One of the interesting shifts in THATCamp sessions from year one to now is that individual session participation seems to have doubled, and they increasingly address general topics. One of the best “sessions” that I had at this year’s THATCamp was just Matt, Boone and I discussing Twitter bots in the hallway, and later hacking away code around a table. Adopting shorter 45 min session periods would hopefully result in sessions that have greater focus with fewer participants, encourage spontaneous sessions, and engage participants with a variety of skills and backgrounds. I also think smaller sessions create more chances to make friends with your peers.

  • Encourage alt sessions. This is difficult to plan for and needs to be organic, but I’d like to think of THATCamp as an event where you’ll see things that you won’t at a normal conference. A few years ago David Staley created an art installation in the Research 1 tower, and I’d love to see similar exhibits or creative expressions. True, we now have @alienweedman and @horse_thatbooks to keep us entertained, but within our network of digital humanists we also have folks building interactive museum exhibits, large data visualizations, wearable Arduinos, and experimenting with weird hacks. Why are they not visible at THATCamp? Maybe next year we can organize the THATCamp “Hall of Hack” where showing off these projects is encouraged.


  • Start Friday night, and skip Sunday morning. There has always been a large drop-off of participation on Sunday for various reasons (travel, church, fatigue) and it seems like every year someone suggests getting rid of Sunday sessions completely. I’m pretty sure Dan Chudnov previously suggested moving scheduling to the night before THATCamp, and I’d like to second that idea. Instead of making the unconference one day, why don’t we do the majority of scheduling the night before on Friday? We could start off very informal over drinks and create a rough schedule, and give participants some breathing room to propose additional sessions in the time between Friday night and Saturday morning.

  • Counter-session proposals. I’m a little concerned that THATCamp sessions may sometimes be too broad (for example: scholarly publishing) to amount to much, and that may dissuade campers who have already been to a handful of THATCamps. Rehashing previous discussions should not be the point of THATCamp. I see counter-proposals as an experimental way to nurture new and radical topics by taking a basic session topic and proposing an additional or substitute session that makes an argument, or turns it on its head. For example, a counter session proposal on open access could be “boycotting Elsevier”, or something similarly provocative. My hope isn’t necessarily that THATCampers stir trouble, but that sessions have more of a point to them, which I think will ultimately lead to more interesting outcomes from the unconference. This idea is still rough, so please pitch in your own ideas.

What ideas do you have?

Initial Technical Questions About Mining a Million Syllabi

| Comments

I’m preparing for a conversation with Matt Gold and Amanda Licastro about mining the CHNM syllabus finder MySQL dump, and rather than keeping my notes and questions to myself I wanted to share them in a blog post. Earlier this year I blogged about some experiments that I’ve done to profile and process the dataset using Hadoop. I didn’t have the dedicated time to push my ideas further during the spring semester, but hope to change that this summer.

What would you do with 1.4 million syllabi? And with so many records, how would you know where to start? One of the unexpected discoveries within a few days of releasing the dataset was that while the mysqldump references the URLs of syllabi, most records do not have saved copies of their HTML. Doug Knox pointed out that the cached HTML only exists for about 17,000 websites. If the current uncompressed data dump is 500MB, can you imagine how large a completed dataset would be? 35GB uncompressed?

There are two technical questions related to the dataset that I’m interested in addressing:

  1. How can we use data in the mysqldump to retrieve archived and cached copies of course syllabi, in order to establish a complete dataset?
  2. What technical platforms, standards, resources, or sample corpuses are necessary to make this approachable to the digital humanities community?

These are large questions that will take a lot of time to answer, require partnerships to address the needs of computing time and storage, and the expertise of a broad base of people. Consider this blog post a draft of some ideas, and the beginning of a conversation.

Completing the Dataset

Using data from the mysqldump, we have syllabus information including url, title, date_added, and chnm_visited. I assume that the difference between date_added and chnm_visited is that chnm_visited includes the date of the most-recently cached copy of a page. I may be possible to use the Internet Archive’s Wayback Machine to retrieve saved copies of those pages (if they exist). While this sounds great in concept, my initial look into this indicated that it may be difficult to write a scraper to do this. Recently, Ed Summers blogged about structured data the Wayback Machine can return and has given me new hope that we could use this structured data to locate archived copies of syllabi.

If this is possible, there are a variety of other questions like what methods should be used to retrieve this information and how, whether the Internet Archive is cool with this type of batch processing being done, and a likely need for a more robust datastore.

Making The Dataset Approachable

At the time the syllabus corpus was originally released and we didn’t initially realize that it had missing values, I expect that most of us only glanced at a few thousand lines on the command line (if that) and closed the file. What else could we do? It drives home the problem that I think many large datasets face; they lack transparency, the ability to be explored, and perhaps most-importantly the ability to be hacked on. What if the dataset was 10,000 syllabi instead of 1.4 million; would it make a difference on how easily it could be investigated? What if there were only 1,000 syllabi? The question of how to make the dataset more approachable is interesting, and may involve both reducing the size of the dataset and working with platforms that lower the barrier of entry for users.

This summer I’m interning with Common Crawl, a non-profit that is creating an open crawl of the web and making the data accessible for developers and researchers. Common Crawl has a corpus of billions of webpages, and shares a basic problem of lack of approachability to people who may be interested in analyzing the data. There’s no way to browse the data without doing it in batch, and given the learning curve to use this data one may question their investment of time in the first place. My hunch is that a basic summary and visualization of the data would be an effective hook to interest people in using the data.

I think a barrier to doing big data analysis is that the data lacks the same hackability that is a property of many successful digital humanities tools and platforms. Most data mining tools and methods are heavy, and a high level of technical knowledge to get started. For this dataset to be of use to a large community of DHers, I think we should be mindful of the patchwork approach that is taken when we hack on code using scripting languages, where coding becomes a method of investigating and understanding an idea or concept. The initial work I was doing with the corpus used MRJob, a Python module developed by Yelp for writing Hadoop streaming MapReduce jobs. MRJob does a great job simplifying the processing of data, and it would be possible to create an Amazon Machine Image for interested DHers to begin using it in the cloud. There may be a handful of these small steps that could be taken, making it much easier for people to use.

Next Steps

It’s clear that pushing this project forward would require a great investment of time, and the collaboration of a large number of people. This is definitely a project larger than an individual. I would love to hear feedback on the two technical questions that I posed, and learn who else may be interested in helping out.

If you’re interested in contributing, let me know in the comments. If anyone is interested, I will be at THATCamp CHNM this upcoming weekend and would love to discuss this in the hallways or at a session.


| Comments

Today is my birthday! One year older, I’ve been living in the Bay Area since August, attending graduate school, and assuming the role of Foursquare mayor at my favorite neighborhood coffee shop. I’m having a blast.

In what I hope will become a tradition of my own, I wanted to follow in the footsteps of Matt by writing something on this special day, and sharing my list of things to focus on in the coming year:

  • Blog again, including personal posts (like this) + technical posts sharing what I’ve been developing.

  • Push code to Github whenever possible, finished or unfinished.

  • Take my baking skills to the next level. 2011 was the year of bagels, but I stopped once I moved to Oakland. I want to get back into this.

  • Skip presenting, and attend more conference/unconference/meetups. Focusing more on the work of others, I plan to continue taking advantage of all the great talks in the Bay Area and meet new folks.

  • Bike around Oakland and get to know different areas.

  • Learn a new language over the summer, and find an opportunity to interact with native speakers.

  • Devote time to exploring California’s great parks, and going off the grid more often.

  • Focus my energy and work around a guiding principle. I’ve been inspired by Bret Victor’s incredible talk ”Inventing on Principle”.

I’m printing out this list to hang beside my desk, and hopefully next year I’ll be able to positively report on my progress. Let’s hope twenty-seven is my best year yet – I feel good about this.