As a future information professional, I would like readers to trust the content in this blog, and when Geoffrey Bilder spoke at the conference and began addressing the issue of trust in "informal" online publishing tools (like blogs and wikis), this topic resonated strongly with me. I have already written about his presentation and about our subsequent correspondence and his generous agreement to help me with the final part of my project. So, without further ado, let's get to the meat of the matter, as it were.
= = = = = # # # # # = = = = =
Ever since choosing this topic, I have been formulating my own definitions of the main terms in my project.
- What do I mean by user?
- What do I mean by publishing?
- What do I mean by user-generated e-content, in other words, user-generated e-publishing?
I also would like to introduce another key concept for my work, which is that of TRUST. Trust is crucial to what I want to talk about. So, let us start with the definitions:
USER : By user I intend an individual who may or may not be formally associated with a professional or scholarly institution, who possesses expertise and/or qualifications in a certain discipline or field of knowledge, but who is acting on his/her own behalf as an author, outside of his/her professional capacity.
PUBLISHING : The issuing of content to the public for free or for sale in any medium (print, electronic, audio, video, multimedia), preceded by an editorial process that lends authority to the final product.
USER-GENERATED E-CONTENT/E-PUBLISHING : The issuing of content to the public for free or for sale in any electronic medium (text, audio, video, mixed), authored (or in some way edited and reissued -- this is the case with SECONDARY PUBLISHING) by a USER, without going through the editorial process. This means that there is no editor, as opposed to traditional publishing, and there is no peer-review process. In order to establish authority for the published material, it is necessary to find alternative methods of building TRUST, from the reputation of the USER to the use of various rating/ranking/linking and other trust-building tools.
This morning, August 8, 2007, at 6:00 EST, Geoffrey called me from England. I had emailed him an outline with my questions. I recorded the interview and what follows is a faithful transcription of the conversation, minimally edited for fluidity.
Turtle = T
Geoffrey = G
G: I've been in California for the past week at a conference.
T: How was it?
G: It was very good.
T: What was the topic of the conference?
G: It was a conference sponsored by O'Reilly Media and it's called Sci Foo. They hold it once a year and they get a bunch of people who are doing interesting developments in the sciences together from all over the place; sort of cross-disciplinarian stuff. And they just sit around and talk about the future and what they think is coming down the road and what interesting stuff is happening. It's very loose but hugely interesting.
T: They're the ones that make the animal books, the Safari books, O'Reilly?
G: Yes. That's right. Tim O'Reilly runs a bunch of conferences and a lot of technical conferences as well. And he does these other ones which are more about social developments in science, and are just a little more general. So this was one of his broader-themed conferences. But he's an interesting -- and I'm sure you've heard a lot about him during your coursework -- his publishing organization is an interesting one to examine, because -- Well, let's just start from the fact that his books are probably some of the only books that you'll ever see that are shelved by publisher. He has that strong a brand presence. And that's almost unheard of in the industry. If you went into a bookstore that was shelved by publisher, you'd go out of your mind, in general, trying to find stuff.
But he's the exception, so it's interesting to try and figure out how he's accomplished that. Then the other thing that he's done is that he's been a real pioneer in two areas: one is electronic publishing, so his Safari book service has been doing online books now for upwards of six years. Clearly it's successful and it hasn't cannibalized his print sales. And it seems odd particularly in my industry, which is scholarly professional, how reluctant they've been to get into developing electronic books. They've just been so slow and they should probably look at him for a model.
The other thing that is interesting about his outfit is that he has really managed to tightly couple his conference business with his publishing business, which is another thing which I think my industry has been trying to do for a while with varying degrees of success. But he has really created quite a tight connection, I think, between his publishing efforts and his print efforts; a very symbiotic relationship. So anyway, he's just an interesting character to look at when you're looking at our industry, when you're looking at the broader publishing industry. Anyway, this conference was held over the weekend. [...]
T: In your talk you said, "we must tell researchers what to look for." This was at the conference, and the "we" that you refer to, I assume, includes publishers and librarians. In the bio that they gave us for you it said that prior to CrossRef you had spent a number of years as a consultant to publishers and librarians to further the use of technology as a knowledge development tool. So, how do you see librarians in particular reaching out to researchers, in practical terms? What do you think would be the best way for librarians to start reaching out and advertising these new technologies and the ways in which they can be used effectively?
G: Let's start by saying that there are a number of places where what librarians do and what publishers do overlap. John Unsworth, who is the dean of the School of Information Science at UIUC, once gave a talk in which he talked about "lublishers and pubrarians," and how a lot of the things they do are very similar and likely to become more similar. And I think that's true with both publishers and librarians, as their roles start to change and they start to strip away what's clearly not as important in the electronic age, one of the things that you're going to find they both share is this role of identifiers of trustworthy, authoritative content.
This is something that they have both done, one as a pre-filter and the other as a post-filter, and they're probably going to both start converging and possibly even treading on each other's toes in trying to provide these services. Then the question was that we have to try and help researchers find stuff that's relevant, and both have done that, have played that role. And the reason that I think it's going to become an even more important role is that you are going to have -- you already have such an explosion of content that's out there. And that's a wonderful thing, and it's particularly wonderful if you're seeking entertainment, because you've got lots of other people out there who are identifying entertaining information or content. So things like a lot of these social recommender systems are fantastic in these environments. But they also have to be adapted for helping people to identify reliable and trustworthy content, because of course it's basic information theory: the more stuff you have out there, the harder it's going to be for you to find stuff that's relevant. And ironically, almost every trend that is benefiting researchers in their role as authors is making their lives as readers more difficult. The easier it is for them to reach a wider audience, to put stuff out there, to publish data, to publish working papers, to publish multiple versions of papers -- all of that makes their lives as a reader harder, because it means that it's harder to determine what's trustworthy; it's harder to determine what different versions of things are and how they relate to each other. And time is, from the researcher's point of view as a reader, of the essence, and there are studies that show that researchers are reading more things and spending less time reading each thing. And ideally they'd like to spend even less time reading. They would really love it if you could provide them with a tool that really did help them to only identify the stuff that was truly relevant and important, that would be a huge benefit. Because at the moment they spend an awful lot of time trolling through dross and filtering out stuff themselves. So this is a place where I think both publishers and librarians can play a role in helping them.
So what can they do? Well, one of the things is that they can use the very same tools that people are using for social bookmarking to do things like create annotated bibliographies for particular disciplines. For instance, the recommender systems can help to contextualize things that are out there. One of the problems that you have and that frustrates researchers in their role as readers is that in their role as authors they might publish five very closely related papers. As a reader, if you go out and find these five papers, it might not be immediately clear what the relationship is between these five papers. Does paper A expand on paper B? Is it a refinement or is it a correction? All of these things you don't know. So you end up with five papers that have a lot of seemingly overlapping content and how the heck do you determine what they are? You spend an awful lot of time doing this kind of work. So, again, if librarians and publishers were able to help researchers do some of that or understand what the context is, that would help researchers in their capacity as readers.
So there are all sorts of things that librarians and publishers can do, but the truth is I don't know. But the truth is that they have to experiment; they have to try different things. And I think librarians have probably been in a better position because they are more in contact with researchers in their capacity as consumers of information. Publishers in contrast, historically, have been removed. First they were removed because they worked through agents. Now they're removed because at least they're working through librarians, but they're still not generally talking to the researcher as consumer of information as much as they are talking to the researcher as producer of information. So publishers have got to do a lot more, probably, to understand what the challenges are of the researcher as reader.
T: The second question, as you can see, is about the "My Brain" button. I am really enamored with this idea. I think it's wonderful and I'd like to know more about the idea of the My Brain button. Was it your idea? Did you get it from somewhere else? Have you thought of setting up a kind of repository of brains where people could look for kindred spirits, as it were? In the sense of people who are interested in the same things they're interested in and what they're looking at, what they're reading? I'm still thinking of this as an aid in disseminating trustworthy material. I found myself repeatedly thinking it would be the ultimate dating service, this collection of brains. If you could find a brain that matched your own that would be your soul mate. This is more anecdotal thinking, of course. But tell me more about it.
G: I've encountered a lot of people who have discussed some of the same problems that led to my phrasing it this way. But the phraseology came to me when I was trying to explain to a researcher colleague of mine why I thought that the ability to subscribe to the RSS feed of somebody's bookmarks, the RSS feed from their blog, the RSS feed from their Wiki, the RSS feed from their calendar -- why I thought that that was interesting and powerful -- and for some reason at the time I said, it effectively lets me subscribe to their brain. And then he got it. All of a sudden he understood what it was that I had been tortuously trying to explain. So I tried the phrase out on a few other people and it seemed to resonate.
Another phrase that I've heard people use more recently for the same concept is "lifestreaming." You create all sorts of streams of information from what you're doing; so it might be your pictures from Flickr, it might be the music you're listening to; it might be that you're using Twitter to update people on your mood or the fact that you're getting up to get a cup of coffee or whatever. But the fact that all of a sudden you can almost dump everything that you're doing into some digital form that other people can consume. And, as I said, the phrase lifestreaming seems to be popular at the moment.
Personally, at least in this industry, I prefer brain subscription because when you think about a lot of what researchers, again, as consumers of information are concerned about, they want to know what their colleagues are doing; they want to know what the state is in their industry; they want to know what their research group is doing. And in meatspace there are a lot of barriers to sharing information, not the least of which is physical proximity. So these tools present people with the ability to share information in near real time about what it is that they're discovering, what it is that they find interesting, what directions they're taking. And you can imagine that if you are a researcher and you're collaborating with people all around the world and this research group started in labs everywhere -- it would be a tremendously useful -- it would hugely reduce the friction of collaborating, because you could see exactly what people were doing at any given time, and understand what was going on.
So I think that this is a very powerful way of doing things. I think that you're beginning to see -- again, let me get back to the use of social tools. One of the things that usually frustrates me, when generally people show tools like social bookmarking tools or recommender systems or blogs or something like that, they always show sort of the top level, the most popular stuff. If they show you del.icio.us, they say, look at this, you can see what everybody is bookmarking and what everybody finds interesting. And if they show you a recommender system like Digg they say, see, you can see what people are voting on and what the most popular stories are. And if they show you blogs they show you the most popular blogs, that are inevitably blogs about gadgets or news or something like that. So a researcher looks at something like that and goes, well, that's all very entertaining, but what the hell has this got to do with me, why should I care? And the answer, I think, is that they've shown us the most naive use of those tools. The truth is that in order to use them effectively, you subscribe to the blogs of people you already respect or who are interested in what you are doing. You only look at the bookmarks on del.icio.us of the people that you're interested in and that you think have something relevant to say. You only subscribe to the Flickr photos of people that you care about. So you immediately narrow it down.
T: And my question to you, then, is, how would you go about finding people whose brains you are interested in? In other words, they're not necessarily just people you know. As you said, the research community is large and many researchers who are working on the same things might be separated by oceans and continents. So have you thought of setting up a kind of repository of brains?
G: There are repositories out there, and they're kind of scattered and I think that you can make use of them. The brain subscription button was an approach to this. I generally don't like centralized -- I think there's a lot of evidence that centralized approaches don't work in these things. So the whole idea of creating a centralized brain repository would probably backfire, because people would say, I don't want to have to use your tool. I want my own tool. You use del.icio.us but I use Furl; or you use CiteULike but I use Connotea. You're immediately going to start falling into these problems with people who have different preferences for tools. So the idea behind the brain subscription was, all right, what if there was a way where you had a format where you could record where it was that you had various streams of information about yourself being collected? So I could create a little embeddable format that I could put on a webpage, that would say, if you want to know what I'm doing, this is where I do my bookmarks, this is where I do my blog; this is where I have a Wiki. And if you made that format machine readable, then you could build a whole bunch of tools that would allow you to go out and harvest this information, and different people could build different tools to harvest it and make use of it.
You could in theory say, I want to build a tool that goes out and finds everybody who has a brain button and who has in their del.icio.us bookmarks a category on Publishing 2.0. And that might be a good way of finding a bunch of people. So that was the idea behind the brain button. But there are other mechanisms, obviously, that are just more akin to techniques that we use already.
One of the things is that generally you follow a path. You meet somebody, you say, all right, I've met this guy Danny Ayers, he writes a blog, I'll read his blog, it's very interesting. He cites these other people a lot, and I look at their blogs because he cites them a lot, and you know what, I think they're interesting too. I don't know them, but I'm going to subscribe to their blogs. And you know what, they cite other people. And then you start realizing that four of them are always referring to this other person, and that person suddenly you realize that they're probably a pretty big authority here... So you do a lot of the same things that you do when you're looking at a journal or a book, and when you're doing background research you get a sense of what the social network is and who is in authority in this area. It would be wonderful if things like the brain button would allow you to automate it a little bit. But some of the tools are already out there.
Let me address the last issue there, about the dating. Interestingly, if you look at something like Nature Network and if you read Charkin, he'll say stuff like, funnily enough, scientists have social lives too. So a lot of the social networking application stuff, if you look at Nature Network and some of the stuff they're doing, they're really thinking about the scientist as not just a scientist but as a person, a person who has to rent an apartment, who has to try and figure out what's going on in a city. So they're definitely combining this notion that professional and social might overlap. My observation is that I know a lot of people who share my interests professionally that I certainly wouldn't want to date. I'm sure it goes the other way as well. But I think that's a sensible way to go about it, to a certain degree. But Nature certainly is pursuing that line.
T: That's very interesting. This brings us to the third question, which you've already answered in part. I'm curious to know, would you call this "publishing" your brain? The creation of a "my brain" button, do you see that as publishing your brain to the world?
G: Yes, it's providing people with a place with all of your feeds that you're generating on what you're doing. So, to use the other phrase, it's sort of a collection of -- somebody else might call it a lifestream button, or something like that.
T: So you feel that rather than having a centralized repository, the electronic word of mouth, as it were, is a better way of disseminating this information.
G: Yes, we have a lot of tools that are very good at going out to websites and consuming information, harvesting information and pulling it together. So you don't need a centralized a place where everybody puts their RSS feeds. You have one place where you read RSS feeds, but the RSS feeds are coming from all over the place. You don't need one place where all people's brain files are stored. They can just be stored on their own website and then you can have things go out and harvest them, or index them with search engines and things like that.
T: Let me ask you a technical question. I'm also interested in understanding a little bit more how the technological things work. Currently, there is not an RSS reader that could decode the OPML and display it in a more readable format?
G: That's actually a problem with browsers. The brain file is basically just an OPML file, Outline Processor Markup Language. It's just a machine readable format that points you at different locations. The problem is that when you click on that at a moment you get a horrible mess.
T: That looks like an XML file.
G: It is an XML file. I don't know whether you remember this, but even a year or so ago, if you went into your web browser and clicked on an RSS button, you would also get a bunch of horrible XML, because the browser didn't know what to do with it. So in recent browsers now -- I'm talking about the latest versions of Firefox and Internet Explorer -- if you go and you see an RSS button and you click on it, all of a sudden the browser will say, ah, I know what this is. I'm not just going to show them this mass of XML; I'm going to offer them a choice of subscribing to this RSS feed via whatever their favorite RSS reader is, whether it's Google Reader or Bloglines or whatever.
So they modified browsers to deal with this more intelligently. Ultimately if something like the brain subscription were to take off, you'd want some sort of mechanism whereby if you click on an OPML file it would recognize it and it would say, okay, fine, I will import this OPML file into whatever your reader is. If you want to use the brain button at the moment, what you have to do is save that XML file onto your hard drive, and then go over to Google Reader or Bloglines, and import it. And it will import. They both support OPML. You can automate it. It's the browser that doesn't do it automatically. You have to do the manual step.
T: And if I did import it, what would then happen, would I just have a collection of links?
G: Yes, you would have a collection of RSS feeds in your RSS reader, and one of them would point to my del.icio.us bookmarks, and one of them would point to my blog, and the other one would point to perhaps my LastFM account, or something like that.
T: Let's move on to the next question, then. Thinking about these brain buttons as trust metric tools, I'm reminded also of the fact that you called links votes, which of course is self-explanatory. But how would you tie in these votes with the brain subscriptions, in other words to apply trust metrics to brain subscriptions? Meaning: I'm looking for things that I trust, that I consider to be trustworthy, I'm getting these different feeds from different people, I want to be sure that I'm not misled.
Since you suggested that I look for things that talk about trust metrics, I have been, and I've been reading about people who intentionally -- as usual -- abuse tools to create chaos rather than being helpful.
G: Okay, let's start with the linking as votes. Yes, a link right now is treated pretty much as a vote by things like search engines. Google, for instance, its PageRank is treating a link as a vote. But the other thing I said is that's really a very naive thing to assume. Because you will often link to things to say, look at this, it's a stupid as dirt. And we do this with citations as well. When you cite something, you might cite it because it supports what you're saying; you might cite something because you're arguing against it. You might cite something as background material; you might cite something as a counter example. There are all sorts of reasons that you might cite something. Ultimately I think that people are going to want to be able to add some sort of semantic hint to any link, so that they can differentiate between these kinds of links or citations. And you already see that to a certain extent with the attempts to deal with blog spam that search engines came up with. They said, okay, there are web links where people allow you to put an attribute with a value of "no follow" on it. And if we see a link that has an attribute with a value "no follow" on it, we're not going to count that link as a vote. So now, for instance, blogging software will automatically put a "no follow" attribute on any links that are included in comments. So people who were using comments to deliver spam, with links back to the sites that they wanted people to go to, they can't use this mechanism anymore, because search engines don't care if there are links in comments because they don't treat them as votes anymore.
Now, of course that could be refined quite extensively, if you had more sophisticated ways of indicating the relative importance of the link. At the moment it's pretty much all or nothing. So this issue with trust metrics, and how it could be applied to brain subscriptions. If you subscribe to my brain and one of the elements of my brain is my bookmarks or perhaps the blogs that I follow, you see that there's already a way that you might be able to traverse that, and say, okay, Geoffrey reads Lee Dodds; Lee Dodds reads Danny Ayers; Danny Ayers reads Clay Shirky, and all of them seem to read Jon Udell. Therefore, I'm thinking that since I read three of these people and they all read Jon Udell, Jon Udell might actually be somebody that I should be paying attention to and who I consider to be trustworthy.
There are two problems that you often see in trust metrics, and I'm sure you've already seen this, and one is: to what degree does transitivity work? How far should it go? And the other is context. I might trust Lee Dodds on anything having to do with technology and publishing, but I certainly don't trust his taste in clothing, or as far as music or anything like that. So you also need an ability to create some sort of a context for whatever trust metric you're using. And that's an important element of any trust metric that's going to be successful.
And then the last thing you brought up is this issue of people trying to game trust metrics. Well, that's no different than the analog world. People "salami slice publish" now. They take something that could easily be written up in one paper, and they split it into five papers, because this somehow gets them better citation counts and stuff like that. People do this kind of stuff all the time. In computer-based trust metrics, the interesting thing about them, and the challenge and the reason that is is very hard is people trying to develop techniques to make them self-balancing, to make it very hard for people to game the system. That's the challenge in creating an electronic trust metric.
I don't know whether anybody will ever -- if somebody is ever able to create a completely self-balancing trust metric that's calculated, that would be phenomenal. I somehow doubt that that will happen. I think that there will probably always be elements of people having to go in and hand tweak them and monitor and administrate them. But we'll see how that works.
T: We can skip the next two questions, since you've already answered them. I had been reading about "attack resistance" and you have just been answering that. So let's move on to question 8 then.
In your podcast with Jon Udell you touch upon the things I'm most interested in exploring: that user-generated e-publishing and scholarly communication are becoming closer and intertwined; that there is original material like blogging and secondary material like social bookmarking, a form of secondary publishing as we have already talked about in our correspondence. As a close to our interview, let's talk a little bit about this: how can we as librarians create e-published material that has legitimacy -- how do we go about implementing trust metrics so that the resulting reliability will be visible to the readers? How do we get the word out there? How do we show that our brains, where we aggregate our output -- original, primary and secondary -- are trustworthy?
G: That's a big one. Let's start with the intersection between user-generated e-publishing and scholarly communication. I think at the beginning of this you defined user-generated content as being people generating content outside of their primary -- in their personal capacity as opposed to their professional capacity.
T: Within their professional interests, but not as official spokespeople -- because that obviously is regular scholarly publishing; in other words journals or the proceedings of conferences and things like that. But this would be something more akin to blogs, or things like that. That are still being used as methods of disseminating their scholarly research, but maybe it's prepublication, maybe it's in the course of their research they're divulging some of the things that they're discovering and so forth.
G: I don't think that people generally try to define -- they should try to define what they mean by user-generated content. My guess is that if you ask most people what their definition of it was, you would get something along the lines of this: that in a traditional media company like NBC, television or radio, you have a group of people who are professionally paid to create content and send it out to other people who consume it. The audience for the content and the people producing it are different.
Whereas with user-generated content, the audience is probably generating as much as anyone. So the issue of professionalism or the affiliation of the thing I think can probably be separated from that. As an example, the one thing that I like to point out is that unlike a traditional publisher where they have a small group of professional writers who they identify and then they help them to disseminate their content to as wide an audience as possible, from the publisher's point of view, they could never make a living if they only sold to the people who are writing the content. If their entire audience for novels was novelists, that would be a problem for them.
Now, contrast this with scholarly publishing, where their entire audience for research or research papers is also producing research papers. That's a big difference. So what I would say is that scholarly publishing has been in the business of user-generated content forever; whereas other media industries have not. We have always had this bizarre situation where our reading audience is also our producing audience. Almost, not quite, because there are an awful lot of faculty members who no longer do research, they just teach. But by and large, you've got a far higher percentage of your audience also being people who produce content.
There's a classic system that's used for analyzing the competitiveness of any particular industry, and it's a guy named Michael Porter who I think teaches at Harvard Business School. He has this concept called the Five Forces Analysis, where if you look at an industry and you look at these -- the five forces in an industry being the bargaining power of suppliers, the bargaining power of customers, threat of new entrants and threat of substitute products and then the competitive rivalry within the industry -- if you actually look at that and you look at the scholarly publishing industry, you realize that the suppliers, substitute products, new entrants and power of customers, all of these people are the same people in the scholarly industry. Any faculty member can go out there and say, you know what, I want to create a new journal, or we're going to try and create a substitute product, we're going to create this open access archive. They are also the suppliers of the content that the publishers are publishing. Effectively you've taken any traditional industry's pretty distinct entities and you've mashed them all together. It's no wonder it's kind of hard to figure out this industry. It's very strange in some ways.
So I think our industry, scholarly and professional publishing, has always been in the user-generated content business. Now the issues I think I was talking to Jon about, which I think is interesting, is that we've always -- and this has been a constraint of physical printing -- and that is that we only wanted to invest the money in disseminating the thing that had the highest level of authority. Because you had to print this up, because you had to mail it out to all these places, you wanted to make sure that whatever you were printing was super-super highly reliable, that it had gone through an amazing process of quality control. Because it was very difficult to retract it, it was very difficult to correct it, it was very time-consuming and expensive to do all of those things.
Now, in the electronic world that changes a bit. All of a sudden if something is wrong it can be corrected pretty quickly, it can be clarified pretty quickly, and it's not that expensive to disseminate. So a lot of the rationale behind original dissemination strategies has probably disappeared, but I don't think our industry has quite adapted yet. Researchers haven't adapted yet either, so it's not just publishers and librarians. If it doesn't cost you that much to put out an idea that isn't completely formed, but that idea still might be useful to other people, then put it out. That's fine. But what I think needs to be done then is that we have to start thinking about different gradations of trustworthiness of the content that we're putting out there.
So some scientist's musings on their blog should probably be treated differently from a working paper, and that in turn should be treated differently from a paper that's been submitted to a journal, and that in turn should be treated differently from a paper that's been accepted and published by a journal, because each goes through a different layer of authority checking, trustworthiness checking.
And then likewise, even after that, an article that's been published by a journal should probably be treated differently from an article that has been published by a journal and that has been extensively commented on by lots of other people publishing. Either commented on by other articles through citations, or commented on by other scientists through less formal means like blogs or Wikis.
So I think one of the big things that the publishing industry has to do is to figure out, all right, there is demand for these different levels of trustworthy information, how are we going to supply it and make it clear what the relationship is between them, so that we don't treat everything as having the same degree of trustworthiness. Yet we don't stymie communication because we don't want to put something out there until it's absolutely been through every process imaginable.
So how can librarians and publishers create stuff that has legitimacy? A lot of it is just building up reputation. Librarians and publishers -- we're in a bit of a circular argument. Trust has a time component to it. Somebody who has been trustworthy for ten minutes -- If you have two people, one of whom you've known for twenty years and they've been trustworthy that twenty years and you compare them to somebody you've known for three days and they've been trustworthy for three days, you probably have more trust in the person who's been around for twenty years than you do in the person you've known for three days. So one of the things that I think that librarians and publishers have to make clearer is their track record in trustworthiness. And they have to be very careful about making sure that they build a very solid track record and once they've done that they've got to advertise it. They've got to think of some sort of mechanism for making the distinction between a publisher who's been around publishing pretty much reliable information for hundreds of years and somebody who's just jumped into the game and is putting stuff up on their website.
They also have to create other metrics that allow people to evaluate the relative trustworthiness of content. Even right now I think that librarians and publishers -- some librarians and some publishers -- have a degree of trustworthiness that they could exploit. I think that for instance librarians creating -- or even publishers -- saying, all right, so you don't know what to trust in science blogs out there. We're going to create some guides, we're going to create you some tools that allow you to identify what we deem to be trustworthy blogs. Or we're going to have people review Wikipedia articles, and then perhaps the Institute of Physics or something like that will say, these are Wikipedia article entries that we've actually checked and that we think are pretty good. All of these kinds of things would help, all of these things are useful trust metrics that I think both librarians and publishers could start providing. And they will help people focus, when they're looking at websites, say, oh, actually this website has gotten some sort of little semi-endorsement from the Institute of Physics, so I'll treat it with a little more respect than I would some of the others, or something like that. I think there are a lot of things that they can do.
T: We're pretty much winding down now, and I really can't thank you enough for your kindness and generosity. But I'm interested also a little more in the practical aspects of blogging, for instance, as a means of e-publishing, since I've been involved in this experience. I had never done a blog before this one. Of course if I'd known what I was getting myself into I never would have done it; but at the same time I'm so glad I did, at the same time. Now that I have done it, it's given me great satisfaction and it's also given me insight into -- for instance the first few days I was putting tags in, and then afterwards it became such a monumental effort that I stopped putting tags in. Now that I'm finished I do want to go back and add tags, so that people can search through my blog.
I'm also interested in exploring things like, at what point do you think it's acceptable for people to make money with ads off of their blogs. I'm thinking about these things myself. I'm asking myself that question. Once my blog started getting some hits, fellow students, professors, family and friends, then I got the popup thing from Google that said, make your blog make money for you. At first I thought, my goodness, they have to turn everything into a commercial venture. But then after a while I was putting so much work into it that I thought, gee, maybe I should be making money off of this, because it's just so much work. But all these considerations are my reflections on what this experience has been like for me. However, I have also come across some technological challenges which I find important to think about and talk about, and they tie in with what you're doing at CrossRef. My own professor, because it became clear quite quickly that my blog was going to be kind of a one-stop-shopping resource for my fellow students in the class for deciding what to do their papers on, because they could go there and see links to every single lecture we've had in the two weeks leading up to the conference. All the speakers gave their PowerPoints and their PDFs to Andy Dawson and he put them on the UCL website and I linked to them in my blog. So if somebody just read my blog they would get every possible link that was in the school, with the addition of links to everything that was mentioned. So every company, every company website, and even concepts, people, and so forth.
I'm very interested in DOI's because I can't quite understand what it is about DOI's that makes them persistent. Does that mean, to create a DOI, you yourself have to have a place where you permanently keep the things that are being linked to, the papers or whatever, that is going to permanently reside in one place so their URL never changes? Or how is that actually implemented?
G: Let me start with the amount of work needed to do a blog, because I think you've done something interesting which I think a lot of people do, myself included, when they start a blog. And it's inevitably the way to stop blogging. And that is that what we do is we have great ideas about big things we want to blog about, and they're too big. It starts becoming a real writing project. And then it becomes so much work that we abandon it. I think a lot of us are not used to the notion that -- The most successful bloggers that you see out there, I think are really good at -- they're far more comfortable putting out half-thought-out ideas in a very informal style.
T: And of course I couldn't bring myself to push the "publish" button until I had read it over a hundred times.
G: Exactly. And that's a cultural difference, and one that's a really hard thing for people who are used to writing in that way to get over. I keep trying to force myself, every time I think of writing something for the blog, I think, I've got this long thing I want to talk about. And the truth is that if I just broke it down into lots of short little entries, and if I stopped obsessing about the wording and phraseology and all of that stuff, I'd be able to post. And the people who I know who have really gotten over that and have adopted a far less formal style and are far happier just posting short things and then linking them together later, they turn out to be the most successful bloggers. So my advice, and it's advice that I wish I followed myself, would be: get less formal, post shorter things.
T: Of course I give myself this advice every day. It's just hard to do it.
G: Then there's the DOI question. DOI's are actually pretty easy to deal with, and unfortunately there isn't much technology magic behind it. Let's just start with the problem, which is that web links break. Linkrot is a huge problem. And even when the web first started out, we all realized that linkrot was a big problem. The simple reason for this is that the strength of the web is also its weakness. It's totally distributed. One web server doesn't have to know of the existence of another web server. If you host a web server and I point to it, your web server doesn't have to know that. It doesn't have to approve my creating a link to your server, it doesn't have to do any of that stuff. That's really powerful and it has all sorts of scalability aspects to it that have contributed to its success. The problem with it is that is also means that if I link to something on your site and you move the thing that I linked to, the link will break. Or if you decide to change sites, the link will break. And there's no way for you to know that I'm linking to your content so that you can inform me, you know what, I'm moving this stuff, so you've got to update all your information.
So this is the fundamental structural problem of the web, and publishers recognized very early on that this was going to be a problem, particularly for them, because if they wanted to create an electronic environment that included electronic citations that would allow you to follow the link to the source material, they didn't want that stuff breaking, because citations are the building blocks of scholarship.
So they thought about this and they realized that really the only mechanism that they could build was to create an organization where publishers who were serious about maintaining citation links could join, and in joining this organization they promised to do some things. They effectively are saying, we will adhere to certain terms: we will submit unique identifiers for all of our content, and when people use these unique identifiers they will be able to locate our content, no matter where it is. But there is no real technical magic behind it.
T: Does that mean that the DOI is actually a miniature searcher, that searches for it wherever it is?
G: No, it's not a searcher: it's a pointer. All it is, is a pointer. The concept of a pointer is -- are you a computer programmer of any sort?
T: No, but I have a basic understanding of some programming concepts.
G: Think of a pointer as, if you have a post office box, that's a pointer. You can say to people send mail to my post office box, and it doesn't matter where you physically live. You can always get your mail but it's going to this post office box instead. The post office box turns into a pointer for you. Anyone can send mail to that post office box, they don't have to know where you physically live. They'll know you'll get the mail. A DOI is a very similar concept. We're saying, when you cite something, don't cite the location of the thing, cite this number instead. And this number, or this string -- it's not really a number -- this identifier, when you cite this identifier, what we will do is we will go look up the most recent physical location of that place, and then take you there.
T: And is that string embedded in the object?
G: The DOI, that string you see, is just an identifier. You click on that and it passes that identifier to a website that looks at the identifier and says, okay, someone is trying to link to this, where does it live now? And it returns the URI or the place where that content lives currently.
T: And to discover where it lives currently, is the string also embedded in the object itself? I mean, how do you find it?
G: No, it's not. So all you're doing, you see the DOI. The DOI is just an identifier that can be assigned by the publisher. What CrossRef keeps is a huge database that maps a DOI to a URI, and if the URI changes it's up to the publisher to tell us that the URI has changed, and then they can update the URI in our table and we'll continue to find the content.
What this means from a practical point of view is, let's say you go out and you cite three articles that are published in Wiley-Blackwell journals. Then Wiley-Blackwell decides to sell two of those journals and then they change where they're posting the third journal. What they will do is they will send updated information about where those DOI's point to, to us at CrossRef. And you don't have to worry about a thing, because you've cited the DOI's instead of the URI's. So when somebody clicks on those DOI's they'll come to CrossRef and say, okay, now where are these located, because they're not at Wiley-Blackwell anymore. And we'll tell them where they're located now and we'll resolve to where they're located now.
The distinctly un-exotic bit about it is that persistence is not -- we don't have some magic technical solution to persistence. Persistence is a social construct. We are a membership organization and in joining our members are agreeing to adhere to certain principles, one of which is that they will always update where things are so they will always update where DOI's currently point. If they don't do that, we have ways we can find them, we can do all sorts of stuff to try and get them to adhere to the principles behind CrossRef. So it's very much just an organizational mechanism for persisting citation links. And the problem is, a lot of people thing, well isn't there a technical solution for this? And the answer is, there isn't a technical solution with an architecture like the Web. If we had a hypertext system where everything was centralized and every document knew about every other document, then you could created a technical means for making sure that links never broke. So if you read early hypertext pioneers like Ted Nelson, who had this concept, his Xanadu project and all of these things -- These are early hypertext systems where everything was controlled fairly centrally, and therefore they could do things like make sure links never broke, and make sure links were always bidirectional and not unidirectional. The Web architecture doesn't support that easily, so we had to create a social construct that allows us to preserve persistent citation links.
So it is abstruse. But the simple way to put it is that we fight linkrot. And we make sure that citation links, which are very important, don't break. And that's one thing we're doing. We're going to be branching out and providing other kinds of services like that.
T: Well, that's wonderful. I think it's a very valuable service. And it also ties in with the concept of trust, because if somebody goes to your website looking for content, and clicks on the links and they don't go anyplace, that erodes their trust instantaneously. Even if it's not something as crucial as following a citation, even if it's more banal or mundane, still, when you click on links and they don't go anyplace, that immediately lowers the degree of respect you have for whatever resource you're using.
G: And this is the root of the conversation that I had with Jon Udell, where he's saying that for a long time, really, only scholarly authors were really concerned about citations. Now all of a sudden bloggers everywhere are concerned about citations. Jon Udell, part of his professional life is his blog, and if he moves from one organization to another -- for example he recently moved to Microsoft -- and he wants to take his content with him, his URI is going to change and all the links to his content are going to break. And that's not acceptable. And in your case, you're blogging at turtleinlondon.blogspot.com. You started that website and named it that largely because it started off because you were going on this course in London and you wanted to blog about it. All of this material that you've recorded here might have a more permanent value, and you might decide that you want to move off of blogspot, or you might decide that you want to make this part of a more general site on the publishing industry. So you might start another website and you might want to move all of this content there. As soon as you do that, anybody who's linked to the content, all of those links are going to break. So Jon is interested in trying to figure out whether there's a way that the concept behind CrossRef can be extended into the wider web, for people who are concerned about links to their content not breaking.
T: Not to mention the fact that I have no control over what the links within my blog do. In other words, if I link to your paper and then you move your paper, how am I going to know that.
G: Absolutely. And I agree with him. I think that this issue of persistence of links is really big, and we have to start thinking about how we can provide mechanisms for people to ensure it. The problem, again, is that there is no technology magic that can be applied to it. So it's going to probably require an organization like CrossRef providing a similar service for a wider audience. And immediately you get into some difficult questions there. For instance, right now we're a membership organization. If publisher X joins CrossRef and then doesn't update their URI's, as I said, we can find them and we can sort of enforce norms of behavior once they've joined our organization. But how do you do that if you have millions of individuals? You can't enforce the same norms of behavior, so you probably have to create a different kind of mechanism.
T: Well, I have one final question, which you can choose to cut very short if you like. When you were at the conference, you talked about vertical and horizontal trust. And what we've been discussing this morning, in the questions of how do we spread the word, how do we get things out, a lot of the things you've described sounded to me, and correct me if I'm wrong, horizontal. In other words from one to another to another, they have been translating laterally from one person to another. Do you think that in these more "informal" technologies, as you said to Jon Udell, like blogs and Wikis and so forth, there is any room for a more vertical structure, or do you think that the horizontal structure of electronic word of mouth, as it were, is the best way to disseminate this information? Please tell me your thoughts on that.
G: The horizontal / vertical, global / local axis is this trust model that I first read about in the book Trust from Socrates to Spin by Kieron O'Hara. The short answer is that I don't think that horizontal/local trust works. It just doesn't work. We have so much evidence of it. We have spam, we have phishing, we have people stealing people's content.
T: It's too vulnerable to attack, you're saying.
G: It's too vulnerable. On the other hand, I also don't think that vertical/global will work anymore, particularly not on a distributed structure like the Internet. So the short answer, I guess, is that I think that the promise of a lot of the social tools that we see, and particularly the promise of trust metrics, is that we might be able to create a mechanism that mitigates, that allows you to transcend the dichotomy between local and global and vertical and horizontal. And say, you know what, we can overcome the limits of these in some way.
For instance, again, Kieron O'Hara, when he talks about local trust he talks about trust that's established through some sort of personal knowledge. So I trust this person because they're my friend or my neighbor, they're related to me. That kind of trust, in the analog world, in meatspace, doesn't scale very well. It has zero geographic scalability. But all of a sudden, with the Internet, I can, through long acquaintance with someone online, develop a trust profile of them. It would have been very, very difficult for me -- I could have done it through letter writing or some other means in the old days, but now all of a sudden my local trust network ironically is no longer geographically constrained as it was in the old days.
So that's one example of how the Internet can allow you to overcome this obstacle.
So local trust, which is trust that has a transitive quality; that is, I trust an auditing company and therefore I trust anybody that they audit, that can be -- the dangers, the intrinsic, systemic risks of failure in that can also be mitigated using social networking tools. So I think that there are a lot of interesting developments out there that promise to breach the divide between Internet trust, which is local and horizontal, and scholarly trust which is vertical and global.
= = = = = # # # # # = = = = =
Thus ends my interview with Geoffrey Bilder. He is an amazing person. Very generous with his time and expertise, and truly passionate about what he does.
I would like to close with a few of my personal thoughts on this experience of blogging for the very first time. I really would like to close with some questions, which are almost always more important and more interesting than the answers.
1. If I had known what I was getting into, I never would have embarked on this adventure.
2. Since I didn't know what I was getting into, I did. Once I realized how hard it was going to be, it was too late, and I had no choice but to forge ahead. Having said that, I'm very glad that I could not foresee the vastness of this project, because it has been one of the most gratifying experiences as a semi-professional writer that I have ever had.
3. I agree wholeheartedly with Geoffrey about blogging: if you want to be successful, you have to feel comfortable publishing incomplete thoughts, poorly phrased sentences, with a few typos here and there. If I ever start another blog, I will try very hard to take his and my own advice on this topic.
4. My interests over the course of my degree program have gradually become focused on cataloging and metadata, and yet the first "casualty" of my blog were the metadata tags, the very objects I should be focusing on the most. If someone on the Web were to search blogs that talk about e-publishing and all things related to it, without the tags they might not find my blog. The metadata, as we've heard many times before, can make all the difference between discovery and invisibility.
5. It is very interesting, and important I think, to highlight something that Geoffrey said in the course of this interview: a) in the business of scholarly publishing, the readers and the writers are all part of the same community, unlike almost all other types of publishing in which a small number of authors writes for a large audience of non-writers. b) Because the writers and the readers are one and the same, they are all engaged in user-generated content generation (whether print or e-).
6. What this means is that these methods of building trust become crucial in the context of online scholarly communication, and with care and attention to maintenance of strong reputation both publishers and librarians can give valuable and strong contributions to the scholarly community in the difficult process of disseminating scholarship and untangling the mass of output, which may vary from informal ramblings to peer-reviewed, published articles and monographs.
7. The open questions that remain are those of translating into practical measures the guidelines that have emerged from this interview:
- how to create trust metrics that will give weight and authority to the words of librarians and publishers?
- which tools are most suited to create guides and "maps" for scholars and students to wade through the volume of material that is available?
8. It is very encouraging and exciting, however, to find that there is widespread agreement that there is great value in using social software like bookmarking tools to sort through scholarly output and make distinctions between various versions of papers and articles, which can save vast amounts of time for those who have to read them. This means that there is great potential for growth in the librarian community for people to perform this type of task. The librarian of the 21st century can become a meta-librarian and continue to uphold the old values of the profession, and shepherd them, as it were, into the future.
And now, for the last time signing off on this London blog, to all my friends and loved ones, good night!