Using LaTeX for a paper for Science Advances

I recently wrote a paper for the new AAAS journal Science Advances using LaTeX (as opposed to their Microsoft Word template), and have some things to share with others interested in sending their beautifully typeset work to that journal. [1]

First, Science Advances uses a bibliography style that is slightly different from that of Science, which means that the Science.bst file available from AAAS for submissions to Science is not suitable. Specifically, Science Advances wants full titles to be listed and wants full page ranges (rather than just the first page). My reading of the detailed information for authors suggests that these are the only differences. Here is a modified version of the Science.bst file, called ScienceAdvances.bst, that conforms to the required bibliographic style. [2]

Second, Science Advances uses a slightly different format for the manuscript itself than Science, and so again, the existing LaTeX template is not quite suitable. One difference is that Science Advances requires section headings. Here is a zip file containing a Science Advances LaTeX template, modified from the Science template distributed by AAAS, that you can use (note: this zip includes the bst file listed above). [2]

Finally, there are a few little things that make Science Advances different from Science. SA has a much longer (effective) length limit, being 15,000 words compared to Science's 4500 words. The Reference list in SA is comprehensive, meaning that references cited only in the Supplementary Material should be included in the main text's reference list. There is also no limit on the number of references (compared to Science's limit of 40). And, SA places the acknowledgements after the Reference section, and the acknowledgements include information about funding, contributions, and conflicts of interest. Otherwise, the overall emphasis on articles being of broad interest across the sciences and of being written in plain English [3] remains the same as Science.

-----

[1] Full disclosure: I am currently serving as an Associate Editor for Science Advances. Adapting Science's LaTeX files to Science Advances's requirements, and sharing them online, was not a part of my duties as an AE.

[2] The files are provided as-is, with no guarantees. They compile for me, which was good enough at the time.

[3] Of course, biology articles in Science are hardly written in "plain English", so there is definitely some degree of a double-standard at AAAS for biology vs. non-biology articles. Often, it seems that biology, and particularly molecular biology, can be written in dense jargon, while non-biology, but especially anything with mathematical concepts or quantities in it, has to be written without jargon. This is almost surely related to the fact that the majority of articles published in Science (apparently by design) are biomedical in nature. AAAS is claiming that Science Advances will be different, having a broader scope and a greater representation of non-biomedical articles (for instance, SA specifically says it wants articles from the social sciences, the computer sciences, and engineering, which I think is a really great stance). Whether they can pull that off remains to be seen, since they need to get the buy-in from the best people in these other fields to send their high-quality work to SA rather than to disciplinary venues.

Posted on December 06, 2014 in Simply Academic | permalink | Comments (0)

Grants and fundraising (Advice to young scholars, part 4 of 4)

These notes are an adapted summary of the the 4th of 4 professional development panels for young scholars, as part of the American Mathematical Society (AMS) Mathematics Research Community (MRC) on Network Science, held in June 2014. Their focus was on mathematics, computer science, and networks, but many of the comments generalize to other fields. [1,2]

Panel 4. Grants and Fundraising

Opening remarks: In general, only around 10% of grant proposals are successful. But, roughly 60% of submitted proposals are crap. Your competition for getting funded is the non-crappy 40%. Therefore, work hard to polish your proposals, and take as much time as you would a serious or flagship paper. Get feedback from colleagues on your proposals before submitting, and try as hard as possible to get that feedback at least one month before the deadline. (Many institutions have these "mock panels" available, and they are incredibly useful, especially for early career scientists.) Practice makes the master, so consider writing a grant proposal as a postdoc. Having some success as a postdoc will also make you look more attractive as a faculty candidate. Know when the annual deadlines are for the regular grant competitions, and plan ahead. Try to avoid the last-minute crush of writing proposals in two weeks or less.

  • What should be in a proposal?
    Really exciting research. But, try to propose to do more than just really exciting research. Consider organizing workshops, creating new classes, creating notes, giving public lectures, hosting undergraduates, working with underrepresented groups, running a podcast series, and even teaching in a local high school.

  • What kinds of proposals should an early-career person write?
    In your first few years as faculty, apply to all the early-career fellowships and competitions that you can comfortably manage. That includes the Sloan, McDonnell, Packard, etc., along with the NSF CAREER award, and the various "early investigator" competitions at the DoD and other places. Figure out what people do in your field and do that too. These awards are sometimes for sizable amounts of funding, but even if they are not, they are often very prestigious.

  • How many grants do I need?
    This depends on the size of your preferred research group. Many faculty try to keep 2-3 active grants at once, and write approximately 1-2 new proposals per year. As a rough calculation, a "normal sized" grant from many parts of NSF will support 1 graduate student for its duration (plus modest summer salary, travel, and computing equipment).

  • Can I propose work that I have already partially completed?
    Yes. This is common, and often even recommended. "Preliminary results" make a proposal sound less risky, and basically the reviewers are looking for proposals that are exciting, will advance the state-of-the-art, well written, and exceedingly likely to succeed. If you've already worked out many of the details of the work itself, it is much easier to write a compelling proposal.

  • Proposals are often required to be understandable by a broad audience but also include technical details, so how do you balance these requirements?
    An advanced undergraduate should be able to understand your proposal with some training. Most panels have some experts who can judge the technical details. A good strategy for learning how to balance technical material versus accessibility is to read other people's proposals, especially successful ones, even outside your field. The first pages of any proposal should be more broadly understandable, while the final pages may be decodable by experts only.

  • Can you reuse the same material for multiple grants?
    It's best not to double dip. If a grant is rejected, you can usually resubmit it, often to the same agency (although sometimes not more than once). Because you have some feedback and you have already written the first proposal, it's often less work to revise and resubmit a rejected proposal. (But, the goal posts may move with the resubmission because the review panel may be composed of different people with different opinions, e.g., at NSF.) Small amounts of overlap are usually okay, but if you don't have anything new to propose, don't submit a proposal.

Pro tips:

  • Calls For Proposals (CFPs) are often difficult to decode, so don't hesitate to ask for help to translate, either from your colleagues or from the cognizant program officer. Usually, the specific words and pitch of a program have been shaped by other researchers' interests, and knowing what those words really mean can help in deciding if your ideas are a good match for the program.
  • Proposals are reviewed differently depending on the agency. NSF proposals are reviewed by ad hoc committees of practicing scientists (drawn essentially at random from a particular broad domain). NIH proposals are reviewed by study panels whose membership is fairly stable over time. DoD proposals are reviewed internally, but sometimes with input from outside individuals (who may or may not be academics).
  • Don't write the budget yourself. Use the resources of your department. You will eventually learn many things about budgeting, but your time is better spent writing about the science. That being said, you will need to think about budgets a lot because they are what pay for the research to get done (and universities and funding agencies really love to treat them like immutable, sacred documents). Familiarize yourself with the actual expenses associated with your kind of research, and with the projects that you currently and aim to do in the future.
  • For NSF, don't budget for funding to support undergraduates during the summer; instead, assume that you will apply for (and receive) an REU Supplement to your award to cover them. The funding rate for these is well above 50%.
  • NSF (and some other agencies) have byzantine rules about the structure, format, and set of documents included in a proposal. They now routinely reject without review proposals that don't follow these rules to the letter. Don't be one of those people.
  • Ending up with leftover money is not good. Write an accurate budget and spend it. Many agencies (e.g., NSF and NIH) will allow you to do a 1-year "no cost extension" to spend the remaining money.
  • Program officers at NSF are typically professors, on leave for 2-3 years, so speak to them at conferences. Program officers at DoD agencies and private foundations are typically professionals (not academics). NSF program officers exert fairly little influence over the review and scoring process of proposals. DoD and foundation program officers exert enormous influence over their process.

-----

[1] Panelists were Mason Porter (Oxford), David Kempe (Southern California), and me (the MRC organizers), along with an ad hoc assortment of individuals from the MRC itself, as per their expertise. The notes were compiled by MRC participants, and I then edited and expanded upon them for clarity and completeness, and to remove identifying information. Notes made public with permission.

[2] Here is a complete copy of the notes for all four panels (PDF).

Posted on December 03, 2014 in Simply Academic | permalink | Comments (0)

Doing interdisciplinary work (Advice to young scholars, part 3 of 4)

These notes are an adapted summary of the the 3rd of 4 professional development panels for young scholars, as part of the American Mathematical Society (AMS) Mathematics Research Community (MRC) on Network Science, held in June 2014. Their focus was on mathematics, computer science, and networks, but many of the comments generalize to other fields. [1]

Panel 3. Interdisciplinary Research

Opening remarks: Sometimes, the most interesting problems come from interdisciplinary fields, and interdisciplinary researchers are becoming more and more common. As network scientists, we tend to fit in with many disciplines. That said, the most important thing you have is time; therefore, choose your collaborations wisely. Interdisciplinary work can be divided into collaboration and publication, and each of these has its own set of difficulties. A common experience with interdisciplinary work is this:

Any paper that aims for the union of two fields will appeal mainly to the intersection. -- Jon Kleinberg

  • What's the deal with interdisciplinary collaborations? How do they impact your academic reputation?
    There are three main points to consider when choosing interdisciplinary collaborations, and how they impact perceptions of your academic reputation.
    First, academia is very tribal, and the opinions of these tribes with regards to your work can have a huge impact on your career. Some departments won't value work outside their scope. (Some even have a short list of sanctioned publication venues, with work outside these venues counting literally as zero for your assessments.) Other departments are more open minded. In general, it's important to signal to your hopefully-future-colleagues that you are "one of them." This can mean publishing in certain places, or working on certain classes of problems, or using certain language in your work, etc. If you value interdisciplinary work, then you want to end up in a department that also values it.
    Second, it's strategically advantageous to be "the person who is the expert on X," where X might be algorithms or statistics or models for networks, or whatever. Your research specialty won't necessarily align completely with any particular department, but it should align well with a particular external research community. In the long run, it is much more important to fit into your community than to fit into your department, research-wise. This community will be the group of people who review your papers, who write your external letters when you go up for tenure, who review your grant proposals, who hire your students as postdocs, etc. The worst possible situation is to be community-less. You don't have to choose your community now, but it helps to choose far enough ahead of your tenure case that you have time to build a strong reputation with them.
    Third, make sure the research is interesting to you. If your contribution in some interdisciplinary collaboration is to point out that an off-the-shelf algorithm solves the problem at hand, it's probably not interesting to you, even if it's very interesting to the collaborator. Even if it gives you an easy publication, it won't have much value to your reputation in your community. Your work will be compared to the work of people who do only one type of research in both fields, and might not look particularly good to any field.
    Be very careful about potentially complicated collaborations in the early stages of your career. Be noncommittal until you're sure that your personalities and tastes in problems match. (Getting "divorced" from a collaborator, once a project has started, can be exhausting and complicated.) Being able to recognize cultural differences is an important first step to good collaborations, and moving forward effectively. Don't burn bridges, but don't fall into the trap of saying yes to too many things. Be open to writing for an audience that is not your primary research community, and be open to learning what makes an interesting question and a satisfying answer in another field.

  • What's the deal with publishing interdisciplinary work? Where should it go?
    As a mathematical or computer or data scientist doing work in a domain, be sure to engage with that domain's community. This helps ensure that you're doing relevant good work, and not reinventing wheels. Attend talks at other departments at your university, attend workshops/conferences in the domain, and discuss your results with people in the domain audience.

    When writing, vocabulary is important. Knowing how to speak another discipline's language will help you write in a way that satisfies reviewers from that community. Less cynically, it also helps the audience of that journal understand your results, which is the real goal. If publishing in the arena of a collaborator, trust your collaborator on the language/writing style.

    In general, know what part of the paper is the most interesting, e.g., the mathematics, or the method or algorithm, or the application and relationship to scientific hypotheses, etc., and send the paper to a venue that primarily values that thing. This can sometimes be difficult, since academic tribes are, by their nature, fairly conservative, and attempting to publish a new or interdisciplinary idea can meet with knee-jerk resistance. Interdisciplinary journals like PLOS ONE, which try not to consider domain, can be an okay solution for early work that has trouble finding a home. But, don't overuse these venues, since they tend also to not have a community of readers built in the way regular venues do.

    Note: When you interview for a faculty position, among the many questions that you should be asking the interviewing department: "In practice, how is your department interdisciplinary? How do you consider interdisciplinary work when evaluating young faculty (e.g., at tenure time)?"

-----

[1] Panelists were Mason Porter (Oxford), David Kempe (Southern California), and me (the MRC organizers), along with an ad hoc assortment of individuals from the MRC itself, as per their expertise. The notes were compiled by MRC participants, and I then edited and expanded upon them for clarity and completeness, and to remove identifying information. Notes made public with permission.

[2] Here is a complete copy of the notes for all four panels (PDF).

Posted on December 02, 2014 in Simply Academic | permalink | Comments (0)

Balancing work and life (Advice to young scholars, part 2 of 4)

These notes are an adapted summary of the the 2nd of 4 professional development panels for young scholars, as part of the American Mathematical Society (AMS) Mathematics Research Community (MRC) on Network Science, held in June 2014. Their focus was on mathematics, computer science, and networks, but many of the comments generalize to other fields. [1,2]

Panel 2. Life / Work Balance

Opening remarks: "Academia is like art because we're all a little crazy."

Productivity often scales with time spent. A good strategy is to find enough of a balance so that you don't implode or burn out or become bitter. The best way to find that balance is to experiment! Social norms in academia are slowly shifting to be more sensitive about work/life balance issues, but academia changes slowly and sometimes you will feel judged. Often, those judging are senior faculty, possibly because of classical gender roles in the family and the fact that their children (if any) are usually grown. Telling people you're unavailable is uncomfortable, but you will get used to it. Pressure will be constant, so if you want a life and/or a family, you just have to do it. Routines can be powerful--make some rules about when your non-work hours are during the week and stick to them.

  • What if I want to have children?
    Most institutions have a standard paternity/maternity leave option: one semester off of research/teaching/service plus a one-year pause on your tenure clock. If you think you will have children while being faculty, ask about the parental leave policy during your job interview. Faculty with small children often have to deal with scheduling constraints driven by day care hours, or at-home responsibilities for child care; they are often simply unavailable nights and evenings, so be sensitive to that (don't assume they will be available for work stuff then). Juggling a brand new faculty job and a new baby in the same year can be done, but it can also burn you out.

  • Burnout, what?
    It's hard to get numbers on burnout rate, in part because there are varying degrees of ``burnout'' and different people burn out in different ways. Most tenured faculty are not completely burned out; true burnout often turns into leaving academia. On the other hand, some faculty have real breakdowns and then get back on the horse. Other faculty give up on the ``rat race'' of fundraising and publishing in highly competitive venues and instead focus on teaching or service. There are many ways to stop being productive and lose the passion.

    One strategy is to promise yourself that once it stops being fun, leave and go get a satisfying 9-5 job (that pays better).

  • What about all this service stuff I'm getting asked to do?
    Service (to your department, to your university, and to your research community) is an important part of being a professor. You will get asked to do many things, many of which you've never done before, some of which will sound exciting. As an early-career person, you should learn to say "no" to things and feel comfortable with that decision. Until you have tenure, it's okay to be fairly selfish about your service--think about whether saying "yes" will have a benefit to your own research efforts. If the benefit is marginal, then you should probably say no.

    There are a lot of factors that go into whether or not you say yes to something. It's important to learn to tell the difference between something you should say no or yes to. A key part of this is having one or more senior faculty mentors you can ask. Ideally, have one inside your department and one outside your department but within your research community.

  • What happens during summers?
    If you're willing to set yourself up for it, then you can readily take a month-long vacation with absolutely no contact. Tell your department head that you're not bringing your laptop. That being said, summer is often the time where many faculty try to focus exclusively on research, since they're not teaching. At most institutions, it's normal for regular departmental committees to not meet, so you often get a break from your departmental service obligations then, too.

  • How many hours should I work each week?
    How much you work each week is really up to you. Some people work 80-85 hours during terms, and 70 between terms. A common number kicked around is 60, and relatively few people work a lot less than that. For the most part, faculty work these hours by choice. The great advantage of faculty life is that your schedule is pretty flexible, which allows you to carve out specific time for other things (e.g., life / family). Many faculty work 9-5 on campus, and then add other hours at home or otherwise off campus. Some others work long hours during the week and then are offline on the weekends.

  • Do I have to spend all those hours on campus?
    If you don't get "face time" with your institution and the people evaluating your tenure case, then they will form negative opinions about you. So go into work often. And, spend time "in your lab," with your students. Good idea to have lunch with every one of your fellow tenure-track faculty during your early faculty career.

  • I have a two body problem.
    Solving the two-body problem (marriage with another academic or other professional career type) can be tricky. Start talking about it with your partner long before you start applying to jobs. One solution: make a list and let your partner cross off the things that don't make sense. In job negotiations, there are things that the department can do, such as interview/hire your spouse (or encourage/fund another department to do so). If your partner is not an academic there are few things the university can do, but often the more senior people have contacts and that can help.

    One strategy is to always go for the interview, get the offer first, and think about it later. Departments often want to know ahead of time whether they'll need to help with the two-body problem in order to get you to say yes. (But, they are legally not allowed to ask you if you have a partner, so you have to bring it up.) This can (but not necessarily) hurt your offer. Also, when women interview, they get assumptions imposed on them, such as the existence of a two-body problem. Some women don't wear a wedding ring to an interview in order to avoid those assumptions. One possibility is to consider saying something in advance along the lines of ``my husband is excited and there's no problem.’’

  • How much should I travel?
    Many strategies. Mostly depends on your personal preferences. A popular strategy is to travel no more than once a month. Also consider picking trips on which you can bring your family and/or do some extra traveling. As a junior person, however, traveling is in part about reputation-building, and is a necessary part of academic success.

-----

[1] Panelists were Mason Porter (Oxford), David Kempe (Southern California), and me (the MRC organizers), along with an ad hoc assortment of individuals from the MRC itself, as per their expertise. The notes were compiled by MRC participants, and I then edited and expanded upon them for clarity and completeness, and to remove identifying information. Notes made public with permission.

[2] Here is a complete copy of the notes for all four panels (PDF).

Posted on December 01, 2014 in Simply Academic | permalink | Comments (0)

The faculty market (Advice to young scholars, part 1 of 4)

These notes are an adapted summary of the the 1st of 4 professional development panels for young scholars, as part of the American Mathematical Society (AMS) Mathematics Research Community (MRC) on Network Science, held in June 2014. Their focus was on mathematics, computer science, and networks, but many of the comments generalize to other fields. [1,2]

Panel 1. The Academic Job Market

Opening remarks: The faculty hiring process is much more personal than the process to get into grad school. Those who are interviewing you are evaluating whether you should be a coworker for the rest of their careers! The single most-important thing about preparing to apply for faculty jobs is to have a strong CV for the type of job you're applying for. If you're on the tenure-track, that nearly always means being able to show good research productivity for your field (publications) and having them be published in the right places for your field.

  • Where did you find job postings? Where did you search?
    It depends on the type of job and the field. For math: AMS weekly mailings, back of SIAM news. For physics: the back of Physics Today. For computer science: CRA.org/jobs, Communications of the ACM. For liberal arts colleges: chronicle vitae. In general: mathjobs.org, academicjobs.org, and ask your supervisor(s) or coauthors.

  • When do you apply?
    The U.S. market is, for the most part, seasonal. The seasonality differs by field. Biology searches may start in September, with interviews in November and December. Math and computer science tend to have applications due in November, December, and maybe even January. In the U.K., institutions tend to hire whenever, regardless of season. Timing for interdisciplinary positions may be a little strange. It is worth figuring out 6 months ahead of time what the usual timeline is for your field.

  • What kind of department should you apply to?
    If you're in department X, you will be expected to teach courses in department X. (At most institutions, you will teach a mixture of undergraduate and graduate-level courses, but not always within your research speciality.) It may be better to have your publications match the departments to which you apply; for instance, if you're interested in jobs in math departments, you should be publishing in the SIAM journals. You should also get letter writers in that field, since their name will be more recognizable to the hiring committee (and thus carry more weight).

  • What should you put in a cover letter?
    The cover letter is the first (and sometimes only) thing the hiring committee sees. In some fields, the cover letter is 1 full page of text and serves as a complete abstract of your application packet (i.e., it describes your preparation, major research areas and achievements, and intended future research agenda). If you have a specific interest in a department / location, say it in the cover letter (e.g., "I have family living in X and want to be close to them") since this signals to the hiring committee that you're genuinely interested in their institution. Also, mention the people in the department whom you would like to look at your application. Mention a few specific things about the individual advertisement (no one likes to feel spammed). Finally, the cover letter is your one chance to explain anything that might look like a red flag to the committee.

  • What should you put in a teaching statement?
    At research universities, teaching statements are usually the last thing that is read. For junior-level positions, their contents often cannot help your changes, but a bad statement can hurt them. At liberal arts / teaching colleges, a compelling teaching statement is very important.

  • What about letters of recommendation?
    Letters of recommendation are the second most important thing in your packet (the most important being your publication record). The best letters are those that can state firmly that you are in the top whatever percent of students or postdocs. Their description of you is the most important, and their own fame is second. There are some cultural differences between the U.K., U.S., and other places in terms of how glowing they will be. An excited letter from an unknown writer is more important than a mediocre letter from a famous person. The absence of a letter from a PhD or postdoc advisor will be interpreted as a red flag.

  • Are software and blogs good or bad?
    Sometimes good, sometimes not. Don't do these things at the cost of your own research. If you have specific reasons for doing these things, emphasize them in your research statement as "sweeteners" to your strong publication record. For tenure-track faculty jobs, these things generally cannot compensate for a poor or mediocre publication record. The research itself is the most important thing.

  • How does the hiring committee work?
    At most institutions today, the ratio of candidates to faculty jobs is roughly 100:1. At major research institutions, about 60% of those candidates are not competitive to begin with; it's the other 40% you have to beat. This means the hiring committee has an enormous job to do just to narrow the pool to the 10-20% or so that they'll scrutinize closely. Your goal is to make it into that group, so that your star qualities can be properly noticed.

    A common strategy that hiring committees take is to progressively pay more attention to a progressively smaller pool. Your goal is to get through the first few waves of filtering until you can get a serious look by the committee. Two very common reasons a candidate is dropped from the pool during these early evaluations are (i) their area of research is not a good match to the areas of interest in the search (including not looking like the kind of candidate the committee thinks they want, e.g., because their work appears in unusual places), and (ii) their research productivity is not good enough (usually controlled for time since PhD). Both are subjective criteria and vary by search. In general, the more prestigious your PhD, the more prestigious your publication venues, and the more prestigious your letter writers, the better you will fare.

  • What about the interview itself?
    Usually 1-2 days of intense, back-to-back meetings with faculty, plus a meeting with a group of graduate students, plus 1-2 dinners with faculty, plus a job talk (about your research), and sometimes also a "teaching talk." In your job talk, you need to convince them of why they should hire someone doing exactly the research you're doing. Make the audience excited. Make it related to things they know about. Be sure to look at the webpage of every person that might be in the room. Be sure to ask for your meeting schedule in advance, and then read up a little about each person you will meet.

Pro Tips:

  • "Exploding offers" (offers that expire after a few weeks) may be used by lower-tier institutions when trying to hire a person likely to have offers from higher-tier institutions. But, deadlines are often negotiable. Play nice. It's often not malicious, but rather just to proceed quickly down the ranked list of candidates. Moreover, if you turn it down in a friendly conversation, you may be able to negotiate a "if you are still interested in me in a month, please let me know."

  • During the year before you apply, figure out what departments you'll be applying to, and be sure to have some publications and talks at major conferences for that type of field or department.

  • Don't pad your CV. Put all preprint and in-prep publications in a separate, clearly-labelled section. CV readers will look at your PhD, your research interests, and then your publications. Awards (e.g. Best Paper) and high quality venues are more important than quantity.

  • You could email people at the target department(s) saying "Here's a paper, btw: I'll be applying soon." If you're uncomfortable with that, your advisor could do it.

  • If you are applying to a lower tier school than your pedigree, tailor the application well. You must send a very strong signal that you are serious. (Otherwise, they may not even interview you.)

-----

[1] Panelists were Mason Porter (Oxford), David Kempe (Southern California), and me (the MRC organizers), along with an ad hoc assortment of individuals from the MRC itself, as per their expertise. The notes were compiled by MRC participants, and I then edited and expanded upon them for clarity and completeness, and to remove identifying information. Notes made public with permission.

[2] Here is a complete copy of the notes for all four panels (PDF).

Posted on November 27, 2014 in Simply Academic | permalink | Comments (1)

PLOS mandates data availability. Is this a good thing?

The Public Library of Science, aka PLOS, recently announced a new policy on the availability of the data used in all papers published in all PLOS journals. The mandate is simple: all data must be either already publicly available, or be made publicly available before publication, except under certain circumstances [1].

On the face of it, this is fantastic news. It is wholly in line with PLOS’s larger mission of making the outcomes of science open to all, and supports the morally correct goal of making all scientific knowledge accessible to every human. It should also help preserve data for posterity, as apparently a paper’s underlying data becomes increasingly hard to find as the paper ages [2]. But, I think the the truth is more complicated.

PLOS claims that it has always encouraged authors to make their data publicly available, and I imagine that in the vast majority of cases, those data are in fact available. But the policy does change two things: (i) data availability is now a requirement for publication, and (ii) the data are supposed to be deposited in a third-party repository that makes them available without restriction or attached to the paper as supplementary files. The first part ensures that authors who would previously decline or ignore the request for open data must now fall into line. The second part means that a mere promise by the authors to share the data with others is now insufficient. It is the second part where things get complicated, and the first part is meaningless without practical solutions to the second part.

First, the argument for wanting all data associated with scientific papers to be publicly available is a good one, and I think it is also the right one. If scientific papers are in the public domain [3], but the data underlying their results are not, then have we really advanced human knowledge? In fact, it depends on what kind of knowledge the paper is claiming to have produced. If the knowledge is purely conceptual or mathematical, then the important parts are already contained in the paper itself. This situation covers only a smallish fraction of papers. The vast majority report figures, tables or values derived from empirical data, taken from an experiment or an observational study. If those underlying data are not available to others, then the claims in the paper cannot be exactly replicated.

Some people argue that if the data are unavailable, then the claims of a paper cannot be evaluated at all, but that is naive. Sometimes it is crucial to use exactly the same data, for instance, if you are trying to understand whether the authors made a mistake, whether the data are corrupted in some way, or understand a particular method. For these efforts, data availability is clearly helpful.

But, science aspires for general knowledge and understanding, and thus getting results using different data of the same type but which are still consistent with the original claims is actually a larger step forward than simply following exactly the same steps of the original paper. Making all data available may thus have an unintended consequence of reducing the amount of time scientists spend trying to generalize, because it will be easier and faster to simply reuse the existing data rather than work out how to collect a new, slightly different data set or understand the details that went into collecting the original data in the first place. As a result, data availability is likely to increase the rate at which erroneous claims are published. In fields like network science, this kind of data reuse is the norm, and thus gives us some guidance about what kinds of issues other fields might encounter as data sharing becomes more common [4].

Of course, reusing existing data really does have genuine benefits, and in most cases these almost surely outweigh the more nebulous costs I just described. For instance, data availability means that errors can be more quickly identified because we can look at the original data to find them. Science is usually self-correcting anyway, but having the original data available is likely to increase the rate at which erroneous claims are identified and corrected [5]. And, perhaps more importantly, other scientists can use the original data in ways that the original authors did not imagine.

Second, and more critically for PLOS’s new policy, there are practical problems associated with passing research data to a third party for storage. The first problem is deciding who counts as an acceptable third party. If there is any lesson from the Internet age, it is that third parties have a tendency to disappear, in the long run, taking all of their data with them [6]. This is true both for private and public entities, as continued existence depends on continued funding, and continued funding, when that funding comes from users or the government, is a big assumption. For instance, the National Science Foundation is responsible for funding the first few years of many centers and institutes, but NSF makes it a policy to make few or no long-term commitments on the time scales PLOS’s policy assumes. Who then should qualify as a third party? In my mind, there is only one possibility: university libraries, who already have a mandate to preserve knowledge, should be tapped to also store the data associated with the papers they already store. I can think of no other type of entity with as long a time horizon, as stable a funding horizon, and as strong a mandate for doing exactly this thing. PLOS’s policy does not suggest that libraries are an acceptable repository (perhaps because libraries themselves fulfill this role only rarely right now), and only provides the vague guidance that authors should follow the standards of their field and choose a reasonable third party. This kind of statement seems fine for fields with well-developed standards, but it will likely generate enormous confusion in all other fields.

This brings us to another major problem with the storage of research data. Most data sets are small enough to be included as supplementary files associated with the paper, and this seems right and reasonable. But, some data sets are really big, and these pose special problems. For instance, last year I published an open access paper in Scientific Reports that used a 20TB data set of scoring dynamics in a massive online game. Data sets of that scale might be uncommon today, but they still pose a real logistical problem for passing it to a third party for storage and access. If someone requests a copy of the entire data set, who pays for the stack of hard drives required to send it to them? What happens when the third party has hundreds or thousands of such data sets, and receives dozens or more requests per day? These are questions that the scientific community is still trying to answer. Again, PLOS’s policy only pays lip service to this issue, saying that authors should contact PLOS for guidance on “datasets too large for sharing via repositories.”

The final major problem is that not all data should be shared. For instance, data from human-subjects research often includes sensitive information about the participants, e.g., names, addresses, private behavior, etc., and it is unethical to share such data [7]. PLOS’s policy explicitly covers this concern, saying that data on humans must adhere to the existing rules about preserving privacy, etc.

But what about large data sets on human behavior, such as mobile phone call records? These data sets promise to shed new light on human behavior of many kinds and help us understand how entire populations behave, but should these data sets be made publicly available? I am not sure. Research has shown, for instance, that it is not difficult to uniquely distinguish individuals within these large data sets [8] because each of us has distinguishing features to our particular patterns of behavior. Several other papers have demonstrated that portions of these large data sets can be deanonymized, by matching these unique signatures across data sets. For such data sets, the only way to preserve privacy might be to not make the data available. Additionally, many of these really big data sets are collected by private companies, as the byproduct of their business, at a scale that scientists cannot achieve independently. These companies generally only provide access to the data if the data is not shared publicly, because they consider the data to be theirs [9]. If PLOS’s policy were universal, such data sets would seem to become inaccessible to science, and human knowledge would be unable to advance along any lines that require such data [10]. That does not seem like a good outcome.

PLOS does seem to acknowledge this issue, but in a very weak way, saying that “established guidelines” should be followed and privacy should be protected. For proprietary data sets, PLOS only makes this vague statement: “If license agreements apply, authors should note the process necessary for other researchers to obtain a license.” At face value, it would seem to imply that proprietary data sets are allowed, so long as other researchers are free to try to license them themselves, but the devil will be in the details of whether PLOS accepts such instructions or demands additional action as a requirement for publication. I’m not sure what to expect there.

On balance, I like and am excited about PLOS’s new data availability policy. It will certainly add some overhead to finalizing a paper for submission, but it will also make it easier to get data from previously published papers. And, I do think that PLOS put some thought into many of the issues identified above. I also sincerely hope they understand that some flexibility will go a long way in dealing with the practical issues of trying to achieve the ideal of open science, at least until we the community figure out the best way to handle these practical issues.

-----

[1] PLOS's Data Access for the Open Access Literature policy goes into effect 1 March 2014.

[2] See “The availability of Research Data Declines Rapidly with Article Age” by Vines et al. Cell 24(1), 94-97 (2013).

[3] Which, if they are published at a regular “restricted” access journal, they are not.

[4] For instance, there is a popular version of the Zachary Karate Club network that has an error, a single edge is missing, relative to the original paper. Fortunately, no one makes strong claims using this data set, so the error is not terrible, but I wonder how many people in network science know which version of the data set they use.

[5] There are some conditions for self-correction: there must be enough people thinking about a claim that someone might question its accuracy, one of these people must care enough to try to identify the error, and that person must also care enough to correct it, publicly. These circumstances are most common in big and highly competitive fields. Less so in niche fields or areas where only a few experts work.

[6] If you had a profile on Friendster or Myspace, do you know where your data is now?

[7] Federal law already prohibits sharing such sensitive information about human participants in research, and that law surely trumps any policy PLOS might want to put in place. I also expect that PLOS does not mean their policy to encourage the sharing of that sensitive information. That being said, their policy is not clear on what they would want shared in such cases.

[8] And, thus perhaps easier, although not easy, to identify specific individuals.

[9] And the courts seem to agree, with recent rulings deciding that a “database” can be copyrighted.

[10] It is a fair question as to whether alternative approaches to the same questions could be achieved without the proprietary data.

Posted on January 29, 2014 in Scientifically Speaking | permalink | Comments (3)

2013: a year in review

This is it for the year, so here's a look back at 2013, by the numbers.

Papers published or accepted: 10 (journals or equivalent)
Number coauthored with students: 4
Number of papers that used data from a video game: 3 (this, that, and the other)
Pre-prints posted on the arxiv: 6
Other publications: 2 workshop papers, and 1 invited comment
Number coauthored with students: 2
Papers currently under review: 2
Manuscripts near completion: 8
Rejections: 4
New citations to past papers: 1722 (+15% over 2012)
Projects in-the-works: too many to count
Half-baked projects unlikely to be completed: already forgotten
Papers read: >200 (about 4 per week)

Research talks given: 15
Invited talks: 13
Visitors hosted: 2
Presentations to school teachers about science and data: 1 (at the fabulous Denver Museum of Nature and Science)
Conferences, workshops organized: 2
Conferences, workshops, summer schools attended: 7
Number of those at which I delivered a research talk: 5
Number of times other people have written about my research: >17
Number of interviews given about my research: 10

Students advised: 9 (6 PhD, 1 MS, 1 BS; 1 rotation student)
Students graduated: 1 PhD (my first: Dr. Sears Merritt), 1 MS
Thesis/dissertation committees: 10
Number of recommendation letters written: 5
Summer school faculty positions: 2
University courses taught: 2
Students enrolled in said courses: 69 grad
Number of problems assigned: 120
Number of pages of lecture notes written: >150 (a book, of sorts)
Pages of student work graded: 7225 (roughly 105 per student, with 0.04 graders per student)
Number of class-related emails received: >1624 (+38% over 2012)
Number of conversations with the university honor council: 0
Guest lectures for colleagues: 1

Proposals refereed for grant-making agencies: 1
Manuscripts refereed for various journals, conferences: 23 (+44% over 2012)
Fields covered: Network Science, Computer Science, Machine Learning, Physics, Ecology, Political Science, and some tabloids
Manuscripts edited for various journals: 2
Conference program committees: 2
Words written per report: 921 (-40% over 2012)
Referee requests declined: 68 (+36% over 2012)
Journal I declined the most: PLoS ONE (12 declines, 3 accepts)

Grant proposals submitted: 7 (totaling $6,013,669)
Number on which I was PI: 3
Proposals rejected: 2
New grants awarded: 3 (totaling $1,438,985)
Number on which I was PI: 1
Proposals pending: 2
New proposals in the works: 3

Emails sent: >8269 (+3% over 2012, and about 23 per day)
Emails received (non-spam): >16453 (+6% over 2012, and about 45 per day)
Fraction about work-related topics: 0.87 (-0.02 over 2012)
Emails received about power-law distributions: 157 (3 per week, same as 2012)

Unique visitors to my professional homepage: 31,000 (same as 2012)
Hits overall: 87,000 (+10% over 2012)
Fraction of visitors looking for power-law distributions: 0.52 (-11% over 2012)
Fraction of visitors looking for my course materials: 0.16
Unique visitors to my blog: 11,300 (-2% over 2012)
Hits overall: 17,300 (-4% over 2012)
Most popular blog post among those visitors: Our ignorance of intelligence (from 2005)
Blog posts written: 6 (-57% over 2012)
Most popular 2013 blog post: Small science for the win? Maybe not.

Number of twitter accounts: 1
Tweets: 235 (+82% over 2012; mostly in lieu of blogging)
Retweets: >930 (+281% over 2012)
Most popular tweet: a tweet about professors having little time to think
New followers on Twitter: >700 (+202% over 2012)

Number of computers purchased: 2
Netflix: 72 dvds, >100 instant (mostly TV episodes during lunch breaks and nap times)
Books purchased: 3 (-73% over 2012)
Songs added to iTunes: 140 (-5% over 2012)
Photos added to iPhoto: 2357 (+270% over 2012)
Jigsaw puzzle pieces assembled: >2,000
Major life / career changes: 0
Photos taken of my daughter: >1821 (about 5 per day)

Fun trips with friends / family: 10
Half-marathons completed: 0.76 (Coal Creek Crossing 10 mile race)
Trips to Las Vegas, NV: 0
Trips to New York, NY: 1
Trips to Santa Fe, NM: 9
States in the US visited: 8
States in the US visited, ever: 49
Foreign countries visited: 6 (Switzerland, Denmark, Sweden, Norway, United Kingdom, Canada)
Foreign countries visited, ever: 30
Number of those I drove to: 1 (Canada, 10 hours from Washington DC after United canceled my flight to Montreal for the JSM; I arrived with a few hours to spare before my invited talk)
Other continents visited: 1
Other continents visited, ever: 5
Airplane flights: 39

Here's to a great year, and hoping that 2014 is even better.

Update 23 December 2013: Mason reminded me that I forgot a foreign country this year.

Posted on December 22, 2013 in Self Referential | permalink | Comments (2)