Project:
Rephetio: Repurposing drugs on a hetnet [rephetio]

Integrating resources with disparate licensing into an open network


Network overview

We recently released the first version of our network containing 10 node types and 27 edge types. The network contains data (nodes and edges) extracted from 28 resources. Many of these 28 resources have themselves compiled data from disparately-licensed resources. In addition:

  • 12 lack any licensing information
  • 10 use standard licenses
  • 6 use custom licenses
  • 3 resources are publication supplements
  • 6 forbid commercial use
  • 2 forbid any redistribution of the data

Why an open network

We are committed to performing an open project, where all code, data, analyses, and results are maximally reproducible and reusable. The foundation of our research is that datasets are more informative when placed in a broader context. Through integration, we create a resource that is more informative and versatile than the 28 separate sources.

However, data integration is challenging and time intensive. Thus far, our integration effort consists of an 8 month time investment, 41 Thinklab discussions, and 35 GitHub repositories. By making our network public and extensible, other researchers can avoid this laborious process while harnessing the benefits of integration.

The licensing problem

We initially released our network under the CC0 (public domain) license, but @larsjuhljensen pointed out that this may violate many sources' licensing. While we used only publicly available resources — funded primarily by the public — many resources are burdened by restrictive licenses. We now must integrate data with incompatible licenses that require legal expertise to understand and operate in jurisdiction dependent manners.

Compliance and caveats

We are seeking expert advice on how to proceed. We would like to achieve the following:

  • a network that is publicly available in full and maximally unrestricted
  • public domain findings. Foremost, unencumbered predictions of drug efficacy
  • legal compliance
  • normative compliance that respects the intent of the data creators
  • minimal pruning of the current network to preserve our investment

We plan to add node/edge-specific attribution and license information to our network, but will await expert advice before proceeding.

You may also want to consider splitting the network into multiple files. For example, you may have a base file that includes, for example, only public domain and CC-BY content. The edges with SA and/or NC clauses could be in "add-on" files. This partially avoids the problem that your complete network file becomes subject to the lowest common denominator.

Having multiple files will allow people to "pick their poison", so to speak. If they need the most permissive license, they will get a less complete network. If they want the most complete network, they will have to live with a less permissive license.

  • Daniel Himmelstein: Great suggestion. This way users can avoid having to subset the network and reconcile various licenses.

Regarding the 12 that completely lack any licensing information, I would contact the authors. For academic databases/datasets, this is usually due to people not knowing that when it comes to copyright, the default is "all rights reserved". Academics often put things on the internet, thinking that this makes it "public domain". If you ask them, they will likely be happy to put a CC0 waiver or CC-BY license on it.

Workflow details

Our data workflow consists of three major stages. Each stage invokes various aspects of copyright as described below

1) Resource processing

Most resources require processing before they can be added to the network. Common steps include terminology conversion, quality control, subsetting, and record merging.

Our general procedure is to create a public GitHub repository for each resource (examples 1, 2, 3, 4, 5). Separate repositories help keep the project modular and reusable. Each repository contains a download directory where we store the unmodified input. Having the local copy is important for reproducibility because the original download location may become unavailable or serve an updated dataset. Therefore, our download directory redistributes unmodified data.

Next, we process data from the download directory and save the resulting datasets in the data directory. The processing steps generally change the database model and field names and include a substantial portion of the original data. However, the original data has usually been transformed in some regard.

Proposed action: apply the source's license to the contents of download. For the contents of data, apply either the source's license or CC0 if the underlying data is not subject to copyright or the derivative work qualifies as fair use. Resources without a license or that explicitly forbid redistribution are problematic. We propose contacting the creators of these resources for permission or licensing clarification. Components in these repositories that do not derive from protected resources will be released as CC0.

2) Integrative network

Our integrate repository combines the resource-specific data from stage 1 into a single network. The compile directory merges resources with the same type of information. The creation of the network is performed by integrate.ipynb. We have compiled the licenses for each resource. The network is saved as text files in the data directory with hetnet.json.gz as the main release. In this integrated network, the database model and field names from the original resource are not present, just derived data.

Proposed action: Adopt a per node/edge licensing framework. Identify which nodes and edges, if any, are eligible for CC0 release. CC0 release may be possible if the creators chose a permissive license or give us permission, the network is fair use, or if the underlying content is not subject to copyright.

3) Network analyses

Next, we use the integrated network from stage 2 for data mining. The purpose of the data mining is to evaluate methods, extract insights, and make predictions. As an example, see this analysis [1] of the network that @caseygreene and I recently did for a separate project.

Here, it is crucial that findings from analyses on the network are fair use and can be placed in the public domain. Since, the network contains data with incompatible licenses such as CC-BY-SA and CC-BY-NC, data mining will be impossible if not considered fair use. In the US, precedent implies our network analyses qualify as fair use.

Proposed action: Identify whether our network analyses qualify as fair use and whether our results can be released as CC0. Evaluate when and if we are subject to European copyright laws, which are less favorable for content users.

Expert feedback requested

We are seeking expert advice. Specifically, are the proposed actions compliant with copyright law? Regarding the three stages, are we on the right track? Will network analyses count as fair use?

Your analysis of the situation looks great — you've correctly described the difficulty of combining incompatible licenses and the data they cover, and the potential of fair use (for extracting data subsets and data mining) for what you're trying to do. And for the datasets that lack a license, you know that in many cases they aren't protected by copyright so you're free to do what you want. Federal government agencies are notorious for refusing to assign licenses or rights waivers to the data they release, claiming that everything they have and do is in the public domain and we should all just know that, so sometimes no license means you're fine. Your goal of making it clear to users what rights and licenses apply to which datasets is laudable.

The one thing I didn't see you covering is liability. I can't figure out who actually owns the work that you're doing — you want to put it in the public domain, which is great, but do you personally have the right to do that? Are you working on a grant project or employed by a university that might claim "ownership" of your results? This is usually dealt with by the licenses. Apache open source software licenses include the language "Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License." Even CC licenses include language like "No warranties are given. The license may not give you all of the permissions necessary for your intended use. For example, other rights such as publicity, privacy, or moral rights may limit how you use the material." So if you're using CC0 wherever you can, you might want a separate statement of warranty (or lack thereof) unless you want to be liable, or implicate your institution, if you do accidentally screw up (easy enough to do, in such a complex project, even when you've done everything you can).

Regarding the problem of incompatible licenses, it is very important that you are clear on the difference between redistribution and data mining.

You write that "Since, the network contains data with incompatible licenses such as CC-BY-SA and CC-BY-NC, data mining will be impossible if not considered fair use". This is to my knowledge simply not true. There is no problem whatsoever in combining material from these incompatible licenses and mining it in any way that you want. The reason is that copyright purely has to do with how you are allow to redistribute things. And if the data mining leads to some results that are substantially different and not effectively a copy of the original material, there is also no problem in redistributing the results.

The problem comes when you want to make what is effectively a meta-resource that combines material from a lot of databases and redistributes it. In this case, you are redistributing something that is effectively a reformatted version of the material. In my opinion, your network falls squarely in that category

However, the solution is very simple. As I have suggested before, you can split the network into subnetworks, that are all redistributed under their respective licenses. You can bundle everything CC-BY-SA in one file and redistribute it under CC-BY-SA. You can bundle everything CC-BY-NC in another file and redistribute it under CC-BY-NC. And as described above, nothing prevents anyone in the world from legally downloading both files, combining them, and mining the data as they please.

To make it simpler, let me make an analogy from the world of text mining, where the situation is somewhat more clearcut, since there is no doubt that articles are subject to copyright law. I can download some articles under CC-BY-SA and some others under CC-BY-NC. I can run text mining on all of them despite the licenses being incompatible, and I can redistribute the results of my efforts under any license I please, because the results are my results, which are not simply a reformatting of the original text. However, I cannot take all the articles, combine them into a text corpus, and release it under CC0.

Caveats: I am not a lawyer, this does constitute legal advice etc.

  • Daniel Himmelstein: I am interested in subnetworks for user convenience — if a user wants only CC-BY-SA content, then the subnetwork saves them time. However, I disagree that subnetworks are a substitute for a complete network with mixed licensing. Merging networks will be burdensome to my hypothetical user who is interested only in analyzing rather than redistributing the network. Additionally, I am not sure the entire network can be represented by copyright uniform subnetworks. For example, DrugBank forbids commercial reuse. However, our drug–gene binding edges derive from the CC-BY-SA ChEMBL resource. Thus the CC-BY-SA subnetwork would contain edges whose nodes cannot be included.

@mackenziesmith makes a very good point about liability, which in my opinion is why you should not attempt to take copyrighted material, claim fair use under US law, and slap a CC0 waiver on it.

Imagine someone in Europe were to download your network, assume that everything was free of copyright (which is what CC0 effectively promises), take all the SIDER data, and redistribute it under CC0. Since SIDER is covered by European sui generis database rights, they could get sued and would likely lose. Subsequently, they could choose to sue you for liabilities.

Caveats: I am not a lawyer, this does constitute legal advice etc.

Who owns the created work

After some background reading [1] and video watching, who owns the work we're creating is not straightforward.

I am a graduate student at UCSF and my PI, @sergiobaranzini, is a professor at UCSF. I am largely, but not completely, funded by an NSF Graduate Research Fellowship whose conditions state:

The National Science Foundation claims no rights to any inventions or writings that might result from its fellowship or traineeship grants.

Copyright and the UC

The UC's 1992 'copyright ownership' policy stipulates ownership by category of work. Several of these categories may apply:

  • academic appointee originator ownership of "scholarly/aesthetic work"

    A scholarly/aesthetic work is a work originated by a designated academic appointee resulting from independent academic effort. Ownership of copyrights to scholarly/aesthetic works shall reside with the designated academic appointee originator, unless they are also sponsored works or contracted facilities works, or unless the designated academic appointee agrees to participate in a project which has special provisions on copyright ownership pursuant to Section V.C. of this Policy.

  • originator ownership of "personal work"

    A personal work is a work that is prepared outside the course and scope of University employment (except for permissible non-University consulting activities) without the use of University Resources. Ownership of copyrights to Personal works shall reside with the originator.

  • originator ownership of "student work":

    A student work is a work produced by a registered student without the use of University funds (other than Student Financial Aid), that is produced outside any University employment, and is not a sponsored, contracted facilities, or commissioned work. Ownership of copyrights to student works shall reside with the originator.

  • university ownership of "institutional work":

    Except as otherwise provided in this Policy, the University shall own all copyrights to works made by University employees in the course and scope of their employment and shall own all copyrights to works made with the use of University resources.

Therefore, UC's asserted ownership is dependent on which categories our work falls under. Additional guidance states:

University staff who create works within the scope of their employment generally do not own the copyright to the work. A work prepared by an employee within the scope of his or her employment is considered a "work made for hire." When a work qualifies as a work made for hire, the employer or commissioning party is considered its author. Under UC policies, some written works created by certain categories of UC faculty, graduate students, and staff are considered works made for hire.

Thus, the University's assertion of ownership may be contradicted by the strong argument that graduate students, such as myself, are not employees and do not produce "work made for hire". Furthermore, the policies and guidelines are outdated and not well tailored towards the collaborative, digital, online, and open approach our project takes. The work I perform goes beyond the sole purposes of studentship, employment, and institutional work. And the academic community has established norms and precedent for allowing creators to transfer copyright and choose licensing — the foremost examples being academic publishing and open source software contribution.

Data and the UC

The UC 'copyright ownership' policy explicitly states that it only:

addresses ownership of copyright; it does not address ownership or access to the underlying research results or data, as covered in Academic Personnel Manual Section 020.

The Academic Personnel Manual Section 020, dated in 1953, provides little clarification:

All such research shall be conducted so as to be as generally useful as possible. To this end, the right of publication is reserved by the University. The University may itself publish the material or may authorize, in any specific case, a member or members of the faculty to publish it through some recognized scientific or professional medium of publication. A report detailing the essential data and presenting the final results must be filed with the University. Notebooks and other original records of the research are the property of the University.

Outside of official policy, UC appears to claim ownership of results and data. Quoting from a talk by @mackenziesmith:

The University of California posits that it actually has a contractual obligation to maintain the ownership of all research data produced from grant funded projects by any researcher at UC, especially federally funded grants. So they claim that the university owns the data.

Additionally, a UCSD guide states:

  • Data produced by UC researchers belong to the Regents of the University of California.
  • To promote sharing and unlimited use of your data, make your data available under a Creative Commons CC0 Declaration.

These seemingly contradictory statements imply that UC may own the data but that its creators are free to release it into the public domain.

Resolutions

We are looking for suggested courses of action to address the ambiguity and potential multiplicity of claims regarding ownership. Two possible actions are:

  • applying a without warranty clause to our licensing to limit our liability.
  • identifying all potential parties that may claim ownership and request permission to release the work as freely as possible given the aforementioned considerations.

At a workshop I organized at UC Davis last year — Data Rights and Data Wrongs — senior counsel from the UC Office of General Counsel (i.e., the university's lawyers) was very clear that UC retains ownership rights to original data as the official 'grantee' and to insure compliance with federal laws for research conduct, etc. I think the relevant policy is here http://www.ucop.edu/raohome/cgmemos/84-31.html (old but still in effect). So I think your assessment is right that UC asserts ownership but allows you to release the data under reasonable terms, including CC0. However most of the data you're working with isn't original to you, so what UC 'owns' is your own findings and if you did something wrong, the university is liable to some extent.

Of course, finding all the rights holders and getting their explicit permission to do what you're doing would be ideal, but is that practical? Do you even know who holds the rights to all the data sources? I disagree with the point that you can't rely on Fair Use and release your results under a CC0 waiver — I believe that's what Fair Use is for, if it's truly transformative — but you might want to be explicit about the waiver of liability. Especially given how gray the area you're working in is, legally speaking.

  • Lars Juhl Jensen: The big problem I see here is the term "results". If the results are actually new results, then I agree that they can be released whichever way you like. But if your "results" are in fact other people's databases mapped to different identifiers, bundled, and reformatted in JSON format, then claiming fair use is in my opinion a very risky proposition.

  • Daniel Himmelstein: @larsjuhljensen and @mackenziesmith, I think we may all be on the same page. Results from network analyses are truly transformative and thus eligible for CC0 licensing due to fair use. However, the resource processing and integrative network steps, which distribute unmodified downloads as well as network-coerced versions of databases, should generally transmit the source licensing.

  • Lars Juhl Jensen: @dhimmel, we almost agree. The one point where I disagree is that, in my opinion, fair use has nothing to do with it. If you do a network analysis that truly produces new results (e.g. predicting new edges based on the imported ones), then those edges are yours. You are free to do with them whatever you want, not because of fair use, but simply because you are the original creator :-)

Hi all,
I am a lawyer, but not your lawyer (or UC’s lawyer), and this isn’t legal advice. Also, I’m not yet familiar with the data sources or the project at a high level of detail - but here’s what I can say about the general issues.

I. U.S. law

A. Layers of copyright

  1. I notice that you’ve generally got a single assessment of copyright/licensing issues associated with each data source. I could see each one having up to three. For instance, you could have facts that both the original distributor and the downstream user agree are in the public domain - layer 1, you can do anything with those if you’ve extracted them and rearranged them. They could be collected and shared in a database that’s licensed under something like CC BY-SA, and the terms of that license would need to be followed when distributing the whole database, or parts of it, in such a way that you were copying & distributing the licensor’s copyrightable arrangement/selection/original authorship. The database is layer 2. Then you might have special software created by the data distributor to access and manipulate the data and/or the database, and that might be licensed separately, either with a CC license or with an open source software license like MIT or BSD. That’s layer 3. Without being an expert on these particular databases, I’m guessing 2 and 3 are often going to be the same thing, but it’s best not to just assume that.
  2. If no license terms are posted, the underlying facts are in the public domain, and any copyrightable expression like software, or creative arrangement, is copyright default’s “all rights reserved.”
  3. Why am I bothering to spell this out? Depending on the terms of these things and how you want to use/redistribute them, it’s possible that something like the GSEA/MIT terms that look really restrictive may not be a hurdle. I read that one to limit what you can do with layers 2 (“the DATABASE”) and 3 (“the PROGRAM”), but less so layer 1. If you’re committed to redistribution of layer 2 wholesale, then yeah, we’ve got barriers.

B. Particular licenses

  1. Software licenses are more commonly used for software than CC licenses are, although either would theoretically work. UC recommends BSD and MIT licenses in particular, because they don’t say anything about patents.
  2. There’s some interesting stuff in the fine print of the CC licenses that might be helpful. For instance, the SA requirement has to be retained by the original material, and has to be attached to any “Adapted Material.” But not every use of a work is “Adapted Material.” Compilations generally aren’t an adaptation, so maybe there’s some creative thinking to be done around that. Attribution requirements can be satisfied in “any reasonable manner based on the medium, means, and context,” and maybe we could do some thinking about what’s a reasonable manner in this context.

C. The CC0 dedication

  1. Like the CC licenses, the CC0 dedication only applies to … what it can apply to. Just the things the licensor has the ability to waive rights to. Here’s the language:
    Affirmer disclaims responsibility for clearing rights of other persons that may apply to the Work or any use thereof, including without limitation any person's Copyright and Related Rights in the Work. Further, Affirmer disclaims responsibility for obtaining any necessary consents, permissions or other rights required for any use of the Work.
    On the bright side this means that theoretically, you can just release your own layer/contributions/authorship as CC0, without affecting the things you reference or incorporate. Unfortunately, this isn’t so helpful for downstream users who have to try to figure out what the CC0 applies to and what other rights are lurking there. Lots of explanation, labeling, help pages, etc. can be useful if people read them.

D. UC and data “ownership”

  1. I’m going to keep putting “ownership” in quotes until something official explains to me, to my satisfaction, exactly what UC is claiming to own. The APM policy they seem to rely on from the 50s talks about records, like notebooks. Data can only be owned to the extent there’s intellectual property involved, like patent, copyright, or trade secrets. If none of those are present, there’s nothing to own. There may be contractual restrictions about what you can or must do with something, that you’ve agreed to as part of an employment agreement or a grant agreement, but that’s a different animal, and will be more explicit than the automatic rights involved in copyright.
  2. Depending on how this project is funded I think any copyrightable work here - the software, for instance - could be student work, personal work, or institutional work, under the 1992 Copyright Ownership Policy. It’s unlikely to be a scholarly/aesthetic work because of the definition of “designated academic appointee,” but I don’t know who the co-authors are.
  3. UC’s lawyers - OGC and general counsel - will generally weigh in to assess legal risk to the university or disposition of university intellectual property. They will not/cannot provide advice about liability to an individual, or assessment of their personal intellectual property.
  4. Each campus has a designated authority who is authorized to approve licensing decisions and the like on that campus. I believe UCSF’s is Karin Immergluck. In my experience, if we get to a place where we decide “well, this project includes copyrights owned by UCSF, but we want to license them CC BY or dedicate them to the public domain,” an email to the relevant campus person explaining the rationale (and preferably why this isn’t something the university would make money off of) results in a quick approval.

E. Contracts

  1. U.S. copyright law includes all kinds of rights for users, including fair use, and the fact that certain things are in the public domain. But you can sign a contract giving away any of these rights. To the extent that you have to agree to restrictive terms to get access to a data set, those terms may effectively limit your rights to reuse even factual data. It’s like when libraries sign a license for a ProQuest database and promise not to make any copies of newspaper articles from the 1800s.

II. International law

A. Database protection generally

Lots of countries protect a database, but not the underlying facts, with copyright law. I see you found the Bitlaw page on this, which is where I would have sent you.

B. European database directive

I’ve never had occasion to deal with this before, but there’s a parallel thing in some countries like Italy for, e.g. digitizing old manuscripts. Limited protection as an incentive to create the thing or make it accessible. It sounds like enough of a pain that it’s probably worth figuring out which of the proposed sources are covered. That may be time consuming and difficult - so, something for further discussion/research.

C. International liability for potential copyright infringement

This is a tricky issue, and a fun subject for law review articles. Most of them revolve around selling things internationally, for a couple reasons. First, that’s when you’re likely to make people mad enough to bother with suing you. Second, there are jurisdictional issues about how much you have to do in a country to subject yourself to a lawsuit there. All I can say is that internet plus free distribution doesn’t automatically equal global legal risk. But that may not matter much because...

II. There’s law, and then there’s politics.

If we were looking at hundreds of sources, contacting them individually would be a horrible thing to contemplate. With a couple dozen, it might be worth it to put together a form letter to let people know about the project, to avoid burning bridges with current colleagues and potential future collaborators. This could address the things these folks are most likely to be concerned about: what is this project doing with the data sources? How will downstream users be able to tell the source of the data? What things will facilitate or burden commercial use? And there could be a few different versions depending on the legal assessment of the underlying rights and which ones the project implicates - maybe a letter to US sources is more of an FYI, and one to European sources asks them to reply granting permission. Or maybe if the project really wants everything to be as open as possible, you just actually get permission from everyone to a release of some version of their data, in this context, under your chosen license. Just because they make it available to the world under, e.g., CC BY-SA doesn’t mean they can’t make it available to you under different terms. There are options. None of them are as easy as “just use it,” but if people have tried to restrict how their stuff is used you have to decide the relative value you place on maximizing your rights under the law vs. maintaining goodwill.

  • Lars Juhl Jensen: Thanks a lot for also pointing out the politics aspect. In my opinion, the risk of actually getting sued in academia is probably fairly low. However, if you were to systematically take resources with restrictive licenses, integrate them, and redistribute the complete data under CC0, you would almost for sure be burning bridges.

Mixed copyright licensing

As explained above, we have created resources (mostly GitHub repositories) that contain content with varying licenses and restrictions. Therefore, we need to:

  • license different files from the same repository under different licenses
  • license different portions within a single file under different licenses

It appears that there is not a rigid formula for how to specify mixed copyright. I found a few examples including the neo4j source code and a license the UCSF Office of Innovation, Technology & Alliances created for my classmate.

In the later case, my classmate asked the ITA to assist him in creating an open source license. As @mackenziesmith predicted, UC asserted ownership of the content and forbid any for-profit usage. As an aside, I am highly confident that UC does not own my work, because it is not work made for hire, and I never agreed to any transfer of ownership.

Proposed license for the SIDER4 repository

SIDER 4 is a resource we're using for drug side effects. I propose the following license for the repository:

SIDER 4 data is released under a CC-BY-NC-SA license. Therefore, all redistributed and derived content from SIDER 4 is CC-BY-NC-SA. All original content is released under CC0.

Accordingly, the following files are CC-BY-NC-SA:

  • download/meddra_all_indications.tsv.gz
  • download/meddra_all_se.tsv.gz
  • download/meddra_freq.tsv.gz
  • data/indication.tsv
  • data/side-effects.tsv

Disclaimer: The repository is provided "as is", without warranty of any kind, express or implied, including but not limited to the warranties of merchantability, fitness for a particular purpose and noninfringement. In no event shall the authors or copyright holders be liable for any claim, damages or other liability, whether in an action of contract, tort or otherwise, arising from, out of or in connection with the repository or the use or other dealings in the repository.

We added the disclaimer to limit our liability as suggested by @mackenziesmith. Does the proposed license seem adequate? Is it clear? @katiefortney, any suggestions?

  • Katie Fortney: Generally with a CC license you should say how the author wants to be credited - if not specific as to manner, than at least with a name of person or organization. That's all so far. I'll keep an eye out now that I'm back in the office...

Chronicling licensing and permission requests

Inspired by the story of Max Haeussler [1] who publicly documented his permission requests to publishers to text mine their corpora, I will be chronicling our licensing efforts that require contact. We will therefore release summaries and statistics pertaining to three types of requests:

  • permission requests to resources with licenses that forbid redistribution or derivatives. We have begun by posting our requests to MSigDB and LINCS L1000.
  • requests to post licenses for resources without licensing information. Resources for which we could not find license information are available here. We have already emailed the creators of these resources and will report back with progress.
  • for datasets obtained from publication supplements, license clarification or permission requests to the journal.

Can you add into the table the corporate or institutional affiliation of the project and the funding agency to each of the data sources?

  • Daniel Himmelstein: I've added institutional affiliations. I did not specify affiliations for community driven projects or multi-affiliated projects. In general, is funding information available? Where can I find it?

  • Caty Chung: Funding is available on the project site or the citation. For example DO, their last publication, the funding are from several grants from NIH, EMBL and the department of energy.

    When sources are looking into licensing or terms of use, do they bind to the affiliation, funding or make the best educated guess?

  • Daniel Himmelstein: I updated the table to include funding information. Without seeing the specific contractual arrangements between funders and Universities and between Universities and their researchers, it's difficult to know whether there are any binding obligations regarding data licensing.

Data licensing and compliance report

As a refresher, we released an initial version of our network built from publicly-availabe resources. I had assumed that as long as a resource was public, we could use it for our research. In addition, we're committed to open science — releasing our network and intermediate data, both for reproducibility and to allow others to build off of our research. However, as @larsjuhljensen pointed out, legal issues arise when using public data that isn't specifically licensed to permit reuse.

It has now been 212 days since Lar's alert and 199 days since I started this discussion seeking expert advice. Here I'll report on the strategy we chose. Our goals were: to bring us into compliance with copyright law and license agreements; to respect the intent of resource creators; to preserve our sunk time investment; and to retain the scientific value of our network. Unfortunately, no one solution satisfied every objective. We were left to choose between several imperfect ways forward.

Compliance efforts

First, I compiled the licenses for all of the resources we included in our network. Of the 28 resources we integrated, only 12 had licenses that met the criteria for open knowledge. As a result, our project would not be a possibility under a paradigm of absolute compliance.

Resources fell into four categories regarding their licensing:

  1. Resources that are in the public domain.
  2. Resources with a license that allows use, redistribution, and modification.
  3. Resources with a license that forbids use, redistribution, or modification.
  4. Resources that do not have a license.

While I retrospectively assigned these categories while writing this post, the approach we pursued for a given resource aligned with its category. We approached category 1 & 2 resources by specifying their license wherever we use, redistribute, or modify them. We approached category 3 & 4 resources by requesting permission from their creators or owners. I attempted attribution for all resources, regardless of category, to maintain data provenance.

Category 1 & 2 resources

There were 4 category 1 resources — Entrez Gene, MEDLINE, LabeledIn, and MeSH — all due to US federal Government creations not being entitled to copyright protection. These resources were easy to integrate: I could proceed without restriction and released derivative works under CC0.

There were 14 category 2 resources. If the resource uses a standard license, such as a license by Creative Commons or Open Data Commons, I used the same license including version for redistribution and derivative works. Examples include Disease Ontology, DISEASES, Gene Ontology, TISSUES, Uberon, WikiPathways, BindingDB, DisGeNET. If the resource used a custom license, then I applied a Creative Commons license that abided by the custom stipulations. For example, CC BY 4.0 for custom licenses that require attribution — GWAS Catalog & LINCS L1000 — and CC BY-NC 4.0 for custom licenses that forbid commercial use or specify academic use only — DrugBank.

I embedded licensing into the network as node/relationship properties. Therefore, users can filter to retain only specific licenses when querying or parsing our network. Prior to the network stage when data for each resource still resides in separate repositories, I specified licensing via a LICENSE.md file or a section in the README.md file.

Category 3 & 4 resources

Originally I identified 3 category 3 resources — MSigDB, Incomplete Interactome, LINCS L1000. I chronicled these permission requests on Thinklab. Through our permission requests, we learned that the Incomplete Interactome was actually category 4 and LINCS L1000 was actually category 2. Our permission request to MSigDB is ongoing.

There were 9 category 4 resources — ADEPTUS, Bgee, DOAF, ehrlink, ERC, hetio-dag, Incomplete Interactome, Human Interactome Database, STARGEO. Since I am the creator of hetio-dag and our STARGEO analysis, these resources did not require any action. For the remaining resources, I sent permission requests.

For category 3 & 4 resources, I opted to continue including the resource in our network regardless of whether we affirmatively received permission. I deemed these resources too critical from a scientific perspective to justify their removal. Several factors shaped my decision: many scientists who post their data assume it will automatically be reusable; the resources were publicly funded with the intent to be used for science; copyright may not apply if our network is fair use or the underlying data is factual; and reuse of scientific data despite all rights reserved is prevalent throughout academia.

There are several unpleasant consequences to my decision to include category 3 & 4 works. First, I risk the legal consequences of infringement. Second, we could have to purge content from our network if a data creator/owner requests that we discontinue use of their resource. Third, anyone who wants to use or build off of our network will have to revisit the same issues we're facing here.

Permission requests by outcome

For category 3 & 4 resources, I requested permission to use the resource for our project. I've organized my requests into four outcomes:

  • EXST — We received a response referring us to an existing license. In the four instances, we had overlooked the license because it was difficult to find or unclear whether it applied.
  • PERM — We received a response granting us permission to use the resource. In both cases, the authors granted their permission but acknowledged that they may not be the rights holder.
  • INC — We received an inconclusive response. In all three cases, the authors indicated they would take licensing actions which have yet to happen.
  • NORESP — No response.

Each resource for which we requested permissions is below. Days indicates the time till first response. When present, public documentation of our request is linked to in Contact Method. The table is sorted by outcome and then by days.

ResourceOutcomeDaysContact Method
UberonEXST0GitHub Issue
Entrez GeneEXST2helpdesk
LINCS L1000EXST16email
GWAS CatalogEXST19email
Incomplete InteractomePERM0email
Evolutionary Rate CovariationPERM16email
DOAFINC2email
BgeeINC9email/note
MSigDBINC129email
Human Interactome DatabaseNORESP189+email
ADEPTUSNORESP198+email

Conclusion

We've gone to great lengths and invested substantial time in complying with data copyright and licensing. However, under a strict interpretation, our project may infringe upon the rights of publicly-funded scholarly resources.

Additional references

I'll try to keep the following list up to date with webpages or papers I come across that provide relevant data licensing information.

  • Legal confusion threatens to slow data science [1] — which discusses our licensing struggles in creating Hetionet
  • Sharing Research Data and Intellectual Property Law: A Primer [2]
  • Legal Interoperability of Research Data: Principles and Implementation Guidelines [3]
  • Who “owns” your data? by @katiefortney published on the University of California, Office of Scholarly Communication Blog.
  • Create guidelines for OBO maintainers who want to be included in WikidataIssue #285 on the OBOFoundry GitHub discussing licensing options for biomedical ontologies.

Final resource counts

Since stats help provide context, we often mention how many resources Hetionet integrates. Our licensing table lists 31 resources, but 2 were removed. Hence, we claim Hetionet v1.0 integrates 29 resources. This is the number of resources that directly contributed data that was encoded as nodes or edges in the hetnet.

Caveats: The number 29 underestimates the extent of integration required for a project such as Rephetio. First, we used additional resources, such as UniChem, to help standardize and integrate these 29 resources. Additionally, several resources were themselves compilations of other resources, such as BindingDB and the resources for protein-interactions. Finally, we rely on several other databases to interpret our findings, such as ATC Codes and HGNC Gene Families.

Nonetheless here are the 31 resources divided by their copyright and licensing situation. Note that this stratification requires some subjectivity. In other words, I used my best judgement to help simplify complex legal considerations into 6 categories. The categories are slightly different than above. Furthermore, this post reflects the removal of MSigDB and the corresponding addition of two pathway resources (Reactome & Pathway Interaction Database).

5 public domain resources

Five resources were created by the United States Government. Hence I consider them not subject to copyright and part of the public domain. Nodes and edges from these resources are CC0 licensed in Hetionet v1.0.

  1. Entrez Gene
  2. LabeledIn
  3. MEDLINE
  4. MeSH
  5. Pathway Interaction Database

Caveats: The public domain status of these resources is complicated. None of them adopted a CC0 license to unambiguously place them in the public domain, including outside of the United States. Furthermore, they often come with custom legal statements and terms of use. See for example, the MeSH Memorandum of Understanding. These custom terms make one question whether these are actually public domain resources. It's a mess. @andrewsu is a leading expert on this mess and reform efforts.

12 openly licensed resources

Twelve resources had licenses that met the Open Definition version 2.1, which is summarized as:

Knowledge is open if anyone is free to access, use, modify, and share it — subject, at most, to measures that preserve provenance and openness.

Those resources are:

  1. Disease Ontology
  2. DISEASES
  3. DrugCentral
  4. Gene Ontology
  5. GWAS Catalog
  6. Reactome
  7. LINCS L1000
  8. TISSUES
  9. Uberon
  10. WikiPathways
  11. BindingDB
  12. DisGeNET

Caveats: Not all of these resources used a standard license that officially conforms with the Open Definition. Therefore, I used my best judgement whether custom license terms were compatible with open licensing. For the most part, nodes and edges from these resources are openly licensed in Hetionet v1.0.

4 resources that allow non-commercial reuse

Four resources had licenses that allowed non-commercial reuse only. Nodes and edges from these resources use the least-restrictive compatible Creative Commons license in Hetionet v1.0.

  1. DrugBank 4.2
  2. MEDI
  3. PREDICT
  4. SIDER 4

Besides DrugBank, these resources did use standard Creative Commons licenses, which while not being open are at least legally straightforward. And DrugBank switched to a Creative Commons license part of the way through Project Rephetio, based in part on our feedback.

9 unlicensed resources

Nine resources did not have a license. For the most part, nodes and edges from these resources don't have a license attribute in Hetionet v1.0.

  1. ADEPTUS (removed)
  2. Bgee
  3. DOAF
  4. ehrlink
  5. Evolutionary Rate Covariation
  6. hetio-dag
  7. Incomplete Interactome
  8. Human Interactome Database
  9. STARGEO

Caveats: In these cases, I believe the researchers generally put the data online for others to use but are unaware of the legal barriers to data reuse. Or in other instances, they would like to openly license their work but are not the data owners or are unsure of the legal considerations of doing so.

MSigDB explicitly forbids redistribution

Ultimately, one resource explicitly forbid redistribution.

  1. MSigDB (removed)

Conclusion

$$5 + 12 + 4 + 9 + 1 - 2 = 29$$

Phew!

 
Status: In Progress
Labels
  licensing
Views
518
Topics
Referenced by
Cite this as
Daniel Himmelstein, Lars Juhl Jensen, MacKenzie Smith, Katie Fortney, Caty Chung (2015) Integrating resources with disparate licensing into an open network. Thinklab. doi:10.15363/thinklab.d107
License

Creative Commons License

Share