Following the FAIR principles, the environment for content in a Text and Data Mining (TDM) context will necessarily be a distributed environment where data resides in multiple locations simultaneously. Without going into technical discussions about the environment itself, it is safe to say that it must be a federated system whereby the environment enables the different organizations to work together through a defined agreement that allows the ability to invoke and share public services and content.
Terms of Art
We must start with a set of “terms of art” definitions about content in today’s machine-readable world. An examination of the various Creative Commons licenses shows definitions for Licensed Material and Adapted Material, as shown below.
- licensed material - The artistic or literary work, database, or other material to which the Licensor applied this Public License.
- adapted material - Material subject to Copyright and Similar Rights that is derived from or based upon the Licensed Material and in which the Licensed Material is translated, altered, arranged, transformed, or otherwise modified in a manner requiring permission under the Copyright and Similar Rights held by the Licensor.
These definitions of content (material) are comprehensive. They don’t consider the creation of structured data, such as can be created by the legal data groups or the technology data groups. We reached out to Negar Rostamzadeh, one of the authors of an excellent document that we’ll draw much from, wherein they defined additional terms that should be included in any of today’s contribution environments. Here are the terms and definitions found in the document, except for the meaning of tokenized data, that delve deeper into content:
- data - A subset of information in an electronic format, usually in a JSON or XML dataset, that allows it to be retrieved or transmitted.
- tokenized data – Data broken into small chunks (paragraphs, sentences, words). These tokens help understand the context or develop the NLP model. The tokenization allows interpretation the text’s meaning by analyzing the sequence of the words.
- labeled data - Highly organized metadata in labels or tags that ascribe meaning or representations to the underlying data. These labels or tags can be created manually or automatically. Labeled data may or may not contain the formatting of the original data.
- representation - A different form, format, or model that mimics the effects of given data but does not contain any individual data points or allow third parties to infer individual data points with currently existing technology. A Trained Model need not necessarily require constant exposure to the Data itself, and a transposed input of Data that does not carry it over is referred to as a representation.
- output - The output is defined as the results of applying a Trained Model to data. Thus, the output depends on the Model and what is sought in a particular use case. For example, the output of temperature-predicting models would be temperature-prediction data or the output from a retailer’s pricing model could be the estimated price that consumers are willing to pay. However, within ML and AI, the outputs may also be far-reaching: automatically generated tags and images, even art, can be a model output. Output is essentially what is sought from using a Trained Model.
Why are these important distinctions? Use cases. Mainly use cases for research and Machine Learning (ML) which includes Artificial Intelligence (AI) and Natural Language Processing (NLP). Don’t all the data definitions fall under licensed content? While data can be extracted with an attached Citation, tokenized data and labeled data for use within AI, ML, and NLP should be handled separately to avoid confusion. Certain Representations of the data, along with Output data, fall under the category of derivative content.
Additionally, here are a few terms from the same document regarding use that should be added for any AI, ML, or NLP use cases:
- access – To access, view, and/or download the data to view it and evaluate it.
- evaluate – To tokenize and label data to train or apply it in a model.
- train - To expose an Untrained Model to the Data to adjust the weights, hyperparameters and structure.
- represent – To transform the data into a re-representation that mimics the effects of the data while also allowing for tokenization and labeling.
- model - This refers to ML or AI-based algorithms, or assemblies thereof, that, in combination with different techniques, may be used to obtain certain results. These results can be insights on past data patterns, predictions on future trends, or more abstract results. Different learning techniques exist and are used to accomplish different tasks. The term Model is thus inherently generic and is made to capture a wide breadth of techniques, though, admittedly, conceptual and practical differences are immense.
- untrained model - A model is deemed untrained when such a model has been devised, and adjustments have been made to its structure and components. These adjustments are made after exposing different model variations the Data to find the optimal model, but that model has not been trained on the Data as such. On a technical level, regarding a given set of Data, the Untrained Model is the version of the model for which the weights and hyperparameters have not been adjusted using that Data.
- trained model - An Untrained Model modified by exposure to the Data. In common parlance, a trained model has learned from fulsome exposure to data. Many training phases may result in tweaks to the model or the Data.
These terms of art are supporting terms for understanding how data is taken through the process of tokenization, labeling, representation, and finally, output. They all fall under the adaption clauses quite neatly.
Creation of TDM Clauses
There should be a few general clauses about the data being used for tokenization and tagging.
1. The Service allows Consumers to access and consume the data subject to the terms hereof as well as for the purposes of aggregating the data into a corpus, tokening the data, and applying Natural Language Processing to the data.
2. Any data processed for TDM purposes and required to maintain attestation of the data shall have the attestation appended to the tokenized and tagged data.
3. Accessed data may be used as training data to evaluate the efficiency of different Untrained Models, algorithms and structures, still it excludes the reuse of the Trained Model except to show the Training results. This includes the right to use the dataset to measure the performance of a Trained or Untrained Model without having the right to carry overweights, code, or architecture or implement any modifications resulting from the Evaluation.
4. Accessed data may be used to create or improve Models, but without the right to use the Output or resulting Trained Model for any purpose other than evaluating the Model Research under the same terms.
5. Accessed data may be used to create or improve Models and resulting Output, but without the right to Output Commercialization or Model Commercialization. The Output can be used internally for any purpose but not made available to Consumers or Re-Distributors or for their benefit.
Or if allowing non-commercial re-distribution of the content
6. Accessed data may be used to create or improve Models and resulting Output, with the right to make the Output available to Consumers or to use it for their benefit, without the right to Model Commercialization. The Output cannot be made available to Re-Distributors or for their benefit.
Or if allowing commercial re-distribution of the content
7. Accessed data may be used to create or improve Models and resulting Output for distribution to Consumers but not to Re-Distributors or for their benefit. The Output can be used internally for any purpose, made available to Consumers, but not made available to Re-Distributors or for their benefit.
Here’s the path we recommend:
1. Make a license. We have worked directly with Perkins Couie to create a Federated Data License that you can fill out and download for your business. Fill it out and route it around so that you are protected!
2. Sign up for the CCH and start learning PlantUML (a simple text-based system) to create data flow diagrams to help you understand where your data is going. To help you get started, we’ve created a diagram that walks you through questions you need to ask about your content and what you need to think about regarding your content’s usage for AI purposes. Check out the diagram.