THE FEVER DATASET 

This dataset contains detailed information about feature evolution from 5 Linux releases. 

1 - ABOUT

This dataset addresses the problem of feature evolution in the context of highly configurable software. 
Features are unit of variability for many system: they are what we chose to include in a software product. 
As a result, features, as a concept, is distributed in many development artefacts, not just the source code. 
The challenge in this context is to obtain a view of how feature "are" and how they evolve when both
their definition and behavior, as well as their evolution occurs in many artefacts of different nature. 

This dataset provides such feature oriented view of changes for 5 release of the Linux kernel. 

The data you are about to query can be seen as a different "view" on the official Linux kernel repository. 
Its content should match want you would be able to find by using "git log" types of command on the Kernel repo. 

We generated this data by running the FEVER tool on this repository. 
As a result, the content of the database is closely related to the artefacts that you would find in Git.



2 - GETTING STARTED

To use the dataset, what you need is the Neo4j server, available here : www.neo4j.com. 
Then, unzip the dataset. It contains 5 folder, each one contains a specific neo4j database. Each database contains the 
detail change information of features for a single release. 
Copy/paste those folder in the neo4j ./data/folder. Then rename the graph.db folder (the default one) to anything you like, 
and rename one of the FEVER folder to graph.db depending on which release you want to query. 
(Re)Start  your server and you are good to go. 


3 - THE DATA 

We describe in this section each node, by type, their properties and provide some query sample to help you get started. 

3.0 - general remarks on the data

Neo4j does not enforce a strict datamodel. However, we used a specific type of node for every concept we extract from 
the Linux history, as well as typed relationships between them. 
In general, when running queries, you don't need to specify all the nodes and relationships types. This makes the queries
heavier than they really need to be - harder to read.

For instance the following queries would return the exact same results: 

- match (c:commit {hash:"aaaaaaaaaa"})-[:NEXT]->(c2:commit) return c2; 
- match (c:commit {hash:"aaaaaaaaaa"})-->(c2:commit) return c2; 
- match (c:commit {hash:"aaaaaaaaaa"})-[:NEXT]->(c2) return c2; 

They return the same results because there is only one type of relationship between 2 commit nodes. 

In a sense, we over-specified our data. However, it makes reading the model and the results of the queries easier and does not 
comes with any major drawbacks, except : you can create many different queries returning the same results. 


3.1 - Node: commit

3.1.1  - overview

A commit node represents a commit as performed by developpers in the kernel repository. 

The commits node has the following properties: 
- hash: the sha1 id of the commit. It is comprised of the first 10 characters of the official sha1 id of each commit. 
- author: name of the author of the commit as represented in Git
- commiter: name of the commiter of this commit as represented in Git
- message: complete commit message, including all the ack/sign-off information. This is a multi line attribute. 

The commit node *may* have the following relationships - it depends on the artefacts edited during the commit
- NEXT : links commits together, as the parent/child relationships does in the Git branches. 
- TOUCHES: links commits to ArtefactEdit nodes
- CHANGES_VM:  links commits to "FeatureEdit" nodes
- CHANGES_IMPLEMENTATION: links commits to "SourceEdit" nodes
- CHANGES_BUILD : links commits to "MappingEdit" nodes

3.1.2 - additional notes

You can use the "hash" property to find commits in the official git repo. They shoul always match. 

3.1.3 - Query sample: 

- to obtain all commits in the database : 
	match (c:commit) return c ; 

- to obtain all commits containing the word "bug" in the message (multi line property query example)
	match (c:commit) where c.message =~ "(?ism).* bug .*" return c ; 

	note: 
		you need the `=~` symbol to enable regexp matching. 
		the (?ism) string goes with the =~, and activates regexp flags. "m" is for multi-line, "i" is for case-insentive, "s" makes sure that the "." wildcard can match any character.
		depending on your use case, this might not be the flags you need. Adjust accordingly.

- to get the commit following commit with hash "0fed2bcf17"
	match (c:commit {hash:"0fed2bcf17"})-[:NEXT]->(c2:commit) return c2;
	

3.2 - Node : ArtefactEdit

3.2.1 - overview

An ArtefactEdit node represent the edition of any artefact contained within the repository. 
If a file is touched in a commit, regardless of the nature of the file, it will be represented in an "ArtefactEdit" node. 

The ArtefactEdit node has the following attribute: 
- name : full name of the artefact, starting from the root folder of the repository
- type : the type of artefact being touched. the possible values are :
	"source", "vm", "build"
	"source" artefacts are considered to be source code, containing behavior
	"vm" files describe the variability model of the system
	"build" files are part of the build system, and are used to recover the mapping between features and source files.
	artefacts that are not source files, vm nor build do NOT have this property.
- change : the change affecting the artefact. the possible values are 
	"MODIFIED", "ADDED", "REMOVED",

3.2.2 - additional notes
	the datamodel is currently being extended to include "data" artefact types. 


3.2.3 - query samples
	match (a:ArtefactEdit) where a.name =~"(?ism).*drivers.*" return a ; 
	returns all artefacts edit  refering to files with "driver" in their path.


3.3 - Node : FeatureEdits and FeatureDesc

3.3.1 - overview

A FeatureEdit node represents the change of a single feature definition in the variability model. 
A FeatureEdit always occurs in the contenxt of the edition of a variability model file.
Consider that changes to a variability model file might affect more than one feature, in which case
for this file, multiple FeatureEdit will be created. 

FeatureDesc (for feature description) objects are more precise descriptions of the state of features
before and after the changes. The FeatureDesc object describe features as they are described in the 
system under study. In this instance the attribute found on such objects almost match one to one
the attribute of features in the Linux kernel. 

FeatureEdit nodes have the following attributes:
- name : the name of the feature
- change: the change occuring to that feature 
	the possible values are : "Modify", "Add", "Remove"
- type : the type of the feature (specific to Linux). 
	the possible values are : "BOOLEAN", "TRISATE", "INT", "STRING", "HEX"
- optionality: (deprecated) 
- visibility : indicates if the feature is visible to the end user during the configuration process
	possible values are "visible" or "internal"

FeatureDesc nodes have the following attributes: 

- name: name of the feature
- type: type of the feature (same as FeatureEdit, but with this attribute you can keep track of feature type changes more precisely)	
- optionality: (deprecated)
- visibility	ndicates if the feature is visible to the end user during the configuration process
- depends on: dependencies of the feature. This is obtained using Dumpconf in the Linux kernel. 
- presence condition : deprecated , rely on "depends on" attribute only.
- default_values: default values of the feature, if any, with their condition if any 
- selects: features selected when this feature is. this is Linux specific. 

The FeatureEdit node is connected to other nodes with the following relationships:
- IS : links the feature edit to a FeatureDesc object containing the description of the feature as it IS after the change
- WAS: links the feature edit to a FeatureDesc object containing the description of the feature as it WAS before the change
- IN: points to the ArtefactEdit object in which the feature change took place

3.3.2 - additional notes

The presence condition attribute is not usable and will be removed in future version of the dataset. 
The optionality is also quite unreliable, and has not been validated. It will be removed in future versions of the dataset. 

3.3.3 - query samples

3.2 - Node : MappingEdit

3.4.1 - overview
A MappingEdit node represents a change to the mapping between features and "assets" (of any kind).

The MappingEdit node has the following properties: 
- type: the type of item mapped to an asset 
	can be only be: "FEATURE"  (maybe "SYMBOL" for non-feature variable, but those shouldn't appear in the database)
- feature: name of the feature mapped to the asset
- target: name of the asset mapped to the feature
- target_type: nature of the asset mapped to the feature. 
	can be one of : "CC_FLAG", for compilation flag, "FOLDER" for folders, "COMPILATION_UNIT" for c.u., or "DATA" for binary/json files ( in Soletta).
- mapping_change: change to the mapping to the feature, can be one of:
	"MODIFED" if the feature was already mapped to something, "ADDED" if the feature wasn't mapped in this file before, or "REMOVED" if the mapping for this feature is removed.
- target_change: change to the target in the context of this mapping. A target can only be "ADDED" or "REMOVED"
- artefact_change (EXPERIMENTAL): change to the artefact linked to the feature

The MappingEdit node is connected to the following nodes: 
- IN : points to the artefacts (Makefile) in which the mapping was modified.

3.4.2 - additional notes

3.4.3 - query samples

3.5 - Node: SourceEdit

3.5.1 - overview 
A SourceEdit node indicates an update of the implementation of a feature interaction. 
It has a change to an implementation contained within a conditionally compiled code block. 

The SourceEdit node has the following properties: 
- change : change to the code block "New", "Modified", "Removed"
- interaction: the condition under which the block is included (with expanded nesting for the file)
- code_edit: the change of the implementation within the block. Can be one of: 
	- "Preserved", if the code has not been changed inside the block
	- "Added", "Removed" if the code has been entirely added or removed
	- "Edited" if the code has been partially touched

The SourceEdit node has the following relationships:
- IN: points to the artefact edit in which the change took place. 


3.6 - Node: TimeLine

3.6 - overview

The TimeLine node aggrates all changes pertaining to the same feature over the studied period of time (a release for this dataset).

The TimeLine node has the following properties: 
- name: name of the feature whose evolution is captured by this node.

The TimeLine node has the following relationships: 
- FEATURE_CORE_UPDATE: points to ArtefactEdit, FeatureEdit and MappingEdit affecting this feature.
- FEATURE_INFLUENCE_UPDATE: points to SourceEdit, and FeatureEdit nodes related to this feature. 

A FEATURE_CORE_UPDATE indicates that the definition or behavior of the feature is being updated in the commit. 
A FEATURE_INFLUENCE_UPDATE indicates that the "influence" of the feature on other features is being updated. 
	This include code scattering (pointer to SourceEdit) or influence in feature declaration (FeatureEdit) nodes.

