Flamenco Documentation

Preparing Your Data

For Flamenco to load your collection, the metadata about the collection has to be provided in tab-delimited files (also known as TSV files, with a ".tsv" extension). TSV files can be easily manipulated using OpenOffice or Microsoft Excel. A sample collection, containing the winners of the Nobel Prize from 1901 to 2004, is provided in the example directory of the Flamenco distribution. You can load this collection into Flamenco and browse it, and you can examine the TSV files in the example directory to see how the data needs to be formatted.

A Flamenco collection is a set of items that are all the same kind (for example, all items are books, or all items are songs, and so on). The metadata about any given item consists of its facet values and attribute values. The first step in preparing your collection is to decide which information will be in facets and which will be in attributes. Facet values are used to organize items into categories, whereas attribute values are only displayed with individual items.

In the sample collection, for instance, prize is a facet indicating the type of Nobel Prize won, whereas name is an attribute for the name of the winner. That's because it makes sense to group Nobel Prize winners into categories by the type of prize, but not by their names.

Facet values are associated with ID numbers, whereas attribute values are strings. When an item belongs to a category, and the category belongs to a particular facet, the item has that category term as a value for that facet. "Facet value" and "category term" mean the same thing. For example, since Mother Teresa won the Nobel Peace Prize, Mother Teresa has one value in the prize facet, the prize category named "peace". The value of the name attribute for Mother Teresa is the string "Mother Teresa".

The TSV files you need to provide are:

attrs.tsv
facets.tsv
items.tsv
facet_terms.tsv (for each facet)
facet_map.tsv (for each facet)
sortkeys.tsv (optional)
text.tsv (optional)

attrs.tsv

attrs.tsv gives the list of attributes. Each line in this file represents one attribute. The tab-separated fields in the line should be as follows.

Field 1	Field 2
attribute identifier	displayable name

The attribute identifier should be a short, unique name containing only letters or underscores (no spaces or punctuation). The displayable name is what will be shown in the user interface. The example below gives three attributes.

Example
name	Full Name
birthyear	Year of Birth
deathyear	Year of Death

facets.tsv

facets.tsv gives the list of facets. Each line in this file represents one facet. The tab-separated fields in the line should be as follows.

Field 1	Field 2	Field 3
facet identifier	displayable name	long description

The facet identifier should be a short, unique name containing only letters or underscores. (Facet and attribute identifiers must be unique among both facets and attributes.) The displayable name is what will be shown in the user interface. The long description gives a more detailed description of the facet. The example below gives four facets.

Example
gender	Gender	gender
affiliation	Affiliation	affiliation at the time of the award
prize	Prize	type of the Nobel Prize won
year	Year	year that the Nobel Prize was won

items.tsv

items.tsv gives the IDs and attribute values for all the items. Each line of the file represents one item. If there are n attributes, then each line should have n + 1 fields, as follows.

Field 1	Field 2	Field 3	...	Field `n` + 1
item identifier	value for attribute 1	value for attribute 2	...	value for attribute `n`

Each item must have a unique identifier, which can be any number or string. It's best to use identifiers that are fairly short (less than 30 characters). The item identifier is followed by the values for each attribute, in the order that the attributes were given in attrs.tsv. The example below shows five items excerpted from a longer file, each with three attributes as given in the attrs.tsv example above.

Example
. . .
237	Alfred Werner	1866	1919
238	Marie Curie	1867	1934
239	Jody Williams	1950
240	Jack Steinberger	1921
241	Linus Pauling	1901	1994
. . .

It's fine to leave any of the attribute values blank, but note that each line still must have exactly n + 1 fields (that is, there must be exactly n tab characters). In this example, the lines for items 239 and 240 would each end in a tab character.

`facet`_terms.tsv

For each facet, the file named facet_terms.tsv (where facet is the facet identifier as specified in the first column of facets.tsv) gives the tree of category terms in the facet. This is the only file where each line can have a different number of fields. Each line represents one category, and gives the entire chain of ancestor categories leading down to that category. If the category is d levels deep, then the line has d + 1 fields.

Field 1	Field 2	...	Field `d` - 1	Field `d`	Field `d` + 1
term identifier	top-level term	...	grandparent term	parent term	category term

The term identifier must be a number unique within the facet. The tree structure is inferred by matching the category terms, so if two terms are subcategories of the same parent, make sure the parent term matches exactly.

prize is an example of a flat facet (disjoint categories with no subcategories). The prize_terms.tsv file might look like this.

Example
1	chemistry
2	economics
3	literature
4	medicine
5	peace
6	physics

affiliation is a hierarchical facet in the sample collection, arranging each Nobel Prize winner's affiliated organizations under the cities and countries to which they belong. Some of the lines in the affiliation_terms.tsv file might look like this.

Example
. . .
82	Switzerland
83	Switzerland	Geneva
84	Switzerland	Geneva	CERN
85	Switzerland	Zurich
86	Switzerland	Zurich	University of Zurich
. . .

As this example shows, categories at different levels are all distinct, and items can be assigned to them at any level. Also, two different categories can have the same category name as long as their parent categories are different.

Separate lines for each parent category (such as 82 and 83 in this example) are allowed but not required. If they are not present, Flamenco will automatically generate identifiers for the parent categories (for example, the CERN category will generate three nested categories, Switzerland, Geneva, and CERN).

`facet`_map.tsv

For each facet, the file named facet_terms.tsv (where facet is the facet identifier as specified in the first column of facets.tsv) assigns items to the category terms for that facet. Each line in this file has two fields.

Field 1	Field 2
item identifier	term identifier

The following example puts Alfred Werner in the category for the University of Zurich and Jack Steinberger in the category for CERN.

Example
237	82
237	86
240	84

The first line of this example is redundant but harmless. Whether or not the first line is present, item 237 (Alfred Werner) will automatically be assigned to category 82 (Switzerland), because Switzerland is a parent of category 86 (University of Zurich). The same item identifier can appear in multiple lines, which assigns the item to multiple categories in the facet.

sortkeys.tsv

sortkeys.tsv indicates which facets or attributes are to be used for sorting result lists. This file is optional. If it is present, each line corresponds to one sort key (either a facet or an attribute).

Field 1	Field 2
facet or attribute identifier	description

The first field is the identifier of an attribute or facet, as given in the first column of attrs.tsv or facets.tsv. The second field is the text that will be used for the link that the user selects in order to sort by that attribute or facet.

Example
name	name
birthyear	year of birth
country	country

text.tsv

text.tsv supports the text search feature of Flamenco. This file is optional. If it is present, each line corresponds to one item and provides the searchable text for the item.

Field 1	Field 2
item identifier	searchable text keywords

The following example shows some possible text keywords for the items in the items.tsv example above.

Example
. . .
237	professor chemistry molecule structure
238	professor sorbonne polonium radium
239	campaign to ban landmines
240	professor neutrino muon pion
241	professor chemistry molecule protein antibody
. . .

Searching on the term "professor" would then yield items 237 (Alfred Werner), 238 (Marie Curie), 240 (Jack Steinberger), and 241 (Linus Pauling).