Flamenco Documentation
Preparing Your Data
For Flamenco to load your collection,
the metadata about the collection has to be provided
in tab-delimited files
(also known as TSV files, with a ".tsv" extension).
TSV files can be easily manipulated using
OpenOffice or Microsoft Excel.
A sample collection,
containing the winners of the Nobel Prize from 1901 to 2004,
is provided in the example
directory of the Flamenco distribution.
You can load this collection into Flamenco and browse it,
and you can examine the TSV files
in the example
directory
to see how the data needs to be formatted.
A Flamenco collection is a set of items that are all the same kind (for example, all items are books, or all items are songs, and so on). The metadata about any given item consists of its facet values and attribute values. The first step in preparing your collection is to decide which information will be in facets and which will be in attributes. Facet values are used to organize items into categories, whereas attribute values are only displayed with individual items.
In the sample collection, for instance, prize is a facet indicating the type of Nobel Prize won, whereas name is an attribute for the name of the winner. That's because it makes sense to group Nobel Prize winners into categories by the type of prize, but not by their names.
Facet values are associated with ID numbers, whereas attribute values are strings. When an item belongs to a category, and the category belongs to a particular facet, the item has that category term as a value for that facet. "Facet value" and "category term" mean the same thing. For example, since Mother Teresa won the Nobel Peace Prize, Mother Teresa has one value in the prize facet, the prize category named "peace". The value of the name attribute for Mother Teresa is the string "Mother Teresa".
The TSV files you need to provide are:
attrs.tsv
facets.tsv
items.tsv
facet_terms.tsv
(for each facet)facet_map.tsv
(for each facet)sortkeys.tsv
(optional)text.tsv
(optional)
attrs.tsv
attrs.tsv gives the list of attributes. Each line in this file represents one attribute. The tab-separated fields in the line should be as follows.
Field 1 | Field 2 |
---|---|
attribute identifier | displayable name |
The attribute identifier should be a short, unique name containing only letters or underscores (no spaces or punctuation). The displayable name is what will be shown in the user interface. The example below gives three attributes.
name | Full Name |
birthyear | Year of Birth |
deathyear | Year of Death |
facets.tsv
facets.tsv gives the list of facets. Each line in this file represents one facet. The tab-separated fields in the line should be as follows.
Field 1 | Field 2 | Field 3 |
---|---|---|
facet identifier | displayable name | long description |
The facet identifier should be a short, unique name containing only letters or underscores. (Facet and attribute identifiers must be unique among both facets and attributes.) The displayable name is what will be shown in the user interface. The long description gives a more detailed description of the facet. The example below gives four facets.
gender | Gender | gender |
affiliation | Affiliation | affiliation at the time of the award |
prize | Prize | type of the Nobel Prize won |
year | Year | year that the Nobel Prize was won |
items.tsv
items.tsv
gives the IDs and attribute values for all the items.
Each line of the file represents one item.
If there are n attributes,
then each line should have n + 1 fields, as follows.
Field 1 | Field 2 | Field 3 | ... | Field n + 1 |
---|---|---|---|---|
item identifier | value for attribute 1 | value for attribute 2 | ... | value for attribute n |
Each item must have a unique identifier, which can be any number or string.
It's best to use identifiers that are fairly short (less than 30 characters).
The item identifier is followed by the values for each attribute,
in the order that the attributes were given
in attrs.tsv
.
The example below shows five items excerpted from a longer file,
each with three attributes
as given in the attrs.tsv
example above.
. . . | |||
237 | Alfred Werner | 1866 | 1919 |
238 | Marie Curie | 1867 | 1934 |
239 | Jody Williams | 1950 | |
240 | Jack Steinberger | 1921 | |
241 | Linus Pauling | 1901 | 1994 |
. . . |
It's fine to leave any of the attribute values blank, but note that each line still must have exactly n + 1 fields (that is, there must be exactly n tab characters). In this example, the lines for items 239 and 240 would each end in a tab character.
facet_terms.tsv
For each facet, the file named
facet_terms.tsv
(where facet is the facet identifier
as specified in the first column of facets.tsv
)
gives the tree of category terms in the facet.
This is the only file where each line can have a different number of fields.
Each line represents one category,
and gives the entire chain of ancestor categories leading down
to that category.
If the category is d levels deep,
then the line has d + 1 fields.
Field 1 | Field 2 | ... | Field d - 1 | Field d | Field d + 1 |
---|---|---|---|---|---|
term identifier | top-level term | ... | grandparent term | parent term | category term |
The term identifier must be a number unique within the facet. The tree structure is inferred by matching the category terms, so if two terms are subcategories of the same parent, make sure the parent term matches exactly.
prize
is an example of a flat facet
(disjoint categories with no subcategories).
The prize_terms.tsv
file
might look like this.
1 | chemistry |
2 | economics |
3 | literature |
4 | medicine |
5 | peace |
6 | physics |
affiliation
is a hierarchical facet in the sample collection,
arranging each Nobel Prize winner's affiliated organizations
under the cities and countries to which they belong.
Some of the lines in the affiliation_terms.tsv
file
might look like this.
. . . | |||
82 | Switzerland | ||
83 | Switzerland | Geneva | |
84 | Switzerland | Geneva | CERN |
85 | Switzerland | Zurich | |
86 | Switzerland | Zurich | University of Zurich |
. . . |
As this example shows, categories at different levels are all distinct, and items can be assigned to them at any level. Also, two different categories can have the same category name as long as their parent categories are different.
Separate lines for each parent category (such as 82 and 83 in this example) are allowed but not required. If they are not present, Flamenco will automatically generate identifiers for the parent categories (for example, the CERN category will generate three nested categories, Switzerland, Geneva, and CERN).
facet_map.tsv
For each facet, the file named
facet_terms.tsv
(where facet is the facet identifier
as specified in the first column of facets.tsv
)
assigns items to the category terms for that facet.
Each line in this file has two fields.
Field 1 | Field 2 |
---|---|
item identifier | term identifier |
The following example puts Alfred Werner in the category for the University of Zurich and Jack Steinberger in the category for CERN.
237 | 82 |
237 | 86 |
240 | 84 |
The first line of this example is redundant but harmless. Whether or not the first line is present, item 237 (Alfred Werner) will automatically be assigned to category 82 (Switzerland), because Switzerland is a parent of category 86 (University of Zurich). The same item identifier can appear in multiple lines, which assigns the item to multiple categories in the facet.
sortkeys.tsv
sortkeys.tsv
indicates which facets or attributes are to be used for sorting result lists.
This file is optional.
If it is present, each line corresponds to one sort key
(either a facet or an attribute).
Field 1 | Field 2 |
---|---|
facet or attribute identifier | description |
The first field is the identifier of an attribute or facet,
as given in the first column of
attrs.tsv
or facets.tsv
.
The second field is the text that will be used
for the link that the user selects
in order to sort by that attribute or facet.
name | name |
birthyear | year of birth |
country | country |
text.tsv
text.tsv
supports the text search feature of Flamenco.
This file is optional.
If it is present, each line corresponds to one item
and provides the searchable text for the item.
Field 1 | Field 2 |
---|---|
item identifier | searchable text keywords |
The following example shows some possible text keywords
for the items in the items.tsv
example above.
. . . | |||
237 | professor chemistry molecule structure | ||
238 | professor sorbonne polonium radium | ||
239 | campaign to ban landmines | ||
240 | professor neutrino muon pion | ||
241 | professor chemistry molecule protein antibody | ||
. . . |
Searching on the term "professor" would then yield items 237 (Alfred Werner), 238 (Marie Curie), 240 (Jack Steinberger), and 241 (Linus Pauling).