Search Provider: Haystack, Elasticsearch

This section describes the search and indexing implementation.

Basics

The current application relies on Django Haystack, a high-level framework brokering between Django and a search backend. This search backend is currently ElasticSearch, but could be interchanged for Apache SOLR, should the need arise.

Re-Indexing

For now, reindexing (or updating the index, for that matter), is only done manually. To have all data indexed, just run:

python manage.py rebuild_index

for a full rebuild (wipes the indices first), or:

python manage.py update_index

to perform a simple update. For this to succeed, make sure ElasticSearch is up and running.

SearchViews

Searching is split between different contexts, represented by different Django views (cf. offenesparlament/search_views.py):

  1. Main Search (all indices), at /search
  2. Persons, at personen/search
  3. Laws, at gesetze/search
  4. debates, at debatten/search

Each view determines the available facets - for instance, the Person view returns, among others, faceting information for the person’s party in it’s results.

The views all inherit from JsonSearchView, an adaptation of Haystack’s SearchView that, instead of rendering a template, returns JSON data to be processed by the frontend.

Each accepts a query parameter, q, and a list of facet filters, named like the facets available for that view:

Main Search
  • No Facets
Persons
  • party: A person’s party, for instance, SPÖ
  • birthplace: A persons birthplace
  • deathplace: A persons deathplace
  • occupation: A persons occupation
  • llps: The legislative period(s) a person was/is active during
  • ts: The timestamp that entry was last updaten (from the parlament site)
Laws
  • category: A law’s category
  • keywords: A law’s assigned keywords
  • llp: The legislative period of a law
  • ts: The timestamp that entry was last updaten (from the parlament site)
Debates
  • llp: The legislative period a debate fell into
  • debate_type: either NR or BR (Nationalrat/Bundesrat)
  • date: The date the debate happened

Each of the facet filters means each resulting entry must contain the term, but it is not specifiying exact searches; for instance, filtering fields that might contain multiple entries like a person’s active legislative periods, for instance, will return all persons that have the period in question in their list, not just persons whose list contains only the period in question.

The query parameter searches in the index’s text field - an aggregate field containing most of the other fields to allow more specific searches.

All parameters have to be supplied as GET-Parameters. A typical request might look like this:

http://offenesparlament.vm:8000/personen/search?q=Franz&llps=XXIV&party=SP%C3%96

and would return the following JSON data:

{
   "facets":{
      "fields":{
         "party":[
            [
               "SP\u00d6",
               2
            ]
         ],
         "birthplace":[
            [
               " Wien",
               1
            ],
            [
               " Wels",
               1
            ]
         ],
         "llps":[
            [
               "XXIV",
               2
            ],
            [
               "XXIII",
               2
            ],
            [
               "XXV",
               1
            ],
            [
               "XXII",
               1
            ],
            [
               "XXI",
               1
            ],
            [
               "XX",
               1
            ]
         ],
         "deathplace":[
            [
               "",
               2
            ]
         ],
         "occupation":[
            [
               " Kaufmann",
               1
            ],
            [
               " Elektromechaniker",
               1
            ]
         ]
      },
      "dates":{

      },
      "queries":{

      }
   },
   "result":[
      {
         "birthplace":" Wien",
         "party_exact":"SP\u00d6",
         "llps_exact":[
            "XXIV",
            "XXIII",
            "XXII",
            "XXI",
            "XX"
         ],
         "text":"PAD_03599\nFranz Riepl\nRiepl Franz\n Wien\n\n Elektromechaniker",
         "birthdate":"1949-03-23T00:00:00",
         "llps":[
            "XXIV",
            "XXIII",
            "XXII",
            "XXI",
            "XX"
         ],
         "deathdate":null,
         "deathplace":"",
         "full_name":"Franz Riepl",
         "occupation_exact":" Elektromechaniker",
         "party":"SP\u00d6",
         "deathplace_exact":"",
         "birthplace_exact":" Wien",
         "reversed_name":"Riepl Franz",
         "source_link":"http://www.parlament.gv.at/WWER/PAD_03599/index.shtml",
         "occupation":" Elektromechaniker"
      },
      {
         "birthplace":" Wels",
         "party_exact":"SP\u00d6",
         "llps_exact":[
            "XXIV",
            "XXIII",
            "XXV"
         ],
         "text":"PAD_35495\nFranz Kirchgatterer\nKirchgatterer Franz\n Wels\n\n Kaufmann",
         "birthdate":"1953-09-24T00:00:00",
         "llps":[
            "XXIV",
            "XXIII",
            "XXV"
         ],
         "deathdate":null,
         "deathplace":"",
         "full_name":"Franz Kirchgatterer",
         "occupation_exact":" Kaufmann",
         "party":"SP\u00d6",
         "deathplace_exact":"",
         "birthplace_exact":" Wels",
         "reversed_name":"Kirchgatterer Franz",
         "source_link":"http://www.parlament.gv.at/WWER/PAD_35495/index.shtml",
         "occupation":" Kaufmann"
      }
   ]
}

Paging

In addition to the query arguments for filtering and facetting, the search views also automatically limit the results to allow for smooth paging. Two parameters govern this behaviour: offset and limit.

Offset returns search results from the given integer on - so, for a search that produced 100 results, an offset value of ‘20’ would only return results 20 to 100. If no offset value is given, the view assumes ‘0’ and returns results starting with the first one.

Limit restricts the amount of results per page; with the abovementioned example and a limit value of ‘50’, the query would only return results 20 through 70. If no limit is given, the view assumes a default of 50 results. This can be changed in the offenesparlament/constants.py file.

Fieldsets

Given the amount of data in the index (particularly the debate statements), returning the entirety of an object including all of it’s fields is not performant enough for long lists of results. To combat that issue, the concept of predefined fieldsets has been introduced. Each index class now contains a FIELDSET dictionary which defines the available fieldsets. The debate class, for instance, contains the following fieldsets:

FIELDSETS = {
      'all': ['text', 'date', 'title', 'debate_type', 'protocol_url', 'detail_url', 'nr', 'llp', 'statements'],
      'list': ['text', 'date', 'title', 'debate_type', 'protocol_url', 'detail_url', 'nr', 'llp'],
  }

The dictionary key describes the fieldset, and the value consists of a list of all fields that should be returned when requesting that fieldset.

Per default, the search view only returns the ‘list’ fieldset; if a search request must return all available data, the ‘fieldset’ parameter allows querying the for a specific fielset fieldset:

http://offenesparlament.vm:8000/personen/search?parl_id=PAD_65677&fieldset=all

Indices

WARNING: Currently, only three seperate indices exist, one for the Laws, one for the Persons and one for the Debates. These are subject to heavy development in the future and will change a lot still, so this documentation will remain mostly blank for now.

The indices are defined in op_scraper/search_indexes.py. Each index contains a text field, which aggregates the objects’ data into a single, text-based field, which Haystack uses as the default search field. The exact makeup of this field is defined in templates, located at offenesparlament/templates/search/indexes/op_scraper/*_text.html.