Contributions are welcome and encouraged
Sources
Data can be tricky to come by for these quangoes and their statistics, as many of them simply don't publish any. However, it is possible to patch various sources together into a coherent picture.
- quangoes.csv
- quangocrats.csv
- decisions.csv
- graphs.json
- afuera.db
- queries.sql
- The government website lists all public bodies
- The Public Bodies Act 2011 worked for a while
- Colin Mackie's list of civil servants since 1900
- The Civil Service Yearbook has been published since 1972
- The Taxpayers' Alliance often publishes financial analysis
- The Cabinet Office produces its own summary
- Despite the mess, data.gov.uk can be useful
- Generally, government data is years out of date
- Occasionally good data can be found on Github
Credits
This data was collected, cleaned, and processed for The Restorationist by Alex Coppen, an English engineer based in California.
Legal oversight was provided for The Restorationist by Michael Reiners, an English Barrister based in London.
Open Source
This dataset is released under the Creative Commons Attribution 4.0 International License (CC BY 4.0). The copyright holder retains all intellectual property rights to this work; however, recipients are granted the following permissions:
- To reproduce, distribute, and communicate the material via any medium or format
- To adapt, transform, and build upon the material for any purpose, including commercial utilisation
The aforementioned permissions are contingent upon adherence to the following condition: appropriate attribution must be provided by clearly indicating the original creator (Alex Coppen), incorporating a link to the licence, and specifying whether modifications have been implemented. Such attribution shall be provided in a reasonable manner, but not in any way that suggests the licensor endorses you or your utilisation of the work.
For the complete terms and conditions of this licence, please refer to: https://creativecommons.org/licenses/by/4.0/legalcode
Disclaimer
Whilst considerable effort has been undertaken to ensure the accuracy and quality of this dataset, it is provided "as is" without any warranty, express or implied. The dataset may contain errors, omissions, or inconsistencies. Users are advised to exercise independent judgement when utilising this dataset and to verify any critical information independently. The copyright holder shall not be liable for any damages or losses arising from the use of, or inability to use, this data.
Usage in AI Systems
This dataset can easily be optimised for Retrieval-Augmented Generation (RAG) applications. To implement it effectively:
- Convert the data into vector embeddings using a model compatible with your RAG architecture. For optimal results, ensure documents are chunked appropriately (250-1000 tokens depending on your use case).
- Load the embeddings into a vector database such as Pinecone, Weaviate, or Chroma for efficient similarity searching.
- When implementing the retrieval component, experiment with different similarity metrics (cosine, dot product, euclidean) to determine which works best with this dataset.
- When generating responses, provide the retrieved context alongside your query to your LLM, adjusting the number of retrieved documents based on your accuracy requirements.
- Monitor relevance metrics to ensure quality retrievals and continuously refine your embedding and chunking strategies as needed.
- For tracking people and relationships between data points, use a graph database such as Neo4j.