Bakken & Baeck logo

Sustainable Impact Estimation: A Journey from Custom Models to LLMs

By Philipp Gross

As artificial intelligence becomes increasingly prevalent, its application to start-up evaluation can revolutionise industries and facilitate sustainable impacts. This article documents the journey of our software development project Impactulator, which aims to estimate the sustainable impact of start-ups by using AI.

Below, I’ll explore how the team leveraged natural language processing (NLP) techniques, starting with custom models and eventually adopting state-of-the-art large language models (LLMs), like GPT-4, despite the initial scarcity of data commonly available to analysts of start-ups.

Part of an impact report for an example company, Brim Explorer, including outcomes likely targeted by the organisation. All information is generated by GPT-4.

1. The challenge: estimating sustainable impact with limited data

There’s often limited data available about start-ups, especially in the early stages of their development. However, estimating a company's sustainable impact requires substantial data in order to build accurate machine learning models. We addressed this hurdle by employing a straightforward approach: gathering initial data from the start-up’s own website, the primary source of information that is publicly available and uniformly accessible. The landing page of a company’s website usually has a good summary of its main activities, although it’s somewhat challenging to process this information reliably.

2. Data collection and initial NLP analysis

To kickstart the sustainable impact estimation, the team used web-scraping techniques to extract relevant information from the start-up’s website. Our focus was on capturing the text content — either the main text of the website or secondary text that’s hidden in the meta attributes, like Open Graph descriptions.

3. The transition: custom models to LLMs

Initially, the team developed custom NLP models with SpaCy to identify the products or services a company produces. These models were tailored specifically to the start-up’s domain, which yielded decent results but lacked scalability and accuracy, and were difficult to extend without costly data annotation. Recognising the limitations of custom models, the team decided to embrace cutting-edge AI advances, leading us to LLMs like GPT-4. While access to these models is costly, they provide an excellent tool to rapidly prototype various information-extraction techniques by experimenting with various prompts.

4. Leveraging GPT-4 for natural language processing

GPT-4, with its impressive language understanding and generation capabilities, was the LLM of choice. The model's pre-trained knowledge significantly reduced the need for labelled data, making it the ideal choice for a data-scarce start-up environment.

We restructured their NLP pipeline to incorporate GPT-4 for various tasks:

  • Basic information: identifying the name of the organisation and where it’s located and operates
  • Activity classification: mapping the organisation to the most likely activity with respect to the ISIC standard
  • SDG classification: identifying the most likely Sustainable Development Goals to which the organisation might contribute
  • Core activities: using GPT-4 to extract a list of core activities and generate a readable summary from the list, with a short one-liner description. These activities are the perfect distillate, as they allow for fast-checking and are already rich enough to enable interesting downstream analyses, like Theory of Change assessment

All these tasks could have been developed with standard NLP classification algorithms, but this would have been quite a tedious task. See our tech blog post for a detailed explanation of prompt design for classification tasks.

5. Advanced structured output generation: Theory of Change analysis

With GPT-4's insights and the features extracted from the website, we proceeded with a comprehensive Theory of Change analysis, a framework to understand and evaluate the start-up’s activities and their anticipated ability to create sustainable impacts. To do this automatically requires consistent generation of graph data structures, which works more or less out of the box with LLMs.

For context, a Theory of Change analysis includes:

  • Situation analysis: understanding the current state of affairs and challenges faced by the start-up
  • Vision and goals: identifying the long-term vision and specific goals set by the organisation
  • Assumptions: recognising the underlying assumptions upon which the start-up’s strategies are built
  • Stakeholders: identifying key stakeholders and understanding their role in the start-up’s success
  • Inputs, activities, outputs: analysing the resources invested, core activities and the immediate outputs generated
  • Outcomes and indicators: assessing the potential outcomes of outputs, measured by indicators that signify the start-up’s progress towards achieving its goals

The journey of this ongoing software development project involving AI has demonstrated the power of adopting advanced NLP techniques, specifically transitioning from custom models to LLMs like GPT-4, in the face of limited data. By scraping data from the start-up’s website and leveraging GPT-4’s language understanding and generation capabilities, the team successfully estimated the start-up’s sustainable impact approach. Application of the robust Theory of Change framework also provided valuable insights into the start-up’s roadmap for creating a lasting positive influence on its users and community.

The way towards actual impact estimation requires a much more thorough data collection process, with access to closed and proprietary data sources, which is difficulty to come by. But a next step would be to generate possible formulas to calculate the impact indicators in terms of business indicators.

Impactulator is a research project developed in collaboration with Katapult and The Factory.

Share to
  • Twitter
  • LinkedIn
  • Mail
  • New
  • Press &

0570 OSLO




53111 BONN