Real-Time Search Indexing with AEM Events & App Builder

Juan Ayala

4 days ago9 min read

Updated: 2 days ago

Site search has been a constant feature on all the AEM projects I have worked on. The first site search I implemented was using the Query Builder API. It was a small site, and the queries were not complex. But at the end of the day, the search and indexing features that come with AEM are hard to optimize and not scalable.

For site search you should be using an external search engine. Such was the case a few years ago when I worked at Code & Theory. There I created a custom integration between SOLR and AEM, which I blogged about. For this type of integration, you create an EventHandler to run on the ReplicationAction.EVENT_TOPIC. Or a custom TransportHandler that does the job of replicating the content. Either way, you create an entry in the search index when a page gets published. Or remove it when the page gets unpublished.

This approach is ok for AEM 6.x, and even AEMaaCS. But, as I wrote in my previous post, AEM has been moving away from this type of customization. If you are on AEMaaCS, you should not extend the runtime with custom Java code. Or deploy 3rd party OSGi bundles. Instead you should rely on the cloud-native eventing system. Which allows subscriptions to AEM Events for processing by external systems.

Source Code

Usually, I never share a source code repository containing the entire solution. This is because I use the AEM archetype project to generate AEM projects. These projects contain a lot of boiler plate source code. And I only ever change a small set of files. On top of that, the archetype project is always changing. Making repos more likely to become outdated.

Adobe Developer App Builder projects are much simpler. And the ratio of my custom code to boiler plate is much grater. And so it makes more sense to put the entire solution up on GitHub. You can find the project here: juan-ayala/az-aem-search-integration.

Choosing the Events & Consumption Model

The Adobe Experience Cloud products generate a lot of events. For example, to name a few:

To achieve real-time indexing, we will build an Adobe Developer App Builder app to capture the AEM events. It will have the actions (serverless functions) registered to process the events. aem.sites.page .published will get registered to the action that adds an entry to the index. And the aem.sites.page.unpublished will get registered to the action that deletes the entry.

There are 2 other ways to consume events. You might consider the journaling API. But this is better suited for a pull mechanism of consuming events, and not real-time. Or a webhook deployed on any platform i.e. Azure Functions. Although it is real-time, the security considerations are a little more complicated. Because the webhook is outside of Adobe's cloud.

Create a New App Builder Project

First we need to create a templated project on the Adobe Developer Console. The App Builder template will provide the tools you need to build an SPA and microservices. As well as orchestrating APIs in Adobe Experience Cloud. Go to developer.adobe.com and click the Console button at the top right. Once there, create a project from a template and select the App Builder template. The new project will have two workspaces: Stage & Production. Each containing a runtime. I captured this process in the video below.

With the project created on the developer console, now we generate the source code. Install Node via NVM. Or whichever way makes more sense in your case. Then install the @adobe/aio-cli package globally. And create an new empty directory.

What follows is the list of CLI commands I used, their effect and caveats. This is also captured in the video below.

aio login : This uses OAuth to log you into Adobe IMS. An access token gets returned and stored on the local machine. Check ~/.config/aio. Or use the aio config plugin to view the stored configuration.
aio app init --standalone-app --yes : This will create a new App Builder app. You will not get prompted to select a template (--standalone-app). And default values will get used when possible (--yes). You will have to choose an organization and project. The default workspace will be Stage. I will talk about why I used --standalone-app later.
aio app delete action : When prompted, select the 2 sample actions that got generated. You can remove them by deleting the relevant files. But this command will delete everything associated with them in one shot.
aio app add service : We need to create IO Events Registrations. To do this we will need to add the I/O Management API to the project, along with a server-to-server credential. Instead of using the CLI, you could do this on the developer console. When prompted to overwrite the .env and .aio files, you should. There should be no custom changes in these yet.
aio app add event : When prompted, select the @adobe/generator-add-events-generic template. This will generate the action and event registration. You will get prompted for the action name, display name and description. Run this command twice. Once for the action that will upload a document to the index. And once for the one that will delete it.

⚠️ I used the --standalone-app flag to generate the content. This is because as I tried the templates, I found the documentation and demos did not match what was going on for me. This caused me to spend time looking at the template source code to try and figure out its intentions. Unfortunately this is a problem with templated source code generation. The templates can't keep up with the changes, and become outdated. You should try some of these templates yourself. But keep your expectations low.

We now have a basic project with two event registrations and actions. But before we can start coding, we will need a search engine.

Setting up Azure AI Search

I have chosen Azure AI Search because it has a free tier available to me. And I can set it up with a few Azure CLI commands. But you can swap this for your own. This is composable architecture after all. Assuming you have logged in via the Azure CLI, first create the resource group and search service.

Next, create the search index. There is no Azure CLI command for this, so we need to get an access token and call the HTTP API with curl. For this demo, I am creating a simple schema with only a handful of fields.

Next, we create a service principal so our client application can authenticate. And give it the ability to read/write the index by assigning the Search Index Data Contributor role.

We have an index and a service principal that has role based access to it. Next, we need the credentials our app will use to authenticate. You will need the app id and tenant id to run the az ad app credential reset command.

Which will give you the the credentials in JSON format.

Which you will need to put in the .env file. Along with the search service endpoint and index name.

And pass the environment variables to the actions as inputs. This mapping gets done in the app.config.yaml file.

Three Ways to Scrape & Index the Content

There are three ways to scrape content, depending on your needs and situation. But before getting started, you'll need to add the SDK specific to your search engine. In this case, since I a using Azure AI Search, so I will install its SDK. Along with the SDK for identity.

We can also put in place the aem-page-index-upload-doc/index.js action. It will use a helper class called HTMLScraper.js which I will share later on.

Scraping the HTML

The first way to scrape content is straight forward. You only need to parse the HTML! This assumes that first, the page is public. And second, your data requirements are simple. Because lets face it, parsing HTML can get real hairy real fast. Unless your pages already expose a lot of metadata. For this task, we'll use cheerio, a library for parsing HTML. Here is the HTMLScraper.js file.

Scraping the Model JSON

The second way you can scrape the content is by parsing the model JSON. Like the first example, the page must be public. Additionally, it also depends on the Sling model exporter framework. If you are using the WCM core components you should be ok. But if you have custom components, you will need custom model exporters for each. Assuming you need to get their content. For this example, I am reading from the page data model. Another benefit of using the WCM core components. Here is the ModelJSONScraper.js class which you can swap out for HTMLScraper.js.

With access to structured data, it becomes much easier to parse. And get to the individual component data if needed.

Scraping the JCR Json From the Author

And finally the third way to scrape content is to get the Sling JSON rendering from the author. This approach works best if you are not using Sling model exporters. Or your can't get what you need form the HTML. Because you are accessing the author, you will need a service credentials. Which you will then use to get an access token using the @adobe/jwt-auth NPM module.

I created the following entries in my .env file which I then mapped in the app.config.yaml file as input parameters.

I also created this helper class to encapsulate @adobe/jwt-auth. And put these parameters into the config object required by the authorize function.

And then I use the AEMServiceCredentials in the SlingJSONScraper. Here I get a token and append it to the Authorization header before I make the request to the author.

Delete the Page From the Index

Deleting the page from the index is straight forward. There is no parsing involved. And in most if not all search engines, you only need to provide the id of the document you want to remove. Here is the action to remove the page from the index.

Testing Locally

Up to this point, I have not deployed the actions to the workspace runtime. I have been writing Jest unit tests. And running the actions locally. To run the local dev server, make sure you have all the needed environment variables in your local .env file. And run the aio app dev command. This will start up a server on port 9080. If it is your first time running it, it will prompt you to accept the SSL certificate in the browser. Once you do that, stop and run aio app dev again. Now use the following curl command to index action. If you don't have an AEM publish instance, use the wknd.site.

Sample JSON payloads are in the API documentation.

Deleting the WEB-SRC

A sample SPA based on React Spectrum got generated with the project. There may be cases where your app requires a front-end. But this is an event driven headless application. So you can run the command to delete that folder. Or do it manually.

In this situation, I can think of one reason to build a UI. Say for example if you wanted to build a tool to manage the search index. The SPA is also useful during development. It has a page that lets you execute the actions. Saving you from having to run curl commands.

Deploying

Deploying is as simple as running the deploy command.

By default, the CLI will deploy to the Stage environment. If you wish to change it, use the CLI command

You will get prompted to merge or override the .env file. Merge it, as it should contain your custom variables by now. You can overwrite the .aio file. Once deployed, you can use curl to test the action like how you did on local. But because these are non-web (private) actions, you will need a auth header. First, get the URL with the CLI command

And then get the AIO_runtime_auth value which you will find in your .env file.

⚠️ The CLI has been managing and using these variables to communicate with the Adobe APIs all along! That is why you should not check in or share this file.

With the URL and auth value (which you must encode in base64) you can now call the action.

This will return an activation id. Which you must then use to get the result.

And finally, if you navigate to the project on the developer console, you will see the event registrations. Which you can click on to get to the debug tracing window. You should see a challenge probe. They get sent by I/O to verify the action is valid. And if you trigger the event i.e. publish a page, you will see the event show up.

The event registration for the page published event in the developer console.

Conclusion

What I’ve shown here is a basic way to keep your AEM page index updated in real-time. There are many ways to set up an indexing system, depending on your needs and the search engine you're using. For example, most websites generate XML sitemaps, which search engines like Coveo use to find and index pages.

As requirements become more complex, event-driven systems offer more flexibility. For example, if you need to index images, you can set up AEM to track when assets are published and automatically send those images to Azure Blob Storage. From there, an indexer can use OCR technology to read and index the text in the images.

By connecting AEM Events to search engines like Azure AI Search, you can keep content updated in real-time, ensuring search results remain accurate and reliable. This also creates a scalable process for managing large volumes of content.