Alle artikelen

Wake Word Integration into VR with Azure Speech API and OutSystems

Owen Corstens
Teach Lead OutSystems
Owen Corstens is a self-taught OutSystems Tech Lead with a passion for smart, scalable solutions. What started as a hobby grew into a full-time vocation where he automates processes and manages to tackle complex challenges elegantly. He combines in-depth technical knowledge with a strong product mindset and smooth communication. In addition to his work as a developer and team lead, he shares his knowledge as a lecturer and speaker.
Deel deze post

At Harmony Group, we believe cutting-edge tech like AI and VR should be accessible, not intimidating. That's why our tech lead Owen Corstens set out to prove just how easy it is to integrate these technologies using OutSystems' low-code platform.

In this blog series, Owen shares his experiments step-by-step.

Whether you're a developer, business leader, or tech enthusiast, you'll see how low-code platforms like OutSystems can accelerate innovation. Let's dive in!

---

When I first imagined this project, I had one clear goal: I wanted to create a Virtual Reality scene in OutSystems where I could build the entire world around me just by talking to it. And the best part? I didn't want anything to be predefined—not in the environment, not in the OutSystems database. AI should be doing all the heavy lifting for me. I mean, why would I bother doing it myself? 😉

One important thing to note: we're building a web application. Since we're using OutSystems, and we want this to work in VR, that just seemed like the most straightforward path.

Cool goal, but where do you even start?

I figured the first step was making sure my voice could actually be heard by the application.

So I started simple. The goal was just to talk to the app. There are loads of ways to make that happen, but when I began experimenting with speech listeners—about a year ago—I had almost no experience with any of it. I decided to go with something familiar: the built-in WebKitSpeechRecognition API. I'd used it before while prepping for a hackathon, so it felt like a safe starting point.

This web speech API is part of the browser, but here's something I didn't realize at the time (and only found out during implementation): it's still considered an experimental feature. That's going to matter soon.

After setting up the necessary listeners and methods, things actually seemed to work pretty smoothly. All I had to do was trigger it—and voilà, it started listening.

Snippet result of speech listener from the console logs

Cool, I was definitely on the right track—getting my voice to show up on the screen felt like a win 😎😎 Sadly, the Web Speech API didn't turn out to be the best option. It really struggled with my beautiful accent. I didn't let it bother me too much though, since I was mainly just aiming for a happy flow.

Still, I got a lot of weird and wrong results. Like this one time, I said “make the box with ID object one smaller,” and somehow it thought I said something about warmth. That was already my third attempt.

So yeah, not ideal.

Anyway, we kept going. The next step in our speech-enabled journey was figuring out how to trigger the speech listener automatically—no more needing to press a button or do something manually every single time. That meant I needed something called a Wake Word Function. Not gonna lie, that's a pretty cool term.

Some quick research showed me there are third-party tools like Picovoice that can handle this kind of thing for you. But of course, I wanted to build it myself. Easier said than done, though.

I started getting creative with WebKit Speech Recognition again. By using continuous listeners, I managed to keep the mic connection open. Then we added logic to only react to a specific command—so it wouldn't respond to every single word I said, just the comman I chose.

After quite a few rounds of trial and error, it finally worked—I had an app that responded when I said the right command. As you can see in the title (and here), my trigger phrase was “Hey Daisy.” Originally, I wanted it to be “Hey AIRY,” since that's the name of the project, but the speech API just couldn't make sense of AIRY the way I played at home. So I had to swap to Daisy 😅

To make it a fully functioning Wake Word, I also needed it to listen even when I wasn't actively using the app. And that's where I hit a wall. After a bit of digging, I found out that both iOS and Android were blocking me from using background services with a continuously open audio connection. There are some shady workarounds, but I didn't want to go down that road just yet. It wasn't a deal breaker, so I left that part for what it was.

ALRIGHT! You'd think everything was set now, right? Yeah... me too. Ha ha.

Later in the project, I made a test application for a VR environment. That's when I hit a new problem: the environment didn't respond to any of my commands. It was like it couldn't even hear me. Which was weird, because everything worked fine in the regular browser during testing.

After a ton of testing and debugging, I finally figured it out: the environment really wasn't hearing me. And get this—the browser speech APIs weren't supported at all in the META browser on the Quest 2. I tried some unofficial browsers too, but none of them worked either. So, I had to start thinking about alternatives. Remember how I said this API is experimental? Yeah, it really shows here. Bad luck, I guess.

So I pivoted. For some reason, I started thinking: what if I pass the commands from another device? Since the speech stuff worked on both my phone and desktop, that didn't seem like a bad idea.

New challenge unlocked: how do I get the commands from that device into the headset?

Passing them to the server was easy with OutSystems. But getting them up to the client in the headset? That was the tricky part.

First, I tried polling. But that created way too much load on the client just to catch a few commands. Bad idea. Plus, I would've had to expose APIS—Something I already got roasted for by an OutSystems tech expert in another project. Then I looked into timers, but that idea got scrapped quickly. Didn't really solve anything.

After some ChatGpting (we're making it a verb, like “googling,” okay?) , I stumbled onto web sockets. That got me excited. This could work!

So I dove in and started learning how to set up a websocket from scratch. And Boom—I landed in the Amazon Universe 😖 So much new stuff I had absolutely zero experience with. But hey, I love a challenge, right? We'll talk about how to set up a websocket in OutSystems in the next article—let's keep our focus on the speech stuff for now.

And yes, we got it working with a websocket! Now the environment could actually hear me. Perfect! But of course, fix one problem and two new ones show up.

Problem #1 was one I already knew about but kinda ignored. The speech API didn't recognize my commands properly. I had to repeat the same sentence like ten times—no exaggeration. I got so fed up. And worst of all, I never knew what exactly failed. Was it the speech API? The web socket? The AI? Or just a misheard command?

So... I threw it all away:) Sometimes starting fresh is the way to go.

Yes, we scrapped the entire command-passing functionality (I almost cried 🥺). I decided to switch to the Azure Speech APIs. I'd used them during the hackathon and avoided them at first because they're not free—but honestly, it was worth it after all the headaches. I could have used the Amazon APIs too, especially since they offer free credits. But I'm stuck with Azure.

Just a side note: OutSystems handles library imports in a specific way, but I'll cover that later. For now, just remember to import the Azure Speech SDK webpack into your OS project.

Alright, this article's getting long, but hang in there—we're almost there! And yes, you'll get to see the component 😉

After introducing the Azure Speech API, everything almost worked perfectly. Yup, I still stand by what I said: every time you fix a problem, another one pops up. In this case, I got two more.

First: the API is trained by a very smart AI model that tries to improve your sentence. Sounds nice, but it messed up my trigger phrase. It started adding commas between the words in my command, which made it unrecognizable. I fixed it in a super dirty way (again, just trying to keep my happy flow alive). I changed the command to just “Daisy” instead of “Hey Daisy.” I am still say “Hey Daisy,” but the continuous listeners sometimes miss the first characters, and that just made the original version fail too often.

Second problem: autoplay policies in browsers. If you've ever tried to open a camera or microphone immediately after a redirect, you'll know what I mean—it doesn't work. Browsers are getting stricter for good reasons, but it does make things harder. The WebKit Speech Recognition used to work fine and kicked in as soon as the screen loaded. But the Azure Speech SDK? Not so much. It wouldn't start after being initialized, and it didn't even throw a proper warning.

To fix it, I needed a specific user gesture to trigger the whole thing. You can work around this in a SPA setup or with some trickery, but I didn't need to go that far. Instead, I just added a start screen with one nice, big button to launch the app.

There we go—my final fix!

It's been an amazing ride to get the speech system working the way I wanted. After solving so many problems, I finally ended up with a working Wakeword Function that could turn my voice commands into real actions in the application.

The WakewordFunction Forge components are ready to download! You'll get both a library component and a demo application (ODC only for now). Try it out, improve it, and come back with feedback:) I'm 100% ready to keep making better WakewordFunctions! 💖

I had a few meltdowns along the way, but I'm really glad I stuck with it—this speech setup is a core part of my project.

Let's get in touch!
We're passionate about driving innovation and delivering value. So, whether you're a potential new colleague or client, we'd love to hear from you.
Gerelateerde artikelen
Alle artikelen
OutSystems Hackathon Sparks New AI Insights for Owen
App development
Join Owen Corstens, Harmony developer at Cordaan, as he navigates the OutSystems Build for the Future Hackathon 2024. This blog post offers an insider's look at Owen's technical challenges and personal growth during the event.
Read more
Harmony and ForTrevo announce partnership
App development
Harmony is an expert in low-code development, has a partnership with low-code platform OutSystems, and operates mainly in the Belgian, Dutch, and Brussels markets. FortRevo, also an OutSystems Partner, is an expert in low-code development and .Net based in Portugal that also operates in the autonomous Region of the Azores, and most recently Belgium, Nordics and US. Both Harmony and ForTrevo have been successfully active in low-code application development for several years. Together with this new partnership, they will be joining forces in order to increase their impact on the Belgium Market.
Read more
How low-code can help startups grow
App development
As low-code technology continues to evolve, many startups have started using low-code application development to improve long-term efficiency, productivity, and profitability. Recent studies indicate that 84% of companies have switched to low-code application development to reduce pressure on IT departments, reduce the length of application development cycles, and increase digital resources.
Read more
No items found.