The coming depth camera explosion & some predictions


#1

For the past couple years, I’ve been watching the weird world of consumer depth cameras. Depth cameras are special cameras that record color information and also know how far each pixel is from the camera. They’re pretty common in the world – probably the most famous and widely distributed depth camera is the microsoft kinect (used kinects now on sale for $20 at a GameStop near you!)

People use depth cameras for all kinds of stuff. The Kinect’s killer app was skeleton tracking to allow people to use their whole body to interact with game worlds. We used the Structure sensor, which is basically the kinect’s guts mounted to an iPad, to develop our Holoflix app that records 2.5D video and can play it back in our volumetric/light field displays. People use them to get measurement of real-world scenes, to know where a camera is in space, to scan and build 3D models of real-world objects and people. They’re really versatile, interesting tools, especially if you’re playing in the 3D space.

The problem is that depth cameras are expensive, kinda big, power-hungry (bad for mobile devices) and most of them have trouble in certain lighting conditions (anything with direct sunlight kills any kinect-style depth cams, and most depth cams have trouble with glassy or reflective surfaces). As someone who works with 3D video, scanning and display, I long for a world where the devices in our pockets are capable of this kind of holographic sorcery. How far away is that future?

One notable push to bring that future to us is Google’s project Tango, which is a mix of a depth There are only two tango phones in the wild right now, the Lenovo Phab 2 and the Asus ZenFone AR (is there a requirement for android depth cameras that you have to have a silly, mispelled name for your device? Is that really what millenials want?). My first experiments with the Tango dev kit 2 years ago were just ugh – it was soooo unstable, the google API and built-in demo apps crashed all the frickin time, and it burned through battery like a hot knife through butter. Tango’s depth camera also has a meh 320x180px resolution. Tango was released three years ago, and given that, in that time, even a mobile/software giant like Google couldn’t push more than two devices to include the hardware, it doesn’t seem like the world is chomping at the bit to put Tango cameras into every phone on the market.

A simple, more achievable technology is to just put two cameras into a phone. Us humans do all of our depth sensing with just two eyes, why can’t our phones do the same? Phone cameras only cost a couple bucks (depending on how fancy they are), so it’s not a big deal to add one more camera into the phone.
That logic appealed to basically every mobile phone maker in the world. “Hey! If I can cram another camera in there and have the intern port over some openCV code to make ‘depth’ work, I bet I could sell an extra million units! Hoo boy! Welp, glad we got that cleared up, looks like it’s my tee time!”

HTC lead the charge into stereo phones, releasing the HTC One M8 in March 2014, a few months before Tango officially came out. An epic phonemaking party followed, with basically every big manufacturer making a model and about 35 different models coming out over the next three years (here’s my list of dual-camera phones, prices & links, if you’re interested). Over the past three years, the world has produced about 20 million stereo-camera android phones and another 20 million iphone 7 pluses. Stereo cameras: one, Tango: zero.

The problem is, the software didn’t really work right. Getting good depth information from stereo cameras is really, really hard, and needs really well-calibrated hardware and good software. Most of the phonemakers (except apple) didn’t really bother to calibrate the hardware, and didn’t have the software team to put real resources into developing decent depth algorithms, so they just grabbed the openCV example code, wrapped it up in a couple apps, locked down access to the underlying depth information and even the raw stereo image and called it a day. Super annoying, right? HTC made some noises about opening up a depth SDK, but it’s extremely broken and unmaintained. Everyone else just locked down all access to the hardware or basic depth information.

It sounds like the hardware manufacturers are bad guys, but they’re not. This is an exceptionally hard problem, and all the android phonemakers don’t really have software teams with the scale and skills to solve it. Google does, but they’re sitting on their hands or betting on Tango right now. Apple does, too, and they’re the only ones who actually made a real solution.

And that’s why I’m excited to see the new iOS 11 release coming out in a week – for the first time, it’ll put an API around the depth information in a mass-market stereo camera phone (the iphone 7 plus). This will do a couple interesting, important things:
First, it proves that getting good depth info from mass-produced stereo cams is possible. Most phone manufacturers empirically don’t believe that it is right now.
Second, it will enable a bunch of interesting AR/VR app development in iOS 11/ARKit, and I think a bunch of really compelling apps will come out of that
Third, all that Fear Of Missing Out will force the hand of the only other software player in the world that could conceivably solve the incredibly hard stereo depth problem: google. I don’t really know how Google quantifies difficulty of problems like this, but I bet that adding a stereo depth API to android is a ~$100 million feature – not money they would spent unless they felt like they were at risk of losing real VR/AR market share. And Google desperately wants to keep that market share.

My guess is that it’ll go like this:

September 2017: Apple releases iOS 11. Also, iphone 8 plus and iphone X, apple’s second and third stereo camera phones come out.
October 2017: devs make a bunch of cool videos showing interesting AR/3D scanning/whatever apps they make using depth sensing, iphone 7 plus sales go up, apple starts planning a cheaper mass-market stereo cam device for next year
November 2017: chinese phonemakers say: “hey, look! Apple managed to put two cheap cameras into their phone and now they’re selling like hotcakes! We should do that, too!” They scream at their software peeps to SOFTWARE HARDER until they have something close to what apple’s stereo vision can do. Every smart Chinese master’s student doing a thesis on stereo vision suddenly gets a really good job offer.
March 2018: New wave of stereo-cam Android devices comes out, probably with better/decent depth perfomance and sadly, each with their own horrific hardware-specific API because that’s what they all do right now. And by they all, I mean HTC, which is the only phonemaker who makes any effort to offer a depth API, and it’s a total garbagefire. The fancier snapdragon-y phones might be able to crunch depth in realtime. Other phones can only postprocess stereo stills/video.
July 2018: This is totally speculative — Google looks at all the stereo cams out there and puts together a group within Android to offer a device-independent depth API. It’s one of those problems that won’t be sane to develop in android if you have to deal with device-specific APIs, and it really needs a google-ex-machina to make the whole thing work. I assume google is thinking about this stuff, because, you know, they’re google, and they obviously care about this thing with Tango, and they are totally thinking about stereo vision too. This is a hard problem that needs google/apple resources to solve well, and mayyybe google will see Apple/ARkit working with stereo cams and get some serious FOMO and start pouring in those sweet, sweet silicon valley bucks into some horrifically smart and expensive mountain view peeps.
August 2018: The google API sucks because it was rushed out the door before it was solid and because google doesn’t have control over the camera hardware, so after a flurry of disgusted press, they frantically revamp it for a month and it works a bit better. A bunch of people take a bunch of those fake SLR tight-focus plane photos that you can make with stereo cams and post them to google plus before remembering that nobody looks at google plus.
September 2018: we start seeing pretty good AR/VR apps on the first fancy stereo android phones. Also, apple announces the iphone 9 or 7c or whatever they name they’ll dream up for a cheaper stereo iphone.
November 2018: the start of the stereo android explosion!!!

So what does it all mean? Well, it means that in the next ~30 days, there will be tens of millions of phones out there that can take decent depth video and interact well with the real world, and it will be open for anyone to create on this platform. Over the next twelve months, a lot of people will realize that there’s no magic to making stereo depth work on phones, it just needs better software, and the software will catch up to the hardware until we have about ~100 million depth phones out in the world by this time next year, little devices in our pockets that can capture 3D movies, know where they are in space, 3D scan objects in the real world and blend real and virtual worlds together. This is an incredibly exciting time to be in the 3D game, and I can’t wait to see the next act of the play.

<33333
–alex


#2

Great post - and very informative list and inspiring. I can confirm the hardware specific API issue and disparity maps being quite poor on most Android phones - a few thoughts came to mind while reading:
a) Stereo cameras are inherently limited by the baseline between the cameras - bigger the baseline the more accurate your disparity maps - in general they are never as good as active depth cameras because they rely on the scene having features and texture variation.
b) Apple has over 400 people on their imaging team right now - an amazing resource - one key thing they are focusing on is using ML prior to image capture to classify pixels across frames. This seems like key point for any system.
c) Multi-camera systems may extend beyond just two cameras - an optimal number of cameras might be 3 or 5 depending on the device.
d) Johnny Lee said in a talk at the Standford AR event that they are having difficulty finding a killer app for 3D - even with pretty good software tools. He recommended patience and diligence for zealots based on his experience trying to market Tango.