To be honest, this "unlimited details" thing looks like a scam to me.
If you notice, they have well-detailed objects... which are copy-pasted everywhere. Because that way, all the objects in the scene can use the same representation in memory. Which makes the unlimited aspect laughably pointless. It's all very well and good that you have a game with an elephant statue where you can zoom in at the atomic level and see procedurally-generated stuff for it; but it's still a game that consists of a camera moving around an elephant statue. There could be twelve billion elephant statues, they're all clones of each other. You damage one? All others mimic the damage exactly because they share the same representation in memory.
It makes for pretty tech demo; but it's useless for games. You can't build a game out of looking at elephant statues.
The simple thing is that voxels are incredibly gluttonous. They gobble RAM like nothing else. They are, after all, 3D images. Take a 1024x1024 picture in RGBA: it takes up 1024x1024x4x8 bits in memory, or 4 megabytes. Now instead of a picture with pixels, make it a volume with voxels. 1024x1024x1024x4x8 = 4 gigabytes. Ouch!
Sure, there are ways to make voxel use up less memory. The approach generally used is through octal trees: you have a rough, low-detail three-dimensional grid (like, 2x2x2) and each of the blocks in that grid can be either entirely empty and stop here, entirely full and also stop here, or partly empty and partly full and then it has eight children in a 2x2x2 grid too. This approach is useful because it allows you to save some space in most cases and it also gives you a quick, built-in way to vary the level of detail by simply choosing the depth at which you stop going down the branches and instead just average a node from the values of its children. But doing so has a big disadvantage: the volumes can no longer be dynamically altered, unless you load them in full and recalculate them entirely. Which means either keeping the maximum level of detail quite low, and thereby renouncing any attempt at creating convincing "infinite details" from them; or dropping that idea and keeping voxels for static geometry.
This is the approach Carmack is taking for Idtech 6, the next engine after Rage and Doom 4. Static geometry such as terrain -- generally represented by height maps currently -- will be voxel, allowing to make cliffs and caves without having to use meshes for that. Everything dynamic as well as smaller but more varied static objects will stay as polygons.