RSSSpotifyApple PodcastsPocket Casts

Our next guest, Julian Chibane, is a PhD student at the Real Virtual Humans group at the Max Planck Institute for Informatics in Germany.

His recent work centers around implicit functions for 3D reconstruction, and his most recent paper at NeurIPS is Neural Unsigned Distance Fields for Implicit Function Learning. He also introduced Implicit Feature Networks (IF-Nets) in Implicit Functions in Feature Space for 3D Shape Reconstruction and Completion. Julian is open to collaborators interested in similar work, so feel free to reach out!

Results of Neural Unsigned Distance Fields (NDF)
Neural Unsigned Distance Fields (NDF) can represent and reconstruct complex open surfaces. Given a sparse test point cloud (left), it generates a detailed, completed scene (right).
Results of IF-Nets
IF-Nets can reconstruct point clouds where a part is is occluded, e.g. the human's back (left), with continuous outputs (right).

Highlights from our conversation:

馃柤 How, surprisingly, the IF-Net architecture learned reasonable representations of humans & objects without priors

馃敘 A simple observation that led to Neural Unsigned Distance Fields, which handle 3D scenes without a clear inside vs. outside (most scenes!)

馃摎 Navigating open questions in 3D representation, and the importance of focusing on what鈥檚 working


Below is the full transcript. As always, please feel free to reach out with feedback, ideas, and questions!

Some quotes we loved

[08:34] What surprised him while working on IF-Nets: How strong neural networks can be! We started with simpler tasks for the paper, but then we started to do some crazy stuff, like just completing the human body where an arm is missing and things like that. The network really does something very reasonable, really learns how a human should look. That was surprising for me actually, that these convolutional layers really do learn something pretty reasonable.

[13:25] Why IF-Nets generated better reconstructions: One very important reason is that we include local information of the inputs, but also encode more global information. If you want to complete the backside of a human, you really need to know where you are relative to the input point cloud, which is maybe only of the front [of the human]. Then you need something very global. The main contribution of the paper is to show how to combine local as well as global information from the input.

[28:42] Why implicit functions have had such an impact: Representing 3D scenes is not so easy as representing a 2D image. For 2D images we know how to represent them: we use grids and for each point in the grid, we have the RGB value stored and that鈥檚 it. We all agree that this is a good representation for an image, or at least it鈥檚 a very common representation for an image. But it鈥檚 different for representing 3D scenes, especially for learning. Mesh is a very dominant representation of 3D surfaces, but it鈥檚 not so easy to learn with those when you鈥檙e applying a neural network. That was why these implicit [function] papers had such an impact, because this representation wasn鈥檛 so clear.

[44:42] Research as a discovery process: Having people who believe in you telling you that what you鈥檙e doing is still worthy [is important]. After coming up with these IF-Nets, but having had a different goal, it wasn鈥檛 so clear for me after working so much that this is still something useful. But then you have to be reminded again, that if you change the setting, this is really useful for other people or for a different setting. That was also something I鈥檇 learn. Many people forget that other stuff is really working well, and instead of pushing what鈥檚 working, they try to engineer what鈥檚 not working. Sometimes shifting the focus is very helpful. In research, if you already know that everything is working beforehand, then it鈥檚 not really research. You have to be open for this discovery process, which is not always as easy as it sounds.

[45:00] On simplicity: I really tend to like small steps in research better than having highly engineered big architectures that solve one task very well, where it鈥檚 not really interpretable where this is coming from. I think that鈥檚 why NeRF is so popular, because it has a very simple representation. It uses stuff that鈥檚 absolutely clear to readers, for example a fully connected network. Also, including volumetric rendering, which is known鈥攖hey take key things that are known, but do something very important and helpful with it. It鈥檚 very clear why it鈥檚 working and why it鈥檚 not working, instead of sometimes you have papers solving some task but they have four different networks in there, that do not have any real names, they鈥檙e just doing something not interpretable.


Transcript

Kanjun Qiu: Julian is a PhD student at the Real Virtual Humans group at the Max Planck Institute for Informatics in Germany. His recent work centers around intrinsic functions for 3D reconstruction and his most recent paper at NeuroCS is called Neural Unsigned Distance Fields for Implicit Function Learning.

Kanjun Qiu: Tell us about how did you get to where you are and develop your initial research interests.

Julian Chibane: Yeah. Already as a child I was very fascinated by image processing, for example, 3D rendering. That was really something I loved and played around with in a really early age, and video editing for example. I did this with computers. My mother was a computer teacher and in the university, I had courses on statistics, computer vision, and 3D point cloud processing. I really felt that this is something really cool and was really drawn into this.

Kanjun Qiu: What did you find cool about it?

Julian Chibane: In 3D point cloud processing we learn from images, we can reconstruct the world in 3D and it was pretty amazing for me to go from a 2D projection of the world into something that鈥檚 3D interpretable that you can render from different viewpoints and representation of the world is much more interactive than just an image. You can look at an image but that鈥檚 it; you cannot really move the viewpoint or stuff like that. With the 3D representation, you can do just so much more and you might even be able to let an agent walk in this environment or interact with the 3D environment and things like that. It was part of this course, and that was just so fascinating.

Kanjun Qiu: After you took this course, you decided to do your PhD in a similar field?

Julian Chibane: I also had another course. It was Computer Vision course. It was not so much neural networks and deep learning, it was more of the traditional stuff, our probabilistic view on computer vision. The teacher, he came from the MPI. The course was very interesting for me and it worked out quite well and then my professor told me about this MPI where, all over Germany that鈥檚 a known institution for doing high quality research. The professor told me about MPI and after my masters, he pointed me towards the MPI and I followed his suggestions and that鈥檚 how I became a PhD, I guess.

Kanjun Qiu: It seems like you鈥檝e been pretty focused on 3D reconstruction this whole time so far. What do you feel like has been most exciting to you in the field?

Julian Chibane: Something like the output of a Kinect which gives you a point cloud of a front of a person, for example, which might have some nice details, capturing these details, but also completing the shape of the full human body. That鈥檚 something I really think is pretty cool.

Josh Albrecht: Yeah. Actually going into, I think it was your first paper, that was one of the things that it looked like it was possible to do. I also, as I was reading that paper, I think probably Julian will do a better job than me at describing the paper, so I was going to let him describe that. That was one of the things that I thought was the most interesting from that. It did such a good job on this. Was actually a really hard problem. You can鈥檛 see anything in the back, you can鈥檛 do a normal鈥 You have a surface with a little hole you can smooth over it and fill it in. You can鈥檛 smooth over the back and fill it in; it just looks really bad.

Josh Albrecht: Do you want to tell us a little bit about that first paper? The title of that one, if I remember correctly was Implicit Functions in Feature Space for 3D Shape Reconstruction and Completion.

Julian Chibane: Exactly. It鈥檚 about 3D reconstruction. You have some deficient input and you maybe have a point cloud just captured from one side and you鈥檙e missing the back. Such data can come from X-Box Kinect, for example, which is used for gaming or for a bunch of other applications. There are other scanners also, 3D scanners like LiDAR scanners that are pretty exact but it gives you a sparse point cloud from objects which are further away, and then you鈥檙e interested in getting the full surface out, things like that. That鈥檚 the application scenario of the paper.

Josh Albrecht: What was the story behind that paper? When you started your PhD did you join an existing project or you came there and wandered around for a little while, thinking about what you wanted to do, or just random this idea from on high, like 鈥淥h, do this.鈥?

Kanjun Qiu: Yeah, how did this come about?

Julian Chibane: It was my Master thesis. I was going to the MPI and I had several different project proposals from which I could choose from. I really liked that proposal. Initially it was a bit different. It developed during the course of the work.

Kanjun Qiu: What was it at the beginning and then what happened?

Julian Chibane: At the beginning, it was actually using implicit functions. It was a thing already. There were three papers on implicit functions so it was a really early idea, I would say. These three papers were, if I remember correctly, in the same conference at the same year. It was occupancy networks, deep SDF and in that we wanted to build on this idea of representing 3D geometry in order to model garments on top of humans. That was the original idea. We have a human body model, which is parametrized so you can choose the parameters such that you have a specific pose or a specific shape of a person, but this person would not have any clothing on. It would be just a human body model without clothing.

Julian Chibane: Then we would like to enhance that and put clothing on and having a nice representation for that would be very important. That was starter of that idea.

Kanjun Qiu: Then what happened? How did it change?

Julian Chibane: At the end of the day, it didn鈥檛 change so much. I guess we had humans with clothing but we didn鈥檛 have this body model anymore in the formulation, but we needed to have a step in between because the problem was these first three papers turned out, not really enabling us to model highly detailed clothing because you have so much information there; you have wrinkles, you have tiny things on top of the clothing and stuff. Also, humans are very different than the datasets that were used within these first three papers. They used mostly rigid objects so something like a car or something like an airplane. That鈥檚 just so much easier to represent than the human body who can have any pose or any shape, were the first real problem that we had to tackle at that point, and that鈥檚 what we did with IF-Nets and later on, then we could follow the path on integrating the human body model. But the first we had to solve the problem of reconstructing humans,

Kanjun Qiu: The IF-Nets architecture is somewhat unique. How did you come up with that?

Julian Chibane: First, we applied these other papers, for example, occupancy networks paper and we found that it really has a lot of trouble representing humans because of their variants and poses. We needed to come up with a formulation that somehow used the input data that we have leveraging all the information that is in this input and we felt like somehow the encoding of the inputs in the previous papers wasn鈥檛 strong enough in order to really nail the reconstruction so we had to change the encoding and that鈥檚 how we started.

Josh Albrecht: What are some of the sharp edges or parts that you鈥檙e a little bit uncomfortable? It鈥檚 still in there or do you feel like it鈥檚 pretty robust and pretty much done with it?

Julian Chibane: I would say actually it鈥檚 really pretty robust and that鈥檚 also why we have quite a few follow-up papers on that. The network has to learn for each point in 3D space. If this point is insights, the object that you want to reconstruct are outside. That鈥檚 the basic idea of this implicit learning. Instead of predicting the full output at once you predict per point, if this point is inside or if this point in 3D space is outside of the object.

Julian Chibane: It can happen that if you do not sample during training time of the network, if you sample not enough points far away from the surface or far away from the object, the network can be uncertain of the further surrounding and then you might have some artifacts. That鈥檚 something you can find but it鈥檚 simple to get rid of those. You just need to sample points there as well and then the network learns not to put artifacts there. That鈥檚 the only hacky thing that you can find. Other than that, I would say it does what you expect it to do.

Kanjun Qiu: Is there anything you feel like you learned that was surprising from doing this work?

Julian Chibane: How strong the neural networks can be. Yes. We started with simpler tasks for the paper. We started with doing something like super resolution where we had no resolution object and we just wanted to have a bit of a higher resolution of the object and this worked well, but then we started to do some more crazy stuff, as I mentioned, where you just completed the human body, which was not there in any pose of the human body, like an arm is completely missing and things like that. The network really does something very reasonable there so it really learns how a human looks like and should look like. That was really surprising for me actually, that these convolutional layers really do learn something pretty reasonable.

Kanjun Qiu: I was really surprised looking at that result and I was like, 鈥淥h, I wonder was this something that took a long time for you to figure out and get working or was it something that actually the architecture just produced?鈥 It sounds like you were also surprised and so it seems like the architecture actually just learned some reasonable, fundamental representations of the actual objects.

Julian Chibane: Exactly. At least not in that degree. I think it was also, for my supervisor, a bit of surprising. In a lot of research papers you always use the prior of human body models so you have a human body model and then you do something with this. This gives you a prior on how the human looks like. I was thinking that this is something you really need in order to do some good completions, but it turns out the network can really learn a lot of things just from data without using the parametric model.

Kanjun Qiu: Did you ever test it on non-human objects or weird objects like a leaf or nature trees or anything like that? I know those are notoriously hard to reconstruct.

Julian Chibane: We did have quite a bit of experiments on other objects as humans. That鈥檚 actually something that many people looking at IF-Nets probably miss that it鈥檚 not only for humans. Basically, can work on a bunch of objects and object classes. We used it on the shape that dataset and the shape that dataset consists of manmade objects; there are cars in it, airplanes and things like that. We use these dataset to show that we鈥檙e on this rigid objects we can still outperform the prior work and use the human data to show that our approach also working on non-rigid objects.

Julian Chibane: We took part in the ECCV challenge with IF-Nets. Was two challenges; the one challenge was on human data and the other challenge was an arbitrary data. In both cases, it was 3D data of humans or arbitrary objects and the arbitrary object class was really anything, so there were some fuzzy toys in there, there were some computers in there, also some humans statues, anything. We reconstructed those 3D scans, we applied the model, the input was colored 3D scans but with holes either of humans or of arbitrary objects and the task was to complete the surfaces such these scans do not have holes anymore, but also complete the color. That was new. That was an extension of the IFF paper. We extended the reconstruction from only surface to surface and color.

Julian Chibane: Surprising to me was first of all that the color completion is very reasonable. If there鈥檚 a hole between your shirt and your jeans it will not only complete the surface there but also will complete the colors, the texture, and it will segment your t-shirt from your shorts ad it will look like a reasonable completion also in the texture space. We trained really on these arbitrary objects and we test the model on object it has never seen before, and they had holes, right? The construction still could remove these holes and do a pretty reasonable job although the model has never seen these objects or categories. Even for me as a human it wasn鈥檛 easy actually. That was also quite impressive and quite interesting.

Josh Albrecht: One of the things I was going to ask about the earlier paper, did you ever try training on the human dataset and then testing on the shape that one or reversed to see how much of the dataset bias actually impacts it?

Julian Chibane: Not on the IF-Net paper but we did something along those lines for the NDF paper and we used a very similar a network architecture so I guess it applies also, to some extent, to IF-Nets. We found that it does generalize quite a bit. Train on shaping it and we tested on real scenes. That鈥檚 not an experiment that actually made it to the paper, but we tried that just in our experiments and we found that it does quite a reasonable job already.

Josh Albrecht: That lines up with what you were saying about the second paper as well about working on unseen object classes. That鈥檚 pretty challenging. I think a lot of these 3D reconstruction things that I鈥檝e seen in the past are like, 鈥淥h yeah, it looks pretty good,鈥 but you only train the planes or cars or something so they鈥檙e all the same anyways. It鈥檚 a lot harder the other way.

Kanjun Qiu: Why do you think it works so much better than many of the methods out there? Or are there other methods that you feel like, 鈥淥h, this is just as good.鈥?

Julian Chibane: One very important reason is that we include local information of the inputs. If we have an inputs, for example this point clouds of a human, which is where you have the point only in the front of the human, then if we want to reconstruct points near the surface of the human, then we really have local information in the network and the test network, how close it is to the inputs and we have an encoding of this local region of the inputs, but also encode more global information. If you want to complete the backside of a human, you really need to know where you are relative to the input point cloud, which is maybe only at the front. Then you need something very global. The main contribution of the paper is to show how to combine local as well as global information from the input.

Josh Albrecht: That鈥檚 done with the different multi-level 3D convolutions as well. I think there was some sort of sampling along the axis or something nearby, but just one step in some direction. It鈥檚 not just looking at the one point, but there鈥檚 also some points near it or something.

Julian Chibane: Exactly. First of all, yes. We have a 3D convolutional encoding. The 3D input is encoded by using just the regular 3D convolutional network. This convolutional network for each layer of the convolution it gives you a feature grid. It鈥檚 3D feature grid encoding your input. The cool thing of these feature grids is that they are aligned with the input which means you know where you find the corresponding feature to one 3D point in space. For one 3D point of space you can look up the corresponding features within these deep feature grids. That鈥檚 one trick. We can just really look up the features that correspond to your point which you want to decode in the feature grids and then we leveraged our knowledge in the machine learning community that the receptive field is growing, the more layers you have. The early layers we鈥檒l have of low receptive field, which means you have very localized informations of your input and the further you go up the hierarchy the more global it will be.

Julian Chibane: We extract features from all the layers, which then yields a encoding of your point, which is global, but also local.

Kanjun Qiu: How are you thinking about follow-up work or building on this? How far can you take it and what are additional problems that you might want to solve with IF-Nets or modifications?

Julian Chibane: We鈥檝e made some follow up work a colleague of mine but he used IF-Nets and made follow up paper with a variant of it. What he did there is he segmented scans of human bodies into body parts and this helped to register parametric human body model to the scans. Why is that useful to have a registered parametric human body model? It鈥檚 useful because if you have a good registration, you can then start to move your static scan and instead of a static scan, you then have a dynamic scan, which you can change the pose of, for example. IF-Net was really helpful there in order to predict which body part is present at what location in 3D space.

Julian Chibane: That鈥檚 one extension of that. That was pretty cool. It was an article in ECCV and he really showed that, then from these static scans, he could really make crazy poses and that worked quite well.

Josh Albrecht: One other question on some of the earlier work in the IF-Nets just in general is how long does it take to train and how long does it take for inference and hey, how long did it take to train? How stable is the training? How long does it take inference time? How much data is in the training set?

Julian Chibane: I see. For the human data set, we used to only around 3,000 scans, if I remember correctly.

Josh Albrecht: Not a gigantic dataset.

Julian Chibane: No, actually not because it鈥檚 quite expensive to get a lot of high-quality human scans actually. There we used relatively small one to shape dataset that鈥檚 quite large, but the objects that you have are pretty rigid, pretty prototype-ish, and many objects are near duplicates of themselves. It鈥檚 a very different setup, actually. It worked on both, we trained one model separately for shaping it and for our humans dataset. Training time is approximately two days if you want to have the best possible quality, or if you want to have the best number at the end of the day in your paper, but the results are reasonable way before that. You can see that it鈥檚 converging to something very reasonable whether improvements are minor, that鈥檚 already after some hours.

Josh Albrecht: How many GPUs?

Julian Chibane: We train on a single GPU.

Josh Albrecht: Then similarly for inference as a due to prediction for one model.

Julian Chibane: The prediction of one model is not real time so it鈥檚 around 30 seconds in order to get your full prediction, which means that鈥檚 mashed then and that鈥檚 the final result. The reason for that is the model is implicit, which means you have to query a lot of points. If you want to have a high resolution we use the resolution of 256^3, we have to query a lot of points. For each of these points the network has to decide if the point is inside or outside. The inference time depends highly on your output resolution that you want to have. That鈥檚 really a matter of what you want to have.

Josh Albrecht: But that鈥檚 good to have ballpark order of magnitude. I think for some of the NeRF stuff, it took a surprisingly long time to do some of those things, maybe even hours or something. We had to make that little video or whatever. It looks so cool but it took a lot of compute to actually even just do the inference part. For inference, if that makes sense, do you think there鈥檚 any other downsides of this particular approach?

Julian Chibane: No, actually. I would say that鈥檚 pretty stable. One limitation and that鈥檚 addressed with the follow-up work is it鈥檚 useful to represent objects by occupancy. Is it useful for each point in 3D space classify this point as being inside or outside of the shape? There might be some shapes that just do not separate the 3D space only into two regions inside and outside, but separate the space into more regions and have a layering of surfaces. That it鈥檚 not really applicable you might have data where you do not really have ground truth inside and outside. It might be that you have holes in your 3D scans so it鈥檚 not really easy to find out when you are inside of an object or outside of an object. Or maybe you have a thin wall or you have a very paper thin object so that might be very tricky to define inside and outside and that鈥檚 one limitation that IF-Nets shares with all the other approaches that predict occupancy or predict sign distance because sign distance at the end of the day also has this binary classification into inside and outside.

Josh Albrecht: That idea I think is probably what led to your work at NeurIPS, the Neural Unsigned Distance Fields for Implicit Function Learning. Do you want to talk a little bit about that paper and the approach there and some of the results?

Julian Chibane: It addresses exactly this problem. We wanted to tackle 3D scenes, which often have holes. Often you do not have a fully complete perfect scan of your 3D scene, but you have holes or you have just a partial scan, and it鈥檚 not really easy to train a model with occupancies where you need inside and outside. That was the setting. What we did with neural unsigned distance fields, NDF for short, what we did there is switching from the SCF representation which is pretty common, so signed distance, to unsigned distance because the sign tells you if you鈥檙e inside or outside of the object. If you want to get rid of this inside and outside because you do not really have it, it鈥檚 the first idea of course, to just remove the sign.

Kanjun Qiu: That makes sense.

Julian Chibane: Of course, as trivial as this sounds then at the end of the day, brings some challenges if you want to use this representation, because for SDFs part of the signed distances and for occupancies you have this Marching cubes algorithm, it converts your implicit representation, which tells you inside and outside for each point in the 3D space into something that鈥檚 more applicable to real world scenarios or use cases, having a mesh. It converts from this implicit representation to a mesh. That鈥檚 what you cannot really easily do with an unsigned distance field. In this paper, NDF, we showed some algorithms that still allow for visualization for direct rendering into images or to create point clouds, very dense point cloud, so a lot of points describing the surface of the object that you鈥檙e reconstructing.

Julian Chibane: Also, because we can predict these very dense point clouds, we can then use very classical algorithms that connect these dots and create a mesh from there. We can also mesh then at the end of the day,

Josh Albrecht: For anyone who hasn鈥檛 seen, it鈥檚 definitely worth checking it out. The audio description of how cool this paper is doesn鈥檛 quite do it justice. There鈥檚 some really great images in there that show how well this works with things that are a lot more realistic. There鈥檚 a drawing of a bus where there鈥檚 some seats, it鈥檚 like a cross-section of a bus. It鈥檚 really hard to tell what鈥檚 inside or outside at any of these given objects and it just does a great job of mapping out the contours that feel sort of intuitively correct.

Josh Albrecht: Or there was another really good example. I think you guys had one with a car that had a bunch of internal structure, the internal seats and the windows and the outside of the car and you can see the surfaces for this entire thing. In a way, that鈥檚 almost impossible to do with these other algorithms that are just looking at the outside of it. It seems much more robust to do things this way. I was very excited about it.

Julian Chibane: Thanks, yeah. Thanks. Appreciate it.

Josh Albrecht: Did you also find it to be actually a pretty useful robust representation and a pretty good extension of these IF-Nets?

Julian Chibane: Yes, I would say so. Especially if you have these scenarios where you do not really have inside and outside so there it鈥檚 really applicable. But of course, if you have inside and outside and you have this ground truth data, then you can just go with the classical style of a occupancies, and then you can rely on Marching cubes which is well-developed and fast and that鈥檚 also a very valid approach. It depends on your scenario.

Josh Albrecht: I think the exciting thing for me in a lot of real-world scenarios is really hard to get the inside and outside, like Kanjun鈥檚 example earlier about the trees or leaves. Yeah, sure there is an inside to a leaf, you really don鈥檛 want to be representing that.

Julian Chibane: Exactly.

Josh Albrecht: Or cloth, right? Or a scan of a room. Yes, there鈥檚 technically an inside and outside but you鈥檙e not going to walk through the other side of every single wall to have to get the outside of the entire thing, it鈥檚 so obnoxious. It鈥檚 really nice to be able to not worry about that in a practical setting, I think a lot of cases.

Kanjun Qiu: If I were a practitioner and I was deciding between these unsigned fields versus a signed field, when would you recommend I use one or the other?

Julian Chibane: As soon you have ground truth for occupancies I would go for that because you can really use these classical Marching cubes, which does a great job in creating surfaces because most applications scenarios at the end are interested in having a mesh as an output because it鈥檚 very lightweight. You can also get this with NDFs, but you have to do a step in between. First you predict a very dense point cloud and then you mesh. That takes more time. Therefore, if you can use occupancies and if you have good data and no holes and high-quality data and everything is perfect, then of course go with the easy approach. But for a more real case scenario, the NDF approach would be the thing to go with.

Kanjun Qiu: If I don鈥檛 care about meshes at all and I maybe only care about representation of objects, then the NDF approach is a good one to go with?

Julian Chibane: Yeah. Then in any case, so if you were interested in images, so if you want to render or if you want to have a point cloud that鈥檚 perfect, yes. Or again, if you want to have meshes but you do not have ground truth with occupancies then still NDFs is the way to go.

Josh Albrecht: One other thing that I thought was pretty cool about the paper was the connection to progression. The experiments at the end with just random 2D point data and lines and things like that, I thought that was a really interesting idea. Do you think that鈥檚 practical at all? Maybe it鈥檚 also probably worth describing what exactly some of those setups were and how that worked. But then also the question is, is this practical or useful in any setting or is it just a neat trick?

Julian Chibane: Yeah, I鈥檝e actually also been very excited about this but reviewers didn鈥檛 like it so much. I didn鈥檛 appreciate this idea of doing something that鈥檚 coming originally from computer graphics, which is Ray Tracing and then using Ray Tracing for regression. It鈥檚 just an interesting connection, right? To the applicability, we didn鈥檛 do very thorough experiments on any kinds of regression or function that you could regress. Representing functions as unsigned distance fields is actually really a very interesting in doing this regression with this Ray Tracing idea. I think it鈥檚 really worth looking into this and maybe following up on it, because I think with this formulation the neural network can really represent any function or also other 2D surface.

Julian Chibane: Functions are not so complex; you could have a binary classification, everything above the function is inside, everything below is outside. You could do that, but there might be some curves that do not really define an inside and outside like a spiral, for example. If you apply neural networks to learn to connect these surfaces using this implicit approach, that鈥檚 very exciting.

Josh Albrecht: Are you currently doing any all board?

Julian Chibane: Everybody can follow up on these, I guess.

Kanjun Qiu: Switching topics a little bit if you鈥檙e done with the papers, one thing I鈥檓 really curious about is in this field whose work has impacted you the most?

Julian Chibane: I would say it鈥檚 these three papers that I mentioned; it鈥檚 occupancy networks from Andreas Geiger鈥檚 group, from Latz as a first author, deep SDF and IMNET and also Pifu, of course. Let鈥檚 not forget about that one. They made it popular because it鈥檚 not really true that these are the first papers. There鈥檚 a paper which is one year before them, and it鈥檚 never cited, but it actually does everything that we are in this implicit realm are doing. It鈥檚 called Deep Volumetric Video From Very Sparse Multi-View Performance Capture. It鈥檚 a paper from Hao Li from Hao Li鈥檚 group. They actually already do all these things. They predict in a point wise manner inside and outside and then they do Marching cubes to get a mesh and they predict this from images.

Julian Chibane: The idea was there one year beforehand, but it got popular only with these three papers, I would say. I guess none of these papers actually cite that paper. At least, I would say it鈥檚 underappreciated.

Kanjun Qiu: When you first digested these papers and felt like they were very impactful, what were the key salient points that you got out of it? How did it change your view?

Julian Chibane: I was doing my master thesis so it forced for me really the first time getting into this 3D vision research. It was not so easy to understand the impact this really has. I鈥檓 not sure if other people were so certain of the impact of these ideas, because I would say now with one or one and a half or two years after these papers, I would say they had a really significant impact in representing 3D scenes. I鈥檓 not sure if this was clear at that point already, these many follow up papers that use these ideas for other stuff and then show that they are applicable, of course, our paper IF-Nets, and many others they really made a success out of this idea and NeRF, right? That was this big paper everyone鈥檚 talking about and these next conferences will probably mainly be NeRF conferences. Yes. Also builds on this idea of predicting point-wise so it was really impactful.

Kanjun Qiu: How does [inaudible 00:28:32] thinking impacted you? Just generally the methodology of predicting point-wise is really what it told you like, 鈥淥h, this is something I can take further.鈥

Julian Chibane: Yes. That was breakthrough in representation of 3D scenes because representing 3D scenes it鈥檚 not so easy as representing a 2D image. For 2D images we know how to represent them; we use grids and for each point in the grid, we have the RGB value stored and that鈥檚 it. We all agree that this is a good representation for an image, or at least maybe not all of us, but it鈥檚 a very common representation for an image. But that鈥檚 a bit different for representing 3D scenes, especially for learning because if you want to learn, it鈥檚 not so easy to learn with a mesh, for example. Mesh is a very dominant representation of 3D surfaces, but it鈥檚 not so easy to learn with those; you鈥檙e applying a neural network and things like that.

Julian Chibane: That was why these implicit papers had such an impact because this representation wasn鈥檛 so clear or it鈥檚 maybe still not so clear and that the major drawback of this volumetric approach where you extend a 2D image to 3D by just concatenating a lot of images, one after another, and then interpreted as a 3D volume so you have a voxel grid. This approach is really memory intensive so you have to start a lot of values and that鈥檚 not so easy to do on our GPUs. Maybe in some time in the future we will go back to voxel but currently, although our hardware resources are increasing, that鈥檚 not that easy so that鈥檚 why there is such a success of this idea.

Kanjun Qiu: It seems like voxels, I guess my opinion maybe, it鈥檚 the discreet representations are definitely non-ideal and this is an issue with 2D as well.

Julian Chibane: Probably not the hierarchical wheel in our brain. We are focusing on where more details are.

Kanjun Qiu: To your point there, I think there is something really interesting about how do we decide what to pay attention to and then how do we register detail in the area that we choose to pay attention to. It鈥檚 a little bit separate from the representation that we use.

Josh Albrecht: Some of the implicit priors captured in the work that you did and that鈥檚 even in unsigned distance fields, I noticed there鈥檚 forcing it to pay attention to the region that鈥檚 very close to the surface and spend most of its resources there as opposed to, I don鈥檛 want to be encoding. 鈥淥h yep, still air. Yep, still lots of empty air.鈥

Julian Chibane: Exactly.

Kanjun Qiu: One thing I was curious about in this field in general is what are some of the most important open questions that you feel like as a field it鈥檚 important to address?

Julian Chibane: It鈥檚 not really real-time. At inference time, we still discretize so we still are creating a voxel grid then at the end of the day. We just defer it, we do not do it in the learning phase and we鈥檙e not storing this voxel grid within the GPU, but we are predicting at the end of the day voxel grids point by point in order to then apply Marching cubes, for example. That goes for this implicit function work. In NeRF of course, we鈥檙e doing it differently but also if we have to query many points in order to render an image, for example. One crucial aspect of the next time will be how can we do this more efficiently? Coming back to our example, if we are in 3D space we do not need so many samples there, right? We do not want to sample many points in the 3D space of air where nothing is and we have to find something more clever neglecting free space, for example and paying more attention to surfaces.

Kanjun Qiu: Paying more attention to services and even more attention to where there鈥檚 a lot of texture or something like that.

Julian Chibane: Exactly. It might not be always as easy as that if you want to represent something that has no surface or no clear surface, something like fork or something volumetric, then you need also samples in free space, or then it becomes ambiguous what is free space and whatnot. But maybe sampling more points where you have a higher density of objects.

Josh Albrecht: Or maybe it鈥檚 simply more points where it鈥檚 more important for some downstream task like using attention to allocate towards, do you want to pay attention to the rough shape of this object or just this little particular piece over here or that sort of stuff as well.

Julian Chibane: Exactly. There鈥檚 quite a bit of work to be done there because of course, it would be cool to have these upsides of these predicting point-wise to have this also in real-time applications such that we can really use these things instead of just having ex-research results. Maybe one question should be go back to the previous question because there are, of course, more things that can be done, for example, making this implicit representations more interpretable or more interactive, and people are starting to work on this already. If we think of NeRF, which is learning some continuous representation based on images, then it would be cool to have this scene representation more interpretable, more interactive s to be able to relight a scene would be very cool, there are papers coming out in this direction.

Julian Chibane: But of course, also maybe changing the pose of objects or changing the pose of humans maybe, it鈥檚 interesting as well. Not only capture the real world but also being able to manipulate, maybe capturing the real world or the scene of interest at different timestamps maybe if an open scene is rainy, for example one time it鈥檚 sunny, and then maybe being able to interpolate between the settings of your scene and interactively manipulate the scene, I think there would be so much cool opportunities. It would have so many applications for gaming, for example, maybe creators can capture more of the scenes instead of recreating or artificially doing some things like that.

Kanjun Qiu: What is the current state of interactive scene representation and being able to modify some of these parameters?

Julian Chibane: There are follow up papers on the NeRF representation allowing for relighting and other papers looking into dynamic scenes. There is work on that but we can still explore this direction much more in order to really elaborate stuff.

Kanjun Qiu: What are you interested in here?

Julian Chibane: Actually, this relighting before I鈥檝e seen the paper that it鈥檚 already being done. That was something I wanted also to look into but using these representations for larger scenes, for example, is also something interesting and for large scenes making them as dynamic and as interactive as possible, and also doing augmentation, putting different objects into the scene.

Josh Albrecht: Thinking about how to make a larger scale practical version, where you can swap out the objects and change the location and change the general parameters of the scene and things are able to deform. It鈥檚 like a really robust, good, large scale scene representation would be really interesting.

Julian Chibane: Something we have for our human body models, which are of course hand engineered, but maybe discovering some latent parameters maybe of scenes and then manipulating those different styles of the scene or object.

Kanjun Qiu: That鈥檇 be so fun as a designer.

Josh Albrecht: Yeah. That鈥檇 be super fun to play around with.

Kanjun Qiu: Find the line between the real-world and digital objects.

Julian Chibane: Exactly. Maybe effectively helping creators to less focus on some redundant work, but put more attention on the creative aspects of their work where they can easily get a sunny scene or any scene. Stuff is just handled by cool mechanism in the background. That could be something very cool.

Kanjun Qiu: There was an Adobe paper demo where they鈥檙e doing retexturing, relighting of nature scenes.

Josh Albrecht: That one specifically was separating the appearance of the scene and the structure-

Kanjun Qiu: From the structure of the scene.

Josh Albrecht: 鈥 of the scene for 2D images so they were able to very quickly swap out [crosstalk 00:35:56] like mainly Sony-

Kanjun Qiu: That鈥檚 true. You鈥檙e right.

Josh Albrecht: 鈥 or take the mountains over here and put them over here and controlling those things separately with a brush being able to select the thing and paint it somewhere else. It鈥檚 pretty interesting.

Kanjun Qiu: It is really cool. It鈥檇 be really cool to be able to do that in 3D.

Josh Albrecht: Exactly. Yeah, it鈥檇 be great to see in 3D,

Julian Chibane: I guess that鈥檚 not the only thing that would be nice to transport from 2D to 3D. There are many cool ideas in 2D that we would like to have in 3D. This 3D field has a lot of potential to grow and there are many applications that really use and really need these 3D representations. Exploring this more and more how to get information from the real world into the 3D representation and allowing to use this real world information to design and to alter 3D representations and to apply them to different use cases.

Kanjun Qiu: What are some controversial opinions you have about research or fields that you feel like other people don鈥檛 necessarily agree with?

Julian Chibane: One argument that can come up in rebuttal, for example, when you have to defend your research or people around me defend their research is that the approach is simple. Sometimes that鈥檚 criticism actually. The idea is simple. It鈥檚 too simple or something. That鈥檚 something which might not be that good actually because simple approaches, that鈥檚 not a bad thing, that鈥檚 a good thing. It鈥檚 a good thing to have something simple as long as it does something reasonable, as long as it really helps. We do not really want to have very elaborate architectures with a lot of parameters doing something, it鈥檚 a black box we don鈥檛 really know. It鈥檚 better to have something simple and to make simple ideas work and to do some things in smaller steps but really understanding.

Josh Albrecht: I couldn鈥檛 really agree with you more towards this opinion over time. There鈥檚 no need to add any extra complexity. We barely even understand how or why neural networks work. I think the idea of taking the sign off of a sign distance field and making an unsigned distant field, I think that鈥檚 great [crosstalk 00:37:47].

Kanjun Qiu: That鈥檚 brilliant. Why didn鈥檛 the people who were working with sign distance fields think about that?

Julian Chibane: I鈥檝e been with the claim that we鈥檙e the first using unsigned distance fields, so鈥

Kanjun Qiu: [crosstalk 00:37:56].

Josh Albrecht: Making it actually work and showing how well it works and where it works and getting it to work for rendering and for meshing and all these other things. I think that鈥檚 very valuable. It鈥檚 a simple idea but there鈥檚 a lot of work that goes into making it really actually work.

Kanjun Qiu: In product development we have a saying that there鈥檚 this curve of how much effort you put in. If when you have put in very little effort, then your work is basic and when you have put a medium amount of effort your work is complex. Then when you have put in an enormous amount of effort your work is simple. Simple [inaudible 00:38:26] it鈥檚 not at the beginning. That鈥檚 true in a lot of design disciplines, which research it because it鈥檚 a creative discipline.

Julian Chibane: Actually, this goes not only for the idea themselves but also for the presentation, I would say. That鈥檚 not so much in the English language actually but for writing in German, people often disagree. I really had to relearn how to write the bits in English because it鈥檚 very different actually. In German, if you state the sentence already should reflect almost all possible circumstances, it should reflect the full truth, and of course makes the sentence very large you have to use a lot of commas, and it鈥檚 very hard. You have to have really complex sentences in order to nail down what you want to say. That鈥檚 very different to the writing in English and I really enjoy that.

Julian Chibane: I just had the discussion with a friend of mine who鈥檚 writing his master thesis. People in Germany sometimes really tend to have these really large sentences and make really hard to understand texts which are, of course, they are very to the point still, they are very exact and it鈥檚 very expressive. These are definitely upsides but for a reader, at the end of the day it鈥檚 really just hard to parse. You really have a hard time understanding what the people are trying to tell you. I really like this also this idea of simple is better, it also goes for me in writing and even if I write German now, I write very short, clear sentences if people like it or not.

Kanjun Qiu: Yeah. I have this theory that constructing good explanations of things is almost an entirely separate discipline than making forward research progress and doing novel work. The field needs both to push the novel work forward and also to construct a ladder for other people to understand what the novel work is, otherwise you end up with too much understanding debt with very difficult for people to really know what鈥檚 going on and at some point it becomes very hard to make progress. What other controversial research opinions?

Julian Chibane: Another thing is that I really tend to like small steps and research better than having highly engineered really big architectures that solve one task very well, but it鈥檚 not really interpretable where this is coming from. I think that鈥檚 why NeRF is so popular, actually. It鈥檚 because it has a very simple representation, it uses stuff that鈥檚 absolutely clear to readers, for example a fully connected network. That鈥檚 very clear for readers what that is so they really just use this representation and show that it really is powerful. Also, including volumetric rendering which is also known so they really take key things that are known but do something very important and very helpful with it. They have something which is very clear why it鈥檚 working and why it鈥檚 not working instead of sometimes you have papers solving some task but they have four different networks in there, even do not have any real names, they鈥檙e just doing something not interpretable and yeah, I really tend to have a hard time parsing those papers.

Kanjun Qiu: Josh is really on this side of small modular networks, no giant big architectures.

Josh Albrecht: Mm-hmm (affirmative). I think it鈥檚 also interesting to see what happens when you take all of these small things and put them together in some other larger way. Before you do that, you want to understand what are each of these pieces for and why are we really doing this and the resulting work. It can be really hard to interpret what the different contributions of the different pieces are, unless if you鈥檙e just taking one fully connected network and something, 鈥淥kay, sure we have [inaudible 00:41:48] what鈥檚 contributing what at this?鈥 I think my favorite though, are the ones that are simple approaches and they get great performance, like the IF-Nets, for example. It鈥檚 a pretty simple approach, works really well. I think it鈥檚 always interesting to see you when you can take something so simple and make it work so well.

Julian Chibane: But of course, that鈥檚 not always easy to get there and I guess also it depends highly on where the research of the field is. At some points, you just have so many papers already solving problem A and then the next paper of course, has to beat a very high baseline already so it鈥檚 much, much harder to make big steps there, it鈥檚 very valid to have larger capacity models or more engineered models. I fully agree with what you said, it鈥檚 good to start small, right? Try to have a good understanding of what you鈥檙e doing or what your model is doing and then start to extend the complexity, but maybe not start with a very complex model at the beginning.

Kanjun Qiu: What do you think is something you鈥檝e learned or something you鈥檝e observed about what makes a great researcher great that would be unusual for other people to say?

Julian Chibane: Being patient with other people鈥檚 ideas, being patient listening to other ideas and first trying to understand theirs before pushing your own or before fighting for yours or having patience also to develop your own idea, taking your time to do the things that you want to do.

Kanjun Qiu: How did this come to you? Were there some experiences where you learned, or you were not as patient before?

Julian Chibane: Yeah. Starting with my master thesis, I was quite anxious to get some results, but then the course of the work changed; it wasn鈥檛 as predictable as hoped that we could just engineer the architecture and everything is working out at the end. It鈥檚 not really like that. I had to adapt to change the setting as it wasn鈥檛 so easy for me, but being patient helped and of course, also having people who believe in you telling you that what you鈥檙e doing is still worthy after coming up with this IF-Nets, but having had a different goal it wasn鈥檛 so clear for me at that point after working so much that this is still something useful. But then you have to be reminded again, that if you change the setting, this is really useful for other people or for a different setting. That was also something I鈥檇 learn.

Kanjun Qiu: There are these like troughs of sorrow, 鈥淥h, I鈥檓 so excited about this. Oh, it doesn鈥檛 work,鈥 and having someone to pull you out of the trough of sorrow is very valuable.

Julian Chibane: Yeah, maybe focusing on the positive sides also that鈥檚 something which can make great people great. Many people tend to try to, if something is not working to improve what is not working. They forget that other stuff is really working well and instead of pushing what鈥檚 working they try to engineer what鈥檚 not working. Maybe sometimes shifting the focus is also very helpful.

Kanjun Qiu: [inaudible 00:44:27].

Josh Albrecht: That resonates with me. It鈥檚 about being open to finding these new things. I think some of the most-

Kanjun Qiu: Focusing on strengths.

Josh Albrecht: Yeah. Some of the most interesting discoveries also come from, 鈥淥h, we were trying to do this thing, didn鈥檛 really work and it kind of worked but we noticed that this other thing over here was really interesting and so we just went off in this other direction.鈥

Julian Chibane: That鈥檚 also what research somehow should be, right? In research if you already know that everything is working beforehand then it鈥檚 not really research.

Kanjun Qiu: It鈥檚 a discovery process.

Julian Chibane: You have to be open for this discovery process, which is not always as easy as it sounds.

Kanjun Qiu: Totally. Yeah.

Josh Albrecht: What are some tips or tricks or hacks that you鈥檝e learned?

Julian Chibane: If you start a project it鈥檚 always good to start with the smallest possible architecture or the smallest possible portion of your idea. Even though you have a much higher goal that you want to achieve, A, B, C, and D you start just by achieving A and trying to figure out what needs to be done to do this, instead of trying to hammer the full problem at once. If you do it like that, it鈥檚 also much easier to test your assumptions, to test your code because many ideas fail or do not really come up that well because maybe there鈥檚 some implementation error or there鈥檚 some error in your formulation and you really need a small formulation or a small project in order to track down such issues and then develop your problem further based on what you鈥檝e learned in the beginning. Simplifying your own idea at the beginning of the project is something very helpful.

Josh Albrecht: That also resonates with me; having something nice and simple to start with is a great place to start.

Kanjun Qiu: Did you apply this in either of your papers and how did that work?

Julian Chibane: This strategy I鈥檓 applying it in each paper always. I always try to start the smallest possible portion of the idea or with the easiest setup. Even if I want to tackle a very complicated setup, I would try to first use it on the easiest setup and check what difficulties might be there that I need to solve.

Kanjun Qiu: I was just going to say for IF-Nets, for example, if you had to get just very concrete what was the easiest setup?

Julian Chibane: The easiest setup there was comparing if these occupancy networks results, for example that cannot really represent humans, these humans had missing arms, for example. From the input that we had it was maybe a course human. If we just interpolate the information that we had there with a very, very tiny network or no neural network at all, if we interpolate these input information, shouldn鈥檛 this already give you much more accurate pose of the person? In the input there is some arm, for example that occupancy networks, for example is missing. If we do some simple interpolation shouldn鈥檛 we get there something more reasonable already by just paying more attention to the data, without even doing a very large learning machinery, relying more on local features?

Julian Chibane: That was the place to start with there and then we wanted to extend this, we wanted to complete shapes so we had to use also more global features and then we really used the 3D scene.

Kanjun Qiu: That鈥檚 a really simple starting point.

Josh Albrecht: Are there any ideas that you thought should work that did not work as you expected?

Julian Chibane: Yes. One further goal was to go from this unsigned distance field to a mesh directly, having a variant of Marching cubes which can do it for occupancies but doing it on NDFs. You have the unsigned distance field and still, you want to mesh it directly using some variant of Marching cubes. We鈥檝e worked on that and we found some reasonable results, but it was still creating holes here and there and it wasn鈥檛 really working out all that well. We didn鈥檛 put it in the paper so I guess this was something I imagined to be easier after having quite good prediction, but this hand engineering should not be underestimated so having a good head engineered algorithm that really nails the meshing that鈥檚 not to be underestimated.

Josh Albrecht: One thing we generally ask is, are there any things that you would like other people to reach out to you about looking for collaborators or specific ideas or projects?

Julian Chibane: Yeah. The things that we talked about, these applications of 3D scene representations, their use cases or the further development that鈥檚 what I鈥檓 really interested in. Especially also incorporating some geometric or some physical principles within those ideas, that鈥檚 something I really like. People interested in similar things are always welcome to contact me and I鈥檓 very happy to look for collaborators to work on cool stuff.

Kanjun Qiu: Awesome.

Josh Albrecht: Great. I really enjoyed our conversation today. I think there was a lot of really cool stuff in there and the papers are great. I would encourage anyone listening to go check them out because our descriptions don鈥檛 necessarily do the pretty pictures justice. You have a code online for all of them, I think as well.

Julian Chibane: Yes, yes.

Josh Albrecht: Which is also great, I always really appreciate that. If people want to go play around with them, it鈥檚 all there. I really enjoyed this and hopefully people listening to it too.

Kanjun Qiu: Thanks so much for chatting with us.

Josh Albrecht: Yeah. Thanks a ton, Julian. This is great.

Julian Chibane: Thanks. I enjoyed it too.

Kanjun Qiu: Thank you for listening to Generally Intelligent. We love feedback and questions so please feel free to ping us on Twitter at @Kanjun and @JoshAlbrecht or email us with any ideas or corrections. If you enjoyed this podcast, please share it with fellow researchers, rate it on iTunes and follow us on Spotify for your favorite podcast app. Thanks very much and we鈥檒l see you next time.