IC3 AI Residency

A few months ago, I was delighted with the news that I had been selected as one of the six in the world to participate in the biannual IC3 AI residency program. And today I would like to share with you what my experience has been like so far.

What is the IC3 AI residency program?

...you might be wondering, the program calls for internal employees to apply for a six-month program where you will be joining the teams behind the deep learning and machine learning optimizations for products such as Teams, ACS, and others. These optimizations come in the form of better noise suppressors during calls, better bandwidth utilization, better camera blur filters, and many more making the overall experience of the users easier, less expensive, enjoyable, and more inclusive.

And what have you learned/done so far Ruben?

I was fortunate enough to join a team focused on computer vision, we aim to create a video encoder using deep learning, which usually means that the objective is to compress video sizes while keeping as much of the subjective quality as that of the original ones. However, the team has set the bar even higher and on top of the size compression, we aim to increase the video's resolution, i.e., from 480p to 720p.

The subjective quality

Subjective versus objective measurement example.

The subjective quality of an image or video corresponds to the quality perceived by the viewer, this is the way video compression engineers measure if a compressed video looks 'good', 'bad', or something in between. This measure is produced by so-called labelers in a controlled and special environment, such conditions come in the form of watching a video using a given lighting condition, a proper distance between the viewer and the screen, a calibrated screen, a labeler with great eyesight and some other conditions that you can read about in this paper (the authors of this paper are my current colleague and manager, how cool is that?).

We could go on a tangent and talk about the difficulty of setting up an environment like this, and how trying to develop a tool to crowdsource this setup is super difficult but worth trying, however, I will conclude this, gathering reliable subjective scores is a difficult and expensive operation.

Wait, why not use an objective quality metric instead?

That is a great question, historically objective qualities have been the norm and are still a topic of research, quality metrics such as PSNR use pixel values of a reference (original) image and compare it with distorted ones by computing the pixel difference by means of RMSE and MSE. MSSIM on the other hand aims to improve the shortcomings of PSNR by introducing structural information from scenes, as we humans do. Reference for PSNR and MSSIM. These objective qualities are purely analytical, and that is great, however, they have shortcomings and edge cases that limit them from fully capturing the processes of the human visual system and the assessment that humans do over an image or video. A subjective metric captures both the human vision system and the decision-making that we humans do, because of course, it is done by humans. You can then argue that we are trying to measure the quality of experience which is subjective by nature.

Original (a) and distorted images (b) to (f), all with equal MSE, but different SSIM.

The subjective quality as a function

Having videos subjectively scored by humans is a necessity, but also expensive and difficult, once this is done the next natural question is, is it possible to define a function such that given a video or videos, we get a subjective score?

score = subjective_score(videos)

This is an interesting question and a very important one because if you are trying to develop an encoder, in order to measure its performance every time you make a change you would have to run subjective experiments and that's not good. Instead, you want an automated way of measuring the results. For the past last month our team has been working on developing a "function" to provide subjective qualities using deep learning. Our quality assessment tool falls under the category of Full-Reference Video Quality Assessment, we are not reinventing the wheel, but we are surely aiming to create a tool that is better than anything out there. The full-reference part means that we are feeding our tool both a reference and a distorted version of that reference, other approaches use a partial reference and others use no reference at all, they only use the distorted videos to assess quality without reference.

The process has been quite exciting and has led me to believe that applied research is a very interesting option for me to pursue, going from reading and discussing papers to implementing parts of them (or the whole thing), to testing results and iterating through this has been very exciting, to say the least. Fortunately, we have great minds leading this project and that helps us to stay critical and objective to our and others' results, we aim to answer every question there is instead of assuming answers, and we are continuously working towards showing whatever the case with our tests and experiments or referencing other people's work. I believe that is what separates great applied research from regular, answering every question.

To summarize my contributions to the team I have been reviewing literature and presenting my findings and understanding to the team through our weekly reading group. I am supporting the benchmarking process, which translates to testing current SOTA methods out of the box and also by retraining them on current data, as an example, I have been playing a lot with the Video Multimethod Assessment Fusion or VMAF, an FR-VQA tool developed by Netflix to optimize the video encoding of their platform. If you want to read more about it here is their repo. Some people might argue this is the SOTA for FR-VQA, and it's quite intriguing to me that their approach is designing handcrafted feature extractors and mixing them up using machine learning algorithms. I have trained VMAF without the current dataset and tested it out of the box. It seems to be the case that VMAF is great overall but not really good at discriminating between the best encoders which is critical if you want to rank order the best encoders you have. I have been also experimenting a lot with feature extraction by using pre-trained algorithms, this is an exciting area to explore, and I think that there is huge potential for improving things by leveraging transfer learning. The main architecture I have been playing with is MoViNet, which is the SOTA for Video Action Recognition developed by Google in 2021 but there are many other architectures I will be playing with in the coming weeks. Also, and this is a little more on the software engineering side of things, I have learned to set up AzureML compute jobs, and have documented tips and tricks (as I call them) that had helped my team get unblocked. Lately, I had been experimenting with distributed training and inference, but that is something I still need to crack. One last word, not exactly a contribution but I had learned to appreciate good coding practices and documentation, this might slow you down a bit but will definitely allow you to keep going in the long run, also I feel that it is super important in research to let everyone know what exactly had happened at any given experiment, especially while working remotely.

Lastly, I am humbled and privileged to be part of this team, they are great scientists, and I am also grateful for this opportunity to contribute to the best of my ability to this super interesting project. This first month has been nothing but a confirmation of how much I love computer vision and deep learning, let's see what the coming months have to bring. If you liked this post stay tuned for more updates like this and some technical deep dives that I will be releasing in the coming weeks.

Search This Blog

A data scientist journey

IC3 AI Residency - Part 1

Comments

Post a Comment