Weekly AI ArXiv 만담 55회차 22.06.19

Disentangling visual and written concepts in CLIP
- CLIP에서 written text 와 visual concept이 entangle 되어 있는데 이걸 distentangle이 가능한지 연구(from MIT)
- Learn to spell 과 Fortget to spell 이 되도록 각각 학습해서 disentangle 시켜봄
- 각각의 feature로 text-control 이미지 생성결과 및 image2text retrieval 성능 비교
- 스펠링 정보가 뒤섞여 있는데, 그걸 분리 시키는.
  
  분리시켜서 feature 를 만들 수 있는지. 오 이걸 어떻게 disentangle 시키지..!?
  
  figure 4가 방법론
- 파랑색은 텍스트 이해 정도,
  
  빨강색은 이미지 이해정도
  
  디스엔텡글 시킬 수 있다. clip feature 에
- spell 부분, visual feature 부분
  
  멀티모달 어택하는 거 ,, 그 부분 안정적으로 지킬 수 있다.
- 분리할 수 있다 자체가 흥미로움
OmniMAE: Single Model Masked Pretraining on Images and Videos
- Meta AI Omnivore-CVPR 2022 oral 에 MAE로 Image와 Video Joint Self-supervised learning 연구 (MAE + VideoMAE)
- 특별한 모델구조 변경없이 spatial (+temporal) patch -> 다수 masking -> encoder -> decoder -> pixel recon 하는 MAE구조
- Joint 트레이닝하니 개별 학습보다 좋은 성능을 보인다고.
- Pretraining data는 ImageNet-1k와 SSv2, Downstream은 iNaturalist-2018, Place365, K400, Epic-Kitchen등
- Masking 관련 ablation은 다른 비디오나 이미지 연구에도 도움될 듯
Beyond Supervised vs. Unsupervised: Representative Benchmarking and Analysis of Image Representation Learning
- 다양한 SSL method들이 나오는데 성능을 제대로 평가하기 위한 비교방법고 두개의 새로운 평가지표 제안 (CVPR 2022)
- https://mgwillia.github.io/exploring-unsupervised/
Language Models are General-Purpose Interfaces
- MSR에서 나온 Image + Text + Multiligual 동시 학습 LM (Semi-causal LM이라고)
- https://github.com/microsoft/unilm 에서 다운 가능 (아직은 껍데기만)
Understanding the Generalization Benefit of Normalization Layers: Sharpness Reduction
- https://arxiv.org/abs/2206.07085
- 웨이트 스케일 인베리언스의 영향을 확인하는 논문
  
  스케일 인베리언스 : 스케일 노말리제이션을 진행하는 것,, 이를 위해 unit sphere 로 projection 하고, 2차 미분, 해쉬안 기준으로 찾아봄
- 샤프니스의 리덕션이 있는 것을 확인
- effected Learning rate
- 스무스한 로스 리전을 찾았다?
- 모멘텀이 수렴성이 안되서 AdamP ?
- BN + SGD 면 ,, 수렴하는데,, 모멘텀이 오히려 방해가 되나?
- 모멘텀이랑 WD?
- AdamW & 1 cycle?