Learning fastai part 2

the last four days i learned:

  • reimplemented the backward pass of ColumnParallelLinear, forward pass of RowParallelLinear in Megatron-LM, VocabParallelEmbedding in GPT-NeoX, 1/3 visualize attention pattern

  • 3/10 how to implement schedule execution order and calculation dependencies in TorchPipe, and 2/5 how to launch a training pipeline on Kubernetes







the last three days i learned: 2/5 how to calculate head attribution and visualize attention patterns, 1/5 reimplemented GPUs allocation in Megatron-LM, 3.5/10 how to implement schedule execution order and backpropagation dependency in TorchGPipe, some basics of torch distributed RPC








the last three days i learned: 4.0/10 how to implement schedule execution order and backpropagation dependency in TorchGPipe, 2/5 reimplemented GPUs allocation in Megatron-LM, how to orchestrate an ML flow, some basics of AWS and memory management, 2.5/5 how to calculate head attribution and visualize attention patterns














the last four days i learned: reimplemented activation patching, 3.5/5 GPU allocation in Megatron-LM (close :wink: , will share code in 2-3 days), 1/3 parameter partitioning in ZERO optimizer, 5.0/10 how to implement schedule execution order and backpropagation dependency in TorchGPipe, some basics of operating systems and AWS VPC













the last four days i learned: reimplemented 5/5 gpu allocation in Megatron-LM, automatically head detection based on target patterns, 0.5/5 reverse an IOI circuit in GPT-2, 1/5 IndexedCachedDataset and MMapIndexedDataset in fairseq , and scheduled an mlflow using aws step functions











the last three days i learned: reimplemented cached datasets, 1/2 discovering latent knowledge using CCS, 1/5 input swap graph: discovering the role of neural network components, 2/5 of ParallelMLP in Megatron-LM, and learned some basics of batch processing using Apache Spark, and operating system







the last three days i learned: reimplemented (3/5 parameter partitioning and .step() in Zero Optimizer), learned how to (move data from a data lake to a data catalog, and some basics of batch and stream processing using Apache Spark and Kafka)





the last two days i learned: how to version data and gpu memory hierarchy, built (1/3 a monitoring distribution drift and logging service for model inference, 1/3 a data warehouse for training data using BigQuery)







the last few days i learned: built (3/3 a monitoring distribution drift and logging service for model inference, 1/5 a raw data to data warehouse pipeline), some basics of (operating systems, and C++)



the last three days i learned: reimplemented (isolate the effect of a computational path in a neural network using path patching, 2/5 synchronous data transfer between CUDA streams in GPUs in torchgpipe), built 2/5 a raw data to data warehouse pipeline, and learned some basics of (kubernetes, operating systems, and c++)










the last three days i learned: reimplemented [(schedule execution order, 2/5 gradient checkpointing, 3.5/5 synchronous data transfer between CUDA streams in GPUs in torchgpipe), 1/5 auto circuit discovery in a neural network using ACDC, 2/5 steering a language model at run-time by adding activation vectors], and learned some basics of (c++, kubernetes, and operating systems)







the last three days i learned: reimplemented 2.5/5 gradient checkpointing in torchgpipe, learned some basics of operating systems and C++ (currently got stuck at the circuit and data transfer, but hell no, i’m not gonna give up)





the last three days i learned: reimplemented (spawn workers for executing tasks, enforce virtual dependency in backward graph in pipeline parallelism torchgpipe), learned some basics of operating systems (yep, still stuck on auto circuit discovery)








the last four days i learned: reimplemented (3.5/5 gradient checkpointing in torchgpipe, 4/5 steering a language model at runtime by adding activation vectors, made a little progress on reversing an IOI circuit in GPT-2), learned some basics of (operating systems, and HPC)


the last three days i learned: reimplemented [0.1/5 discovery of new nodes at runtime in elastic training and 0.1/5 fault tolerance in horovod (yep, from scratch, this is just a very first step), 4.5/5 logit attribution (fill my gaps)], some basics of (CUDA programming and torch rpc)






the last three days i learned: reimplemented (2/5 of notification mechanism and 1.5/5 of fault-tolerance in elastic training horovod, 4.5/5 gradient checkpointing in torchgpipe, 5/5 in logit attribution), and learned some basics of torch rpc










the last four days i learned: reimplemented (1.5/5 transfer data for skip connections in pipeline parallelism torchgpipe, 2.8/5 notification mechanism,and 2.5/5 discover new nodes at run-time in elastic training horovod, 5/5 logit lens, 0.5/5 reverse an induction circuit from transformer’s weights), some basics of CUDA programming










the last four days i learned: reimplemented (2.5/5 sync state across workers, 2.5/5 elastic sampler in elastic training horovod, 1/5 reverse an induction circuit from transformer’s weights), some basics of ai alignment





the last four days i learned: reimplemented (4/5 reverse an induction circuit from transformer’s weight, 1.5/5 on superposition in neural networks (dig deeper this time), 3/5 sync state across workers in elastic training horovod)





![SCR-20230709-luap|690x316](upload://qBnAO10M2sZ3yrZTYBeVqMUIMiQ.jpeg






the last three days i learned: reimplemented [1.5/5 elastic driver (this one controls worker nodes, executes jobs, monitors and collects results), 3.5/5 of the notification mechanism, 5/5 restore a synchronous state from its last backup state (but not triggering sync across all workers yet) in horovod, 5/5 of an induction circuit, 0.5/5 of a modular arithmetic circuit, and 0.5/5 of a balanced bracket classifier from transformer weights]