-
Notifications
You must be signed in to change notification settings - Fork 701
IR: optimize memory buffers by reusing tensor buffers of the same typ… #16
Conversation
…e. This brigs for a major memory use reduction. It brings the resnet50 model from 600Mb to 400Mb. This commit includes a few changes that had to go in together: 1. Change the optimizer interface to include the kind of optimization level: kNone, kTrain, kInfer. 2. Add a method to all of the instruction that describes if they can share the memory buffers. 3. A new optimization for sharing the memory buffers. 4. A new dead-buffer optimization. 5. Fix the C2 loader that did not construct the module in the right order. 6. Add a new method to UseDef::getNumUsers.
include/glow/IR/UseDef.h
Outdated
|
|
||
| /// \returns the number of users that the value has. | ||
| unsigned getNumUsers() const { | ||
| return std::distance(users_.begin(), users_.end()); |
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
include/glow/IR/UseDef.h
Outdated
| void addUse(Use U) { users_.push_back(U); } | ||
|
|
||
| /// \returns True if the value has some users. | ||
| bool hasUsers() const { return users_.size(); } |
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
include/glow/IR/UseDef.h
Outdated
|
|
||
| /// Returns true if the user \p I is in the list. | ||
| bool hasUser(const UserTy *I) const { | ||
| for (auto &U : users_) { |
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
| void optimize(Module &M); | ||
| enum class OptimizationMode { | ||
| kNone, // Don't optimize the module. | ||
| kTrain, // Optimize the module but allow training. |
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
src/glow/IR/IR.cpp
Outdated
|
|
||
| bool Instruction::mayShareBuffers(const Instruction *I) { | ||
| #define DEF_INSTR(CLASS, NAME) \ | ||
| if (auto *X = dyn_cast<const CLASS>(I)) \ |
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
| } | ||
| it++; | ||
| } | ||
| } |
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
src/glow/Optimizer/Optimizer.cpp
Outdated
| static void replaceAllNonDeallocUsersWith(Value *val, Value *with) { | ||
| assert(val != with && "Replacing value with self"); | ||
| auto &lst = val->getUsers(); | ||
| std::vector<Value::Use> users(lst.begin(), lst.end()); |
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
src/glow/Optimizer/Optimizer.cpp
Outdated
| // IR. | ||
| static void replaceAllNonDeallocUsersWith(Value *val, Value *with) { | ||
| assert(val != with && "Replacing value with self"); | ||
| auto &lst = val->getUsers(); |
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
| for (unsigned op = 0, ope = I->getNumOperands(); op < ope; op++) { | ||
| auto O = I->getOperand(op); | ||
| auto ai = dyn_cast<AllocActivationInst>(O.first); | ||
|
|
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
|
This is great! Glow is already doing non-trivial optimizations. Any plan on also reusing buffers of different types? I know this will require more sophisticated analysis but will significantly reduce memory capacity requirement. |
Summary: **Description** This commit fixes two bugs in the OpenCL implementation of `BatchedReduceAddInst` and adds a few comments for clarity. The first is a segmentation fault caused by incorporating feedback on #2958. A suggestion was made to make the loop variable `i` in the loop that computes `batchSliceSizes` count down instead of count up, but this suggestion was taken without changing the type (which was `size_t`, an unsigned type), so the loop never terminates and eventually leads to a segmentation fault. The second bug is an incorrect computation of `destSliceSizes`. Instead of multiplying the slice size at a dimension with the number of elements in that same dimension, the code was multiplying the former with the number of elements in the *adjacent* dimension. This was surfaced by the unit test added in #2958 for `axis = 2`. **Test Plan** 1) `ninja check` with OpenCL enabled, DEBUG mode ``` Start 1: BackendCorrectnessTest 1/34 Test #1: BackendCorrectnessTest .............. Passed 21.28 sec Start 2: BackendTest 2/34 Test #2: BackendTest ......................... Passed 1.97 sec Start 3: BasicIRTest 3/34 Test #3: BasicIRTest ......................... Passed 0.05 sec Start 4: Caffe2ImporterTest 4/34 Test #4: Caffe2ImporterTest .................. Passed 3.00 sec Start 5: DeviceManagerTest 5/34 Test #5: DeviceManagerTest ................... Passed 0.76 sec Start 6: ThreadPoolExecutorTest 6/34 Test #6: ThreadPoolExecutorTest .............. Passed 1.48 sec Start 7: Float16Test 7/34 Test #7: Float16Test ......................... Passed 0.01 sec Start 8: GemmTest 8/34 Test #8: GemmTest ............................ Passed 0.05 sec Start 9: GlowOnnxifiManagerTest 9/34 Test #9: GlowOnnxifiManagerTest .............. Passed 0.06 sec Start 10: GradCheckTest 10/34 Test #10: GradCheckTest ....................... Passed 4.72 sec Start 11: GraphGradTest 11/34 Test #11: GraphGradTest ....................... Passed 0.06 sec Start 12: GraphOptzTest 12/34 Test #12: GraphOptzTest ....................... Passed 0.03 sec Start 13: GraphSchedulerTest 13/34 Test #13: GraphSchedulerTest .................. Passed 0.01 sec Start 14: GraphTest 14/34 Test #14: GraphTest ........................... Passed 1.03 sec Start 15: HostManagerTest 15/34 Test #15: HostManagerTest ..................... Passed 7.49 sec Start 16: HyphenTest 16/34 Test #16: HyphenTest .......................... Passed 1.17 sec Start 17: IROptTest 17/34 Test #17: IROptTest ........................... Passed 0.01 sec Start 18: ImageTest 18/34 Test #18: ImageTest ........................... Passed 0.31 sec Start 19: LLVMIRGenTest 19/34 Test #19: LLVMIRGenTest ....................... Passed 0.01 sec Start 20: MLTest 20/34 Test #20: MLTest .............................. Passed 46.30 sec Start 21: MemoryAllocatorTest 21/34 Test #21: MemoryAllocatorTest ................. Passed 0.03 sec Start 22: OCLTest 22/34 Test #22: OCLTest ............................. Passed 0.24 sec Start 23: OnnxImporterTest 23/34 Test #23: OnnxImporterTest .................... Passed 0.12 sec Start 24: OperatorGradTest 24/34 Test #24: OperatorGradTest .................... Passed 0.05 sec Start 25: OperatorTest 25/34 Test #25: OperatorTest ........................ Passed 14.47 sec Start 26: PartitionerTest 26/34 Test #26: PartitionerTest ..................... Passed 0.05 sec Start 28: ProvisionerTest 27/34 Test #28: ProvisionerTest ..................... Passed 1.00 sec Start 29: QuantizationTest 28/34 Test #29: QuantizationTest .................... Passed 7.46 sec Start 30: TensorsTest 29/34 Test #30: TensorsTest ......................... Passed 0.36 sec Start 31: TensorPoolTest 30/34 Test #31: TensorPoolTest ...................... Passed 0.01 sec Start 32: ThreadPoolTest 31/34 Test #32: ThreadPoolTest ...................... Passed 0.01 sec Start 33: TraceEventsTest 32/34 Test #33: TraceEventsTest ..................... Passed 10.62 sec Start 34: TypeAToTypeBFunctionConverterTest 33/34 Test #34: TypeAToTypeBFunctionConverterTest ... Passed 0.06 sec Start 35: UtilsTest 34/34 Test #35: UtilsTest ........................... Passed 0.02 sec 100% tests passed, 0 tests failed out of 34 Total Test time (real) = 124.33 sec ``` 2) `ninja check` with OpenCL enabled, RELEASE mode ``` Start 1: BackendCorrectnessTest 1/34 Test #1: BackendCorrectnessTest .............. Passed 11.51 sec Start 2: BackendTest 2/34 Test #2: BackendTest ......................... Passed 1.53 sec Start 3: BasicIRTest 3/34 Test #3: BasicIRTest ......................... Passed 0.02 sec Start 4: Caffe2ImporterTest 4/34 Test #4: Caffe2ImporterTest .................. Passed 0.62 sec Start 5: DeviceManagerTest 5/34 Test #5: DeviceManagerTest ................... Passed 0.83 sec Start 6: ThreadPoolExecutorTest 6/34 Test #6: ThreadPoolExecutorTest .............. Passed 0.71 sec Start 7: Float16Test 7/34 Test #7: Float16Test ......................... Passed 0.01 sec Start 8: GemmTest 8/34 Test #8: GemmTest ............................ Passed 0.31 sec Start 9: GlowOnnxifiManagerTest 9/34 Test #9: GlowOnnxifiManagerTest .............. Passed 0.33 sec Start 10: GradCheckTest 10/34 Test #10: GradCheckTest ....................... Passed 1.90 sec Start 11: GraphGradTest 11/34 Test #11: GraphGradTest ....................... Passed 0.32 sec Start 12: GraphOptzTest 12/34 Test #12: GraphOptzTest ....................... Passed 0.03 sec Start 13: GraphSchedulerTest 13/34 Test #13: GraphSchedulerTest .................. Passed 0.02 sec Start 14: GraphTest 14/34 Test #14: GraphTest ........................... Passed 0.59 sec Start 15: HostManagerTest 15/34 Test #15: HostManagerTest ..................... Passed 10.61 sec Start 16: HyphenTest 16/34 Test #16: HyphenTest .......................... Passed 4.18 sec Start 17: IROptTest 17/34 Test #17: IROptTest ........................... Passed 0.04 sec Start 18: ImageTest 18/34 Test #18: ImageTest ........................... Passed 0.10 sec Start 19: LLVMIRGenTest 19/34 Test #19: LLVMIRGenTest ....................... Passed 0.71 sec Start 20: MLTest 20/34 Test #20: MLTest .............................. Passed 52.44 sec Start 21: MemoryAllocatorTest 21/34 Test #21: MemoryAllocatorTest ................. Passed 0.03 sec Start 22: OCLTest 22/34 Test #22: OCLTest ............................. Passed 0.96 sec Start 23: OnnxImporterTest 23/34 Test #23: OnnxImporterTest .................... Passed 0.89 sec Start 24: OperatorGradTest 24/34 Test #24: OperatorGradTest .................... Passed 0.76 sec Start 25: OperatorTest 25/34 Test #25: OperatorTest ........................ Passed 33.00 sec Start 26: PartitionerTest 26/34 Test #26: PartitionerTest ..................... Passed 0.79 sec Start 28: ProvisionerTest 27/34 Test #28: ProvisionerTest ..................... Passed 3.00 sec Start 29: QuantizationTest 28/34 Test #29: QuantizationTest .................... Passed 19.64 sec Start 30: TensorsTest 29/34 Test #30: TensorsTest ......................... Passed 0.09 sec Start 31: TensorPoolTest 30/34 Test #31: TensorPoolTest ...................... Passed 0.04 sec Start 32: ThreadPoolTest 31/34 Test #32: ThreadPoolTest ...................... Passed 0.04 sec Start 33: TraceEventsTest 32/34 Test #33: TraceEventsTest ..................... Passed 13.18 sec Start 34: TypeAToTypeBFunctionConverterTest 33/34 Test #34: TypeAToTypeBFunctionConverterTest ... Passed 0.87 sec Start 35: UtilsTest 34/34 Test #35: UtilsTest ........................... Passed 0.04 sec 100% tests passed, 0 tests failed out of 34 Total Test time (real) = 160.15 sec ``` 3) `ninja check` with OpenCL enabled, ASAN+UBSAN mode ``` Start 1: BackendCorrectnessTest 1/34 Test #1: BackendCorrectnessTest .............. Passed 65.05 sec Start 2: BackendTest 2/34 Test #2: BackendTest ......................... Passed 5.42 sec Start 3: BasicIRTest 3/34 Test #3: BasicIRTest ......................... Passed 0.09 sec Start 4: Caffe2ImporterTest 4/34 Test #4: Caffe2ImporterTest .................. Passed 11.51 sec Start 5: DeviceManagerTest 5/34 Test #5: DeviceManagerTest ................... Passed 1.93 sec Start 6: ThreadPoolExecutorTest 6/34 Test #6: ThreadPoolExecutorTest .............. Passed 5.08 sec Start 7: Float16Test 7/34 Test #7: Float16Test ......................... Passed 0.03 sec Start 8: GemmTest 8/34 Test #8: GemmTest ............................ Passed 0.22 sec Start 9: GlowOnnxifiManagerTest 9/34 Test #9: GlowOnnxifiManagerTest .............. Passed 0.18 sec Start 10: GradCheckTest 10/34 Test #10: GradCheckTest ....................... Passed 15.40 sec Start 11: GraphGradTest 11/34 Test #11: GraphGradTest ....................... Passed 0.22 sec Start 12: GraphOptzTest 12/34 Test #12: GraphOptzTest ....................... Passed 0.12 sec Start 13: GraphSchedulerTest 13/34 Test #13: GraphSchedulerTest .................. Passed 0.03 sec Start 14: GraphTest 14/34 Test #14: GraphTest ........................... Passed 3.00 sec Start 15: HostManagerTest 15/34 Test #15: HostManagerTest ..................... Passed 13.79 sec Start 16: HyphenTest 16/34 Test #16: HyphenTest .......................... Passed 3.47 sec Start 17: IROptTest 17/34 Test #17: IROptTest ........................... Passed 0.05 sec Start 18: ImageTest 18/34 Test #18: ImageTest ........................... Passed 1.08 sec Start 19: LLVMIRGenTest 19/34 Test #19: LLVMIRGenTest ....................... Passed 0.05 sec Start 20: MLTest 20/34 Test #20: MLTest .............................. Passed 141.01 sec Start 21: MemoryAllocatorTest 21/34 Test #21: MemoryAllocatorTest ................. Passed 0.08 sec Start 22: OCLTest 22/34 Test #22: OCLTest ............................. Passed 0.64 sec Start 23: OnnxImporterTest 23/34 Test #23: OnnxImporterTest .................... Passed 0.51 sec Start 24: OperatorGradTest 24/34 Test #24: OperatorGradTest .................... Passed 0.14 sec Start 25: OperatorTest 25/34 Test #25: OperatorTest ........................ Passed 35.78 sec Start 26: PartitionerTest 26/34 Test #26: PartitionerTest ..................... Passed 0.20 sec Start 28: ProvisionerTest 27/34 Test #28: ProvisionerTest ..................... Passed 2.25 sec Start 29: QuantizationTest 28/34 Test #29: QuantizationTest .................... Passed 17.17 sec Start 30: TensorsTest 29/34 Test #30: TensorsTest ......................... Passed 1.28 sec Start 31: TensorPoolTest 30/34 Test #31: TensorPoolTest ...................... Passed 0.03 sec Start 32: ThreadPoolTest 31/34 Test #32: ThreadPoolTest ...................... Passed 0.05 sec Start 33: TraceEventsTest 32/34 Test #33: TraceEventsTest ..................... Passed 32.11 sec Start 34: TypeAToTypeBFunctionConverterTest 33/34 Test #34: TypeAToTypeBFunctionConverterTest ... Passed 0.15 sec Start 35: UtilsTest 34/34 Test #35: UtilsTest ........................... Passed 0.07 sec 100% tests passed, 0 tests failed out of 34 Total Test time (real) = 358.24 sec ``` Pull Request resolved: #3118 Differential Revision: D15836207 Pulled By: SplitInfinity fbshipit-source-id: 7bfa3c6ed5583d6a8f42b1f712f359e8e1d10b47
…e. This brigs for a major memory use reduction. It brings the resnet50 model from 600Mb to 400Mb.
This commit includes a few changes that had to go in together: