How to avoid importing modules that a job doesn't make use of #28631
Replies: 4 comments
-
Any help would be greatly appreciated |
Beta Was this translation helpful? Give feedback.
-
I've been looking into this lately as well. My first thought was to leverage lazy imports somehow, now I'm still trying to understand how to make this work with Dagster. I guess one key to this problem is to know which actual command Dagster is using when executing the run, which I think depends on the deployment set up used. |
Beta Was this translation helpful? Give feedback.
-
I'm interested in understanding this more as well. In particular, I have a repo with a large number of assets generated with an asset factory, which can take a long time to load. |
Beta Was this translation helpful? Give feedback.
-
@nsteins After discovering the |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
I commonly construct my dagster codespaces with a file structure as follows:
The main init.py contains my definitions object.
Imagine I have created multiple custom resource classes, performing various functions. They each inherit from ConfigurableResource. Some resources make use of pandas, other resources, make use of sk-learn, etc.
At the top of each resource file I import modules I will be making use of in that file.
I then create instances of those resources in the resources/init.py file,, making use of env vars. To do this I obviously import in each [custom_resource].py file. I then import these instances in the main init.py file and add them to my definitions object. The import would look something like this:
from .resources import resource_1, resource_2, resource_3
A very common scenario I have, is that a run only makes use of one or two of the resources. But when executing these runs, all of the resources and their accompanying modules are imported. All of the accompanying imports might take up 300MB of memory, but the ones I need for a run might only be 150MB. In a multi-process executor situation, If I am running 8 processes in parallel, the total memory usage is 300MB * 8 = ~2.4GB vs, if I only import the modules I need, this would be half that. In a scenario where I have a codespace containing many expensive modules, with a degree for potential parallelisation, this creates situations with far too much memory being consumed.
One way I can get around this is creating a separate code repository that hosts assets that make use of more obscure expensive modules, to prevent them from being imported in the more common case. However this doesn't always feel natural. One natural separation would be: one codespace for extract and loading, another for transformation and perhaps another for machine learning. This helps, but for a smaller project, it would be nice to maintain one codespace, and prevent the unnecessary imports from happening.
Is there a way for me to setup my import chain, file structure, etc. To prevent these unnecessary imports of modules in runs that do not use them?
Beta Was this translation helpful? Give feedback.
All reactions