Skip to content

Spark Job dependencies set using spark.submit.pyFiles cannot be loaded from HDFS #419

Closed
@Jimvin

Description

@Jimvin

Affected Stackable version

24.3

Affected Apache Spark-on-Kubernetes version

3.5.0

Current and expected behavior

With the correct configuration in place for Kerberos and HDFS Spark jobs can be successfully started using a resource loaded from Kerberos-enabled HDFS by setting mainApplicationFile to a HDFS URL e.g. mainApplicationFile: hdfs://poc-hdfs/user/stackable/pi.py. The same Spark Job will fail if the property spark.submit.pyFiles is configured pointing to a resource stored on the same HDFS cluster e.g. hdfs://poc-hdfs/user/stackable/mybanner.py.

2024-06-25T10:21:14,754 WARN [main] org.apache.hadoop.fs.FileSystem - Failed to initialize fileystem hdfs://poc-hdfs/user/stackable/mybanner.py: java.lang.IllegalArgumentException: java.net.UnknownHostException: poc-hdfs

Possible solution

No response

Additional context

No response

Environment

No response

Would you like to work on fixing this bug?

None

### Tasks
- [ ] provide workaround here (if existent)
- [ ] optional: Report upstream Spark bug

Metadata

Metadata

Assignees

Labels

release-noteDenotes a PR that will be considered when it comes time to generate release notes.release/24.11.0type/bug

Type

No type

Projects

Status

Done

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions