Skip to content

Commit

Permalink
Improving DirectTcp config Defaults for Spark workloads (#37878)
Browse files Browse the repository at this point in the history
* Improving DirectTcp config Defaults for Spark workloads

* Updating changelog
  • Loading branch information
FabianMeiswinkel committed Dec 1, 2023
1 parent e688ccf commit c03ace2
Show file tree
Hide file tree
Showing 6 changed files with 15 additions and 0 deletions.
1 change: 1 addition & 0 deletions sdk/cosmos/azure-cosmos-spark_3-1_2-12/CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,6 +12,7 @@
#### Bugs Fixed

#### Other Changes
* Improved DirectTcp config Defaults for Spark workloads - transit timeout health checks as well as request and connect timeout are too aggressive considering that many Spark jobs unlike latency sensitive apps is throughput optimized and executors will often hit CPU usage >70%. - See [PR 37878](https://github.com/Azure/azure-sdk-for-java/pull/37878)

### 4.23.0 (2023-10-09)

Expand Down
1 change: 1 addition & 0 deletions sdk/cosmos/azure-cosmos-spark_3-2_2-12/CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,6 +12,7 @@
#### Bugs Fixed

#### Other Changes
* Improved DirectTcp config Defaults for Spark workloads - transit timeout health checks as well as request and connect timeout are too aggressive considering that many Spark jobs unlike latency sensitive apps is throughput optimized and executors will often hit CPU usage >70%. - See [PR 37878](https://github.com/Azure/azure-sdk-for-java/pull/37878)

### 4.23.0 (2023-10-09)

Expand Down
1 change: 1 addition & 0 deletions sdk/cosmos/azure-cosmos-spark_3-3_2-12/CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,6 +11,7 @@
#### Bugs Fixed

#### Other Changes
* Improved DirectTcp config Defaults for Spark workloads - transit timeout health checks as well as request and connect timeout are too aggressive considering that many Spark jobs unlike latency sensitive apps is throughput optimized and executors will often hit CPU usage >70%. - See [PR 37878](https://github.com/Azure/azure-sdk-for-java/pull/37878)

### 4.23.0 (2023-10-09)

Expand Down
1 change: 1 addition & 0 deletions sdk/cosmos/azure-cosmos-spark_3-4_2-12/CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,6 +11,7 @@
#### Bugs Fixed

#### Other Changes
* Improved DirectTcp config Defaults for Spark workloads - transit timeout health checks as well as request and connect timeout are too aggressive considering that many Spark jobs unlike latency sensitive apps is throughput optimized and executors will often hit CPU usage >70%. - See [PR 37878](https://github.com/Azure/azure-sdk-for-java/pull/37878)

### 4.23.0 (2023-10-09)

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -359,4 +359,14 @@ private[cosmos] object SparkBridgeImplementationInternal extends BasicLoggingTra
def configureSimpleObjectMapper(allowDuplicateProperties: Boolean) : Unit = {
Utils.configureSimpleObjectMapper(allowDuplicateProperties)
}

def overrideDefaultTcpOptionsForSparkUsage(): Unit = {
val overrideJson = "{\"timeoutDetectionEnabled\": true, \"timeoutDetectionDisableCPUThreshold\": 70.0," +
"\"timeoutDetectionTimeLimit\": \"PT600S\", \"timeoutDetectionHighFrequencyThreshold\": 100," +
"\"timeoutDetectionHighFrequencyTimeLimit\": \"PT30S\", \"timeoutDetectionOnWriteThreshold\": 10," +
"\"timeoutDetectionOnWriteTimeLimit\": \"PT600s\", \"tcpNetworkRequestTimeout\": \"PT10S\", " +
"\"connectTimeout\": \"PT10S\", \"connectionAcquisitionTimeout\": \"PT10S\"}"

System.setProperty("azure.cosmos.directTcp.defaultOptions", overrideJson)
}
}
Original file line number Diff line number Diff line change
Expand Up @@ -32,6 +32,7 @@ private[spark] object CosmosClientCache extends BasicLoggingTrait {

SparkBridgeImplementationInternal.setUserAgentWithSnapshotInsteadOfBeta()
System.setProperty("COSMOS.SWITCH_OFF_IO_THREAD_FOR_RESPONSE", "true")
SparkBridgeImplementationInternal.overrideDefaultTcpOptionsForSparkUsage()

// removing clients from the cache after 15 minutes
// The clients won't be disposed - so any still running task can still keep using it
Expand Down

0 comments on commit c03ace2

Please sign in to comment.