Speeding up NS-2 scheduler

(Part of the NS-2 enhancement project)

David X. Wei

Netlab @ Caltech

Sep 2005

This is a patch that can speed up the NS-2 scheduler by improving the Calendar Queue data structure in the scheduler. Especially, it can help to speed up many simulations with high speed long distance network. Some simulations that need one day with the original NS2 scheduler can now be finished in an hour. The patch is for NS-2.28. You may need some modification if you want to install it for higher version.

News:

Patch for NS-2.28 / 2.29:

The patch can be downloaded from here. It is for NS-2.28 or NS-2.29. To install the patch, you need to take the following steps, assuming you have successfully installed and run NS-2.28 or NS-2.29 in your computer:

  1. Copy the patch to ns-allinone-2.28/ns-2.28 (or ns-2.29)
  2. In the directory of ns-allinone-2.28/ns-2.28 (or ns-2.29) , run: make clean
  3. In the directory of ns-allinone-2.28/ns-2.28 (or ns-2.29), run: patch -p1 < ns-scheduler.patch
  4. (Optional) delete the patch file by rm ns-scheduler.patch
  5. recompile the ns2 by make and make install

The only changed codes are scheduler.h and scheduler.cc. The patch is original made for NS-2.28. But NS-2.29 does not change much in these two files. So the patch is applicable to NS-2.29 with fuzzy match.

The documentation files sim.tex and ns.bib are changed to include a short description on the change and a reference to the related papers.

The original NS-2.28 uses a CalendarQueue to implement the priority queue, which store the future events in the simulation. I made three changes to this data structure:

Simulation speed with the patch:

Summary: The original NS-2 implementaiton of 2.28 is not consistently working well. Sometimes it takes more than one day to finish a 200second simulation with high speed long distance network (See Test 1~4). Its performance highly depends on the distribution of simulation events. Test 1~4 have the same scenario. The only difference among them is the monitor intervals (This introduces sampling events at different intervals and hence different event distributions in the priority queue). When the monitor interval is 0.1 second per sample, NS finishes in 25minutes. In other sampling rates we tried, NS2 runs for >15 hours. With the patch, the simulation speed is much more consistent.

Common data: Simulation time: 200sec; TCP Variants: Sack1 with HS-TCP; All codes are compiled with -pg (gprof support).

Simulation machine: Sun SunFire server with dual AMD Opteron(tm) Processor 250 (2.4GHz)

Details: (click on the scenario details to download the ns script (.tcl) file. Click on the time to download the gprofile).

Test
Scenario
NS-2.28 original
NS-2.28 with the patch*
NS-2.28 with Heap**
NS-2.28 with map ***
1 c= 1Gbps
buffer=20000pkt
d=220ms
flow#= 40
no output
17h 9m 12s 25m 6s 40m 15s 1h 8m 2s
2 c= 1Gbps
buffer=20000pkt
d=220ms
flow#= 40
monitor per 0.1s
25m 14s 25m 58s 42m 53s 1h 7m 24s
3 c= 1Gbps
buffer=20000pkt
d=220ms
flow#= 40
monitor per 1s
20h 10m 44s 24m 17s 41m 40s 1h 6m 28s
4 c= 1Gbps
buffer=20000pkt
d=220ms
flow#= 40
monitor per 10s
19h 0m 25s 24m 15s 41m 57s 1h 6m 56s
5 c= 1Gbps
buffer=10000pkt
d=220ms
flow#= 40
no output
23m 37s 24m 30s 40m 82s 1h 4m 38s
6 c= 25Mbps
buffer=125pkt
d=220ms
flow#= 4
tracing variable
11s 11s 12s 18s
7 c= 25Mbps
buffer=125pkt
d=220ms
flow#= 40
tracing variable
27s 28s 31s 45s
8 c= 250Mbps
buffer=125pkt
d=220ms
flow#= 40
tracing variable
4m 40s 3m 55s 4m 43s 7m 1s
9 c= 250Mbps
buffer=125pkt
d=220ms
flow#= 40
monitor per 0.1s
3m 05s 2m 14s 3m 06s 5m 23s
10 c= 250Mbps
buffer=125pkt
d=220ms
flow#= 40
no output
2m 52s 2m 9s 3m 0s 5m 28s

* The trace/monitor results in all the scenarios are compared between the original ns2 output and the patched ns2 output. They are identical.

** An implementation with Heap as the priority queue is also implemented for comparison. See details. But this implementation is an ad-hoc hack solution. Should not be used in serious simulation before further verification.

*** NS-map is the map scheduler in the NS-2 distribution, enabled by adding "$ns use-scheduler Map" in the tcl script. It uses stl map implementation as the priority queue. Thanks to Prof. Riley for suggestion.

Details of the data structures:

Calendar Queue (default in NS-2.28)

Randy Brown, "Calendar Queues: A Fast O(1) Priority Queue Implementation for the Simulation Event Set Problem" , Communications of the ACM, Volume 31 , Issue 10 (October 1988) Pages: 1220 - 1227 Year of Publication: 1988 ISSN:0001-0782

JongSuk Ahn, SeungHyun Oh. "Dynamic Calendar Queue," ss, vol. 00, no. , p. 20, Thirty-Second 1999.

The calendar queue stores events in to an array of buckets. A bucket corresponds to a "day" in a real calendar. A bucket can hold multiple events, as you can write down multiple notes in each day in a real calendar. The whole bucket array corresponds to a "year". If events in the same "day" but in different "years" share the same bucket. When a new event is inserted, we can calculate the right bucket in O(1) and insert the event into this bucket via linear search. The size of the array may be doubled if the number of events grows larger, or halved if the number of events grows smaller.

The efficiency of the calendar queue seems to depend on the width of each bucket. If the width of a bucket is too long, many events may be put into one bucket where linear search happens when a new event is inserted -- this is the case when I run NS2. If the width of a bucket is too small, most of the events in the buckets are of different years and there is a large overhead for dequeue (linear search over the calendar days). The optimal solution for bucket width, suggested by the paper, is the average interval between adjacent events. But to calculate this number, one has to traverse the whole queue and calculate teh average intervals. So the paper suggested to calculate interval among the first N (N<=25) events in the queue as sampling. However, ns2 does not use this width calculation algorithm. What it does is to pickup the fullest bucket (with largest number of events), calculate the average interval in that bucket, as suggested in Dynamic Calendar Queue, and set the bucket size to average interval*4. (4 is probably the heuristic number that keep the inqueue and dequeue complexity balanced)

But this way may result in a width value much larger than the suggested optimial value. If there are some events that in different "years" in that buckets while most of the events in the whole queue are clustered within "seconds", each bucket will have a width in unit of "years" and most of the events (clustered within seconds) will go into a few buckets of "years".

Also, NS2 does not adapt the Dynamic Calendar Queue's resize triggering algorithm. Once the bucket width is set to be a value, it won't change unless the bucket number needs to be changed (by significant change on number of events in the queue)


SNOOPy Calendar Queue with average interval estimation

Kah Leong Tan, Li-Jin Thng, "SNOOPy Calendar Queue", Proceedings of the 32nd conference on Winter simulation, Orlando, Florida, Pages: 487 - 495, Year of Publication: 2000, ISBN:1-23456-789-0

SNOOPy Calendar Queue is similar to Dynamic Calendar Queue. But it calculates the new bucket width with a square root formular:

new_bucket_width = k * bucket_width where k = sqrt (Cost Of Dequeue / Cost of Inqueue).

by assuming the new cost of dequeue is 1/k of the old cost of dequeue while the new cost of inqueue is k times of the old cost of inqueue.

Also, I added an average interval estimation which is different from the previous literature. I used the average interval of each pair of dequeued events, over a whole queue size of dequeued events, as the estimation of average interval of the events in the queue. This greatly improves the estimation. But the optimal bucket width is not the average interval as suggested by the original Calendar Queue paper. The optimal bucket width is usually 3 or 4 times of the average interval.

Heap Scheduler

http://www.sgi.com/tech/stl/push_heap.html

http://www.sgi.com/tech/stl/pop_heap.html

I implemented the heap based scheduler with STL package. The NS2 also requires a "cancel" operation which can delete any event from the heap. This requires traversing the whole heap and re-adjusting the heap after deleting the event. "cancel" operation becomes the highest overhead if we implement the heap in a straight forward way.

What I did is that I marked the event in the heap as "cancelled" (by setting uid = -100) without deleting the event. When the dequeue process finds a "cancelled" event, it deletes the event and dequeue another event. Since the number of "cancel" operations is 7.5% of the number of "dequeue" operations, I estimate the size of heap is increased by 7.5%, which is acceptible. And the results do show that this approach is much faster than the straight forward one.

Acknowledgment:

This work is inspired by Prof. Pei Cao at Stanford who showed me how to use gprofile to analyze the NS2 execution time. Many thanks to her!