A question came up in CDP about what a “large” dynamic array is, when used as a select list, and when lists should be processed as dimensioned arrays versus dynamic arrays. The answer can be more complex than one might think.
Before I continue, I’d like to state why I chose to publish this here instead of in CDP. I’m providing a lot of details here and each one is subject to being slightly incorrect or simply outdated. I’m not going into incredible detail, so each note may only scratch the surface of a deeper discussion. In a forum we’ll find a lot of assumptions and theories and nitpicking at each point, which only make it difficult for someone to come along later to get all of the facts in one spot. By publishing here we have a chance to get the info correct in one spot. As details are verified, corrected, or expanded upon, I’ll update the info here.
Let’s summarize why we care about list size
Lists can be pulled into a BASIC program as a dynamic array, delimited by attribute marks. Processing a dynamic array from start to end is accomplished by an internal process, an assembler code called SID, for Scan Incrementing to Delimiter. When you want attribute 6, for example, a pointer is setup to the beginning of a variable and SID loops until 6 attributes have been found. So retrieving an attribute requires counting every character, and that’s perceived as a performance issue. If a list is converted to a dimensioned array where every field is a separate variable (rather than one variable that must be scanned) each array element can be directly accessed, which is much faster than the scan process. So many developers prefer to MatParse “large” dynamic arrays into dimensioned arrays to get that performance benefit. The question is, at what point does the pain of converting from dynamic to dimensions exceed the benefits?
Factors for consideration
The answer to “what is a large list” isn’t a number of attributes like 10,000 or 100,000. This has less to do with a number of attributes and more to do with volume of data, or the LEN() of the variable containing the dynamic array.
Let’s look at attributes and their contents in terms of frame consumption which is how it’s managed in D3 VME environments. User workspace is composed of frames, data is paged from filespace frames, and non-flashed BASIC makes use of frames to operate on data, so some discussion of frames is warranted.
Long strings cross frame boundaries, so each time you start processing through the attributes and you hit the end of a frame, the system needs to go through the logic to check the next frame ID and read that in if it’s not already in memory. That process is known as “frame faulting”. Back when frames were 500 bytes, 1k or 2k that could be an issue. Pulling data into user workspace is like this too, as the data is paged from file space into user workspace. This has nothing to do with the number of attributes – you can have a single attribute that crosses frame boundaries, or you can have about 500 attributes with unique values in a single 2k frame.
Minor digression to explain the numbers: Figure 500 bytes for attribute marks, and assume the numbers 1 to 500 in each attribute which has (400*3 bytes + 90*2 + 10 = 1390) = 1390+500=1890 with the rest used for housekeeping. In a real list with ID’s of 4-12 characters of course the number of attributes per frame will be lower.
Note that I said “if it’s not already in memory”. There are many factors that determine whether data is in memory, the least of which is “did we already read this frame?” If data has already been read then it’s in cache somewhere, where in D3 allocated memory, CPU cache, disk cache, etc. The point of the above is that at some point D3 must read a frame, then process instructions to follow the forward link for any linked frames. Where the data ultimately comes from is irrelevant, these instructions are always processed.
Many of us used to increase the separation to deal with large items like this. Keep the frames contiguous and the chances of having the frame you want in memory increase. This only relates to pulling data into memory from files and not really to operating on lists, unless you’re saving your lists to files, which is exactly what the pointer-file is. What’s better – larger frames or using a separation? And can or should we increase the separation of the pointer-file? Those are separate topics.
The provision of local overflow in early D3 improved the process of retrieving frames. (See the “set-ovf-local” command.) The global overflow pool provides frames to all users. Each time a user needs a frame the overflow table is locked, a frame is retrieved, and the table is unlocked. D3 has a local overflow table for each process, which is allocated at session start time, so users generally dip into their local table, and not into the global pool when they need frames. So when you’re building large lists, you’re using overflow space but you may not be impacting other users as much as in old R83 days. The default local overflow allocation is rather small. You may wish to experiment with changing the settings over time to see if you can get remarkable performance changes.
The increase of frame size further reduces frame faulting. In D3NT 7.5 you can have a dict with frame size up to 1MB and data frames up to 4MB. These frame limit enhanements to D3 always follow improvements in disk/memory paging for common hardware. When common hardware was capable of paging 2k in a single disk read this made it effective to make frames 2k. Prior to that a request for 2k would cause 2-4 disk reads. These days hardware can stream a lot more data into memory, so larger blocks of contiguous data are often desirable. That said the notion of “contiguous” in a virtual environment is subject to question as we usually have no idea how much work the hardware really has to do to import data from disk. Disk fragmentation may make a “contiguous” file not contiguous at all.
Again, to explain why we care about frames: let’s assume you don’t generate a list and then read the thing into BASIC for processing. What if you use a SELECT statement in BASIC? This generates a list in frame space with a persistent pointer, so each time you execute a READNEXT statement you don’t SID through the list, you simply get the next value in the list. That’s great but the list is still “somewhere”. It’s in frames linked from your user workspace. The same holds true if you EXECUTE “SELECT filename”, then READNEXT, because the list is in frames pointed to from BASIC workspace. But if you EXECUTE “SELECT filename” RTNLIST MYLIST, then the variable MYLIST in BASIC now contains the entire list. While it’s all still in frames and memory no matter how you slice it, this list is more like a dynamic array and it’s processed with SID like any other variable.
Having said that, in D3 a RTNLIST variable is not exactly the same as a dynamic array you create yourself, and it’s not a good idea to RTNLIST to a variable and then operate on that variable just like any other. D3 has optimizations for these variables because it “knows” these are lists, and while these variables can almost always behave like normal vars, they really aren’t normal. Perhaps the optimizations include some sort of cursor for READNEXT handling – I don’t know at this point. My advice is to not take chances with how D3 or other platforms may treat RTNLIST vars, and just assume they’re not standard dynamic arrays.
With FlashBASIC, the manipulation of strings with C++ in memory is orders of magnitude faster than when the same operations are performed in virtual workspace. Also, as memory availability in modern systems increases, and the amount of memory allocated to D3, there is less (or no) paging to disk of user workspace which contains these large variables. Underlying OS functions page memory without us knowing about it and the larger the string the greater the chances that some or all of it will be paged depending on many factors. But the point here is that if you do have long lists in dynamic arrays, you will get a major benefit simply from Flash-compiling with zero code changes, compared to reworking your code with dimensioned arrays or other mechanisms.
Contrary to what I said above, sure, the number of attributes can be a factor since every time D3 needs an attribute it counts attributes starting at the beginning of the string. jBase, and maybe other platforms, keep a pointer to where the last/highest attribute was so that it can just move the cursor forward without repeating unnecessary work. I have no idea yet if D3 has had an enhancement like this implemented over the last 8 years, but that would also virtually negate the concept of “large” except when initially building strings or when writing to disk.
Let’s assume you do want to convert a dynamic array to a dimensioned array, you need to be aware of what happens when you create a dimensioned array, and that includes use of MatParse. BASIC needs to carve out workspace for each variable, that is each element of the array. That can temporarily take up frame space. Try this for an experiment. Compile the following two-line program:
DIM ARRAY(1000000) ; * that's one million EXECUTE "POVF"
Now POVF from TCL, note your total overflow, then run the program. You’ll see that BASIC has dipped into common overflow space for about 12,000 frames – depending on your frame size. If you ask for 10 million elements or 100 million, the program may abort with a “not enough disk space” error – again, depending on system size. The space is released at the end of the program, which should be confirmed by another POVF from TCL. What we see here is that in trying to avoid excessive paging operations you’ve only created a situation where overflow is required anyway. Sure, BASIC can access elements of a dimensioned array much faster than a dynamic array, but that doesn’t come without some price in the conversion process.
All of the above technical improvements over time have progressively raised the bar in terms of what “large” means. Compiling your code differently or moving your code from one system to another will change the definition to some extent. The rule of thumb is now more like “suck it and see”. The concept of “large” on your developer system will be vastly different on each of your end-user systems. I think the easiest thing to do is to just decide if you want to work with a dynamic array or a dimensioned array, and understand that for some systems, no matter what you do, it’s going to be the wrong decision.
The technician in me wants to write performance tests to benchmark many of the above factors. The numbers would be interesting. The business man in me encourages you to do your own testing – or to commission me to do a performance analysis for you.