Various links will be placed here

Loading classes, a matter of choice?

Loading classes, a matter of choice?
-------------------------------------
March 23rd, 2019

In this issue I'll discuss the notion of content generator objects and tag parsing. Big changes this time :), although not many of them are visual changes.

As you may recall if you read through the previous posts, generating content was done until now by inserting php snippets such as <? TQuark::instance()->generateSomething(); ?> in the template. There were four or five functions in the cms class that when called generated some predefined html sequences and in conjunction with content definitions. Then I mentioned something about catching the generated template in an output buffer in order to parse it for custom tags which would replace php code snippets, also allowing for recursive content nesting by repeating the process with each subsequent generated output. Now its time to put it to practice and parse the output buffer :), yet I'll explain it in reverse as I think it is easier to see it that way.

We'll start with the generateSomething() functions that resided in the TQuarkCMS class. In their place we'll define a number of classes that can be instantiated as we go and can be called for generating specific portions of the content. Not only that, but they don't have to be included from the start in our script, they can actually be used as plugins. For that I defined a generic ancestor class called TBaseGenerator and placed it in a file called BaseGenerator.php. You will notice that naming convention is important: classes, like all types, need to have their name starting with a capital T letter. This is a naming convention borrowed from years of programming in Delphi, but I found it to be good practice. Generator classes in particular must end their name in ...Generator. Therefore any foo generator class will have to be named TFooGenerator, and the file should be named FooGenerator.php. Location of the file for now is considered to be on the same directory level with the quarkCMS.php file, most probably they will reside in some plugins folder in the future. You'll see that using this naming convention will help us load each class only when needed using an autoload mechanism provided by the php interpreter.

The TBaseGenerator class has a very simple definition for the moment. It has two functions, render() and generate(), which may sound to do similar things and... they do. The generate function is called by the processor function of the TQuarkCMS instance (the one doing the parsing for tags) and it ensures the recursive calls to the processor function for each generated content. The render() function is called by generate() to create the html portions that needs to be inserted. It can either return a string containing the html or it can echo it directly (but not both) as it is executed inside an ob_start() ...  ob_end_clean() block therefore any echoed text will be catched and processed. Here is the code for the TBaseGenerator class:








    class TBaseGenerator
    {
        function render($attr = null, $innerText = null)
        {
            return '';    
        }
        
        function generate($attr = null, $innerText = null)
        {
            ob_start();            
            $buffer = $this->render($attr, $innerText);
            if (empty($buffer)) $buffer = ob_get_contents();
            ob_end_clean();
            
            return TQuarkCMS::instance()->process($buffer);
        }
    }
That is almost the entire mechanism (besides the process() function in the main CMS class). To implement specific generators we need to inherit the TBaseGenerator class and overwrite the render() function for that specific content. For a very simple example of this, let's implement a generator for the page title:

    class TTitleGenerator extends TBaseGenerator
    {
        function render($attr = null, $innerText = null)
        {
            $cms = TQuarkCMS::instance();
            return '<h2>'.$cms->menu_items[$cms->idx_current_lang][$cms->idx_current_page].'</h2>';
        }
    }
The routine is basically identical to the one we had previously in TQuarkCMS::GenerateTitle() function. The entire mechanism may seem an overkill for this simple example but it's meant to prove the functionality, more complex code will appear when dealing with actual content, menus, links and so on (for instance, I wrote a code listing generator for rendering code bits in my posts with monospace font and line numbers, and in the future it will provide some syntax highlighting as well), so by using this mechanism of content generator classes we arranged for the specific portions of the cms to have room to grow and mature. Next, the autoload function:

    function quarkCMS_autoloader($class)
    {
        $filename = $class.'.php';
        if ($filename[0] = 'T') $filename = substr($filename, 1);
        include $filename;        
    }
    
    spl_autoload_register('quarkCMS_autoloader');
The autoload function is called automatically by PHP whenever we try to instatiate a non existing class. To avoid fatal errors if the requested class and/or its code file cannot be located, we'll use the class_exists API call that can check for class existence and call the autoloader in its absence (when $autoload argument is true). If the autoloader fails to locate the correct file containing the class definition, class_exists function will simply return false and we can thus gracefully skip generating that portion of the content instead of failing miserably with a fatal error.

    $GeneratorName = 'T'.ucfirst($name).'Generator';
    if (class_exists($GeneratorName, $autoload = true)) 
    {
        $Generator = new $GeneratorName();
        ... call $Generator->generate() or do something else ...
    }
This code is part of the TQuarkCMS::process($text) function we'll talk about in a minute. The autoload mechanism allows us to avoid including all the generator classes from the beginning and only include them as they are needed. Because we'll always need some of them, the mechanism may not provide an improved performance for the simplest cases but in the long run and for more complex sites it can prove to be a lag killer. Now, for the text processor, I analyzed a few approaches to parse the custom quark tags. Using regular expressions was perhaps the first on my mind because of the short code I could write to do the actual parsing but I realized that I needed to a have the regex function return both custom tags and the text in between them in order to reconstruct the output. Then I thought about an HTML dom parser (you can find it on sourceforge at simplehtmldom.sourceforge.net - see what I did there ;) ?), but because it expects the text to always be formatted as html and because it extracts each dom element when I was only interested in replacing some of them it became unfeasable. Therefore, the last option was to do an old style search and replace with a twist :). Using the strpos() is not necessarily bad, especially when you know what you are looking for. I needed it to look for substrings starting with <q: and then copy whatever was inside the tag into an array, marking the start and end position for later replacement. In its simplest form, the processor would look for tags starting with <q: and copy everything until /> is reached, build in this way an array of tags and then step through each one of them, call the respective generator objects and reconstruct the output using generated portions and bits from the $text located in between tags.

        function process(string $text)
        {
            //  parse the text for quark tags and collect them into an array alongside
            //  their start and end position
            $result = '';
            $tags = array();
            
            $idx_start = strpos($text, '<q:');
            while ($idx_start !== false)
            {
                $idx_stop = strpos($text, '/>', $idx_start);
                if ($idx_stop === false) $idx_stop = strpos($text, '<', $idx_start + 1); else $idx_stop++; 
                if ($idx_stop === false) $idx_stop = strlen($text) - 1;
                $len = $idx_stop - $idx_start + 1;
                
                $str_tag = substr($text, $idx_start + 3, $len - 3); //  skip the <q: part and avoid a second substr
                $str_tag = strtolower(trim($str_tag, " \t/>")); //  cut any space, tab, slash or greater signs
                
                $tag_rec = array('tag' => $str_tag, 'start' => $idx_start, 'stop' => $idx_stop);
                $tags[] = $tag_rec;
                
                $idx_start = strpos($text, '<q:', $idx_stop); //  get the position of the next quark tag
            }

            //  rebuild the output by processing each content placeholder in the array
            //  and copying the in between bits directly from the input text
            $offset = 0;
            foreach ($tags as $tag)
            {
                $result.= substr($text, $offset, $tag['start'] - $offset);
                
                //  search for a content generator based on the tag name
                $GeneratorName = 'T'.ucfirst($tag['tag']).'Generator';
                if (class_exists($GeneratorName, $autoload = true)) 
                {
                    $Generator = new $GeneratorName();
                    $result.= $Generator->generate();
                }
                
                $offset = $tag['stop'] + 1;                
            }
            $result.= substr($text, $offset); //  copy the rest of the output buffer
            
            return $result;
        }
This was the initial version of the processor. I copied it here to give an example of how strpos can be used to efficiently locate substrings to be replaced without continuously reiterating through the entire text and replacing one custom tag at a time (which would have been the simpler but less efficient implementation of the processor routine). This version of the processor is only capable of extracting tags in the form <q:element /> which is useful but not always applicable. For instance, we'll need to generate links to other content pages and maybe include some custom text, in the form <q:link contentid=13>this is a link</q:link>, or have class and style attributes specified in the tag. Most definitely this version of the processor won't be capable of doing that, therefore the next version had to be able to also extract attributes and inner text when non self closing tags were used (like <q:element attributes list>inner text</q:element>). To achieve this I needed a processTag function as well and this is where I was really tempted to use the preg_split() or preg_match_all() php functions to extract a list of attributes but after more than a few tries I realized how regex expressions can become over complicated very quickly, with no chance for me to come back later and change them if necessary. But not only that, while testing various regex expressions on regex101.com I noticed that even for short 20 or 30 character strings some of these patterns required a few hundred iterations in order to correctly extract attributes, mainly because of the back referencing necessary to contextually parse values for their respective identifiers. So I went back to straight forward parsing, one character at a time, thus ensuring that the number of iterations is linearly dependent on the number of parsed characters. In the end I got a pretty fast routine which only adds one millisecond or so to the previously measured time for an entire page render. The parsing itself is based on a simple state machine concept although I didn't keep the intermediary form of the routine which was much clearer than what I ended up with in the end. Everything is done in two functions: process($text) which searches for beginnings and ends of tags and fasterTag($s) (called this way because I wrote it in parallel to the initial processTag function) which extracts tag properties such as name and the list of attributes. And no, load_simple_xml() wouldn't have worked as an alternative unless one can be sure the input is in perfect xml format which will probably not be always true. The fasterTag($s) function will accept any of the following forms for specifying attributes: element attr1 attr2=value1 attr3='value2' attr4="value3", with or without any spaces before and after equal sign. The only requirement if you decide to use it for yourself is that tag beginning and ending symbols (<, >, />) need to be stripped before passing the string to the fasterTag function. This is very easily handled by the process($text) function once it identifies a custom quark tag.

        function fasterTag($s)
        {
            $result = array();
            $spaces = array(' ', "\t");
            $len = strlen($s) - 1; $i = 0; $char = $s[$i];
            
            //  parse tag name
            $key = '';
            while (in_array($char, $spaces) && $i < $len) { $i++; $char = $s[$i]; } // skip any space in the beginning
            while (!in_array($char, $spaces) && $i < $len) { $key.= $char; $i++; $char = $s[$i]; };
            if ($i == $len) $key.= $char;
            $result['tag'] = $key;
            
            //  parse attributes
            while ($i < $len)
            {
                $key = '';
                while (in_array($char, $spaces) && $i < $len) { $i++; $char = $s[$i]; } // skip any space after the element and before attributes
                while (!in_array($char, $spaces) && $char != '=' && $i < $len) { $key.= $char; $i++; $char = $s[$i]; }
                if ($char != '=')
                    while (in_array($char, $spaces) && $i < $len) { $i++; $char = $s[$i]; } // skip any space after the element and before attributes
                    
                if ($char = '=' && $i < $len)
                {
                    $value = ''; $i++; $char = $s[$i];
                    while (in_array($char, $spaces) && $i < $len) { $i++; $char = $s[$i]; } // skip any space after the element and before attributes
                    
                    $marker = '';
                    if (($char == '"' || $char == "'") && $i < $len)
                    {
                        $marker = $char; $i++; $char = $s[$i];
                        while ($char != $marker && $i < $len) { $value.= $char; $i++; $char = $s[$i]; }
                        if ($i < $len) { $i++; $char = $s[$i]; }
                    }
                    else
                        while (!in_array($char, $spaces) && $i < $len) { $i++; $char = $s[$i]; }
                    
                    $result[$key] = $value;
                }
                else $result[$key] = '';
            }
            
            return $result;
        }
        
        function process(string $text)
        {
            //  parse the text for quark tags and collect them into an array alongside
            //  their start and end position
            $result = '';
            $tags = array();
            
            $idx_start = strpos($text, '<q:'); //  search for the first occurence of a quark tag
            while ($idx_start !== false)
            {
                //  assume some properties of the found tag
                $selfclosed = true;
                $malformed = false;
                
                //  locate tag and determine if it is malformed, selfclosed or not
                $idxs = $idx_start + 3; //  avoids a few adds in the next lines
                $idx_next = strpos($text, '<', $idxs);  //  take the pos of the next tag opening to check for format errors
                $idx_stop = strpos($text, '/>', $idxs); //  locate the end of the tag as if it is an autoclosing one
                if ($idx_stop === false || $idx_next < $idx_stop)
                {
                    //  self closing marker not found, it might be an error or there could be a separate closing tag
                    $idx_stop = strpos($text, '>', $idxs);
                    
                    if ($idx_stop === false || $idx_next < $idx_stop)
                    {
                        //  a new tag is opened before closing the current one or we simply get to EOF
                        $malformed = true;
                        
                        //  we'll attempt to decode the unclosed tag
                        if ($idx_stop === false) $idx_stop = strlen($string) - 1;
                        else $idx_stop = $idx_stop - 1;
                    }
                    else $selfclosed = false; //  we'll have to search for the closing tag
                }
                else $idx_stop++;
                
                //  extract tag info
                $len = $idx_stop - $idx_start - 2; // it's actually + 1 - 3
                $str_tag = substr($text, $idxs, $len); //  skip the <q: part and avoid a second substr
                $str_tag = trim($str_tag, " \t/>"); //  cut any space, tab, slash or greater signs
                
                $parts = $this->fasterTag($str_tag);
                if (sizeof($parts) >= 1)
                {
                    $attr = array();
                    foreach ($parts as $key => $value)
                    {
                        if ($key == 'tag') $str_tag = $value;
                        else $attr[$key] = $value;
                    }
                    
                    $str_inner = '';
                    if (!$selfclosed)
                    {
                        $search = '</q:'.$str_tag.'>';
                        $idx = strpos($text, $search, $idx_stop);
                        if ($idx !== false)
                        {
                            $str_inner = substr($text, $idx_stop + 1, $idx - $idx_stop - 1);
                            $idx_stop = $idx + strlen($search) - 1;
                        }
                    }
                    
                    //  save gathered information
                    $tag_rec = array('tag' => $str_tag, 'start' => $idx_start, 'stop' => $idx_stop);
                    if (sizeof($attr) > 0) $tag_rec['attr'] = $attr;
                    if (!empty($str_inner)) $tag_rec['inner'] = $str_inner;
                    $tags[] = $tag_rec;
                }
                
                //  get the position of the next quark tag and continue searching
                $idx_start = strpos($text, '<q:', $idx_stop);
            }
            
            //  rebuild the output by processing each content placeholder in the array
            //  and copying the in between bits directly from the input text
            $offset = 0;
            foreach ($tags as $tag)
            {
                $result.= substr($text, $offset, $tag['start'] - $offset);
                
                //  search for a content generator based on the tag name
                $GeneratorName = 'T'.ucfirst($tag['tag']).'Generator';
                if (class_exists($GeneratorName, $autoload = true)) 
                {
                    $Generator = new $GeneratorName();
                    $attr = null; if (isset($tag['attr'])) $attr = $tag['attr'];
                    $innerText = null; if (isset($tag['inner'])) $innerText = $tag['inner'];
                    $result.= $Generator->generate($attr, $innerText);
                }
                
                $offset = $tag['stop'] + 1;                
            }
            $result.= substr($text, $offset); //  copy the rest of the output buffer
            
            return $result;
        }
That's it, a tad long but it will work for most scenarios I imagined. In addition to this, in order to allow me to continue writing various posts in plain text format but still use the processing capability to nest content, such as code listings. TContentGenerator will by default assume plain text to be preformatted, which means that all special html/xml characters will be filtered and replaced with character codes, yet I added a few lines of code in order to recognize and escape desired custom tags, by enclosing them in square brackets. Now it would be time to get back to finishing those menus I guess...