Full text search Draw.io (diagrams.net) diagram in Nextcloud

Full text search Draw.io (diagrams.net) diagram in Nextcloud

Full text search in Nextcloud is a great feature. It includes LibreOffice files. However, I found that Draw.io files are not included. This is just some notes on the idea to include .drawio files to Nextcloud full text search. It should be developed as a plugin later.

.drawio file type is an XML

Here’s a sample .drawio file opened in text editor:

<mxfile host="embed.diagrams.net" modified="2022-02-17T15:39:04.371Z" agent="5.0 (Windows)" etag="cR5UYiRQA5ij_U9vdlQz" version="16.5.6" type="embed"><diagram id="AuVbhCTzAwpeFgywoMie" name="Page-1">vLzHsuPKEiT4NW9PaGAJrSWhd9CaJDTAr59Mnnun+1lPm82mu4xWRYJgIkWEh3tEZv0H46dLXrJPa77LavwP+iiv/2DCf1CUIWnwN7xw/13ACfLvQrN05d8l5H9ceHbf6p+Lj3+u7l1Zrf914/Z+j1v3+e+Lxfv1qortv65ly/I+//u2+j3+91M/WVP9LxeeRTb+r1ejrtzaf64iJPM/vlCqrmn/eTSNUn9fTNm/N/8zkrXNyvf5P13CxP9g/PJ+b3/vpouvRjh3/87L3++k/823/2/Hluq1/f/5Afr3gyMb93/G9k+/tvvfwYIufuDbbvrNCndUy9aBuTCyvBqd99pt3fsFvs/f2/aewA0j/ILLiqFZ3vur5N/jewHfl1Wd7eP2P7XAjl0Df7m9P+Bqtn7+1qrurgp0j/s9kP336uPfK7CpbMv+g7F/H1Hp82r+g/JdyNne+dDl5s2CP9YzaMWgAe8o+FHjeDYB/3Lgfk566eBtw4qj6IYeHu9XyZVt4HnFf1CuVIq+T5rHc5jUUhw+Qql2YqN2uoPz4nsOOrZ5P4f51YqtpvNBg3mIXoJWbQ/8RX1f9E1gDHNUR32UmI99HbVfsX7DiM+DskH7BSNXdpp1niVyLHhJnMuxv5cbsK4IX4HEigJLsL0sntqHbiMPB4NqVPh666zKw5fONzz7e/Uq60bK4P1v2wLPaRqVU3OOW4Pzfop/1/4/789Xkdlzi1cWnBZEpj6uQORUeDsYnqqxjet9qwBYFVc9zY5QwBs+MBqXe9yiqz3wtWuHLzeEoiPkJSp7bDOwuHqxSzjcuvwtIiyH5ohKuna1gzEY7CBLrq6JxXP8PF6NfaF3fFokKQUyln/pAdzKI7+5qsXit975UDpKF+cfzD8JF2PkQAHPp2/ZVYeQfn7UQfk8iIKx6IYwPd7V2l4UAORwhvgOyHQo7W+y+ZzsZsnOs27MtsOt6flwVxblTewRLxcwfGgpZT2ujyqW1+39DS52AleZManQ3Jni7wel8krz7o0Fs6yw0enfRLAJUW5OHwYgAQdX+7NJmF7DFZNVjiU3V5vlJZTQF8k...
(...)
</diagram></mxfile>

Drawio format “mxfile” is an XML file with deflated content of the diagram.

https://drawio-app.com/extracting-the-xml-from-mxfiles/

Nextcloud files full text search app

App source is available at:

https://github.com/nextcloud/files_fulltextsearch.git

File types are handled in “FilesService.php” located in:

NEXTCLOUD_ROOT/apps/files_fulltextsearch/lib/Service/FilesService.php

Just for a test, add applicaiton/xml to be recognized as TEXT file, in parseMimeTypeText() function like below at line 861.

 847         /**
 848          * @param string $mimeType
 849          * @param string $extension
 850          * @param string $parsed
 851          *
 852          * @throws KnownFileMimeTypeException
 853          */
 854         private function parseMimeTypeText(string $mimeType, string $extension, string &$parsed) {
 855 
 856                 if (substr($mimeType, 0, 5) === 'text/') {
 857                         $parsed = self::MIMETYPE_TEXT;
 858                         throw new KnownFileMimeTypeException();
 859                 }
 860 
 861                 // 20220219 Parse XML files as TEXT files
 862                 if (substr($mimeType, 0, 15) === 'application/xml') {
 863                         $parsed = self::MIMETYPE_TEXT;
 864                         throw new KnownFileMimeTypeException();
 865                 }
 866 
 867                 $textMimes = [
 868                         'application/epub+zip'
 869                 ];
 870 
 871                 foreach ($textMimes as $mime) {
 872                         if (strpos($mimeType, $mime) === 0) {
 873                                 $parsed = self::MIMETYPE_TEXT;
 874                                 throw new KnownFileMimeTypeException();
 875                         }
 876                 }
 877 
 878                 $this->parseMimeTypeTextByExtension($mimeType, $extension, $parsed);
 879         }
 880 

Restart nextcloud-fulltext-elasticsearch-worker:

sudo systemctl restart nextcloud-fulltext-elasticsearch-worker

Put an xml file in Nextcloud (sample taken from simple.xml ):

Try search in Nextcloud.

Exact match should be indicated by bold letters.

So far, so good.

.drawio file is an XML file, but deflated.

Here is a very simple Drawio diagram:

and the .drawio file is like this:

<mxfile host="embed.diagrams.net" modified="2022-02-20T03:14:32.065Z" agent="5.0 (Macintosh)" etag="mZv-Ht9U0wXalCt4k_3Y" version="16.6.1" type="embed"><diagram id="bwsSY7Qtgjerim4sss-x" name="Page-1">jZLBboMwDIafJncgK+uuY1132YlDzxnxSKRAUGoK9OkXFqeA0KRdkP39NrH/hPGiGc9OdOrTSjAsS+TI+BvLsiM/+O8MpgDyJAmgdloGlC6g1HcgGMt6LeG6KURrDepuCyvbtlDhhgnn7LAt+7Zme2onatiBshJmTy9aoiKaH54W4QN0reLRaf4SlEbEalrlqoS0wwrxE+OFsxZD1IwFmNm8aEzoe/9DfUzmoMX/NGSh4SZMT8uJr0oCDYdTXNnZvpUwNyWMvw5KI5SdqGZ18HfsmcLG+Cz14X4ImusGDmFcIRrqDLYBdJMvIZU/k0H0RLIj5cPK8GiiWnmdExN0x/Xj14sLPiAjYroY/qutni0//QA=</diagram></mxfile>

So in draw.io file with .drawio extension, the diagram XML is in the “diagram” element of the XML, base64 encoded and deflated.

* Draw.io diagram has XML export function without compression. Then you can save a full XML file directly from diagrams.net.

Instead of parsing XML file, we need to specifically read and parse “.drawio” file.

Like with XML file, we make Nextcloud app to treat “.drawio” extension as TEXT file and parse later properly, by adding the latter part below code to the same FilesService.php:

(...)
                // 20220219 Parse XML files as TEXT files
                if (substr($mimeType, 0, 15) === 'application/xml') {
                        $parsed = self::MIMETYPE_TEXT;
                        throw new KnownFileMimeTypeException();
                }

                // 20220219 Parse .drawio file
                if ($extension  === 'drawio') {
                        $parsed = self::MIMETYPE_TEXT;
                        throw new KnownFileMimeTypeException();
                }
(...)

Parsing .drawio diagram XML

Diagram.net has a conversion tool at:

https://jgraph.github.io/drawio-tools/tools/convert.html

From the JavaScript code of the tool we can see what we need to convert the element to XML:

  • Base64 decode
  • Inflate
  • URL decode

So to do the indexing in ElasticSearch, follow the 3 steps to convert the “diagram” element and pass it to ElasticSearch, instead of the original content of .drawio file.

The same file in:

NEXTCLOUD_ROOT/apps/files_fulltextsearch/lib/Service/FilesService.php

Change extractContentFromFileText function accordingly:

  • Extract diagram element
  • Do the 3 steps above
  • Remove image tag from the diagram as the are base64 encoded binary which is meaningless for indexing
  • Pass the converted data to ElasticSearch when the file extension is “.drawio”

So it will be like this:

/**
 * @param FilesDocument $document
 * @param File $file
 */
private function extractContentFromFileText(FilesDocument $document, File $file) {
        if ($this->parseMimeType($document->getMimeType(), $file->getExtension())
                !== self::MIMETYPE_TEXT) {
                return;      
        }

        if (!$this->isSourceIndexable($document)) {
                return;      
        }

        // 20220219 Inflate drawio file
        $content = $file->getContent();
        if ( $file->getExtension() === 'drawio') {

                try {        
                        $xml = simplexml_load_string($content);
                        $deflated_content = (string)$xml->diagram;
                        $base64decoded = base64_decode($deflated_content, true);
                        $urlencoded_content = gzinflate($base64decoded);
                        $content = urldecode($urlencoded_content);

                        // Remove image tag, as they can be huge
                        $content = preg_replace('/style=\"shape=image[^"]*\"/', '//', $content);

                } catch (\Throwable $t) {
                        // Error handling here
                }            
        }

        try {        
                $document->setContent(

                        // 20220219 Pass content of file, or inflated drawio graph xml 
                        base64_encode($content), IIndexDocument::ENCODED_BASE64
                        // base64_encode($file->getContent()), IIndexDocument::ENCODED_BASE64
                );           
        } catch (NotPermittedException | LockedException $e) {
        }
}

Result

Now the Draw.io (diagrams.net) diagram can be searched in full text search:

Remove XML tags from search result

Better search result may be without XML tags.

Add a function to strip XML tags:

/**
 * @param SimpleXMLElement $element
 */
private function readXmlValue(\SimpleXMLElement $element) {
    $str = '';
    if( $element['value'] != null && trim(strval($element['value'])) !== '')
        $str = $str . " " .  trim(strval($element['value']));
    if( $element != null && trim(strval($element)) !== '')
        $str = $str . " " . trim(strval($element));
    try {
        foreach ($element->children() as $child) {
            $str = $str . " " . $this->readXmlValue($child);
        }
    } finally {
    }

    // Strip HTML tags
    $str_without_tags = preg_replace('/<[^>]*>/', ' ', $str);
    return $str_without_tags;
}

And call this function before passing the content of XML to ElasticSearch. So the function extractContentFromFileText will be:

/**
 * @param FilesDocument $document
 * @param File $file
 */
private function extractContentFromFileText(FilesDocument $document, File $file) {
        if ($this->parseMimeType($document->getMimeType(), $file->getExtension())
                !== self::MIMETYPE_TEXT) {
                return;
        }

        if (!$this->isSourceIndexable($document)) {
                return;
        }

        // 20220219 Inflate drawio file
        $content = $file->getContent();
        if ( $file->getExtension() === 'drawio') {

                try {
                        $xml = simplexml_load_string($content);

                        // Initialize $content
                        $content = '';

                        // Iterate all the 'diagram' tags
                        foreach ($xml->diagram as $child) {
                            $deflated_content = (string)$child;
                            $base64decoded = base64_decode($deflated_content);
                            $urlencoded_content = gzinflate($base64decoded);
                            $urldecoded_content = urldecode($urlencoded_content);

                            // Remove image tag
                            $diagram_str = preg_replace('/style=\"shape=image[^"]*\"/', '', $urldecoded_content);
                            $content = $diagram_str;

                            // Construct XML document of diagram
                            $diagram_xml = simplexml_load_string($diagram_str);
                            $content = $content . ' ' . $this->readXmlValue($diagram_xml);
                        }

                } catch (\Throwable $t) {
                    // Exception handling
                }
        }

        try {
                $document->setContent(
                        // 20220219 Pass content of file, or inflated drawio graph xml
                        base64_encode($content), IIndexDocument::ENCODED_BASE64
                        // base64_encode($file->getContent()), IIndexDocument::ENCODED_BASE64
                );
        } catch (NotPermittedException | LockedException $e) {
        }
}

This way, the retrieved diagram’s XML is:

資産 Assets サービス Producst /Services 価値 Value 売上 / 利益 Sales / Profit 技術力 対話能力 製品・ソリューションの知識 経験 仕組み 製品 教育 規定 インフラ 管理部 エンジニア・派遣 プロジェクトマネジメント 品質管理 経験 仕組み 製品 教育 規定 インフラ 管理部 自社製品 他社製品/OSS製品 自社サービス

No XML nor HTML tags are included, but only words used in the .drawio file are used to index the file.

Search result looks better now: