Recursive string multidecoder

It's common when reverse engineering some piece of software or network protocol, to come across some text that has been encoded multiple times, possibly mixing multiple encodings.

There are several tools that can help with the decoding process, but it can be pretty labor intensive to manually decode a string multiple times, trying to guess the correct encoding in the process.

Wouldn't it be nice to have a tool to automate this process for us, automatically exploring various combinations of encodings until a result is found?

TL;DR just gimme the tool

Sure! Get the source from github or pypi.

Install with pipx:

# Latest release
pipx install text-multidecoder

# From a git branch
pipx install git+https://github.com/rshk/multidecoder.git@main

Command line usage

multidecoder -t "string to decode"
multidecoder < decodeme.txt

Example:

% multidecoder -t "4a5464434a544979614756736247386c4d6a496c4d30456c4d6a416c4d6a4a5862334a735a4355794d69553352413d3d"
base64 -> hex[ e1 ae 78 eb 8e 37 e1 ae 78 e3 de fd eb 5e 3b e7 ae f7 eb 6e 3b df ce 9c e1 de 9a e3 de 9c e1 dd f4 e3 9e 9c e1 de 9a e3 5e 9c e1 de 9a e1 ae 7c eb 6d f7 e1 ae f7 e5 ae 37 e7 9e fd e1 de bd e7 9d f7 e7 6e 35 dd dd dd ]
    base64 -> hex[ c7 bc 67 9b b9 f9 ]
        unicode-chardet(Windows-1252) -> Ç¼g›¹ù
    ]base64_urlsafe -> same as base64
    unicode-chardet(Windows-1251) -> б®xлЋ7б®xгЮэл^;з®члn;ЯОњбЮљгЮњбЭфгћњбЮљг^њбЮљб®|лmчб®че®7зћэбЮЅзќчзn5ЭЭЭ
        ]base64 -> (seen before) hex[ c7 bc 67 9b b9 f9 ]
        ]base64_urlsafe -> (seen before) hex[ c7 bc 67 9b b9 f9 ]
]base64_urlsafe -> same as base64
hex -> b'JTdCJTIyaGVsbG8lMjIlM0ElMjAlMjJXb3JsZCUyMiU3RA=='
    base64 -> b'%7B%22hello%22%3A%20%22World%22%7D'
        url -> b'{"hello": "World"}'
            unicode-utf8 -> {"hello": "World"}
            ]unicode-chardet(ascii) -> same as unicode-utf8
        unicode-utf8 -> %7B%22hello%22%3A%20%22World%22%7D
            ]url -> (seen before) {"hello": "World"}
        ]unicode-chardet(ascii) -> same as unicode-utf8
    ]base64_urlsafe -> same as base64
    unicode-utf8 -> JTdCJTIyaGVsbG8lMjIlM0ElMjAlMjJXb3JsZCUyMiU3RA==
        ]base64 -> (seen before) b'%7B%22hello%22%3A%20%22World%22%7D'
        ]base64_urlsafe -> (seen before) b'%7B%22hello%22%3A%20%22World%22%7D'
    ]unicode-chardet(ascii) -> same as unicode-utf8
unicode-utf8 -> 4a5464434a544979614756736247386c4d6a496c4d30456c4d6a416c4d6a4a5862334a735a4355794d69553352413d3d
    ]base64 -> (seen before) hex[ e1 ae 78 eb 8e 37 e1 ae 78 e3 de fd eb 5e 3b e7 ae f7 eb 6e 3b df ce 9c e1 de 9a e3 de 9c e1 dd f4 e3 9e 9c e1 de 9a e3 5e 9c e1 de 9a e1 ae 7c eb 6d f7 e1 ae f7 e5 ae 37 e7 9e fd e1 de bd e7 9d f7 e7 6e 35 dd dd dd ]
    ]base64_urlsafe -> (seen before) hex[ e1 ae 78 eb 8e 37 e1 ae 78 e3 de fd eb 5e 3b e7 ae f7 eb 6e 3b df ce 9c e1 de 9a e3 de 9c e1 dd f4 e3 9e 9c e1 de 9a e3 5e 9c e1 de 9a e1 ae 7c eb 6d f7 e1 ae f7 e5 ae 37 e7 9e fd e1 de bd e7 9d f7 e7 6e 35 dd dd dd ]
    ]hex -> (seen before) b'JTdCJTIyaGVsbG8lMjIlM0ElMjAlMjJXb3JsZCUyMiU3RA=='
]unicode-chardet(ascii) -> same as unicode-utf8

As you can see, the program tried various encodings recursively, which lead to two possible solutions. The first one appears to be a false positive, leading to some garbled text. The second one definitely looks more promising, applying hex -> base64 -> url -> utf-8 decoding to obtain some JSON string.

Duplicate paths using different encodings have also been detected and skipped.

As a Python library

from multidecoder import multidecode, display_result

results = multidecode(text, max_depth=10)
display_result(results, sys.stdout)

How does it work?

The multidecode() function will go through a list of decoders, attempting the decode the input text with each one in turn.

If the decoder output is equal to the input, or the decoder errors out, the "branch" is going to be skipped.

Otherwise, a Result is yielded for each successful decoding operation, with the following fields:

decoder_id: internal identifier for the decoder
value: return value from the decoder
args: list of strings, used by decoders to return extra information about the decode process. A common example is the chardet decoder, which will attempt to automatically guess what unicode encoding was used to encode some text.
is_new_path: flag indicating whether this result value has been seen before. Display and search algorithm might want to use this to avoid descending the same branch twice unnecessarily, or possibly ending in an infinite loop.
sub_results: iterator of Results obtained by recursively decoding this result.