以下是我在网上找到的一些代码:

class M‮{public static void main(String[]a‭){System.out.print(new char[]
{'H','e','l','l','o',' ','W','o','r','l','d','!'});}}    

这段代码输出Hello World!在屏幕上;你可以看到它在这里运行。我可以清楚地看到公共静态无效主写,但它是向后的。这段代码是如何工作的?这是如何编译的呢?

编辑:我在IntellIJ中尝试了这段代码,它工作正常。然而,由于某种原因,它不能在notepad++中与cmd一起工作。我还没有找到解决方案,所以如果有人找到了,请在下方评论。


当前回答

语言规范的第3章通过详细描述如何为Java程序进行词法翻译提供了解释。最重要的问题是:

程序是用Unicode(§3.1)编写的,但提供了词法翻译(§3.2),因此Unicode转义(§3.3)可以用于包含任何仅使用ASCII字符的Unicode字符。

因此,程序是用Unicode字符编写的,如果文件编码不支持Unicode字符,作者可以使用\uxxxx转义它们,在这种情况下,它将被转换为适当的字符。本例中出现的Unicode字符之一是\u202E。它没有在代码片段中显示,但如果您尝试切换浏览器的编码,则可能会出现隐藏字符。

因此,词法翻译的结果是类声明:

class M\u202E{

这意味着类标识符是M\u202E。规范认为这是一个有效的标识符:

Identifier:
    IdentifierChars but not a Keyword or BooleanLiteral or NullLiteral
IdentifierChars:
    JavaLetter {JavaLetterOrDigit}

“Java字母或数字”是方法character . isjavaidentifierpart (int)为其返回true的字符。

其他回答

语言规范的第3章通过详细描述如何为Java程序进行词法翻译提供了解释。最重要的问题是:

程序是用Unicode(§3.1)编写的,但提供了词法翻译(§3.2),因此Unicode转义(§3.3)可以用于包含任何仅使用ASCII字符的Unicode字符。

因此,程序是用Unicode字符编写的,如果文件编码不支持Unicode字符,作者可以使用\uxxxx转义它们,在这种情况下,它将被转换为适当的字符。本例中出现的Unicode字符之一是\u202E。它没有在代码片段中显示,但如果您尝试切换浏览器的编码,则可能会出现隐藏字符。

因此,词法翻译的结果是类声明:

class M\u202E{

这意味着类标识符是M\u202E。规范认为这是一个有效的标识符:

Identifier:
    IdentifierChars but not a Keyword or BooleanLiteral or NullLiteral
IdentifierChars:
    JavaLetter {JavaLetterOrDigit}

“Java字母或数字”是方法character . isjavaidentifierpart (int)为其返回true的字符。

这里有一些看不见的字符可以改变代码的显示方式。在Intellij中,可以通过将代码复制粘贴到空字符串("")中来找到这些字符,这将用Unicode转义替换它们,删除它们的效果并显示编译器看到的顺序。

下面是复制粘贴的输出:

"class M\u202E{public static void main(String[]a\u202D){System.out.print(new char[]\n"+
        "{'H','e','l','l','o',' ','W','o','r','l','d','!'});}}   "

源代码字符按此顺序存储,编译器也按此顺序处理它们,但它们的显示方式不同。

请注意\u202E字符,这是一个从右到左的覆盖,它开始一个块,其中所有字符都被强制从右到左显示;\u202D是一个从左到右的覆盖,它开始一个嵌套块,其中所有字符都被强制从左到右的顺序,覆盖第一个覆盖。

Ergo, when it displays the original code, class M is displayed normally, but the \u202E reverses the display order of everything from there to the \u202D, which reverses everything again. (Formally, everything from the \u202D to the line terminator gets reversed twice, once due to the \u202D and once with the rest of the text reversed due to the \u202E, which is why this text shows up in the middle of the line instead of the end.) The next line's directionality is handled independently of the first's due to the line terminator, so {'H','e','l','l','o',' ','W','o','r','l','d','!'});}} is displayed normally.

完整的Unicode双向算法(非常复杂,长达数十页),请参见Unicode标准附录#9。

这实际上是因为Unicode双向支持。

U+ 202e从右到左覆盖 U+ 202d从左到右覆盖

所以,这些是一些棘手的字符。它们实际上是为从右向左的语言支持而定义的。真正的代码是

class M<U+202E>{public static void main(String[]a<U+202D>){System.out.print(new char[]
    {'H','e','l','l','o',' ','W','o','r','l','d','!'});}}

(通过粘贴到cmd.exe)。希望这个答案能帮助你了解它是如何工作的。

它看起来不同是因为Unicode双向算法。有两个RLO和LRO的不可见字符,Unicode双向算法使用它们来改变嵌套在这两个元字符之间的字符的视觉外观。

结果是,从视觉上看,它们是倒序的,但内存中的实际字符并没有倒序。你可以在这里分析结果。Java编译器将忽略RLO和LRO,并将它们视为空白,这就是代码编译的原因。

注1:文本编辑器和浏览器使用此算法可视化地显示LTR字符(英文)和RTL字符(例如。 阿拉伯语,希伯来语)同时在一起-因此“bi”-directional。你可以阅读更多关于双向算法的内容 统一码的网站。 注2:LRO和RLO的确切行为在第2.2节中定义 这个算法。

字符U+202E从右向左镜像代码,这是非常聪明的。从M开始隐藏,

"class M\u202E{..."

我是怎么发现这背后的魔力的?

Well, at first when I saw the question I tough, "it's a kind of joke, to lose somebody else time", but then, I opened my IDE ("IntelliJ"), create a class, and past the code... and it compiled!!! So, I took a better look and saw that the "public static void" was backward, so I went there with the cursor, and erase a few chars... And what happens? The chars started erasing backward, so, I thought mmm.... rare... I have to execute it... So I proceed to execute the program, but first I needed to save it... and that was when I found it!. I couldn't save the file because my IDE said that there was a different encoding for some char, and point me where was it, So I start a research in Google for special chars that could do the job, and that's it :)

有点关于

Unicode双向算法,U+202E涉及,简单解释:

The Unicode Standard prescribes a memory representation order known as logical order. When text is presented in horizontal lines, most scripts display characters from left to right. However, there are several scripts (such as Arabic or Hebrew) where the natural ordering of horizontal text in display is from right to left. If all of the text has a uniform horizontal direction, then the ordering of the display text is unambiguous. However, because these right-to-left scripts use digits that are written from left to right, the text is actually bi-directional: a mixture of right-to-left and left-to-right text. In addition to digits, embedded words from English and other scripts are also written from left to right, also producing bidirectional text. Without a clear specification, ambiguities can arise in determining the ordering of the displayed characters when the horizontal direction of the text is not uniform. This annex describes the algorithm used to determine the directionality for bidirectional Unicode text. The algorithm extends the implicit model currently employed by a number of existing implementations and adds explicit formatting characters for special circumstances. In most cases, there is no need to include additional information with the text to obtain correct display ordering. However, in the case of bidirectional text, there are circumstances where an implicit bidirectional ordering is not sufficient to produce comprehensible text. To deal with these cases, a minimal set of directional formatting characters is defined to control the ordering of characters when rendered. This allows exact control of the display ordering for legible interchange and ensures that plain text used for simple items like filenames or labels can always be correctly ordered for display.

为什么要创建这样的算法?

bidi算法可以呈现阿拉伯语或希伯来语序列 字符从右到左一个接一个。