as3:~/ngokevin-site# nano content/blog/20140114_test-chinese.mkd
as3:~/ngokevin-site# wok
Traceback (most recent call last):
  File "/usr/local/bin/wok", line 4, in
    Engine()
  File "/usr/local/lib/python2.7/site-packages/wok/engine.py", line 104, in init
    self.load_pages()
  File "/usr/local/lib/python2.7/site-packages/wok/engine.py", line 238, in load_pages
    p = Page.from_file(os.path.join(root, f), self.options, self, renderer)
  File "/usr/local/lib/python2.7/site-packages/wok/page.py", line 111, in from_file
    page.meta['content'] = page.renderer.render(page.original)
  File "/usr/local/lib/python2.7/site-packages/wok/renderers.py", line 46, in render
    return markdown(plain, Markdown.plugins)
  File "/usr/local/lib/python2.7/site-packages/markdown/init.py", line 419, in markdown
    return md.convert(text)
  File "/usr/local/lib/python2.7/site-packages/markdown/init.py", line 281, in convert
    source = unicode(source)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe8 in position 1: ordinal not in range(128). -- Note: Markdown only accepts unicode input!

如何解决?

在其他一些基于python的静态博客应用中,中文帖子可以成功发布。 比如这个应用:http://github.com/vrypan/bucket3。在我的网站http://bc3.brite.biz/,中文帖子可以成功发布。


当前回答

Encode将unicode对象转换为字符串对象。我认为你正在尝试编码一个字符串对象。首先将结果转换为unicode对象,然后将该unicode对象编码为'utf-8'。 例如

    result = yourFunction()
    result.decode().encode('utf-8')

其他回答

"UnicodeDecodeError: 'ascii' codec can't decode byte"

错误原因:input_string必须是unicode,但给出了str

"TypeError: Decoding Unicode is not supported"

此错误的原因:试图将unicode input_string转换为unicode


因此,首先检查你的input_string是否为str,并在必要时转换为unicode:

if isinstance(input_string, str):
   input_string = unicode(input_string, 'utf-8')

其次,上面只是改变了类型,但没有删除非ascii字符。如果你想删除非ascii字符:

if isinstance(input_string, str):
   input_string = input_string.decode('ascii', 'ignore').encode('ascii') #note: this removes the character and encodes back to string.

elif isinstance(input_string, unicode):
   input_string = input_string.encode('ascii', 'ignore')

这是典型的“统一码问题”。我相信,解释这个问题已经超出了StackOverflow回答的范围,无法完全解释正在发生的事情。

这里有很好的解释。

简单地说,您已经将一个被解释为字节字符串的内容传递给了需要将其解码为Unicode字符的内容,但是默认的编解码器(ascii)失败了。

我给你们看的演示提供了避免这种情况的建议。让你的代码成为“unicode三明治”。在Python 2中,使用from __future__ import unicode_literals会有所帮助。

更新:如何修复代码:

OK - in your variable "source" you have some bytes. It is not clear from your question how they got in there - maybe you read them from a web form? In any case, they are not encoded with ascii, but python is trying to convert them to unicode assuming that they are. You need to explicitly tell it what the encoding is. This means that you need to know what the encoding is! That is not always easy, and it depends entirely on where this string came from. You could experiment with some common encodings - for example UTF-8. You tell unicode() the encoding as a second parameter:

source = unicode(source, 'utf-8')

我得到了字符串“PastelerÃ-a Mallorca”同样的问题,我用:

unicode("Pastelería Mallorca", 'latin-1')

这是我的解决方案,只需添加编码。 用open(file, encoding='utf8')作为f

因为读取glove文件需要很长时间,所以我建议将glove文件转换为numpy文件。当你读取嵌入权重时,它将节省你的时间。

import numpy as np
from tqdm import tqdm


def load_glove(file):
    """Loads GloVe vectors in numpy array.
    Args:
        file (str): a path to a glove file.
    Return:
        dict: a dict of numpy arrays.
    """
    embeddings_index = {}
    with open(file, encoding='utf8') as f:
        for i, line in tqdm(enumerate(f)):
            values = line.split()
            word = ''.join(values[:-300])
            coefs = np.asarray(values[-300:], dtype='float32')
            embeddings_index[word] = coefs

    return embeddings_index

# EMBEDDING_PATH = '../embedding_weights/glove.840B.300d.txt'
EMBEDDING_PATH = 'glove.840B.300d.txt'
embeddings = load_glove(EMBEDDING_PATH)

np.save('glove_embeddings.npy', embeddings) 

Gist链接:https://gist.github.com/BrambleXu/634a844cdd3cd04bb2e3ba3c83aef227

简而言之,为了确保在Python 2中正确处理unicode:

使用io。打开文件读写 使用from __future__ import unicode_literals 配置其他数据输入/输出(例如,数据库,网络)使用unicode 如果不能将输出配置为utf-8,请将输出转换为print(text。编码(“ascii”、“替换”).decode ())

有关解释,请参阅@Alastair McCormack的详细回答。