如何使用Java从文件中获得媒体类型(MIME类型)?到目前为止,我已经尝试了JMimeMagic和Mime-Util。第一个给了我内存异常,第二个没有正确地关闭它的流。

您将如何探测该文件以确定其实际类型(而不仅仅是基于扩展名)?


当前回答

这是我发现的最简单的方法:

byte[] byteArray = ...
InputStream is = new BufferedInputStream(new ByteArrayInputStream(byteArray));
String mimeType = URLConnection.guessContentTypeFromStream(is);

其他回答

在Java 7中,你现在可以只使用Files.probeContentType(path)。

检测文件Media Type1的解决方案包括以下部分:

文件签名列表(参见Kessler列表、Wikipedia列表和Space Maker列表) 媒体类型列表 媒体类型到文件扩展名的映射 将文件签名与文件、路径或InputStream数据源进行比较

如果你复制了代码,请记得给学分。

StreamMediaType.java

在下面的代码中,-1表示跳过该索引处的字节比较;-2表示文件类型签名的结束。它检测二进制格式,主要是图像和一些纯文本格式变体(HTML、SVG、XML)。代码最多使用数据源头部的前11个“神奇”字节。缩短逻辑的优化和改进是受欢迎的。

import java.io.FileInputStream;
import java.io.IOException;
import java.io.InputStream;
import java.nio.file.Path;
import java.util.LinkedHashMap;
import java.util.Map;

import static com.keenwrite.io.MediaType.*;
import static java.lang.System.arraycopy;

public class StreamMediaType {
  private static final int FORMAT_LENGTH = 11;
  private static final int END_OF_DATA = -2;

  private static final Map<int[], MediaType> FORMAT = new LinkedHashMap<>();

  static {
    //@formatter:off
    FORMAT.put( ints( 0x3C, 0x73, 0x76, 0x67, 0x20 ), IMAGE_SVG_XML );
    FORMAT.put( ints( 0x89, 0x50, 0x4E, 0x47, 0x0D, 0x0A, 0x1A, 0x0A ), IMAGE_PNG );
    FORMAT.put( ints( 0xFF, 0xD8, 0xFF, 0xE0 ), IMAGE_JPEG );
    FORMAT.put( ints( 0xFF, 0xD8, 0xFF, 0xEE ), IMAGE_JPEG );
    FORMAT.put( ints( 0xFF, 0xD8, 0xFF, 0xE1, -1, -1, 0x45, 0x78, 0x69, 0x66, 0x00 ), IMAGE_JPEG );
    FORMAT.put( ints( 0x49, 0x49, 0x2A, 0x00 ), IMAGE_TIFF );
    FORMAT.put( ints( 0x4D, 0x4D, 0x00, 0x2A ), IMAGE_TIFF );
    FORMAT.put( ints( 0x47, 0x49, 0x46, 0x38 ), IMAGE_GIF );
    FORMAT.put( ints( 0x8A, 0x4D, 0x4E, 0x47, 0x0D, 0x0A, 0x1A, 0x0A ), VIDEO_MNG );
    FORMAT.put( ints( 0x25, 0x50, 0x44, 0x46, 0x2D, 0x31, 0x2E ), APP_PDF );
    FORMAT.put( ints( 0x38, 0x42, 0x50, 0x53, 0x00, 0x01 ), IMAGE_PHOTOSHOP );
    FORMAT.put( ints( 0x25, 0x21, 0x50, 0x53, 0x2D, 0x41, 0x64, 0x6F, 0x62, 0x65, 0x2D ), APP_EPS );
    FORMAT.put( ints( 0x25, 0x21, 0x50, 0x53 ), APP_PS );
    FORMAT.put( ints( 0xFF, 0xFB, 0x30 ), AUDIO_MP3 );
    FORMAT.put( ints( 0x49, 0x44, 0x33 ), AUDIO_MP3 );
    FORMAT.put( ints( 0x3C, 0x21 ), TEXT_HTML );
    FORMAT.put( ints( 0x3C, 0x68, 0x74, 0x6D, 0x6C ), TEXT_HTML );
    FORMAT.put( ints( 0x3C, 0x68, 0x65, 0x61, 0x64 ), TEXT_HTML );
    FORMAT.put( ints( 0x3C, 0x62, 0x6F, 0x64, 0x79 ), TEXT_HTML );
    FORMAT.put( ints( 0x3C, 0x48, 0x54, 0x4D, 0x4C ), TEXT_HTML );
    FORMAT.put( ints( 0x3C, 0x48, 0x45, 0x41, 0x44 ), TEXT_HTML );
    FORMAT.put( ints( 0x3C, 0x42, 0x4F, 0x44, 0x59 ), TEXT_HTML );
    FORMAT.put( ints( 0x3C, 0x3F, 0x78, 0x6D, 0x6C, 0x20 ), TEXT_XML );
    FORMAT.put( ints( 0xFE, 0xFF, 0x00, 0x3C, 0x00, 0x3f, 0x00, 0x78 ), TEXT_XML );
    FORMAT.put( ints( 0xFF, 0xFE, 0x3C, 0x00, 0x3F, 0x00, 0x78, 0x00 ), TEXT_XML );
    FORMAT.put( ints( 0x42, 0x4D ), IMAGE_BMP );
    FORMAT.put( ints( 0x23, 0x64, 0x65, 0x66 ), IMAGE_X_BITMAP );
    FORMAT.put( ints( 0x21, 0x20, 0x58, 0x50, 0x4D, 0x32 ), IMAGE_X_PIXMAP );
    FORMAT.put( ints( 0x2E, 0x73, 0x6E, 0x64 ), AUDIO_BASIC );
    FORMAT.put( ints( 0x64, 0x6E, 0x73, 0x2E ), AUDIO_BASIC );
    FORMAT.put( ints( 0x52, 0x49, 0x46, 0x46 ), AUDIO_WAV );
    FORMAT.put( ints( 0x50, 0x4B ), APP_ZIP );
    FORMAT.put( ints( 0x41, 0x43, -1, -1, -1, -1, 0x00, 0x00, 0x00, 0x00, 0x00 ), APP_ACAD );
    FORMAT.put( ints( 0xCA, 0xFE, 0xBA, 0xBE ), APP_JAVA );
    FORMAT.put( ints( 0xAC, 0xED ), APP_JAVA_OBJECT );
    //@formatter:on
  }

  private StreamMediaType() {
  }

  public static MediaType getMediaType( final Path path ) throws IOException {
    return getMediaType( path.toFile() );
  }

  public static MediaType getMediaType( final java.io.File file )
    throws IOException {
    try( final var fis = new FileInputStream( file ) ) {
      return getMediaType( fis );
    }
  }

  public static MediaType getMediaType( final InputStream is )
    throws IOException {
    final var input = new byte[ FORMAT_LENGTH ];
    final var count = is.read( input, 0, FORMAT_LENGTH );

    if( count > 1 ) {
      final var available = new byte[ count ];
      arraycopy( input, 0, available, 0, count );
      return getMediaType( available );
    }

    return UNDEFINED;
  }

  public static MediaType getMediaType( final byte[] data ) {
    assert data != null;

    final var source = new int[]{
      0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF};

    for( int i = 0; i < source.length; i++ ) {
      source[ i ] = data[ i ] & 0xFF;
    }

    for( final var key : FORMAT.keySet() ) {
      int i = -1;
      boolean matches = true;

      while( ++i < FORMAT_LENGTH && key[ i ] != END_OF_DATA && matches ) {
        matches = key[ i ] == source[ i ] || key[ i ] == -1;
      }

      if( matches ) {
        return FORMAT.get( key );
      }
    }

    return UNDEFINED;
  }

  private static int[] ints( final int... data ) {
    final var magic = new int[ FORMAT_LENGTH ];
    int i = -1;
    while( ++i < data.length ) {
      magic[ i ] = data[ i ];
    }

    while( i < FORMAT_LENGTH ) {
      magic[ i++ ] = END_OF_DATA;
    }

    return magic;
  }
}

MediaType.java

根据IANA媒体类型列表定义文件格式。注意,文件扩展名映射在MediaTypeExtension中。Apache的FilenameUtils类依赖于它的getExtension函数。

import java.io.File;
import java.io.IOException;
import java.nio.file.Path;

import static MediaType.TypeName.*;
import static MediaTypeExtension.getMediaType;
import static org.apache.commons.io.FilenameUtils.getExtension;

public enum MediaType {
  APP_ACAD( APPLICATION, "acad" ),
  APP_JAVA_OBJECT( APPLICATION, "x-java-serialized-object" ),
  APP_JAVA( APPLICATION, "java" ),
  APP_PS( APPLICATION, "postscript" ),
  APP_EPS( APPLICATION, "eps" ),
  APP_PDF( APPLICATION, "pdf" ),
  APP_ZIP( APPLICATION, "zip" ),
  FONT_OTF( "otf" ),
  FONT_TTF( "ttf" ),
  IMAGE_APNG( "apng" ),
  IMAGE_ACES( "aces" ),
  IMAGE_AVCI( "avci" ),
  IMAGE_AVCS( "avcs" ),
  IMAGE_BMP( "bmp" ),
  IMAGE_CGM( "cgm" ),
  IMAGE_DICOM_RLE( "dicom_rle" ),
  IMAGE_EMF( "emf" ),
  IMAGE_EXAMPLE( "example" ),
  IMAGE_FITS( "fits" ),
  IMAGE_G3FAX( "g3fax" ),
  IMAGE_GIF( "gif" ),
  IMAGE_HEIC( "heic" ),
  IMAGE_HEIF( "heif" ),
  IMAGE_HEJ2K( "hej2k" ),
  IMAGE_HSJ2( "hsj2" ),
  IMAGE_X_ICON( "x-icon" ),
  IMAGE_JLS( "jls" ),
  IMAGE_JP2( "jp2" ),
  IMAGE_JPEG( "jpeg" ),
  IMAGE_JPH( "jph" ),
  IMAGE_JPHC( "jphc" ),
  IMAGE_JPM( "jpm" ),
  IMAGE_JPX( "jpx" ),
  IMAGE_JXR( "jxr" ),
  IMAGE_JXRA( "jxrA" ),
  IMAGE_JXRS( "jxrS" ),
  IMAGE_JXS( "jxs" ),
  IMAGE_JXSC( "jxsc" ),
  IMAGE_JXSI( "jxsi" ),
  IMAGE_JXSS( "jxss" ),
  IMAGE_KTX( "ktx" ),
  IMAGE_KTX2( "ktx2" ),
  IMAGE_NAPLPS( "naplps" ),
  IMAGE_PNG( "png" ),
  IMAGE_PHOTOSHOP( "photoshop" ),
  IMAGE_SVG_XML( "svg+xml" ),
  IMAGE_T38( "t38" ),
  IMAGE_TIFF( "tiff" ),
  IMAGE_WEBP( "webp" ),
  IMAGE_WMF( "wmf" ),
  IMAGE_X_BITMAP( "x-xbitmap" ),
  IMAGE_X_PIXMAP( "x-xpixmap" ),
  AUDIO_BASIC( AUDIO, "basic" ),
  AUDIO_MP3( AUDIO, "mp3" ),
  AUDIO_WAV( AUDIO, "x-wav" ),
  VIDEO_MNG( VIDEO, "x-mng" ),
  TEXT_HTML( TEXT, "html" ),
  TEXT_MARKDOWN( TEXT, "markdown" ),
  TEXT_PLAIN( TEXT, "plain" ),
  TEXT_XHTML( TEXT, "xhtml+xml" ),
  TEXT_XML( TEXT, "xml" ),
  TEXT_YAML( TEXT, "yaml" ),

  /*
   * When all other lights go out.
   */
  UNDEFINED( TypeName.UNDEFINED, "undefined" );

  public enum TypeName {
    APPLICATION,
    AUDIO,
    IMAGE,
    TEXT,
    UNDEFINED,
    VIDEO
  }

  private final String mMediaType;
  private final TypeName mTypeName;
  private final String mSubtype;

  MediaType( final String subtype ) {
    this( IMAGE, subtype );
  }

  MediaType( final TypeName typeName, final String subtype ) {
    mTypeName = typeName;
    mSubtype = subtype;
    mMediaType = typeName.toString().toLowerCase() + '/' + subtype;
  }

  public static MediaType valueFrom( final File file ) {
    assert file != null;
    return fromFilename( file.getName() );
  }

  public static MediaType fromFilename( final String filename ) {
    assert filename != null;
    return getMediaType( getExtension( filename ) );
  }

  public static MediaType valueFrom( final Path path ) {
    assert path != null;
    return valueFrom( path.toFile() );
  }

  public static MediaType valueFrom( String contentType ) {
    if( contentType == null || contentType.isBlank() ) {
      return UNDEFINED;
    }

    var i = contentType.indexOf( ';' );
    contentType = contentType.substring(
      0, i == -1 ? contentType.length() : i );

    i = contentType.indexOf( '/' );
    i = i == -1 ? contentType.length() : i;
    final var type = contentType.substring( 0, i );
    final var subtype = contentType.substring( i + 1 );

    return valueFrom( type, subtype );
  }

  public static MediaType valueFrom(
    final String type, final String subtype ) {
    assert type != null;
    assert subtype != null;

    for( final var mediaType : values() ) {
      if( mediaType.equals( type, subtype ) ) {
        return mediaType;
      }
    }

    return UNDEFINED;
  }

  public boolean equals( final String type, final String subtype ) {
    assert type != null;
    assert subtype != null;

    return mTypeName.name().equalsIgnoreCase( type ) &&
      mSubtype.equalsIgnoreCase( subtype );
  }

  public boolean isType( final TypeName typeName ) {
    return mTypeName == typeName;
  }

  public String getSubtype() {
    return mSubtype;
  }
   
  @Override
  public String toString() {
    return mMediaType;
  }
}

MediaTypeExtension.java

谜题的最后一部分是MediaTypes到其已知和常见/流行的文件扩展名的映射。这允许基于文件扩展名的双向查找。

import static MediaType.*;
import static java.util.List.of;

public enum MediaTypeExtension {
  MEDIA_APP_ACAD( APP_ACAD, of( "dwg" ) ),
  MEDIA_APP_PDF( APP_PDF ),
  MEDIA_APP_PS( APP_PS, of( "ps" ) ),
  MEDIA_APP_EPS( APP_EPS ),
  MEDIA_APP_ZIP( APP_ZIP ),

  MEDIA_AUDIO_MP3( AUDIO_MP3 ),
  MEDIA_AUDIO_BASIC( AUDIO_BASIC, of( "au" ) ),
  MEDIA_AUDIO_WAV( AUDIO_WAV, of( "wav" ) ),

  MEDIA_FONT_OTF( FONT_OTF ),
  MEDIA_FONT_TTF( FONT_TTF ),

  MEDIA_IMAGE_APNG( IMAGE_APNG ),
  MEDIA_IMAGE_BMP( IMAGE_BMP ),
  MEDIA_IMAGE_GIF( IMAGE_GIF ),
  MEDIA_IMAGE_JPEG( IMAGE_JPEG,
                    of( "jpg", "jpe", "jpeg", "jfif", "pjpeg", "pjp" ) ),
  MEDIA_IMAGE_PNG( IMAGE_PNG ),
  MEDIA_IMAGE_PSD( IMAGE_PHOTOSHOP, of( "psd" ) ),
  MEDIA_IMAGE_SVG( IMAGE_SVG_XML, of( "svg" ) ),
  MEDIA_IMAGE_TIFF( IMAGE_TIFF, of( "tiff", "tif" ) ),
  MEDIA_IMAGE_WEBP( IMAGE_WEBP ),
  MEDIA_IMAGE_X_BITMAP( IMAGE_X_BITMAP, of( "xbm" ) ),
  MEDIA_IMAGE_X_PIXMAP( IMAGE_X_PIXMAP, of( "xpm" ) ),

  MEDIA_VIDEO_MNG( VIDEO_MNG, of( "mng" ) ),

  MEDIA_TEXT_MARKDOWN( TEXT_MARKDOWN, of(
    "md", "markdown", "mdown", "mdtxt", "mdtext", "mdwn", "mkd", "mkdown",
    "mkdn" ) ),
  MEDIA_TEXT_PLAIN( TEXT_PLAIN, of( "txt", "asc", "ascii", "text", "utxt" ) ),
  MEDIA_TEXT_R_MARKDOWN( TEXT_R_MARKDOWN, of( "Rmd" ) ),
  MEDIA_TEXT_R_XML( TEXT_R_XML, of( "Rxml" ) ),
  MEDIA_TEXT_XHTML( TEXT_XHTML, of( "xhtml" ) ),
  MEDIA_TEXT_XML( TEXT_XML ),
  MEDIA_TEXT_YAML( TEXT_YAML, of( "yaml", "yml" ) ),

  MEDIA_UNDEFINED( UNDEFINED, of( "undefined" ) );

  private final MediaType mMediaType;
  private final List<String> mExtensions;

  MediaTypeExtension( final MediaType mediaType ) {
    this( mediaType, of( mediaType.getSubtype() ) );
  }

  MediaTypeExtension(
    final MediaType mediaType, final List<String> extensions ) {
    assert mediaType != null;
    assert extensions != null;
    assert !extensions.isEmpty();

    mMediaType = mediaType;
    mExtensions = extensions;
  }

  public String getExtension() {
    return mExtensions.get( 0 );
  }

  public static MediaTypeExtension valueFrom( final MediaType mediaType ) {
    for( final var type : values() ) {
      if( type.isMediaType( mediaType ) ) {
        return type;
      }
    }

    return MEDIA_UNDEFINED;
  }

  boolean isMediaType( final MediaType mediaType ) {
    return mMediaType == mediaType;
  }

  static MediaType getMediaType( final String extension ) {
    final var sanitized = sanitize( extension );

    for( final var mediaType : MediaTypeExtension.values() ) {
      if( mediaType.isType( sanitized ) ) {
        return mediaType.getMediaType();
      }
    }

    return UNDEFINED;
  }

  private boolean isType( final String sanitized ) {
    for( final var extension : mExtensions ) {
      if( extension.equalsIgnoreCase( sanitized ) ) {
        return true;
      }
    }

    return false;
  }

  private static String sanitize( final String extension ) {
    return extension == null ? "" : extension.toLowerCase();
  }

  private MediaType getMediaType() {
    return mMediaType;
  }
}

Usages:

// EXAMPLE -- Detect media type
//
final File image = new File( "filename.jpg" );
final MediaType mt = StreamMediaType.getMediaType( image );

// Tricky! The JPG could be a PNG in disguise.
if( mt.isType( MediaType.TypeName.IMAGE ) ) {

  if( mt == MediaType.IMAGE_PNG ) {
    // Nice try! Sneaky sneak.
  }
}

// EXAMPLE -- Get typical media type file name extension
//
final String ext = MediaTypeExtension.valueFrom( MediaType.IMAGE_SVG_XML ).getExtension();

// EXAMPLE -- Get media type from HTTP request
//
final var url = new URL( "https://localhost/path/file.ext" );
final var conn = (HttpURLConnection) url.openConnection();
final var contentType = conn.getContentType();
MediaType mediaType = valueFrom( contentType );

// Fall back to stream detection probe
if( mediaType == UNDEFINED ) {
  mediaType = StreamMediaType.getMediaType( conn.getInputStream() );
}

conn.disconnect();

你懂的。


简短的图书馆回顾:

Apache Tika——600kb的膨胀,需要多行配置和多个JAR文件。 jMimeMagic—未完成,需要多行配置。 MimeUtil2——相当大,不能开箱即用。 filettypedetector——与JDK捆绑在一起,比山松甲虫出没的森林还要多。 文件。probeContentType——检测是特定于平台的,被认为是不可靠的(来源)。 MimetypesFileTypeMap——与activation.jar绑定,使用文件扩展名。


用于测试的音频、视频和图像文件示例:

http://mirrors.standaloneinstaller.com/video-sample/ https://www.w3.org/People/mimasa/test/imgformat/


注意,几乎所有XML文档的开头都是一样的:

<?xml version="1.0" standalone="no"?>

由于SVG文档是XML文档,许多SVG文档将包含XML声明,并且可能还包含:

<!DOCTYPE svg PUBLIC "-//W3C//DTD SVG 1.0//EN" "http://www.w3.org/TR/2001/REC-SVG-20010904/DTD/svg10.dtd">

可以通过将神奇的字节从11改为13来检测SVG文档类型。不过,doctype不是必需的,这意味着SVG文档也可以在XML声明之后开始,如下所示:

<svg xmlns="http://www.w3.org/2000/svg">

这意味着,在使用此代码检测SVG文件格式时要谨慎,因为它不可靠。相反,应该考虑使用HTTP内容类型或文件扩展名。

更复杂的问题是,可以在<svg标记之前插入任意长度的注释,这使得检测更加困难。


“MIME类型”是一个已弃用的术语。

使用Apache Tika,你只需要三行代码:

File file = new File("/path/to/file");
Tika tika = new Tika();
System.out.println(tika.detect(file));

如果你有一个groovy控制台,只需粘贴并运行以下代码即可:

@Grab('org.apache.tika:tika-core:1.14')
import org.apache.tika.Tika;

def tika = new Tika()
def file = new File("/path/to/file")
println tika.detect(file)

记住,它的api是丰富的,它可以解析“任何东西”。从tika-core 1.14开始,你有:

String  detect(byte[] prefix)
String  detect(byte[] prefix, String name)
String  detect(File file)
String  detect(InputStream stream)
String  detect(InputStream stream, Metadata metadata)
String  detect(InputStream stream, String name)
String  detect(Path path)
String  detect(String name)
String  detect(URL url)

有关更多信息,请参阅apidocs。

在Java中,URLConnection类有一个名为guessContentTypeFromName(String fileName)的方法,可以用来根据文件的文件名猜测文件的MIME媒体类型(也称为内容类型)。该方法使用文件名的扩展名来确定内容类型。

String fileName = "image.jpg";
String contentType = URLConnection.guessContentTypeFromName(fileName);
System.out.println(contentType); // "image/jpeg"

想要了解更多,请阅读这篇文章

不幸的是,

mimeType = file.toURL().openConnection().getContentType();

不工作,因为URL的这种使用会使文件被锁定,因此,例如,它是不可删除的。

然而,你有这个:

mimeType= URLConnection.guessContentTypeFromName(file.getName());

还有下面的内容,它的优点不仅仅是使用文件扩展名,还可以查看内容

InputStream is = new BufferedInputStream(new FileInputStream(file));
mimeType = URLConnection.guessContentTypeFromStream(is);
 //...close stream

然而,正如上面的评论所建议的那样,mime-types的内置表是非常有限的,例如,不包括MSWord和PDF。因此,如果您想要泛化,您将需要使用内置库,例如Mime-Util(这是一个很棒的库,同时使用文件扩展名和内容)。