大发体育娱乐在线-大发体育娱乐官方网站-大发体育娱乐登录网址
做最好的网站

Linux下Xpdf的布置和应用,PHP读取PDF内容格外Xpdf的

来源:http://www.dfwstonefabricators.com 作者:编程应用 人气:69 发布时间:2019-11-23
摘要:一.下载 首先,大家先把质感下下来先。尽管无需转中文的话,只须求下载它就可以:xpdf-bin-linux-3.03.tar,即便急需转中文,那你就还索要它了:xpdf-chinese-simplified.tar 二.安装 以后,下载

一.下载 首先,大家先把质感下下来先。 尽管无需转中文的话,只须求下载它就可以:xpdf-bin-linux-3.03.tar,即便急需转中文,那你就还索要它了:xpdf-chinese-simplified.tar 二.安装 以后,下载实现了吗,大家得以开展设置了。 [root@localhost ~]# mkdir -p /lcf/upan [root@localhost ~]# mkdir -p /lcf/cdrom [root@localhost ~]# mkdir -p /lcf/xpdf [root@localhost ~]# cd /lcf/upan/ [root@localhost upan]# cp xpdf/* ../xpdf/ [root@localhost upan]# cd ../xpdf/ [root@localhost xpdf]# tar -zxvf xpdfbin-linux-3.03.tar.gz [root@localhost xpdf]# cd xpdfbin-linux-3.03 [root@localhost xpdfbin-linux-3.03]# cat INSTALL [root@localhost xpdfbin-linux-3.03]# cd bin32/ [root@localhost bin32]# cp ./* /usr/local/bin/ [root@localhost bin32]# cd ../doc/ [root@localhost doc]# mkdir -p /usr/local/man/man1 [root@localhost doc]# mkdir -p /usr/local/man/man5 [root@localhost doc]# cp *.1 /usr/local/man/man1 [root@localhost doc]# cp *.5 /usr/local/man/man5 若是无需读取中文的话,到这边就足以甘休了,借使急需,那大家后续今后 [root@localhost doc]# cp sample-xpdfrc /usr/local/etc/xpdfrc [root@localhost xpdf]# cd /lcf/xpdf [root@localhost xpdf]# tar -zxvf xpdf-chinese-simplified.tar.gz [root@localhost xpdf]# cd xpdf-chinese-simplified [root@localhost xpdf]# mkdir -p/usr/local/share/xpdf/chinese-simplified [root@localhost xpdf]# cd xpdf-chinese-simplified/ [root@localhost xpdf-chinese-simplified]# cp Adobe-GB1.cidToUnicode ISO-2022-CN.unicodeMap EUC-CN.unicodeMap GBK.unicodeMap CMAP /usr/local/share/xpdf/chinese-simplified/ 把chinese-simplified里面文件add-to-xpdfrc 的内容复制到/usr/local/etc/xpdfrc文件中。记得里面的路子要科学。(注意,那当中的简体中文手袋括以下三种格式:ISO-2022-CN,EUC-CN,GBK ,看清楚哦,不辅助UTF-8,能够先转为GBK,然后进行转义卡塔 尔(英语:State of Qatar) 三.意义完成 至此,全数的配备达成,大家要从头选择它了。 假若是简简单单的PDF读取,那么直接用上边包车型大巴口舌就OK了。 $content = shell_exec('/usr/local/bin/pdftotext '.$filename.' -'); 即使急需转汉语,如此那般,加上参数。 $content = shell_exec('/usr/local/bin/pdftotext -layout -enc GBK '.$filename.' -'); 当然,加了参数之后仍然为不影响意国语的更改的,所以,放心使用吧。供给介怀的是,这里转出来的是GBK编码的啊,现在网址比很多用的是UTF-8,想要不突显乱码的话,供给重新转义一下啊。 $content = mb_convert_encoding($content, 'UTF-8','GBK'); 至此,就瓜熟蒂落了。读抽取来的源委,你想怎么样利用,再写代码管理吧。 最终加一下pdftotext 的参数表达给我们。 首要参数如下: OPTIONS Many of the following options can be set with configuration file com- mands. These are listed in square brackets with the description of the corresponding command line option. -f number Specifies the first page to convert. -l number Specifies the last page to convert. -layout Maintain the original physical layout of the text. The default is to 'undo' physical layout (columns, hyphenation, etc.) and output the text in reading order. -fixed number Assume fixed-pitch text, with the specified charac- ter width . This forces physical layout mode. -raw Keep the text in content stream order. This is a hack which often "undoes" column formatting, etc. Use of raw mode is no longer recommended. -htmlmeta Generate a simple HTML file, including the meta information. This simply wraps the text in

前天官员拍脑袋想出了一个需求,要笔者读取PDF里面的内容,并且入仓库储存为正文,用来搜寻。

 and 

一.下载

and prepends the meta headers. -enc encoding-name

首先,我们先把材料下下来先。下载地址在此:

纵然无需转汉语的话,只须求下载它就能够:xpdf-bin-linux-3.03.tar,假若急需转汉语,那你就还亟需它了:xpdf-chinese-simplified.tar

二.安装

现行反革命,下载达成了啊,大家得以扩充安装了。

      [root@localhost ~]# mkdir -p /lcf/upan
      [root@localhost ~]# mkdir -p /lcf/cdrom
      [root@localhost ~]# mkdir -p /lcf/xpdf
     
      [root@localhost ~]# cd /lcf/upan/

      [root@localhost upan]# cp xpdf/* ../xpdf/ (下载的文件放入/lcf/xpdf目录卡塔 尔(阿拉伯语:قطر‎
      [root@localhost upan]# cd ../xpdf/

      [root@localhost xpdf]# tar -zxvf xpdfbin-linux-3.03.tar.gz

      [root@localhost xpdf]# cd xpdfbin-linux-3.03

      [root@localhost xpdfbin-linux-3.03]# cat INSTALL

      [root@localhost xpdfbin-linux-3.03]# cd bin32/
      [root@localhost bin32]# cp ./* /usr/local/bin/

      [root@localhost bin32]# cd ../doc/

      [root@localhost doc]# mkdir -p /usr/local/man/man1
      [root@localhost doc]# mkdir -p /usr/local/man/man5
      [root@localhost doc]# cp *.1 /usr/local/man/man1
      [root@localhost doc]# cp *.5 /usr/local/man/man5

假设没有须要读取粤语的话,到此处就足以终结了,假使要求,那我们三回九转将来

      [root@localhost doc]# cp sample-xpdfrc /usr/local/etc/xpdfrc

      [root@localhost xpdf]# cd /lcf/xpdf

      [root@localhost xpdf]# tar -zxvf xpdf-chinese-simplified.tar.gz
      [root@localhost xpdf]# cd xpdf-chinese-simplified
      [root@localhost xpdf]# mkdir -p/usr/local/share/xpdf/chinese-simplified
      [root@localhost xpdf]# cd xpdf-chinese-simplified/

      [root@localhost xpdf-chinese-simplified]# cp Adobe-GB1.cidToUnicode ISO-2022-CN.unicodeMap EUC-CN.unicodeMap GBK.unicodeMap CMAP /usr/local/share/xpdf/chinese-simplified/     

        把chinese-simplified里面文件add-to-xpdfrc 的剧情复制到/usr/local/etc/xpdfrc文件中。记得里面的门道要精确。(注意,那中间的简体汉语手袋括以下二种格式:ISO-2022-CN,EUC-CN,GBK ,看清楚哦,不帮衬UTF-8,能够先转为GBK,然后进行转义卡塔尔国

      三.成效达成

      至此,全体的安排完结,大家要早先应用它了。

      假使是轻巧的PDF读取,那么直接用上面的讲话就OK了。

      $content = shell_exec('/usr/local/bin/pdftotext '.$filename.' -');   

  假使急需转普通话,如此那般,加上参数。

  $content = shell_exec('/usr/local/bin/pdftotext -layout -enc GBK '.$filename.' -');

  当然,加了参数之后如故是不影响捷克语的转变的,所以,放心使用吧。需求静心的是,这里转出来的是GBK编码的啊,以往网址超多用的是UTF-8,想要不显得乱码的话,必要再行转义一下啊。

  $content = mb_convert_encoding($content, 'UTF-8','GBK');

  至此,就马到成功了。读抽取来的源委,你想怎么着行使,再写代码管理吧。

  最后加一下pdftotext 的参数表达给大家。

首要参数如下:

OPTIONS
Many of the following options can be set with configuration file com-
mands. These are listed in square brackets with the description of the
corresponding command line option.

-f number
Specifies the first page to convert.

-l number
Specifies the last page to convert.

-layout
Maintain (as best as possible) the original physical layout of
the text. The default is to 'undo' physical layout (columns,
hyphenation, etc.) and output the text in reading order.

-fixed number
Assume fixed-pitch (or tabular) text, with the specified charac-
ter width (in points). This forces physical layout mode.

-raw Keep the text in content stream order. This is a hack which
often "undoes" column formatting, etc. Use of raw mode is no
longer recommended.

-htmlmeta
Generate a simple HTML file, including the meta information.
This simply wraps the text in <pre> and </pre> and prepends the
meta headers.

-enc encoding-name

图片 1

本文由大发体育娱乐在线发布于编程应用,转载请注明出处:Linux下Xpdf的布置和应用,PHP读取PDF内容格外Xpdf的

关键词:

上一篇:PHP怎么样赢得mssql的囤积进程的出口参数

下一篇:没有了

最火资讯