博客
关于我
强烈建议你试试无所不能的chatGPT,快点击我
PDF转Word完全免费?这么好的事情我怎么不知道????
阅读量:4227 次
发布时间:2019-05-26

本文共 3572 字,大约阅读时间需要 11 分钟。

”阅读此篇需要三分钟“

640?wx_fmt=png

首先来看看来个PDF文件

640?wx_fmt=png

我们来选择其中一个论文摘要

|

|

V

640?wx_fmt=png

使用我们的python代码转化后:

640?wx_fmt=png

是不是很神奇?

现在网络上大部分的PDF转Word都是收费的,基本都是按页收费,有了我们的python代码后,我们就可以完全免费的将PDF转成Word了,这么好的福利我们赶紧来了解一下吧!

首先来看看我们要安装一些什么模块:

attrs==17.4.0 lxml==4.1.1 pdfminer3k==1.3.1 pluggy==0.6.0 ply==3.11 py==1.5.2 pytest==3.4.1 python-docx==0.8.6 six==1.11.0

使用pip模块管理工具即可安装。

640?wx_fmt=png

如上图,将每个模块都安装好。

或者直接将模块放到requirements.txt文件里,运行

pip install -r requirements

安装即可

640?wx_fmt=png

下一步就来开始coding了!

首先导入需要使用的模块:

import os from io import StringIO from io import open from concurrent.futures import ProcessPoolExecutor from pdfminer.pdfinterp import PDFResourceManager from pdfminer.pdfinterp import process_pdf from pdfminer.converter import TextConverter from pdfminer.layout import LAParams from docx import Document

然后定义好PDF文件的读取路径和Word文件的生成路径。

pdf_folder = r'/Users/wuyuqing/Desktop/Code/pdf2word/pdf' word_folder = r'/Users/wuyuqing/Desktop/Code/pdf2word/word'

接下来我们定义使用的方法:

def read_from_pdf(file_path): with open(file_path, 'rb') as file:         resource_manager = PDFResourceManager()         return_str = StringIO()         lap_params = LAParams()         device = TextConverter(             resource_manager,             return_str,             laparams=lap_params)         process_pdf(resource_manager, device, file)         device.close()         content = return_str.getvalue()         return_str.close() return content

通过字节流的方式打开文件,读取内容。我们主要使用process_pdf这个函数处理pdf,详情处理步骤我们可以看看API是这么处理的(这API写好的代码,供参考,不需要你再次手写):

def process_pdf(rsrcmgr, device, fp, pagenos=None, maxpages=0, password='',                 caching=True, check_extractable=True): # Create a PDF parser object associated with the file object.     parser = PDFParser(fp) # Create a PDF document object that stores the document structure.     doc = PDFDocument(caching=caching) # Connect the parser and document objects.     parser.set_document(doc)     doc.set_parser(parser) # Supply the document password for initialization.     # (If no password is set, give an empty string.)     doc.initialize(password) # Check if the document allows text extraction. If not, abort.     if check_extractable and not doc.is_extractable:

raise PDFTextExtractionNotAllowed(

                        'Text extraction is not allowed: %r' % fp)

# Create a PDF interpreter object.
interpreter = PDFPageInterpreter(rsrcmgr , device)
# Process each page contained in the document.
for (pageno ,page) in enumerate(doc.get_pages()):
if pagenos and (pageno not in pagenos): continue
interpreter.process_page(page)
if maxpages and maxpages <= pageno+ 1: break

下面我们考虑将字节流存成docx文档:

def save_text_to_word(content, file_path):     doc = Document() for line in content.split('\n'):         paragraph = doc.add_paragraph()         paragraph.add_run(remove_control_characters(line))     doc.save(file_path)

 

# 将两个函数封装起来

def pdf_to_word(pdf_file_path, word_file_path):

content = read_from_pdf(pdf_file_path)
save_text_to_word(content , word_file_path)

主要功能完成,这样就算完工了

下面我们来调用读取pdf生成docx的方法

tasks = [] with ProcessPoolExecutor(max_workers=5) as executor: for file in os.listdir(pdf_folder):         extension_name = os.path.splitext(file)[1] if extension_name != '.pdf': continue         file_name = os.path.splitext(file)[0]         pdf_file = pdf_folder + '/' + file         word_file = word_folder + '/' + file_name + '.docx'         print('正在处理: ', file)         result = executor.submit(pdf_to_word, pdf_file, word_file)         tasks.append(result) while True:     exit_flag = True     for task in tasks: if not task.done():             exit_flag = False     if exit_flag: print('完成') exit(0)

这样就可以生成doc文件了,怎么样是不是很简单?

你也来动手试一试?

完整代码请点击阅读原文

640?wx_fmt=gif

转载地址:http://ahnqi.baihongyu.com/

你可能感兴趣的文章
Excel 2007: Beyond the Manual
查看>>
Windows Vista: Beyond the Manual
查看>>
DotNetNuke For Dummies
查看>>
PCI Compliance: Understand and Implement Effective PCI Data Security Standard Compliance
查看>>
Flash CS3 For Dummies
查看>>
Professional ASP.NET 2.0 AJAX
查看>>
Security+ Study Guide
查看>>
Programming Interviews Exposed: Secrets to Landing Your Next Job
查看>>
Linksys WRT54G Ultimate Hacking
查看>>
Professional Rootkits
查看>>
Financial Applications using Excel Add-in Development in C/C++
查看>>
Learning Joomla! Extension Development: Creating Modules, Components, and Plugins with PHP
查看>>
How to Cheat at IIS 7 Server Administration
查看>>
Simply JavaScript
查看>>
Expert SQL Server 2005 Integration Services
查看>>
Beginning SharePoint 2007: Building Team Solutions with MOSS 2007
查看>>
QoS Over Heterogeneous Networks
查看>>
Workflow in the 2007 Microsoft Office System
查看>>
IPv6 Advanced Protocols Implementation
查看>>
Pro InfoPath 2007
查看>>