python帶格式讀取word_python如何讀取word文件中的文本內容並寫入到新的txt文件

❶ 如何在 Linux 上使用 python 讀取 word 文件信息

首先下載安裝win32com
from win32com import client as wc
word = wc.Dispatch('Word.Application')
doc = word.Documents.Open('c:/test')
doc.SaveAs('c:/test.text', 2)
doc.Close()
word.Quit()

這種方式產生的text文檔，不能用python用普通的r方式讀取，為了讓python可以用r方式讀取，應當寫成

doc.SaveAs('c:/test', 4)

注意：系統執行完成後，會自動產生文件後綴txt（雖然沒有指明後綴）。
在xp系統下面，應當
open(r'c:\text','r')
wdFormatDocument = 0
wdFormatDocument97 = 0
wdFormatDocumentDefault = 16
wdFormatDOSText = 4
wdFormatDOSTextLineBreaks = 5
wdFormatEncodedText = 7
wdFormatFilteredHTML = 10
wdFormatFlatXML = 19
wdFormatFlatXMLMacroEnabled = 20
wdFormatFlatXMLTemplate = 21
= 22
wdFormatHTML = 8
wdFormatPDF = 17
wdFormatRTF = 6
wdFormatTemplate = 1
wdFormatTemplate97 = 1
wdFormatText = 2
wdFormatTextLineBreaks = 3
wdFormatUnicodeText = 7
wdFormatWebArchive = 9
wdFormatXML = 11
wdFormatXMLDocument = 12
= 13
wdFormatXMLTemplate = 14
= 15
wdFormatXPS = 18

照著字面意思應該能對應到相應的文件格式，如果你是office 2003可能支持不了這么多格式。word文件轉html有兩種格式可選wdFormatHTML、wdFormatFilteredHTML（對應數字 8、10），區別是如果是wdFormatHTML格式的話，word文件裡面的公式等ole對象將會存儲成wmf格式，而選用 wdFormatFilteredHTML的話公式圖片將存儲為gif格式，而且目測可以看出用wdFormatFilteredHTML生成的HTML 明顯比wdFormatHTML要干凈許多。
當然你也可以用任意一種語言通過com來調用office API，比如PHP.
from win32com import client as wc
word = wc.Dispatch('Word.Application')
doc = word.Documents.Open(r'c:/test1.doc')
doc.SaveAs('c:/test1.text', 4)
doc.Close()
import re
strings=open(r'c:\test1.text','r').read()
result=re.findall('\(\s*[A-D]\s*\)|\(\xa1*[A-D]\xa1*\)|\（\s*[A-D]\s*\）|\（\xa1*[A-D]\xa1*\）',strings)
chan=re.sub('\(\s*[A-D]\s*\)|\(\xa1*[A-D]\xa1*\)|\（\s*[A-D]\s*\）|\（\xa1*[A-D]\xa1*\）','()',strings)
question=open(r'c:\question','a+')
question.write(chan)
question.close()
answer=open(r'c:\answeronly','a+')
for i,a in enumerate(result):
m=re.search('[A-D]',a)
answer.write(str(i+1)+' '+m.group()+'\n')
answer.close()
chan=re.sub(r'\xa3\xa8\s*[A-D]\s*\xa3\xa9','()',strings)
#不要()，容易引起歧義。

❷ 如何在 Linux 上使用 Python 讀取 word 文件信息

第一步：獲取doc文件的xml組成文件

import zipfiledef get_word_xml(docx_filename):
with open(docx_filename) as f:
zip = zipfile.ZipFile(f)
xml_content = zip.read('word/document.xml')
return xml_content

第二步：解析xml為樹形數據結構
from lxml import etreedef get_xml_tree(xml_string):
return etree.fromstring(xml_string)

第三步：讀取word內容：
def _itertext(self, my_etree):
"""Iterator to go through xml tree's text nodes"""
for node in my_etree.iter(tag=etree.Element):
if self._check_element_is(node, 't'):
yield (node, node.text)def _check_element_is(self, element, type_char):
word_schema = '99999'
return element.tag == '{%s}%s' % (word_schema,type_char)

❸ python如何讀取word文件中的文本內容並寫入到新的txt文件

熱點內容

蘋果解壓視頻在哪裡找發布：2025-09-17 03:47:27 瀏覽：964

中國程序員發現最大程序漏洞發布：2025-09-17 03:41:09 瀏覽：776

圖像數據加密解密發布：2025-09-17 03:40:26 瀏覽：189

pdf金發布：2025-09-17 03:30:52 瀏覽：511

湖北拼團商城源碼發布：2025-09-17 03:29:25 瀏覽：198

為什麼說伺服器沒有響應發布：2025-09-17 03:13:53 瀏覽：972

linux怎麼搭web伺服器發布：2025-09-17 02:59:17 瀏覽：254

房產證加密收費嗎發布：2025-09-17 02:39:01 瀏覽：154

slam演算法處理數據發布：2025-09-17 02:24:47 瀏覽：269

如何判斷伺服器ip地址和版本號發布：2025-09-17 02:23:09 瀏覽：966

python獲取html內容發布：2025-09-17 02:12:17 瀏覽：771

北歐大神程序員發布：2025-09-17 01:52:16 瀏覽：205

安卓手機怎麼拍出照片的質感發布：2025-09-17 01:51:32 瀏覽：836

編譯後的病毒長什麼樣子發布：2025-09-17 01:49:05 瀏覽：24

圍棋與程序員發布：2025-09-17 01:24:36 瀏覽：260

加密和解密的單詞發布：2025-09-17 01:22:56 瀏覽：984

我的世界td伺服器怎麼注冊發布：2025-09-17 01:22:46 瀏覽：416

編譯器的堆空間發布：2025-09-17 01:04:27 瀏覽：604

雲引擎雲伺服器發布：2025-09-17 00:42:21 瀏覽：912

解壓視頻聲控吃冰義大利發布：2025-09-17 00:22:36 瀏覽：409

導航:首頁 > 編程語言 > python帶格式讀取word

python帶格式讀取word

與python帶格式讀取word相關的資料