How to edit Microsoft Word documents in Python (2024)

Published on 29th of August, 2021.

In preparation for the job market, I started polishing my CV. I try to keep theCV on my website as up-to-date as possible, but many recruiters and companiesprefer a single-page neat CV in a Microsoft Word document. I used to always makemy CV’s in LaTeX, but it seems Word is often preferred since it’s easier toedit for third parties.

Keeping both a web, Word, and PDF version all up-to-date and easy to edit seemedlike an annoying task. I have plenty experience with automatically generatingPDF documents using LaTeX and Python, so I figured why should a Word document beany different? Let’s dive into the world of editing Word documents in Python!

Fortunately there is a library for this:

python-docx

. It can be used to createWord documents from scratch, but stylizing a document is a bit tricky. Instead,its real power lies in editing pre-made documents. I went ahead and made a nicelooking CV in Word, and now let’s open this document in

python-docx

. A Worddocument is stored in XML under the hoods, and there can be a complicated treestructure to a document. However, we can create a document and use the

.paragraphs

attribute for a complete list of all the paragraphs in thedocument. Let’s take a paragraph, and print it’s text content.

from docx import Documentdocument = Document("resume.docx")paragraph = document.paragraphs[0]print(paragraph.text)

Output

Rik Voorhaar

Turns out the first paragraph contains my name! Editing this text is very easy;we just need to set a new value to the

.text

attribute. Let’s do this and safethe document.

paragraph.text = "Willem Hendrik"document.save("resume_edited.docx")

Below is a picture of the resulting change; it unfortunately seems like twoadditional things happened when editing this paragraph: the font of the editedparagraph changed, and the bar / text box on the right-hand side disappearedcompletely!

How to edit Microsoft Word documents in Python (1)

This is no good, but to understand what happened to the text box we need todig into the XML of the document. We can turn the document into an XML file likeso:

document = Document("resume.docx")with open('resume.xml', 'w') as f:f.write(document._element.xml)

It seems the problem was that the text box on the right was nested inside another object, which is apparently not handled properly. This issue was easy tofix by modifying the Word document. However, the right bar on the side consistsof 2 text boxes, and the top box with my contact information does disappear ifI change the first paragraph. But, it does not disappear if I change thesecond paragraph; it only happens if I change paragraph 1 or 3 (and the latteris empty). I tried inserting two paragraphs before this particular paragraph, orchanging the style of this particular paragraph, but the issue remains.

Looking at the XML the issue is clear: the text box element lies nested insidethis paragraph! It turned out to be a bit tricky to avoid this, so for now letus then try changing the second paragraph, changing the word “resume” for“curriculum vitae”.

document = Document("resume.docx")paragraph = document.paragraphs[1]print(paragraph.text)paragraph.text = "Curriculum Vitae"document.save("CV.docx")

Output

Resume

If we do this there’s no problems with text boxes disappearing, butunfortunately the style of this paragraph is still reset when we do this. Let’shave a look at how the XML changes when we edit this paragraph. Ignoringirrelevant information, before changing it looks like this:

<w:p> <w:r> <w:t>R</w:t> </w:r> <w:r> <w:t>esume</w:t> </w:r></w:p>

And afterwards it looks like this:

<w:p> <w:r> <w:t>Curriculum Vitae</w:t> </w:r></w:p>

In Word, each paragraph (

<p>

) is split up in multiple runs (

<r>

). What wesee here is that originally the paragraph was two runs, and after modifying it,it became a single run. However, it seems that in both cases the styleinformation is exactly the same, so I don’t understand why the style changesafter modification. In this case if I retype the word ‘Resume’ in the originalword document, this paragraph become a single run, but still the style changesafter editing, and I still don’t see why this happens when looking at the XML.

Looking at the source code of

python-docx

I noticed that when we call

paragraph.text = ...

, what happens is that the contents of the paragraph getdeleted, and then a new run is added with the desired text. It is not clear tome at where exactly the style information is stored, but either way there is asimple workaround to what we’re trying to do: we can simply modify the text ofthe first run in the paragraph, rather than clearing the entire paragraph andadding a new one. This in fact also works for editing the first paragraph,where before we had problems with disappearing text boxes:

document = Document("resume.docx")with open('resume.xml', 'w') as f:f.write(document._element.xml)# Change 'Rik Voorhaar' for 'Willem Hendrik Voorhaar'paragraph = document.paragraphs[0]run = paragraph.runs[1]run.text = 'Willem Hendrik Voorhaar'# Change 'Resume' for 'Curriculum Vitae'paragraph = document.paragraphs[1]run = paragraph.runs[0]run.text = 'Curriculum Vitae'document.save('CV.docx')

Doing this changes the text, but leaves all the style information thesame. Alright, now we now how to edit text. It’s more tricky than one mightexpect, but it does work!

Dealing with text boxes

Let’s say that next we want to edit the text box on the right-hand side of thedocument, and add a skill to our list of skills. We’ve been diving into theinner workings of Word documents, so it’s fair to say we know how to useMicrosoft Word, so let’s add the skill “Microsoft Word” to the list.

To do this we first want to figure out in which paragraph this information isstored. We can do this by going through all the paragraphs in the document andlooking for the text “Skills”.

import repattern = re.compile("Skills")for p in document.paragraphs: if pattern.search(p.text): print("Found the paragraph!") breakelse: print("Did not find the paragraph :(")

Output

Did not find the paragraph :(

Seems like there is unfortunately no matching paragraph! This is because theparagraph we want is inside a text box, and modifying text boxes is not supportedin

python-docx

. This is a known issue, but instead of giving up I decided toadd support for modifying text boxes to

python-docx

myself! It turned out not tobe too difficult to implement, despite my limited knowledge of both the packageand the inner structure of Word documents.

The first step is understanding how text boxes are encoded in the XML. It turnsout that the structure is something like this:

<mc:AlternateContent> <mc:Choice Requires="wps"> <w:drawing> <wp:anchor> <a:graphics> <a:graphicData> <wps:txbx> <w:txbxContent> ... <w:txbxContent> </wps:txbx> </a:graphicData>  </a:graphics> </wp:anchor> </w:drawing> </mc:Choice> <mc:Fallback> <w:pict> <v:textbox> <w:txbxContent> ... <w:txbxContent> </v:textbox> </w:pict> </mc:Fallback></mc:AlternateContent>

The insides of the two

<w:txbxContent>

elements are exactly identical. Theinformation is stored twice probably for legacy reasons. A quick Google revealsthat

wps

is an XML namespace introduced in Office 2010, and WPS is short forWord Processing Shape. The textbox is therefore stored twice to maintainbackwards compatibility with older Word versions. Not sure many people still useOffice 2006… Either way, this means that if we want to update the contents ofthe textbox, we need to do it in two places.

Next we need to figure out how to manipulate these word objects. My idea is tocreate a

TextBox

class, that is associated to an

<mc:AlternateContent>

element, and which ensures that both

<w:txbxContent>

elements are alwaysupdated at the same time. First we make a class encoding a

<w:txbxContent>

element. For this we can build on the

BlockItemContainer

class alreadyimplemented in

python-docx

. Mixing in this class gives automatic support formanipulating paragraphs inside of the container.

class TextBoxContent(BlockItemContainer)

Given an

<mc:AlternateContent>

object, we can access the two

<w:txbxContent>

elements using the following XPath specifications:

XPATH_CHOICE = "./mc:Choice/w:drawing/wp:anchor/a:graphic/a:graphicData//wps:txbx/w:txbxContent"XPATH_FALLBACK = "./mc:Fallback/w:pict//v:textbox/w:txbxContent"

Then making a rudimentary

TextBox

class is very simple. We base it on the

ElementProxy

class in

python-docx

. This class is meant for storing andmanipulating the children of an XML element.

class TextBox(ElementProxy): """Implements texboxes. Requires an `<mc:AlternateContent>` element.""" def __init__(self, element, parent): super(TextBox, self).__init__(element, parent) try: (tbox1,) = element.xpath(XPATH_CHOICE) (tbox2,) = element.xpath(XPATH_FALLBACK) except ValueError as err: raise ValueError( "This element is not a text box; it should contain precisely two  ``<w:txbxContent>`` objects" ) self.tbox1 = TextBoxContent(tbox1, self) self.tbox2 = TextBoxContent(tbox2, self)

So far this is just good for storing the text box, we still need some code toactually manipulate it. It would also be great if we have a way to find all thetext boxes in a document. This is as simple as finding all the

<mc:AlternateContent>

elements with precisely two

<w:txbxContent>

elements.We can use the following function:

def find_textboxes(element, parent): """ List all text box objects in the document. Looks for all ``<mc:AlternateContent>`` elements, and selects those which contain a text box.  """ alt_cont_elems = element.xpath(".//mc:AlternateContent") text_boxes = [] for elem in alt_cont_elems: tbox1 = elem.xpath(XPATH_CHOICE) tbox2 = elem.xpath(XPATH_FALLBACK) if len(tbox1) == 1 and len(tbox2) == 1: text_boxes.append(TextBox(elem, parent)) return text_boxes

We then update the

Document

class with a new

textboxes

attribute:

@propertydef textboxes(self): """ List all text box objects in the document. """ return find_textboxes(self._element, self)

Now let’s test this out:

document = Document("resume.docx")document.textboxes

This gives output:

[<docx.oxml.textbox.TextBox at 0x7faf395c3bc0>,<docx.oxml.textbox.TextBox at 0x7faf395c3100>]

Now to manipulate the “Skills” section as we initially wanted, we first find theright paragraph. Since the two

<w:txbxContent>

objects have the sameparagraphs, we need to find which number of paragraph contains the text, andin which textbox:

import redef find_paragraph(pattern): for textbox in document.textboxes:  for i,p in enumerate(textbox.paragraphs): if pattern.search(p.text): return textbox,ipattern = re.compile("Skills")textbox, i = find_paragraph(pattern)print(textbox.paragraphs[i].text)

Output

Skills

Now to insert a new skill, we need to create a new paragraph with the text“Microsoft Word”. For this we can find the paragraph right after, and thisparagraphs

insert_paragraph_before

method with appropriate text and styleinformation. The paragraph in question is the one containing the word“Research”. I want to copy the style of this paragraph to the new paragraph, butfor some reason the style information is empty for this paragraph. However, Iknow that the style of this paragraph should be the

'Skillsentries'

, so I canjust use that directly.

style = document.styles['Skillsentries']pattern = re.compile("Research")textbox,i = find_paragraph(pattern)p1 = textbox.tbox1.paragraphs[i]p2 = textbox.tbox2.paragraphs[i]for p in (p1,p2): p.insert_paragraph_before("Microsoft Word", p.style)document.save("CV.docx")

When now opening the Word document, we see the item “Microsoft Word” in my listof skills, with the right style and everything. I did cheat a little; I neededto make some additional technical changes to the code for this all to work, butthe details are not super important. If you want to use this feature, you canuse my fork of python-docx. Mysolution is still a little hacky, so I don’t think it will be added to the mainrepository, but it does work fine for my purposes.

Conclusion

In summary, we can use Python to edit word documents. However the

python-docx

package is not fully mature, and using it for editinghighly-stylized word documents is a bit painful (but possible!). It is howeverquite easy to extend with new functionality, in case you do need to do this. Onthe other hand, there is quite extensive functionality in Visual Basic to editword documents, and the whole Word API is built around Visual Basic.

While I now have all the tools available to automatically update my CV usingPython, I will actually refrain from doing it. It is a lot of work to set upproperly, and needs active maintenance ever time I would want to change thestyling of my CV. Probably it’s a better idea to just manually edit it everytime I need to. Automatization isn’t always worth it. But I wouldn’t besurprised if this new found skill will be useful at some point in the future forme.

How to edit Microsoft Word documents in Python (2024)
Top Articles
Latest Posts
Article information

Author: Edmund Hettinger DC

Last Updated:

Views: 6149

Rating: 4.8 / 5 (58 voted)

Reviews: 89% of readers found this page helpful

Author information

Name: Edmund Hettinger DC

Birthday: 1994-08-17

Address: 2033 Gerhold Pine, Port Jocelyn, VA 12101-5654

Phone: +8524399971620

Job: Central Manufacturing Supervisor

Hobby: Jogging, Metalworking, Tai chi, Shopping, Puzzles, Rock climbing, Crocheting

Introduction: My name is Edmund Hettinger DC, I am a adventurous, colorful, gifted, determined, precious, open, colorful person who loves writing and wants to share my knowledge and understanding with you.