Programmatically Converting/Printing a Web Page to PDF
Posted by sapientcoder on August 8, 2008
Something that came up recently at work was the need to programmatically convert a web page (rendered, not raw HTML) to PDF. Since I was less than impressed with the documentation I found on the web explaining how to do this in a Windows enviroment, I thought I’d post how I did it to make it easier for the next person.
Let’s start with the overall architecture. The first thing I did was install PDFCreator on a server on our network. PDFCreator is a free “virtual printer” that allows you to print documents to it and writes them out as PDF documents. While I use CutePDF on my own computer for this purpose, PDFCreator has the advantage of having a “batch print” capability, meaning it’s possible to print to it without a print dialog, and the files are automatically saved to a specific location. To do this, I installed PDFCreator as a service (instructions here). I then created a console application (using VS.NET 2005) that grabs the web page I’m interested in, renders it, and sends it to the PDF printer. I set this application up as a Scheduled Task in windows so it can run weekly.
Now let’s take a look at the code. (All code snippets include the relevant “using” statements.) Here’s my Main method (leaving out the exception handling for brevity’s sake):
using System.Windows.Forms; [STAThread] static void Main(string[] args) { Application.Run(new CustomApplicationContext()); }
Now, I know that two aspects of that code seem unusual for a console application. The first is the [STAThread] attribute, which I’ll explain later, and the second is the call to Application.Run(). That call is there because I need the console application to handle events and to continue executing until I explicitly tell it to quit, which means it needs a message loop (or message “pump”). In this regard, it acts more like a GUI application but without the overhead of creating forms I don’t need.
NOTE: Although I used a console application, the same code will work in a Windows (GUI) application by changing the call to Application.Run() to run a form rather than an ApplicationContext.
Next, here’s the majority of my CustomApplicationContext class definition:
using System; using System.Windows.Forms; class CustomApplicationContext : ApplicationContext { WebBrowser _browser = new WebBrowser(); Int16 PRINT_ARGS = ( Constants.PRINT_DONTBOTHERUSER | Constants.PRINT_WAITFORCOMPLETION ); public CustomApplicationContext() { _browser.DocumentCompleted += _browser_DocumentCompleted; _browser.Navigate(Constants.TARGET_URL); } void _browser_DocumentCompleted(object sender, WebBrowserDocumentCompletedEventArgs e) { if (_browser.ReadyState == WebBrowserReadyState.Complete) { try { string defaultPrinter = WrappedNativeMethods.GetDefaultPrinter(); if (NativeMethods.SetDefaultPrinter(Constants.PDF_PRINTER_NAME)) { object pvaIn = PRINT_ARGS; object pvaOut = Type.Missing; ((SHDocVw.IWebBrowser2)_browser.ActiveXInstance).ExecWB( SHDocVw.OLECMDID.OLECMDID_PRINT, SHDocVw.OLECMDEXECOPT.OLECMDEXECOPT_DONTPROMPTUSER, ref pvaIn, ref pvaOut); NativeMethods.SetDefaultPrinter(defaultPrinter); } } finally { Application.ExitThread(); } } } }
Here’s the relevant code in my Constants class:
using System;
class Constants
{
public const string PDF_PRINTER_NAME = "PDFCreator";
public const Int16 PRINT_WAITFORCOMPLETION = 0x02;
public const Int16 PRINT_DONTBOTHERUSER = 0x01;
public const string TARGET_URL = @"http://url_of_web_page_to_print";
}
Lastly, here’s the relevant code in my NativeMethods class:
using System;
using System.Runtime.InteropServices;
using System.Text;
class NativeMethods
{
[DllImport("winspool.drv", CharSet = CharSet.Auto, SetLastError = true)]
public static extern bool GetDefaultPrinter(StringBuilder pszBuffer, ref int pcchBuffer);
[DllImport("winspool.drv", CharSet = CharSet.Auto, SetLastError = true)]
public static extern bool SetDefaultPrinter(string name);
}
The above Win32 function declarations were obtained from pinvoke.net.
If you’re wondering why I didn’t post the code for the WrappedNativeMethods class, it’s because there’s only one method in it: a method that wraps the above GetDefaultPrinter method. To get that code, click here and copy the “Sample Code” on the page.
Now I’ll explain the [STAThread] attribute above the Main method. The attribute is there because in order to instantiate an instance of the WebBrowser control, the current thread must be running in a “single-threaded apartment” state. Click here for a short explanation of what that means.
Now to explain a few quirks of printing with the WebBrowser control (i.e. Internet Explorer) and how I worked around them. First, the WebBrowser control has a single Print() method that takes no arguments. As a result, you can’t easily do the following: (a) send the output to any printer other than the system’s default printer, (b) suppress Internet Explorer’s “print” dialog (it’s displayed as though a user clicked “File => Print”), and (c) execute the print call synchronously (i.e. cause it to block until completion rather than firing off a thread and then returning).
To solve the default printer issue, I had to make Win32 API calls to get the current default printer, make my PDF printer the default right before printing, and reset the default printer right after printing.
Next, to work around the print dialog issue, I bypass the web browser’s Print() method and call ExecWB directly (which Print() calls under the hood) with my own arguments. Note that to call ExecWB, the web browser must be cast to a COM interface of type IWebBrowser2. To include that interface (and a few constants) in my code, I added a reference to the “SHDocVw.dll” file (found in the “C:\WINDOWS\system32″ folder). The OLECMDEXEOPT_DONTPROMPTUSER value passed to ExecWB tells the browser to suppress the print dialog. Likewise, the pvaIn parameter contains a reference to PRINT_ARGS, which includes a “don’t bother the user” flag.
By the way, both flags in PRINT_ARGS are documented here on MSDN. Likewise, the IDM_PRINT command (which is what calling ExecWB with the OLECMDID_PRINT argument maps to) is documented here. If you read the docs you’ll see that the IDM_PRINT command takes a VARIANT of type VT_I2 for the pvaIn argument. In .NET, that maps to an object holding a reference to an Int16 structure (see this page on MSDN).
Lastly, you’ll see that PRINT_ARGS includes a “wait for completion” flag. This is what tells the ExecWB call to block until printing is completed. Why is this important? Because we don’t want to reset the default printer and exit the application until we’ve actually printed something. In fact, exiting too soon can cause the print job never to happen.
I hope this post was helpful. Even if you’re not printing to PDF, I hope knowing a little more about how the WebBrowser controls prints in general will help make it easier to use.
Roby said
Thanks for your helpful article!
I have a question: there is a way via code to set the pdf file name generated by PDFCreator (without using Automatic Save option in PDFCreator menu)?
Thanks,
Roby
sapientcoder said
Good question. There are actually a few ways I can think of to tackle this issue, but probably the easiest is to programmatically change the value of the registry key PDFCreator uses for the filename when auto-saving.
This is the key it uses for the filename:
HKCU\Software\PDFCreator\Program\AutosaveFilename
And this is the key it uses for the directory:
HKCU\Software\PDFCreator\Program\AutosaveDirectory
Hope that helps. If not or if I didn’t understand the question, let me know, and I’ll work with you some more offline (via e-mail) to see if I can help.
Roby said
Oh thanks Barts for your reply! I can apply your solution in my context. I’ve also other questions about what I’ve to do. Could I write you? I think you can see my email address.
Thanks,
Roby
Roby said
Hi Bart, another question for you.. I hope you can help me
I’ve created a functionality that convert a web page (aspx page) using PDFCreator and WebBrowser control directly from the same web page using a solution similar at your. In my machine (win xp sp 3, IIS 5) that works good, but on the server (win server 2003) that doesn’t work (the request stall and doesn’t arrive at the PDFCreator printer). I’ve tried to make the same with a console application on the server and it works!
I think could be a security problem, but how I could resolve this?
Thanks in advance..
Roby
Roby said
I’ve solved my problem! I had to set PDFCreator as default printer!!
Thanks!!!
Roby
idophir said
Hi sapientcoder,
I recently created a similar solution to yours but when I try to run my code as a service on a server in a domain, I get a COM Exception 80080005 when trying to initialize shdocvw.dll
Do you have any idea of environmental settings that might solve this?
Thanks!
hx said
I was trying to do the exact thing as you described in your blog entry” Programmatically Converting/Printing a Web Page to PDF”, however, my web page contains java script, (http://www.housingmap.com/?c=sfo), _browser_DocumentCompleted doesn’t wait until the java script is executed.
Could you suggest some solution for that?
Thanks a lot!