Introduction:To understand how to read streams from an office file it is essential to understand the storage structure of it.
MS-office file is not a general NTFS file but it is a Compound File. Compound File is Microsoft’s implementation of structure storage.
Structured Storage:
One liner for it is a file system with in a file. What does that mean? Structured storage allows hierarchical storage of information within a single file. Elements of a structured storage object are storages and streams. Storages are analogous to directories, and streams are analogous to files. Within a structured storage there will be a primary storage object that may contain substorages, possibly nested, and streams. Storages provide the structure of the object, and streams contain the data.
Pic-1: Structured Storage
To read further about storage structure of an office file, go through the below links.
http://msdn.microsoft.com/en-us/library/aa380363%28VS.85%29.aspx
http://msdn.microsoft.com/en-us/library/aa378938%28VS.85%29.aspx
http://en.wikipedia.org/wiki/COM_Structured_Storage
For example I have a utility s/w called stg with me that shows the structured storage. Take a look at the sample document Test.doc’s storage structure.
1. Here is the sample code snippet for how to open a compound storage object.
const WCHAR *inFilePtr = L"C:\\Test.doc";
HRESULT hr = S_OK;
IStorage *pStg = NULL;
hr = StgOpenStorageEx(
inFilePtr,
STGM_READ |STGM_SHARE_DENY_WRITE,
STGFMT_ANY,
0,
NULL,
NULL,
ID_IStorage,
reinterpret_cast(&pStg) );
if( FAILED(hr) ) {
throw( "couldn't open storage because of Invalid\
file type or file doesnot exist");
return 1;
}
Read more about StgOpenStorageEx from MSDN.
2. Here is the sample code snippet for how to check whether the child storage object is a stream or storage.· Get the pointer to an enumerator object that can be used to enumerate storage and stream objects with in the storage object.
· Iterate through the enumerator object and check each element type. If the element type is stream then get the name of the stream and open the stream using the function IStorage::OpenStream
// Get the pointer to an enumerator object of
// root storage.
hr = pStg->EnumElements( NULL, NULL, NULL, &penum );
if( FAILED(hr) ){ throw "failed to IStorage::Enum";}
// iterate through the emuerator object.
hr = penum->Next( 1, &statstg, 0 );
while( S_OK == hr )
{
if(statstg.type == 2) // child object is a stream
{
IStream *pStm = NULL;
HRESULT hrInn = S_OK;
hrInn = pStg->OpenStream(
statstg.pwcsName,
NULL,
STGM_READ|STGM_SHARE_EXCLUSIVE,
0,
&pStm);
if(S_OK != hrInn)
throw("Unable to open the Stream");
// Read the data from stream
char temp[1000];
ULONG readBytes;
memset(&temp,0,sizeof(temp));
pStm->Read(&temp,1000,&readBytes);
while(readBytes)
{
printf(temp);
memset(&temp,0,sizeof(temp));
pStm->Read(&temp,1000,&readBytes);
}
}
hr = penum->Next( 1, &statstg, 0);
} //End of processing current stream
Read further about IStorage::EnumElements, STATSTG structure, IStorage::OpenStream and IStream::Read from MSDN
References:
http://www.microsoft.com/interop/docs/OfficeBinaryFormats.mspx
http://www.securityfocus.com/infocus/1822
http://support.microsoft.com/kb/105763

