More sas7bdat progress

The development version of the read.sas7bdat function (in the sas7bdat package) now reads field labels and formats. In addition, errors of the type "found <x> <type> subheaders where 1 expected" are now a thing of the past. These improvements are largely due to work by Clint Cummins. The function also works on some files generated by 64 bit builds of SAS for Windows. Of 280 test files, the read.sas7bdat function works for all but 6, and these were generated on a Linux platform. Here is a preview of the sas7bdat package, version 0.2: sas7bdat_0.2.tar.gz

For those interested in C programming, it appears that the sas7bdat header corresponds to a C structure. There is evidence for this because the header fields are aligned at 4 byte boundaries when the file was written by SAS for 32 bit platforms, but 8 byte aligned for 64bit platforms. In addition, C structures are often padded with extra bytes by the compiler so that the entire structure is aligned to some multiple of the virtual memory page size. The overall length of the sas7bdat header differs from Windows to non-Windows platforms. This would make it difficult for the Windows version of SAS to read files written by the Linux version of SAS.

Does anyone know whether SAS can read sas7bdat datasets independent of the platform where the file was written? For example, is it possible using SAS for 32 bit Windows to read a file written by SAS for 64 bit Windows?

6 thoughts on “More sas7bdat progress

  1. Yes, SAS can read sas7bdat datasets created on any platform, but you get a warning about reduced performance when you’re reading a non-native dataset.

    1. Ian,

      Thanks for your interest. To remove the 'experimental' qualification, I think it would be necessary to replace it with a host of other, more specific qualifications. For instance, that the reader is only safe for files written by SAS for 32bit Windows, without compression or encryption, and where character strings are ASCII (or compatible) encoded, etc. My thinking is that it's better (for myself and for package users) to wait until this list is a bit more manageable 🙂

      The more scary uncertainty is that where the reader might fail in an unknown way. I think this uncertainty will subside as we progress, and more feedback arrives. The good news is that we are making rapid progress, and confidence is building in proportion.

      Best,
      Matt

  2. Matt,

    thanks for the response. I feel that this is a very important project. Being able to get your data into R from common formats is absolutely key. R only recently got a decent package for importing xls/xlsx files (with XLConnect), and sas7bat is the last major unsupported format.

    Personally I don't mind if the reader throws an error on some files, but I am concerned that the experimental designation could mean that the data could be read in incorrectly. Have you run into any cases where this has happened?

    1. Ian,

      We've been careful to throw errors when there is significant possibility to misread data by enforcing that the structure of files conform to those we have already seen, and know to parse correctly. However, it is still possible (if unlikely) that data will be read incorrectly. I have only seen this happen when character strings were encoded using the Windows-1250 code page, which is a superset of ASCII. Hence, the non-ASCII characters are rendered incorrectly, or not rendered at all. Fortunately, the non-ASCII subset is consists mostly of accents and symbols that don't occur very often in English. I don't consider this a major problem. But, If SAS were to use a non-ASCII compatible encoding, the text would be interpreted incorrectly, and obviously garbled.

  3. I can provide SAS 32-bit and 64-bit test files, if needed. We use compress=no, compress=yes, and compress=binary. These work fine cross-platform (between SAS 9.1.3 32-bit Windows and SAS 9.2 Windows 64-bit) except indexes do not work cross platform.

Comments are closed.