No funded issue found.
Check out the Issue Explorer
Be the OSS Funding you wish to see in the world.
Looking to fund some work? You can submit a new Funded Issue here .
Time left
Opened
Issue Type
Workers Auto Approve
Project Type
Time Commitment
Experience Level
Permissions
Accepted
Reserved For
Splitting Internally Compressed sas7bdat
saurfang
scala spark
There appears to be an issue reading internally compressed sas7bdat as discussed in #32. This is a recap of what we know so far about the issue and what is required to identify the root cause and possible fix.
## Background
`sas7bdat` is a binary data storage file format used by [SAS](https://www.sas.com/en_us/home.html). There is no public documentation about this file format and different versions of SAS appeared to have evolved the file format over the years. The best documentation of `sas7bdat` can be found at [SAS7BDAT Database Binary Format](http://www2.uaem.mx/r-mirror/web/packages/sas7bdat/vignettes/sas7bdat.pdf). However, this shall be taken with a grain of salt since it does not accurately reflect the latest revision nor the internal compression used in `sas7bdat`.
On a high level, `sas7bdat` stores data in `page`s and each `page` contains rows of serialized data. This is the basis for `spark-sas7bdat` package which splits dataset to process in parallel. Internally, `spark-sas7bdat` delegates deserialization to [`parso`](coinmarketcap.com) Java library, which does an amazing job deserializing `sas7bdat` file sequentially.
## Problem
`spark-sas7bdat` contains [unit tests ](https://github.com/saurfang/spark-sas7bdat/blob/master/src/test/scala/com/github/saurfang/sas/SasRelationSpec.scala#L15) to verify `sas7bdat` can indeed by split and read correctly as a DataFrame. However, there have been reports that it fails for many datasets in the wild. See #32.
It has been verified that `parso` can read these problematic files just fine (https://github.com/saurfang/spark-sas7bdat/issues/32#issuecomment-419593441). Therefore it is likely that the bug would be where we determine is the appropriate splits to divide the sas7bdat file for parallel processing. https://github.com/saurfang/spark-sas7bdat/blob/master/src/main/scala/com/github/saurfang/sas/mapred/SasRecordReader.scala since everything else is just a thin wrapper over Parso thanks to @mulya 's contribution in #10 .
Furthermore, it is likely this issue only happens in certain version of sas7bdat or in sas7bdat files that enable internally compression. Recall externally compressed file is only not splittable (e.g. gzip.) and we don't support parallel read in `spark-sas7bdat`.
## Proposal
### Build Test Case
We first need to collect dataset that exhibits the said issue.
@vivard has provided one: https://github.com/saurfang/spark-sas7bdat/issues/32#issuecomment-412535049
@nelson Where do you get your problematic dataset? Can you generate a dummy dataset that exhibits the same problem?
By setting a very [low block size](https://github.com/saurfang/spark-sas7bdat/blob/master/src/test/scala/com/github/saurfang/sas/SasRelationSpec.scala#L13) in unit test, we can force Spark to split the input data and hopefully trigger the error.
### Debug Test Case
It can be fairly convoluted to debug the splitting logic in Spark and Hadoop. One potential way to debug this is to create a local multi-threaded Parso reader, without Spark, to replicate and validate the splitting logic.
See `parso`'s unit test [here](https://github.com/epam/parso/blob/caf2cbd948ac8ed623ec6eeee40c82caf081fc76/src/test/java/com/epam/parso/SasFileReaderUnitTest.java#L160) where we read row by row from `sas7bdat` and write to `csv`.
The idea would be to generalize this by looking at number of pages in the `sas7bdat` file, split pages into chunks, create separate input stream for each chunk, seek the input stream to the page starting location, and process rows from all pages in the chunk. The splitting logic can be refactored from [here](https://github.com/saurfang/spark-sas7bdat/blob/master/src/main/scala/com/github/saurfang/sas/mapred/SasRecordReader.scala#L81) in this package. Since the issue is unlikely to be related to concurrency because Spark executor creates separate input streams, one can run the above logic sequentially for each chunk, which should be easier to debug.
This might help identify and isolate the issue. We might also discover functions, interfaces, and encapsulation that can be contributed back to Parso, which could greatly simplifies this package.
Setup your profile
Tell us a little about you:
Skills
No results found for [[search]] .
Type to search skills..
Bio Required
[[totalcharacter]] / 240
Are you currently looking for work?
[[ option.string ]]
Next
Setup your profile
Our tools are based on the principles of earn (💰), learn (📖), and meet (💬).
Select the ones you are interested in. You can change it later in your settings.
I'm also an organization manager looking for a great community.
Back
Next
Save
Enable your organization profile
Gitcoin products can help grow community around your brand. Create your tribe, events, and incentivize your community with bounties. Announce new and upcoming events using townsquare. Find top-quality hackers and fund them to work with you on a grant.
These are the organizations you own. If you don't see your organization here please be sure that information is public on your GitHub profile. Gitcoin will sync this information for you.
Select the products you are interested in:
Out of the box you will receive Tribes Lite for your organization. Please provide us with a contact email:
Email
Back
Save