Skip to content

E.xcel P.erformant R.eader via I.nterfaces for M.emory E.fficiency. Without using any external libraries. Optimised for Range extraction

License

Notifications You must be signed in to change notification settings

Smurf-IV/Excel_PRIME

Repository files navigation

Excel_PRIME 🌟

  • Excel_Performant Reader via Interfaces for Memory Efficiency.
  • Without using any external libraries.
  • Optimised for Range extraction.
  • Excel.png

What does that mean?

  • Yet another Excel reader ?,
    • Starting with .Net 8 as the performant Runtime (See Benchmarks)
    • V9 gives an extra 5% boost,
    • V10 Another 5% ;-)

Lets take each of the above elements and explain:

Excel πŸ“ˆ

  • Open Large 2007 (Onwards) XLSX file formats (Binary later, maybe)

Performant πŸš€

  • Try to be as fast as possible, i.e.
    • Forward only Lazy loading
    • Only "Quick" decipher / convert of the cell(s) types to ease GC pressure
    • No attempt at "creating / using" datatables with headers etc.
    • Use IEnumerables with initial offset starts (Row / Column)
    • Allow CancellationTokens to be used to allow page transitioning cancellation (More on this later)
  • Now the fastest in Real world usage 2025-11-19

Q & A's

  • Q: There are others that are faster
  • A: Agreed, but then
    • They do not have range extraction.
    • Or optionally allow the use of the OS's TempFile System to store massive sheets
    • Or re-use of already extracted (massive) sheets
    • Or allow multiple sheets to be read at the same time
      • because others use global memory to represent the current row
      • Or have a single access into the Zip Excel file

Reader πŸ“‹

  • Read only
    • Therefore no calculations or updates to formula calls

Interfaces πŸ—οΈ

  • Will use the DotNet core functionality by default
  • But, if your target deployment allows for the use of native performant binaries, then via the use of interfaces these will be pluggable
    • i.e. Using Zlib.Net for getting the data streams out of the compressed Excel file faster. (Or SharpZipLib / PowerPlayZipper)
    • A faster / slimmer implementation for xml stream reading (i.e. TurboXml)
  • Allow the implementation of different source files (i.e. XLSB)

Q & A's

  • Q: Why?
  • A: As mentioned above, this is to allow a developer to replace with external nugets that might perform better XML speed etc.

Memory 🌐

  • The reason for this project, is to handle very large XSLX files (i.e. > 500K rows with > 180 columns per sheet, with multiple sheets of this size)
  • For ETL validation scenarios, i.e. make sure that the user modified data that has been transferred has interaction rules applied, before moving onto the T and L stages
  • Try not to hit / store in the LOH
  • No internal .Net memory of previously loaded sheets / rows.

Q & A's

  • Q: It appears that this uses more memory than other implementations
  • A: Currently yes, but it is being optimised for Range Extraction,
    • AND for allowing multiple rows (With cell data) to be stored in memory at the same time, (i.e. via ToList() call);
    • AND there is work in place to allow multiple sheets to be read at the same time (Unlike some to of the others that use global memory to represent a row)
    • And it appears that the current benchmarks do not extract unless a ToString and a check on the result is used (Otherwise the Jit removes the unassigned dead code)

Efficiency πŸ“¦

  • As hinted by the above statements, this is to be targetted at memory restricted environments (i.e. ASP Net VM's)
  • Use the OS's Temp File caching, so if the memory is tight then the Owner app will not have to worry about OOM exceptions, or having to use Swap Disk speeds.
  • Only unzip the sheet(s) when they are asked for
  • Only load the shared strings upto the current request number

Q & A's

  • Q: Sometimes the Async await s add too much overhead
  • A: true, that is why there are also the equivalent base interfaces that perform the same functionality without the need for the async await overheads.

Etc. πŸ”§

CancellationTokens

  • This is to allow the Large files to be Aborted
  • Make "Most" of the "Net Cores'" API's Asynchronous Tasks

IDisposable

  • Got to tidy up those Temp Files, and release the FileStream's

It will not β›”:

Be: Same sheet Thread safe πŸ“Š

  • It will Not be same sheet Instance thread safe, because the xml reader will be locked (Forward only) to the sheet in use.
    • but you can Open the sheet more than once, and have different threads running over it,
    • And you can have Parallel threads access the Excel file
    • Just remember to set Options{ AccessExcelFileInForwardOnlyMode = false}

Do: Dynamic Ranges ⚠️

  • i.e. Ones that contain formulas:
    • <definedName name="Prices">OFFSET(Sheet1!$A$1,0,0,COUNTA(Sheet1!$A:$A),1)</definedName>

Do: Poco πŸ€–

  • A POCO / Type populator (Extensions can be written for that later)

Be a: Writer / Modifier πŸ“š

  • Totally beyond the scope of this project remit

Badge πŸ”„ Area
.NET Release build and tests

Flag Counter

Targets 🎯

Phase 0

  • βœ… Setup this github
  • βœ… Create the main project
  • βœ… Add Unit Test project
  • βœ… Add simple Test Data

Phase Alpha

  • βœ… Use Net Core Interface(s)
    • βœ… Use ZipArchive
    • βœ… Use XDocument
  • βœ… Implement Open / Dispose (Async)
    • βœ… Sheet Names
    • βœ… Shared Strings
  • βœ… Implement Sheet loading (unzip and be ready for use)
    • βœ… Use XDocument as POC only
  • βœ… Implement Row extraction
    • βœ… Skip
    • βœ… Delayed read - until a cell is actually needed
    • βœ… Deal with Null / Empty cells (Utilise sparse array?)
    • βœ… Keep last used offset (i.e. no need to reload sheet if the next range API startRow call is later)

Phase Beta - Benchmarks ⏱️

  • βœ… Benchmarks
    • βœ… Add Other "Excel readers" to the Benchmark project(s)
    • πŸŽ‰ Now With Sylvan.Data.Excel
    • πŸŽ‰ Now With XlsxHelper
  • βœ… More UnitTests

Phase 1 - MVP πŸ”

  • βœ… Add IEnumerables and benchmark
    • ⚠️ Still not convinced whether to implement "all the way down"
  • βœ… Implement XmlReader.Create for
    • βœ… Loading sharedStrings
    • βœ… Sheet loading
    • βœ… Some Profiling Enahancements
  • βœ… More Benchmarks
    • Now With FastExcel
    • βœ… Some Profiling Enahancements
  • βœ… Better Storage of the SharedStrings
    • βœ… Use of LazyLoading Class
    • βœ… Use of Derived XmlNamedTable implementations
    • βœ… Locking for separate sheet thread reading
    • βœ… Restricted storage (i.e. do not return things that are not relevant)
      • πŸš€ Big Performance improvements 2025-10-26
  • βœ… Cell object type πŸ“… - πŸš€ Big Performance improvements 2025-11-01
    • βœ… Cell converted when read (i.e. you will know the type that you want, and you can convert it.)
      • πŸš€ Big Performance improvements 2025-11-04
  • βœ… Use internal ZipEntry rented buffer
    • βœ… Add and explain usage in options
      • πŸš€ Big Performance improvements 2025-11-07
  • βœ… Investigation into the smallest function πŸ’ͺ
    • πŸš€ More Performance improvements 2025-11-08
  • βœ… Optimise for CellConversion.None πŸ’ͺ
    • πŸš€ More Performance improvements 2025-11-12
  • βœ… Parallel Sheet threads Access
    • βœ… Multiple times (with locking)
  • βœ… Nuget
    • βœ… Beta etc.
    • 🎊 Released as Nuget V1.yyMM.dd -> 1.2511.14

Phase 2 - RC

  • βœ… Add IEnumerables All the way down ‡️
    • i.e. remove the need for Asynchronous awaits
    • πŸš€ Yielding More Performance improvements 2025-11-19
    • ⛓️‍πŸ’₯ Breaking Change πŸ”©
      • The Async classes now have Async appended to be distinct from the non async versions
      • But, Async inherit from the non, so they are interchangable
  • βœ… Nuget
    • βœ… Manual workflow deploy Release
    • βœ… Manual workflow deploy Beta
  • βœ… Read definedNames (Ranges / Cell / Value / Dynamic) πŸ“‡
    • βœ… Read from global
    • βœ… Handle Dynamics (i.e. do not fall over! 🀷)
  • βœ… Deal with blank rows in a sheet πŸ—‹
    • βœ… Return a null cell row
  • βœ… Deal with Empty cells in a row πŸ—…
    • βœ… Return a null cell (e.g. <c r="F12" s="8"/>)
  • βœ… Implement Sheet scoping of definedNames
    • βœ… i.e. <definedName name="OrderSize" localSheetId="0">'Try it Yourself'!$C$12:$E$12</definedName>
    • Note: The above will be referenced as OrderSize (Try it Yourself) as shown in LibreOffice.
  • βœ… Implement Row extraction πŸ“Ÿ
    • βœ… Allow ColumnHeader addressing (i.e. start -> end columns)
  • βœ… Implement RangeExtraction πŸ“²
    • βœ… Global rangeNames
    • βœ… Make DefinedName's work with localSheetIddefinitions
    • βœ… User defined, using the "A1:B10" or "$A$1:$B$10" syntax
  • βœ… Add Benchmarks for "Excel readers" That perform Range Extraction
  • βœ… Investigate memory usage(s) πŸ§‘β€πŸ’»
  • βœ… Release as Nuget V2.2512-10 πŸ’¨

V2 Changes

2025-12-14


Phase 3 - XLSB πŸ’Ύ - Alpha V3

  • Implement Open / Dispose (Async)
    • Sheet Names
    • Shared Strings
  • Implement Sheet loading
  • Implement Row extraction
    • Skip
    • Delayed read - until a cell is actually needed
    • Deal with Null / Empty cells (Utilise sparse array?)
  • Cell object type πŸ“…
  • Parallel Sheet threads Access
    • Multiple times (with locking)
  • Read definedNames (Ranges / Cell / Value / Dynamic) πŸ“‡
    • Read from global
  • Benchmarks πŸ–²οΈ
    • Add "Excel readers" That support XLSB Extraction
  • Release as Nuget V3.yyMM.dd

Phase 4 - Specific Cell value type(s) #️⃣

  • Cell object type πŸ“…
    • Deal with DateOnly / TimeOnly fields -> CellConversion.NumberAndDates πŸ’Ή
    • Use of user defined column schema (Excel Number Format nuget?)
    • Formatter applied -> CellConversion.ForceStyles
    • Operator based conversion
    • Investigate if the XmlConvert classes are efficient
  • Benchmarks

Phase 5 - Third Party Nugets πŸ“¦

  • Excercise the Implementation of Interfaces for other Libs (Xml / Zip)
    • Separate Nuget(s) ?
  • Benchmarks

Phase 6 - ideas πŸ’‘

  • Investigate a different way of storing the Shared strings to the Filesystem, when they are in the MB's

  • Investigate possibility of using "Pipelining" to get data for Next row / cell population after yield?

    • Locking
    • How to deal with rows that are completely blank
    • fibres ?
  • Indicate that things may be Hidden πŸ–Ί

    • Sheet
    • Row
    • Column
    • Cell ?
  • Indicate that things may be Readonly

    • Sheet
    • Row
    • Column
    • Cell ?
  • More ideas to be added later, Please suggest... ;-)

About

E.xcel P.erformant R.eader via I.nterfaces for M.emory E.fficiency. Without using any external libraries. Optimised for Range extraction

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Languages