-
Notifications
You must be signed in to change notification settings - Fork 2
Refactor flashloader #329
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Refactor flashloader #329
Changes from all commits
Commits
Show all changes
80 commits
Select commit
Hold shift + click to select a range
58dbcd7
major refactor to flash code
zain-sohail f376387
update dataframe class to be able to use index and dataset keys
zain-sohail 08e8d9f
minor changes introduced
zain-sohail 5c9a04c
change majorly the class with a new initialize method. now save parqu…
zain-sohail ff5dd07
now uses a simpler notation and save_parquet method after loading dat…
zain-sohail 7852aaf
methods made more consistent and fixing the get_index_dataset_key
zain-sohail ac9abea
include steinn's proposed solution to pulse_id channel being empty
zain-sohail 41fd70d
include unit tests and fixtures. still many to be done. needs to move…
zain-sohail da00635
add more tests, simplify logic on dataframe class
zain-sohail 8b39bdb
remove the gmdTunnel channel because the datafile is not correct. Rep…
zain-sohail e1b9a9f
major structure changes
zain-sohail cd85dfd
docstrings etc
zain-sohail f6ca14e
updated buffer creation etc. tests won't work currently
zain-sohail c9f1fcc
fix linting errors and comment out tests for now
zain-sohail 1398bf2
fix the error of getting wrong attribute in loader, and fix parquet l…
zain-sohail eb72230
fix lint error
zain-sohail 4d950db
cleaning up the classes
zain-sohail b8bfdf0
add back easy access apis
zain-sohail 1f95408
small fix
zain-sohail 8f551d0
small fix
zain-sohail c85fdec
small fix
zain-sohail 0a7e836
fix error with pickling
zain-sohail 4a787eb
use old cfg
zain-sohail 084f407
docstrings fixes
zain-sohail 73802fa
fix tests
zain-sohail 70a3c5b
fix certain problems with df_electron and add comphrehensive tests fo…
zain-sohail 77bf46b
add tests
zain-sohail d8cc6f6
buffer handler tests
zain-sohail 09cffec
ruff formated
zain-sohail 0f23ddb
add parquethandler tests
zain-sohail ac4f8cd
further tests
zain-sohail 1519752
fixes
zain-sohail d31e6b1
fix the lint error
zain-sohail ed18a5c
fix parse_metadata
zain-sohail ce8134f
put everything in one file
zain-sohail 08a2adc
reoder
zain-sohail 74b41dc
update interface from suggestions
zain-sohail b937db8
limit the cores used
zain-sohail 9dc69aa
change interface of parquethandler to suggested
zain-sohail 09a93d3
fix bug for df indexing
zain-sohail 55cfa0c
merge main branch
zain-sohail 4b3e6f7
Merge branch 'main' into refactor-flashloader
zain-sohail d316137
lint fix
zain-sohail c00207d
update dataframe saving and loading from parquet behavior
zain-sohail 89130b0
remove saving/loading of parquets
zain-sohail dbef804
add instrument option
zain-sohail afd9772
fix tests
zain-sohail b9fce76
fix tests
zain-sohail 6400878
fix tests
zain-sohail 7129f57
fix tests
zain-sohail bc53214
Merge branch 'main' into refactor-flashloader
zain-sohail 02aee6e
- added retrocompabtibility for older buffer files that have sectorID…
zain-sohail 2142c11
fix ruff settings
zain-sohail 79922ef
update tests
zain-sohail 02ae74e
make small change to check actions status
zain-sohail f520310
bring back types
zain-sohail 4c7d069
fix small error
zain-sohail 9f6a31b
move utility func test to utility tests
zain-sohail f4a30e0
seperate to different modules
zain-sohail 08f8f13
add time_elapsed method
zain-sohail fa68746
fix test issues
zain-sohail cb884dd
add tests for elapsed time
zain-sohail ae26555
fix main loader tests
zain-sohail 04714bc
fix sxp loader tests
zain-sohail 6589595
fix tests
zain-sohail 1b73b76
fix minor issue with repr html
zain-sohail 852a867
add available runs property
zain-sohail f010a2e
Merge branch 'main' into refactor-flashloader
zain-sohail 8dd5e6a
Merge branch 'main' into refactor-flashloader
zain-sohail cd6fbf0
Merge branch 'v1_feature_branch' into refactor-flashloader
zain-sohail 147e913
add back annotations
zain-sohail f2a26b9
use index and dataset keys
zain-sohail ebd2b32
Merge remote-tracking branch 'origin/v1_feature_branch' into refactor…
rettigl d131fe4
remove nans from all electron channels
zain-sohail 194c874
use pd import, load h5 file inside df creator
zain-sohail af33740
update comments to explain the code
zain-sohail 50f7ee1
make review changes
zain-sohail 65d909d
fix tests with review comments
zain-sohail b7537a8
fix dropna
zain-sohail b0b090d
fix minor stuff and add test to see if exception handling works in pa…
zain-sohail File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,238 @@ | ||
from __future__ import annotations | ||
|
||
import os | ||
from itertools import compress | ||
from pathlib import Path | ||
|
||
import dask.dataframe as dd | ||
import pyarrow.parquet as pq | ||
from joblib import delayed | ||
from joblib import Parallel | ||
|
||
from sed.core.dfops import forward_fill_lazy | ||
from sed.loader.flash.dataframe import DataFrameCreator | ||
from sed.loader.flash.utils import get_channels | ||
from sed.loader.flash.utils import initialize_paths | ||
from sed.loader.utils import get_parquet_metadata | ||
from sed.loader.utils import split_dld_time_from_sector_id | ||
|
||
|
||
class BufferHandler: | ||
""" | ||
A class for handling the creation and manipulation of buffer files using DataFrameCreator. | ||
""" | ||
|
||
def __init__( | ||
self, | ||
config: dict, | ||
) -> None: | ||
""" | ||
Initializes the BufferHandler. | ||
|
||
Args: | ||
config (dict): The configuration dictionary. | ||
""" | ||
self._config = config["dataframe"] | ||
self.n_cores = config["core"].get("num_cores", os.cpu_count() - 1) | ||
|
||
self.buffer_paths: list[Path] = [] | ||
self.missing_h5_files: list[Path] = [] | ||
self.save_paths: list[Path] = [] | ||
|
||
self.df_electron: dd.DataFrame = None | ||
self.df_pulse: dd.DataFrame = None | ||
self.metadata: dict = {} | ||
|
||
def _schema_check(self) -> None: | ||
""" | ||
Checks the schema of the Parquet files. | ||
|
||
Raises: | ||
ValueError: If the schema of the Parquet files does not match the configuration. | ||
""" | ||
existing_parquet_filenames = [file for file in self.buffer_paths if file.exists()] | ||
parquet_schemas = [pq.read_schema(file) for file in existing_parquet_filenames] | ||
config_schema_set = set( | ||
get_channels(self._config["channels"], formats="all", index=True, extend_aux=True), | ||
) | ||
|
||
for filename, schema in zip(existing_parquet_filenames, parquet_schemas): | ||
# for retro compatibility when sectorID was also saved in buffer | ||
if self._config["sector_id_column"] in schema.names: | ||
config_schema_set.add( | ||
self._config["sector_id_column"], | ||
) | ||
schema_set = set(schema.names) | ||
if schema_set != config_schema_set: | ||
missing_in_parquet = config_schema_set - schema_set | ||
missing_in_config = schema_set - config_schema_set | ||
|
||
errors = [] | ||
if missing_in_parquet: | ||
errors.append(f"Missing in parquet: {missing_in_parquet}") | ||
if missing_in_config: | ||
errors.append(f"Missing in config: {missing_in_config}") | ||
|
||
raise ValueError( | ||
f"The available channels do not match the schema of file {filename}. " | ||
f"{' '.join(errors)}. " | ||
"Please check the configuration file or set force_recreate to True.", | ||
) | ||
|
||
def _get_files_to_read( | ||
self, | ||
h5_paths: list[Path], | ||
folder: Path, | ||
prefix: str, | ||
suffix: str, | ||
force_recreate: bool, | ||
) -> None: | ||
""" | ||
Determines the list of files to read and the corresponding buffer files to create. | ||
|
||
Args: | ||
h5_paths (List[Path]): List of paths to H5 files. | ||
folder (Path): Path to the folder for buffer files. | ||
prefix (str): Prefix for buffer file names. | ||
suffix (str): Suffix for buffer file names. | ||
force_recreate (bool): Flag to force recreation of buffer files. | ||
""" | ||
# Getting the paths of the buffer files, with subfolder as buffer and no extension | ||
self.buffer_paths = initialize_paths( | ||
filenames=[h5_path.stem for h5_path in h5_paths], | ||
folder=folder, | ||
subfolder="buffer", | ||
prefix=prefix, | ||
suffix=suffix, | ||
extension="", | ||
) | ||
# read only the files that do not exist or if force_recreate is True | ||
files_to_read = [ | ||
force_recreate or not parquet_path.exists() for parquet_path in self.buffer_paths | ||
] | ||
|
||
# Get the list of H5 files to read and the corresponding buffer files to create | ||
self.missing_h5_files = list(compress(h5_paths, files_to_read)) | ||
self.save_paths = list(compress(self.buffer_paths, files_to_read)) | ||
|
||
print(f"Reading files: {len(self.missing_h5_files)} new files of {len(h5_paths)} total.") | ||
|
||
def _save_buffer_file(self, h5_path: Path, parquet_path: Path) -> None: | ||
""" | ||
Creates a single buffer file. | ||
|
||
Args: | ||
h5_path (Path): Path to the H5 file. | ||
parquet_path (Path): Path to the buffer file. | ||
""" | ||
|
||
# Create a DataFrameCreator instance and the h5 file | ||
df = DataFrameCreator(config_dataframe=self._config, h5_path=h5_path).df | ||
|
||
# Reset the index of the DataFrame and save it as a parquet file | ||
df.reset_index().to_parquet(parquet_path) | ||
|
||
def _save_buffer_files(self, debug: bool) -> None: | ||
""" | ||
Creates the buffer files. | ||
|
||
Args: | ||
debug (bool): Flag to enable debug mode, which serializes the creation. | ||
""" | ||
n_cores = min(len(self.missing_h5_files), self.n_cores) | ||
paths = zip(self.missing_h5_files, self.save_paths) | ||
if n_cores > 0: | ||
if debug: | ||
for h5_path, parquet_path in paths: | ||
self._save_buffer_file(h5_path, parquet_path) | ||
else: | ||
Parallel(n_jobs=n_cores, verbose=10)( | ||
delayed(self._save_buffer_file)(h5_path, parquet_path) | ||
for h5_path, parquet_path in paths | ||
) | ||
|
||
def _fill_dataframes(self): | ||
""" | ||
Reads all parquet files into one dataframe using dask and fills NaN values. | ||
""" | ||
dataframe = dd.read_parquet(self.buffer_paths, calculate_divisions=True) | ||
file_metadata = get_parquet_metadata( | ||
self.buffer_paths, | ||
time_stamp_col=self._config.get("time_stamp_alias", "timeStamp"), | ||
) | ||
self.metadata["file_statistics"] = file_metadata | ||
|
||
fill_channels: list[str] = get_channels( | ||
self._config["channels"], | ||
["per_pulse", "per_train"], | ||
extend_aux=True, | ||
) | ||
index: list[str] = get_channels(index=True) | ||
overlap = min(file["num_rows"] for file in file_metadata.values()) | ||
|
||
dataframe = forward_fill_lazy( | ||
df=dataframe, | ||
columns=fill_channels, | ||
before=overlap, | ||
iterations=self._config.get("forward_fill_iterations", 2), | ||
) | ||
self.metadata["forward_fill"] = { | ||
"columns": fill_channels, | ||
"overlap": overlap, | ||
"iterations": self._config.get("forward_fill_iterations", 2), | ||
} | ||
|
||
# Drop rows with nan values in electron channels | ||
df_electron = dataframe.dropna( | ||
subset=get_channels(self._config["channels"], ["per_electron"]), | ||
) | ||
|
||
# Set the dtypes of the channels here as there should be no null values | ||
channel_dtypes = get_channels(self._config["channels"], "all") | ||
rettigl marked this conversation as resolved.
Show resolved
Hide resolved
|
||
config_channels = self._config["channels"] | ||
dtypes = { | ||
channel: config_channels[channel].get("dtype") | ||
for channel in channel_dtypes | ||
if config_channels[channel].get("dtype") is not None | ||
} | ||
rettigl marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
# Correct the 3-bit shift which encodes the detector ID in the 8s time | ||
if self._config.get("split_sector_id_from_dld_time", False): | ||
df_electron, meta = split_dld_time_from_sector_id( | ||
df_electron, | ||
config=self._config, | ||
) | ||
self.metadata.update(meta) | ||
|
||
self.df_electron = df_electron.astype(dtypes) | ||
self.df_pulse = dataframe[index + fill_channels] | ||
|
||
def run( | ||
self, | ||
h5_paths: list[Path], | ||
folder: Path, | ||
force_recreate: bool = False, | ||
prefix: str = "", | ||
suffix: str = "", | ||
debug: bool = False, | ||
) -> None: | ||
""" | ||
Runs the buffer file creation process. | ||
|
||
Args: | ||
h5_paths (List[Path]): List of paths to H5 files. | ||
folder (Path): Path to the folder for buffer files. | ||
force_recreate (bool): Flag to force recreation of buffer files. | ||
prefix (str): Prefix for buffer file names. | ||
suffix (str): Suffix for buffer file names. | ||
debug (bool): Flag to enable debug mode.): | ||
""" | ||
|
||
self._get_files_to_read(h5_paths, folder, prefix, suffix, force_recreate) | ||
|
||
if not force_recreate: | ||
self._schema_check() | ||
|
||
self._save_buffer_files(debug) | ||
|
||
self._fill_dataframes() |
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.