-
-
Notifications
You must be signed in to change notification settings - Fork 19.4k
CLN: series to now inherit from NDFrame #3482
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
If you have a chance...this is my refactor of Series to inherit from NDFrame, like DataFrame and co. I wrote this a while back and just rebased to current master. Almost all passing, except for the ujson stuff. I took a brief look, but not easy for me to debug this.
can you have a look see... thanks Jeff |
pandas/core/common.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FYI: PEP8 standard is two lines between declarations at the top level
|
@jreback the issues were caused by The one remaining test failure is due to What's the etiquette for a pull request on a pull request? I've just pushed the code to my fork, presumably the best way is to cherry-pick the commits from there? |
|
thanks so much!!!! I will take a look at the I cc'd @cpcloud because not 100% sure how to push to this PR....my |
|
ok...seems that best way here is just to pull down your branch and cherry-pick.... @cpcloud suggest that you could submit a PR to my branch...I guess that's sort of the same thing |
|
Yeah that's what I was talking about with the cherry-pick, I can do a PR on your fork though. Might be cleaner that way, give me a sec. |
|
OK pull request created jreback#2 |
|
worked beautifully (welll after I accidently merged ALL of your branch in, had to rebase it out) |
|
@cpcloud thanks...pep8d most of the major changes..easy |
|
np. really excited about this pr |
|
well...it technically doesn't change anything!!! (except for lots of code) |
|
Well, this is downright epic, Jeff. To start, can you post a test_perf.sh run of this versus master? I think we should discuss some big picture pandas things at some point. For example, I'm starting to become a bit down on the We also need to plot a way to place a layer between pandas and NumPy so we can have better control over the data representation. For example, I would like to have integer NAs. From my point of view at some point using NumPy at all won't continue to make sense (example: why are we forcing people to import the whole numpy library when all we need is an array object, basically). |
|
I had to optimize Series/Block/Index creation (basically short cut it) e.g. diff in |
|
@wesm here's my 2c on basic design: The row centric view of the arrays is a natural way of looking at things. However, you are right, the machinery to hold it really doesn't make it any faster or less complicated (and more complicated when dealing with mixed types as the blocks need to be split). I have seen various reasons why people go to column oriented structures.. (e.g. blaze, ctable), and then just combine or operate on demand. However, you pretty much need a an pros:
cons:
Here's what I see as the fundamental issue:: pandas has to be very general in that though there are better |
|
on removing some dependence on numpy. A lot of the bug fixes have been aimed at making pandas very consistent in spite of numpy issues/bugs. I think I once mentioned a possible solution to integer nan, just use a sentinal value, like
Not sure what goals you have for a numpy replacement though. |
|
related #816 |
|
Just took a more serious look through this PR and if you're all satisfied might as well pull the trigger on merging. Any API incompatibilities introduced (beyond the obvious, |
axis creation routines now commonized under _setup_axes
ENH: more methods added
PERF: was missing multi-take opportunity in reindex
was incorrectly passing to com._count_not_none
doing an extra copy in certain cases
BUG: reindex with called with no args will by default return a copy (fixed bug)
ENH: moved filter and added axis arg
moved where,mask,align
TST: make reindex benchmarks longer
CLN: fixed up names for creation in panelnd.py
DOC: minor release notes changes
ENH: initial commite - attempt to reengineer series to inherit from NDFrame rather than ndarray
ENH: fixed SparseDataFrame constructor with scalar values
reindex still broken
removed refs to SparseSeries in internals (not all SparseArray)
TST: more fixed
TST: more fixes
TST: more tests
TST: fixed up indexing
TST: more sparse fixes
BUG: reindex with single block manager now correctly fills with a method
BUG: fixed pickle I think
BUG: fixed set in internals for sparse
fixed boolean indexing iin series I thnk
BUG: fixed printing and inclusion of sparse series in DataFrame (now keeps its type),
converted to dense for printing
CLN: took out SeriesIndex, now uses regular indexing properties
BUG: fixed copy (was using series method, bad)
block filling for datetimes now ok (was filling with NaT, not iNaT)
NaN in boolean ops now correctly handled (was not working for Datetimes)
BUG: fixed set_item in SparseFrame if only a scalar is passed (needed index)
BUG: sparse join fixed, did I break something in merge?
BUG: consolidated block slicing under _slice
BUG: added Series to santize_array
all numeric methods now call get_values() rather than values
ENH: partial SparsePanel support
ENH: reverted SparsePanel changes, save for later
fixed up xs in SparseFrame
BUG: SparsePanel was using an inherited as_matrix(), bad
TST: fixed shift
default in class creation wrapper is to not pass existing fillers
added sanitize column for generalitiy
fixed count (in series)
CLN: modify core/expressions to use get_values()
remove methods from SparseFrame (and use inherited):
combine_first,icol,as_matrix,get_dtype_counts
bug fix in core/internals/get_dtype_counts
CLN: use _values_from_object instead of direct call to get_values()
BUG: fixed set_value semantics, as it could possibily change the index
BUG: fixed tseries/period indexing
fixed some bugs showing up in 32-bit (in nanops)
BUG: fix incorrect exception raised in indexing (on 32-bit)
BUG: fixed get_merge_keys (add Series to ndarray testing)
BUG: fixed pivot table maybe???x
core/internals/_ref_locs will now set indexer if ref_items==items
TST: apply_reduce in tests/test_frame still failing
BUG: fixed getitem_boolean_object finally I think (was issue in set_value in Series)
BUG: fixed putmasking mess in Series, now in core/internals
BUG: more fixes
BUG: fixed core/internals/replace as choking on input
BUG: refixed groupby
BUG: fix test_where in series
BUG: fixed reindex on a sparse block (was not taking correctly)
BUG: fixed sparse filling!!!!!
BUG: fixed pivot, need to define __hash__ to raise TypeError in NDFrame
BUG: downcast argument not in SparseBlock or sparse/frame.py for fillna
BUG: fix apply_reduce?
BUG: fixes in reduce.pyx to deal with reconstrucing a Series argument to the function
if needed
BUG: reducer now produces a Series with its index (to the called function)
ols converts to_dense to avoid some issues
ENH: fixed core/frame/apply to accept reduce argument (default True),
to allow turning off the reduction attempt (to preserver the column character)
if say self.values would change it
BUG: finally fixed reducer?
BUG: reduce on frame bug (showing in py3)
BUG: ols not working with sparse
TST: stats.tests.test_ols/test_wls is not testing for the correct version
of statsmodels (fails on 32-bit)
PTF
TST: make sure to skip the test_wls if our version isn't enough
PERF: some perf enhancements
BUG: fix sparse/array/make_sparse to take objects and extract the arrays
PERF: series construction now much faster
PERF: improvements in core/internals
MERGE: updated to master and merged in
MERGE: more merging fixes
PERF: fixed null tests to be MUCH faster
PERF: improvements in series construction via from_array
PERF: merge improvements by using _has_sparse in bms
PERF: some improvements
PERF: more internals optimizations
CLN: Index now subclassed off of PandasObject
BUG: fixed inheritence for core/index.py (Index), solves unicode issues
BUG: some merge errors in sparse
VB: modernize the sparse vb suite
BUG: fixed merging by single item (was broker for sparse for some reason)
names not propogating in Series constructor on _slice
BUG: add name back to series constructor
ENH: pickle compatibility for Series/SparseSeries prior to 0.12!
ENH: added pickle_compat to common/load
BUG: in core/series on fastpath and index is actually changed
(e.g. its actually a datelike index, but is of type object),
need to set the axis in the BlockManager
BUG: _getitem__bool only is active for Index/Int64Index (issues with DatetimeIndex/PeriodIndex)
so default to having it call (slower) __getitem__
COMPAT: py3 compat fixes
TST: recover pickles in a particular order or names
MERGE: fixup merging with 0.11.0 final
BUG: set _subtyp in sparse (use main type of object)
BUG: fixed mergig on need to reindex sparse
BUG: fixed consolidation issue prior to merge
BUG: construction of a series with another series odd bug
BUG: fix series constructor when passed a dtype (and no copy)
BUG: fixed sparse slicing via blocks (don't use a sparse block when slicing)
BUG: fixed remaining sparse issue (SpareDataFrame was converting SparseArray incorrectly)
BUG: dtypes in groupby nth fixed (converting on aggregation item_by_item)
BUG: partial fix on groupby?
BUG: restored groupby back to master (SeriesGrouper)
BUG: more fixes on groupby
BUG: fixed all groupbys!
BUG: get_median in core/nanops.py complaining
PERF: made constructions of SparseFrame have less redundant steps
PERF: minor series perf improvement
TST: trying to fix how_lambda in tseries/resample
PTF
PERF: addtl groupby multi_python perf improvements
PERF: speeds up for Series.__getitem__
PERF: some perf on groupby.....
added _block, _values in SingleBlockManager
PERF: more reducer improvements
BUG: fixed SeriesBinGrouper hopefully
BUG: tseries/index.py was missing __str__ = __repr__
BUG: groupby filter that return a series/ndarray truth testing
BUG: refixed GH3880, prop name index
BUG: not handling sparse block deletes in internals/_delete_from_block
BUG: refix generic/truncate
TST: refixed generic/replace (bug in core/internals/putmask) revealed as well
TST: fix spare_array to put up correct type exceptions rather than Exception
CLN: cleanups
BUG: fix stata dtype inference (error in core/internals/astype)
BUG: fix ujson handling of new series object
BUG: fixed scalar coercion (e.g. calling float(series)) to work
BUG: fixed astyping with and w/o copy
ENH: added _propogate_attributes method to generic.py to allow
subclasses to automatically propogate things like name
DOC: added v0.13.0.txt feature descriptions
CLN: pep8ish cleanups
BUG: fix 32-bit,numpy 1.6.1 issue with datetimes in astype_nansafe
PERF: speedup for groupby by passing a SNDArray (Series like ndarray) object to evaluation functions
if allowed, can avoid Series creation overhead
BUG: issue with older numpy (1.6.1) in SeriesGrouper, fallback to passing a Series
rather than SNDArray
DOC: release notes & doc updates
DOC: fixup doc build failures
DOC: change pasing of direct ndarrays to cython doc functions (enhancedperformance.rst)
…cache based on
changes (GH4080)
BUG: Series not updating properly with object dtype (GH33217)
BUG: (GH3386) fillna same issue as (GH4080), not updating cacher
CLN: cleaned up internal block action routines, now always return a list of blocks
Instead of the `is_series`, `is_generic`, etc methods, can use the ABC* methods to check for certain pandas types. This is useful because it helps decrease issues with circular imports (since they can be easily imported from core/common). The checks take advantage of the `_typ` and `_subtyp` attributes to handle checks. (e.g. `DataFrame` now has `_typ` of `"dataframe"`, etc. See the code for specifics. PERF: register _cacher as an internal name BUG: fixed abstract base class type checking bug in py2.6 DOC: updates for abc type checking PERF: small perf gains in _get_item_cache
TST/BUG: test/bugfix for GH4463 BUG: fix core/internals/setitem to work for boolean types (weird numpy bug!) BUG: partial frame setting with dtype change (GH4204) BUG: Indexing with dtype conversions fixed GH4463 (int->float), GH4204(boolean->float) BUG: provide better ndarray compat CLN: removed some duped methods MERGE: fix an issue cropping up on the rebase
TST: additional test for series dtype conversion with where (and fix!) DOC: update docstrings in to_json/to_hdf/pd.read_hdf BLD: ujson rebase issue fixed
|
@wesm .... ok... bombs away shortly |
CLN: series to now inherit from NDFrame
|
thanks to @jtratner, @Komnomnomnom, @cpcloud, @jseabold, @wesm for assistance with various aspects of this PR! squash those bugs! |
|
@wesm I'm thinking of how to make th e instance check works. Maybe we could |
|
@jtratner isinstance checking is not necessary what is the purpose of this? some sort of back compat? |
|
@wesm made the comment. It's actually impossible anyways. |
Major refactor primarily to make Series inherit from NDFrame
affects #4080, #3862, #816, #3217, #3386, #4463, #4204, #4118 , #4555
Preserves pickle compat
very few tests were changed (and only for compat on return objects)
a few performance enhancements, a couple of regressions (see bottom)
obviously this is a large change in terms of the codebase, but it brings more consistency between series/frame/panel (not all of this is there yet, but future changes are much easier)
Series is now like Frame in that it has a BlockManager (called SingleBlockManager), which holds a block (of any type we support). This introduced some overhead in doing certain operations, which I spent a lot of time optimizing away, further optimizations will come from cythonizing the core/internals, which should be straightforward at this point
Highlites below:
In 0.13.0 there is a major refactor primarily to subclass
SeriesfromNDFrame,which is the base class currently for
DataFrameandPanel, to unify methodsand behaviors. Series formerly subclassed directly from
ndarray._setup_axesto created generic NDFrame structuresfrom_axes,_wrap_array,axes,ix,shape,empty,swapaxes,transpose,pop__iter__,keys,__contains__,__len__,__neg__,__invert__convert_objects,as_blocks,as_matrix,values__getstate__,__setstate__(though compat remains in frame/panel)__getattr__,__setattr___indexed_same,reindex_like,align,where,mask,replacefilter(also added axis argument to selectively filter on a different axis)reindex,reindex_axis(which was the biggest change to make generic)truncate(moved to become part ofNDFrame)Panelmore consistent withDataFrameDataFramefilterNDFramerather than directly fromndarray.There are several minor changes that affect the API.
return
ndarraysrather than series, e.g.np.diffandnp.whereSeries(0.5)would previously return the scalar0.5, this is nolonger supported
NDFrame(convert_objects,where,mask)
TimeSeriesis now an alias forSeries. the propertyis_time_seriescan be used to distinguish (if desired)
SparseBlock, which can hold multi-dtypesand is non-consolidatable.
SparseSeriesandSparseDataFramenow inheritmore methods from there hierarchy (Series/DataFrame), and no longer inherit
from
SparseArray(which instead is the object of theSparseBlock)data is supportable (partially implemented)
merging type operations will convert to dense (and back to sparse), so might
be somewhat inefficient
SparseSeriesfor boolean/integer/slicesSparsePanelsimplementation is unchanged (e.g. not using BlockManager, needs work)ftypesmethod to Series/DataFame, similar todtypes, but indicatesif the underlying is sparse/dense (as well as the dtype)
NDFrameobjects now have a_prop_attributes, which can be used to indcated variousvalues to propogate to a new object from an existing (e.g. name in
Serieswill followmore automatically now)
Perf changed a bit primarily in groupby where a Series has to be reconstructed in order to be passed to the function (in some cases). I basically pass a Series-like class to the grouped function to see if it doesn't raise, if its ok, then it is used rather than a full Series in order to reduce overhead of the Series creation for each group.