-
-
Notifications
You must be signed in to change notification settings - Fork 18.6k
Please add force_suffixes to pandas.merge() #17834
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
@soerendip : Thanks for reporting this! Could you explain why you would want the suffix to be forcefully added, even if duplicates don't exist? I'm having some difficulty understanding the use-case. |
@gfyoung: Say you have several tables with the identical column names. You want to combine them into one table while keeping track of the source. As in my example above, the third table did not get the label. You could use the |
Hi, |
@ash13 I would try a different issue; this request has not gotten much commentary; i am -0 on it nowhere |
This isn't coming from nowhere - there's a stackoverflow question on it. As of October 31, 2017, it has 10 up-votes. |
@ThomasProctor : Thanks for pointing that out. I'm still not convinced that we should do this. You can easily call |
Here's the kludge I've been using to do this. Maybe it will help with a fix. I make no claims to it being remotely well-engineered def merge_force_suffix(left, right, **kwargs):
on_col = kwargs['on']
suffix_tupple = kwargs['suffixes']
def suffix_col(col, suffix):
if col != on_col:
return str(col) + suffix
else:
return col
left_suffixed = left.rename(columns=lambda x: suffix_col(x, suffix_tupple[0]))
right_suffixed = right.rename(columns=lambda x: suffix_col(x, suffix_tupple[1]))
del kwargs['suffixes']
return pd.merge(left_suffixed, right_suffixed, **kwargs) |
@ThomasProctor : Thanks for sharing this. I think this would be a reasonable workaround for people to use if they so need to. |
@gfyoung My use case right now is merging together 10 or so DataFrames, all with possible redundant column names. Trying to keep track of where columns came from is just annoying. |
@gfyoung : +1 for this feature. Like others, I find having to rename dataframes to get suffixes annoying and unnatural. Either you have to modify in place (never the behavior I want), or you have to duplicate the frames. I'd actually argue that the default behavior of join should be to always append suffixes if they are supplied, but at least having the option would make merging multiple frames less unpleasant. |
This feature (enforcing suffixes) is definitely needed for all non-trivial merges in order to avoid ambiguity: currently I never know what column names I will get in the result frame. Ideally, it should be a default option: if I provide a suffix then it must be used. |
I agree - this very useful. |
Having a df 'links' with source and destination columns. I need to add details about the source and the destination from df 'cities'. Easy: juast merge links and cities on the source and city_id and then merge the result again with cities, only this time on destination and city_id. What do I get? A little mess because the details of the source city are NOT marked with the '_s' suffix after the first merge, but the second merge ends with the '_d' suffix on all the columns from df cities.
I don't really want to rename cities prior to the first merge and then again prior to the second. And I definitely do not want cities to have its column names altered after I do the double merge. |
The original request seems reasonable. I'm not sure about the name
`force_suffixes`, but I haven't come up with anything better.
…On Mon, May 27, 2019 at 9:29 PM Juan Pablo Tamayo ***@***.***> wrote:
Having a df 'links' with source and destination columns. I need to add
details about the source and the destination from df 'cities'. Easy: juast
merge links and cities on the source and city_id and then merge the result
again with cities, only this time on destination and city_id. What do I
get? A little mess because the details of the source city are NOT marked
with the '_s' suffix after the first merge, but the second merge ends with
the '_d' suffix on all the columns from df cities.
links = merge(links, cities, left_on='source', right_on='city_id', suffixes=('', '_s'))
links = merge(links, cities, left_on='destination', right_on='city_id', suffixes=('', '_d'))
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#17834?email_source=notifications&email_token=AAKAOIWN3KO7GN5FU2K5C5DPXSKH3A5CNFSM4D6OMT6KYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODWKZLNI#issuecomment-496342453>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAKAOIS3UKTFJNBIJNKSG73PXSKH3ANCNFSM4D6OMT6A>
.
|
|
yeah this feature would be great . say I have one total dataframe, I break it out by column value and then merge to another. every time i do that in a for loop would be great to add that column value I selected and add it to new columnd name. say product X mean, product Y mean, product Z mean |
Hi ! Here is my use case : As I don't know which fields the user will load, and will be duplicate with my reference dataset, I'd like to add suffixes for the two dataframes while merging. This would fix the problem of correctly selecting the field as I don't know in advance which one will be duplicate. As reported above, I could suffixe all fields earlier, when loading the dataframes, but seems less convenient. |
Any update on this feature? It's been a while since it was first suggested, but as others have pointed out, there are plenty of good use cases for it. Could have certainly used it on my current project with multiple merges! |
pandas is all volunteer you are welcome to submit a PR |
Just a small remark... it's a nice implementation, thanks. However it only works for single column "on". You could make a small adaption: |
If you only replace Updated version that works with both single and multi-column "on": def merge_force_suffix(left, right, **kwargs):
on_col = kwargs['on']
on_col = [on_col] if not isinstance(on_col, (list, tuple)) else on_col
suffix_tupple = kwargs['suffixes']
def suffix_col(col, suffix):
if col not in on_col:
return str(col) + suffix
else:
return col
left_suffixed = left.rename(columns=lambda x: suffix_col(x, suffix_tupple[0]))
right_suffixed = right.rename(columns=lambda x: suffix_col(x, suffix_tupple[1]))
del kwargs['suffixes']
return pd.merge(left_suffixed, right_suffixed, **kwargs) |
I don't believe this is still open. need this feature so bad right now.. :( |
Hi, |
Uh oh!
There was an error while loading. Please reload this page.
Code Sample, a copy-pastable example if possible
Problem description
When using pandas.merge() suffixes are only added to the column-names if a column-name appears in both data frames. With more than two data frames that can lead to the above situation. The first merge adds suffixes. When the resulting table is merged with a third table the column-name does not appear twice and the suffix is not added. The suffix then needs to be added manually. This is overhead, especially when larger numbers of data-frames are merged. An option "force_suffixes" would be appreciated that ensures the suffix is added.
Expected Output
The text was updated successfully, but these errors were encountered: