diff --git a/blogs/.DS_Store b/blogs/.DS_Store
new file mode 100644
index 0000000..5008ddf
Binary files /dev/null and b/blogs/.DS_Store differ
diff --git a/blogs/d6tjoin-blogs/.DS_Store b/blogs/d6tjoin-blogs/.DS_Store
new file mode 100644
index 0000000..31c3c68
Binary files /dev/null and b/blogs/d6tjoin-blogs/.DS_Store differ
diff --git a/blogs/d6tjoin-blogs/Fuzzy joins in python with d6tjoin.md b/blogs/d6tjoin-blogs/Fuzzy joins in python with d6tjoin.md
new file mode 100644
index 0000000..0d93e6e
--- /dev/null
+++ b/blogs/d6tjoin-blogs/Fuzzy joins in python with d6tjoin.md
@@ -0,0 +1,130 @@
+# Fuzzy joins in python with d6tjoin
+
+## Combining different data sources is a time suck!
+Combining data from different sources can be a big time suck for data scientists. [d6tjoin](https://github.com/d6t/d6tjoin) is a python library that lets you join pandas dataframes quickly and efficiently.
+
+Coauthored with [Haijing Li](https://www.linkedin.com/in/haijing-li-7b50a11b2/), Data Analyst in Financial Services, MS Business Analytics@Columbia University.
+
+
+## Example
+I have made up this example to illustrate what d6tjoin is capable of.
+
+Suppose several companies' stocks have gained my attention for a while and I have came up with a strategy to score those companies' performances in a 1-5 point scale. Backtesting on history data will help me evaluate if stock price really reflects those scores and find out how I want to trade according to those scores.
+Information I need for backtesting is contained in the following two datasets: df_price contains historical stock prices of year 2019 and df_score contains scores updated regularly by myself.
+
+
+
+
+
+
+To prepare for backtesting, I need to merge "score" column to df_price. Obviously, ticker name and date should be the merge keys. But there are two problems: 1.Values in "ticker" of df_price and of df_valuation are not identical; 2.Scores were recorded on a monthly basis and I want each row in df_price to be assigned with the most recent assuming next score would not be available until next update date.
+
+
+## Prejoin Analysis
+One thing best about d6tjoin is that it provides easy pre join diagnostic. This is particularly useful for detecting potential data problems even if you did't intend to do fuzzy join.
+```
+import d6tjoin
+j = d6tjoin.Prejoin([df_price,df_score],['ticker','date'])
+
+try:
+ assert j.is_all_matched() # fails
+except:
+ print('assert fails!')
+```
+
+After this, we know that the two tables are not all matched on merge keys. This would be more useful when you have larger size datasets and messier string identifier that you cannot tell by a glance if the two datasets have different key values.
+
+For our case, a more useful method is `Prejoin.match_quality()`.It summarizes the number of matched/unmatched records for each join key. In our case, no ticker name are exactly matched and few dates are matched. This is why we need d6tjoin to help with fuzzy join.
+```
+j.match_quality()
+```
+
+
+
+## Join with Misaligned Ids(names) and Dates
+d6tjoin does best match joins on strings, dates and numbers. Its `MergeTop1()`object in `d6tjoin.top1` module is very versatile that gives you flexibility to define how you want to merge: exact or fuzzy on multiple keys using default or costumed difference functions. By default, distance of strings are calculated using Levenshtein distance.
+
+In our example, both 'ticker' and 'date' are misaligned in the two datasets, so we need to do fuzzy join on both.
+
+```
+result=d6tjoin.top1.MergeTop1(df_price,df_score,fuzzy_left_on=['ticker','date'],fuzzy_right_on=['ticker','date']).merge()
+```
+Result is stored in dictionary structure which contains two keys: `{'merged': a pandas dataframe of the merged result; 'top1': performance statistics and summary for each join key}`
+
+Let's take a look at the result:
+```
+result['top1']['ticker']
+result['merged']
+```
+The first line gives back performance summary on key 'ticker'. It shows not only matched key values on right and left but also the calculated difference scores in `_top1diff_` column. By default, each key value in the left table should be matched to one from the right table. Therefore even if "SRPT-US" and "VRTX-US" are not recorded in df_score, they still get matched to some values. We can solve this problem by setting `top_limit=3`, which will make any matching that differs larger than 3 ignored.
+
+
+The second line gives back the merged dataset.
+
+
+The original df_price dataset contains only 1536 rows but why after merged it becomes 5209 rows?
+Well, that's because in the original datasets we parse "date" as strings. d6tjoin use Levenshtein distance to calculate difference for strings by default and so one date from left would be matched with several dates from right. That tells you everytime dealing with date you should check its datatype first.
+
+Change "date" datatype from string to datetime object and set `top_limit=3` for "ticker". Let's check the result again.
+```
+import datetime
+df_price["date"]=df_price["date"].apply(lambda x: datetime.datetime.strptime(x,'%Y-%m-%d'))
+df_score["date"]=df_score["date"].apply(lambda x: datetime.datetime.strptime(x,'%Y-%m-%d'))
+```
+```
+result=d6tjoin.top1.MergeTop1(df_price,df_score,fuzzy_left_on=['ticker','date'],fuzzy_right_on=['ticker','date'],top_limit=[3,None]).merge()
+```
+```
+result['top1']['ticker']
+result['merged']
+```
+
+
+
+
+
+Looks good! All the tickers in left are perfectly matched and dates from left are matched to the closest from the right.
+
+## Advanced Usage Option: Passing a Custom Difference Function
+Now we have one last problem left. Remember that we want to assume scores would not be available until it was assigned. That means we want each row from df_price to be matched to a previously most recent score. But by default d6tjoin doesn't consider the order of dates but only match with the closest date either before or behind.
+
+To tackle this problem, we need to write a custom difference function: if date from left is previous to that from right, the calculated difference should be made large enough to let the match ignored.
+```
+import numpy as np
+
+def diff_customized(x,y):
+ if np.busday_count(x,y)>0: #x is left_date and y is right_date,np_busday_count>0 means left_date is previous to right date
+ diff_= 10000000000 #as large enough
+ else:
+ diff_=np.busday_count(x,y)
+ return abs(diff_)
+```
+
+Now let’s parse the custom difference function in and see result.
+
+```
+result=d6tjoin.top1.MergeTop1(df_price,df_score,fuzzy_left_on=['ticker','date'],fuzzy_right_on=['ticker','date'],fun_diff=[None,diff_customized],top_limit=[3,300]).merge()
+```
+```
+result['top1']['date']
+result['merged']
+```
+
+
+
+
+Now we have our final merged dataset: price data without a previously assigned score are ignored and the others are each assigned with a previously most recent score.
+
+## Conclusion
+d6tjoin provides features including:
+- Pre join diagnostics to identify mismatched join keys
+- Best match joins that finds most similar value of misaligned ids, names and dates
+- Ability to customize difference functions, set max difference and other advanced features
+
+Check out [d6t library](https://github.com/d6t/d6t-python). It provides solutions to common data science problems including:
+- [d6tflow](https://github.com/d6t/d6tflow): build data science workflow
+- [d6tjoin](https://github.com/d6t/d6tjoin): quickly fuzzy joins
+- [d6tpipe](https://github.com/d6t/d6tpipe): quickly share and distribute data
+
+
+
diff --git a/blogs/d6tjoin-blogs/examples-prejoin.ipynb b/blogs/d6tjoin-blogs/examples-prejoin.ipynb
new file mode 100644
index 0000000..69a58cc
--- /dev/null
+++ b/blogs/d6tjoin-blogs/examples-prejoin.ipynb
@@ -0,0 +1,592 @@
+{
+ "cells": [
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "# Data Engineering in Python with databolt - Identify and analyze join problems (d6tlib/d6tjoin.utils)\n",
+ "\n",
+ "## Introduction\n",
+ "\n",
+ "Joining datasets is a common data engineering operation. However, often there are problems merging datasets from different sources because of mismatched identifiers, date conventions etc. \n",
+ "\n",
+ "** `d6tjoin.utils` module allows you to test for join accuracy and quickly identify and analyze join problems. **\n",
+ "\n",
+ "Here are some examples which show you how to:\n",
+ "* do join quality analysis prior to attempting a join\n",
+ "* detect and analyze a string-based identifiers mismatch\n",
+ "* detect and analyze a date mismatch"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "## Generate sample data\n",
+ "\n",
+ "Let's generate some random respresentative data:\n",
+ "* identifier (string)\n",
+ "* date (np.datetime)\n",
+ "* values (flaot)"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 1,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "import pandas as pd\n",
+ "import numpy as np\n",
+ "import uuid\n",
+ "import itertools\n",
+ "import importlib\n",
+ "\n",
+ "import d6tjoin.utils\n",
+ "importlib.reload(d6tjoin.utils)\n",
+ "\n",
+ "# ******************************************\n",
+ "# generate sample data\n",
+ "# ******************************************\n",
+ "nobs = 10\n",
+ "uuid1 = [str(uuid.uuid4()) for _ in range(nobs)]\n",
+ "dates1 = pd.date_range('1/1/2010','1/1/2011')\n",
+ "\n",
+ "df1 = pd.DataFrame(list(itertools.product(uuid1,dates1)),columns=['id','date'])\n",
+ "df1['v']=np.random.sample(df1.shape[0])"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 2,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/html": [
+ "
\n",
+ "\n",
+ "
\n",
+ " \n",
+ " \n",
+ " | \n",
+ " id | \n",
+ " date | \n",
+ " v | \n",
+ "
\n",
+ " \n",
+ " \n",
+ " \n",
+ " 0 | \n",
+ " bea5c9d3-2e88-4bc1-b9ab-dbce797dfee9 | \n",
+ " 2010-01-01 | \n",
+ " 0.637035 | \n",
+ "
\n",
+ " \n",
+ " 1 | \n",
+ " bea5c9d3-2e88-4bc1-b9ab-dbce797dfee9 | \n",
+ " 2010-01-02 | \n",
+ " 0.040725 | \n",
+ "
\n",
+ " \n",
+ " 366 | \n",
+ " 8062e9b6-b294-43e4-93d5-182d5432bcc3 | \n",
+ " 2010-01-01 | \n",
+ " 0.267829 | \n",
+ "
\n",
+ " \n",
+ " 367 | \n",
+ " 8062e9b6-b294-43e4-93d5-182d5432bcc3 | \n",
+ " 2010-01-02 | \n",
+ " 0.895737 | \n",
+ "
\n",
+ " \n",
+ " 732 | \n",
+ " 20701b42-bb47-49f0-add6-88fc98aaac99 | \n",
+ " 2010-01-01 | \n",
+ " 0.817187 | \n",
+ "
\n",
+ " \n",
+ " 733 | \n",
+ " 20701b42-bb47-49f0-add6-88fc98aaac99 | \n",
+ " 2010-01-02 | \n",
+ " 0.952077 | \n",
+ "
\n",
+ " \n",
+ "
\n",
+ "
"
+ ],
+ "text/plain": [
+ " id date v\n",
+ "0 bea5c9d3-2e88-4bc1-b9ab-dbce797dfee9 2010-01-01 0.637035\n",
+ "1 bea5c9d3-2e88-4bc1-b9ab-dbce797dfee9 2010-01-02 0.040725\n",
+ "366 8062e9b6-b294-43e4-93d5-182d5432bcc3 2010-01-01 0.267829\n",
+ "367 8062e9b6-b294-43e4-93d5-182d5432bcc3 2010-01-02 0.895737\n",
+ "732 20701b42-bb47-49f0-add6-88fc98aaac99 2010-01-01 0.817187\n",
+ "733 20701b42-bb47-49f0-add6-88fc98aaac99 2010-01-02 0.952077"
+ ]
+ },
+ "execution_count": 2,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "df1.groupby(['id']).head(2).head(6)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "## Use Case: assert 100% join accuracy for data integrity checks \n",
+ "\n",
+ "In data enginerring QA you want to test that data is joined correctly. This is particularly useful for detecting potential data problems in production."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 3,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "df2 = df1.copy()\n",
+ "\n",
+ "j = d6tjoin.Prejoin([df1,df2],['id','date'])\n",
+ "assert j.is_all_matched() # succeeds\n",
+ "assert j.is_all_matched('id') # succeeds\n",
+ "assert j.is_all_matched('date') # succeeds\n"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "## Use Case: detect and analyze id mismatch \n",
+ "\n",
+ "When joining data from different sources, eg different vendors, often your ids don't match and then you need to manually analyze the situation. With databolt this becomes much easier."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "### 100% id mismatch\n",
+ "\n",
+ "Let's look at an example where say vendor 1 uses a different id convention than vendor 2 and none of the ids match."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 4,
+ "metadata": {},
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "assert fails!\n"
+ ]
+ }
+ ],
+ "source": [
+ "# create mismatch\n",
+ "df2['id'] = df1['id'].str[1:-1]\n",
+ "\n",
+ "j = d6tjoin.Prejoin([df1,df2],['id','date'])\n",
+ "\n",
+ "try:\n",
+ " assert j.is_all_matched() # fails\n",
+ "except:\n",
+ " print('assert fails!')"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "The QA check shows there's a problem, lets analyze the issue with `Prejoin().match_quality()`. We can immediately see that none of the ids match."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 5,
+ "metadata": {},
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ " key left key right all matched inner left right outer unmatched total unmatched left unmatched right\n",
+ "0 id id False 0 10 10 20 20 10 10\n",
+ "1 date date True 366 366 366 366 0 0 0\n",
+ "2 __all__ __all__ False 0 3660 3660 7320 7320 3660 3660\n"
+ ]
+ }
+ ],
+ "source": [
+ "j.match_quality()"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "Let's look at some of the mismatched records with `Prejoin().show_unmatched()`. Looks like there might be a length problem."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 6,
+ "metadata": {},
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ " id date v\n",
+ "1830 bb517ed3-a2fb-49b3-ba46-1bf947f83188 2010-01-01 0.174712\n",
+ "1831 bb517ed3-a2fb-49b3-ba46-1bf947f83188 2010-01-02 0.858288\n",
+ "1832 bb517ed3-a2fb-49b3-ba46-1bf947f83188 2010-01-03 0.508282\n",
+ " id date v\n",
+ "0 ea5c9d3-2e88-4bc1-b9ab-dbce797dfee 2010-01-01 0.637035\n",
+ "1 ea5c9d3-2e88-4bc1-b9ab-dbce797dfee 2010-01-02 0.040725\n",
+ "2 ea5c9d3-2e88-4bc1-b9ab-dbce797dfee 2010-01-03 0.797620\n"
+ ]
+ }
+ ],
+ "source": [
+ "print(j.show_unmatched('id')['left'])\n",
+ "print(j.show_unmatched('id')['right'])"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "We can show string length statistics using `d6tjoin.Prejoin().describe_str()` which confirms that the id string lenghts are different."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 7,
+ "metadata": {
+ "scrolled": true
+ },
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "dataframe #0\n",
+ " median min max nrecords\n",
+ "id 36.0 36.0 36.0 3660.0\n",
+ "dataframe #1\n",
+ " median min max nrecords\n",
+ "id 34.0 34.0 34.0 3660.0\n",
+ "None\n"
+ ]
+ }
+ ],
+ "source": [
+ "print(j.describe_str())\n"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "### Partial id mismatch\n",
+ "\n",
+ "Let's look at another example where there is a partial mismatch. In this case let's say vendor 2 only has a certain percentage of ids covered."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 8,
+ "metadata": {
+ "scrolled": true
+ },
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "assert fails!\n"
+ ]
+ }
+ ],
+ "source": [
+ "# create partial mismatch\n",
+ "uuid_sel = np.array(uuid1)[np.random.choice(nobs, nobs//5, replace=False)].tolist()\n",
+ "df2 = df1[~df1['id'].isin(uuid_sel)]\n",
+ "\n",
+ "j = d6tjoin.Prejoin([df1,df2],['id','date'])\n",
+ "\n",
+ "try:\n",
+ " assert j.is_all_matched() # fails\n",
+ "except:\n",
+ " print('assert fails!')"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "Again we've quickly identified a problem. This would typically cause you to do manual and tedious manual QA work but with `Prejoin().match_quality()` you can quickly see how many ids were mismatched."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 9,
+ "metadata": {
+ "scrolled": true
+ },
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ " key left key right all matched inner left right outer unmatched total unmatched left unmatched right\n",
+ "0 id id False 8 10 8 10 2 2 0\n",
+ "1 date date True 366 366 366 366 0 0 0\n",
+ "2 __all__ __all__ False 2928 3660 2928 3660 732 732 0\n"
+ ]
+ }
+ ],
+ "source": [
+ "j.match_quality()"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "## Use Case: detect and analyze date mismatch \n",
+ "\n",
+ "Dates are another common sources of frustration for data engineers working with time series data. Dates come in a variety of different formats and conventions. Let's use databolt to analyze a date mismatch situation."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 10,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "dates2 = pd.bdate_range('1/1/2010','1/1/2011') # business instead of calendar dates\n",
+ "df2 = pd.DataFrame(list(itertools.product(uuid1,dates2)),columns=['id','date'])\n",
+ "df2['v']=np.random.sample(df2.shape[0])"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "To highlight some different functionality for `Prejoin().match_quality()` The QA test for all matches fails.We can look at the dataframe to see 105 dates are not matched."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 11,
+ "metadata": {},
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ " key left key right all matched inner left right outer unmatched total unmatched left unmatched right\n",
+ "0 id id True 10 10 10 10 0 0 0\n",
+ "1 date date False 261 366 261 366 105 105 0\n",
+ "2 __all__ __all__ False 2610 3660 2610 3660 1050 1050 0\n",
+ "assert fails!\n"
+ ]
+ }
+ ],
+ "source": [
+ "j = d6tjoin.Prejoin([df1,df2],['id','date'])\n",
+ "dfr = j.match_quality()\n",
+ "try:\n",
+ " assert dfr['all matched'].all() # fails\n",
+ "except:\n",
+ " print('assert fails!')"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "We can look at mismatched records using `Prejoin.show_unmatched()`. Here we will return all mismatched records into a dataframe you can analyze."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 12,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "dft = j.show_unmatched('date',keys_only=False,nrecords=-1,nrows=-1)['left']"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 13,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/html": [
+ "\n",
+ "\n",
+ "
\n",
+ " \n",
+ " \n",
+ " | \n",
+ " id | \n",
+ " date | \n",
+ " v | \n",
+ "
\n",
+ " \n",
+ " \n",
+ " \n",
+ " 1 | \n",
+ " bea5c9d3-2e88-4bc1-b9ab-dbce797dfee9 | \n",
+ " 2010-01-02 | \n",
+ " 0.040725 | \n",
+ "
\n",
+ " \n",
+ " 2 | \n",
+ " bea5c9d3-2e88-4bc1-b9ab-dbce797dfee9 | \n",
+ " 2010-01-03 | \n",
+ " 0.797620 | \n",
+ "
\n",
+ " \n",
+ " 8 | \n",
+ " bea5c9d3-2e88-4bc1-b9ab-dbce797dfee9 | \n",
+ " 2010-01-09 | \n",
+ " 0.810554 | \n",
+ "
\n",
+ " \n",
+ " 9 | \n",
+ " bea5c9d3-2e88-4bc1-b9ab-dbce797dfee9 | \n",
+ " 2010-01-10 | \n",
+ " 0.372955 | \n",
+ "
\n",
+ " \n",
+ " 15 | \n",
+ " bea5c9d3-2e88-4bc1-b9ab-dbce797dfee9 | \n",
+ " 2010-01-16 | \n",
+ " 0.503245 | \n",
+ "
\n",
+ " \n",
+ "
\n",
+ "
"
+ ],
+ "text/plain": [
+ " id date v\n",
+ "1 bea5c9d3-2e88-4bc1-b9ab-dbce797dfee9 2010-01-02 0.040725\n",
+ "2 bea5c9d3-2e88-4bc1-b9ab-dbce797dfee9 2010-01-03 0.797620\n",
+ "8 bea5c9d3-2e88-4bc1-b9ab-dbce797dfee9 2010-01-09 0.810554\n",
+ "9 bea5c9d3-2e88-4bc1-b9ab-dbce797dfee9 2010-01-10 0.372955\n",
+ "15 bea5c9d3-2e88-4bc1-b9ab-dbce797dfee9 2010-01-16 0.503245"
+ ]
+ },
+ "execution_count": 13,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "dft.head()"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "Looking at the weekdays of the mismatched entries, you can see they are all weekends. "
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 14,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "array([5, 6])"
+ ]
+ },
+ "execution_count": 14,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "dft['date_wkday']=dft['date'].dt.weekday\n",
+ "dft['date_wkday'].unique()"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "## Conclusion\n",
+ "\n",
+ "Joining datasets from different sources can be a big time waster for data engineers! With databolt you can quickly do join QA and analyze problems without doing manual tedious work."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": []
+ }
+ ],
+ "metadata": {
+ "kernelspec": {
+ "display_name": "Python 3",
+ "language": "python",
+ "name": "python3"
+ },
+ "language_info": {
+ "codemirror_mode": {
+ "name": "ipython",
+ "version": 3
+ },
+ "file_extension": ".py",
+ "mimetype": "text/x-python",
+ "name": "python",
+ "nbconvert_exporter": "python",
+ "pygments_lexer": "ipython3",
+ "version": "3.7.6"
+ }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 2
+}
diff --git a/blogs/d6tjoin-blogs/examples-top1.ipynb b/blogs/d6tjoin-blogs/examples-top1.ipynb
new file mode 100644
index 0000000..e3494e6
--- /dev/null
+++ b/blogs/d6tjoin-blogs/examples-top1.ipynb
@@ -0,0 +1,1553 @@
+{
+ "cells": [
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "# Data Engineering in Python with databolt - Fuzzy Joins (d6tlib/d6tjoin.utils)\n",
+ "\n",
+ "## Introduction\n",
+ "\n",
+ "Joining datasets is a common data engineering operation. However, often there are problems merging datasets from different sources because of mismatched identifiers, date conventions etc. \n",
+ "\n",
+ "** `d6tjoin.top1` module allows you to quickly join datasets even if they don't perfectly match. **\n",
+ "Easily join different datasets without writing custom code. Does fuzzy top1 similarity joins for strings, dates and numbers, for example you can quickly join similar but not identical stock tickers, addresses, names without manual processing. It will find the top 1 matched entry from the right dataframe to join onto the left dataframe.\n",
+ "\n",
+ "Here are some examples which show you how to:\n",
+ "1. join on mismatched identifiers\n",
+ "2. join on calendar vs business dates\n",
+ "3. join on both mismatched dates and identifiers"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 1,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/html": [
+ "\n",
+ "\n",
+ "
\n",
+ " \n",
+ " \n",
+ " | \n",
+ " date | \n",
+ " id | \n",
+ " v | \n",
+ "
\n",
+ " \n",
+ " \n",
+ " \n",
+ " 0 | \n",
+ " 2010-01-01 | \n",
+ " e3e70682 | \n",
+ " 0.526 | \n",
+ "
\n",
+ " \n",
+ " 1 | \n",
+ " 2010-01-01 | \n",
+ " f728b4fa | \n",
+ " 0.760 | \n",
+ "
\n",
+ " \n",
+ " 2 | \n",
+ " 2010-01-01 | \n",
+ " eb1167b3 | \n",
+ " 0.385 | \n",
+ "
\n",
+ " \n",
+ " 3 | \n",
+ " 2010-01-01 | \n",
+ " f7c1bd87 | \n",
+ " 0.741 | \n",
+ "
\n",
+ " \n",
+ " 4 | \n",
+ " 2010-01-01 | \n",
+ " e443df78 | \n",
+ " 0.397 | \n",
+ "
\n",
+ " \n",
+ "
\n",
+ "
"
+ ],
+ "text/plain": [
+ " date id v\n",
+ "0 2010-01-01 e3e70682 0.526\n",
+ "1 2010-01-01 f728b4fa 0.760\n",
+ "2 2010-01-01 eb1167b3 0.385\n",
+ "3 2010-01-01 f7c1bd87 0.741\n",
+ "4 2010-01-01 e443df78 0.397"
+ ]
+ },
+ "execution_count": 1,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "import pandas as pd\n",
+ "import numpy as np\n",
+ "import itertools\n",
+ "from faker import Faker\n",
+ "import importlib\n",
+ "\n",
+ "import d6tjoin.top1\n",
+ "importlib.reload(d6tjoin.top1)\n",
+ "import d6tjoin.utils\n",
+ "\n",
+ "# *******************************************************\n",
+ "# generate sample time series data with id and value\n",
+ "# *******************************************************\n",
+ "nobs = 10\n",
+ "Faker.seed(0)\n",
+ "f1 = Faker()\n",
+ "uuid1 = [str(f1.uuid4()).split('-')[0] for _ in range(nobs)]\n",
+ "dates1 = pd.date_range('1/1/2010','1/1/2011')\n",
+ "\n",
+ "df1 = pd.DataFrame(list(itertools.product(dates1,uuid1)),columns=['date','id'])\n",
+ "df1['v']=np.round(np.random.sample(df1.shape[0]),3)\n",
+ "df1.head()"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "# Example 1: join datasets on misaligned ids\n",
+ "\n",
+ "When joining data from different sources, eg different vendors, often your ids don't match perfect and then you need to manually analyze the situation. With databolt this becomes much easier.\n",
+ "\n",
+ "Let's create another dataset where the `id` is slightly different."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 2,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/html": [
+ "\n",
+ "\n",
+ "
\n",
+ " \n",
+ " \n",
+ " | \n",
+ " date | \n",
+ " id | \n",
+ " v | \n",
+ "
\n",
+ " \n",
+ " \n",
+ " \n",
+ " 0 | \n",
+ " 2010-01-01 | \n",
+ " 3e7068 | \n",
+ " 0.526 | \n",
+ "
\n",
+ " \n",
+ " 1 | \n",
+ " 2010-01-01 | \n",
+ " 728b4f | \n",
+ " 0.760 | \n",
+ "
\n",
+ " \n",
+ " 2 | \n",
+ " 2010-01-01 | \n",
+ " b1167b | \n",
+ " 0.385 | \n",
+ "
\n",
+ " \n",
+ " 3 | \n",
+ " 2010-01-01 | \n",
+ " 7c1bd8 | \n",
+ " 0.741 | \n",
+ "
\n",
+ " \n",
+ " 4 | \n",
+ " 2010-01-01 | \n",
+ " 443df7 | \n",
+ " 0.397 | \n",
+ "
\n",
+ " \n",
+ "
\n",
+ "
"
+ ],
+ "text/plain": [
+ " date id v\n",
+ "0 2010-01-01 3e7068 0.526\n",
+ "1 2010-01-01 728b4f 0.760\n",
+ "2 2010-01-01 b1167b 0.385\n",
+ "3 2010-01-01 7c1bd8 0.741\n",
+ "4 2010-01-01 443df7 0.397"
+ ]
+ },
+ "execution_count": 2,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "# create mismatch\n",
+ "df2 = df1.copy()\n",
+ "df2['id'] = df1['id'].str[1:-1]\n",
+ "df2.head()"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "`d6tjoin.Prejoin().match_quality()` shows you there is none of `id` match so a normal join won't work well."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 3,
+ "metadata": {},
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ " key left key right all matched inner left right outer unmatched total unmatched left unmatched right\n",
+ "0 id id False 0 10 10 20 20 10 10\n",
+ "1 date date True 366 366 366 366 0 0 0\n",
+ "2 __all__ __all__ False 0 3660 3660 7320 7320 3660 3660\n"
+ ]
+ }
+ ],
+ "source": [
+ "d6tjoin.Prejoin([df1,df2],['id','date']).match_quality()"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "Using `d6tjoin.top1.MergeTop1()` you can quickly merge this dataset without having to do any manual processing. It will find the closest matching id using the Levenstein string similarity metric. We want to look at the closest id by date so we will pass in date as an exact match key."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 4,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "result = d6tjoin.top1.MergeTop1(df1.head(),df2,fuzzy_left_on=['id'],fuzzy_right_on=['id'],exact_left_on=['date'],exact_right_on=['date']).merge()"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "Lets check what matches it found. Looking at the top1 match table, it shows the closest string with only 2 character difference in id, meaning it found the correct substring. "
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 5,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/html": [
+ "\n",
+ "\n",
+ "
\n",
+ " \n",
+ " \n",
+ " | \n",
+ " date | \n",
+ " __top1left__ | \n",
+ " __top1right__ | \n",
+ " __top1diff__ | \n",
+ " __matchtype__ | \n",
+ "
\n",
+ " \n",
+ " \n",
+ " \n",
+ " 20 | \n",
+ " 2010-01-01 | \n",
+ " e3e70682 | \n",
+ " 3e7068 | \n",
+ " 2 | \n",
+ " top1 left | \n",
+ "
\n",
+ " \n",
+ " 14 | \n",
+ " 2010-01-01 | \n",
+ " e443df78 | \n",
+ " 443df7 | \n",
+ " 2 | \n",
+ " top1 left | \n",
+ "
\n",
+ " \n",
+ " 42 | \n",
+ " 2010-01-01 | \n",
+ " eb1167b3 | \n",
+ " b1167b | \n",
+ " 2 | \n",
+ " top1 left | \n",
+ "
\n",
+ " \n",
+ " 31 | \n",
+ " 2010-01-01 | \n",
+ " f728b4fa | \n",
+ " 728b4f | \n",
+ " 2 | \n",
+ " top1 left | \n",
+ "
\n",
+ " \n",
+ " 3 | \n",
+ " 2010-01-01 | \n",
+ " f7c1bd87 | \n",
+ " 7c1bd8 | \n",
+ " 2 | \n",
+ " top1 left | \n",
+ "
\n",
+ " \n",
+ "
\n",
+ "
"
+ ],
+ "text/plain": [
+ " date __top1left__ __top1right__ __top1diff__ __matchtype__\n",
+ "20 2010-01-01 e3e70682 3e7068 2 top1 left\n",
+ "14 2010-01-01 e443df78 443df7 2 top1 left\n",
+ "42 2010-01-01 eb1167b3 b1167b 2 top1 left\n",
+ "31 2010-01-01 f728b4fa 728b4f 2 top1 left\n",
+ "3 2010-01-01 f7c1bd87 7c1bd8 2 top1 left"
+ ]
+ },
+ "execution_count": 5,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "result['top1']['id']"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "Since the match results look good, you can use the merged dataset."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 6,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/html": [
+ "\n",
+ "\n",
+ "
\n",
+ " \n",
+ " \n",
+ " | \n",
+ " date | \n",
+ " id | \n",
+ " v | \n",
+ " id_right | \n",
+ " v_right | \n",
+ "
\n",
+ " \n",
+ " \n",
+ " \n",
+ " 0 | \n",
+ " 2010-01-01 | \n",
+ " e3e70682 | \n",
+ " 0.526 | \n",
+ " 3e7068 | \n",
+ " 0.526 | \n",
+ "
\n",
+ " \n",
+ " 1 | \n",
+ " 2010-01-01 | \n",
+ " f728b4fa | \n",
+ " 0.760 | \n",
+ " 728b4f | \n",
+ " 0.760 | \n",
+ "
\n",
+ " \n",
+ " 2 | \n",
+ " 2010-01-01 | \n",
+ " eb1167b3 | \n",
+ " 0.385 | \n",
+ " b1167b | \n",
+ " 0.385 | \n",
+ "
\n",
+ " \n",
+ " 3 | \n",
+ " 2010-01-01 | \n",
+ " f7c1bd87 | \n",
+ " 0.741 | \n",
+ " 7c1bd8 | \n",
+ " 0.741 | \n",
+ "
\n",
+ " \n",
+ " 4 | \n",
+ " 2010-01-01 | \n",
+ " e443df78 | \n",
+ " 0.397 | \n",
+ " 443df7 | \n",
+ " 0.397 | \n",
+ "
\n",
+ " \n",
+ "
\n",
+ "
"
+ ],
+ "text/plain": [
+ " date id v id_right v_right\n",
+ "0 2010-01-01 e3e70682 0.526 3e7068 0.526\n",
+ "1 2010-01-01 f728b4fa 0.760 728b4f 0.760\n",
+ "2 2010-01-01 eb1167b3 0.385 b1167b 0.385\n",
+ "3 2010-01-01 f7c1bd87 0.741 7c1bd8 0.741\n",
+ "4 2010-01-01 e443df78 0.397 443df7 0.397"
+ ]
+ },
+ "execution_count": 6,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "result['merged'].head()"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 7,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "assert not result['duplicates']"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "# Example 2: join 2 datasets with misaligned dates\n",
+ "\n",
+ "As another example, instead of the ids not matching, lets look at an example where the dates don't match. We will look at calendar vs business month end dates."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 8,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "dates2 = pd.bdate_range('1/1/2010','1/1/2011') # business instead of calendar dates\n",
+ "df2 = pd.DataFrame(list(itertools.product(dates2,uuid1)),columns=['date','id'])\n",
+ "df2['v']=np.round(np.random.sample(df2.shape[0]),3)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "`d6tjoin.Prejoin()` shows some but not all of the dates match. All the ids match."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 9,
+ "metadata": {},
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ " key left key right all matched inner left right outer unmatched total unmatched left unmatched right\n",
+ "0 id id True 10 10 10 10 0 0 0\n",
+ "1 date date False 261 366 261 366 105 105 0\n",
+ "2 __all__ __all__ False 2610 3660 2610 3660 1050 1050 0\n"
+ ]
+ }
+ ],
+ "source": [
+ "d6tjoin.Prejoin([df1,df2],['id','date']).match_quality()"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "So we want to do a fuzzy match on dates but have the id match perfectly."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 10,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "result = d6tjoin.top1.MergeTop1(df1,df2,fuzzy_left_on=['date'],fuzzy_right_on=['date'],exact_left_on=['id'],exact_right_on=['id']).merge()"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "Again lets check if the fuzzy matches are correct. If either matches or is off by a day most, looks good!"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 11,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/html": [
+ "\n",
+ "\n",
+ "
\n",
+ " \n",
+ " \n",
+ " | \n",
+ " id | \n",
+ " __top1left__ | \n",
+ " __top1right__ | \n",
+ " __top1diff__ | \n",
+ " __matchtype__ | \n",
+ "
\n",
+ " \n",
+ " \n",
+ " \n",
+ " 0 | \n",
+ " 1846d424 | \n",
+ " 2010-01-01 | \n",
+ " 2010-01-01 | \n",
+ " 0 days | \n",
+ " exact | \n",
+ "
\n",
+ " \n",
+ " 1 | \n",
+ " eb1167b3 | \n",
+ " 2010-01-01 | \n",
+ " 2010-01-01 | \n",
+ " 0 days | \n",
+ " exact | \n",
+ "
\n",
+ " \n",
+ " 2 | \n",
+ " e443df78 | \n",
+ " 2010-01-01 | \n",
+ " 2010-01-01 | \n",
+ " 0 days | \n",
+ " exact | \n",
+ "
\n",
+ " \n",
+ "
\n",
+ "
"
+ ],
+ "text/plain": [
+ " id __top1left__ __top1right__ __top1diff__ __matchtype__\n",
+ "0 1846d424 2010-01-01 2010-01-01 0 days exact\n",
+ "1 eb1167b3 2010-01-01 2010-01-01 0 days exact\n",
+ "2 e443df78 2010-01-01 2010-01-01 0 days exact"
+ ]
+ },
+ "execution_count": 11,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "result['top1']['date'].head(3)"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 12,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/html": [
+ "\n",
+ "\n",
+ "
\n",
+ " \n",
+ " \n",
+ " | \n",
+ " id | \n",
+ " __top1left__ | \n",
+ " __top1right__ | \n",
+ " __top1diff__ | \n",
+ " __matchtype__ | \n",
+ "
\n",
+ " \n",
+ " \n",
+ " \n",
+ " 3657 | \n",
+ " 1846d424 | \n",
+ " 2011-01-01 | \n",
+ " 2010-12-31 | \n",
+ " 1 days | \n",
+ " top1 left | \n",
+ "
\n",
+ " \n",
+ " 3658 | \n",
+ " f7c1bd87 | \n",
+ " 2011-01-01 | \n",
+ " 2010-12-31 | \n",
+ " 1 days | \n",
+ " top1 left | \n",
+ "
\n",
+ " \n",
+ " 3659 | \n",
+ " fcbd04c3 | \n",
+ " 2011-01-01 | \n",
+ " 2010-12-31 | \n",
+ " 1 days | \n",
+ " top1 left | \n",
+ "
\n",
+ " \n",
+ "
\n",
+ "
"
+ ],
+ "text/plain": [
+ " id __top1left__ __top1right__ __top1diff__ __matchtype__\n",
+ "3657 1846d424 2011-01-01 2010-12-31 1 days top1 left\n",
+ "3658 f7c1bd87 2011-01-01 2010-12-31 1 days top1 left\n",
+ "3659 fcbd04c3 2011-01-01 2010-12-31 1 days top1 left"
+ ]
+ },
+ "execution_count": 12,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "result['top1']['date'].tail(3)"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 13,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "Timedelta('1 days 00:00:00')"
+ ]
+ },
+ "execution_count": 13,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "result['top1']['date']['__top1diff__'].max()"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "Again with very little effort we were able to join this dataset together."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 14,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/html": [
+ "\n",
+ "\n",
+ "
\n",
+ " \n",
+ " \n",
+ " | \n",
+ " date | \n",
+ " id | \n",
+ " v | \n",
+ " date_right | \n",
+ " v_right | \n",
+ "
\n",
+ " \n",
+ " \n",
+ " \n",
+ " 0 | \n",
+ " 2010-01-01 | \n",
+ " e3e70682 | \n",
+ " 0.526 | \n",
+ " 2010-01-01 | \n",
+ " 0.467 | \n",
+ "
\n",
+ " \n",
+ " 1 | \n",
+ " 2010-01-02 | \n",
+ " e3e70682 | \n",
+ " 0.845 | \n",
+ " 2010-01-01 | \n",
+ " 0.467 | \n",
+ "
\n",
+ " \n",
+ " 2 | \n",
+ " 2010-01-01 | \n",
+ " f728b4fa | \n",
+ " 0.760 | \n",
+ " 2010-01-01 | \n",
+ " 0.855 | \n",
+ "
\n",
+ " \n",
+ " 3 | \n",
+ " 2010-01-02 | \n",
+ " f728b4fa | \n",
+ " 0.506 | \n",
+ " 2010-01-01 | \n",
+ " 0.855 | \n",
+ "
\n",
+ " \n",
+ " 4 | \n",
+ " 2010-01-01 | \n",
+ " eb1167b3 | \n",
+ " 0.385 | \n",
+ " 2010-01-01 | \n",
+ " 0.485 | \n",
+ "
\n",
+ " \n",
+ "
\n",
+ "
"
+ ],
+ "text/plain": [
+ " date id v date_right v_right\n",
+ "0 2010-01-01 e3e70682 0.526 2010-01-01 0.467\n",
+ "1 2010-01-02 e3e70682 0.845 2010-01-01 0.467\n",
+ "2 2010-01-01 f728b4fa 0.760 2010-01-01 0.855\n",
+ "3 2010-01-02 f728b4fa 0.506 2010-01-01 0.855\n",
+ "4 2010-01-01 eb1167b3 0.385 2010-01-01 0.485"
+ ]
+ },
+ "execution_count": 14,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "result['merged'].head()"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "# Example 3: join 2 datasets with misaligned dates AND ids\n",
+ "\n",
+ "In the final example, we combine the above cases. None of the ids match and some of the dates are mismatched. As before with little manual effort we are able to correctly merge the dataset."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 15,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "dates2 = pd.bdate_range('1/1/2010','1/1/2011') # business instead of calendar dates\n",
+ "df2 = pd.DataFrame(list(itertools.product(dates2,uuid1)),columns=['date','id'])\n",
+ "df2['v']=np.round(np.random.sample(df2.shape[0]),3)\n",
+ "df2['id'] = df2['id'].str[1:-1]"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 16,
+ "metadata": {},
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ " key left key right all matched inner left right outer unmatched total unmatched left unmatched right\n",
+ "0 id id False 0 10 10 20 20 10 10\n",
+ "1 date date False 261 366 261 366 105 105 0\n",
+ "2 __all__ __all__ False 0 3660 2610 6270 6270 3660 2610\n"
+ ]
+ }
+ ],
+ "source": [
+ "d6tjoin.Prejoin([df1,df2],['id','date']).match_quality()"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 17,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "result = d6tjoin.top1.MergeTop1(df1,df2,['date','id'],['date','id']).merge()"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 18,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/html": [
+ "\n",
+ "\n",
+ "
\n",
+ " \n",
+ " \n",
+ " | \n",
+ " date | \n",
+ " id | \n",
+ " v | \n",
+ " date_right | \n",
+ " id_right | \n",
+ " v_right | \n",
+ "
\n",
+ " \n",
+ " \n",
+ " \n",
+ " 0 | \n",
+ " 2010-01-01 | \n",
+ " e3e70682 | \n",
+ " 0.526 | \n",
+ " 2010-01-01 | \n",
+ " 3e7068 | \n",
+ " 0.695 | \n",
+ "
\n",
+ " \n",
+ " 1 | \n",
+ " 2010-01-02 | \n",
+ " e3e70682 | \n",
+ " 0.845 | \n",
+ " 2010-01-01 | \n",
+ " 3e7068 | \n",
+ " 0.695 | \n",
+ "
\n",
+ " \n",
+ " 2 | \n",
+ " 2010-01-01 | \n",
+ " f728b4fa | \n",
+ " 0.760 | \n",
+ " 2010-01-01 | \n",
+ " 728b4f | \n",
+ " 0.891 | \n",
+ "
\n",
+ " \n",
+ " 3 | \n",
+ " 2010-01-02 | \n",
+ " f728b4fa | \n",
+ " 0.506 | \n",
+ " 2010-01-01 | \n",
+ " 728b4f | \n",
+ " 0.891 | \n",
+ "
\n",
+ " \n",
+ " 4 | \n",
+ " 2010-01-01 | \n",
+ " eb1167b3 | \n",
+ " 0.385 | \n",
+ " 2010-01-01 | \n",
+ " b1167b | \n",
+ " 0.424 | \n",
+ "
\n",
+ " \n",
+ "
\n",
+ "
"
+ ],
+ "text/plain": [
+ " date id v date_right id_right v_right\n",
+ "0 2010-01-01 e3e70682 0.526 2010-01-01 3e7068 0.695\n",
+ "1 2010-01-02 e3e70682 0.845 2010-01-01 3e7068 0.695\n",
+ "2 2010-01-01 f728b4fa 0.760 2010-01-01 728b4f 0.891\n",
+ "3 2010-01-02 f728b4fa 0.506 2010-01-01 728b4f 0.891\n",
+ "4 2010-01-01 eb1167b3 0.385 2010-01-01 b1167b 0.424"
+ ]
+ },
+ "execution_count": 18,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "result['merged'].head()"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 19,
+ "metadata": {
+ "scrolled": true
+ },
+ "outputs": [
+ {
+ "data": {
+ "text/html": [
+ "\n",
+ "\n",
+ "
\n",
+ " \n",
+ " \n",
+ " | \n",
+ " __top1left__ | \n",
+ " __top1right__ | \n",
+ " __top1diff__ | \n",
+ " __matchtype__ | \n",
+ "
\n",
+ " \n",
+ " \n",
+ " \n",
+ " 361 | \n",
+ " 2010-12-28 | \n",
+ " 2010-12-28 | \n",
+ " 0 days | \n",
+ " exact | \n",
+ "
\n",
+ " \n",
+ " 362 | \n",
+ " 2010-12-29 | \n",
+ " 2010-12-29 | \n",
+ " 0 days | \n",
+ " exact | \n",
+ "
\n",
+ " \n",
+ " 363 | \n",
+ " 2010-12-30 | \n",
+ " 2010-12-30 | \n",
+ " 0 days | \n",
+ " exact | \n",
+ "
\n",
+ " \n",
+ " 364 | \n",
+ " 2010-12-31 | \n",
+ " 2010-12-31 | \n",
+ " 0 days | \n",
+ " exact | \n",
+ "
\n",
+ " \n",
+ " 365 | \n",
+ " 2011-01-01 | \n",
+ " 2010-12-31 | \n",
+ " 1 days | \n",
+ " top1 left | \n",
+ "
\n",
+ " \n",
+ "
\n",
+ "
"
+ ],
+ "text/plain": [
+ " __top1left__ __top1right__ __top1diff__ __matchtype__\n",
+ "361 2010-12-28 2010-12-28 0 days exact\n",
+ "362 2010-12-29 2010-12-29 0 days exact\n",
+ "363 2010-12-30 2010-12-30 0 days exact\n",
+ "364 2010-12-31 2010-12-31 0 days exact\n",
+ "365 2011-01-01 2010-12-31 1 days top1 left"
+ ]
+ },
+ "execution_count": 19,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "result['top1']['date'].tail()"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 20,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/html": [
+ "\n",
+ "\n",
+ "
\n",
+ " \n",
+ " \n",
+ " | \n",
+ " __top1right__date | \n",
+ " __top1left__ | \n",
+ " __top1right__ | \n",
+ " __top1diff__ | \n",
+ " __matchtype__ | \n",
+ "
\n",
+ " \n",
+ " \n",
+ " \n",
+ " 19836 | \n",
+ " 2010-01-01 | \n",
+ " 1846d424 | \n",
+ " 846d42 | \n",
+ " 2 | \n",
+ " top1 left | \n",
+ "
\n",
+ " \n",
+ " 24795 | \n",
+ " 2010-01-01 | \n",
+ " 23a7711a | \n",
+ " 3a7711 | \n",
+ " 2 | \n",
+ " top1 left | \n",
+ "
\n",
+ " \n",
+ " 18009 | \n",
+ " 2010-01-01 | \n",
+ " 259f4329 | \n",
+ " 59f432 | \n",
+ " 2 | \n",
+ " top1 left | \n",
+ "
\n",
+ " \n",
+ " 9918 | \n",
+ " 2010-01-01 | \n",
+ " b4862b21 | \n",
+ " 4862b2 | \n",
+ " 2 | \n",
+ " top1 left | \n",
+ "
\n",
+ " \n",
+ " 13050 | \n",
+ " 2010-01-01 | \n",
+ " e3e70682 | \n",
+ " 3e7068 | \n",
+ " 2 | \n",
+ " top1 left | \n",
+ "
\n",
+ " \n",
+ "
\n",
+ "
"
+ ],
+ "text/plain": [
+ " __top1right__date __top1left__ __top1right__ __top1diff__ __matchtype__\n",
+ "19836 2010-01-01 1846d424 846d42 2 top1 left\n",
+ "24795 2010-01-01 23a7711a 3a7711 2 top1 left\n",
+ "18009 2010-01-01 259f4329 59f432 2 top1 left\n",
+ "9918 2010-01-01 b4862b21 4862b2 2 top1 left\n",
+ "13050 2010-01-01 e3e70682 3e7068 2 top1 left"
+ ]
+ },
+ "execution_count": 20,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "result['top1']['id'].head()"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "collapsed": true
+ },
+ "source": [
+ "# Advanced Usage Options"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "## Passing a difference limit\n",
+ "By default every record in the left dataframe will be matched with a record in the right dataframe. Sometimes the difference is too large though to be considered a match. You can control this by passing the `top_limit` parameter."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 21,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "dates2 = pd.bdate_range('1/1/2010','1/1/2011') # business instead of calendar dates\n",
+ "df2 = pd.DataFrame(list(itertools.product(dates2,uuid1[:-2])),columns=['date','id'])\n",
+ "df2['v']=np.random.sample(df2.shape[0])\n",
+ "df2['id'] = df2['id'].str[1:-1]"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 22,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/html": [
+ "\n",
+ "\n",
+ "
\n",
+ " \n",
+ " \n",
+ " | \n",
+ " __top1right__date | \n",
+ " __top1left__ | \n",
+ " __top1right__ | \n",
+ " __top1diff__ | \n",
+ " __matchtype__ | \n",
+ "
\n",
+ " \n",
+ " \n",
+ " \n",
+ " 16182 | \n",
+ " 2010-01-01 | \n",
+ " 1846d424 | \n",
+ " 846d42 | \n",
+ " 2 | \n",
+ " top1 left | \n",
+ "
\n",
+ " \n",
+ " 20097 | \n",
+ " 2010-01-01 | \n",
+ " 23a7711a | \n",
+ " 3a7711 | \n",
+ " 2 | \n",
+ " top1 left | \n",
+ "
\n",
+ " \n",
+ " 14094 | \n",
+ " 2010-01-01 | \n",
+ " 259f4329 | \n",
+ " 846d42 | \n",
+ " 6 | \n",
+ " top1 left | \n",
+ "
\n",
+ " \n",
+ " 6786 | \n",
+ " 2010-01-01 | \n",
+ " b4862b21 | \n",
+ " b1167b | \n",
+ " 5 | \n",
+ " top1 left | \n",
+ "
\n",
+ " \n",
+ " 7830 | \n",
+ " 2010-01-01 | \n",
+ " b4862b21 | \n",
+ " 846d42 | \n",
+ " 5 | \n",
+ " top1 left | \n",
+ "
\n",
+ " \n",
+ "
\n",
+ "
"
+ ],
+ "text/plain": [
+ " __top1right__date __top1left__ __top1right__ __top1diff__ __matchtype__\n",
+ "16182 2010-01-01 1846d424 846d42 2 top1 left\n",
+ "20097 2010-01-01 23a7711a 3a7711 2 top1 left\n",
+ "14094 2010-01-01 259f4329 846d42 6 top1 left\n",
+ "6786 2010-01-01 b4862b21 b1167b 5 top1 left\n",
+ "7830 2010-01-01 b4862b21 846d42 5 top1 left"
+ ]
+ },
+ "execution_count": 22,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "result = d6tjoin.top1.MergeTop1(df1,df2,['date','id'],['date','id']).merge()\n",
+ "result['top1']['id'].head()"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "We have some correct matches but also some bad matches with `__top1diff__`>2. We will restrict `top_limit` to be at most 2."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 23,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "result = d6tjoin.top1.MergeTop1(df1,df2,['date','id'],['date','id'], top_limit=[None,2]).merge()"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 24,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/html": [
+ "\n",
+ "\n",
+ "
\n",
+ " \n",
+ " \n",
+ " | \n",
+ " __top1right__date | \n",
+ " __top1left__ | \n",
+ " __top1right__ | \n",
+ " __top1diff__ | \n",
+ " __matchtype__ | \n",
+ "
\n",
+ " \n",
+ " \n",
+ " \n",
+ " 16182 | \n",
+ " 2010-01-01 | \n",
+ " 1846d424 | \n",
+ " 846d42 | \n",
+ " 2 | \n",
+ " top1 left | \n",
+ "
\n",
+ " \n",
+ " 20097 | \n",
+ " 2010-01-01 | \n",
+ " 23a7711a | \n",
+ " 3a7711 | \n",
+ " 2 | \n",
+ " top1 left | \n",
+ "
\n",
+ " \n",
+ " 10440 | \n",
+ " 2010-01-01 | \n",
+ " e3e70682 | \n",
+ " 3e7068 | \n",
+ " 2 | \n",
+ " top1 left | \n",
+ "
\n",
+ " \n",
+ " 17748 | \n",
+ " 2010-01-01 | \n",
+ " e443df78 | \n",
+ " 443df7 | \n",
+ " 2 | \n",
+ " top1 left | \n",
+ "
\n",
+ " \n",
+ " 2610 | \n",
+ " 2010-01-01 | \n",
+ " eb1167b3 | \n",
+ " b1167b | \n",
+ " 2 | \n",
+ " top1 left | \n",
+ "
\n",
+ " \n",
+ "
\n",
+ "
"
+ ],
+ "text/plain": [
+ " __top1right__date __top1left__ __top1right__ __top1diff__ __matchtype__\n",
+ "16182 2010-01-01 1846d424 846d42 2 top1 left\n",
+ "20097 2010-01-01 23a7711a 3a7711 2 top1 left\n",
+ "10440 2010-01-01 e3e70682 3e7068 2 top1 left\n",
+ "17748 2010-01-01 e443df78 443df7 2 top1 left\n",
+ "2610 2010-01-01 eb1167b3 b1167b 2 top1 left"
+ ]
+ },
+ "execution_count": 24,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "result['top1']['id'].head()"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "## Passing a custom difference function\n",
+ "By default string matches are done using Levenstein edit distance. You can pass a custom function using `fun_diff`. For example lets pass Hamming distance."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 25,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "import jellyfish\n",
+ "result = d6tjoin.top1.MergeTop1(df1,df2,['date','id'],['date','id'], fun_diff=[None,jellyfish.hamming_distance]).merge()"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 26,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/html": [
+ "\n",
+ "\n",
+ "
\n",
+ " \n",
+ " \n",
+ " | \n",
+ " __top1right__date | \n",
+ " __top1left__ | \n",
+ " __top1right__ | \n",
+ " __top1diff__ | \n",
+ " __matchtype__ | \n",
+ "
\n",
+ " \n",
+ " \n",
+ " \n",
+ " 15138 | \n",
+ " 2010-01-01 | \n",
+ " 1846d424 | \n",
+ " b1167b | \n",
+ " 7 | \n",
+ " top1 left | \n",
+ "
\n",
+ " \n",
+ " 15399 | \n",
+ " 2010-01-01 | \n",
+ " 1846d424 | \n",
+ " 7c1bd8 | \n",
+ " 7 | \n",
+ " top1 left | \n",
+ "
\n",
+ " \n",
+ " 20097 | \n",
+ " 2010-01-01 | \n",
+ " 23a7711a | \n",
+ " 3a7711 | \n",
+ " 6 | \n",
+ " top1 left | \n",
+ "
\n",
+ " \n",
+ " 12789 | \n",
+ " 2010-01-01 | \n",
+ " 259f4329 | \n",
+ " 728b4f | \n",
+ " 7 | \n",
+ " top1 left | \n",
+ "
\n",
+ " \n",
+ " 14094 | \n",
+ " 2010-01-01 | \n",
+ " 259f4329 | \n",
+ " 846d42 | \n",
+ " 7 | \n",
+ " top1 left | \n",
+ "
\n",
+ " \n",
+ "
\n",
+ "
"
+ ],
+ "text/plain": [
+ " __top1right__date __top1left__ __top1right__ __top1diff__ __matchtype__\n",
+ "15138 2010-01-01 1846d424 b1167b 7 top1 left\n",
+ "15399 2010-01-01 1846d424 7c1bd8 7 top1 left\n",
+ "20097 2010-01-01 23a7711a 3a7711 6 top1 left\n",
+ "12789 2010-01-01 259f4329 728b4f 7 top1 left\n",
+ "14094 2010-01-01 259f4329 846d42 7 top1 left"
+ ]
+ },
+ "execution_count": 26,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "result['top1']['id'].head()"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": []
+ }
+ ],
+ "metadata": {
+ "kernelspec": {
+ "display_name": "Python 3",
+ "language": "python",
+ "name": "python3"
+ },
+ "language_info": {
+ "codemirror_mode": {
+ "name": "ipython",
+ "version": 3
+ },
+ "file_extension": ".py",
+ "mimetype": "text/x-python",
+ "name": "python",
+ "nbconvert_exporter": "python",
+ "pygments_lexer": "ipython3",
+ "version": "3.7.6"
+ }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 2
+}
diff --git a/blogs/d6tjoin-blogs/pic/.DS_Store b/blogs/d6tjoin-blogs/pic/.DS_Store
new file mode 100644
index 0000000..93c5932
Binary files /dev/null and b/blogs/d6tjoin-blogs/pic/.DS_Store differ
diff --git a/blogs/d6tjoin-blogs/pic/1attempt_result.png b/blogs/d6tjoin-blogs/pic/1attempt_result.png
new file mode 100644
index 0000000..f813b21
Binary files /dev/null and b/blogs/d6tjoin-blogs/pic/1attempt_result.png differ
diff --git a/blogs/d6tjoin-blogs/pic/1attempt_ticker_match_quality.png b/blogs/d6tjoin-blogs/pic/1attempt_ticker_match_quality.png
new file mode 100644
index 0000000..a3e2921
Binary files /dev/null and b/blogs/d6tjoin-blogs/pic/1attempt_ticker_match_quality.png differ
diff --git a/blogs/d6tjoin-blogs/pic/2attempt_result.png b/blogs/d6tjoin-blogs/pic/2attempt_result.png
new file mode 100644
index 0000000..36e1cd6
Binary files /dev/null and b/blogs/d6tjoin-blogs/pic/2attempt_result.png differ
diff --git a/blogs/d6tjoin-blogs/pic/2attempt_ticker_match_qualtiy.png b/blogs/d6tjoin-blogs/pic/2attempt_ticker_match_qualtiy.png
new file mode 100644
index 0000000..fdba220
Binary files /dev/null and b/blogs/d6tjoin-blogs/pic/2attempt_ticker_match_qualtiy.png differ
diff --git a/blogs/d6tjoin-blogs/pic/3attempt_date_match_quality.png b/blogs/d6tjoin-blogs/pic/3attempt_date_match_quality.png
new file mode 100644
index 0000000..c6826d9
Binary files /dev/null and b/blogs/d6tjoin-blogs/pic/3attempt_date_match_quality.png differ
diff --git a/blogs/d6tjoin-blogs/pic/3attempt_result.png b/blogs/d6tjoin-blogs/pic/3attempt_result.png
new file mode 100644
index 0000000..1ffff1f
Binary files /dev/null and b/blogs/d6tjoin-blogs/pic/3attempt_result.png differ
diff --git a/blogs/d6tjoin-blogs/pic/df_price.png b/blogs/d6tjoin-blogs/pic/df_price.png
new file mode 100644
index 0000000..e284b4e
Binary files /dev/null and b/blogs/d6tjoin-blogs/pic/df_price.png differ
diff --git a/blogs/d6tjoin-blogs/pic/df_score.png b/blogs/d6tjoin-blogs/pic/df_score.png
new file mode 100644
index 0000000..1a0f1e3
Binary files /dev/null and b/blogs/d6tjoin-blogs/pic/df_score.png differ
diff --git a/blogs/d6tjoin-blogs/pic/match_quality.png b/blogs/d6tjoin-blogs/pic/match_quality.png
new file mode 100644
index 0000000..dab4b18
Binary files /dev/null and b/blogs/d6tjoin-blogs/pic/match_quality.png differ
diff --git a/blogs/unit-test-your-data-solution.md b/blogs/unit-test-your-data-solution.md
new file mode 100644
index 0000000..231a434
--- /dev/null
+++ b/blogs/unit-test-your-data-solution.md
@@ -0,0 +1,63 @@
+# Unit Test Your Data Pipeline, You Will Thank Yourself Later
+One common mistake that data scientists, especially beginners, make is not writing unit tests. Data scientists sometimes argue that unit testing is not applicable because there is no correct answer to a model that can be known ahead of time or to test with. However, most data science projects start with data transformation. While you cannot test model output, at least you should test that inputs are correct. Compared to the time you invest in writing unit tests, good pieces of simple tests will save you much more time later, especially when working on large projects or big data.
+
+Coauthored with [Haijing Li](https://www.linkedin.com/in/haijing-li-7b50a11b2/), Data Analyst in Financial Services, MS Business Analytics@Columbia University.
+## Benefits of Unit Testing
+* Detect bugs earlier: Running big data projects is time consuming. You don't want to get an unexpected output after 3-hour running when you could have easily avoided it.
+* Easier to update codes: You will be no longer afraid of changing your code because you know what to expect and you can easily tell what is broken if it is broken.
+* Push you to have a better structured code: You will write cleaner codes and prefer to write in DAGs instead of linearly chaining functions when you keep in mind you are gonna test your codes with isolated pieces. (use [d6tflow](https://github.com/d6t/d6tflow) to build data science workflows easily)
+* Give you confidence on the outputs: Bad data leads to bad decisions. Running unit tests gives you confidence on data quality. You know your code outputs what you want it to output.
+
+## Pytest
+
+To improve testing efficiency, use Pytest. If you are looking for tutorials on Pytest, I would recommend Dane Hillard's post [Effective Python Testing With Pytest](https://realpython.com/pytest-python-testing/). In his post you will find out how to utilize basic and advanced Pytest features.
+
+## Unit Testing for Data Science
+Depending on your projects, what you want to check with unit testing will be different. But there are some common tests you would wish to run for data science solutions.
+#### 1.Missing values
+```
+#catch missing values
+assert df['column'].isna().sum()<1
+```
+#### 2.Duplicates
+```
+# check there is no duplicate
+assert len(df['id'].unique())==df.shape[0]
+assert df.groupby(['date','id']).size().max()==1
+```
+#### 3.Shapes
+```
+# have data for all ids?
+assert df['id'].unique().shape[0] == len(ids)
+
+# function returns have shapes as expected
+assert all([some_funtion(df).shape == df[0].shape for df in dfs])
+```
+#### 4.Value Ranges
+```
+assert df.groupby('date')['percentage'].sum()==1
+assert all (df['percentage']<=1)
+assert df.groupby('name')['budget'].max()<=1000
+```
+#### 5.Join Quality
+[d6tjoin](https://github.com/d6t/d6tjoin) has checks for join quality.
+```
+assert d6tjoin.Prejoin([df1,df2],['date','id']).is_all_matched()
+```
+#### 6.Preprocess Functions
+```
+assert preprocess_function("name\t10019\n")==["name",10019]
+assert preprocess_missing_name("10019\n") is None
+assert preprocess_text("Do you Realize those are genetically modified food?" ) == ["you","realize","gene","modify","food"]
+```
+
+
+
+
+
+
+
+
+
+
+
\ No newline at end of file