Magellan Toward Building Entity Matching Management Systems

Magellan Toward Building Entity Matching Management Systems-Free PDF

  • Date:23 Nov 2020
  • Views:10
  • Downloads:0
  • Pages:12
  • Size:550.66 KB

Share Pdf : Magellan Toward Building Entity Matching Management Systems

Download and Preview : Magellan Toward Building Entity Matching Management Systems

Report CopyRight/DMCA Form For : Magellan Toward Building Entity Matching Management Systems


The Magellan Solution To address these limitations We describe the Magellan system which is novel in. we describe Magellan a new kind of EM systems currently several important aspects how to guides tools to sup. being developed at UW Madison in collaboration with Wal port all steps of the EM pipeline tight integration with. martLabs Magellan named after Ferdinand Magellan the the Python data eco system easy access to an interac. first end to end explorer of the globe is novel in several im tive scripting environment and open world vs closed. portant aspects world systems, First Magellan provides how to guides that tell users what. to do in each EM scenario step by step Second Magellan We describe significant challenges in realizing Magel. provides tools that help users do these steps These tools lan including the novel challenge of designing open. seek to cover the entire EM pipeline e g debugging sam world systems that operate in an eco system. pling not just the matching and blocking steps We describe extensive experiments with 44 students. Third the tools are being built on top of the Python data and real users at various organizations that show the. analysis and Big Data stacks Specifically we propose that utility of Magellan including improving the accuracy. users solve an EM scenario in two stages In the develop of an EM system in production. ment stage users find an accurate EM workflow using data. samples Then in the production stage users execute this This paper describes the most important aspects of Mag. workflow on the entirety of data We observe that the de ellan deferring details to 22 Magellan will be released at. velopment stage basically performs data analysis So we de sites google com site anhaidgroup projects magellan in Sum. velop tools for this stage on top of the well known Python mer 2016 to serve research development and practical uses. data analysis stack which provide a rich set of tools such as Finally the ideas underlying Magellan can potentially be ap. pandas scikit learn matplotlib etc Similarly we develop plied to other types of DI problems e g IE schema match. tools for the production stage on top of the Python Big Data ing data cleaning etc and an effort has been started to. stack e g Pydoop mrjob PySpark etc explore this direction and to foster an eco system of open. Thus Magellan is well integrated with the Python data source DI tools see Magellan s website. eco system allowing users to easily exploit a wide range of. techniques in learning mining visualization IE etc 2 THE CASE FOR ENTITY MATCHING. Finally an added benefit of integration with Python is MANAGEMENT SYSTEMS. that Magellan is situated in a powerful interactive scripting. environment that users can use to prototype code to patch 2 1 Entity Matching. the system This problem also known as record linkage data match. Challenges Realizing the above novelties raises major ing etc has received much attention in the past few decades. challenges First it turns out that developing effective how 11 16 A common EM scenario finds all tuple pairs a b. to guides even for very simple EM scenarios such as apply that match i e refer to the same real world entity between. ing supervised learning to match is already quite difficult two tables A and B see Figure 1 Other EM scenarios in. and complex as we will show in Section 4 clude matching tuples within a single table matching into. Second developing tools to support these guides is equally a knowledge base matching XML data etc 11. difficult In particular current EM work may have dismissed Most EM works have developed matching algorithms ex. many steps in the EM pipeline as engineering But here we ploiting rules learning clustering crowdsourcing among. show that many such steps e g loading the data sampling others 11 16 The focus is on improving the matching ac. and labeling debugging etc do raise difficult research curacy and reducing costs e g run time Trying to match. challenges all pairs in A B often takes very long So users often em. Finally while most current EM systems are stand alone ploy heuristics to remove obviously non matched pairs e g. monoliths Magellan is designed to be placed within an eco products with different colors in a step called blocking be. system and is expected to play well with others e g fore matching the remaining pairs Several works have stud. other Python packages We distinguish this by saying that ied this step focusing on scaling it up to large amounts of. current EM systems are closed world systems whereas Mag data see Section 7. ellan is an open world system because it relies on many 2 2 Current Entity Matching Systems. other systems in the eco system in order to provide the In contrast to the extensive effort on matching algorithms. fullest amount of support to the user doing EM It turns e g 96 papers were published on this topic in 2009 2014. out that building open world systems raises non trivial chal alone in SIGMOD VLDB ICDE KDD and WWW there. lenges such as designing the right data structures and man has been relatively little work on building EM systems As. aging metadata as we discuss in Section 5 of 2016 we counted 18 major non commercial systems e g. In this paper we have taken the first steps in addressing D Dupe DuDe Febrl Dedoop Nadeef and 15 major com. the above challenges We have also built and evaluated Mag mercial ones e g Tamr Data Ladder IBM InfoSphere. ellan 0 1 in several real world settings e g at WalmartLabs 11 Our examination of these systems see 22 reveals the. Johnson Control Inc Marshfield Clinic and in data science following four major problems. classes at UW Madison In summary we make the following. contributions 1 Systems Do Not Cover the Entire EM Pipeline. When performing EM users often must execute many steps. We argue that far more efforts should be devoted to. e g blocking matching exploration cleaning extraction. building EM systems to significantly advance the field. IE debugging sampling labeling etc Current systems. We discuss four limitations that prevent current EM provide support for only a few steps in this pipeline while. systems from being used extensively in practice ignoring less well known yet equally critical steps. Table A Table B 2 3 Entity Matching Management Systems. Name City State Name City State Matches, To address the above limitations we propose to build a. a1 Dave Smith Madison WI b1 David D Smith Madison WI a1 b1. new kind of EM systems In contrast to current EM sys. a2 Joe Wilson San Jose CA b2 Daniel W Smith Middleton WI a3 b2. tems which mostly provide a set of implemented match. a3 Dan Smith Middleton WI ers blockers these new systems are far more advanced. Figure 1 An example of matching two tables First and foremost they seek to handle a wide variety. of EM scenarios These scenarios can use very different EM. For example all 33 systems that we have examined pro workflows So it is difficult to build a single system to handle. vide support for blocking and matching Twenty systems all EM scenarios Instead we should build a set of systems. provide limited support for data exploration and cleaning each handling a well defined set of similar EM scenarios. There is no meaningful support for any other steps e g Each system should target the following goals. debugging sampling etc Even for blocking the systems. merely provide a set of blockers that users can call there 1 How to Guide Users will have to be in the loop. is no support for selecting and debugging blockers and for So it is critical that the system provides a how to guide. combining multiple blockers that tells users what to do and how to do it. 2 Difficult to Exploit a Wide Range of Techniques 2 User Burden The system should minimize the user. Practical EM often requires a wide range of techniques burden It should provide a rich set of tools to help. e g learning mining visualization data cleaning IE SQL users easily do each EM step and do so for all steps. querying crowdsourcing keyword search etc For example of the EM pipeline not just matching and blocking. to improve matching accuracy a user may want to clean Special attention should be paid to debugging which. the values of attribute Publisher in a table or extract is critical in practice. brand names from Product Title or build a histogram for. Price The user may also want to build a matcher that 3 Runtime The system should minimize tool runtimes. uses learning crowdsourcing or some statistical techniques and scale tools up to large amounts of data. Current EM systems do not provide enough support for 4 Expandability It should be easy to extend the sys. these techniques and there is no easy way to do so Incorpo tem with any existing or future techniques that can. rating all such techniques into a single system is extremely be useful for EM e g cleaning IE learning crowd. difficult But the alternate solution of just moving data sourcing Users should be able to easily patch the. among a current EM system and systems that do cleaning system using an interactive scripting environment. IE visualization etc is also difficult and time consuming. A fundamental reason is that most current EM systems are Of these goals expandability deserves more discussion If. stand alone monoliths that are not designed from the scratch we can build a single super system for EM do we need. to play well with other systems For example many cur expandability We believe it is very difficult to build such a. rent EM systems were written in C C C and Java system First it would be immensely complex to build just. using proprietary data structures Since EM is often iter an initial system that incorporates all of the techniques men. ative we need to repeatedly move data among these EM tioned in Goal 4 Indeed despite decades of development. systems and cleaning IE etc systems But this requires re today no EM system comes close to achieving this. peated reading writing of data to disk followed by compli Second it would be very time consuming to maintain and. cated data conversion keep this initial system up to date especially with the latest. advances e g crowdsourcing deep learning, 3 Difficult to Write Code to Patch the System Third and most importantly a generic EM system is un. In practice users often have to write code either to im likely to perform equally well for multiple domains e g. plement a lacking functionality e g to extract product biomedicine social media payroll Hence we often need. weights or to clean the dates or to tie together system to extend and customize it to a particular target domain. components It is difficult to write such code correctly in e g adding a data cleaning package specifically designed. one shot Thus ideally such coding should be done using for biomedical data written by biomedical researchers For. an interactive scripting environment to enable rapid proto the above three reasons we believe that EM systems should. typing and iteration This code often needs access to the be fundamentally expandable. rest of the system so ideally the system should be in such Clearly systems that target the above goals seek to man. an environment too Unfortunately only 5 out of 33 systems age all aspects of the end to end EM process So we refer to. provide such settings using Python and R this kind of systems as entity matching management systems. 4 Little Guidance for Users on How to Match In EMMSs Building EMMSs is difficult long term and will. our experience this is by far the most serious problem with require a new kind of architecture compared to current EM. using current EM systems in practice In many EM scenar systems In the rest of this paper we describe Magellan an. ios users simply do not know what to do how to start what attempt to build such an EMMS. to do next Interestingly even the simple task of taking a. sample and labeling it to train a learning based matcher 3 THE MAGELLAN APPROACH. can be quite complicated in practice as we show in Section Figure 2 shows the Magellan architecture The system tar. 4 3 Thus it is not enough to just build a system consisting gets a set of EM scenarios For each EM scenario it provides. of a set of tools It is also critical to provide step by step a how to guide The guide proposes that the user solve the. guidance to users on how to use the tools to handle a par scenario in two stages development and production. ticular EM scenario No EM system that we have examined In the development stage the user seeks to develop a good. provides such guidance EM workflow e g one with high matching accuracy The. Facilities for Lay Users Clearly there is a wide variety of EM scenarios So we will. GUIs wizards build Magellan to handle a few common scenarios and then. extend it to more similar scenarios over time Specifically. Power Users for now we will consider the three scenarios that match two. given relational tables A and B using 1 supervised learn. EM Development Stage Production Stage, ing 2 rules and 3 learning plus rules respectively These. EM scenarios are very common In practice users often try Sce. Supporting tools Supporting tools, as Python commands Workflow as Python commands nario 1 or 2 and if neither works then a combination of.
Guides them Scenario 3,Data samples Original data,EM Workflows As discussed earlier to handle an EM. Python Interactive Environment scenario a user often has to execute many steps such as. Script Language, cleaning IE blocking matching etc The combination of. Data Analysis Stack Big Data Stack these steps form an EM workflow Figure 5 shows a sample. Magellan Toward Building Entity Matching Management Systems Pradap Konda1 Sanjib Das1 Paul Suganthan G C 1 AnHai Doan1 Adel Ardalan1 Jeffrey R Ballard1 Han Li1 Fatemah Panahi1 Haojun Zhang1 Jeff Naughton1 Shishir Prasad3 Ganesh Krishnan2 Rohit Deep2 Vijay Raghavendra2 1University of Wisconsin Madison 2 WalmartLabs 3Instacart ABSTRACT Entity matching EM has been a long

Related Books