Combining Text and Image Analysis in The Web Filtering

Combining Text And Image Analysis In The Web Filtering-Free PDF

  • Date:30 May 2020
  • Views:39
  • Downloads:0
  • Pages:9
  • Size:529.94 KB

Share Pdf : Combining Text And Image Analysis In The Web Filtering

Download and Preview : Combining Text And Image Analysis In The Web Filtering

Report CopyRight/DMCA Form For : Combining Text And Image Analysis In The Web Filtering


COMBINING TEXT AND IMAGE ANALYSIS IN THE WEB,FILTERING SYSTEM WEBGUARD. Mohamed Hammami,LIRIS Ecole Centrale de Lyon,36 Av Guy de Collongue 69131 Ecully France. Youssef Chahir,GREYC URA CNRS 6072,Campus II BP 5186 Universit de Caen. 14032 Caen Cedex,Liming Chen,LIRIS Ecole Centrale de Lyon. 36 Av Guy de Collongue 69131 Ecully France, Web applications increasingly utilize search techniques that heavily rely on content based text and image analyses For.
example for parental site filtering it is necessary to identify adult sites These applications must rely on a semantic. analysis of images in the process of identification where text analysis alone is insufficient In this article we describe our. site filtering system WebGuard and show the importance of image analysis in such system Our results show that it can. detect and filter adult content effectively, Web filtering Data mining Image analysis Skin color model Text mining Semantic web. 1 INTRODUCTION, Nowadays Internet takes a place growingly pivotal in everyday life The Internet community has been not. only in an ever increasing number but it is also getting increasingly younger In fact children find each day. an easier access to Internet which may cause socio cultural problems According to a study carried out in. May 2000 60 of the interviewed parents are anxious when their children navigate on internet in particular. because of the presence of adult material In addition according to the lookup of Forrester a company which. examines contained of internet the sum of the sales of pornography on line corresponds to 10 of the total. amount of the sales on line This problem concerns parents as much as companies For example the. company Rank Xerox laid off in October 1999 forty employees who navigate on pornographic sites during. their working hours To avoid this kind of abuse the company installed program packages to supervise what. its employees visit on the Net, Some companies have proposed solutions to Web site filtering Their products concentrated on IP based. filtering and their classification of Web sites is mostly manual But as we know the Web is a highly. dynamic information source Not only do many Web sites appear everyday while others disappear but site. content including links is updated frequently Thus manual classification and filtering systems are largely. impractical The highly dynamic character of the Web calls for new techniques designed to classify and filter. Web sites and URLs automatically,International Conference WWW Internet 2003. In this paper we propose an adult content detection and filtering system called WebGuard that extends. adult content detection accuracy through both image signature and textual clues of adult material Compared. to other system WebGuard has the advantage of combining image analyses and text analyses Image. analyses complement text analyses by detecting adult content incorporated inside images. The remainder of this paper is organized as follows The WebGuard architecture is presented in Section 2. The extraction of feature vectors from Web pages is reviewed in Section 3 The classification of URLs. through Data Mining techniques is discussed in Section 4 Fuzzy clustering and Skin color image. segmentation is presented in Section 5 An experimental evaluation and comparison results are presented in. Section 6 Finally Section 7 summarize the WebGuard approach. 2 WEBGUARD ARCHITECTURE, The web filter system WebGuard aims to block those sites with pornographic or other nudity and sexually.
explicit language It provides Internet content filtering solutions and Internet blocking of pornography adult. material and many more categories The Internet will thus become more controllable and therefore safer for. both adults and children,Figure1 WebGuard architecture. The formulation of the Web Guard is as follows, Fully automated adult content detection and filtering. Categorization into black list access denied and white list access allowed to speed up navigation. If the site is not recorded on the black list or white list the engine will then analyses both the visual and. textual information and makes a further decision on the sites access allowed denied status The black. list white list file is then updated, In order to rapidly detect and filter the Web pages with adult sexual content in real time we must first. have some knowledge about adult sexual content such as suspected URLs stored in the knowledge base. Hence our Web Based Audit Content Detection and Filtering System is comprised of two parts The first. COMBINING TEXT AND IMAGE ANALYSIS IN THE WEB FILTERING SYSTEM WEBGUARD. part is designed to create and accurately Update the Knowledge Base CUKB the second part is designed to. Detect and Filter D F the Web pages with adult sexual content dynamically when younger browsers view. them Figure 2 is the overview of the system architecture. In CUKB as show in Figure 3 we have four facilities the Web Crawler the Temporary Database the. Data Mining Tools and the Updating Trigger used to create and update the Knowledge Base The Web. Crawler is used to periodically search adult sexual images and web pages on the Internet download suspect. images or web pages put them in the temporary database and then trigger the Data Mining Tool The Data. Mining Tool uses a data mining method to extract the features of adult sexual images or web pages stored in. the temporary database to discover the suspect URLs to classify the features and to trigger Updating. Trigger The Updating Trigger uses predefined strategies to add newly discovered adult sexual content and. suspect URLs to the Knowledge Base To date we have created the Knowledge Base and can periodically. update it and have established the fundamentals of our Web based adult content detection and filtering. Figure 2 The overview of the system architecture Figure 3 The components of CUKB. In D F as shown in Figure 4 we have three facilities to detect and filter browsing activity the Activity. Monitor the Decision Engine and the Knowledge Base The Activity Monitor captures active users URLs in. real time and compares these URLs with the suspected URLs stored in the Knowledge Base If such URLs. are in the Knowledge Base the Decision Engine is informed According to the strategies stored in the. Knowledge Base the Decision Engine filters the adult content or disconnects the connection Apart from. classified features and suspected URLs any anti browse measures or management information which have. been defined by ISPs or generated by the system are also stored in the Knowledge Base. Figure 4 The components of D F,3 WEB PAGE FEATURE VECTOR EXTRACTION. Before detecting and filtering the URLs with adult content we need to know which URLs are sex oriented. and which are not This is quintessentially a problem of URL classification. In order to sort the URLs into two classifications sex oriented and non sex oriented we first decide. which features of a URL can be used as its defining features Considering many sex oriented Web pages have. picture galleries with little or no text at all we use both image signature and textual clues as the features of a. URL At the same time many sex oriented URLs have some pop up windows and if a Web page links to. another Web page it is possible this Web page also has sexual content Consequently the number of pop up. windows on a Web page and the nature of a Web page s links sex relevant or not are also important features. of an URL The URLs of many sex oriented Web sites contain sexually explicit words which is another a. International Conference WWW Internet 2003, clear indication that the site contains sexual content To summarize the above we give the feature vector of a.
Web site as following,VoW bSEW nWD nWDwS nLNK nLNKwS nIMG nIMGwS nPW. Where bSEW is the flag of whether or not the current URL contains sexually explicit words nWD is the. number of words on the current Web page nWDwS is the number of sexually explicit words on the current. Web page nLNK is the number of links on the current Web page nLNKwS is the number of the current Web. page s links with adult sexual content nIMG is the number of images on the current Web page nIMGwS is. the number of the current Web page s images with adult sexual content nPW is the number of pop up. windows on the current Web page, Using the Web crawler we create the feature vector VoW of a URL From the definition of the VoW we. can know that in order to set up the feature vector VoW of a URL we should first decide whether or not the. Web pages that this Web page linked to are sex relevant So we must traverse the Web site corresponding to. this URL and get the leaf URL of this Web site then construct the feature vectors of all leaf URLs then. construct the feature vectors of their parent URLs after which we construct the feature vectors of their. grandparent URLs Finally we set up the feature vector of the given URL Obviously it is a process of. computing from bottom to top And in this process we used stack as the data structure. URLs HTML Parse,Text Analysis,Feature Vectors Image Analysis. Figure 5 The preparation of feature, As shown in Figure 5 at each step in the computing process we first parse the HTML deleting the. HTML tags after that we analyse the textual content of the HTML gathering the textual information and. then we analyze the images appearing in the HTML deciding whether or not they are sex relevant Finally. based on the obtained information we create the feature vector of the given URL. 4 USING DATA MINING TECHNIQUES TO CLASSIFY URLS, Once the feature vectors of all the URLs have been constructed the task is to construct a classifier to classify.
these URLs into two classes adult sexual URLs and other URLs. A number of classification techniques from the statistics and machine learning communities have been. proposed 6 7 8 10 A well accepted method of classification is the induction of decision trees 2 6 10 A. decision tree is a flow chart like structure consisting of internal nodes leaf nodes and branches Each. internal node represents a decision or test on a data attribute and each outgoing branch corresponds to a. possible outcome of the test Each leaf node represents a class In order to classify an unlabeled data sample. the classifier tests the attribute values of the sample against the decision tree A path is traced from the root to. a leaf node which holds the class predication for that sample. Let the set of Web Sites be O,C suspect URLs normal URLs. The observation of C w is not easy therefore we are looking for mean value f to describe class C The. process of graph construction is as follows We begin with a sample of sites both suspect URLs and normal. URLs and look for the particular attribute which will produce the best partitio n We repeat the process for. each node of the new partitions The best partitioning is obtained by maximizing the variation of uncertainty. COMBINING TEXT AND IMAGE ANALYSIS IN THE WEB FILTERING SYSTEM WEBGUARD. between the current partition and previous partition As I S i is a measure of entropy for partition Si. and I S i 1 is the measure of entropy of the following partition Si 1. The variation of uncertainty is,Si I S i I S i 1, For I Si we use the quadratic entropy a or Shannon entropy b. n j nij nij K n j m n ij n ij,I S i n j m 1 ni m a I S i log 2 b. n j 1 n i 1 n m n j m, Where n ij is the number of elements of class I at the node S j with I Suspect URLs Normal URLs ni is. the total number of elements of the class i ni kj 1 nij n j the number of elements of the node S j. n j nij n is the total number of elements n,i 1 ni m 2 is the number of classes suspect.
URLs normal URLs, As is a variable controlling effectiveness of graph construction The algorithm stops if no changes in. uncertainty occur, In our system WebGuard we use several classification methods ID3 C4 5 SIPINA that can be. combined in order to ensure a high degree of accuracy In addition the user can configure the blocking. degree to a level that suites his her cultural background Furthermore the user can protect his her. configuration through a password Figure 7 shows the configuration interface. Figure 7 Configuration interface,International Conference WWW Internet 2003. 5 FUZZY CLUSTERING AND SKIN COLOR IMAGE SEGMENTATION. One of the driving applications for skin color model construction of large datasets is data mining The main. goal is to find an original structure of the data select the most significant part of this structure and describe it. with a set of compact decision rules 4 5, An other important step in the image classification process is color segmentation of the image into skin. color regions and non skin color regions Each skin color image in the database was proceed in the following. manner first similar regions were automatically labeled using a fuzzy clustering then the skin color regions. are extracted and identified by the dedied process. The fuzzy algorithm detects automatically the number of classes The local minima of the contrast in the. image give a set of seeds regions which are exploited subsequently by an discrete segmentation method such. the continuous watershed method The region growing process of the segment is simulated by linear. diffusion with a diffusion coefficient th at depends on local image properties similar to the boundary. Web filtering Data mining Image analysis Skin color model Text mining Semantic web 1 INTRODUCTION Nowadays Internet takes a place growingly pivotal in everyday life The Internet community has been not only in an ever increasing number but it is also getting increasingly younger In fact children find each day an easier access to Internet which may cause socio cultural problems

Related Books