سال انتشار: ۱۳۹۰

محل انتشار: همایش ملی شهر الکترونیک

تعداد صفحات: ۵

نویسنده(ها):

Saman Bashbaghi – Computer Engineering Dept., Bu-Ali Sina University Hamedan, Iran
Hassan Khotanlou – Computer Engineering Dept., Bu-Ali Sina University Hamedan, Iran

چکیده:

According to daily increase of the documents increasing on the internet, automatic language detection is getting more important. In this paper we used language detection system to classify and filtering of the immoral web pages, based on their contents. This system could detect 10 most used languages in the immoral web pages, including FARSI language. As a technique we introduce a new combined method which consists of three parts; URL Processor, page encoding processor, and text processor. In order to generate proper results this system has a voter which combines the results of these three parts. We used the immoral web pages and labeled web pages as an input data set in order to make a linguistic model for each language and system evaluation. Our experiments show 95% accuracy success in accuracy of outcome results. because in this particular issue, it is possible that the name used in the address doesn’t show the page immorality. Another reason is that, there could be many web pages with different languages which used the same encoding. Consequently, each method could not solve the problem by itself. It is declared in this paper that combination of thesethree methods has a very promising result. The paper structure consists of related works, problemdefinition, solution introduction, results interpretation, conclusion and future works.