Home | Conferences | Techknowledge | 9th National Conference | Analysing And Preventing The Challenging Issues Related To Web Scraping

Analysing And Preventing The Challenging Issues Related To Web Scraping

02/04/2011 15:43:00

Font size:

The Internet is messed up with personal websites and blogs and other sites where millions of people post information about themselves. There is a technique and legitimate set of tools available you with raw and accurate data in the least amount of effort on your part, through which you can search all of these accessible websites and retrieve relevant information. This technique is called Web Scraping. The tools can simulate user interactions and can automatically store the resultant display. There are many valid reasons for uses of web Scraping by various Web Analyzer, Ethical Hacker, Developers, and Authenticate User’s Decision Analyst etc. Even though it has been observed that many a times Web Scraping is used for illegal or illicit purposes. For instance Web Scraping is used for anything disagreeable, or anything that would violate anyone else's copyright. Therefore In this paper we propose various issues related to web scraping and discuss some prevention techniques for these issues. It will provide you better solution for scraping as well as prevention methods of discussed issues.

1.    Introduction: When developing a system that uses online data from external parties, many companies take the careful approach meaning that they use a formal documented legal process where permission is granted and a revenue package is agreed upon. However, these same companies tend to be frustrated when it comes to their own data being leveraged by other parties, who have less to lose and that do not take this prudent approach.

These companies normally use web-scraping to harvest their information. Web scraping can be defined as the act of going through the content of a website for the purpose of extracting information from it. It is typically implemented by means of authoring an automated agent that makes appropriate HTTP requests to the website with the desired content, and 'scrapes' the said content from the result of the HTTP request. The scraping (or extraction or harvesting) is used to collect content such as user-data, image-links, user-comments, email addresses or any other data of potential value from the source website [1].

In this technique the main focus is given on the transformation of unstructured Web content, coded in HTML format into structured data that can be stored and analyzed in a central local database or spreadsheet. Web scraping is also related to Web automation, which simulates human Web browsing using computer software. Web scraping is used for many purposes like online price comparison, weather data monitoring, website change detection, Web research, Web content mashup and Web data integration . It will save your hundreds of thousands of man-hours.
You can generate sales leads, harvest product pricing data, duplicate an online database, capture financial data, real estate data, job postings, auction info and more via our leading web scraping services [2].

2.    Introduction to Web Scraping: Web Scraping refers to an application that processes the HTML of a Web page to extract data for manipulation such as converting the Web page to another format (i.e. HTML to WML). Web Scraping scripts and applications will simulate a person viewing a Web site with a browser.
a)    Categories of Web Scraping
Web scraping can be of three types:
i)     Manual scrapers: Many a times People download data manually and use it in direct breach with the terms and conditions of the site. This can be either single individuals or groups of people such as for example a call centre using the site commercially.
ii)    Scripted scrapers: When to perform the transactions automatically or to get large amounts of data quickly then it is most convenient for scrapers to use a script or a program to perform the scraping rather than doing it manually. Scripted web scrapers can use single or multiple IP’s making it seem that they are in fact a group of legitimate users.
iii)    Bots: If you have selected text or images from a web page, and saved them onto your hard drive, you have performed a form of screen scraping. Pirates do not use browsers though. The term for the tool they use is “bot”, short for robot. Bots pretend to be browsers, and your server can't tell the difference. Instead of rendering a web page, bots extract the data and images and save them to the thief's hard drive. Technically, this process involves parsing the HTML supplied by the server and extracting the data and image elements. While searching for information on the Internet, bots only retain the data elements that they find, discarding other “markup” elements of HTML
b)    Techniques for Web scraping: Web scraping is a field with active developments. The process of automatically collecting Web information shares a common goal with the semantic Web vision, which is a more ambitious initiative that still requires breakthroughs in text processing, semantic understanding, artificial intelligence and human-computer interactions. Web scraping, instead, favours practical solutions based on existing technologies even though some solutions are entirely ad hoc. Therefore, there are different levels of automations that existing Web-scraping technologies can provide:
i)    Human copy-and-paste: Sometimes you have to use Human copy and paste for extracting data from the Web pages. This may be the only workable solution when the websites for scraping explicitly setup barriers to prevent machine automation. Even best Web-scraping technology can not replace human’s manual examination and copy-and-paste.
ii)    Text grepping and regular expression matching: A simple and powerful approach to extract information from Web pages can be based on the unix grep command and regular expression matching using the Perl programming languages.
iii)    HTTP programming: In this approach you can retrieve Static and dynamic Web pages by posting HTTP requests to the remote Web server using socket programming.
iv)    HTML parsers: There are some semi-structured data query languages used to retrieve and transform Web content, such as the XML query language (XQL) and the hyper-text query language (HTQL) that can be used to parse HTML pages.
v)    DOM parsing: By embedding a full-fledged Web browser, such as the Internet Explorer or the Mozilla Web browser control, programs can retrieve the dynamic contents generated by client side scripts. These Web browser controls and parse Web pages into a DOM tree, based on which programs can retrieve parts of the Web pages.
vi)    Web-scraping software: There are many Web-scraping software and services available that can be used to convert Web-scraping solutions. These software may provide a Web recording interface that removes the necessity to manually write Web-scraping codes, or some scripting functions that can be used to extract and transform Web content, and database interfaces that can store the scraped data in local databases. These web scraping softwares are:
i.    Web Scraper Plus+:
ii.    Web Scraper Lite.
iii.    Web2DB.

c.    Process of Web Scraping: Web Scraping is essentially reverse engineering of HTML pages. It can also be thought of as parsing out chunks of information from a page. Web pages are coded in HTML, which uses a tree-like structure to represent the information. The actual data is mingled with layout and rendering information and is not readily available to a computer. Scrapers are the programs that know how to get the data back from a given HTML page. They work by learning the details of the particular mark up and figuring out where the actual data is.
Fig.1 Describes two-step process of Web Scraping:
i)    In step 1, the web page coded in HTML or XHTML is transformed into XML language [2] [3]. If the input document is already in XML then this step will be skipped
In Fig. 2, target HTML pages are subjected to a sequence of data extraction steps. As we know that, much of the HTML content on the Web today is ill formed because it does not conform to HTML specifications. Therefore, the first step in data extraction is to translate the content to a well formed XML syntax because this helps in subsequent data extraction steps. Since XHTML is based on XML, any XML tool can be used to further process target HTML pages [3] [4].

In Fig. 2, the URL of an XHTML document is used to determine which set of XSLT files to apply to it. The XHMTL document is passed through the first XSLT file and the output is pipelined through other XSLT files defined for that URL. The final output is an XML file whose structure and content is determined by the last XSLT file [4].
ii)    In step 2 the XML language is used to extract the required information and store it into the required place like local database or spreadsheet [4].

d.    Benefits of web scraping: Web scraping offers several benefits, as
i)    Web Scraping will help businessmen to extract and collect the market figures, product pricing data, or real estate data.

ii)    Web Scraping   will help book lovers to extract the information about books, including their titles, authors, descriptions, ISBNs, images, and prices, from online booksellers.
iii)    Web Scraping will help hobbyists and collectors to automate extraction of betting and auction information from auction sites.
iv)    Web Scraping will help journalists to extract news and articles from news
sites.
v)    Web Scraping will extract the online information about vacation and
holiday places, including their names, addresses, descriptions, images, and prices, from web sites.
vi)    Web Scraping will help people seeking a job extract job postings from
online job websites. Find a new job faster and with minimum
inconveniences.
vii)    Web Scraping will help developers to create a new concept from the
   existing one.

3.     Challenging of Web Scraping: Since web scraping is a very useful technique of extracting web data but it also has some problematic issues.In this section we propose these various issues.
a.    Legal Issues: Sometimes there are legal restrictions with the data you retrieve with web scraping. It may be against the terms of use of some websites. Web scraping is illegal in a way that they can be perceived as stealing the information owned by a web site. It is a very complicated issue to understand where copy/paste ends and scraping begins. It is okay for people to copy and save the information from web pages, but it might not be legal to have software do this automatically. But it does not seem that scraping is going to stop. For example, legal issues with Napster did not stop people from writing peer-to-peer sharing software, or the more recent You Tube lawsuit is not likely to stop people from posting copyrighted videos.

b.    Security issues: Scraping becomes problematic when an attacker steals web-based information under subscription to share free-of-charge. More complex attacks combine scrapes of intellectual property with probes for security holes that have left the company vulnerable to hacking. In addition to stealing intellectual property and uncovering security issues, a large-scale scrape may also pose performance issues similar to a denial-of-service attack.
For example, a hacker could log on to a research firm's website and launch an automated tool to extract volumes of information quickly and effortlessly. If the hacker were to make that information available free-of-charge on the web, it would render the firm's research library a valueless commodity overnight and destroy the business.
If launched against a business-networking site, such an attack could collect personal information intended to be available only with permission to other subscribers. By making the information publicly available, the hacker would not only negate the viral marketing model of the site, but also expose private contact information and activities, such as a job search.
c.    Business issue: From the point of the website owner there are a few negative thought.
i.    The website is being accessed in a way that not intended – most websites are developed for individuals browsing through web browsers.
ii.    The people who are doing the scraping are typically collecting the data for purposes that is not in line with how the website owner intends – either to re-present the data to others via their own websites or to gain intelligence about the company’s business
iii    Because screen-scraping mimics users but typically in large volumes, it can put significant pressure on a web server and either slow down the responsiveness of a website or cause it to crash altogether. While screen-scraping may be a handy way to cost-effectively get access to your competitors’ data, you certainly don’t want anyone doing it to your website.

All online businesses that share information on their websites as a part of their business model are threatened by scraping. Examples of these include: online directory, online property portal, Airline, Business-to-business-portal.

The common targets of the systematic scraping (data theft) are Online property portals that plaques online businesses around the world. The fact is that scraping can seriously harm online property portals.

Online property portals usually get a major part of their revenue from advertising. Both from real estate agents advertising property on their web site and from other companies advertising property related businesses – such as building or gardening.
One of the biggest threats to the online property portal is aggregator site that use automated programs to scrape all the property listings from the web site. They then proceed to publish the data on their own site together with data stolen from lots of other property portals.

i.    The scraper site has spent next to no money or time gathering and refining the material. No overhead costs mean that they can offer their services for a much lower cost than you can.
ii.    Because the aggregator site has compiled property listings from many different portals they can offer their users a wider, more attractive, range of properties. When users and advertisers notice this they will eventually steer clear of your site in favour of the one, which offers the most attractive services.

4.    Prevention Against Challenges Issues:This section describes prevention techniques against the various challenging issues of web scraping.

a.    Security issue: As the use of web scraping is going on all of the time; it can steal information from a website, put the details on to another and redirect people to it. This can lead to fraud. We can solve this problem by blacklisting IP addresses but that will require a lot of time. Another solution is to use some anti scraping services. Here are four traditional anti scraping methods which all have their drawbacks when used to prevent scraping.
i.    Rate Limiting: In this prevention technique you only allow IP a certain amount of searches in a fixed timeframe before blocking it .In reality It is not sure way to prevent the worst offenders. The problem is that a large proportion of your users are likely to come through proxy servers or large corporate gateways which they often share with thousands of other users. If you rate limit a proxy's IP that limit will easily trigger when different users from the proxy uses your site.
One solution is of course to use white list but the problem with that is that you continually need to manually compile and maintain these lists since IP-addresses change over time. Needless to say the data scrapers will only lower their rates or distribute the searches over more IP’s once they realise that you are rate limiting certain addresses. In order for rate limiting to be effective and not prohibitive for big users of the site we usually recommend to investigate everyone exceeding the rate limit before blocking them.

ii    Captcha tests: Captcha test refer Completely Automated Public Turing test [2]. These are a common way of trying to block scraping at web sites. Captcha test used to tell Computers and Humans apart and approaches the problem in a unique way. It identifies the party trying to access your site as a human or a computer program .It does this by generating questions that only a human can answer correctly. Computers still lag humans in the area of image and word recognition. Captcha displays distorted images of a word and challenges the party to correctly enter the word. You should be aware that there have been academic algorithms published that can defeat some forms of Captcha. Captcha is basically used by financial and ticket industries.
This method has two obvious drawbacks. Firstly the Captcha tests may be annoying for the users if they have to fill out more than one. Secondly, web scrapers can easily manually do the test and then let their script run. Apart from this a couple of big users of Captcha tests have had their implementations compromised.

iii    Obfuscating source code: Some solutions try to obfuscate the http source code to make it harder for machines to read it. The problem here with this method is that if a web browser can understand the obfuscated code, so can any other program. Obfuscating source code may also interfere with how search engines see and treat your website. If you decide to implement this you should do it with great care.

iv    Blacklists: Blacklists consisting of IP’s known to scrape the site are not really a method in itself since you still need to detect a scraper first in order to blacklist him. Even so it is still a blunt weapon since IP’s tend to change over time. In the end you will end up blocking legitimate users with this method. If you still decide to implement black lists you should have a procedure to review them on at least a monthly basis.
b.    Business issue: In order to help online property portals and other businesses deal with scraping problems once and for all Sentor has developed ASSASSIN, a world unique managed anti-scraping service, completely dedicated to detecting and blocking scraping. With ASSASSIN you will have complete control over your data [7]. ASSASSIN analyses the traffic at your website – around the clock – and raises the alarm when scraping is suspected. Sentor's security operators evaluate the alarm and act according to a response plan [7].

c.    Problems with legal action against scraping: There are two major problems with using legal action to stop web scraping.

i.    The first is obviously that since the scraping is performed on the Internet the scraper may be located anywhere in the world and he or she may not abide the laws of the country where the site is located.
ii.    The Second problem is the sheer scale of scraping and the fact that it is not trivial to identify the scrapers at most times. If you have a large site with valuable information or business logic that attracts scrapers there will probably be hundreds of offenders each month and pursuing legal action against them all will be very costly.

5.    Conclusion: There is no single solution that can be used to prevent the theft of images and data from your website. The best defence is to develop a strategy that combines both Reactive and Proactive measures. These measures are most effective when a mix of legal and technical tactics is used.
As the use of web scraping grows faster, so it is impossible to stop it completely. In this paper we discussed about web scraping, its benefits and various techniques of it. As we know that this technique is very useful but it also has some problematic issues like legal issue, security issue and business issue. It mostly creates problems for online businesses where the data is very important .In this paper we analyze and identified those issues and proposed some prevention techniques like for online businesses we suggest a web service named ASSASSIN through which we can stop the use of web scraping in some respect.

6.    References:
1.    Deactivation of Unwelcome Deep Web Extraction Services through Random Injection Varun Bhagwan, Tyrone Grandson IBM Almaden Research Center, 650 Harry Road, San Jose, California 95120 USA.
2.    CRT White Paper June 2004, Screen Scraping Strategies.
3.    XHTML: The Extensible Hyper Text Markup Language, W3C Recommendation, January 2000. http://www.w3.org/TR/xhtml1.
4.    Extensible Markup Language (XML) (1998), 3CRecommendation, February.
5.    XSL Transformations (XSLT) (1999), W3C Recommendation, November 1999. http://www.w3.org
6.    Effective Web Data Extraction with Standard, XML Technologies, Jussi Myllymaki, IBM, Almaden Research Center
7.    ASSASSIN, http://www.sentormss.com/assassin.html

Share on:

Digg

Print version

Analysing And Preventing The Challenging Issues Related To Web Scraping

More from 9th National Conference

Most Recent