Information leakage of the web application's directory or folder path

By
Sooraj V Nair
Published on
04 Jul 2018
2 min read

Web Spiders, Robots, or Crawlers can be used to retrieve any webpage and can recursively traverse all the hyperlinks present in the webpages. This software uses recursive methods to help them retrieve furthermore web content. The robots use the Robots Exclusion Protocol’s specified behaviour to retrieve information from the application. The Robots Exclusion Protocol is specified in the robots.txt file, found on the web application’s web root folder. The robot.txt file contains the protocol along with all the folders the software must ignore. But, the Web Spiders, Robots, or Crawlers can intentionally ignore the disallowed directives in the robot.txt file. These types of robots can be found on many social networks. Due to this reason, robot.txt is not the safest method to enforce restrictions on the way the web content is used by 3rd parties.

An attacker can get robot.txt by using wget.

wget http://www.google.com/robots.txt

The output of the above command will be as follows:-

        --2018-09-01 13:13:59--  http://www.facebook.com/robot.txt
        Resolving www.facebook.com (www.facebook.com)... 157.240.16.39, 2a03:2880:f12f:87:face:b00c:0:50fb
        Connecting to www.facebook.com (www.facebook.com)|157.240.16.39|:80... connected.
        HTTP request sent, awaiting response... 302 Found
        Location: https://www.facebook.com/robot.txt [following]
        --2018-09-01 13:13:59--  https://www.facebook.com/robot.txt
        Connecting to www.facebook.com (www.facebook.com)|157.240.16.39|:443... connected.
        HTTP request sent, awaiting response... 302 Found
        Location: https://www.facebook.com/unsupportedbrowser [following]
        --2018-09-01 13:13:59--  https://www.facebook.com/unsupportedbrowser
        Reusing existing connection to www.facebook.com:443.
        HTTP request sent, awaiting response... 200 OK
        Length: unspecified [text/html]
        Saving to: ‘robot.txt’
        
        robot.txt               [     <=>            ] 559.13K   560KB/s    in 1.0s   
        
        2018-09-01 13:14:01 (560 KB/s) - ‘robot.txt’ saved [572554]

    

Example

The following is the example of code written in robot.txt :-

        User-agent: *
        Disallow: /search
        Disallow: /sdch
        Disallow: /groups
        Disallow: /images
        Disallow: /catalogs
        ...

    

Impact

  • An attacker can easily find all the hidden folders used by the application.

Mitigation / Precaution

Beagle recommends the following fixes:-

  • Make sure the Robots.txt file does not reveal any information about the application’s directory and internal folder structure details.
Automated human-like penetration testing for your web apps & APIs
Teams using Beagle Security are set up in minutes, embrace release-based CI/CD security testing and save up to 65% with timely remediation of vulnerabilities. Sign up for a free account to see what it can do for you.

Written by
Sooraj V Nair
Sooraj V Nair
Cyber Security Engineer
Find website security issues in a flash
Improve your website's security posture with proactive vulnerability detection.
Free website security assessment
Experience the power of automated penetration testing & contextual reporting.