從開源工具中汲取知識之網頁爬蟲工具 - 網安 - 專業的網絡安全產業、社區、知識平臺

今天分析了幾款網站爬蟲開源工具，其主要作用是輔助安全測試人員，測試網站功能，發現網站漏洞，本著學習的原則，通過閱讀源碼的方式來學習其核心技術，從而有助于我們自身編寫相關腳本，在實際的工作中應用它來提升工具效率。

參考工具

gospider

https://github.com/jaeles-project/gospider/blob/master/core/sitemap.go

常見 sitemap 的路徑：

sitemapUrls := []string{"/sitemap.xml", "/sitemap_news.xml", "/sitemap_index.xml", "/sitemap-index.xml", "/sitemapindex.xml",    "/sitemap-news.xml", "/post-sitemap.xml", "/page-sitemap.xml", "/portfolio-sitemap.xml", "/home_slider-sitemap.xml", "/category-sitemap.xml",    "/author-sitemap.xml"}

獲取 sitemap 可以提取網站的一些鏈接信息

url 提取正則，如果提取的url不包含網站，則進行修復：

(?:"|')(((?:[a-zA-Z]{1,10}://|//)[^"'/]{1,}\.[a-zA-Z]{2,}[^"']{0,})|((?:/|\.\./|\./)[^"'><,;| *()(%%$^/\\\[\]][^"'><,;|()]{1,})|([a-zA-Z0-9_\-/]{1,}/[a-zA-Z0-9_\-/]{1,}\.(?:[a-zA-Z]{1,4}|action)(?:[\?|#][^"|']{0,}|))|([a-zA-Z0-9_\-/]{1,}/[a-zA-Z0-9_\-/]{3,}(?:[\?|#][^"|']{0,}|))|([a-zA-Z0-9_\-]{1,}\.(?:php|asp|aspx|jsp|json|action|html|js|txt|xml)(?:[\?|#][^"|']{0,}|)))(?:"|')

黑名單后綴：

(?i)\.(png|apng|bmp|gif|ico|cur|jpg|jpeg|jfif|pjp|pjpeg|svg|tif|tiff|webp|xbm|3gp|aac|flac|mpg|mpeg|mp3|mp4|m4a|m4v|m4p|oga|ogg|ogv|mov|wav|webm|eot|woff|woff2|ttf|otf|css)(?:\?|#|$)

javascript 后綴

if fileExt == ".js" || fileExt == ".xml" || fileExt == ".json" || fileExt == ".map" {

如果 js 文件名為 min.js（加密后的 js），嘗試獲取明文 js，將 min 去掉然后訪問

黑名單狀態碼：

response.StatusCode == 404 || response.StatusCode == 429 || response.StatusCode < 100

訪問包內容太大，將內容進行分割之后，再進行正則匹配：

if len(source) > 1000000 {    source = strings.ReplaceAll(source, ";", ";\r")    source = strings.ReplaceAll(source, ",", ",\r")  }

域名匹配正則：

const SUBRE = `(?i)(([a-zA-Z0-9]{1}|[_a-zA-Z0-9]{1}[_a-zA-Z0-9-]{0,61}[a-zA-Z0-9]{1})[.]{1})+`

S3存儲桶的正則

var AWSS3 = regexp.MustCompile(`(?i)[a-z0-9.-]+\.s3\.amazonaws\.com|[a-z0-9.-]+\.s3-[a-z0-9-]\.amazonaws\.com|[a-z0-9.-]+\.s3-website[.-](eu|ap|us|ca|sa|cn)|//s3\.amazonaws\.com/[a-z0-9._-]+|//s3-[a-z0-9-]+\.amazonaws\.com/[a-z0-9._-]+`)

從 robots 中提取鏈接，進行爬取，可以發現搜索引擎發現不了的目錄

gau（getallurl）

https://github.com/lc/gau

核心原理從多個網站提取目標相關信息

1、http://index.commoncrawl.org/collinfo.json

2、https://otx.alienvault.com/

3、https://urlscan.io/

4、https://web.archive.org/cdx/search/cdx

linkfinder

https://github.com/GerbenJavado/LinkFinder

用到的正則（提取網頁中的 url）：

  (?:"|')                               # Start newline delimiter  (    ((?:[a-zA-Z]{1,10}://|//)           # Match a scheme [a-Z]*1-10 or //    [^"'/]{1,}\.                        # Match a domainname (any character + dot)    [a-zA-Z]{2,}[^"']{0,})              # The domainextension and/or path    |    ((?:/|\.\./|\./)                    # Start with /,../,./    [^"'><,;| *()(%%$^/\\\[\]]          # Next character can't be...    [^"'><,;|()]{1,})                   # Rest of the characters can't be    |    ([a-zA-Z0-9_\-/]{1,}/               # Relative endpoint with /    [a-zA-Z0-9_\-/]{1,}                 # Resource name    \.(?:[a-zA-Z]{1,4}|action)          # Rest + extension (length 1-4 or action)    (?:[\?|#][^"|']{0,}|))              # ? or # mark with parameters    |    ([a-zA-Z0-9_\-/]{1,}/               # REST API (no extension) with /    [a-zA-Z0-9_\-/]{3,}                 # Proper REST endpoints usually have 3+ chars    (?:[\?|#][^"|']{0,}|))              # ? or # mark with parameters    |    ([a-zA-Z0-9_\-]{1,}                 # filename    \.(?:php|asp|aspx|jsp|json|         action|html|js|txt|xml)        # . + extension    (?:[\?|#][^"|']{0,}|))              # ? or # mark with parameters  )  (?:"|')                               # End newline delimiter

可以看到這個正則是對 gospider 中正則的解釋，可以學習學習

python 寫的工具，輸入的參數可以是 url，也可以是文件和目錄，做本地數據分析也是可以的。

waybackurls

https://github.com/tomnomnom/waybackurls

不直接訪問網站，與 gau 類似，也是從多個網站獲取相關信息：

1、http://web.archive.org/cdx/search/cdx

2、http://index.commoncrawl.org/

3、https://www.virustotal.com/

直接使用其他平臺的數據，速度是比較快的，而且不用直接訪問相關網站

hakrawler

https://github.com/hakluke/hakrawler

其匹配 url 的正則寫的比較簡單，只匹配跟目標相關的 URL：

c.URLFilters = []*regexp.Regexp{regexp.MustCompile(".*(\\.|\\/\\/)" + strings.ReplaceAll(hostname, ".", "\\.") + "((#|\\/|\\?).*)?")}

工具比較簡單，功能也不多，僅僅做到了獲取網頁中的 url，也不會自動爬取其他 url，單網站測試可用。

paramspider

https://github.com/devanshbatham/ParamSpider

python 寫的工具，主要匹配網頁中帶參數的 url，正則：

 regexp : r'.*?:\/\/.*\?.*\=[^$]'

只能匹配 get 參數的 url，數據來源有兩種，一種是從第三方平臺查詢：

https://web.archive.org/cdx/search/cdx

另外一種是直接獲取網頁內容，可以借鑒的也就這個正則表達式。

總結

以上就是收集整理了一些開源工具，可以獲取網頁中的 url，而獲取 url 的主要場景是，分析 url 中的參數是否存在漏洞，另一種是一層一層的爬取內容，從而獲得更多信息，比如子域名、帶參數的 url、隱藏功能等，提升網站測試的工具面，除了爬蟲的方式，還可以進行目錄枚舉，發現隱藏功能。