使用正則表達式實現網頁爬蟲的(de)思路詳解

ns868 2018-12-12 17:23:55 4172

1.為(wèi)模拟網頁爬蟲，我們可(kě)以現在我們的(de)tomcat服務器端部署一(yī)個1.html網頁。（部署的(de)步驟：在tomcat目錄的(de)webapps目錄的(de)ROOTS目錄下新建一(yī)個1.html。使用notepad++進行編輯，編輯內(nèi)容為(wèi)：

）

2.使用URL與網頁建立聯系
3.獲取輸入流，用于讀取網頁中的(de)內(nèi)容
4.建立正則規則，因為(wèi)這裏我們是爬去(qù)網頁中的(de)郵箱信息，所以建立匹配郵箱的(de)正則表達式：String regex="\w+@\w+(\.\w+)+";
5.将提取到的(de)數據放到集合中。

代碼：

import java.io.BufferedReader;

import java.io.InputStream;

import java.io.InputStreamReader;

import java.net.URL;

import java.util.ArrayList;

import java.util.List;

import java.util.regex.Matcher;

import java.util.regex.Pattern;

/*

 * 網頁爬蟲:就是一(yī)個程序用于在互聯網中獲取指定規則的(de)數據

*

*

*/

public class RegexDemo {

 public static void main(String[] args) throws Exception {

 List<String> list=getMailByWeb();

 for(String str:list){

 System.out.println(str);

}

}

 private static List<String> getMailByWeb() throws Exception {

 //1.與網頁建立聯系。使用URL

 String path="http://localhost:8080//1.html";//後面寫雙斜杠是用于轉義

 URL url=new URL(path);

 //2.獲取輸入流

 InputStream is=url.openStream();

 //加緩沖

 BufferedReader br=new BufferedReader(new InputStreamReader(is));

 //3.提取符合郵箱的(de)數據

 String regex="\\w+@\\w+(\\.\\w+)+";

 //進行匹配

 //将正則規則封裝成對象

 Pattern p=Pattern.compile(regex);

 //将提取到的(de)數據放到一(yī)個集合中

 List<String> list=new ArrayList<String>();

 String line=null;

 while((line=br.readLine())!=null){

 //匹配器

 Matcher m=p.matcher(line);

 while(m.find()){

 //3.将符合規則的(de)數據存儲到集合中

 list.add(m.group());

}

}

 return list;

}

}

注意:在執行前需要先開啓tomcat服務器

運行結果：

在這裏插入圖片描述

編輯:--ns868

南順網絡

使用正則表達式實現網頁爬蟲的(de)思路詳解