最近在学习java, 写了个获取网页所有链接(超链接)的例子,实践是最好的学习嘛,用到htmlparser这个包.
htmlparser用于对html页面进行解析的包,很好用的.
htmlparser项目首页:http://htmlparser.sourceforge.net/ , 查找download下载就行了, 就是以下页面:
http://sourceforge.net/projects/htmlparser/files/
解包,要将HTMLParser中lib下的htmlparser.jar和htmllexer.jar加入到classpath中才能使用HTML Parser提供的API。
以下就是java获取网页所有网址链接的代码了:
import java.io.*;
import java.net.*;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
import org.htmlparser.filters.*;
import org.htmlparser.*;
import org.htmlparser.nodes.*;
import org.htmlparser.tags.*;
import org.htmlparser.util.*;
import org.htmlparser.visitors.*;
public class GetUrl2 {
public static void main(String args[]) throws Exception{
String mystr=getwangye2("https://www.lpfrx.com/");
if (mystr=="")
{
System.out.println("获取网页数据失败,系统退出");
System.exit(1);
}
System.out.println(mystr.getBytes().length);
System.out.println(writetxt(mystr,"GetUrl2.txt"));
mygeturl(mystr);
}
//获取网页数据错误时处理方法,尝试三次
public static String getwangye2(String lasturl) {
int kk=3;
String sss="";
while( kk>0 ) {
sss=getwangye(lasturl);
//!sss.equals("null") 解作 sss !="null" , 但java语法是不能这么写
if (!sss.equals("null")) {
System.out.println("获取网页数据ok");
return sss;
}
System.out.println("获取网页数据错误,继续尝试");
try{
Thread.sleep(1500);
}catch(Exception e){
System.out.println(e);
System.out.println("休眠中");
}
kk--;
}
return "";
}
//获取网页数据
public static String getwangye(String surl) {
String fanstr = "null";
try
{
URL url=new URL(surl);
URLConnection conn=url.openConnection();
conn.setRequestProperty(
"User-Agent",
"Mozilla/5.0 (Windows NT 6.1; WOW64; rv:8.0) Gecko/20100101 Firefox/8.0");
//conn.connect();
InputStream is=conn.getInputStream();
BufferedReader br=new BufferedReader(new InputStreamReader(is,"UTF-8"));
String ss="";
StringBuffer sb = new StringBuffer("");
while((line=br.readLine())!=null){
sb.append(ss+ "/r/n");
}
br.close();
// fanstr=sb.toString();
return sb.toString();
}catch(Exception e){
//System.out.println(e);
Writer w = new StringWriter();
e.printStackTrace(new PrintWriter(w));
String errorstr = "system error: " + w.toString();
System.out.println(errorstr);
return fanstr;
}
}
//写入文本文件
public static String writetxt(String mystr,String xstr) {
try{
FileWriter fw = new FileWriter(xstr);
fw.write(mystr,0,mystr.length());
fw.flush();
return "ok";
}catch(IOException e){
System.out.println (e);
return "no";
}
}
public static void mygeturl(String mysss)throws ParserException {
Parser myParser;
myParser = Parser.createParser(mysss, "utf8");
NodeList list = myParser.parse(new NodeClassFilter(LinkTag.class));
for (int i = 0; i < list.size(); i++) {
// System.out.println(list.elementAt(i).toHtml());
String urlLink = ((LinkTag)list.elementAt(i)).extractLink();
String linkName = ((LinkTag)list.elementAt(i)).getLinkText();
// 输出链接
System.out.println(urlLink);
//if (urlLink.indexOf("http") == 0)
//System.out.println(linkName + ":" + urlLink);
//System.out.println(urlLink);
}
}
}
编译环境是:win7,jdk6
还没有处理相对路径的问题.
小弟为java菜鸟,路过的高人请指出错误。