Labels

Monday 13 January 2014

How To Extract Data From Webpage in Java Using JSOUP.

Download jsoup
    ->For Non-Maven User download the jsoup api from here
For Maven User
  <dependency>
 <groupId>org.jsoup</groupId>
 <artifactId>jsoup</artifactId>
 <version>1.7.1</version>
  </dependency>
Crossword.java
package com.smartcode;
 
import java.io.IOException;
 
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
 
public class Crossword {

    public static void main (String args[])
    {
       
        Document doc;
        try{
            // need http protocol
            doc = Jsoup.connect("http://www.crossword.in).get();
           
          // get page title
  String title = doc.title();
  System.out.println("title : " + title);
 
  // get all links
  Elements links = doc.select("a[href]");
  for (Element link : links) {
 
   // get the value from href attribute
   System.out.println("\nlink : " + link.attr("href"));
   System.out.println("text : " + link.text());
            }
        }
        catch (IOException e) {
            e.printStackTrace();
        }
    }
}
Jsoup Library

  • To get the value of an attribute, use the Node.attr(String key) method
  • For the text on an element (and its combined children), use Element.text()
For HTML, use Element.html(), or Node.outerHtml() as appropriate For Example String html = "<p>An <a href='http://example.com/'><b>example</b></a> link.</p>"; Document doc = Jsoup.parse(html); Element link = doc.select("a").first(); String text = doc.body().text(); // "An example link" String linkHref = link.attr("href"); // "http://example.com/" String linkText = link.text(); // "example"" String linkOuterH = link.outerHtml(); // "<a href="http://example.com"><b>example</b></a>" String linkInnerH = link.html(); // "<b>example</b>" Working With CSS Selector overview tagname: find elements by tag, e.g. a ns|tag: find elements by tag in a namespace, e.g. fb|name finds <fb:name> elements #id: find elements by ID, e.g. #logo .class: find elements by class name, e.g. .masthead [attribute]: elements with attribute, e.g. [href] [^attr]: elements with an attribute name prefix, e.g. [^data-] finds elements with HTML5 dataset attributes [attr=value]: elements with attribute value, e.g. [width=500] [attr^=value], [attr$=value], [attr*=value]: elements with attributes that start with, end with, or contain the value, e.g. [href*=/path/] [attr~=regex]: elements with attribute values that match the regular expression; e.g. img[src~=(?i)\.(png|jpe?g)] *: all elements, e.g. * Selector combinations el#id: elements with ID, e.g. div#logo el.class: elements with class, e.g. div.masthead el[attr]: elements with attribute, e.g. a[href] Any combination, e.g. a[href].highlight ancestor child: child elements that descend from ancestor, e.g. .body p finds p elements anywhere under a block with class "body" parent > child: child elements that descend directly from parent, e.g. div.content > p finds p elements; and body > * finds the direct children of the body tag siblingA + siblingB: finds sibling B element immediately preceded by sibling A, e.g. div.head + div siblingA ~ siblingX: finds sibling X element preceded by sibling A, e.g. h1 ~ p el, el, el: group multiple selectors, find unique elements that match any of the selectors; e.g. div.masthead, div.logo Pseudo selectors :lt(n): find elements whose sibling index (i.e. its position in the DOM tree relative to its parent) is less than n; e.g. td:lt(3) :gt(n): find elements whose sibling index is greater than n; e.g. div p:gt(2) :eq(n): find elements whose sibling index is equal to n; e.g. form input:eq(1) :has(seletor): find elements that contain elements matching the selector; e.g. div:has(p) :not(selector): find elements that do not match the selector; e.g. div:not(.logo) :contains(text): find elements that contain the given text. The search is case-insensitive; e.g. p:contains(jsoup) :containsOwn(text): find elements that directly contain the given text :matches(regex): find elements whose text matches the specified regular expression; e.g. div:matches((?i)login) :matchesOwn(regex): find elements whose own text matches the specified regular expression Note that the above indexed pseudo-selectors are 0-based, that is, the first element is at index 0, the second at 1, etc See the Selector API reference for the full supported list and details.
keywords: java,spring,jsoup,extracting data from webpage, extracting data from website, data extraction etc.

1 comment:

  1. Anonymous7:44:00 pm

    Yet another absolutely worthless post.

    ReplyDelete