Smart Code: How To Extract Data From Webpage in Java Using JSOUP.

Download jsoup
->For Non-Maven User download the jsoup api from here
For Maven User

  <dependency>
 <groupId>org.jsoup</groupId>
 <artifactId>jsoup</artifactId>
 <version>1.7.1</version>
  </dependency>

Crossword.java

package com.smartcode;
 
import java.io.IOException;
 
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
 
public class Crossword {

    public static void main (String args[])
    {
       
        Document doc;
        try{
            // need http protocol
            doc = Jsoup.connect("http://www.crossword.in).get();
           
          // get page title
  String title = doc.title();
  System.out.println("title : " + title);
 
  // get all links
  Elements links = doc.select("a[href]");
  for (Element link : links) {
 
   // get the value from href attribute
   System.out.println("\nlink : " + link.attr("href"));
   System.out.println("text : " + link.text());
            }
        }
        catch (IOException e) {
            e.printStackTrace();
        }
    }
}

Jsoup Library


To get the value of an attribute, use the Node.attr(String key) method
For the text on an element (and its combined children), use Element.text()

For HTML, use Element.html(), or Node.outerHtml() as appropriate

For Example
String html = "<p>An <a href='http://example.com/'><b>example</b></a> link.</p>";
Document doc = Jsoup.parse(html);
Element link = doc.select("a").first();

String text = doc.body().text(); // "An example link"
String linkHref = link.attr("href"); // "http://example.com/"
String linkText = link.text(); // "example""

String linkOuterH = link.outerHtml(); 
    // "<a href="http://example.com"><b>example</b></a>"
String linkInnerH = link.html(); // "<b>example</b>"

Working With CSS
Selector overview
tagname: find elements by tag, e.g. a
ns|tag: find elements by tag in a namespace, e.g. fb|name finds <fb:name> elements
#id: find elements by ID, e.g. #logo
.class: find elements by class name, e.g. .masthead
[attribute]: elements with attribute, e.g. [href]
[^attr]: elements with an attribute name prefix, e.g. [^data-] finds elements with HTML5 dataset attributes
[attr=value]: elements with attribute value, e.g. [width=500]
[attr^=value], [attr$=value], [attr*=value]: elements with attributes that start with, end with, or contain the value, e.g. [href*=/path/]
[attr~=regex]: elements with attribute values that match the regular expression; e.g. img[src~=(?i)\.(png|jpe?g)]
*: all elements, e.g. *
Selector combinations
el#id: elements with ID, e.g. div#logo
el.class: elements with class, e.g. div.masthead
el[attr]: elements with attribute, e.g. a[href]
Any combination, e.g. a[href].highlight
ancestor child: child elements that descend from ancestor, e.g. .body p finds p elements anywhere under a block with class "body"
parent > child: child elements that descend directly from parent, e.g. div.content > p finds p elements; and body > * finds the direct children of the body tag
siblingA + siblingB: finds sibling B element immediately preceded by sibling A, e.g. div.head + div
siblingA ~ siblingX: finds sibling X element preceded by sibling A, e.g. h1 ~ p
el, el, el: group multiple selectors, find unique elements that match any of the selectors; e.g. div.masthead, div.logo
Pseudo selectors
:lt(n): find elements whose sibling index (i.e. its position in the DOM tree relative to its parent) is less than n; e.g. td:lt(3)
:gt(n): find elements whose sibling index is greater than n; e.g. div p:gt(2)
:eq(n): find elements whose sibling index is equal to n; e.g. form input:eq(1)
:has(seletor): find elements that contain elements matching the selector; e.g. div:has(p)
:not(selector): find elements that do not match the selector; e.g. div:not(.logo)
:contains(text): find elements that contain the given text. The search is case-insensitive; e.g. p:contains(jsoup)
:containsOwn(text): find elements that directly contain the given text
:matches(regex): find elements whose text matches the specified regular expression; e.g. div:matches((?i)login)
:matchesOwn(regex): find elements whose own text matches the specified regular expression
Note that the above indexed pseudo-selectors are 0-based, that is, the first element is at index 0, the second at 1, etc
See the Selector API reference for the full supported list and details.

keywords: java,spring,jsoup,extracting data from webpage, extracting data from website, data extraction etc.

Smart Code

Labels

Monday, 13 January 2014

How To Extract Data From Webpage in Java Using JSOUP.

1 comment: