本文共 2956 字,大约阅读时间需要 9 分钟。
anlp在功能上的扩展主要体现在以下几个方面:
•关键词提取 •自动摘要•短语提取 •拼音转换•简繁转换•文本推荐下面是 hanLP分词器的代码
注:使用maven依赖
com.hankcs hanlp portable-1.3.4
使用了java8进行处理
import java.util.ArrayList;
import java.util.List;import java.util.stream.Collectors;import org.apache.commons.lang3.StringUtils;
import com.hankcs.hanlp.seg.Segment;
import com.hankcs.hanlp.seg.Dijkstra.DijkstraSegment;import com.hankcs.hanlp.seg.NShort.NShortSegment;import com.hankcs.hanlp.tokenizer.IndexTokenizer;import com.hankcs.hanlp.tokenizer.NLPTokenizer;import com.hankcs.hanlp.tokenizer.SpeedTokenizer;import com.hankcs.hanlp.tokenizer.StandardTokenizer;public class HanLPTokenizer {private static final Segment N_SHORT_SEGMENT = new NShortSegment().enableCustomDictionary(false)
.enablePlaceRecognize(true).enableOrganizationRecognize(true);private static final Segment DIJKSTRA_SEGMENT = new DijkstraSegment().enableCustomDictionary(false).enablePlaceRecognize(true).enableOrganizationRecognize(true);/**
public static List standard(String text) {
List list = new ArrayList();StandardTokenizer.segment(text).forEach(term -> { if (StringUtils.isNotBlank(term.word)) { list.add(term.word);}});return list.stream().distinct().collect(Collectors.toList());
}/**
public static List nlp(String text) {
List list = new ArrayList();NLPTokenizer.segment(text).forEach(term -> { if (StringUtils.isNotBlank(term.word)) { list.add(term.word);}});return list.stream().distinct().collect(Collectors.toList());
}/**
public static List index(String text) {
List list = new ArrayList();IndexTokenizer.segment(text).forEach(term -> { if (StringUtils.isNotBlank(term.word)) { list.add(term.word);}});return list.stream().distinct().collect(Collectors.toList());
}/**
public static List speed(String text) {
List list = new ArrayList();SpeedTokenizer.segment(text).forEach(term -> { if (StringUtils.isNotBlank(term.word)) { list.add(term.word);}});return list;
}/**
public static List nShort(String text) {
List list = new ArrayList();N_SHORT_SEGMENT.seg(text).forEach(term -> { if (StringUtils.isNotBlank(term.word)) { list.add(term.word);}});return list.stream().distinct().collect(Collectors.toList());
}/**
public static List shortest(String text) {
List list = new ArrayList();DIJKSTRA_SEGMENT.seg(text).forEach(term -> { if (StringUtils.isNotBlank(term.word)) { list.add(term.word);}});return list.stream().distinct().collect(Collectors.toList());
}public static void main(String[] args) {
String text = "测试勿动12";
System.out.println("标准分词:" + standard(text));System.out.println("NLP分词:" + nlp(text));System.out.println("索引分词:" + index(text));System.out.println("N-最短路径分词:" + nShort(text));System.out.println("最短路径分词分词:" + shortest(text));System.out.println("极速词典分词:" + speed(text));}}
文章来源于猴德华的博客
转载地址:http://aicwo.baihongyu.com/