client Package¶
client
Package¶
webLyzard web service clients
classifier
Module¶
Created on Jan 16, 2013
-
class
weblyzard_api.client.classifier.
Classifier
(url='http://localhost:8080', usr=None, pwd=None)[source]¶ Bases:
eWRT.ws.rest.MultiRESTClient
Classifier
Provides support for text classification.
Parameters: - url – URL of the jeremia web service
- usr – optional user name
- pwd – optional password
-
CLASSIFIER_WS_BASE_PATH
= '/joseph/rest/'¶
-
classify_v2
(classifier_profile, weblyzard_xml, search_agents=None, num_results=1)[source]¶ Classify weblyzard XML documents based on the given classifier profile using the new classifier interface
Parameters: - classifier_profile – the profile to use for classification (e.g. ‘COMET’, ‘MK’)
- weblyzard_xml – weblyzard_xml representation of the document to classify
- search_agents –
a list of search agent dictionaries which are composed as follows {
- {“name”:”Axa Winterthur”,
- “id”:9,
“product_list”:[{“name”:”AXA WINTERTHUR VERS. PRODUKTE RP”,”id”:300682}, {“name”:”AXA WINTERTHUR FINANZ PERSONEN RP”,”id”:300803}, {“name”:”AXA WINTERTHUR FINANZ PRODUKTE RP”,”id”:300804},
]
}
]
- num_results – number of classes to return
Returns: the classification result
-
train
(classifier_profile, weblyzard_xml, correct_category, incorrect_category=None, document_timestamp=None)[source]¶ Trains (and corrects) the classifier’s knowledge base.
Parameters: - classifier_profile – the profile to use for classification (e.g. ‘COMET’, ‘MK’)
- weblyzard_xml – weblyzard_xml representation of the document to learn
- correct_category – the correct category for the document
- incorrect_category – optional information on the incorrect category returned for this document
- document_timestamp – an optional timestamp, specifying when the document has been classified (used for retraining temporal knowledge bases)
Returns: a response object with a status code and message.
domain_specificity
Module¶
-
class
weblyzard_api.client.domain_specificity.
DomainSpecificity
(url='http://localhost:8080', usr=None, pwd=None)[source]¶ Bases:
eWRT.ws.rest.MultiRESTClient
Domain Specificity Web Service
Determines whether documents are relevant for a given domain by searching for domain relevant terms in these documents.
Workflow
- submit a domain-specificity profile with
add_profile()
- obtain the domain-speificity of text documents with
get_domain_specificity()
,parse_documents()
orsearch_documents()
.
Parameters: - url – URL of the jeremia web service
- usr – optional user name
- pwd – optional password
-
URL_PATH
= 'rest/domain_specificity'¶
-
add_profile
(profile_name, profile_mapping)[source]¶ Adds a domain-specificity profile to the Web service.
Parameters: - profile_name – the name of the domain specificity profile
- profile_mapping – a dictionary of keywords and their respective domain specificity values.
-
get_domain_specificity
(profile_name, documents, is_case_sensitive=True)[source]¶ Parameters: - profile_name – the name of the domain specificity profile to use.
- documents – a list of dictionaries containing the document
- is_case_sensitive – whether to consider case or not (default: True)
-
has_profile
(profile_name)[source]¶ Returns whether the given profile exists on the server.
Parameters: profile_name – the name of the domain specificity profile to check. Returns: True
if the given profile exists on the server.
-
parse_documents
(matview_name, documents, is_case_sensitive=False)[source]¶ Parameters: - matview_name – a comma separated list of matview_names to check for domain specificity.
- documents – a list of dictionaries containing the document
- is_case_sensitive – case sensitive or not
Returns: dict (profilename: (content_id, dom_spec))
- submit a domain-specificity profile with
jeremia
Module¶
-
class
weblyzard_api.client.jeremia.
Jeremia
(url='http://localhost:8080', usr=None, pwd=None)[source]¶ Bases:
eWRT.ws.rest.MultiRESTClient
Jeremia Web Service
Pre-processes text documents and returns an annotated webLyzard XML document.
Blacklisting
Blacklisting is an optional service which removes sentences which occur multiple times in different documents from these documents. Examples for such sentences are document headers or footers.
The following functions handle sentence blacklisting:
clear_blacklist()
get_blacklist()
submit_document_blacklist()
update_blacklist()
Jeremia returns a webLyzard XML document. The weblyzard_api provides the class
XMLContent
to process and manipulate the weblyzard XML documents.:Note
Example usage
from weblyzard_api.client.recognize import Recognize from pprint import pprint docs = {'id': '192292', 'title': 'The document title.', 'body': 'This is the document text...', 'format': 'text/html', 'header': {}} client = Jeremia() result = client.submit_document(docs) pprint(result)
Parameters: - url – URL of the jeremia web service
- usr – optional user name
- pwd – optional password
-
ATTRIBUTE_MAPPING
= {'lang': 'lang', 'sentences_map': {'token': 'token', 'md5sum': 'id', 'pos': 'pos', 'value': 'value'}, 'content_id': 'id', 'sentences': 'sentence', 'title': 'title'}¶
-
URL_PATH
= 'jeremia/rest'¶
-
clear_blacklist
(source_id)[source]¶ Parameters: source_id – the blacklist’s source id Empties the existing sentence blacklisting cache for the given source_id
-
commit
(batch_id, sentence_threshold=None)[source]¶ Parameters: batch_id – the batch_id to retrieve Returns: a generator yielding all the documents of that particular batch
-
get_blacklist
(source_id)[source]¶ Parameters: source_id – the blacklist’s source id Returns: the sentence blacklist for the given source_id
-
get_xml_doc
(text, content_id='1')[source]¶ Processes text and returns a XMLContent object.
Parameters: - text – the text to process
- content_id – optional content id
-
submit
(batch_id, documents, source_id=None, use_blacklist=False, sentence_threshold=None)[source]¶ Convenience function to submit documents. The function will submit the list of documents and finally call commit to retrieve the result
Parameters: - batch_id – ID of the batch
- documents – list of documents (dict)
- source_id –
- use_blacklist – use the blacklist or not
Returns: result as a list with dicts
-
submit_document
(document)[source]¶ processes a single document with jeremia (annotates a single document)
Parameters: document – the document to be processed
-
submit_documents
(batch_id, documents)[source]¶ Parameters: - batch_id – batch_id to use for the given submission
- documents – a list of dictionaries containing the document
-
submit_documents_blacklist
(batch_id, documents, source_id)[source]¶ submits the documents and removes blacklist sentences
Parameters: - batch_id – batch_id to use for the given submission
- documents – a list of dictionaries containing the document
- source_id – source_id for the documents, determines the blacklist
jesaja
Module¶
-
class
weblyzard_api.client.jesaja.
Jesaja
(url='http://localhost:8080', usr=None, pwd=None)[source]¶ Bases:
eWRT.ws.rest.MultiRESTClient
Provides access to the Jesaja keyword service.
Jesaja extracts associations (i.e. keywords) from text documents.
Parameters: - url – URL of the jeremia web service
- usr – optional user name
- pwd – optional password
-
ATTRIBUTE_MAPPING
= {'lang': 'xml:lang', 'sentences_map': {'token': 'token', 'md5sum': 'id', 'pos': 'pos', 'value': 'value'}, 'content_id': 'id', 'sentences': 'sentence', 'title': 'title'}¶
-
URL_PATH
= 'jesaja/rest'¶
-
VALID_CORPUS_FORMATS
= ('xml', 'csv')¶
-
add_or_update_corpus
(corpus_name, corpus_format, corpus, profile_name=None, skip_profile_check=False)[source]¶ Adds/updates a corpus at Jesaja.
Parameters: - corpus_name – the name of the corpus
- corpus_format – either ‘csv’, ‘xml’, or wlxml
- corpus – the corpus in the given format.
- profile_name – the name of the profile used for tokenization (only used in conjunction with corpus_format ‘doc’).
Note
Supported
corpus_format
csv
xml
wlxml:
# xml_content: the content in the weblyzard xml format corpus = [ xml_content, ... ]
Attention
uploading documents (corpus_format = doc, wlxml) requires a call to finalize_corpora to trigger the corpus generation!
-
add_or_update_stoplist
(name, stoplist)[source]¶ Deprecated since version 0.1: Use:
add_stoplist()
instead.
-
add_profile
(profile_name, keyword_calculation_profile)[source]¶ Add a keyword profile to the server
Parameters: - profile_name – the name of the keyword profile
- keyword_calculation_profile – the full keyword calculation profile (see below).
Note
Example keyword calculation profile
{ 'valid_pos_tags' : ['NN', 'P', 'ADJ'], 'corpus_name' : reference_corpus_name, 'min_phrase_significance' : 2.0, 'num_keywords' : 5, 'keyword_algorithm' : 'com.weblyzard.backend.jesaja.algorithm.keywords.YatesKeywordSignificanceAlgorithm', 'min_token_count' : 5, 'skip_underrepresented_keywords' : True, 'stoplists' : [], }
Note
Available keyword_algorithms
com.weblyzard.backend.jesaja.algorithm.keywords.YatesKeywordSignificanceAlgorithm
com.weblyzard.backend.jesaja.algorithm.keywords.LogLikelihoodKeywordSignificanceAlgorithm
-
add_stoplist
(name, stoplist)[source]¶ Parameters: - name – name of the stopword list
- stoplist – a list of stopwords for the keyword computation
-
change_log_level
(level)[source]¶ Changes the log level of the keyword service
Parameters: level – the new log level to use.
-
classmethod
convert_document
(xml)[source]¶ converts an XML String to a dictionary with the correct parameters (ignoring non-sentences and adding the titles
Parameters: xml – str representing the document Returns: converted document Return type: dict
-
finalize_corpora
()[source]¶ Note
This function needs to be called after uploading ‘doc’ or ‘wlxml’ corpora, since it triggers the computations of the token counts based on the ‘valid_pos_tags’ parameter.
-
classmethod
get_documents
(xml_content_dict)[source]¶ converts a list of weblyzard xml files to the json format required by the jesaja web service.
-
get_keywords
(profile_name, documents)[source]¶ Parameters: - profile_name – keyword profile to use
- documents – a list of webLyzard xml documents to annotate
Note
example documents list
documents = [ { 'title': 'Test document', 'sentence': [ { 'id': '27150b5fae553ebab63332fe7b94d518', 'pos': 'NNP VBZ VBN IN VBZ NNP . NNP VBZ NNP .', 'token': '0,5 6,8 9,16 17,19 20,27 28,43 43,44 45,48 49,54 55,61 61,62', 'value': 'CDATA is wrapped as follows <![CDATA[aha]]>. Ana loves Martin.' }, { 'id': 'f8ddd9b3c8cf4c7764a3348d14e84e79', 'pos': 'NN IN CD ' IN JJR JJR JJR JJR CC CC CC : : JJ NN .', 'token': '0,4 5,7 8,9 10,11 12,16 17,18 18,19 19,20 20,21 22,23 23,24 25,28 29,30 30,31 32,39 40,45 45,46', 'value': '10µm in € ” with <><> && and // related stuff.' } ], 'content_id': '123k233', 'lang': 'en', } ]
pos
Module¶
Part-of-speech (POS) tagging service
recognize
Module¶
-
class
weblyzard_api.client.recognize.
EntityLyzardTest
(methodName='runTest')[source]¶ Bases:
unittest.case.TestCase
Create an instance of the class that will use the named test method when executed. Raises a ValueError if the instance does not have a method with the specified name.
-
DOCS
= [{'xml:lang': 'de', 'id': 99933, 'sentence': [{'token': '0,5 6,12 13,16 17,19 20,23 24,27 28,36 36,37', 'id': '50612085a00cf052d66db97ff2252544', 'value': u'Georg M\xfcller hat 10 Mio CHF gewonnen.', 'pos': 'NE NE VAFIN CARD NE NE VVPP $.'}, {'token': '0,4 5,12 13,19 20,23 24,27 28,35 36,39 40,42 43,46 47,50 50,51 52,55 56,59 60,65 66,72 73,84 84,85 86,92 93,101 101,102', 'id': 'a3b05957957e01060fd58af587427362', 'value': u'Herr Schmidt konnte mit dem Angebot von 10 Mio CHF, das ihm Georg M\xfcller hinterlegte, nichts anfangen.', 'pos': 'NN NE VMFIN APPR ART NN APPR CARD NE NE $, PRELS PPER NE NE VVFIN $, PIS VVINF $.'}]}, {'xml:lang': 'de', 'id': 99934, 'sentence': [{'token': '0,6 7,14 15,23 23,24 25,29 30,33 34,37 38,42 43,47 48,59 60,64 65,69 69,70', 'id': 'f98a0c4d2ddffd60b64b9b25f1f5657a', 'value': u'Rektor Kessler erkl\xe4rte, dass die HTW auch 2014 erfolgreich sein wird.', 'pos': 'NN NE VVFIN $, KOUS ART NN ADV CARD ADJD VAINF VAFIN $.'}]}]¶
-
DOCS_XML
= ['\n <?xml version="1.0" encoding="UTF-8"?>\n <wl:page xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:wl="http://www.weblyzard.com/wl/2013#" dc:title="" wl:id="99933" dc:format="text/html" xml:lang="de" wl:nilsimsa="030472f84612acc42c7206e07814e69888267530636221300baf8bc2da66b476" dc:related="http://www.heise.de http://www.kurier.at">\n <wl:sentence wl:id="50612085a00cf052d66db97ff2252544" wl:pos="NE NE VAFIN CARD NE NE VVPP $." wl:token="0,5 6,12 13,16 17,19 20,23 24,27 28,36 36,37" wl:sem_orient="0.0" wl:significance="0.0"><![CDATA[Georg M\xc3\xbcller hat 10 Mio CHF gewonnen.]]></wl:sentence>\n <wl:sentence wl:id="a3b05957957e01060fd58af587427362" wl:pos="NN NE VMFIN APPR ART NN APPR CARD NE NE $, PRELS PPER NE NE VVFIN $, PIS VVINF $." wl:token="0,4 5,12 13,19 20,23 24,27 28,35 36,39 40,42 43,46 47,50 50,51 52,55 56,59 60,65 66,72 73,84 84,85 86,92 93,101 101,102" wl:sem_orient="0.0" wl:significance="0.0"><![CDATA[Herr Schmidt konnte mit dem Angebot von 10 Mio CHF, das ihm Georg M\xc3\xbcller hinterlegte, nichts anfangen.]]></wl:sentence>\n </wl:page>\n ', '\n <?xml version="1.0" encoding="UTF-8"?>\n <wl:page xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:wl="http://www.weblyzard.com/wl/2013#" dc:title="" wl:id="99934" dc:format="text/html" xml:lang="de" wl:nilsimsa="020ee211a20084bb0d2208038548c02405bb0110d2183061db9400d74c15553a" dc:related="http://www.heise.de http://www.kurier.at">\n <wl:sentence wl:id="f98a0c4d2ddffd60b64b9b25f1f5657a" wl:pos="NN NE VVFIN $, KOUS ART NN ADV CARD ADJD VAINF VAFIN $." wl:token="0,6 7,14 15,23 23,24 25,29 30,33 34,37 38,42 43,47 48,59 60,64 65,69 69,70" wl:sem_orient="0.0" wl:significance="0.0"><![CDATA[Rektor Kessler erkl\xc3\xa4rte, dass die HTW auch 2014 erfolgreich sein wird.]]></wl:sentence>\n </wl:page>\n ']¶
-
IS_ONLINE
= True¶
-
TESTED_PROFILES
= ['de.people.ng', 'en.geo.500000.ng', 'en.organization.ng', 'en.people.ng']¶
-
test_geo_swiss
()[source]¶ Tests the geo annotation service for Swiss media samples.
Note
de_CH.geo.5000.ng
detects Swiss cities with more than 5000 and worldwide cities with more than 500,000 inhabitants.
-
xml
= '\n <?xml version="1.0" encoding="UTF-8"?>\n <wl:page xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:wl="http://www.weblyzard.com/wl/2013#" dc:title="" wl:id="99934" dc:format="text/html" xml:lang="de" wl:nilsimsa="020ee211a20084bb0d2208038548c02405bb0110d2183061db9400d74c15553a" dc:related="http://www.heise.de http://www.kurier.at">\n <wl:sentence wl:id="f98a0c4d2ddffd60b64b9b25f1f5657a" wl:pos="NN NE VVFIN $, KOUS ART NN ADV CARD ADJD VAINF VAFIN $." wl:token="0,6 7,14 15,23 23,24 25,29 30,33 34,37 38,42 43,47 48,59 60,64 65,69 69,70" wl:sem_orient="0.0" wl:significance="0.0"><![CDATA[Rektor Kessler erkl\xc3\xa4rte, dass die HTW auch 2014 erfolgreich sein wird.]]></wl:sentence>\n </wl:page>\n '¶
-
-
class
weblyzard_api.client.recognize.
Recognize
(url='http://localhost:8080', usr=None, pwd=None)[source]¶ Bases:
eWRT.ws.rest.MultiRESTClient
Provides access to the Recognize Web Service.
- Workflow:
- pre-load the recognize profiles you need using the
add_profile()
call. - submit the text or documents to analyze using one of the following calls:
search_document()
orsearch_documents()
for document dictionaries.search_text()
for plain text.
- pre-load the recognize profiles you need using the
Note
Example usage
from weblyzard_api.client.recognize import Recognize from pprint import pprint url = 'http://triple-store.ai.wu.ac.at/recognize/rest/recognize' profile_names = ['en.organization.ng', 'en.people.ng', 'en.geo.500000.ng'] text = 'Microsoft is an American multinational corporation headquartered in Redmond, Washington, that develops, manufactures, licenses, supports and sells computer software, consumer electronics and personal computers and services. It was was founded by Bill Gates and Paul Allen on April 4, 1975.' client = Recognize(url) result = client.search_text(profile_names, text, output_format='compact', max_entities=40, buckets=40, limit=40) pprint(result)
Parameters: - url – URL of the jeremia web service
- usr – optional user name
- pwd – optional password
-
ATTRIBUTE_MAPPING
= {'lang': 'xml:lang', 'sentences_map': {'token': 'token', 'md5sum': 'id', 'pos': 'pos', 'value': 'value'}, 'content_id': 'id', 'sentences': 'sentence'}¶
-
OUTPUT_FORMATS
= ('standard', 'minimal', 'annie', 'compact')¶
-
URL_PATH
= 'recognize/rest/recognize'¶
-
add_profile
(profile_name, force=False)[source]¶ pre-loads the given profile
::param profile_name: name of the profile to load.
-
classmethod
convert_document
(xml)[source]¶ converts an XML String to a document dictionary necessary for transmitting the document to Recognize.
Parameters: xml – weblyzard_xml representation of the document Returns: the converted document Return type: dict Note
non-sentences are ignored and titles are added based on the XmlContent’s interpretation of the document.
-
get_focus
(profile_names, doc_list, max_results=1)[source]¶ Parameters: - profile_names – a list of profile names
- doc_list – a list of documents to analyze based on the weblyzardXML format
- max_results – maximum number of results to include
Returns: the focus and annotation of the given document
Note
Corresponding web call
http://localhost:8080/recognize/focus?profiles=ofwi.people&profiles=ofwi.organizations.context
-
get_xml_document
(document)[source]¶ Returns: the correct XML representation required by the Recognize service
-
list_configured_profiles
()[source]¶ Returns: a list of all profiles supported in the current configuration
-
list_profiles
()[source]¶ Returns: a list of all pre-loaded profiles >>> r=Recognize() >>> r.list_profiles() [u'Cities.DACH.10000.de_en', u'People.DACH.de']
-
search_document
(profile_names, document, debug=False, max_entities=1, buckets=1, limit=1, output_format='minimal')[source]¶ Parameters: - profile_names – a list of profile names
- document – a single document to analyze (see example documents below)
- debug – compute and return an explanation
- buckets – only return n buckets of hits with the same score
- max_entities – number of results to return (removes the top hit’s tokens and rescores the result list subsequently
- limit – only return that many results
- output_format – the output format to use (‘standard’, ‘minimal’, ‘annie’)
Return type: the tagged dictionary
Note
Example document
# option 1: document dictionary {'content_id': 12, 'content': u'the text to analyze'} # option 2: weblyzardXML XMLContent('<?xml version="1.0"...').as_list()
-
search_documents
(profile_names, doc_list, debug=False, max_entities=1, buckets=1, limit=1, output_format='annie')[source]¶ Parameters: - profile_names – a list of profile names
- doc_list – a list of documents to analyze (see example below)
- debug – compute and return an explanation
- buckets – only return n buckets of hits with the same score
- max_entities – number of results to return (removes the top hit’s tokens and rescores the result list subsequently
- limit – only return that many results
- output_format – the output format to use (‘standard’, ‘minimal’, ‘annie’)
Return type: the tagged dictionary
Note
Example document
# option 1: list of document dictionaries ( {'content_id': 12, 'content': u'the text to analyze'}) # option 2: list of weblyzardXML dictionary representations (XMLContent('<?xml version="1.0"...').as_list(), XMLContent('<?xml version="1.0"...').as_list(),)
-
search_text
(profile_names, text, debug=False, max_entities=1, buckets=1, limit=1, output_format='minimal')[source]¶ Search text for entities specified in the given profiles.
Parameters: - profile_names – the profile to search in
- text – the text to search in
- debug – compute and return an explanation
- buckets – only return n buckets of hits with the same score
- max_entities – number of results to return (removes the top hit’s tokens and rescores the result list subsequently
- limit – only return that many results
- output_format – the output format to use (‘standard’, ‘minimal’, ‘annie’)
Return type: the tagged text
sentiment_analysis
Module¶
-
class
weblyzard_api.client.sentiment_analysis.
SentimentAnalysis
(url='http://voyager.srv.weblyzard.net/ws', usr=None, pwd=None)[source]¶ Bases:
eWRT.ws.rest.RESTClient
Sentiment Analysis Web Service
Parameters: - url – URL of the jeremia web service
- usr – optional user name
- pwd – optional password
-
parse_document
(text, lang)[source]¶ Returns the sentiment of the given text for the given language.
Parameters: - text – the input text
- lang – the text’s language
Returns: sv; n_pos_terms; n_neg_terms; list of tuples, where each tuple contains two dicts:
- tuple[0]: ambiguous terms and its sentiment value after disambiguation
- tuple[1]: the context terms with their number of occurrences in the document.
-
parse_document_list
(document_list, lang)[source]¶ Returns the sentiment of the given text for the given language.
Parameters: - document_list – the input text
- lang – the text’s language
Returns: sv; n_pos_terms; n_neg_terms; list of tuples, where each tuple contains two dicts:
- tuple[0]: ambiguous terms and its sentiment value after disambiguation
- tuple[1]: the context terms with their number of occurrences in the document.
-
reset
(lang)[source]¶ Restores the default data files for the given language (if available).
Parameters: lang – the used language Note
Currently this operation is only supported for German and English.
-
update_context
(context_dict, lang)[source]¶ Uploads the given context dictionary to the Web service.
Parameters: - context_dict – a dictionary containing the context information
- lang – the used language