瀏覽代碼

started iso reader

bscheibel 5 年之前
父節點
當前提交
36792a2cc6
共有 2 個文件被更改,包括 12 次插入0 次删除
  1. 0 0
      iso_documents/ISO1101.PDF
  2. 12 0
      read_isos.py

iso_documents/ISO 1101.PDF → iso_documents/ISO1101.PDF


+ 12 - 0
read_isos.py

@@ -0,0 +1,12 @@
+import nltk
+nltk.download('punkt')
+from tika import parser
+
+raw = parser.from_file('iso_documents/ISO1101.PDF')
+#print(raw['content'])
+text = raw
+sent_text = nltk.sent_tokenize(text)
+#tokenized_text = nltk.word_tokenize(sent_text.split)
+#tagged = nltk.pos_tag(tokenized_text)
+#match = text.concordance('Toleranz')
+#print(sent_text)