Topc を使用したトピック モデリング: ドレフュス、AI、ワードクラウド

表紙 > プログラミング > Topc を使用したトピックモデリング: ドレフュス、AI、ワードクラウド

Topc を使用したトピックモデリング: ドレフュス、AI、ワードクラウド

2024 年 7 月 30 日に公開

ブラウズ：804

Python を使用して PDF から洞察を抽出する: 包括的なガイド

このスクリプトは、PDF の処理、テキストの抽出、文章のトークン化、および視覚化を伴うトピックモデリングの実行のための強力なワークフローを示し、効率的で洞察力に富んだ分析を実現します。

ライブラリの概要

os: オペレーティングシステムと対話するための機能を提供します。
matplotlib.pyplot: Python で静的、アニメーション、インタラクティブな視覚化を作成するために使用されます。
nltk: Natural Language Toolkit、自然言語処理用のライブラリとプログラムのスイート。
pandas: データ操作および分析ライブラリ。
pdftotext: PDF ドキュメントをプレーンテキストに変換するためのライブラリ。
re: 正規表現の一致操作を提供します。
seaborn: matplotlib.
nltk.tokenize.sent_tokenize: 文字列を文にトークン化する NLTK 関数。
top2vec: トピックモデリングとセマンティック検索のためのライブラリ。
wordcloud: テキストデータからワードクラウドを作成するためのライブラリ。

初期設定

モジュールのインポート

import os
import matplotlib.pyplot as plt
import nltk
import pandas as pd
import pdftotext
import re
import seaborn as sns
from nltk.tokenize import sent_tokenize
from top2vec import Top2Vec
from wordcloud import WordCloud
from cleantext import clean

次に、punkt トークナイザーがダウンロードされていることを確認します:

nltk.download('punkt')

テキストの正規化

def normalize_text(text):
    """Normalize text by removing special characters and extra spaces,
    and applying various other cleaning options."""

    # Apply the clean function with specified parameters
    cleaned_text = clean(
        text,
        fix_unicode=True,  # fix various unicode errors
        to_ascii=True,  # transliterate to closest ASCII representation
        lower=True,  # lowercase text
        no_line_breaks=False,  # fully strip line breaks as opposed to only normalizing them
        no_urls=True,  # replace all URLs with a special token
        no_emails=True,  # replace all email addresses with a special token
        no_phone_numbers=True,  # replace all phone numbers with a special token
        no_numbers=True,  # replace all numbers with a special token
        no_digits=True,  # replace all digits with a special token
        no_currency_symbols=True,  # replace all currency symbols with a special token
        no_punct=False,  # remove punctuations
        lang="en",  # set to 'de' for German special handling
    )

    # Further clean the text by removing any remaining special characters except word characters, whitespace, and periods/commas
    cleaned_text = re.sub(r"[^\w\s.,]", "", cleaned_text)
    # Replace multiple whitespace characters with a single space and strip leading/trailing spaces
    cleaned_text = re.sub(r"\s ", " ", cleaned_text).strip()

    return cleaned_text

PDFテキスト抽出

def extract_text_from_pdf(pdf_path):
    with open(pdf_path, "rb") as f:
        pdf = pdftotext.PDF(f)
    all_text = "\n\n".join(pdf)
    return normalize_text(all_text)

文のトークン化

def split_into_sentences(text):
    return sent_tokenize(text)

複数のファイルの処理

def process_files(file_paths):
    authors, titles, all_sentences = [], [], []
    for file_path in file_paths:
        file_name = os.path.basename(file_path)
        parts = file_name.split(" - ", 2)
        if len(parts) != 3 or not file_name.endswith(".pdf"):
            print(f"Skipping file with incorrect format: {file_name}")
            continue

        year, author, title = parts
        author, title = author.strip(), title.replace(".pdf", "").strip()

        try:
            text = extract_text_from_pdf(file_path)
        except Exception as e:
            print(f"Error extracting text from {file_name}: {e}")
            continue

        sentences = split_into_sentences(text)
        authors.append(author)
        titles.append(title)
        all_sentences.extend(sentences)
        print(f"Number of sentences for {file_name}: {len(sentences)}")

    return authors, titles, all_sentences

データをCSVに保存する

def save_data_to_csv(authors, titles, file_paths, output_file):
    texts = []
    for fp in file_paths:
        try:
            text = extract_text_from_pdf(fp)
            sentences = split_into_sentences(text)
            texts.append(" ".join(sentences))
        except Exception as e:
            print(f"Error processing file {fp}: {e}")
            texts.append("")

    data = pd.DataFrame({
        "Author": authors,
        "Title": titles,
        "Text": texts
    })
    data.to_csv(output_file, index=False, quoting=1, encoding='utf-8')
    print(f"Data has been written to {output_file}")

ストップワードのロード

def load_stopwords(filepath):
    with open(filepath, "r") as f:
        stopwords = f.read().splitlines()
    additional_stopwords = ["able", "according", "act", "actually", "after", "again", "age", "agree", "al", "all", "already", "also", "am", "among", "an", "and", "another", "any", "appropriate", "are", "argue", "as", "at", "avoid", "based", "basic", "basis", "be", "been", "begin", "best", "book", "both", "build", "but", "by", "call", "can", "cant", "case", "cases", "claim", "claims", "class", "clear", "clearly", "cope", "could", "course", "data", "de", "deal", "dec", "did", "do", "doesnt", "done", "dont", "each", "early", "ed", "either", "end", "etc", "even", "ever", "every", "far", "feel", "few", "field", "find", "first", "follow", "follows", "for", "found", "free", "fri", "fully", "get", "had", "hand", "has", "have", "he", "help", "her", "here", "him", "his", "how", "however", "httpsabout", "ibid", "if", "im", "in", "is", "it", "its", "jstor", "june", "large", "lead", "least", "less", "like", "long", "look", "man", "many", "may", "me", "money", "more", "most", "move", "moves", "my", "neither", "net", "never", "new", "no", "nor", "not", "notes", "notion", "now", "of", "on", "once", "one", "ones", "only", "open", "or", "order", "orgterms", "other", "our", "out", "own", "paper", "past", "place", "plan", "play", "point", "pp", "precisely", "press", "put", "rather", "real", "require", "right", "risk", "role", "said", "same", "says", "search", "second", "see", "seem", "seems", "seen", "sees", "set", "shall", "she", "should", "show", "shows", "since", "so", "step", "strange", "style", "such", "suggests", "talk", "tell", "tells", "term", "terms", "than", "that", "the", "their", "them", "then", "there", "therefore", "these", "they", "this", "those", "three", "thus", "to", "todes", "together", "too", "tradition", "trans", "true", "try", "trying", "turn", "turns", "two", "up", "us", "use", "used", "uses", "using", "very", "view", "vol", "was", "way", "ways", "we", "web", "well", "were", "what", "when", "whether", "which", "who", "why", "with", "within", "works", "would", "years", "york", "you", "your", "suggests", "without"]
    stopwords.extend(additional_stopwords)
    return set(stopwords)

トピックからのストップワードのフィルタリング

def filter_stopwords_from_topics(topic_words, stopwords):
    filtered_topics = []
    for words in topic_words:
        filtered_topics.append([word for word in words if word.lower() not in stopwords])
    return filtered_topics

ワードクラウドの生成

def generate_wordcloud(topic_words, topic_num, palette='inferno'):
    colors = sns.color_palette(palette, n_colors=256).as_hex()
    def color_func(word, font_size, position, orientation, random_state=None, **kwargs):
        return colors[random_state.randint(0, len(colors) - 1)]

    wordcloud = WordCloud(width=800, height=400, background_color='black', color_func=color_func).generate(' '.join(topic_words))
    plt.figure(figsize=(10, 5))
    plt.imshow(wordcloud, interpolation='bilinear')
    plt.axis('off')
    plt.title(f'Topic {topic_num} Word Cloud')
    plt.show()

メイン実行

file_paths = [f"/home/roomal/Desktop/Dreyfus-Project/Dreyfus/{fname}" for fname in os.listdir("/home/roomal/Desktop/Dreyfus-Project/Dreyfus/") if fname.endswith(".pdf")]

authors, titles, all_sentences = process_files(file_paths)

output_file = "/home/roomal/Desktop/Dreyfus-Project/Dreyfus_Papers.csv"
save_data_to_csv(authors, titles, file_paths, output_file)

stopwords_filepath = "/home/roomal/Documents/Lists/stopwords.txt"
stopwords = load_stopwords(stopwords_filepath)

try:
    topic_model = Top2Vec(
        all_sentences,
        embedding_model="distiluse-base-multilingual-cased",
        speed="deep-learn",
        workers=6
    )
    print("Top2Vec model created successfully.")
except ValueError as e:
    print(f"Error initializing Top2Vec: {e}")
except Exception as e:
    print(f"Unexpected error: {e}")

num_topics = topic_model.get_num_topics()
topic_words, word_scores, topic_nums = topic_model.get_topics(num_topics)
filtered_topic_words = filter_stopwords_from_topics(topic_words, stopwords)

for i, words in enumerate(filtered_topic_words):
    print(f"Topic {i}: {', '.join(words)}")

keywords = ["heidegger"]
topic_words, word_scores, topic_scores, topic_nums = topic_model.search_topics(keywords=keywords, num_topics=num_topics)
filtered

_search_topic_words = filter_stopwords_from_topics(topic_words, stopwords)

for i, words in enumerate(filtered_search_topic_words):
    generate_wordcloud(words, topic_nums[i])

for i in range(reduced_num_topics):
    topic_words = topic_model.topic_words_reduced[i]
    filtered_words = [word for word in topic_words if word.lower() not in stopwords]
    print(f"Reduced Topic {i}: {', '.join(filtered_words)}")
    generate_wordcloud(filtered_words, i)

Topic Wordcloud

トピックの数を減らす

reduced_num_topics = 5
topic_mapping = topic_model.hierarchical_topic_reduction(num_topics=reduced_num_topics)

# Print reduced topics and generate word clouds
for i in range(reduced_num_topics):
    topic_words = topic_model.topic_words_reduced[i]
    filtered_words = [word for word in topic_words if word.lower() not in stopwords]
    print(f"Reduced Topic {i}: {', '.join(filtered_words)}")
    generate_wordcloud(filtered_words, i)

Hierarchical Topic Reduction Wordcloud

リリースステートメントこの記事は次の場所に転載されています: https://dev.to/roomals/topic-modeling-with-top2vec-dreyfus-ai-and-wordclouds-1ggl?1 侵害がある場合は、削除するために[email protected]に連絡してください。それ

最新のチュートリアルもっと>

Pythonを使用して、大きなファイルを逆の順序で効率的に読み取るにはどうすればよいですか？
Python でファイルを逆順序で読み取る必要があり、最後の行から最初の行までの内容を読み取る必要がある場合、Pythonの組み込み機能は適切ではないかもしれません。このタスクに取り組むための効率的なソリューションは次のとおりです。バッファベースのアプローチを使用してパフォーマンスを最...

プログラミング 2025-04-17に投稿されました
PostgreSQLの各一意の識別子の最後の行を効率的に取得するにはどうすればよいですか？
postgresql：各一意の識別子の最後の行を抽出します。次のデータを検討してください： select distinct on (id) id, date, another_info from the_table order by id, date desc; データセット内の一...

プログラミング 2025-04-17に投稿されました
交換指令を使用して、GO modのモジュールパスの不一致を解決する方法は？
go mod のモジュールパスの不一致を克服するgo modを利用する場合、輸入パッケージと実際の輸入パスの間のパスミスマッチとのパスミスマッチで、第三者パッケージが別のパッケージをインポートする紛争に遭遇する可能性があります。エコーされたメッセージで示されているように、これはGo M...

プログラミング 2025-04-17に投稿されました
なぜ有効なコードにもかかわらず、PHPで入力をキャプチャするリクエストを要求するのはなぜですか？
アドレス指定Php action='' を使用して、フォームの提出後に$ _POSTアレイの内容を確認します。適切に： if（empty（$ _ server ['content_type']）） { $ _Server ['content_typ...

プログラミング 2025-04-17に投稿されました
PHPの2つの等しいサイズの配列から値を同期して反復して印刷するにはどうすればよいですか？
同じサイズの2つの配列の2つの配列から値を同期して反復して印刷する場合、同サイズの2つの配列を使用してselectboxを作成する場合、1つは対応する名前を含む1つを使用して、困難が不適切なsyntaxに起因する可能性があります。アレイ： foreach（$ codes as $ code、...

プログラミング 2025-04-17に投稿されました
Laravel Bladeテンプレートの変数をエレガントに定義するにはどうすればよいですか？
Laravel Bladeテンプレートの変数を優雅さで定義するブレードテンプレートに変数を割り当てる方法を理解することは、後で使用するためにデータを保存するために重要です。「{{{{}}}」を使用して変数を割り当てるのは簡単ですが、常に最もエレガントなソリューションであるとは限りませ...

プログラミング 2025-04-17に投稿されました
Codeigniterがmysqliに切り替えた後にmysqlデータベースに接続する理由
MySQLデータベースに接続できません：エラーメッセージのトラブルシューティングは、MySQLドライバーからMySQLIドライバーのコードジニターのMySQLIドライバーに切り替えようとする場合、ユーザーは、設定を使用してデータベースサーバーを接続できます。このエラーは、誤ったPHP構...

プログラミング 2025-04-17に投稿されました
3つのMySQLテーブルのデータを新しいテーブルに組み合わせる方法は？
mysql：3つのテーブルのデータと列から新しいテーブルを作成する質問：人々、詳細、および分類表の表？ P。*、d.contentを年齢として選択します psとしての人々から D.Person_id = p.idのDとして詳細を結合します t.id = d.detail_idでt...

プログラミング 2025-04-17に投稿されました
$mysqlが絵文字を挿入するときに\\ "string値エラー\\"例外を解きます$
mysqlが絵文字を挿入するときに\\ "string値エラー\\"例外を解きます
誤った文字列値例外を解決する絵文字を挿入するときに絵文字を含む文字列をMySQLデータベースに挿入しようとするときに、次の例外を遭遇する可能性があります： Java.SQL.SQL.SQL.SQL.SQL.SQL.SQL.SQL.SQL.SQL.SQL.SQL.SQL.SQL.SQL.SQL...

プログラミング 2025-04-17に投稿されました
多次元アレイのためにPHPでのJSONの解析を簡素化する方法は？
jsonをphp でphpで解析しようとする場合、特に多次元配列を扱う場合は困難な場合があります。プロセスを簡素化するには、JSONをオブジェクトではなく配列として解析することをお勧めします。 print_r（$ json）を使用して配列構造を探索することは、目的の情報へのアクセス方法を決...

プログラミング 2025-04-17に投稿されました
PHPのUnicode文字列からURLに優しいナメクジを効率的に生成するにはどうすればよいですか？
効率的なナメクジ生成のための関数を作成するスラッグの作成、URLで使用されるユニコード文字列の単純化された表現は挑戦的な作業になります。この記事では、スラッグを効率的に生成し、特殊文字と非ASCII文字をURLに優しい形式に変換するための簡潔なソリューションを紹介します。一連の操作を使用...

プログラミング 2025-04-17に投稿されました
CSS「コンテンツ」プロパティを使用してFirefoxが画像を表示しないのはなぜですか？
firefox のコンテンツURLを使用して画像を表示します。これは、提供されたCSSクラスで見ることができます： .googlePic { content: url('../../img/googlePlusIcon.PNG'); margin-top: -6.5%;...

プログラミング 2025-04-17に投稿されました
C＃でインデントのために文字列文字を効率的に繰り返す方法は？
インデンテーションのために文字列を繰り返すアイテムの深さに基づいて文字列をインデントするとき、文字列を繰り返します。 Constructor 同じ文字を繰り返すだけの場合、文字を受け入れる文字列コンストラクターを使用してそれを繰り返すことができます： string indent = ...

プログラミング 2025-04-17に投稿されました
JavaScriptのグローバル変数に動的にアクセスする方法は？
javascriptの名前で動的にグローバル変数にアクセスするランタイム中にグローバル変数にアクセスすることは、一般的な要件になる可能性があります。通常、グローバル変数はウィンドウオブジェクトを介してアクセスできます。ただし、これは、異なるスクリプトにわたってローカル変数にアクセスしようと...

プログラミング 2025-04-17に投稿されました
CSSを使用してChromeとFirefoxのコンソール出力を着色できますか？
javaScriptコンソールの色の表示は、クロムのコンソールを使用してエラー用の赤、警告用のオレンジ、コンソール用グリーンなどの色のテキストを表示することは可能です。メッセージ？回答はい、CSSを使用して、ChromeとFirefox（バージョン31以降）のコンソールに表示さ...

プログラミング 2025-04-17に投稿されました