نمذجة المواضيع باستخدام Topc: Dreyfus، وAI، وWordclouds

الصفحة الأمامية > برمجة > نمذجة المواضيع باستخدام Topc: Dreyfus، وAI، وWordclouds

نمذجة المواضيع باستخدام Topc: Dreyfus، وAI، وWordclouds

تم النشر بتاريخ 2024-07-30

تصفح:864

استخراج الرؤى من ملفات PDF باستخدام بايثون: دليل شامل

يوضح هذا البرنامج النصي سير عمل قويًا لمعالجة ملفات PDF، واستخراج النص، ورموز الجمل، وتنفيذ نماذج الموضوع باستخدام التصور، وهو مصمم خصيصًا للتحليل الفعال والثاقب.

نظرة عامة على المكتبات

os: يوفر وظائف للتفاعل مع نظام التشغيل.
matplotlib.pyplot: يستخدم لإنشاء تصورات ثابتة ومتحركة وتفاعلية في بايثون.
nltk: مجموعة أدوات اللغة الطبيعية، وهي مجموعة من المكتبات والبرامج لمعالجة اللغة الطبيعية.
الباندا: مكتبة معالجة البيانات وتحليلها.
pdftotext: مكتبة لتحويل مستندات PDF إلى نص عادي.
re: يوفر عمليات مطابقة التعبير العادي.
seaborn: مكتبة تصور البيانات الإحصائية بناءً على matplotlib.
nltk.tokenize.sent_tokenize: وظيفة NLTK لتحويل سلسلة إلى جمل.
top2vec: مكتبة لنمذجة المواضيع والبحث الدلالي.
wordcloud: مكتبة لإنشاء سحب الكلمات من البيانات النصية.

الإعداد الأولي

وحدات الاستيراد

import os
import matplotlib.pyplot as plt
import nltk
import pandas as pd
import pdftotext
import re
import seaborn as sns
from nltk.tokenize import sent_tokenize
from top2vec import Top2Vec
from wordcloud import WordCloud
from cleantext import clean

بعد ذلك، تأكد من تنزيل أداة punkt tokenizer:

nltk.download('punkt')

تطبيع النص

def normalize_text(text):
    """Normalize text by removing special characters and extra spaces,
    and applying various other cleaning options."""

    # Apply the clean function with specified parameters
    cleaned_text = clean(
        text,
        fix_unicode=True,  # fix various unicode errors
        to_ascii=True,  # transliterate to closest ASCII representation
        lower=True,  # lowercase text
        no_line_breaks=False,  # fully strip line breaks as opposed to only normalizing them
        no_urls=True,  # replace all URLs with a special token
        no_emails=True,  # replace all email addresses with a special token
        no_phone_numbers=True,  # replace all phone numbers with a special token
        no_numbers=True,  # replace all numbers with a special token
        no_digits=True,  # replace all digits with a special token
        no_currency_symbols=True,  # replace all currency symbols with a special token
        no_punct=False,  # remove punctuations
        lang="en",  # set to 'de' for German special handling
    )

    # Further clean the text by removing any remaining special characters except word characters, whitespace, and periods/commas
    cleaned_text = re.sub(r"[^\w\s.,]", "", cleaned_text)
    # Replace multiple whitespace characters with a single space and strip leading/trailing spaces
    cleaned_text = re.sub(r"\s ", " ", cleaned_text).strip()

    return cleaned_text

استخراج نص PDF

def extract_text_from_pdf(pdf_path):
    with open(pdf_path, "rb") as f:
        pdf = pdftotext.PDF(f)
    all_text = "\n\n".join(pdf)
    return normalize_text(all_text)

ترميز الجملة

def split_into_sentences(text):
    return sent_tokenize(text)

معالجة ملفات متعددة

def process_files(file_paths):
    authors, titles, all_sentences = [], [], []
    for file_path in file_paths:
        file_name = os.path.basename(file_path)
        parts = file_name.split(" - ", 2)
        if len(parts) != 3 or not file_name.endswith(".pdf"):
            print(f"Skipping file with incorrect format: {file_name}")
            continue

        year, author, title = parts
        author, title = author.strip(), title.replace(".pdf", "").strip()

        try:
            text = extract_text_from_pdf(file_path)
        except Exception as e:
            print(f"Error extracting text from {file_name}: {e}")
            continue

        sentences = split_into_sentences(text)
        authors.append(author)
        titles.append(title)
        all_sentences.extend(sentences)
        print(f"Number of sentences for {file_name}: {len(sentences)}")

    return authors, titles, all_sentences

حفظ البيانات إلى CSV

def save_data_to_csv(authors, titles, file_paths, output_file):
    texts = []
    for fp in file_paths:
        try:
            text = extract_text_from_pdf(fp)
            sentences = split_into_sentences(text)
            texts.append(" ".join(sentences))
        except Exception as e:
            print(f"Error processing file {fp}: {e}")
            texts.append("")

    data = pd.DataFrame({
        "Author": authors,
        "Title": titles,
        "Text": texts
    })
    data.to_csv(output_file, index=False, quoting=1, encoding='utf-8')
    print(f"Data has been written to {output_file}")

جارٍ تحميل كلمات الإيقاف

def load_stopwords(filepath):
    with open(filepath, "r") as f:
        stopwords = f.read().splitlines()
    additional_stopwords = ["able", "according", "act", "actually", "after", "again", "age", "agree", "al", "all", "already", "also", "am", "among", "an", "and", "another", "any", "appropriate", "are", "argue", "as", "at", "avoid", "based", "basic", "basis", "be", "been", "begin", "best", "book", "both", "build", "but", "by", "call", "can", "cant", "case", "cases", "claim", "claims", "class", "clear", "clearly", "cope", "could", "course", "data", "de", "deal", "dec", "did", "do", "doesnt", "done", "dont", "each", "early", "ed", "either", "end", "etc", "even", "ever", "every", "far", "feel", "few", "field", "find", "first", "follow", "follows", "for", "found", "free", "fri", "fully", "get", "had", "hand", "has", "have", "he", "help", "her", "here", "him", "his", "how", "however", "httpsabout", "ibid", "if", "im", "in", "is", "it", "its", "jstor", "june", "large", "lead", "least", "less", "like", "long", "look", "man", "many", "may", "me", "money", "more", "most", "move", "moves", "my", "neither", "net", "never", "new", "no", "nor", "not", "notes", "notion", "now", "of", "on", "once", "one", "ones", "only", "open", "or", "order", "orgterms", "other", "our", "out", "own", "paper", "past", "place", "plan", "play", "point", "pp", "precisely", "press", "put", "rather", "real", "require", "right", "risk", "role", "said", "same", "says", "search", "second", "see", "seem", "seems", "seen", "sees", "set", "shall", "she", "should", "show", "shows", "since", "so", "step", "strange", "style", "such", "suggests", "talk", "tell", "tells", "term", "terms", "than", "that", "the", "their", "them", "then", "there", "therefore", "these", "they", "this", "those", "three", "thus", "to", "todes", "together", "too", "tradition", "trans", "true", "try", "trying", "turn", "turns", "two", "up", "us", "use", "used", "uses", "using", "very", "view", "vol", "was", "way", "ways", "we", "web", "well", "were", "what", "when", "whether", "which", "who", "why", "with", "within", "works", "would", "years", "york", "you", "your", "suggests", "without"]
    stopwords.extend(additional_stopwords)
    return set(stopwords)

تصفية كلمات التوقف من المواضيع

def filter_stopwords_from_topics(topic_words, stopwords):
    filtered_topics = []
    for words in topic_words:
        filtered_topics.append([word for word in words if word.lower() not in stopwords])
    return filtered_topics

توليد سحابة الكلمات

def generate_wordcloud(topic_words, topic_num, palette='inferno'):
    colors = sns.color_palette(palette, n_colors=256).as_hex()
    def color_func(word, font_size, position, orientation, random_state=None, **kwargs):
        return colors[random_state.randint(0, len(colors) - 1)]

    wordcloud = WordCloud(width=800, height=400, background_color='black', color_func=color_func).generate(' '.join(topic_words))
    plt.figure(figsize=(10, 5))
    plt.imshow(wordcloud, interpolation='bilinear')
    plt.axis('off')
    plt.title(f'Topic {topic_num} Word Cloud')
    plt.show()

التنفيذ الرئيسي

file_paths = [f"/home/roomal/Desktop/Dreyfus-Project/Dreyfus/{fname}" for fname in os.listdir("/home/roomal/Desktop/Dreyfus-Project/Dreyfus/") if fname.endswith(".pdf")]

authors, titles, all_sentences = process_files(file_paths)

output_file = "/home/roomal/Desktop/Dreyfus-Project/Dreyfus_Papers.csv"
save_data_to_csv(authors, titles, file_paths, output_file)

stopwords_filepath = "/home/roomal/Documents/Lists/stopwords.txt"
stopwords = load_stopwords(stopwords_filepath)

try:
    topic_model = Top2Vec(
        all_sentences,
        embedding_model="distiluse-base-multilingual-cased",
        speed="deep-learn",
        workers=6
    )
    print("Top2Vec model created successfully.")
except ValueError as e:
    print(f"Error initializing Top2Vec: {e}")
except Exception as e:
    print(f"Unexpected error: {e}")

num_topics = topic_model.get_num_topics()
topic_words, word_scores, topic_nums = topic_model.get_topics(num_topics)
filtered_topic_words = filter_stopwords_from_topics(topic_words, stopwords)

for i, words in enumerate(filtered_topic_words):
    print(f"Topic {i}: {', '.join(words)}")

keywords = ["heidegger"]
topic_words, word_scores, topic_scores, topic_nums = topic_model.search_topics(keywords=keywords, num_topics=num_topics)
filtered

_search_topic_words = filter_stopwords_from_topics(topic_words, stopwords)

for i, words in enumerate(filtered_search_topic_words):
    generate_wordcloud(words, topic_nums[i])

for i in range(reduced_num_topics):
    topic_words = topic_model.topic_words_reduced[i]
    filtered_words = [word for word in topic_words if word.lower() not in stopwords]
    print(f"Reduced Topic {i}: {', '.join(filtered_words)}")
    generate_wordcloud(filtered_words, i)

Topic Wordcloud

تقليل عدد المواضيع

reduced_num_topics = 5
topic_mapping = topic_model.hierarchical_topic_reduction(num_topics=reduced_num_topics)

# Print reduced topics and generate word clouds
for i in range(reduced_num_topics):
    topic_words = topic_model.topic_words_reduced[i]
    filtered_words = [word for word in topic_words if word.lower() not in stopwords]
    print(f"Reduced Topic {i}: {', '.join(filtered_words)}")
    generate_wordcloud(filtered_words, i)

Hierarchical Topic Reduction Wordcloud

بيان الافراج تم إعادة نشر هذه المقالة على: https://dev.to/roomals/topic-modeling-with-top2vec-dreyfus-ai-and-wordclouds-1ggl?1 إذا كان هناك أي انتهاك، يرجى الاتصال بـ [email protected] للحذف هو - هي

أحدث البرنامج التعليمي أكثر>

$كيفية التغلب على قيود وظائف PHP \؟$
كيفية التغلب على قيود وظائف PHP \؟
التغلب على قيود دالة PHP إن محاولة القيام بذلك ، كما هو موضح في مقتطف الرمز المقدم ، ستؤدي إلى خطأ "لا يمكن redeclare". $ b) { return ...

برمجة نشر في 2025-05-09
لماذا تتطلب تعبيرات Lambda متغيرات "نهائية" أو "نهائية صالحة" في جافا؟
النهائي. في قصاصة الكود المقدمة: // الرمز المفقود cal.getcomponents (). getComponents ("VTimeZone"). // الرمز المفقود ...

برمجة نشر في 2025-05-09
هل يمكن تكديس عناصر لزجة متعددة فوق بعضها البعض في CSS النقي؟
هنا: https://webthemez.com/demo/sticky-multi-header-scroll/index.html فقط أفضل استخدام CSS النقي ، بدلاً من تنفيذ JavaScript. لقد جربت قليلاً ...

برمجة نشر في 2025-05-09
كيف تحدد بدقة ثابتة قوة 10 في لغة الذهاب
كتابة قوى 10 كطوائف مضغوطة في GO في لغة البرمجة GO ، وتحديد القوى 10 كثانوية هي مهمة شائعة. توفر آلية IOTA ، التي تم تقديمها في الفصل 3 ، طريق...

برمجة نشر في 2025-05-09
لماذا لا تزال الصور لديها حدود في الكروم؟ `الحدود: لا شيء ؛` حل غير صالح
إزالة حدود الصورة باللغة الكروم و "الحدود: لا شيء ؛" في CSS. لحل هذه المشكلة ، ضع في اعتبارك الأساليب التالية: تحيز خلل الكروم ل...

برمجة نشر في 2025-05-09
كيفية إزالة معالجات أحداث JavaScript المجهولة بشكل نظيف؟
العنصر؟ لا توجد طريقة لإزالة معالج الأحداث المجهول بشكل نظيف ما لم يتم تخزين إشارة إلى المعالج في الإنشاء. لمعالجة هذه المشكلة ، فكر في تخزين معا...

برمجة نشر في 2025-05-09
التنفيذ الديناميكي العاكس لواجهة GO لاستكشاف طريقة RPC
أحد الأسئلة التي أثيرت هو ما إذا كان من الممكن استخدام الانعكاس لإنشاء وظيفة جديدة تنفذ واجهة محددة. بيان مشكلة على سبيل المثال ، فكر في واجهة...

برمجة نشر في 2025-05-09
الإدخال: لماذا "تحذير: mysqli_query () يتوقع أن تكون المعلمة 1 mysqli ، مورد معطى" يحدث خطأ وكيفية إصلاحه؟ الإخراج: يجب أن تكون معلمة التحليل والتثبيت لحل الخطأ "تحذير: mysqli_query () mysqli بدلاً من المورد"
mysqli ، المورد المعطى "يمكن أن يحدث خطأ. يشير هذا الخطأ إلى عدم تطابق بين نوع المعلمة الأولى ونوع المورد المتوقع. لا يمكن تبديل هاتين الامتد...

برمجة نشر في 2025-05-09
هل تسمح Java بأنواع عائدات متعددة: نظرة فاحصة على الطرق العامة؟
أنواع عائدات متعددة في java: تم الكشف عن المفاهيم الخاطئة getResult (String s) ؛ حيث Foo فئة مخصصة. يبدو أن إعلان الطريقة يضم نوعين من الإرج...

برمجة نشر في 2025-05-09
كيف يمكنك استخدام مجموعة من خلال محور البيانات في MySQL؟
هنا ، نتعامل مع تحد شائع: تحويل البيانات من الصف إلى الصفوف المستندة إلى الأعمدة باستخدام. لننظر في الاستعلام التالي: حدد البيانات مجموعة بوا...

برمجة نشر في 2025-05-09
كيف تسترجع أحدث مكتبة jQuery من Google APIs؟
لاسترداد أحدث إصدار ، كان هناك سابقًا بديلًا لاستخدام رقم إصدار معين ، والذي كان لاستخدام بناء الجملة التالي: /latest/jquery.js Budaps &&. للحصول...

برمجة نشر في 2025-05-09
كيفية تشغيل العمليات غير المتزامنة بشكل متزامن والتعامل مع الأخطاء بشكل صحيح في JavaScript؟
متزامنة في انتظار تنفيذ العملية getValue2Async () ؛ ينتظر هذا التنفيذ بشكل متتابع الانتهاء من كل عملية قبل بدء التشغيل التالي. لتمكين التنف...

برمجة نشر في 2025-05-09
كيفية التقاط وتدفق stdout في الوقت الحقيقي لتنفيذ أوامر chatbot؟
ومع ذلك ، تنشأ التحديات عند محاولة استرداد STDOUT في الوقت الفعلي. للتغلب على هذا ، نحتاج إلى طريقة لالتقاط ودفق STDOUT بشكل مستمر مع تنفيذ البرنا...

برمجة نشر في 2025-05-09
كيفية إعادة توجيه أنواع متعددة للمستخدمين (الطلاب والمعلمين والمسؤولين) إلى أنشطتهم في تطبيق Firebase؟
تسجيل الدخول. يدير الكود الحالي بنجاح إعادة التوجيه لنوعين للمستخدمين ، ولكنه يواجه تحديات عند محاولة دمج النوع الثالث (المسؤول). يخطط المخطط الأص...

برمجة نشر في 2025-05-09
خطأ المترجم "usr/bin/ld: لا يمكن العثور على -L" حل
-l يشير هذا الخطأ إلى أن الرابط لا يمكنه تحديد موقع المكتبة المحددة أثناء ربطك القابل للتنفيذ. لحل هذه المشكلة ، سوف نتعمق في تفاصيل كيفية تحدي...

برمجة نشر في 2025-05-09