Introduction:
Comprehensive testing is essential for data-driven applications, but it often relies on having the right datasets, which may not always be available. Whether you are developing web applications, machine learning models, or backend systems, realistic and structured data is crucial for proper validation and ensuring robust performance. Acquiring real-world data may be limited due to privacy concerns, licensing restrictions, or simply the unavailability of relevant data. This is where synthetic data becomes valuable.
In this blog, we will explore how Python can be used to generate synthetic data for different scenarios, including:
We’ll leverage the faker and pandas libraries to create realistic datasets for these use cases.
Example 1: Creating Synthetic Data for Customers and Orders (One-to-Many Relationship)
In many applications, data is stored in multiple tables with foreign key relationships. Let’s generate synthetic data for customers and their orders. A customer can place multiple orders, representing a one-to-many relationship.
Generating the Customers Table
The Customers table contains basic information such as CustomerID, name, and email address.
import pandas as pd from faker import Faker import random fake = Faker() def generate_customers(num_customers): customers = [] for _ in range(num_customers): customer_id = fake.uuid4() name = fake.name() email = fake.email() customers.append({'CustomerID': customer_id, 'CustomerName': name, 'Email': email}) return pd.DataFrame(customers) customers_df = generate_customers(10)
This code generates 10 random customers using Faker to create realistic names and email addresses.
Generating the Orders Table
Now, we generate the Orders table, where each order is associated with a customer through CustomerID.
def generate_orders(customers_df, num_orders): orders = [] for _ in range(num_orders): order_id = fake.uuid4() customer_id = random.choice(customers_df['CustomerID'].tolist()) product = fake.random_element(elements=('Laptop', 'Phone', 'Tablet', 'Headphones')) price = round(random.uniform(100, 2000), 2) orders.append({'OrderID': order_id, 'CustomerID': customer_id, 'Product': product, 'Price': price}) return pd.DataFrame(orders) orders_df = generate_orders(customers_df, 30)
In this case, the Orders table links each order to a customer using the CustomerID. Each customer can place multiple orders, forming a one-to-many relationship.
Example 2: Generating Hierarchical Data for Departments and Employees
Hierarchical data is often used in organizational settings, where departments have multiple employees. Let’s simulate an organization with departments, each of which has multiple employees.
Generating the Departments Table
The Departments table contains each department's unique DepartmentID, name, and manager.
def generate_departments(num_departments): departments = [] for _ in range(num_departments): department_id = fake.uuid4() department_name = fake.company_suffix() manager = fake.name() departments.append({'DepartmentID': department_id, 'DepartmentName': department_name, 'Manager': manager}) return pd.DataFrame(departments) departments_df = generate_departments(10)
Generating the Employees Table
Next, we generate theEmployeestable, where each employee is associated with a department via DepartmentID.
def generate_employees(departments_df, num_employees): employees = [] for _ in range(num_employees): employee_id = fake.uuid4() employee_name = fake.name() email = fake.email() department_id = random.choice(departments_df['DepartmentID'].tolist()) salary = round(random.uniform(40000, 120000), 2) employees.append({ 'EmployeeID': employee_id, 'EmployeeName': employee_name, 'Email': email, 'DepartmentID': department_id, 'Salary': salary }) return pd.DataFrame(employees) employees_df = generate_employees(departments_df, 100)
This hierarchical structure links each employee to a departmentthrough DepartmentID, forming a parent-child relationship.
Example 3: Simulating Many-to-Many Relationships for Course Enrollments
In certain scenarios, many-to-many relationships exist, where one entity relates to many others. Let’s simulate this with students enrolling in multiple courses, where each course has multiple students.
Generating the Courses Table
def generate_courses(num_courses): courses = [] for _ in range(num_courses): course_id = fake.uuid4() course_name = fake.bs().title() instructor = fake.name() courses.append({'CourseID': course_id, 'CourseName': course_name, 'Instructor': instructor}) return pd.DataFrame(courses) courses_df = generate_courses(20)
Generating the Students Table
def generate_students(num_students): students = [] for _ in range(num_students): student_id = fake.uuid4() student_name = fake.name() email = fake.email() students.append({'StudentID': student_id, 'StudentName': student_name, 'Email': email}) return pd.DataFrame(students) students_df = generate_students(50) print(students_df)
Generating the Course Enrollments Table
The CourseEnrollments table captures the many-to-many relationship between students and courses.
def generate_course_enrollments(students_df, courses_df, num_enrollments): enrollments = [] for _ in range(num_enrollments): enrollment_id = fake.uuid4() student_id = random.choice(students_df['StudentID'].tolist()) course_id = random.choice(courses_df['CourseID'].tolist()) enrollment_date = fake.date_this_year() enrollments.append({ 'EnrollmentID': enrollment_id, 'StudentID': student_id, 'CourseID': course_id, 'EnrollmentDate': enrollment_date }) return pd.DataFrame(enrollments) enrollments_df = generate_course_enrollments(students_df, courses_df, 200)
In this example, we create a linking table to represent many-to-many relationships between students and courses.
Conclusion:
Using Python and libraries like Faker and Pandas, you can generate realistic and diverse synthetic datasets to meet a variety of testing needs. In this blog, we covered:
These examples lay the foundation for generating synthetic data tailored to your needs. Further enhancements, such as creating more complex relationships, customizing data for specific databases, or scaling datasets for performance testing, can take synthetic data generation to the next level.
These examples provide a solid foundation for generating synthetic data. However, further enhancements can be made to increase complexity and specificity, such as:
If you like the article, please share it with your friends and colleagues. You can connect with me on LinkedIn to discuss any further ideas.
Disclaimer: All resources provided are partly from the Internet. If there is any infringement of your copyright or other rights and interests, please explain the detailed reasons and provide proof of copyright or rights and interests and then send it to the email: [email protected] We will handle it for you as soon as possible.
Copyright© 2022 湘ICP备2022001581号-3