Sharan Initiatives — AI, Finance, Photography & More

Introduction to Pandas

Pandas is the go-to library for data manipulation in Python. It provides two main data structures: Series (1D) and DataFrame (2D), making it easy to work with structured data.

Why Pandas?

- **Easy data manipulation**: Filter, group, merge, and transform data effortlessly - **Handles missing data**: Built-in functions for dealing with NaN values - **Time series support**: Excellent for temporal data - **Integration**: Works seamlessly with NumPy, Matplotlib, and scikit-learn

Creating and Exploring DataFrames

Let's create and explore data using Pandas:

python

import pandas as pd
import numpy as np

# Create DataFrame from dictionary
data = {
    'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Eve'],
    'Age': [25, 30, 35, 28, 32],
    'City': ['New York', 'London', 'Paris', 'Tokyo', 'Berlin'],
    'Salary': [70000, 80000, 90000, 75000, 85000]
}
df = pd.DataFrame(data)

print("DataFrame:")
print(df)
print(f"\nShape: {df.shape}")
print(f"\nInfo:")
print(df.info())
print(f"\nBasic Statistics:")
print(df.describe())
print(f"\nFirst 3 rows:")
print(df.head(3))

Output:

DataFrame:
      Name  Age       City  Salary
0    Alice   25   New York   70000
1      Bob   30     London   80000
2  Charlie   35      Paris   90000
3    David   28      Tokyo   75000
4      Eve   32     Berlin   85000

Shape: (5, 4)

Info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5 entries, 0 to 4
Data columns (total 4 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   Name    5 non-null      object
 1   Age     5 non-null      int64 
 2   City    5 non-null      object
 3   Salary  5 non-null      int64 
dtypes: int64(2), object(2)

Basic Statistics:
             Age        Salary
count   5.000000      5.000000
mean   30.000000  80000.000000
std     3.807887   7905.694150
min    25.000000  70000.000000
25%    28.000000  75000.000000
50%    30.000000  80000.000000
75%    32.000000  85000.000000
max    35.000000  90000.000000

First 3 rows:
      Name  Age      City  Salary
0    Alice   25  New York   70000
1      Bob   30    London   80000
2  Charlie   35     Paris   90000

Data Selection and Filtering

Select, filter, and manipulate data:

python

# Select columns
print("Names:", df['Name'].tolist())
print("\nName and Age:")
print(df[['Name', 'Age']])

# Filter rows
high_salary = df[df['Salary'] > 80000]
print("\nHigh Salary Employees:")
print(high_salary)

# Multiple conditions
young_high_earners = df[(df['Age'] < 32) & (df['Salary'] > 70000)]
print("\nYoung High Earners:")
print(young_high_earners)

# Sorting
sorted_df = df.sort_values('Salary', ascending=False)
print("\nSorted by Salary (desc):")
print(sorted_df)

Output:

Names: ['Alice', 'Bob', 'Charlie', 'David', 'Eve']

Name and Age:
      Name  Age
0    Alice   25
1      Bob   30
2  Charlie   35
3    David   28
4      Eve   32

High Salary Employees:
      Name  Age    City  Salary
2  Charlie   35   Paris   90000
4      Eve   32  Berlin   85000

Young High Earners:
    Name  Age     City  Salary
1    Bob   30   London   80000
3  David   28    Tokyo   75000

Sorted by Salary (desc):
      Name  Age      City  Salary
2  Charlie   35     Paris   90000
4      Eve   32    Berlin   85000
1      Bob   30    London   80000
3    David   28     Tokyo   75000
0    Alice   25  New York   70000

Grouping and Aggregation

Perform group-wise operations:

python

# Add department column
df['Department'] = ['IT', 'IT', 'Finance', 'Finance', 'IT']

# Group by department
grouped = df.groupby('Department')
print("Average Salary by Department:")
print(grouped['Salary'].mean())

print("\nMultiple Aggregations:")
agg_result = grouped.agg({
    'Salary': ['mean', 'min', 'max'],
    'Age': ['mean', 'count']
})
print(agg_result)

Output:

Average Salary by Department:
Department
Finance    82500.0
IT         78333.333333
Name: Salary, dtype: float64

Multiple Aggregations:
            Salary                 Age      
              mean    min    max  mean count
Department                                  
Finance    82500.0  75000  90000  31.5     2
IT         78333.3  70000  85000  29.0     3

Pandas for Data Analysis

Introduction to Pandas

Why Pandas?

Creating and Exploring DataFrames

Data Selection and Filtering

Grouping and Aggregation