Master data manipulation and analysis with Pandas
Pandas is the go-to library for data manipulation in Python. It provides two main data structures: Series (1D) and DataFrame (2D), making it easy to work with structured data.
- **Easy data manipulation**: Filter, group, merge, and transform data effortlessly - **Handles missing data**: Built-in functions for dealing with NaN values - **Time series support**: Excellent for temporal data - **Integration**: Works seamlessly with NumPy, Matplotlib, and scikit-learn
Let's create and explore data using Pandas:
import pandas as pd
import numpy as np
# Create DataFrame from dictionary
data = {
'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Eve'],
'Age': [25, 30, 35, 28, 32],
'City': ['New York', 'London', 'Paris', 'Tokyo', 'Berlin'],
'Salary': [70000, 80000, 90000, 75000, 85000]
}
df = pd.DataFrame(data)
print("DataFrame:")
print(df)
print(f"\nShape: {df.shape}")
print(f"\nInfo:")
print(df.info())
print(f"\nBasic Statistics:")
print(df.describe())
print(f"\nFirst 3 rows:")
print(df.head(3))DataFrame:
Name Age City Salary
0 Alice 25 New York 70000
1 Bob 30 London 80000
2 Charlie 35 Paris 90000
3 David 28 Tokyo 75000
4 Eve 32 Berlin 85000
Shape: (5, 4)
Info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5 entries, 0 to 4
Data columns (total 4 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Name 5 non-null object
1 Age 5 non-null int64
2 City 5 non-null object
3 Salary 5 non-null int64
dtypes: int64(2), object(2)
Basic Statistics:
Age Salary
count 5.000000 5.000000
mean 30.000000 80000.000000
std 3.807887 7905.694150
min 25.000000 70000.000000
25% 28.000000 75000.000000
50% 30.000000 80000.000000
75% 32.000000 85000.000000
max 35.000000 90000.000000
First 3 rows:
Name Age City Salary
0 Alice 25 New York 70000
1 Bob 30 London 80000
2 Charlie 35 Paris 90000Select, filter, and manipulate data:
# Select columns
print("Names:", df['Name'].tolist())
print("\nName and Age:")
print(df[['Name', 'Age']])
# Filter rows
high_salary = df[df['Salary'] > 80000]
print("\nHigh Salary Employees:")
print(high_salary)
# Multiple conditions
young_high_earners = df[(df['Age'] < 32) & (df['Salary'] > 70000)]
print("\nYoung High Earners:")
print(young_high_earners)
# Sorting
sorted_df = df.sort_values('Salary', ascending=False)
print("\nSorted by Salary (desc):")
print(sorted_df)Names: ['Alice', 'Bob', 'Charlie', 'David', 'Eve']
Name and Age:
Name Age
0 Alice 25
1 Bob 30
2 Charlie 35
3 David 28
4 Eve 32
High Salary Employees:
Name Age City Salary
2 Charlie 35 Paris 90000
4 Eve 32 Berlin 85000
Young High Earners:
Name Age City Salary
1 Bob 30 London 80000
3 David 28 Tokyo 75000
Sorted by Salary (desc):
Name Age City Salary
2 Charlie 35 Paris 90000
4 Eve 32 Berlin 85000
1 Bob 30 London 80000
3 David 28 Tokyo 75000
0 Alice 25 New York 70000Perform group-wise operations:
# Add department column
df['Department'] = ['IT', 'IT', 'Finance', 'Finance', 'IT']
# Group by department
grouped = df.groupby('Department')
print("Average Salary by Department:")
print(grouped['Salary'].mean())
print("\nMultiple Aggregations:")
agg_result = grouped.agg({
'Salary': ['mean', 'min', 'max'],
'Age': ['mean', 'count']
})
print(agg_result)Average Salary by Department:
Department
Finance 82500.0
IT 78333.333333
Name: Salary, dtype: float64
Multiple Aggregations:
Salary Age
mean min max mean count
Department
Finance 82500.0 75000 90000 31.5 2
IT 78333.3 70000 85000 29.0 3