Photo by freepik
Creating a log parser to analyze server logs can help you extract meaningful insights like errors, response times, or traffic patterns in python programming. For this example, let’s assume we’re working with Apache or Nginx server logs, which typically look like this:
127.0.0.1 – – [10/Sep/2024:13:35:48 +0000] “GET /index.html HTTP/1.1” 200 512 “-” “Mozilla/5.0”
Log Parsing Python Program
Here’s how you can develop a log parser to extract meaningful insights:
Step-by-Step Breakdown:
- Parse the log entries: Extract elements like IP address, date, HTTP method, status code, and user-agent.
- Analyze the logs: Count status codes, track response times, and find patterns like the most visited pages.
Python Program to Parse Logs
import re
from collections import Counter
from datetime import datetime
# Regular expression to parse Apache or Nginx log format
log_pattern = re.compile(
r'(?P<ip>\S+) - - \[(?P<date>.*?)\] "(?P<method>\S+) (?P<url>\S+) \S+" (?P<status>\d{3}) (?P<size>\d+) "(?P<referer>.*?)" "(?P<user_agent>.*?)"'
)
# Function to parse a single log line
def parse_log_line(line):
match = log_pattern.match(line)
if match:
return match.groupdict()
return None
# Function to parse the log file
def parse_log_file(file_path):
log_data = []
with open(file_path, 'r') as file:
for line in file:
parsed_line = parse_log_line(line)
if parsed_line:
log_data.append(parsed_line)
return log_data
# Function to analyze logs and extract insights
def analyze_logs(log_data):
# Count occurrences of different HTTP status codes
status_codes = Counter(entry['status'] for entry in log_data)
# Count requests per IP
ip_counts = Counter(entry['ip'] for entry in log_data)
# Count most frequently accessed URLs
url_counts = Counter(entry['url'] for entry in log_data)
# Analyze logs by time (hourly traffic analysis)
hourly_traffic = Counter(datetime.strptime(entry['date'], "%d/%b/%Y:%H:%M:%S %z").hour for entry in log_data)
# Most common user agents
user_agents = Counter(entry['user_agent'] for entry in log_data)
# Extract insights
insights = {
"total_requests": len(log_data),
"status_codes": status_codes,
"top_ips": ip_counts.most_common(5),
"top_urls": url_counts.most_common(5),
"hourly_traffic": hourly_traffic,
"top_user_agents": user_agents.most_common(5)
}
return insights
# Function to display insights
def display_insights(insights):
print("Total Requests:", insights['total_requests'])
print("\nStatus Code Distribution:")
for status, count in insights['status_codes'].items():
print(f" {status}: {count}")
print("\nTop 5 IP Addresses:")
for ip, count in insights['top_ips']:
print(f" {ip}: {count} requests")
print("\nTop 5 Requested URLs:")
for url, count in insights['top_urls']:
print(f" {url}: {count} requests")
print("\nHourly Traffic:")
for hour, count in insights['hourly_traffic'].items():
print(f" Hour {hour}: {count} requests")
print("\nTop 5 User Agents:")
for agent, count in insights['top_user_agents']:
print(f" {agent}: {count} requests")
# Example Usage
log_file_path = 'server.log' # Path to your server log file
parsed_logs = parse_log_file(log_file_path)
insights = analyze_logs(parsed_logs)
display_insights(insights)
Program Explanation
- Regex to Parse Logs:
- The regular expression is designed to extract meaningful parts of a server log line, such as:
- ip: The client’s IP address.
- date: The timestamp of the request.
- method: HTTP method (GET, POST, etc.).
- url: The requested URL.
- status: The HTTP status code (200, 404, etc.).
- size: The size of the response.
- referer: The referring URL.
- user_agent: The browser or device making the request.
- The regular expression is designed to extract meaningful parts of a server log line, such as:
- Parsing a Single Log Line:
- The function parse_log_line() applies the regular expression to extract information from a single log line.
- Parsing the Log File:
- The function parse_log_file() reads the log file line by line and parses each line using the parse_log_line() function.
- Analyzing the Logs:
- Status Codes: Counts how many requests returned each HTTP status code (e.g., 200, 404).
- Top IPs: Counts the number of requests made by each IP address.
- Top URLs: Identifies the most requested URLs.
- Hourly Traffic: Tracks how many requests were made each hour of the day.
- User Agents: Lists the most common user agents.
- Displaying Insights:
- The function display_insights() neatly prints the extracted data, such as total requests, the distribution of status codes, top IPs, top requested URLs, and hourly traffic patterns.
- Example Log Insights
Total Requests: 12345
Status Code Distribution:
200: 9500
404: 850
500: 45
Top 5 IP Addresses:
192.168.1.10: 450 requests
203.0.113.54: 320 requests
172.16.254.1: 290 requests
Top 5 Requested URLs:
/index.html: 1200 requests
/contact.html: 800 requests
/about.html: 700 requests
Hourly Traffic:
Hour 13: 1230 requests
Hour 14: 1150 requests
Hour 15: 980 requests
Top 5 User Agents:
Mozilla/5.0 (Windows NT 10.0): 8000 requests
curl/7.68.0: 1500 requests
Extensions to the Program
- Filter by Date: You can add a filter to only analyze logs within a specific date range.
- Response Time Analysis: If your logs contain response times, you can extend the program to calculate average response times per URL.
- Error Log Analysis: You can specifically track and analyze error logs (e.g., all 4xx and 5xx status codes).
- Visualization: Use libraries like matplotlib or seaborn to visualize the data, such as traffic by hour or the distribution of status codes.
This log parser is flexible enough to provide meaningful insights into your server’s traffic and performance, and it can be extended further based on specific requirements.
Leave a Reply