数据处理

GitHub 社交协作数据帮助我们深入分析 Jia Tan 的活动模式及其对项目的潜在影响。其中我们特别关注的是 push 记录，包括每次代码提交的时间戳、提交消息、提交前后的代码变化等重要信息，有助于我们了解 Jia Tan 的贡献频率、提交习惯及其代码修改的影响范围。

为了分析 Jia Tan 在 GitHub 上对 xz 项目的贡献情况，我们对下载的 JSON 格的 Push 记录进行了预处理。利用 Python 中的 json 和 pandas 库，我们首先将 JSON 数据加载并将其中的时间戳转换为 DataFrame 格式，便于进行数据分析。随后，我们筛选出所有与 xz 项目相关的 push 记录，并使用 matplot 将其 push 记录的频率可视化，将与 xz 相关相关的 push 频率与所有的 push 频率进行对比，以更加直观地聚焦于 Jia Tan 对该项目的贡献。通过对这些记录的时间序列分析，我们能够直观地看出 Jia Tan 在不同时间段内的 push 频率，从而评估其对 xz 项目的关注程度。

结合数据分析，我们系统总结了社会工程学攻击概念和常见攻击手段，提出了一套综合的防御体系，旨在为未来的网络安全研究和实践贡献积极的参考。详见报告“Report_基于 XZ 事件的社会工程攻击策略及其防范措施”

如何配置运行环境

编程语言及版本

Python 3.x。

安装必要的库

json： Python 内置库，无需额外安装，用于处理 JSON 数据格式。

pandas： 可使用命令 pip install pandas 进行安装。它提供了高效的数据结构和数据分析工具，在这段代码中用于数据的读取、处理和转换。

matplotlib： 可使用命令 pip install matplotlib 进行安装。它是一个用于绘制各种图表的库，这段代码中用于绘制折线图及设置图表的各种属性。

如何运行代码并得出报告中呈现的结果

1. 准备 JiaTan 的 github Push 记录 PushEvent.json，data_clean.py，data_process.py，并放在同一路径下。

2. 运行 data_clean.py，进行数据清洗，生成 xzPushEvent.json

3. 运行 data_process.py，得到 JiaTan Push 频率可视化图表。

代码

数据清洗

过滤保留名字中包含 “xz” 的仓库，并单独保存为一个 json 文件。

data_clean.py

import json

with open('PushEvent.json', 'r', encoding='utf-8') as file:
    data = json.load(file)

filtered_data = [item for item in data if "xz" in item["repo"]["name"]]

with open('xzPushEvent.json', 'w') as f:
    json.dump(filtered_data, f, indent=2)

数据挖掘

读取数据并将 Push 记录按照月份进行统计。

data_process.py

import json
import pandas as pd
import matplotlib.pyplot as plt

# 处理第一个 JSON 文件
with open('xzPushEvent.json', 'r', encoding='utf-8') as file:
    events1 = json.load(file)

df1 = pd.DataFrame(events1)
df1['created_at'] = pd.to_datetime(df1['created_at'])
df1['year'] = df1['created_at'].dt.year
df1['month'] = df1['created_at'].dt.month

monthly_counts1 = df1.groupby(['year', 'month']).size()
monthly_df1 = monthly_counts1.reset_index(name='count')

# 处理第二个 JSON 文件
with open('PushEvent.json', 'r', encoding='utf-8') as file:
    events2 = json.load(file)

df2 = pd.DataFrame(events2)
df2['created_at'] = pd.to_datetime(df2['created_at'])
df2['year'] = df2['created_at'].dt.year
df2['month'] = df2['created_at'].dt.month

monthly_counts2 = df2.groupby(['year', 'month']).size()
monthly_df2 = monthly_counts2.reset_index(name='count')

数据可视化

用 matplot 将其 push 记录的频率可视化。

data_process.py

# 绘制图表
plt.figure(figsize=(10, 6))

# 第二个文件的数据
plt.plot(monthly_df2['year'].astype(str) + '-' + monthly_df2['month'].astype(str),
         monthly_df2['count'], marker='o', label='allPushEventByJiaTan')


# 第一个文件的数据
plt.plot(monthly_df1['year'].astype(str) + '-' + monthly_df1['month'].astype(str),
         monthly_df1['count'], marker='o', label='xzPushEventByJiaTan')

plt.xlabel('Month')
plt.ylabel('Number of Events')
plt.xticks(rotation=45)
plt.legend()
plt.grid(True)

plt.annotate('Jia Tan becomes maintainer',
            xy=(13, 24),
            xytext=(3,30),
            color="red",
            fontsize=15,
            arrowprops=dict(facecolor='black', shrink=0.05))

plt.annotate('xz-5.6.0 released',
            xy=(29, 123),
            xytext=(20,123),
            color="red",
            fontsize=15,
            arrowprops=dict(facecolor='black', shrink=0.05))

plt.show()

运行结果

Jia Tan GitHub Push频率统计

其他代码展示

爬取 XZ 的邮件列表

将托管在 mail-archive.com/xz-devel@tukaani.org上的 XZ 的邮件列表爬取下来，并保存为 txt 格式文件。

import requests
from bs4 import BeautifulSoup
from datetime import datetime
import time

# 定义原始日期时间字符串的格式
original_format = "%a, %d %b %Y %H:%M:%S %z"
# 定义新的日期时间格式
new_format = "%Y.%m.%d %H%M%S"
headers={
    "User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:129.0) Gecko/20100101 Firefox/129.0"
    }
ulrstd='https://www.mail-archive.com/xz-devel@tukaani.org/'

email_cnt=391
while(email_cnt<=694):
    email_cnt+=1
    ulr = ulrstd+'msg'+str(email_cnt).zfill(5)+'.html'
    resp = requests.get(ulr,headers=headers)
    soup = BeautifulSoup(resp.text,'html.parser')
    txt_title = ''
    txt_body = ''
    sisters = soup.select("span.date")
    for sister in sisters:
        # 将原始日期时间字符串转换为 datetime 对象
        dt = datetime.strptime(sister.get_text(), original_format)
        # 将 datetime 对象转换为新的格式
        txt_title = dt.strftime(new_format)

    name_spans = soup.find_all('span', {'itemprop': 'name'})
    for span in name_spans:
        #print(span.get_text())
        txt_title+=' '+span.get_text()

    print(txt_title)
    pre_tags = soup.find_all('pre')
    for pre in pre_tags:
        #print(pre.get_text())
        txt_body+=pre.get_text()+'\n'
    txt_title=txt_title.replace(':',' ')
    txt_title=txt_title.replace('\',' ')
    txt_title=txt_title.replace('/',' ')
    txt_title=txt_title.replace('?',' ')
    txt_title=txt_title.replace('*',' ')
    txt_title=txt_title.replace('"',' ')
    txt_title+='.txt'

    with open(txt_title, mode="w", encoding="utf-8") as file:
        file.write(txt_body)
    time.sleep(4)

文本分析

需提前从爬取下的邮件列表中筛选出 Jigar Kumar 和 Denis Excoffier 的 txt 格式邮件，并将其与程序放在同一文件夹下。

通过分析这些关键邮件，我们不仅能够识别出邮件中的主要观点和立场，还能进一步理解邮件交流过程中各方的情感态度，从而揭示 Jia Tan 获得开发者权限的过程及其背后的社会工程学。

#需提前准备
import matplotlib.pyplot as plt
import os
from textblob import TextBlob
import re

def clean_text(text):
    """清洗输入文本，将其转换为小写并移除非字母字符。"""
    return re.sub(r'[^a-zA-Z\s]', '', text.lower())

def analyze_sentiment(text):
    """使用TextBlob分析给定文本的情感。"""
    analysis = TextBlob(text)
    polarity = analysis.sentiment.polarity
    subjectivity = analysis.sentiment.subjectivity
    return polarity, subjectivity

def plot_sentiment(polarity_list, subjectivity_list, short_filenames):
    """绘制每个文件的情感极性和主观性的折线图，横坐标显示文件名的前十个字符。"""
    fig, ax1 = plt.subplots(figsize=(10, 5))

    # 绘制主观性
    ax1.plot(short_filenames, subjectivity_list, marker='o', linestyle='-', color='skyblue', label='主观性')
    ax1.set_xlabel('日期')
    ax1.set_ylabel('主观性值（0是客观，1是非常主观）', color='skyblue')
    ax1.tick_params(axis='y', labelcolor='skyblue')
    ax1.set_ylim(0,1)

    # 创建第二个y轴
    ax2 = ax1.twinx()
    # 绘制情感极性
    ax2.plot(short_filenames, polarity_list, marker='s', linestyle='-', color='red', label='情感极性')
    ax2.set_ylabel('情感极性值（-1是最消极，1是最积极）', color='red')
    ax2.tick_params(axis='y', labelcolor='red')
    ax2.set_ylim(-1,1)

    plt.title('Jigar Kumar和Denis Excoffier邮件的情感极性和主观性')
    plt.xticks(rotation=45)  # 如果文件名过长，则旋转x轴标签
    plt.tight_layout()  # 调整布局以适应所有元素
    fig.legend(loc="upper right")  # 添加图例
    plt.show()

if __name__ == "__main__":
    directory = '.'
    polarity_list = []
    subjectivity_list = []
    short_filename_list = []

    plt.rcParams['font.family'] = 'SimHei'

    for filename in os.listdir(directory):
        if filename.endswith('.txt'):
            filepath = os.path.join(directory, filename)
            with open(filepath, 'r', encoding='utf-8') as file:
                content = file.read()
                polarity, subjectivity = analyze_sentiment(content)
                polarity_list.append(polarity)
                subjectivity_list.append(subjectivity)
                short_filename_list.append(filename[:10])  # 只取文件名的前十个字符 即日期

    plot_sentiment(polarity_list, subjectivity_list, short_filename_list)

Jigar Kumar和Denis Excoffier邮件的情感极性和主观性

Lasse Collin 邮件频率分析

需提前从爬取下的邮件列表中筛选出 Lasse Collin 的 txt 格式邮件，并将其与程序放在同一文件夹下。从中我们可以了解到了 Lasse Collin 对 XZ 项目的关注程度。

import os
import matplotlib.pyplot as plt
from collections import defaultdict

def count_files_by_year():
    # 创建一个字典来存储每年的文件数
    yearly_counts = defaultdict(int)

    # 获取当前工作目录
    current_directory = os.getcwd()

    # 遍历当前目录下的所有文件
    for filename in os.listdir(current_directory):
        if filename.endswith('.txt'):
            # 提取文件名中的日期部分
            try:
                date_part = filename[:10]
                year = int(date_part.split('.')[0])

                # 更新计数
                yearly_counts[year] += 1
            except ValueError:
                print(f"Skipping file {filename}: Invalid date format.")

    return dict(yearly_counts)

def plot_years(yearly_counts):
    years = list(yearly_counts.keys())
    counts = list(yearly_counts.values())

    plt.figure(figsize=(10, 6))
    plt.plot(years, counts, color='skyblue')
    plt.xlabel('Year')
    plt.ylabel('Number of Email')
    plt.title('Number of Email send by Lasse Collin per Year')
    plt.xticks(years)  # 确保每个年份都有标签
    plt.grid(axis='y', linestyle='--', alpha=0.7)
    plt.show()

# 调用函数并绘图
yearly_counts = count_files_by_year()
plot_years(yearly_counts)

Lasse Collin发送邮件频率