CMU15445P0实验笔记:实现基于copy on write tree的k-v存储引擎

发表于 2024-03-15 更新于 2024-03-27 分类于数据库， CMU15445 阅读次数：本文字数： 13k 阅读时长 ≈ 12 分钟

写在前面：因为对C++ 17的特性不太了解，写的很痛苦,感觉对C++的智能指针，模板等新特性有了更深入的了解，收获颇多。

task1：Copy-On-Write Trie

一开始的树节结构是一个前缀树的结构（详见数据结构与算法前缀树的笔记）

修改 trie.h 和 trie.cpp来实现copy-on-write trie. 什么是 copy-on-write trie？对树的操作不需要直接修改原来的树节点，而是创建新的节点并返回新的root节点。例如插入：("ad", 2) ，先创建一个新的节点： Node2 ，NODE2会重用原来树的孩子节点，返回新的根节点

插入 ("a", "abc") 并且移除("ab", 1)：

尝试完成三种操作

Get(key): 获取key对应的value值
Put(key, value): 设置key对应的value值.如果key值已经存在则覆盖. Note that the type of the value might be non-copyable (i.e., std::unique_ptr<int>). This method returns a new trie.

在 C++ 中，有一些类型是不允许直接进行复制的，而是通过移动或者转移所有权的方式来操作。std::unique_ptr 就是这样的一个例子。它表示一个独占所有权的指针，意味着同一时刻只有一个指针可以拥有它所指向的资源。由于其独占性质，它不允许直接进行复制，而是通过移动（move）操作来传递所有权。
Delete(key): Delete the value for the key. This method returns a new trie.

创建新的node节点,应该使用 TrieNode class里面的Clone 函数. To reuse an existing node in the new trie, you can copy std::shared_ptr<TrieNode>: copying a shared pointer doesn’t copy the underlying data.

不能通过new和delete人工的分配内存 std::shared_ptr will deallocate the object when no one has a reference to the underlying object.

实现过程中的学习笔记

tree class中根节点的定义：std::shared_ptr<const TrieNode> root_{nullptr};

这行代码是在 C++ 中声明了一个类成员变量 root_，它是一个 std::shared_ptr 智能指针，指向 const TrieNode 类型的对象，初始值为 nullptr。

std::shared_ptr：std::shared_ptr 是 C++ 标准库提供的一个智能指针类，用于管理动态分配的内存。它提供了自动内存管理，可以确保在不再需要时正确释放内存，以避免内存泄漏。std::shared_ptr 允许多个指针共享对同一对象的所有权，并且会在所有指向对象的 std::shared_ptr 都被销毁时自动释放内存。
root_{nullptr}：这是成员变量 root_ 的初始化方式，使用了 C++11 中的成员初始化列表。nullptr 表示空指针，即初始时 root_ 不指向任何有效对象。

1 2	// Create a new trie with the given root. explicit Trie(std::shared_ptr<const TrieNode> root) : root_(std::move(root)) {}

explicit Trie(std::shared_ptr<const TrieNode> root)：这是一个类 Trie 的构造函数的声明。它接受一个 std::shared_ptr<const TrieNode> 类型的参数 root。构造函数前面的 explicit 关键字指示该构造函数是显式的，即不允许隐式转换调用。

explicit 这个关键字用来说明这个类的构造不支持隐式类型转换。举个例子：

CPP
class Person
{
public:
    int age;
    std::string name;
    explicit Person(const std::string &name_) : name(name_) {}
    Person(int age_) : age(age_) {}
};

Person a = 22; // 隐式构造，存在这样的构造函数
// Person b = std::string("ceyewan"); wrong，明确了不能使用隐式构造
Person b("ceyewan"); // 只能这样构造
std::cout << a.age << "\n" << b.name << std::endl;

同样，这样初始化不仅代码风格简单，而且可以避免性能浪费，如果不这样的话，“ceyewan“（const char* 类型）会调用 string 的构造函数得到 name_，然后再次调用 string 的构造函数得到 name。

: root_(std::move(root))：这是成员初始化列表，用于初始化 Trie 类的成员变量 root_。在这里，std::move(root) 使用了 C++11 中的 std::move 函数，将参数 root 转移（move）为右值引用，以便进行移动构造。移动构造可以避免不必要的内存拷贝，提高效率。

移动构造函数：C++的移动构造函数是一种特殊的构造函数，用于将资源从一个对象转移到另一个对象而不进行深拷贝。移动构造函数通常用于支持移动语义，以提高代码的效率和性能。

move

参考博客连接

左值和右值

首先区分左值和右值

左值是表达式结束后依然存在的持久对象(代表一个在内存中占有确定位置的对象)

右值是表达式结束时不再存在的临时对象(不在内存中占有确定位置的表达式）

便携方法：对表达式取地址，如果能，则为左值，否则为右值

1
2
3

int val;
val = 4; // 正确 ①
4 = val; // 错误 ②

上述例子中，由于在之前已经对变量val进行了定义，故在栈上会给val分配内存地址，运算符=要求等号左边是可修改的左值，4是临时参与运算的值，一般在寄存器上暂存，运算结束后在寄存器上移除该值，故①是对的，②是错的

std::move作用主要可以将一个左值转换成右值引用，从而可以调用C++11右值引用的拷贝构造函数

智能指针的使用

智能指针可以像普通指针那样使用：

// 定义智能指针
auto_ptr<Test> test(new Test);

cout << "test->debug：" << test->getDebug() << endl;
cout << "(*test).debug：" << (*test).getDebug() << endl;

智能指针的常用函数：

get() 获取智能指针托管的指针地址

auto_ptr<Test> test(new Test);

Test *tmp = test.get();		// 获取指针返回
cout << "tmp->debug：" << tmp->getDebug() << endl;

但我们一般不会这样使用，因为都可以直接使用智能指针去操作，除非有一些特殊情况。

shared_ptr和unique_ptr

std::auto_ptr 是 C++98 中引入的智能指针，用于管理动态分配的内存，但它已经在 C++11 中被标记为废弃（deprecated），并且在 C++17 中被完全移除。

std::shared_ptr 和 std::unique_ptr 是 C++11 引入的两种智能指针，它们都位于 <memory> 头文件中，用于管理动态分配的内存。虽然它们都提供了自动内存管理，但它们之间有一些关键的区别。

std::shared_ptr：

std::shared_ptr 允许多个指针共享对同一对象的所有权。它会维护一个引用计数，记录指向对象的共享指针数量，直到所有的 std::shared_ptr 都被销毁时才会释放对象。
可以通过 std::make_shared 来创建 std::shared_ptr，这会将对象和引用计数放在同一个分配的内存块中，提高效率。
std::shared_ptr 的拷贝构造和拷贝赋值是线程安全的，因为引用计数的增减是原子操作。
使用 std::shared_ptr 时要小心循环引用的问题，即两个或多个 std::shared_ptr 相互持有对方的指针，导致引用计数无法降为零，造成内存泄漏。

std::unique_ptr：

std::unique_ptr 代表对唯一对象的所有权，即同一时刻只有一个 std::unique_ptr 可以拥有一个对象。它提供了移动语义，支持转移所有权而不需要拷贝。
由于它的独占性质，std::unique_ptr 没有引用计数的开销，因此更加轻量级，且无需担心循环引用问题。
不能直接拷贝构造或拷贝赋值 std::unique_ptr，但可以通过移动构造和移动赋值来转移所有权。
可以使用 std::make_unique 来创建 std::unique_ptr。

if you want to convert unique_ptr into shared_ptr, you can use std::shared_ptr<T>(std::move(ptr))

std::string_view

C++17中我们可以使用std::string_view来获取一个字符串的视图，字符串视图并不真正的创建或者拷贝字符串，而只是拥有一个字符串的查看功能。std::string_view比std::string的性能要高很多，因为每个std::string都独自拥有一份字符串的拷贝，而std::string_view只是记录了自己对应的字符串的指针和偏移位置。当我们在只是查看字符串的函数中可以直接使用std::string_view来代替std::string。

创建过程：

int main()
{
 
    const char* cstr = "yangxunwu";
    std::string_view stringView1(cstr);
    std::string_view stringView2(cstr, 4);
    std::cout << "stringView1: " << stringView1 << ", stringView2: " << stringView2 << std::endl;
 
    std::string str = "yangxunwu";
    std::string_view stringView3(str.c_str());
    std::string_view stringView4(str.c_str(), 4);
    std::cout << "stringView3: " << stringView1 << ", stringView4: " << stringView2 << std::endl;
}

分析一下tree.h里面的各种类

模板函数

1 2	template <class T> auto Get(std::string_view key) const -> const T *;

这行代码定义了一个模板成员函数 Get，它接受一个 std::string_view 类型的参数 key，并返回一个指向类型 T 的常量指针。

template <class T>：这是一个模板声明，声明了一个模板函数，其中 <class T> 表示 T 是一个模板参数，可以在函数内部作为一种类型来使用。
auto Get(std::string_view key) const -> const T \*;：这是函数的声明部分。
- auto：返回类型的占位符类型，表示编译器将根据函数体中的返回语句的表达式推断返回类型。
- Get：函数名。
- (std::string_view key)：函数参数列表，接受一个类型为 std::string_view 的参数 key，它是一个轻量级的字符串视图，用于传递字符串的引用和长度。
- const：表示该函数是一个常量成员函数，即在函数内部不能修改对象的成员变量。
- -> const T *：箭头符号 -> 用于指定函数的返回类型，这里返回一个指向类型 T 的常量指针。

get

如果trie为空，则直接返回nullptr。

如果key为空，先找根节点，如果根节点是一个存储value的节点，则返回value。

如果key不为空，让cur指向根节点。遍历key的字符，如果当前字符在cur的子节点map中，则让cur等于当前字符在cur的子节点中的映射节点继续遍历；否则不存在该key，直接返回nullptr即可。最后把找到的value指针返回。

auto pointer= dynamic_cast<const bustub::TrieNodeWithValue<T> *>(root_.get());

沿着字符串找到最后的值节点。然后使用dynamic_cast转换。dynamic_cast允许指针安全地向下转换（即由基类转为派生类指针），失败则返回 nullptr 。但这里的问题是智能指针不支持这个转换，故我们使用智能指针的 get( ) 方法取得裸指针，然后再进行转换。

1
2
3

//B是A的派生类
//b是B类型的对象
std::shared_ptr<A> a = std::make_shared<A>(b);

make_shared不会保留多态性质生成一个指向b的基类指针。make_shared会先截断b，只留下A部分的值，然后生成指针。在本题中就是TrieNodeWithValue会被截断为TrieNode。正确做法是

1	std::shared_ptr<A> a = std::make_shared<B>(b);

template <class T>
auto Trie::Get(std::string_view key) const -> const T * {
  //throw NotImplementedException("Trie::Get is not implemented.");

  // You should walk through the trie to find the node corresponding to the key. If the node doesn't exist, return
  // nullptr. After you find the node, you should use `dynamic_cast` to cast it to `const TrieNodeWithValue<T> *`. If
  // dynamic_cast returns `nullptr`, it means the type of the value is mismatched, and you should return nullptr.
  // Otherwise, return the value.
  //如果树为空则返回空
  if(this->root_==nullptr){
     return nullptr;
  }
  //如果key为空pointerpointer
  if(key.empty()){
    //判断根节点是否为value节点
    if(root_->is_value_node_){
      //转为子类指针
      auto pointer= dynamic_cast<const bustub::TrieNodeWithValue<T> *>(root_.get());
      return  pointer==nullptr?nullptr:pointer->value_.get();
    }
    if(!root_->is_value_node_) {
      return nullptr;
    }    
  }
  else{
    std:: shared_ptr<const bustub :: TrieNode> cur=this->root_;
    //遍历字符串
    for( char ch: key){
       //使用map映射找到字符串对应的节点
       auto node=cur->children_.find(ch);
       //没找到
       if(node==cur->children_.end()){
         return nullptr;
       }
       cur=node->second;
    }   
    if(cur->is_value_node_){
          std::shared_ptr<const bustub::TrieNodeWithValue<T>> pointer =std::dynamic_pointer_cast<const bustub::TrieNodeWithValue<T>>(cur); // 父类智能指针转子类    
         return  pointer==nullptr?nullptr:pointer->value_.get();
    }
    
  }
  return nullptr;
}

put

设置键对应的值。如果键已经存在，则覆盖现有值。注意，值的类型可能是不可复制的（即， std::unique_ptr<int> 因此需要使用移动语义）。这个方法返回一个新的trie，也就是说，实现写时拷贝

template <class T>
void SubPut(std::shared_ptr<bustub::TrieNode> &root,std::shared_ptr<T>& valptr,std::string_view key){
    char ch=key.at(0);  
    auto it = root->children_.find(ch);
    if (it != root->children_.end()) {
        //找到了
        //判断是否为叶子节点
        if(key.size()<=1){
          //覆盖原来的数据
          it->second=std::make_shared<bustub::TrieNodeWithValue<T>>(it->second->children_,valptr);
        }
        //否则复制该节点，并复用下面的节点
        else{
         std::shared_ptr<bustub::TrieNode> newnode=it->second->Clone();
         //递归写入
         SubPut(newnode,valptr,key.substr(1,key.size()-1));
         it->second=std::move(newnode);
        }
    } else {
       //没找到
       char ch=key.at(0);
       //新建节点
       if(key.size()<=1){
         root->children_.insert({ch,std::make_shared<bustub::TrieNodeWithValue<T>>(valptr)}) ;   
        }
       else{
        std::shared_ptr tmpptr=std::make_shared<bustub::TrieNode>();
        //递归插入
        SubPut(tmpptr,valptr,key.substr(1,key.size()-1));
        root->children_.insert({ch,std::move(tmpptr)});
       }
    }
}
template <class T>
auto Trie::Put(std::string_view key, T value) const -> Trie {
  // Note that `T` might be a non-copyable type. Always use `std::move` when creating `shared_ptr` on that value.
  //throw NotImplementedException("Trie::Put is not implemented.");

  // You should walk through the trie and create new nodes if necessary. If the node corresponding to the key already
  // exists, you should create a new `TrieNodeWithValue`.
  //key为空在根节点插入
  std::shared_ptr<T> val_ptr=std::make_shared<T>(std::move(value));
  if(key.empty()){
    std::shared_ptr<bustub::TrieNodeWithValue<T>> newroot=nullptr;
    
   //如果根节点没有孩子直接插入
   if(this->root_->children_.empty()){
    newroot=std::make_shared<bustub::TrieNodeWithValue<T>>(val_ptr);
   }
   //如果根节点有孩子复制完再插入
   else{
    newroot=std::make_shared<bustub::TrieNodeWithValue<T>>(this->root_->children_,std::move(val_ptr));
   }
   //返回树
   return Trie(newroot);
  }
  //key不为空，首先复制树然后递归插入
  std::shared_ptr<bustub::TrieNode> newroot=nullptr;
  if(this->root_!=nullptr){
     newroot=this->root_->Clone();
  }
  else{
    newroot=std::make_shared<TrieNode>(TrieNode());
  }
  //递归插入
  SubPut(newroot,val_ptr,key);
  return Trie(newroot);
}

remove

Delete the value for the key. This method returns a new trie

auto subremove(std::shared_ptr<TrieNode> &root,std::string_view key)->bool{
     char ch=key.at(0);
     auto it =root->children_.find(ch);
     if(it!=root->children_.end()){
      //找到了
      //是否为最后一个节点
      if(key.size()==1){
        if(!it->second->is_value_node_){
            return false;
        }
        else{
            if(it->second->children_.empty()){
                root->children_.erase(it->first);
            }
            else{
                it->second=std::make_shared<TrieNode>(TrieNode(it->second->children_));
            }
            return true;
        }
      }
      else{
        std::shared_ptr ptr=it->second->Clone();//克隆，因为原来是常量指针
        bool ret=subremove(ptr,key.substr(1,key.size()-1));
        if(!ret){
            return false;
        }
        if(ptr->children_.empty()&&!ptr->is_value_node_){
          root->children_.erase(it->first);
        }
        else{
            //转为常量指针
            it->second=std::shared_ptr<const TrieNode>(ptr);
        }
        return true;
      }
     }
     else{
     //没找到
    return false;
     }
}

auto Trie::Remove(std::string_view key) const -> Trie {
  //throw NotImplementedException("Trie::Remove is not implemented.");

  // You should walk through the trie and remove nodes if necessary. If the node doesn't contain a value any more,
  // you should convert it to `TrieNode`. If a node doesn't have children any more, you should remove it.
  if(key.empty()){
    //如果根节点没有孩子
    if(this->root_->children_.empty()){
        return {};
    }
    else{
          std::shared_ptr<TrieNode> newroot=std::make_shared<TrieNode>(TrieNode(this->root_->children_));
          return Trie(newroot);
    }
  }
  else{
        std::shared_ptr newroot=this->root_->Clone();//拷贝转为非常量
        bool flag=subremove(newroot,key);
        if(!flag){
            return *this;
        }
        //判断根节点是否需要删除
        if(newroot->children_.empty()&&!newroot->is_value_node_){
           newroot=nullptr;
        }
        if(flag){
            return Trie(newroot);
        }
  }
  return {};
}

通过本地测试：

bugs

bug1

auto it = root->children_.find(ch); 这句会发生错误。

错误发生在 std::_Rb_tree 的内部，具体是在 std::_Rb_tree::_M_begin() 函数内部，这个函数是红黑树（std::map）的成员函数，用于获取树的起始位置迭代器报错信息中指出错误发生在读取内存地址的操作（READ memory access），并且地址指向未知的地址（unknown address），提示地址指向了零页（zero page）。发现主要是指针没初始化

父类智能指针转换为子类智能指针：

std::shared_ptr<const bustub::TrieNodeWithValue<T>> pointer =std::dynamic_pointer_cast<const bustub::TrieNodeWithValue<T>>(cur); // 父类智能指针转子类

bug2

没法通过CopyOnWriteTest1

定位主要是在这一句出问题：

初步判断原因是ASSERT_EQ(*trie3.Get<uint32_t>("test"), 2333); 在删除test的时候把est节点都删掉了导致的问题

定位到错误：if(ptr->children_.empty()&&ptr->is_value_node_)应该是if(ptr->children_.empty()&&!ptr->is_value_node_)

Task2：Concurrent Key-Value Store

完成copy-on-write trie 后，这个树可以在单一线程的环境使用, 这个实验是implement a concurrent key-value store for a multithreaded environment. 修改 trie_store.h和 trie_store.cpp完成实验

对于传统的 Trie class, 每次我们修改trie的时候, we need to get the new root to access the new content.但是对于现在的 key-value store来说, the put and delete methods 不需要返回值. This requires you to use concurrency primitives（并发原语） to synchronize （同步）reads and writes so that no data is lost through the process.

需求： concurrent key-value store可以并发的服务多个读和单个写。也就是说, when someone is modifying the trie, reads can still be performed on the old root. When someone is reading, writes can still be performed without waiting for reads.

如果我们要获得一个树节点的引用值,我们应该无论是否现在正在修改这个节点我们都能得到他。 get函数只返回一个指针。如果存储此值的trie节点已被删除，则指针将悬空。因此，在TrieStore中，我们返回一个ValueGuard，它存储对该值的引用和与trie结构的根对应的TrieNode，以便在存储ValueGuard时可以访问该值。