作为我对 Land Registry price paid 数据集 探索的一部分,我想比较连续销售房产之间的差异。
这意味着我们需要按属性标识符对销售进行分组,然后将之前的销售价格放入每一行的列中,除非是第一次销售,在这种情况下我们将得到“NA”。我们可以通过创建 滞后 变量来做到这一点。
我将使用一个在结构上与土地注册处非常相似的更简单的数据集来演示:
> blogDT = data.table(name = c("Property 1","Property 1","Property 1","Property 2","Property 2","Property 2"),
price = c(10000, 12500, 18000, 245000, 512000, 1000000))
> blogDT
name price
1: Property 1 10000
2: Property 1 12500
3: Property 1 18000
4: Property 2 245000
5: Property 2 512000
6: Property 2 1000000
我们想按“名称”列分组,然后将第 1 行的价格显示在第 2 行,第 2 行的价格显示在第 3 行,第 4 行的价格显示在第 5 行,第 5 行的价格显示在第 6 行。为此,我们将引入一个“lag.price”列:
> blogDT = data.table(name = c("Property 1","Property 1","Property 1","Property 2","Property 2","Property 2"),
price = c(10000, 12500, 18000, 245000, 512000, 1000000))
> blogDT
name price
1: Property 1 10000
2: Property 1 12500
3: Property 1 18000
4: Property 2 245000
5: Property 2 512000
6: Property 2 1000000
接下来让我们计算两个价格之间的差异:
> blogDT = data.table(name = c("Property 1","Property 1","Property 1","Property 2","Property 2","Property 2"),
price = c(10000, 12500, 18000, 245000, 512000, 1000000))
> blogDT
name price
1: Property 1 10000
2: Property 1 12500
3: Property 1 18000
4: Property 2 245000
5: Property 2 512000
6: Property 2 1000000
最后让我们按照最大的价格涨幅对数据表进行排序:
> blogDT = data.table(name = c("Property 1","Property 1","Property 1","Property 2","Property 2","Property 2"),
price = c(10000, 12500, 18000, 245000, 512000, 1000000))
> blogDT
name price
1: Property 1 10000
2: Property 1 12500
3: Property 1 18000
4: Property 2 245000
5: Property 2 512000
6: Property 2 1000000