继续 使用 SparkR 探索土地登记处的开放数据集, 我想看看英国哪条道路在过去 20 年中的房地产销售量最大。
回顾一下,这就是数据框的样子:
./spark-1.5.0-bin-hadoop2.6/bin/sparkR --packages com.databricks:spark-csv_2.11:1.2.0
> sales <- read.df(sqlContext, "pp-complete.csv", "com.databricks.spark.csv", header="false")
> head(sales)
C0 C1 C2 C3 C4 C5
1 {0C7ADEF5-878D-4066-B785-0000003ED74A} 163000 2003-02-21 00:00 UB5 4PJ T N
2 {35F67271-ABD4-40DA-AB09-00000085B9D3} 247500 2005-07-15 00:00 TA19 9DD D N
3 {B20B1C74-E8E1-4137-AB3E-0000011DF342} 320000 2010-09-10 00:00 W4 1DZ F N
4 {7D6B0915-C56B-4275-AF9B-00000156BCE7} 104000 1997-08-27 00:00 NE61 2BH D N
5 {47B60101-B64C-413D-8F60-000002F1692D} 147995 2003-05-02 00:00 PE33 0RU D N
6 {51F797CA-7BEB-4958-821F-000003E464AE} 110000 2013-03-22 00:00 NR35 2SF T N
C6 C7 C8 C9 C10 C11
1 F 106 READING ROAD NORTHOLT NORTHOLT
2 F 58 ADAMS MEADOW ILMINSTER ILMINSTER
3 L 58 WHELLOCK ROAD LONDON
4 F 17 WESTGATE MORPETH MORPETH
5 F 4 MASON GARDENS WEST WINCH KING'S LYNN
6 F 5 WILD FLOWER WAY DITCHINGHAM BUNGAY
C12 C13 C14
1 EALING GREATER LONDON A
2 SOUTH SOMERSET SOMERSET A
3 EALING GREATER LONDON A
4 CASTLE MORPETH NORTHUMBERLAND A
5 KING'S LYNN AND WEST NORFOLK NORFOLK A
6 SOUTH NORFOLK NORFOLK A
本文档 解释了存储在每个字段中的数据,对于此特定查询,我们对字段 C9-C12 感兴趣。计划是按这些字段对数据框进行分组,然后按频率降序排序。
当按多个字段分组时,最简单的方法往往是创建一个新字段,将它们连接起来,然后按该字段分组。
我从以下内容开始:
./spark-1.5.0-bin-hadoop2.6/bin/sparkR --packages com.databricks:spark-csv_2.11:1.2.0
> sales <- read.df(sqlContext, "pp-complete.csv", "com.databricks.spark.csv", header="false")
> head(sales)
C0 C1 C2 C3 C4 C5
1 {0C7ADEF5-878D-4066-B785-0000003ED74A} 163000 2003-02-21 00:00 UB5 4PJ T N
2 {35F67271-ABD4-40DA-AB09-00000085B9D3} 247500 2005-07-15 00:00 TA19 9DD D N
3 {B20B1C74-E8E1-4137-AB3E-0000011DF342} 320000 2010-09-10 00:00 W4 1DZ F N
4 {7D6B0915-C56B-4275-AF9B-00000156BCE7} 104000 1997-08-27 00:00 NE61 2BH D N
5 {47B60101-B64C-413D-8F60-000002F1692D} 147995 2003-05-02 00:00 PE33 0RU D N
6 {51F797CA-7BEB-4958-821F-000003E464AE} 110000 2013-03-22 00:00 NR35 2SF T N
C6 C7 C8 C9 C10 C11
1 F 106 READING ROAD NORTHOLT NORTHOLT
2 F 58 ADAMS MEADOW ILMINSTER ILMINSTER
3 L 58 WHELLOCK ROAD LONDON
4 F 17 WESTGATE MORPETH MORPETH
5 F 4 MASON GARDENS WEST WINCH KING'S LYNN
6 F 5 WILD FLOWER WAY DITCHINGHAM BUNGAY
C12 C13 C14
1 EALING GREATER LONDON A
2 SOUTH SOMERSET SOMERSET A
3 EALING GREATER LONDON A
4 CASTLE MORPETH NORTHUMBERLAND A
5 KING'S LYNN AND WEST NORFOLK NORFOLK A
6 SOUTH NORFOLK NORFOLK A
没那么成功!接下来我更原始:
./spark-1.5.0-bin-hadoop2.6/bin/sparkR --packages com.databricks:spark-csv_2.11:1.2.0
> sales <- read.df(sqlContext, "pp-complete.csv", "com.databricks.spark.csv", header="false")
> head(sales)
C0 C1 C2 C3 C4 C5
1 {0C7ADEF5-878D-4066-B785-0000003ED74A} 163000 2003-02-21 00:00 UB5 4PJ T N
2 {35F67271-ABD4-40DA-AB09-00000085B9D3} 247500 2005-07-15 00:00 TA19 9DD D N
3 {B20B1C74-E8E1-4137-AB3E-0000011DF342} 320000 2010-09-10 00:00 W4 1DZ F N
4 {7D6B0915-C56B-4275-AF9B-00000156BCE7} 104000 1997-08-27 00:00 NE61 2BH D N
5 {47B60101-B64C-413D-8F60-000002F1692D} 147995 2003-05-02 00:00 PE33 0RU D N
6 {51F797CA-7BEB-4958-821F-000003E464AE} 110000 2013-03-22 00:00 NR35 2SF T N
C6 C7 C8 C9 C10 C11
1 F 106 READING ROAD NORTHOLT NORTHOLT
2 F 58 ADAMS MEADOW ILMINSTER ILMINSTER
3 L 58 WHELLOCK ROAD LONDON
4 F 17 WESTGATE MORPETH MORPETH
5 F 4 MASON GARDENS WEST WINCH KING'S LYNN
6 F 5 WILD FLOWER WAY DITCHINGHAM BUNGAY
C12 C13 C14
1 EALING GREATER LONDON A
2 SOUTH SOMERSET SOMERSET A
3 EALING GREATER LONDON A
4 CASTLE MORPETH NORTHUMBERLAND A
5 KING'S LYNN AND WEST NORFOLK NORFOLK A
6 SOUTH NORFOLK NORFOLK A
至少编译了,但所有地址都是“NA”,这不是我们想要的。经过一番搜索后,我意识到有一个 concat 函数 可以用来完成这个任务:
./spark-1.5.0-bin-hadoop2.6/bin/sparkR --packages com.databricks:spark-csv_2.11:1.2.0
> sales <- read.df(sqlContext, "pp-complete.csv", "com.databricks.spark.csv", header="false")
> head(sales)
C0 C1 C2 C3 C4 C5
1 {0C7ADEF5-878D-4066-B785-0000003ED74A} 163000 2003-02-21 00:00 UB5 4PJ T N
2 {35F67271-ABD4-40DA-AB09-00000085B9D3} 247500 2005-07-15 00:00 TA19 9DD D N
3 {B20B1C74-E8E1-4137-AB3E-0000011DF342} 320000 2010-09-10 00:00 W4 1DZ F N
4 {7D6B0915-C56B-4275-AF9B-00000156BCE7} 104000 1997-08-27 00:00 NE61 2BH D N
5 {47B60101-B64C-413D-8F60-000002F1692D} 147995 2003-05-02 00:00 PE33 0RU D N
6 {51F797CA-7BEB-4958-821F-000003E464AE} 110000 2013-03-22 00:00 NR35 2SF T N
C6 C7 C8 C9 C10 C11
1 F 106 READING ROAD NORTHOLT NORTHOLT
2 F 58 ADAMS MEADOW ILMINSTER ILMINSTER
3 L 58 WHELLOCK ROAD LONDON
4 F 17 WESTGATE MORPETH MORPETH
5 F 4 MASON GARDENS WEST WINCH KING'S LYNN
6 F 5 WILD FLOWER WAY DITCHINGHAM BUNGAY
C12 C13 C14
1 EALING GREATER LONDON A
2 SOUTH SOMERSET SOMERSET A
3 EALING GREATER LONDON A
4 CASTLE MORPETH NORTHUMBERLAND A
5 KING'S LYNN AND WEST NORFOLK NORFOLK A
6 SOUTH NORFOLK NORFOLK A
这还差不多!现在让我们看看哪些街道售出的房产最多:
./spark-1.5.0-bin-hadoop2.6/bin/sparkR --packages com.databricks:spark-csv_2.11:1.2.0
> sales <- read.df(sqlContext, "pp-complete.csv", "com.databricks.spark.csv", header="false")
> head(sales)
C0 C1 C2 C3 C4 C5
1 {0C7ADEF5-878D-4066-B785-0000003ED74A} 163000 2003-02-21 00:00 UB5 4PJ T N
2 {35F67271-ABD4-40DA-AB09-00000085B9D3} 247500 2005-07-15 00:00 TA19 9DD D N
3 {B20B1C74-E8E1-4137-AB3E-0000011DF342} 320000 2010-09-10 00:00 W4 1DZ F N
4 {7D6B0915-C56B-4275-AF9B-00000156BCE7} 104000 1997-08-27 00:00 NE61 2BH D N
5 {47B60101-B64C-413D-8F60-000002F1692D} 147995 2003-05-02 00:00 PE33 0RU D N
6 {51F797CA-7BEB-4958-821F-000003E464AE} 110000 2013-03-22 00:00 NR35 2SF T N
C6 C7 C8 C9 C10 C11
1 F 106 READING ROAD NORTHOLT NORTHOLT
2 F 58 ADAMS MEADOW ILMINSTER ILMINSTER
3 L 58 WHELLOCK ROAD LONDON
4 F 17 WESTGATE MORPETH MORPETH
5 F 4 MASON GARDENS WEST WINCH KING'S LYNN
6 F 5 WILD FLOWER WAY DITCHINGHAM BUNGAY
C12 C13 C14
1 EALING GREATER LONDON A
2 SOUTH SOMERSET SOMERSET A
3 EALING GREATER LONDON A
4 CASTLE MORPETH NORTHUMBERLAND A
5 KING'S LYNN AND WEST NORFOLK NORFOLK A
6 SOUTH NORFOLK NORFOLK A
接下来我们将进一步深入研究数据,但那是另一篇文章的内容。