Importing HIBP's pwned password list in to DynamoDB

Troy Hunt recently introduced HIBP Passwords , a freely downloadable list of over 300 million passwords that have been pwned in the various breaches the site records. There is an API to access the list for auditing and checking passwords, but it's rate limited, and I thought it would be more friendly to import the passwords in to a database we control. It looks like HIBP uses Azure Table Storage to make the data quickly accessible, I do most of my work on AWS so I thought I'd take a look at importing the hashes in to DynamoDB , it's relatively cheap and easy to use.

The first thing I tried was to follow a tutorial using AWS Data Pipeline , which seemed to be exactly what I needed. In the end though, I found the tutorial made a few assumptions that I either missed or they failed to mention, mostly around the expected data format for the CSV files. Thankfully though, this lead me towards using AWS Elastic Map Reduce directly and this turned out to be the winning formula.

To start with, I needed to get the data from HIBP and upload it to S3. I think this method would have worked if the source files had been gzipped, but I wasn't too sure about 7zipped, so I decompressed them before uploading them to S3.

mkdir hibp-passwords cd hibp-passwords wget https://downloads.pwnedpasswords.com/passwords/pwned-passwords-1.0.txt.7z wget https://downloads.pwnedpasswords.com/passwords/pwned-passwords-update-1.txt.7z wget https://downloads.pwnedpasswords.com/passwords/pwned-passwords-update-2.txt.7z 7zr e pwned-passwords-1.0.txt.7z 7zr e pwned-passwords-update-1.txt.7z 7zr e pwned-passwords-update-2.txt.7z rm *.7z aws s3 mb s3://hibp-passwords-123 aws s3 sync ./ s3://hibp-passwords-123

I created a DynamoDB table to hold the data. It only needs the one column, which will also be the partition key.

Importing HIBP's pwned password list in to DynamoDB

Now adjust the write capacity to allow our import to go reasonably quickly. I set the capacity to 10000 units, which I think was the maximum without having to asking AWS to lift the limits.

We then need to create our EMR cluster. I experimented with different sizes to begin with, but settled on a cluster of 16 c4.8xlarge instances. I left the rest of the settings as the defaults. Between this and the write capacity we set on the DynamoDB table, this meant the import took around 10 hours. There's probably a better combination of cluster size and write capacity, but it was good enough for me.

No we're all good to go. Log in to the EMR master node, install tmux and fire up Hive .

ssh -i ~/.ssh/your-key.pem hadoop@your-master-node-public-url sudo yum install tmux tmux hive

We need to tell Hive to use as much DynamoDB write capacity as it sees fit, no other resources are currently trying to access the table.

> SET dynamodb.throughput.write.percent=1.5;

We then create our first Hive table, by telling it where to find our data.

> CREATE EXTERNAL TABLE s3_hibp_passwords(a_col string) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' LOCATION 's3://hibp-passwords-123/'; OK Time taken: 3.371 seconds

The first time I tried running the import, I ran in to trouble with it trying to enter duplicate items in to the DynamoDB table. A bit of searching lead me to a stackoverflow question and a means to create another table, but with only unique hashes.

> CREATE TABLE s3_hibp_passwords_dedup AS SELECT a_col FROM (SELECT a_col, rank() OVER (PARTITION BY a_col) AS col_rank FROM s3_hibp_passwords) t WHERE t.col_rank = 1 GROUP BY a_col; Query ID = hadoop_20170810144127_818de1ea-0c47-4a16-a5bf-5b77e3336d69 Total jobs = 1 Launching Job 1 out of 1 Tez session was closed. Reopening... Session re-established. Status: Running (Executing on YARN cluster with App id application_1502375367082_0202) ---------------------------------------------------------------------------------------------- VERTICES MODE STATUS TOTAL COMPLETED RUNNING PENDING FAILED KILLED ---------------------------------------------------------------------------------------------- Map 1 .......... container SUCCEEDED 257 257 0 0 0 0 Reducer 2 ...... container SUCCEEDED 53 53 0 0 0 0 Reducer 3 ...... container SUCCEEDED 27 27 0 0 0 0 ---------------------------------------------------------------------------------------------- VERTICES: 03/03 [==========================>>] 100% ELAPSED TIME: 105.63 s ---------------------------------------------------------------------------------------------- Moving data to directory hdfs://ip-172-31-43-51.eu-west-1.compute.internal:8020/user/hive/warehouse/s3_hibp_passwords_dedup OK Time taken: 112.395 seconds

Now we need to create another hive table, this one will be backed by our DynamoDB table.

> CREATE EXTERNAL TABLE ddb_hibp_passwords (col1 string) STORED BY 'org.apache.hadoop.hive.dynamodb.DynamoDBStorageHandler' TBLPROPERTIES ("dynamodb.table.name" = "pwned_passwords", "dynamodb.column.mapping" = "col1:sha1"); OK Time taken: 0.925 seconds

That's it, we're all ready to go. This is going to take some time, so set it running and go to bed.

> INSERT OVERWRITE TABLE ddb_hibp_passwords SELECT * FROM s3_hibp_passwords_dedup; Query ID = hadoop_20170810144407_872afa4b-ca88-4b57-a284-5496fcb0d30f Total jobs = 1 Launching Job 1 out of 1 Status: Running (Executing on YARN cluster with App id application_1502375367082_0002) ---------------------------------------------------------------------------------------------- VERTICES MODE STATUS TOTAL COMPLETED RUNNING PENDING FAILED KILLED ---------------------------------------------------------------------------------------------- Map 1 .......... container SUCCEEDED 254 254 0 0 18 0 ---------------------------------------------------------------------------------------------- VERTICES: 01/01 [==========================>>] 100% ELAPSED TIME: 59056.94 s ---------------------------------------------------------------------------------------------- OK Time taken: 59058.426 seconds hive>

I don't know what happened with the failures, I didn't bother checking. I know I ended up with 319,169,449 hashes in the database, so good enough for me.

Back in the terminal, we can quickly check to see how this is going to work. Note that I shifted the hash to uppercase, the HIBP source data hashes are all uppercase.

davem@wes:~$ SHA1DAVE=$(echo -n dave | sha1sum | tr "[a-z]" "[A-Z]" | awk '{ print $1 }') davem@wes:~$ echo $SHA1DAVE BFCDF3E6CA6CEF45543BFBB57509C92AEC9A39FB davem@wes:~$ aws dynamodb get-item --table-name hibp_passwords --key "{\"sha1\": {\"S\": \"$SHA1DAVE\"}}" { "Item": { "sha1": { "S": "BFCDF3E6CA6CEF45543BFBB57509C92AEC9A39FB" } } } real 0.46 user 0.30 sys 0.04 davem@wes:~$

Have fund auditing your passwords.

Importing HIBP's pwned password list in to DynamoDB

Trending Articles

LMD VCL Complete v2024.4

Dahon自救會之SP8火線救援

出售:美國JBL,Paul Audio 出品15吋低音喇叭

日活罗曼晴色粉红电影系列目录

有人買民雄嘉大博識嗎?(或美銓建設以前的建案)

[心得] 從來沒碰過魔獸世界的新手照過來,一篇文章就讓你快速上手!

宝可梦无限融合6.4.6最新汉化版+福利版，PC端+安卓端

出售: 100% New 抗鼻敏感噴霧 Budesonide PH&T 50 x 3

动画「Visual Prison」BD第三卷封面公开

請問Rogue這個故障燈號是什麼意思？

Artweaver 7.0.17 免安裝中文版 (8.0.4 安裝版) - 小型繪圖軟體

《踏血寻梅》拍援交妹命案春夏露点争新人奖

博讯｜张磊帮助下，李源潮的儿子被耶鲁录取

RAV4 E-Mirror電子式後視鏡無法連線

[一般] 至尊不動劍成長進化史給大家參考

[NEO·QSW&FSD]机动战士高达GQuuuuuuX/跨时之战機動戦士Gundam GQuuuuuuX 11[WEBRIP AVC 1080P]

原中国500强建企工程款断崖式下降生存艰难

creator的editbox怎么隐藏键盘

bundle.load 不回调

【查】土星在第三宫的表现 (豆瓣草菇@占星社区小组)