常用数据下载方式
1. http链接下载
一般能够在浏览器点击,然后自动触发浏览器弹出下载的就是http链接,右键下载位置,能够获取到下载链接。
但是鼠标点击浏览器下载的方式有很大的局限性,比如:
- 如果要下载大量碎小文件,需要一个一个点击链接,有可能还需要重命名。
- 能够点击即下载的位置不好找,多级目录需要反复点击。
- 对于浏览器来说,一般有最大数据规模限制,比如最大只能下载4GB的单个文件。
下载方式--1 : 通过迅雷
这种http url的格式,对于多个数据来说,往往是有规律可循的,比如点击每个下载位置,右键复制下载链接
http://apdrc.soest.hawaii.edu/erddap/griddap/hawaii_soest_c4f8_7ed4_2d75.nc?iicethic[(2006-01-01):1:(2006-12-31)][(-89.5):1:(89.5)][(0.0):1:(359.0)]
http://apdrc.soest.hawaii.edu/erddap/griddap/hawaii_soest_c4f8_7ed4_2d75.nc?iicethic[(2007-01-01):1:(2008-12-31)][(-89.5):1:(89.5)][(0.0):1:(359.0)]
可以发现其中变化的就是时间和经纬度。根据这个规律完全可以自己生成这样的链接,生成你想要的时间和经纬度,然后统一复制,到迅雷中创建批量下载,粘贴即可。
下载方式--1 : 通过linux的命令wget -i
将生成的链接复制到一个文件中,然后在命令行中输入:
wget -i 链接.txt
wget有非常强大的下载功能,有许多参数,如果需要更进一步使用,比如说递归下载,断点续传,ip代理等,如果需要就去网上查。
2. FTP下载
有的数据网站会提供ftp地址,比如像图中的BGC-Argo,有了这样的ftp服务器之后,下载数据就比较容易。直接用FileZilla等ftp软件连接服务器地址即可,如果网站没有提示用户名和密码,一般来说选择匿名连接即可。
3. opendap下载
网络数据访问协议的开源项目 ( OPeNDAP ) 是同名客户端/服务器软件的开发商,它使科学家能够更轻松地通过互联网共享数据。
如果你查找的数据网站有opendap地址,那么我建议首选这一条,因为opendap能与xarray 无缝衔接,非常好用!
复制该链接,直接用xarray读取即可,甚至不用下载就能查看数据信息,并且画图!!只需要0.7s!
利用xarray下载hycom-opendap数据
Hycom 数据的时间起始点是2000-01-01 00:00:00,里面存的是小时数,所以需要对时间进行一个解码校正。opendap的优势在于可以先读进来数据,然后对其切割你想下载的局部海域,然后再进行下载。
import xarray as xr
import os
from datetime import datetime,timedelta
import pandas as pd
import numpy as np
def return_latest_time():
date_start = '2000-01-01 00:00:00'
date_list = []
for i in data_latest.time.data:
date_list.append(pd.to_datetime(date_start)+timedelta(hours = i))
return date_list
data_global = xr.open_dataset('http://tds.hycom.org/thredds/dodsC/GLBy0.08/expt_93.0',decode_times=False,chunks={"time":100})
data_latest = data_global.sel(lat = slice(2,42),lon = slice(104,132),depth=slice(0,1001))
date_list = return_latest_time()
data_latest['time'] = pd.to_datetime(date_list) # 重新更新文件时间
date_time = pd.to_datetime(date_list)
#每天到点下载就行,就下载最近的,0点和12点,如果本地有就覆盖就行了
for date in date_time[-58:]:
if date.hour == 0 or date.hour == 12 :
data_latest_now = data_latest.sel(time = date)
if not os.path.exists("/data/hycom_2018_latest/{}.nc".format(str(date))) or os.path.getsize("/data/hycom_2018_latest/{}.nc".format(str(date)))<96320000:
data_latest_now.to_netcdf("/data/hycom_2018_latest/{}.nc".format(str(date)))
print(date,"下载完成!!!")
dask并行下载:
import xarray as xr
# 利用chunks参数,将文件用dask打开
data_global = xr.open_dataset('http://tds.hycom.org/thredds/dodsC/GLBy0.08/expt_93.0',decode_times=False,chunks={"time":100})
#索引需要的海域范围和深度范围
data_latest = data_global.sel(lat = slice(2,42),lon = slice(104,132),depth=slice(0,1001))
# 更新源数据文件的时间
from datetime import datetime,timedelta
import pandas as pd
import numpy as np
def return_latest_time():
date_start = '2000-01-01 00:00:00'
date_list = []
for i in data_latest.time.data:
date_list.append(pd.to_datetime(date_start)+timedelta(hours = i))
return date_list
date_list = return_latest_time()
data_latest['time'] = pd.to_datetime(date_list) # 重新更新文件时间
date_time = pd.to_datetime(date_list)
# 构建用于mfdataset保存的数据列表和文件列表
data_latest_now =[]
data_latest_path= []
for date in date_time:
if (date.hour == 0 or date.hour == 12) and date.year>2019:
data_latest_now.append(data_latest.sel(time = date))
data_latest_path.append("/data/hycom_2020_latest/{}.nc".format(str(date)))
xr.save_mfdataset(data_latest_now,data_latest_path)
4. Linux 命令行中的ftp
在第二种方式中,我推荐了ftp软件下载,这样做的方式是界面可操作,但是也有弊端,通常我们不想将数据下载到本地,而是想要直接下载到linux服务器怎么办呢?
这样就需要通过linux远程终端去操作ftp下载,也很简单,具体参考这篇文章https://linux.cn/article-6746-1.html。
比如下载Argo FTP数据
(base) msdc@msdc-virtual-machine:~/hycom_predict_temp_3D$ ftp data.argo.org.cn
Connected to data.argo.org.cn.
220 (vsFTPd 3.0.2)
Name (data.argo.org.cn:msdc): anonymous # anonymous 表示匿名登陆
331 Please specify the password.
Password:
230-$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$
230- Welcome to the FTP site of the China Argo Real-time Data Centre (CARDC).
230- The site is maintained by the Second Institute of Oceanography, Ministry
230- of Natural Resources.
230- CARDC website: http://www.argo.org.cn/
230-$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$
230 Login successful.
Remote system type is UNIX.
Using binary mode to transfer files.
ftp> ls
200 PORT command successful. Consider using PASV.
150 Here comes the directory listing.
drwxr-xr-x 3 0 0 26 Nov 10 2019 pub
226 Directory send OK.
ftp> cd pub
250 Directory successfully changed.
ftp> ls
200 PORT command successful. Consider using PASV.
150 Here comes the directory listing.
drwxr-xr-x 12 1000 1000 246 Apr 23 2020 ARGO
226 Directory send OK.
ftp> cd ARGO
250-$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$
250- Welcome to the FTP site of the China Argo Real-time Data Centre (CARDC).
250- The site is maintained by the Second Institute of Oceanography, Ministry
250- of Natural Resources. All data contained on this site is produced by CARDC.
250- Users are permitted to download and make use of all the data.
250-$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$
250 Directory successfully changed.
ftp> ls
200 PORT command successful. Consider using PASV.
150 Here comes the directory listing.
drwxr-xr-x 2 1000 1000 131 Nov 10 2019 ArgoQuerySystem
drwxr-xr-x 6 1000 1000 158 Nov 10 2019 Argo_derived
drwxr-xr-x 5 1000 1000 58 Nov 10 2019 BOA_Argo
drwxr-xr-x 4 1000 1000 53 Nov 10 2019 G-argo
drwxr-xr-x 2 1000 1000 12288 Nov 10 2019 GDCSM
drwxr-xr-x 2 1000 1000 8192 Nov 10 2019 ROSWPOA
drwxr-xr-x 2 1000 1000 4096 Sep 08 08:14 argo-index
drwxrwxr-x 2 1000 1000 144 Apr 23 2020 etopo
drwxr-xr-x 13 1000 1000 4096 Oct 11 02:22 raw_argo_data
drwxr-xr-x 2 1000 1000 142 Nov 10 2019 surface_current
226 Directory send OK.
ftp> cd BOA_Argo
250 Directory successfully changed.
ftp> ls
200 PORT command successful. Consider using PASV.
150 Here comes the directory listing.
drwxr-xr-x 2 1000 1000 8192 May 22 04:33 MAT
drwxr-xr-x 2 1000 1000 8192 May 22 04:33 NetCDF
drwxr-xr-x 2 1000 1000 171 Apr 30 08:27 doc
226 Directory send OK.
ftp> cd NetCDF
250 Directory successfully changed.
ftp> ls
200 PORT command successful. Consider using PASV.
150 Here comes the directory listing.
-rw-r--r-- 1 1000 1000 54151116 Apr 06 2021 BOA_Argo_2004_01.nc
-rw-r--r-- 1 1000 1000 54151116 Apr 01 2021 BOA_Argo_2004_02.nc
-rw-r--r-- 1 1000 1000 54151116 Apr 01 2021 BOA_Argo_2004_03.nc
*********************
*********************
226 Directory send OK.
ftp> lcd ARGO
Local directory now /home/msdc/Downloads/ARGO
ftp> prompt off
Interactive mode off.
ftp> mget BOA_Argo_2*.nc
local: BOA_Argo_2004_01.nc remote: BOA_Argo_2004_01.nc
200 PORT command successful. Consider using PASV.
150 Opening BINARY mode data connection for BOA_Argo_2004_01.nc (54151116 bytes).
226 Transfer complete.
.......................