安装显卡驱动
查看显卡型号
lspci | grep -i vga
>>> 64:00.0 VGA compatible controller: ASPEED Technology, Inc. ASPEED Graphics Family (rev 41)
>>> 81:00.0 VGA compatible controller: NVIDIA Corporation GP102 [GeForce GTX 1080 Ti] (rev a1) # 显卡型号
下载显卡驱动
根据显卡型号选择对应的驱动
https://www.nvidia.cn/Download/index.aspx?lang=cn
复制下载链接,到终端里下载
复制的链接:https://cn.download.nvidia.com/XFree86/Linux-x86_64/550.67/NVIDIA-Linux-x86_64-550.67.run
wget @download_link
wget https://cn.download.nvidia.com/XFree86/Linux-x86_64/550.67/NVIDIA-Linux-x86_64-550.67.run
可以先cd进到要保存到路径
安装驱动
关闭X服务
X service是centos中的桌面服务,安装显卡驱动前需要先关闭
查找X服务进程
ps aux|grep X
根据pid结束进程
kill @X_service_pid # 通常是/usr/bin/X, /usr/bin/Xvnc,重启就会重新启动
安装
进到下载目录,找到下载的驱动文件: NVIDIA-Linux-x86_64-550.67.run
cd Downloads
安装驱动
sudo sh NVIDIA-Linux-x86_64-550.67.run
会遇到的选项
- Install NVIDIA’s 32-bit compat ibility libraries?
是否安装32位兼容性库,安不安装都行
- Would uou like to run the nvidia-xconfio utility to automatically undate your X configuration file so that the NVIDiA X driver will be used when you restart X? Any pre-existing X configuration file will be backed up.
Would uou like to run the nvidia-xconfio utility to automatically undate your X configuration file so that the NVIDiA X driver will be used when you restart X? Any pre-existing X configuration file will be backed up.
**除了上述选项需要选择外,其他用默认选项就行**
安装完成后,重启机器
sudo reboot
查看是否安装完成,查看cuda版本
nvidia-smi
同时看一下安装的cuda版本号是多少,后面安装torch要用到。
可能遇到的问题
如果已经安装,但是用着用着用不了了,用 nvidia-smi
查看一下驱动。
出现这样的结果:
NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running.
则需要卸载后重新安装。
-
Ctrl+Alt+F2(或F1)进入命令行界面
-
输入用户名和密码
-
找到安装的sh文件
sudo sh NVIDIA-Linux-x86_64-535.98.run --uninstall
如果前面备份了就选是,没有就选否,但安装的时候推荐选否。
If you plan to no longer use the NVIDIA driver, you should make sure that no X screens are configured to use the MVIDIA Xdriver in your X confiquration file. lf you used nvidia-xconfig to conf igure X, it may have created a backup of youroriginal conf iquration. Would uou like to run nvidia-xconfig --restore-original-backup’ to attempt restoration of theoriqinal X conf iqurat ion file?
-
其他卸载方法
sudo yum remove nvidia-*
为了删除干净,可以相关组件都删掉
rpm -qa|grep -i nvid|sort yum remove kmod-nvidia-*
-
卸载完后重启服务器:
sudo reboot
-
重复前面的安装步骤
安装conda
使用miniconda3管理python环境,参考官网的安装方法,最后安装完成后init bash
下载与安装
mkdir -p ~/miniconda3
wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh -O ~/miniconda3/miniconda.sh
bash ~/miniconda3/miniconda.sh -b -u -p ~/miniconda3
rm -rf ~/miniconda3/miniconda.sh
初始化
~/miniconda3/bin/conda init bash
~/miniconda3/bin/conda init zsh
检查是否安装成功
conda --version # 看有没有输出版本号
激活环境
重新打开终端,就可以看到前面有(base)
安装torch
创建并激活conda环境
conda create -n @env_name python=@python_version -y
以安装名字为dl,python版本为3.9的环境为例:
conda create -n dl python=3.9 -y
激活环境
conda activate @env_name
安装torch
参考链接:https://pytorch.org
先激活想要安装pytorch到环境,这里以dl为例
conda activate dl
根据cuda版本,使用pip安装pytorch
pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu@cuda_version
以cuda版本为12.4为例,cuda_version应该对应的是124(就是去掉小数点),但是torch可能还没有对应的版本,这时候只要把cuda_version向下减1到可以pip安装为止,因为torch版本会向上兼容cuda,所以不是完全对应的也没有关系
pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121
安装完成后,进入python查看cuda是否可用
python
import torch
torch.cuda.is_available()
>>> True